CN115134338A - Multimedia information coding method, object retrieval method and device - Google Patents

Multimedia information coding method, object retrieval method and device Download PDF

Info

Publication number
CN115134338A
CN115134338A CN202210563346.7A CN202210563346A CN115134338A CN 115134338 A CN115134338 A CN 115134338A CN 202210563346 A CN202210563346 A CN 202210563346A CN 115134338 A CN115134338 A CN 115134338A
Authority
CN
China
Prior art keywords
information
coding
sample
multimedia
item
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210563346.7A
Other languages
Chinese (zh)
Other versions
CN115134338B (en
Inventor
蔡成飞
涂荣成
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202210563346.7A priority Critical patent/CN115134338B/en
Publication of CN115134338A publication Critical patent/CN115134338A/en
Application granted granted Critical
Publication of CN115134338B publication Critical patent/CN115134338B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L65/00Network arrangements, protocols or services for supporting real-time applications in data packet communication
    • H04L65/60Network streaming of media packets
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/48Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/483Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • Library & Information Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application relates to the technical field of machine learning, in particular to a multimedia information coding method, an object retrieval method and a device, wherein the multimedia information coding method comprises the following steps: acquiring multimedia information to be encoded; performing information coding on the multimedia information to be coded based on a target coding model to obtain target coding information corresponding to the multimedia information to be coded; the target coding model is obtained by carrying out model training on a coding model to be trained on the basis of loss information; the loss information is determined based on similar coding information and difference coding information respectively corresponding to each item of sample coding information; and similar coding information and difference coding information respectively corresponding to each item of sample coding information are determined from the sample coding information set on the basis of pairwise similar information between each item of sample multimedia information. The method and the device can improve the encoding accuracy of the multimedia information.

Description

Multimedia information coding method, object retrieval method and device
Technical Field
The present application relates to the field of machine learning technologies, and in particular, to a multimedia information encoding method, an object retrieval method, and an object retrieval device.
Background
With the rapid development of the internet, the amount of multimedia information is rapidly increased, and in order to facilitate the application and storage of the multimedia information, the multimedia resource can be encoded through the encoding model to obtain corresponding encoding information, and the encoding information corresponding to the multimedia information is applied or stored.
In the prior art, when sample information includes original sample information and transform sample information for performing similarity transformation on the original sample information, a loss function is generally determined based on a similarity relationship between each item of original sample information and the corresponding transform sample information, which results in inaccurate loss function, and thus poor coding performance of a coding model obtained based on the training of the loss function, and coding based on the coding model reduces coding accuracy of multimedia information.
Disclosure of Invention
The technical problem to be solved by the present application is to provide a multimedia information encoding method, an object retrieval method and an object retrieval device, which can improve the encoding accuracy of an encoding model and further improve the retrieval accuracy of object retrieval.
In order to solve the foregoing technical problem, in one aspect, an embodiment of the present application provides a multimedia information encoding method, including:
acquiring multimedia information to be encoded;
carrying out information coding on the multimedia information to be coded based on a target coding model to obtain target coding information corresponding to the multimedia information to be coded;
the target coding model is obtained by carrying out model training on a coding model to be trained on the basis of loss information; the loss information is determined based on similar coding information and difference coding information respectively corresponding to each item of sample coding information; similar coding information and difference coding information respectively corresponding to each item of sample coding information are determined from the sample coding information set on the basis of pairwise similar information between each item of sample multimedia information; and the sample coding information set is obtained by respectively carrying out information coding on each item of sample multimedia information based on the coding model to be trained.
On the other hand, an embodiment of the present application provides an object retrieval method, including:
acquiring coding information of an object to be retrieved and coding information of a candidate object; the coding information of the object to be retrieved is obtained by carrying out information coding on the multimedia information of the object to be retrieved based on a target coding model; the encoding information of the candidate object is obtained by carrying out information encoding on the multimedia information of the candidate object based on the target encoding model;
the target coding model is obtained by carrying out model training on a coding model to be trained on the basis of loss information; the loss information is determined based on similar coding information and difference coding information respectively corresponding to each item of sample coding information; similar coding information and difference coding information respectively corresponding to each item of sample coding information are determined from the sample coding information set on the basis of pairwise similar information between each item of sample multimedia information; the sample coding information set is obtained by respectively carrying out information coding on each item of sample multimedia information based on the coding model to be trained;
carrying out information matching on the coding information of the object to be retrieved and the coding information of the candidate object to obtain an information matching result;
and determining a target retrieval object from the candidate objects based on the information matching result.
In another aspect, an embodiment of the present application provides a multimedia information encoding apparatus, including:
the first acquisition module is used for acquiring multimedia information to be coded;
the first coding module is used for carrying out information coding on the multimedia information to be coded based on a target coding model to obtain target coding information corresponding to the multimedia information to be coded;
the target coding model is obtained by carrying out model training on a coding model to be trained on the basis of loss information; the loss information is determined based on similar coding information and difference coding information respectively corresponding to each item of sample coding information; similar coding information and difference coding information respectively corresponding to each item of sample coding information are determined from the sample coding information set on the basis of pairwise similar information between each item of sample multimedia information; and the sample coding information set is obtained by respectively carrying out information coding on each sample multimedia information based on the coding model to be trained.
In another aspect, an embodiment of the present application provides an object retrieval apparatus, including:
the second acquisition module is used for acquiring the coding information of the object to be retrieved and the coding information of the candidate object; the coding information of the object to be retrieved is obtained by carrying out information coding on the multimedia information of the object to be retrieved based on a target coding model; the coding information of the candidate object is obtained by carrying out information coding on the multimedia information of the candidate object based on the target coding model;
the target coding model is obtained by carrying out model training on a coding model to be trained on the basis of loss information; the loss information is determined based on similar coding information and difference coding information respectively corresponding to each item of sample coding information; similar coding information and difference coding information respectively corresponding to each item of sample coding information are determined from the sample coding information set on the basis of pairwise similar information between each item of sample multimedia information; the sample coding information set is obtained by respectively carrying out information coding on each item of sample multimedia information based on the coding model to be trained;
the information matching module is used for performing information matching on the coding information of the object to be retrieved and the coding information of the candidate object to obtain an information matching result;
and the retrieval result determining module is used for determining a target retrieval object from the candidate objects based on the information matching result.
In another aspect, the present application provides an electronic device, which includes a processor and a memory, where at least one instruction or at least one program is stored in the memory, and the at least one instruction or the at least one program is loaded by the processor and executed to implement the multimedia information encoding method or the object retrieval method as described above.
In another aspect, the present application provides a computer storage medium, in which at least one instruction or at least one program is stored, and the at least one instruction or the at least one program is loaded by a processor and executes the multimedia information encoding method or the object retrieval method as described above.
The embodiment of the application has the following beneficial effects:
in the process of training a coding model, based on pairwise similar information between various sample multimedia information, determining similar coding information and difference coding information respectively corresponding to various sample coding information; determining loss information based on the similar coding information and the difference coding information; when loss information is determined, the method and the device are based on pairwise similar information among various sample multimedia information, namely the similar information among the various sample multimedia information is comprehensively covered, and the similar information between the various sample multimedia information and other multimedia information in a sample can be contained in the loss information, so that the accuracy of determining the loss information is improved; the method and the device for transforming the sample multimedia information are applicable to scenes containing the original sample multimedia information and the transformed sample multimedia information subjected to similarity transformation in the sample multimedia information, and also applicable to scenes containing the original sample multimedia information in the sample multimedia information, so that the application flexibility and the adaptability can be improved. Furthermore, on the basis of improving the accuracy of the loss information, model training is carried out based on the loss information, so that the coding performance of the target coding model can be improved, and the coding accuracy of the multimedia information is improved.
Drawings
In order to more clearly illustrate the technical solutions and advantages of the embodiments or the prior art of the present application, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the description below are only some embodiments of the present application, and other drawings can be obtained by those skilled in the art without creative efforts.
FIG. 1 is a schematic diagram of an implementation environment provided by an embodiment of the present application;
FIG. 2 is a flowchart of a method for encoding multimedia information according to an embodiment of the present disclosure;
FIG. 3 is a flowchart of a method for training a target coding model according to an embodiment of the present disclosure;
fig. 4 is a flowchart of a method for determining similar coding information and differential coding information according to an embodiment of the present application;
fig. 5 is a flowchart of a method for determining similar information according to an embodiment of the present application;
fig. 6 is a flowchart of a method for generating encoded information according to an embodiment of the present application;
fig. 7 is a flowchart of a loss information determining method provided in an embodiment of the present application;
fig. 8 is a flowchart of another loss information determining method provided in the embodiment of the present application;
fig. 9 is a flowchart of a multimedia information transformation method according to an embodiment of the present application;
FIG. 10 is a flowchart of an object retrieval method provided in an embodiment of the present application;
FIG. 11 is a schematic structural diagram of a target coding model provided in an embodiment of the present application;
FIG. 12 is a schematic structural diagram of an object coding model provided in an embodiment of the present application;
FIG. 13 is a diagram of an apparatus for encoding multimedia information according to an embodiment of the present application;
fig. 14 is a schematic diagram of an object retrieval apparatus according to an embodiment of the present application;
fig. 15 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
To make the objects, technical solutions and advantages of the present application more clear, the present application will be further described in detail with reference to the accompanying drawings. It is to be understood that the described embodiments are merely a few embodiments of the present application and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or server that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
Referring to fig. 1, a schematic diagram of an implementation environment provided by an embodiment of the present application is shown, where the implementation environment may include: at least one retrieval request terminal 110 and a retrieval processing terminal 120, wherein the retrieval request terminal 110 and the retrieval processing terminal 120 can perform data communication through a network.
Specifically, the retrieval request end 110 may send a retrieval request to the retrieval processing end 120, where the retrieval request may include object information of an object to be retrieved; the retrieval processing end 120 may determine the encoding information of the object to be retrieved based on the object information of the object to be retrieved, then perform information matching based on the encoding information of the object to be retrieved and the encoding information of the candidate retrieval object, and determine the corresponding object retrieval result based on the matching result; the search processing terminal 120 sends the object search result to the search request terminal 110.
The search request terminal 110 can communicate with the search processing terminal 120 based on a Browser/Server mode (Browser/Server, B/S) or a Client/Server mode (Client/Server, C/S). The retrieval request terminal 110 may include: the physical devices may also include software running in the physical devices, such as application programs and the like. The operating system running on the retrieval request end 110 in the embodiment of the present application may include, but is not limited to, an android system, an IOS system, linux, windows, and the like.
The retrieval processing end 120 and the retrieval requesting end 110 may establish a communication connection through a wire or wirelessly, and the retrieval processing end 120 may include a server operating independently, or a distributed server, or a server cluster composed of a plurality of servers, where the server may be a cloud server.
In order to solve the problem of low accuracy of encoding multimedia information due to poor encoding performance of an encoding model in the prior art, an embodiment of the present application provides a method for encoding multimedia information, where an execution main body of the method may be the search processing end, and please refer to fig. 2, the method specifically includes:
s210, multimedia information to be coded is obtained.
The multimedia information in the embodiment of the present application may include image, text, audio, video and other types of information, and thus the multimedia information to be encoded may include one or more of image, text, audio, video and other information. When the multimedia information to be coded comprises one type of multimedia information, corresponding to a single-mode input scene; when the multimedia information to be encoded includes multiple types of multimedia information, and the multiple types of multimedia information are description information for the same object, a cross-modal input scene is corresponding to the multiple types of multimedia information, for example, the multimedia information to be encoded includes an image and a text, the display content in the image is one cat, and the text description information is "one cat," that is, the description information of the image and the text is corresponding.
S220, information coding is carried out on the multimedia information to be coded based on a target coding model, and target coding information corresponding to the multimedia information to be coded is obtained.
The target coding model in the embodiment can realize information coding of multimedia information to be coded, the information coding can be an information conversion mode for converting the multimedia information into digital information or symbol information based on a conversion rule, and the coded information obtained after coding is convenient to store, retrieve and use; for example, for different multimedia information, the information is transformed by the same transformation rule to obtain the encoded information with the same metric and the same encoding length, so that the operations such as comparison, retrieval and the like can be performed based on the encoded information corresponding to the different multimedia information.
The information encoding manner in this embodiment may specifically be Hash codes such as MD5(Message-Digest Algorithm5, information Digest Algorithm 5), SHA1(Secure Hash Algorithm 1), SHA256(Secure Hash Algorithm 256), SHA512(Secure Hash Algorithm 512), and the Hash codes may convert information with variable length into a fixed-length Hash value.
The target coding model is obtained by carrying out model training on a coding model to be trained on the basis of the loss information; the loss information is determined based on similar coding information and difference coding information respectively corresponding to each item of sample coding information; similar coding information and difference coding information respectively corresponding to each item of sample coding information are determined from the sample coding information set on the basis of pairwise similar information between each item of sample multimedia information; and the sample coding information set is obtained by respectively carrying out information coding on each sample multimedia information based on the coding model to be trained.
In the training process of the model to be coded, the corresponding loss information can be determined based on the similar coding information and the differential coding information, the similar coding information and the differential coding information can be determined based on pairwise similar information between the multimedia information of each sample, namely when the loss information is determined, the similar information between the multimedia information of each sample with the comprehensive coverage rate can be contained in the loss information, and therefore the accuracy of determining the loss information is improved. Because the similar information among various sample multimedia information is involved in determining the loss information and has no relation with the sample type of the sample multimedia information, the model training process can be suitable for a scene that the sample type is that the sample multimedia information contains the original sample multimedia information, and can also be suitable for a scene that the sample type is that the sample multimedia information contains both the original sample multimedia information and the transformed sample multimedia information which is obtained by performing similar transformation on the original sample multimedia information, so that the application flexibility and the adaptability can be improved. Furthermore, on the basis of improving the accuracy of the loss information, model training is carried out based on the loss information, so that the coding performance of the target coding model can be improved, and the coding accuracy of the multimedia information is improved.
Further, referring to fig. 3, a specific training method for the target coding model may include:
and S310, respectively carrying out information coding on each sample multimedia information based on the coding model to be trained to obtain a sample coding information set.
In this embodiment, the model to be encoded may be an initialized machine learning model or a pre-training model; the coding model to be trained may include coding submodels corresponding to various multimedia types, such as a first coding submodel corresponding to an image, a second coding submodel corresponding to a text, a third coding submodel corresponding to audio, and a fourth coding submodel corresponding to video. In a single-mode input scene, determining a corresponding coding sub-model based on the type of input multimedia information, and then obtaining corresponding coding information; in a cross-mode input scene, determining a plurality of corresponding coding submodels based on a plurality of types of input multimedia information, and obtaining a plurality of items of coding information corresponding to the cross-mode multimedia information based on the plurality of coding submodels, wherein the number of items of the plurality of items of coding information is consistent with the type number of the input multimedia information; furthermore, in order to facilitate the storage of the encoded information, information fusion processing can be performed on a plurality of items of encoded information to obtain fused encoded information corresponding to the cross-modal multimedia information.
Based on the modal information of the sample multimedia information, the sample multimedia information is subjected to information coding through a coding model to be trained to obtain sample codes corresponding to various sample multimedia information, so that a corresponding sample coding information set is obtained. The sample encoding information set may include one item of encoding information corresponding to each item of sample multimedia information, or include multiple items of encoding information corresponding to each item of sample multimedia information.
S320, based on the pairwise similar information among the sample multimedia information, determining the similar coding information and the difference coding information respectively corresponding to the sample coding information from the sample coding information set.
Based on pairwise similarity information between each item of sample multimedia information, similarity information between any two items of sample multimedia information can be determined, the similarity information between any two items of sample multimedia information can represent the similarity degree between any two items of sample multimedia information, and the similarity degree between each item of sample multimedia information and other sample multimedia information can be correspondingly determined. In this embodiment, for similar multimedia information, the corresponding encoding information is also similar, so that the similar encoding information and the difference encoding information corresponding to each sample encoding information can be determined.
The pairwise similar information among various sample multimedia information can be represented by similarity levels, for example, the similarity levels can include a first level, a second level and a third level … …, and the higher the similarity level is, the higher the corresponding similarity degree is; conversely, the lower the degree of similarity; a corresponding similarity threshold may be set, for example, the similarity threshold may be set to three levels, so that two sample multimedia information having a similarity level greater than or equal to three levels may be determined as similar sample multimedia information, and two sample multimedia information having a similarity level less than three levels may be determined as differential sample multimedia information. Further, the similarity can be expressed by similarity, the similarity can be specifically a number between 0 and 1, and the larger the numerical value is, the higher the corresponding similarity is; conversely, the lower the degree of similarity; corresponding similarity thresholds may be set, for example, the similarity threshold may be set to 0.8, so that two sample multimedia information having a similarity greater than or equal to 0.8 may be determined as similar sample multimedia information, and two sample multimedia information having a similarity less than 0.8 may be determined as difference sample multimedia information.
Therefore, for each item of coding information, the corresponding similar coding information can be sample coding information of which the similarity level with the item of coding information is greater than or equal to a preset level, or of which the similarity threshold with the item of coding information is greater than or equal to a preset similarity threshold, and the similar coding information can comprise the item of coding information; accordingly, the difference encoding information may be sample encoding information whose similarity level with the item of encoding information is less than a preset level, or whose similarity threshold with the item of encoding information is less than a preset similarity threshold.
When all the sample multimedia information is original multimedia information, similarity calculation can be carried out on each original multimedia information and other original multimedia information in all the sample multimedia information, so that pairwise similar information between all the original multimedia information is obtained. Further, each item of original multimedia information is similar to itself.
When each item of sample multimedia information comprises original multimedia information and transformation multimedia information, the transformation multimedia is obtained by performing similar transformation on the original multimedia information, namely the original multimedia information and the transformation multimedia information are in one-to-one correspondence, each item of original multimedia information corresponds to one item of transformation multimedia information, and the original multimedia information with the corresponding relationship is similar to the transformation multimedia information. Pairwise similarity information between the various pieces of converted multimedia information can be determined based on pairwise similarity information between the various pieces of original multimedia information, and the pairwise similarity information between the various pieces of original multimedia information is consistent with the pairwise similarity information between the various pieces of converted multimedia information. For example, for the original multimedia information x i And x j The corresponding transform multimedia information is x' i And x' j X can be determined i And x j Of similar information, x' i And x' j Like information of (1), x i And x' j X 'and' i And x j The similar information of all the devices is consistent. Further, each item of original multimedia information is similar to itself, and each item of transformed multimedia information is similar to itself. Therefore, the pairwise similar information between the converted multimedia information can be determined based on the pairwise similar information between the original multimedia information, and then various items are obtainedThe pairwise similarity information between the sample multimedia information improves the efficiency of determining pairwise similarity between various sample multimedia information.
In this embodiment, when determining pairwise similar information between multimedia information of each sample, the multimedia information of each sample may be pairwise combined to obtain similar information corresponding to each combination, so that repeated calculation of the similar information between any two multimedia information of the samples may be avoided.
S330, based on the similar coding information and the difference coding information respectively corresponding to the sample coding information, determining loss information.
Unsupervised training may be employed for the training of the coding model in this embodiment, and accordingly, the loss information may be determined based on the similar coding information as well as the difference coding information.
Because two similar information between each sample multimedia information exist objectively, the corresponding training target is that the coding information obtained by coding each sample multimedia information based on the target coding model also has corresponding two similar information, but because the parameters of the coding model are not perfect, the coding result of the sample multimedia information is possibly inaccurate, and further the coding information of the actual similar sample multimedia information is possibly dissimilar, or the coding information of the actual dissimilar sample multimedia information is similar, so that loss information is generated.
And S340, performing model training on the coding model to be trained based on the loss information to obtain a target coding model.
Specifically, in each training process, loss information of the training process needs to be determined based on the similar coding information and the difference coding information, and then parameters of the coding model can be updated reversely based on the loss information. By determining the loss information and continuously optimizing the loss function, for example, random gradient descent can be adopted to optimize the loss function, so that pairwise similar information between each item of coding information approaches pairwise similar information between each item of sample multimedia information, and when a preset training condition is reached, training of the coding model to be trained can be completed, and the target coding model is obtained.
In the process of training a coding model, based on pairwise similar information between various sample multimedia information, determining similar coding information and difference coding information respectively corresponding to various sample coding information; determining loss information based on the similar coding information and the difference coding information; the method and the device have the advantages that when loss information is determined, the method and the device are based on pairwise similar information among various sample multimedia information, namely the similar information among the various sample multimedia information is comprehensively covered, the similar information among the various sample multimedia information and other multimedia information in a sample can be contained in the loss information, accordingly, the accuracy of determining the loss information is improved, further, on the basis of improving the accuracy of the loss information, model training is carried out based on the loss information, and the coding performance of a target coding model can be improved.
For the method for determining similar coding information and differential coding information in the coding model training process, referring to fig. 4 in particular, the method may include:
s410, for each item of sample coding information in the sample coding information set, determining target sample multimedia information corresponding to each item of sample coding information.
S420, based on the pairwise similar information among the sample multimedia information, determining the similar multimedia information and the different multimedia information of the target sample multimedia information.
And S430, determining the sample coding information corresponding to the similar multimedia information in the sample coding information set as the similar coding information.
S440, determining the sample coding information corresponding to the difference multimedia information in the sample coding information set as the difference coding information.
Because the sample encoding information corresponds to the sample multimedia information, for each item of sample encoding information in the sample encoding information set, the corresponding target sample multimedia information can be determined. Any two items of similar sample multimedia information are similar to each other in their corresponding sample encoding information, so that for the target sample multimedia information, the similar multimedia information and the different multimedia information of the target sample multimedia information can be determined.
Specifically, two pieces of similar information are taken as an example for explanation, and of the sample multimedia information, the sample multimedia information of which the similarity value with the target sample multimedia information is greater than or equal to the similarity threshold value is determined as similar multimedia information of the target sample multimedia information; therefore, the sample multimedia information with the similarity value smaller than the similarity threshold value with the target sample multimedia information in each sample multimedia information can be determined as the difference sample multimedia information. Further, for the determination of the difference multimedia information corresponding to the target sample multimedia information, the sample multimedia information in each item of sample multimedia information except the target sample multimedia information and the sample multimedia information with the similarity greater than the similarity threshold value with the target sample multimedia information can also be directly determined as the difference multimedia information.
In one example, when the sample multimedia information items are original multimedia information, non-contrast learning may be performed, and based on pairwise similarity information between the sample multimedia information items, similar multimedia information corresponding to each sample multimedia information item may be determined, and differential multimedia information may be determined. For each coding information u in a sample coding information set i Corresponding to the original multimedia information x i Can be based on each item of original multimedia information x i Obtaining the similarity information with other original multimedia information to obtain the original multimedia information x i Corresponding similar multimedia information x j Similar multimedia information x j Including sample multimedia information as x i Itself. Similar multimedia information x can be determined accordingly j Corresponding similar coded information u j I.e. determining and encoding the information u i Corresponding similar coded information u j
In another example, sample multimedia information may be packagedThe method comprises the steps of including original multimedia information and transforming the multimedia information; the transformation multimedia is obtained by performing similar transformation on the original multimedia information, namely the original multimedia information and the transformation multimedia information are in one-to-one correspondence, each item of multimedia information in the original multimedia information corresponds to one item of multimedia information in the transformation multimedia information, and the multimedia information with the corresponding relationship is similar. The method can be used for corresponding comparison learning, and pairwise similarity information between various sample multimedia information can be determined through similarity calculation and similarity transformation. For each item of original multimedia information, the similarity information between the item of multimedia information and other original multimedia information can be determined by a similarity calculation method; for the transformed multimedia information corresponding to the item of original multimedia information, the similarity between the item of original multimedia information and the corresponding transformed multimedia information can be determined as a target similarity, and the target similarity is greater than a similarity threshold. For each item of transformed multimedia information, similarity information between the item of transformed multimedia information and other transformed multimedia information may be determined based on similarity information between corresponding original multimedia information. For example, for the original multimedia information x i And x j The corresponding transformed multimedia information is x' i And x' j X can be determined i And x j Of similar information, x' i And x' j Like information of (1), x i And x' j And x 'of' i And x j The similarity information is consistent, and the similarity numerical values are all larger than the similarity threshold value. Coding information u in a set of coding information for a sample i Corresponding to sample multimedia information of x i And sample multimedia information x i The similar multimedia information with the similar transformation relation is x' i Similar multimedia information x' i The corresponding coded information is u' i For a sample coding information u of a set of coding information j Corresponding to sample multimedia information of x j And sample multimedia information x j The similar multimedia information with the similar transformation relation is x' j Similar multimedia information x' j The corresponding coded information is u' j Due to transforming multimediaThe information is obtained by similarity transformation, so that x i And x' i Is greater than or equal to a similarity threshold value, x j And x' j If x is greater than or equal to the threshold value of similarity i And x j The similarity value is more than or equal to a preset similarity threshold value, and the coded information u can be known i ,u j ,u' i And u' j Every two similarity values between the two codes are greater than or equal to the similarity threshold value, so that a similar coding information set is obtained.
Therefore, when similar coding information and differential coding information corresponding to each item of sample coding information are determined, target sample multimedia information corresponding to each item of coding information is determined at first, and the pairwise similarity between each item of sample coding information is determined based on pairwise similarity between each item of sample multimedia information, so that the similar coding information and the differential coding information can be determined, and convenience and accuracy in determination of the similar coding information and the differential coding information are improved.
The determination of pairwise similar information between various sample multimedia information can be determined based on sample characteristic information of the various sample multimedia information; specifically, the coding model to be trained may include a feature extraction layer obtained based on pre-training, and accordingly, referring to fig. 5, a method for determining similar information is shown, which may include:
and S510, performing feature extraction on the sample multimedia information based on the feature extraction layer to obtain sample feature information respectively corresponding to the sample multimedia information.
S520, similarity calculation is carried out on the basis of sample characteristic information respectively corresponding to the sample multimedia information, and pairwise similarity information between the sample multimedia information is obtained.
In this embodiment, the feature extraction layer in the coding model to be trained may be a feature extraction layer obtained through pre-training, and has a feature extraction capability, so that feature extraction may be performed on each item of sample multimedia information based on the feature extraction layer in the coding model to be trained to obtain sample feature information corresponding to each item of sample multimedia information; the sample characteristic information corresponding to each item of sample multimedia information can represent the characteristic data of the corresponding multimedia information, so that similarity calculation can be performed based on the sample characteristic information corresponding to each item of sample multimedia information to obtain pairwise similar information among each item of multimedia information. The sample feature information may be in the form of a feature vector or a feature code.
Therefore, the sample multimedia information is subjected to feature extraction based on the feature extraction layer obtained by pre-training, the efficiency and the accuracy of sample feature information extraction are improved, and the efficiency and the accuracy of similar information calculation can be further improved.
Further, the pairwise similarity information between each sample multimedia information can be determined based on the type of the multimedia information, for example, when the type of the sample multimedia information comprises an image, similarity calculation can be correspondingly performed on each sample image based on a pixel point comparison mode; when the type of the sample multimedia information comprises texts, similarity calculation can be correspondingly carried out on each sample text based on a text comparison mode.
Each item of sample multimedia information can comprise original multimedia information and transformation multimedia information, and the transformation multimedia is obtained by performing similar transformation on the original multimedia information, so that the original multimedia information and the transformation multimedia information can be respectively encoded through two encoding models shared by model parameters, namely the encoding model to be trained comprises a first encoding model and a second encoding model; the first coding model and the second coding model share model parameters; referring specifically to fig. 6, a method for generating encoded information is shown, which may include:
s610, information coding is carried out on the original multimedia information based on the first coding model, and first coding information is obtained.
S620, information coding is carried out on the transformed multimedia information based on the second coding model, and second coding information is obtained.
S630, generating the sample coding information set based on the first coding information and the second coding information.
When the original multimedia information is obtained, the original multimedia information can be subjected to similar transformation to obtain transformed multimedia information, and the sample multimedia information is determined based on the original multimedia information and the transformed multimedia information, so that the requirement on the number of the original multimedia information can be reduced, and the number of the sample multimedia information is expanded.
In addition, the original multimedia information and the transformed multimedia information are in one-to-one correspondence, so that the original multimedia information and the transformed multimedia information can be respectively subjected to information coding based on a first coding model and a second coding model shared by model parameters to obtain first coding information corresponding to the original multimedia information and second coding information corresponding to the transformed multimedia information, and then sample coding information is generated based on the first coding information and the second coding information.
The original multimedia information is subjected to similarity transformation to obtain corresponding transformed multimedia information, so that the number of sample multimedia information can be increased, and the accuracy of subsequent coding model training can be improved. In addition, the information is respectively coded by the first coding model and the second coding model shared by the model parameters, so that the consistency of the coding model parameters can be ensured on the basis of improving the coding efficiency, namely, the first coding model and the second coding model are ensured to carry out information coding under the same parameter condition, and the determination of the similarity relation between the first coding information and the second coding information is facilitated.
Each item of sample coding information in the sample coding information set should have pairwise similarity information between corresponding sample multimedia information, i.e. for sample multimedia information x i And x j Respectively corresponding to the encoded information u i And u j If x i And x j If the similarity value is greater than or equal to the preset similarity threshold, the encoded information u i And u j Should also be greater than or equal to a predetermined similarity threshold if x i And x j Is less than a preset similarity threshold, the encoded information u i And u j Is likeThe degree value should also be smaller than a preset similarity threshold value, so that when constructing the loss function, the loss information item can be constructed based on the similar coding information and the different coding information, respectively. Referring specifically to fig. 7, a method for determining loss information is shown, which may include:
s710, for each item of sample coding information in the sample coding information set, constructing a first loss information item based on similar coding information corresponding to each item of sample coding information.
And S720, constructing a second loss information item based on the difference coding information corresponding to each item of sample coding information.
S730, determining the loss information based on the first loss information item and the second loss information item.
For each item of sample coding information, a first loss information item can be constructed based on corresponding similar coding information, and based on the first loss information item, a coding model can be drawn close to the distance between the prediction sample coding information and the similar coding information in the training process; and constructing a second loss information item based on the difference coding information, wherein the distance between the prediction sample coding information and the difference coding information can be enlarged in the training process of the coding model based on the second loss information item.
Therefore, loss information is constructed based on the first loss information item corresponding to the similar coding information and the second loss information item corresponding to the differential coding information, the influence of the sample coding information on model training can be considered from the similar coding dimension and the differential coding dimension respectively, and therefore the accuracy of determining the loss information is improved.
In order to further improve the training loss of the model in the process of training the coding model, when the sample multimedia information comprises the original multimedia information and the transformed multimedia information subjected to similarity transformation, the loss information can be determined based on the similarity matrix of the original multimedia information; referring specifically to fig. 8, another loss information determination method is shown, which may include:
s810, determining third loss information based on the first coding information, the second coding information and the similarity matrix; the similarity matrix represents the similarity among various multimedia information in the original multimedia information.
S820, determining the loss information based on the first loss information item, the second loss information item and the third loss information item.
The description is given by taking two pieces of similar information as similarity values, the similarity matrix may include similarity values between various pieces of multimedia information in the original multimedia information, the similarity values may be obtained based on sample feature information extracted by the feature extraction layer obtained through pre-training, or may be obtained based on an information comparison manner corresponding to different multimedia types, which is not specifically limited in this embodiment.
Based on the first coding information, the second coding information and the similarity matrix, a third loss information item can be determined, and the third loss information item can represent the correlation information between the coding information output by the first coding model and the second coding model and the similar information of the original multimedia information, so that the loss information is determined by combining the third loss information item on the basis of the first loss information item and the second loss information item, the loss information is determined based on more comprehensive information, and the accuracy of determining the loss information is improved.
In one example, the loss function determined based on the loss information is as follows:
Figure BDA0003654594330000161
wherein cosh () is hyperbolic cosine function, cos () is cosine function, alpha and mu are hyper-parameters, and the first loss information item is
Figure BDA0003654594330000162
The second loss information item is
Figure BDA0003654594330000163
The third loss information item is
Figure BDA0003654594330000164
Z is first encoding information, Z 'is second encoding information, r is encoding length, S is similarity matrix, Z U Z' is sample encoding information set, and for each item of encoding information u, O in Z U Z u Indicating a set of coding information in the set of Z ^ Z' that should be similar to the coding information u, i.e. corresponding to similar coding information, P u Corresponding to the difference coding information, | O u I represents the set O u The number of items of encoded information.
Performing similarity transformation on the original multimedia information to generate corresponding transformed multimedia information, and determining a corresponding similarity transformation mode for each item of original multimedia information based on the type of the multimedia information; referring specifically to fig. 9, a multimedia information transformation method is shown, which may include:
s910, determining multiple preset similarity transformations based on the information types of the original multimedia information.
And S920, determining target similarity transformation from the multiple preset similarity transformations.
S930, performing the target similarity transformation on the original multimedia information to obtain the transformed multimedia information.
When the information type of the original multimedia information is an image, the corresponding preset similarity transformation can comprise rotation, cutting, scaling, noise superposition, covering, color transformation, filter and the like; when the information type of the original multimedia information is text, the corresponding preset similarity transformation may include translation, character insertion, character deletion, and the like. One or more similarity transformations can be determined from corresponding preset similarity transformations based on the information type of the original multimedia information to form target similarity transformation, and the target similarity transformation is performed on the original multimedia information to obtain transformed multimedia information. In each training process, one original multimedia information corresponds to one transformed multimedia information, and in different training processes, because the target similarity transformation executed on the same original multimedia information may be different, the transformed multimedia information corresponding to the same original multimedia information may also be different.
Therefore, the corresponding transformed multimedia information is obtained by performing similar transformation on the original multimedia information, so that the number of sample multimedia information is expanded; furthermore, in different training processes, the same original multimedia information may correspond to different transformed multimedia information, thereby improving the diversity of sample multimedia information.
Under the condition that the coding model to be trained comprises an information coding layer and a feature extraction layer obtained based on pre-training, when the coding model to be trained is subjected to model training, the feature extraction layer can be trained based on a first learning rate, and the information coding layer can be trained based on a second learning rate; the first learning rate is less than the second learning rate.
In this embodiment, the feature extraction layer may be obtained by pre-training, so that when the coding model to be trained is trained, the first learning rate of the feature extraction layer may be set to be smaller than the second learning rate of the information coding layer; the learning rate can represent the update frequency of the model parameters, so that the update rate of the model parameters of the information coding layer is greater than that of the feature extraction layer, the model parameters of the information coding layer are converged as soon as possible, and the model training efficiency of the information coding layer is improved.
The training process of the target coding model can be realized based on the following algorithm flow:
1. inputting: sample multimedia information set, iteration times t, learning rate eta, batch-size ═ n.
2. And performing feature extraction on the sample multimedia information based on the pre-trained feature extraction layer to obtain sample feature information corresponding to the sample multimedia information.
3. And constructing pairwise similar information among various sample multimedia information based on the sample characteristic information.
4. Repeatedly executing for t times:
randomly selecting n items of sample multimedia information from the sample multimedia information set each time, marking as Xc, and performing similar transformation on the n items of sample multimedia information Xc to obtain X' c
Xc and X' c Inputting the information into a coding model to be trained for information coding,obtaining corresponding coding information Zc and Z' c
Mixing Zc and Z' c And substituting the similar information between every two items of the n items of sample multimedia information into an expression (1), and updating the model parameters in the coding model to be trained by the learning rate eta in a back propagation mode.
5. And (3) outputting: an object coding model.
For the multimedia information encoding method of the present embodiment, it can be specifically applied to a search scenario, specifically refer to fig. 10, which shows an object search method, where the method includes:
s1010, acquiring coding information of an object to be retrieved and coding information of a candidate object; the coding information of the object to be retrieved is obtained by carrying out information coding on the multimedia information of the object to be retrieved based on a target coding model; and the coding information of the candidate object is obtained by carrying out information coding on the multimedia information of the candidate object based on the target coding model.
The target coding model is obtained by carrying out model training on a coding model to be trained on the basis of loss information; the loss information is determined based on similar coding information and difference coding information respectively corresponding to each item of sample coding information; similar coding information and difference coding information respectively corresponding to each item of sample coding information are determined from the sample coding information set on the basis of pairwise similar information between each item of sample multimedia information; and the sample coding information set is obtained by respectively carrying out information coding on each item of sample multimedia information based on the coding model to be trained.
S1020, performing information matching on the coding information of the object to be retrieved and the coding information of the candidate object to obtain an information matching result.
And S1030, determining a target retrieval object from the candidate objects based on the information matching result.
The target coding model based on in the object retrieval process can be obtained based on the coding model training method described above in this embodiment, and is not described herein again.
For an object, in the field of information push, the object may be push information, such as an advertisement, an article, a small video, and the like, and the corresponding multimedia information of the object to be retrieved may be information, such as an image, a text, an audio, a video, and the like, included in the push information; in the e-commerce field, the object can be a commodity, and the corresponding multimedia information of the object to be retrieved can be an image of the commodity, and information such as characters, audio or video for introducing the commodity.
The method comprises the steps of obtaining coding information of an object to be retrieved and coding information of candidate objects by coding information based on a target coding model, matching the coding information of the object to be retrieved and the coding information of the candidate objects to obtain an information matching result, and determining the target retrieval object from the candidate objects based on the information matching result.
The information coding in the embodiment can realize the coding of high-dimensional information into low-dimensional compact coded information, so that the information coding is performed in advance based on the multimedia information of each object to obtain the low-dimensional coded information corresponding to each object, the storage of the coded information is convenient, and the storage space is saved; on the other hand, the coding information can be mapped into binary coding information, and when the distance between the coding information corresponding to each object is calculated, the calculation can be performed through the Hamming distance, and further, the Hamming distance calculation of the binary coding information can be realized through the bit exclusive OR (XOR) of a computer, so that the efficiency of coding distance measurement is improved, and the calculation resources are saved.
The encoding information of the object may be continuous floating-point type encoding information, so that the encoding information of the object may be an encoding vector, and when the information of the object to be retrieved and the encoding information of the candidate object are matched, the encoding distance or the encoding similarity between the encoding vector of the object to be retrieved and the encoding vector of the candidate object may be calculated, so that the encoding distance or the encoding similarity may be determined as a matching result.
The encoding information of the object may also be non-continuous binary encoding information, for example, the non-continuous encoding information includes 0 and 1, or includes 1 and-1, so that the matching result between the object to be retrieved and the candidate object may be obtained by calculating the hamming distance between the non-continuous encoding information of the object to be retrieved and the non-continuous encoding information of the candidate object.
The information matching result may include matching information of the object to be retrieved and each candidate object, and the candidate object whose matching information is greater than the preset matching value may be determined as a target retrieval object corresponding to the object to be retrieved. When the target search object is searched, object recommendation, object delivery, and the like can be performed based on the target search object.
For example, in an advertisement system, similar retrieval is a crucial link, and bears a plurality of important services such as advertisement retrieval and commodity retrieval; meanwhile, the method has great influence on subsequent recalling, rough-arranging and fine-arranging links, so that the accuracy of processing results of the subsequent recalling, rough-arranging and fine-arranging links can be improved on the basis of improving the accuracy of similarity retrieval.
Performing information matching based on the coding information of the object to be retrieved and the coding information of the candidate objects to obtain matching information of the object to be retrieved and each candidate object, and further determining a target retrieval object matched with the object to be retrieved; the coding information of the object can be generated based on the target coding model, and in the process of training and generating the target coding model, the loss information is determined by pairwise similar information between various sample multimedia information, so that the coding accuracy of the target coding model can be improved, and the accuracy of determining the object retrieval result can be improved by performing object retrieval based on the coding information of the target coding model.
In the following, the multimedia information is taken as an example for specific description, and the similarity transformation of the original multimedia information can also be understood as data enhancement of the original multimedia information, where the original sample image is
Figure BDA0003654594330000201
Wherein x is i Representing the ith image in the original sample image, and performing data enhancement processing on each image in the original sample image to obtain an enhanced sample image
Figure BDA0003654594330000202
Wherein x' i Showing the enhanced image corresponding to the ith image in the original sample image. So that the original sample image and the enhanced sample information can be input into the coding model; please refer to fig. 11, which shows a schematic structural diagram of a target coding model, where the target coding model includes a first coding model and a second coding model, the first coding model and the second coding model share model parameters, and the feature extraction layer may specifically be a VGG19 network pre-trained on ImageNet, and the original sample image extracted correspondingly
Figure BDA0003654594330000203
Is characterized by being shown as
Figure BDA0003654594330000204
Wherein f is i ∈R 4096 Feature information indicating correspondence of ith image in original sample image based on original sample image
Figure BDA0003654594330000205
Is characterized by being shown as
Figure BDA0003654594330000206
Generating a similarity matrix S representing the similarity of the original sample images, wherein the S belongs to [ -1, 1 [ ]] n×n Wherein the ith row and the jth column are S ij Representing the similarity between the ith image and the jth image in the original sample image, when S ij The closer to 1, the image x i And image x j The more similar, when S ij The closer to-1, the image x i And image x j The less similar. Therefore, when the model to be trained is trained, the original sample image can be used
Figure BDA0003654594330000211
And enhancing the sample image
Figure BDA0003654594330000212
Respectively input into the first coding model and the second coding model to obtain
Figure BDA0003654594330000213
Corresponding coded information
Figure BDA0003654594330000214
Corresponding coded information
Figure BDA0003654594330000215
Where r is the encoded information length.
Calculating loss information based on the loss function of the formula (1), specifically, using Z £ Z' as an encoding information set corresponding to the original sample image and the enhanced sample image, and for each item of encoding information u, O in Z $ @ u Indicating a set of coding information in the set of Z ^ Z' that should be similar to the coding information u, i.e. corresponding to similar coding information, P u Corresponding to the encoding information set in the Z ^ Z' set which should not be similar to the encoding information u, i.e. the difference encoding information, | O u I represents the set O u The number of items of encoded information.
For each item of encoded information u in Z U Z', its corresponding O u And P u The construction mode is as follows: u is z i Or z' i I.e. u is the ith image x i Or its enhanced image x' i Corresponding coded information, when S ij When the image x is larger than or equal to the preset similarity threshold value, the image x is considered to be image x i And image x j Similarly, image x i And picture x' j Similarly, or image x' i And image x j Similarly, image x' i And picture x' j Similarly, such that u is identical to the coded information z j And z' j I.e. z j And z' j Belong to the set O u (ii) a On the contrary, when S ij When the value is less than the preset similarity threshold value, z is j And z' j Belong to the set P u . In the prior art O u And P u The construction method of (2) is generally as follows: u is z i Or z' i Then O is u ={z i ,z' i And the rest of the encoded information in Z { (Z) } belongs to the set P u (ii) a I.e. the methods in the prior art consider the encoded information of one image to be similar to the encoded information of itself or its enhancement pictures only,and the coded information is not similar to that of other images, however, the method will process a large number of similar points in the sample image into dissimilar points, and further the distance between the coded information of the similar picture pairs becomes large. For O in the examples of this application u And P u In the above-mentioned method, a preset similarity threshold may be preset for the determination, so that in addition to determining the enhanced image of each image as an image similar to the enhanced image, other images in the sample image with similarity greater than the preset similarity threshold are determined as similar images, and the enhanced images corresponding to the similar image pairs are also similar to each other and are classified into the set O u Thereby ensuring the set O u The comprehensiveness of the determination can be further improved, and the accuracy of the determination of the loss information can be further improved.
In another example, taking the example that the multimedia information includes images and texts as an example for specific description, please refer to fig. 12, which shows another structural diagram of the target coding model, where the target coding model includes an image coding model and a text coding model, so as to implement coding of the multi-modal multimedia input information, where the image samples and the text samples form sample pairs, that is, each item of image in the image samples corresponds to each item of text in the text samples one-to-one.
The image coding model comprises an image feature extraction layer, an image full-connection layer and an image information coding layer, image feature extraction is carried out on an image sample based on the image feature extraction layer to obtain image feature information, information classification is carried out on the image feature information based on the image full-connection layer to obtain image classification information, and information coding is carried out on the image classification information based on the image information coding layer to obtain an image coding information set Zx.
The text coding model comprises a text feature extraction layer, a text full-connection layer and a text information coding layer, text feature extraction is carried out on a text sample based on the text feature extraction layer to obtain text feature information, information classification is carried out on the text feature information based on the text full-connection layer to obtain text classification information, and information coding is carried out on the text classification information based on the text information coding layer to obtain a text coding information set Zy.
Further, for the image coding model and the text coding model shown in fig. 12, the image coding model may further include a first image coding model and a second image coding model to perform image coding on the original image sample and perform image coding on the enhanced image sample, respectively, and the specific implementation process thereof may refer to the embodiment corresponding to fig. 11. Similarly, the text coding model may further include a first text coding model and a second text coding model, which respectively perform text coding on the original text sample and perform text coding on the enhanced text sample, and the implementation process is similar to the text coding and is not described herein again. The above contents of this embodiment can also be referred to for the generation method of the enhanced image sample and the enhanced text sample.
It should be noted that fig. 12 shows multi-modal input information formed by images and texts, and in the specific implementation process, the multi-modal input information may be a combined sample pair of two or more types of multimedia information in multimedia information such as images, texts, audio and video.
The various implementation methods provided in the above embodiments can be randomly combined based on actual application conditions, and have the corresponding beneficial effects of executing the combination back method.
Referring to fig. 13, a multimedia information encoding apparatus is shown, including:
a first obtaining module 1310, configured to obtain multimedia information to be encoded;
a first encoding module 1320, configured to perform information encoding on the multimedia information to be encoded based on a target encoding model, so as to obtain target encoding information corresponding to the multimedia information to be encoded;
the target coding model is obtained by carrying out model training on a coding model to be trained on the basis of loss information; the loss information is determined based on similar coding information and difference coding information respectively corresponding to each item of sample coding information; similar coding information and difference coding information respectively corresponding to each item of sample coding information are determined from the sample coding information set on the basis of pairwise similar information between each item of sample multimedia information; and the sample coding information set is obtained by respectively carrying out information coding on each sample multimedia information based on the coding model to be trained.
Further, the apparatus comprises:
a first determining module, configured to determine, for each item of sample encoding information in the sample encoding information set, target sample multimedia information corresponding to each item of sample encoding information;
the second determining module is used for determining similar multimedia information and different multimedia information of the target sample multimedia information based on pairwise similar information between the sample multimedia information;
a third determining module, configured to determine, as the similar encoding information, sample encoding information corresponding to the similar multimedia information in the sample encoding information set;
a fourth determining module, configured to determine, as the difference coding information, sample coding information corresponding to the difference multimedia information in the sample coding information set.
Further, the coding model to be trained comprises a feature extraction layer obtained based on pre-training;
the device further comprises:
the characteristic extraction module is used for extracting the characteristics of the sample multimedia information based on the characteristic extraction layer to obtain sample characteristic information respectively corresponding to the sample multimedia information;
and the similarity calculation module is used for calculating the similarity based on the sample characteristic information respectively corresponding to the sample multimedia information to obtain pairwise similar information between the sample multimedia information.
Further, the sample multimedia information items comprise original multimedia information and transformed multimedia information; the transformation multimedia is obtained by performing similarity transformation on the original multimedia information; the coding model to be trained comprises a first coding model and a second coding model; the first coding model and the second coding model share model parameters;
the device further comprises:
the second coding module is used for carrying out information coding on the original multimedia information based on the first coding model to obtain first coding information;
a third encoding module, configured to perform information encoding on the transformed multimedia information based on the second encoding model to obtain second encoded information;
a set determination module to generate the set of sample coding information based on the first coding information and the second coding information.
Further, the apparatus further comprises:
a first constructing module, configured to construct, for each item of sample encoding information in the sample encoding information set, a first loss information item based on similar encoding information corresponding to the each item of sample encoding information;
the second construction module is used for constructing a second loss information item based on the difference coding information corresponding to the coding information of each sample;
a loss information determination module to determine the loss information based on the first loss information item and the second loss information item.
Further, the determining the loss information based on the first loss information item and the second loss information item includes:
a third constructing module, configured to construct a third loss information item based on the first encoding information, the second encoding information, and a similarity matrix; the similarity matrix represents the similarity among various multimedia information in the original multimedia information;
a fifth determining module for determining the loss information based on the first loss information item, the second loss information item, and the third loss information item.
Further, the apparatus further comprises:
the preset similarity transformation determining module is used for determining a plurality of preset similarity transformations based on the information type of the original multimedia information;
the target similarity transformation determining module is used for determining target similarity transformation from the multiple preset similarity transformations;
and the target similarity transformation module is used for executing the target similarity transformation on the original multimedia information to obtain the transformed multimedia information.
Further, the coding model to be trained comprises an information coding layer and a feature extraction layer obtained based on pre-training;
the device further comprises:
the model training module is used for training the feature extraction layer based on a first learning rate and training the information coding layer based on a second learning rate; the first learning rate is smaller than the second learning rate.
Referring to fig. 14, there is shown an object retrieval apparatus including:
a second obtaining module 1410, configured to obtain coding information of an object to be retrieved and coding information of a candidate object; the coding information of the object to be retrieved is obtained by carrying out information coding on the multimedia information of the object to be retrieved based on a target coding model; the encoding information of the candidate object is obtained by carrying out information encoding on the multimedia information of the candidate object based on the target encoding model;
the target coding model is obtained by carrying out model training on a coding model to be trained on the basis of loss information; the loss information is determined based on similar coding information and difference coding information respectively corresponding to each item of sample coding information; similar coding information and difference coding information respectively corresponding to each item of sample coding information are determined from the sample coding information set on the basis of pairwise similar information between each item of sample multimedia information; the sample coding information set is obtained by respectively carrying out information coding on each item of sample multimedia information based on the coding model to be trained;
the information matching module 1420 is configured to perform information matching on the coding information of the object to be retrieved and the coding information of the candidate object to obtain an information matching result;
the retrieval result determining module 1430 is configured to determine a target retrieval object from the candidate objects based on the information matching result.
The device provided in the above embodiments can execute the method provided in any embodiment of the present application, and has corresponding functional modules and beneficial effects for executing the method. Technical details not described in detail in the above embodiments may be referred to a method provided in any of the embodiments of the present application.
The present embodiment also provides a computer-readable storage medium, in which at least one instruction or at least one program is stored, and the at least one instruction or the at least one program is loaded by a processor and executes any one of the methods described in the embodiments.
According to an aspect of the application, a computer program product or computer program is provided, comprising computer instructions, the computer instructions being stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform any of the methods described above in the embodiments.
Referring to fig. 15, the apparatus 1500 may include one or more Central Processing Units (CPUs) 1522 (e.g., one or more processors) and a memory 1532, and one or more storage media 1530 (e.g., one or more mass storage devices) for storing applications 1542 or data 1544. The memory 1532 and the storage medium 1530 may be, among other things, transient or persistent storage. The program stored on the storage medium 1530 may include one or more modules (not shown), each of which may include a sequence of instructions operating on the device. Further, a central processor 1522 may be provided in communication with the storage medium 1530 for executing a series of instruction operations on the storage medium 1530 at the device 1500. Device 1500 can also include one or more power suppliesA source 1526, one or more wired or wireless network interfaces 1550, one or more input-output interfaces 1558, and/or one or more operating systems 1541, such as a Windows Server TM ,Mac OS X TM ,Unix TM ,Linux TM ,FreeBSD TM And so on. Any of the methods described above in this embodiment can be implemented based on the apparatus shown in fig. 15.
The present specification provides method steps as described in the examples or flowcharts, but may include more or fewer steps based on routine or non-inventive practice. The steps and sequences recited in the embodiments are but one manner of performing the steps in a multitude of sequences and do not represent a unique order of performance. In the actual system or interrupt product execution, the method according to the embodiment or the figures may be executed sequentially or in parallel (for example, in the context of parallel processors or multi-thread processing).
The configurations shown in the present embodiment are only partial configurations related to the present application, and do not constitute a limitation on the devices to which the present application is applied, and a specific device may include more or less components than those shown, or combine some components, or have an arrangement of different components. It should be understood that the methods, apparatuses, and the like disclosed in the embodiments may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is merely a division of one logic function, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed coupling or direct coupling or communication connection between each other may be an indirect coupling or communication connection through some interfaces, devices or unit modules.
Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
Those of skill would further appreciate that the various illustrative components and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims (13)

1. A method for encoding multimedia information, comprising:
acquiring multimedia information to be encoded;
performing information coding on the multimedia information to be coded based on a target coding model to obtain target coding information corresponding to the multimedia information to be coded;
the target coding model is obtained by carrying out model training on a coding model to be trained on the basis of loss information; the loss information is determined based on similar coding information and difference coding information respectively corresponding to each item of sample coding information; similar coding information and difference coding information corresponding to each item of sample coding information are determined from the sample coding information set on the basis of pairwise similar information between each item of sample multimedia information; and the sample coding information set is obtained by respectively carrying out information coding on each item of sample multimedia information based on the coding model to be trained.
2. The method of claim 1, further comprising:
for each item of sample coding information in the sample coding information set, determining target sample multimedia information corresponding to each item of sample coding information;
determining similar multimedia information and different multimedia information of the target sample multimedia information based on pairwise similar information among the sample multimedia information;
determining sample coding information corresponding to the similar multimedia information in the sample coding information set as the similar coding information;
and determining sample coding information corresponding to the difference multimedia information in the sample coding information set as the difference coding information.
3. The method of claim 1, wherein the coding model to be trained comprises a feature extraction layer based on pre-training;
the method further comprises the following steps:
performing feature extraction on the sample multimedia information based on the feature extraction layer to obtain sample feature information respectively corresponding to the sample multimedia information;
and carrying out similarity calculation on the basis of sample characteristic information respectively corresponding to the sample multimedia information to obtain pairwise similar information between the sample multimedia information.
4. The method of claim 1, wherein the sample multimedia information items comprise original multimedia information and transformed multimedia information; the transformation multimedia is obtained by performing similarity transformation on the original multimedia information; the coding model to be trained comprises a first coding model and a second coding model; the first coding model and the second coding model share model parameters;
the method further comprises the following steps:
performing information coding on the original multimedia information based on the first coding model to obtain first coding information;
performing information coding on the transformed multimedia information based on the second coding model to obtain second coding information;
generating the set of sample encoding information based on the first encoding information and the second encoding information.
5. The method of claim 4, further comprising:
for each item of sample coding information in the sample coding information set, constructing a first loss information item based on similar coding information corresponding to each item of sample coding information;
constructing a second loss information item based on the difference coding information corresponding to each item of sample coding information;
determining the loss information based on the first loss information item and the second loss information item.
6. The method of claim 5, wherein determining the loss information based on the first loss information item and the second loss information item comprises:
constructing a third loss information item based on the first encoding information, the second encoding information and a similarity matrix; the similarity matrix represents the similarity among various multimedia information in the original multimedia information;
determining the loss information based on the first loss information item, the second loss information item, and the third loss information item.
7. The method of claim 4, further comprising:
determining a plurality of preset similarity transformations based on the information type of the original multimedia information;
determining a target similarity transformation from the plurality of preset similarity transformations;
and executing the target similarity transformation on the original multimedia information to obtain the transformed multimedia information.
8. The method according to claim 1, wherein the coding model to be trained comprises an information coding layer and a feature extraction layer obtained based on pre-training;
the method further comprises the following steps:
training the feature extraction layer based on a first learning rate, and training the information coding layer based on a second learning rate; the first learning rate is smaller than the second learning rate.
9. An object retrieval method, comprising:
acquiring coding information of an object to be retrieved and coding information of a candidate object; the coding information of the object to be retrieved is obtained by carrying out information coding on the multimedia information of the object to be retrieved based on a target coding model; the encoding information of the candidate object is obtained by carrying out information encoding on the multimedia information of the candidate object based on the target encoding model;
the target coding model is obtained by carrying out model training on a coding model to be trained on the basis of loss information; the loss information is determined based on similar coding information and difference coding information respectively corresponding to each item of sample coding information; similar coding information and difference coding information corresponding to each item of sample coding information are determined from the sample coding information set on the basis of pairwise similar information between each item of sample multimedia information; the sample coding information set is obtained by respectively carrying out information coding on each item of sample multimedia information based on the coding model to be trained;
carrying out information matching on the coding information of the object to be retrieved and the coding information of the candidate object to obtain an information matching result;
and determining a target retrieval object from the candidate objects based on the information matching result.
10. An apparatus for encoding multimedia information, comprising:
the first acquisition module is used for acquiring multimedia information to be coded;
the first coding module is used for carrying out information coding on the multimedia information to be coded based on a target coding model to obtain target coding information corresponding to the multimedia information to be coded;
the target coding model is obtained by carrying out model training on a coding model to be trained on the basis of loss information; the loss information is determined based on similar coding information and difference coding information respectively corresponding to each item of sample coding information; similar coding information and difference coding information corresponding to each item of sample coding information are determined from the sample coding information set on the basis of pairwise similar information between each item of sample multimedia information; and the sample coding information set is obtained by respectively carrying out information coding on each item of sample multimedia information based on the coding model to be trained.
11. An object retrieval apparatus, comprising:
the second acquisition module is used for acquiring the coding information of the object to be retrieved and the coding information of the candidate object; the coding information of the object to be retrieved is obtained by carrying out information coding on the multimedia information of the object to be retrieved based on a target coding model; the encoding information of the candidate object is obtained by carrying out information encoding on the multimedia information of the candidate object based on the target encoding model;
the target coding model is obtained by carrying out model training on a coding model to be trained on the basis of loss information; the loss information is determined based on similar coding information and difference coding information respectively corresponding to each item of sample coding information; similar coding information and difference coding information corresponding to each item of sample coding information are determined from the sample coding information set on the basis of pairwise similar information between each item of sample multimedia information; the sample coding information set is obtained by respectively carrying out information coding on each item of sample multimedia information based on the coding model to be trained;
the information matching module is used for performing information matching on the coding information of the object to be retrieved and the coding information of the candidate object to obtain an information matching result;
and the retrieval result determining module is used for determining a target retrieval object from the candidate objects based on the information matching result.
12. An electronic device, comprising a processor and a memory, wherein at least one instruction or at least one program is stored in the memory, and the at least one instruction or the at least one program is loaded by the processor and executed to implement the multimedia information encoding method according to any one of claims 1 to 8 or the object retrieval method according to claim 9.
13. A computer storage medium, wherein at least one instruction or at least one program is stored, and the at least one instruction or the at least one program is loaded by a processor and executes the multimedia information encoding method according to any one of claims 1 to 8 or the object retrieval method according to claim 9.
CN202210563346.7A 2022-05-20 2022-05-20 Multimedia information coding method, object retrieval method and device Active CN115134338B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210563346.7A CN115134338B (en) 2022-05-20 2022-05-20 Multimedia information coding method, object retrieval method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210563346.7A CN115134338B (en) 2022-05-20 2022-05-20 Multimedia information coding method, object retrieval method and device

Publications (2)

Publication Number Publication Date
CN115134338A true CN115134338A (en) 2022-09-30
CN115134338B CN115134338B (en) 2023-08-11

Family

ID=83375921

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210563346.7A Active CN115134338B (en) 2022-05-20 2022-05-20 Multimedia information coding method, object retrieval method and device

Country Status (1)

Country Link
CN (1) CN115134338B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111246209A (en) * 2020-01-20 2020-06-05 北京字节跳动网络技术有限公司 Adaptive encoding method, apparatus, electronic device, and computer storage medium
CN113747168A (en) * 2020-05-29 2021-12-03 北京三星通信技术研究有限公司 Training method of multimedia data description model and generation method of description information
CN114510599A (en) * 2022-01-14 2022-05-17 北京有竹居网络技术有限公司 Feature coding model generation method, audio determination method and related device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111246209A (en) * 2020-01-20 2020-06-05 北京字节跳动网络技术有限公司 Adaptive encoding method, apparatus, electronic device, and computer storage medium
CN113747168A (en) * 2020-05-29 2021-12-03 北京三星通信技术研究有限公司 Training method of multimedia data description model and generation method of description information
CN114510599A (en) * 2022-01-14 2022-05-17 北京有竹居网络技术有限公司 Feature coding model generation method, audio determination method and related device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
赵瑞: "《基于深度学习的视频-文本跨模态搜索》", pages 17 - 18 *

Also Published As

Publication number Publication date
CN115134338B (en) 2023-08-11

Similar Documents

Publication Publication Date Title
CN109508584B (en) Video classification method, information processing method and server
CN111382555B (en) Data processing method, medium, device and computing equipment
EP3885966B1 (en) Method and device for generating natural language description information
CN106777318A (en) Matrix decomposition cross-module state Hash search method based on coorinated training
CN111160191B (en) Video key frame extraction method, device and storage medium
CN113298197B (en) Data clustering method, device, equipment and readable storage medium
CN111581926A (en) Method, device and equipment for generating file and computer readable storage medium
CN108304376B (en) Text vector determination method and device, storage medium and electronic device
CN116978011B (en) Image semantic communication method and system for intelligent target recognition
CN112232889A (en) User interest portrait extension method, device, equipment and storage medium
CN112948626B (en) Video processing method and device, electronic equipment and computer readable storage medium
WO2023155304A1 (en) Keyword recommendation model training method and apparatus, keyword recommendation method and apparatus, device, and medium
JP7504192B2 (en) Method and apparatus for searching images - Patents.com
CN111241850A (en) Method and device for providing business model
CN111241310A (en) Deep cross-modal Hash retrieval method, equipment and medium
CN116796038A (en) Remote sensing data retrieval method, remote sensing data retrieval device, edge processing equipment and storage medium
CN114332550A (en) Model training method, system, storage medium and terminal equipment
CN117635275A (en) Intelligent electronic commerce operation commodity management platform and method based on big data
CN115204436A (en) Method, device, equipment and medium for detecting abnormal reasons of business indexes
CN117217283A (en) Model distillation method, apparatus, electronic device, and storage medium
CN107944045B (en) Image search method and system based on t distribution Hash
CN115134338A (en) Multimedia information coding method, object retrieval method and device
CN117171393A (en) Multi-mode retrieval-oriented self-adaptive semi-pairing inquiry hash method
CN113554450B (en) Data model training and data processing method, device, equipment and storage medium
CN114417251A (en) Retrieval method, device, equipment and storage medium based on hash code

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant