CN117725234A

CN117725234A - Media information identification method, device, computer equipment and storage medium

Info

Publication number: CN117725234A
Application number: CN202310916835.0A
Authority: CN
Inventors: 李军伟
Original assignee: Xingyin Information Technology Shanghai Co ltd
Current assignee: Xingyin Information Technology Shanghai Co ltd
Priority date: 2023-07-24
Filing date: 2023-07-24
Publication date: 2024-03-19

Abstract

The embodiment of the application discloses a media information identification method, a device, computer equipment and a storage medium. According to the scheme, the media information to be identified is obtained, and the media information to be identified comprises a picture to be identified and a text to be identified; extracting target image features corresponding to objects in the picture to be identified based on the picture to be identified; extracting target text features corresponding to commodity description texts in the text to be identified based on the text to be identified; acquiring multi-modal characteristics of preset commodities in a commodity library, and determining target similarity between media information to be identified and the preset commodities in the commodity library based on target image characteristics, target text characteristics and the multi-modal characteristics; and screening target commodities matched with the media information to be identified from the commodity library based on the similarity. Therefore, the commodity identification accuracy of the media information can be improved.

Description

Media information identification method, device, computer equipment and storage medium

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a media information identification method, a device, a computer device, and a storage medium.

Background

With the rapid growth of the internet, social networks have become an integral part of the life of internet users. In some social platforms, users may post social media content, which may include multimedia content such as pictures, text, etc., so that other users in the social platform may share information. For example, a user may post social media content that contains merchandise information, and other users may identify merchandise that matches the merchandise information of the social media content.

In the related art, when a user needs to identify a commodity according to commodity media information in social media content, keyword search can be performed through the social media content, for example, a commodity keyword is extracted to perform commodity search and other modes. However, the method of searching the commodity through the keyword is low in efficiency, and the commodity with high matching degree with the commodity information in the social media content is difficult to search, so that the accuracy of identifying the commodity according to the social media content is low.

Disclosure of Invention

The embodiment of the application provides a media information identification method, a device, computer equipment and a storage medium, which can improve the commodity identification accuracy of media information.

The embodiment of the application provides a media information identification method, which comprises the following steps:

acquiring media information to be identified, wherein the media information to be identified comprises a picture to be identified and a text to be identified;

extracting target image features corresponding to objects in the picture to be identified based on the picture to be identified;

extracting target text features corresponding to commodity description texts in the text to be recognized based on the text to be recognized;

acquiring multi-modal characteristics of preset commodities in a commodity library, and determining target similarity between the media information to be identified and the preset commodities in the commodity library based on the target image characteristics, the target text characteristics and the multi-modal characteristics;

and screening target commodities matched with the media information to be identified from the commodity library based on the similarity.

Correspondingly, the embodiment of the application also provides a media information identification device, which comprises:

the first acquisition unit is used for acquiring media information to be identified, wherein the media information to be identified comprises a picture to be identified and a text to be identified;

the first extraction unit is used for extracting target image features corresponding to objects in the picture to be identified based on the picture to be identified;

The second extraction unit is used for extracting target text features corresponding to commodity description texts in the text to be identified based on the text to be identified;

the second acquisition unit is used for acquiring multi-modal characteristics of preset commodities in a commodity library and determining target similarity between the media information to be identified and the preset commodities in the commodity library based on the target image characteristics, the target text characteristics and the multi-modal characteristics;

and the screening unit is used for screening target commodities matched with the media information to be identified from the commodity library based on the similarity.

In some embodiments, the first extraction unit comprises:

the first determining subunit is used for carrying out target detection on the picture to be identified and determining at least one object image in the picture to be identified;

and the first extraction subunit is used for extracting the characteristics of the object image through an image model to obtain the characteristics of the target image.

In some embodiments, the second extraction unit comprises:

the first processing subunit is used for carrying out sentence dividing processing on the text to be identified to obtain at least one text sentence;

a second determining subunit configured to determine a target text sentence containing the article description text from the at least one text sentence;

And the second extraction subunit is used for extracting the characteristics of the target text sentence through a text model to obtain the characteristics of the target text.

In some embodiments, the second acquisition unit comprises:

a first computing subunit, configured to compute a first target similarity between the target image feature and the multi-modal feature;

a second computing subunit, configured to calculate a second target similarity between the target text feature and the multi-modal feature;

and the third determining subunit is used for determining the target similarity between the media information to be identified and the preset commodity based on the first target similarity and the second target similarity.

In some embodiments, the apparatus further comprises:

the first acquisition unit is used for acquiring sample media information and sample commodity information;

a construction unit configured to construct a first sample pair based on a sample text in the sample media information and the sample commodity information, and construct a second sample pair based on a sample image in the sample media information and the sample commodity information;

the training unit is used for training a preset network model based on the first sample pair and the second sample pair to obtain a trained model;

In some embodiments, the first computing subunit is specifically configured to:

and calculating the first target similarity between the target image features and the multi-mode features through the trained model.

In some embodiments, the training unit comprises:

a third extraction subunit, configured to extract a sample text feature corresponding to the sample text, a sample image feature corresponding to the sample image, and a sample multi-modal feature corresponding to the sample commodity information;

a fourth determining subunit, configured to obtain a first feature sample pair based on the sample text feature and the sample multi-modal feature, and obtain a second feature sample pair based on the sample image feature and the sample multi-modal feature;

a third computing subunit, configured to compute, through the preset network model, a first predicted similarity between a sample text feature in the first feature sample pair and the sample multi-modal feature, and compute a second predicted similarity between a sample image feature in the second feature sample pair and the sample multi-modal feature;

and the training subunit is used for adjusting the model parameters of the preset network model based on the actual similarity between the first prediction similarity and the sample text features and the sample multi-modal features in the first feature sample pair and the actual similarity between the second prediction similarity and the sample image features and the sample multi-modal features in the second feature sample pair until the preset network model converges to obtain the trained model.

In some embodiments, the second computing subunit may be specifically configured to:

and calculating second target similarity between the target text feature and the multi-modal feature based on the trained model.

In some embodiments, the screening unit comprises:

and the selecting subunit is used for selecting the commodity with the similarity between the commodity library and the media information to be identified being greater than a preset threshold value from the commodity library as the target commodity.

In some embodiments, the apparatus further comprises:

the second acquisition unit is used for acquiring commodity information of a plurality of preset commodities, wherein the commodity information comprises commodity pictures and commodity description texts;

and the third extraction unit is used for inputting the commodity information into a multi-mode model, and extracting the characteristics of the commodity picture and the commodity description text through the multi-mode model to obtain the multi-mode characteristics of the preset commodity.

Accordingly, embodiments of the present application further provide a computer device including a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor executes the transaction management method of the blockchain platform provided in any of the embodiments of the present application.

Accordingly, embodiments of the present application also provide a storage medium storing a plurality of instructions adapted to be loaded by a processor to perform a transaction management method of a blockchain platform as described above.

According to the embodiment of the application, the media information to be identified is obtained, and the media information to be identified comprises a picture to be identified and a text to be identified; extracting target image features corresponding to objects in the picture to be identified based on the picture to be identified; extracting target text features corresponding to commodity description texts in the text to be identified based on the text to be identified; acquiring multi-modal characteristics of preset commodities in a commodity library, and determining target similarity between media information to be identified and the preset commodities in the commodity library based on target image characteristics, target text characteristics and the multi-modal characteristics; and screening target commodities matched with the media information to be identified from the commodity library based on the similarity. Therefore, the commodity identification accuracy of the media information can be improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flow chart of a media information identification method according to an embodiment of the present application.

Fig. 2 is an application scenario schematic diagram of a media information identification method according to an embodiment of the present application.

Fig. 3 is a block diagram of a media information identification device according to an embodiment of the present application.

Fig. 4 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.

The embodiment of the application provides a media information identification method, a device, a storage medium and computer equipment. Specifically, the media information identification method of the embodiment of the application may be executed by a computer device, where the computer device may be a terminal or a server. The terminal can be a terminal device such as a smart phone, a tablet computer, a notebook computer, a touch screen, a personal computer (PC, personal Computer), a personal digital assistant (Personal Digital Assistant, PDA) and the like. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, basic cloud computing services such as big data and artificial intelligent platforms.

For example, the computer device may be a server that may obtain media information to be identified, including pictures to be identified and text to be identified; extracting target image features corresponding to objects in the picture to be identified based on the picture to be identified; extracting target text features corresponding to commodity description texts in the text to be identified based on the text to be identified; acquiring multi-modal characteristics of preset commodities in a commodity library, and determining target similarity between media information to be identified and the preset commodities in the commodity library based on target image characteristics, target text characteristics and the multi-modal characteristics; and screening target commodities matched with the media information to be identified from the commodity library based on the similarity.

Based on the above problems, embodiments of the present application provide a method, an apparatus, a computer device, and a storage medium for identifying media information, which can improve accuracy of identifying commodities of the media information.

The following will describe in detail. The following description of the embodiments is not intended to limit the preferred embodiments.

The embodiment of the application provides a media information identification method, which can be executed by a terminal or a server, and the embodiment of the application is described by taking the media information identification method executed by the server as an example.

Referring to fig. 1, fig. 1 is a flowchart of a media information identification method according to an embodiment of the present application. The specific flow of the media information identification method can be as follows:

101. and acquiring media information to be identified, wherein the media information to be identified comprises a picture to be identified and a text to be identified.

In the embodiment of the application, the media information refers to content published by a user through a social network platform. Wherein, the media information is presented in the form of multimedia, which can comprise text content, picture content and the like.

In particular, the social network platform may be used for users to make various social interactions, and the social network platform may include various types, such as a communication type social platform, a content propagation type social platform, and the like. The communication type social platform refers to a social platform mainly based on user communication, and the content propagation type social platform refers to a social platform mainly based on user release content propagation.

For example, the media information may be content published through a content-propagated social platform. Users may share media information to other users by uploading the media information on a content-propagated social platform.

The media information to be identified refers to media information needing commodity identification, namely identifying commodities contained in the media information to be identified. The pictures in the media information to be identified, namely the pictures to be identified, namely the picture content required to be identified; the text in the media information to be identified, namely the text content required for commodity identification.

102. And extracting target image features corresponding to the objects in the picture to be identified based on the picture to be identified.

The target image features refer to image features of object images in a picture to be identified, and the picture to be identified can include at least one object image.

In some embodiments, in order to improve accuracy of image recognition, the step of extracting, based on the image to be recognized, a target image feature corresponding to an object in the image to be recognized may include the following operations:

performing target detection on the picture to be identified, and determining at least one object image in the picture to be identified;

and extracting the characteristics of the object image through the image model to obtain the characteristics of the target image.

The object detection refers to extracting all interested objects (i.e. objects) in the picture, and determining the position area of each object in the picture, so as to obtain an object image in the picture.

Specifically, the target detection of the picture to be identified may be performed by using a target detection model, for example, the target detection model may use a YOLO model, and the YOLO model may be used to output all detected object information in the picture at one time.

For example, inputting the picture to be identified into a YOLO model, detecting all objects in the picture to be identified through the YOLO model, and determining the positions of the objects in the picture to be identified, thereby obtaining the object image in the picture to be identified.

The image model may be a Resnet model, that is, a residual network model, and the Resnet model may be used to extract features of the image.

For example, the image a is input into a Resnet model, feature extraction is performed on the image a through the Resnet model, and image features of the image a are output. In the embodiment of the application, the image features of the object image can be extracted through the Resnet model to obtain the target image features.

103. And extracting target text features corresponding to the commodity description text in the text to be recognized based on the text to be recognized.

The target text features refer to text features of commodity description texts in texts to be identified. The commodity description text refers to text related to a commodity, for example, the commodity description text may be a commodity name, a commodity category, a commodity brand, or the like.

In some embodiments, in order to improve accuracy of text recognition, the step of extracting, based on the text to be recognized, a target text feature corresponding to the commodity description text in the text to be recognized may include the following operations:

Sentence dividing processing is carried out on the text to be identified, so that at least one text sentence is obtained;

determining a target text sentence containing commodity description text from at least one text sentence;

and extracting the characteristics of the target text sentence through the text model to obtain the characteristics of the target text.

Specifically, the sentence dividing process may divide the text to be processed into a plurality of text sentences according to punctuation marks. The punctuation marks are used for sentence dividing, so that the method is simple, and the text processing efficiency can be improved.

For example, the text to be recognized may be "the computer is very good, i want to recommend to the person that the performance is excellent. Then, sentence dividing is carried out according to punctuation marks in the text to be recognized, and the obtained text sentence can comprise: "this computer works very well", "i want to recommend to a person", "the performance is very good".

The target text sentence refers to commodity description text in the text to be recognized, namely text related to the commodity.

In the embodiment of the application, a commodity description text set may be preset, and the commodity description text set may include a plurality of texts related to commodities, for example, including a plurality of commodity names, a plurality of commodity types, a plurality of commodity brands, and the like.

Specifically, a target text sentence containing commodity description text is determined from at least one text sentence, each text sentence can be matched with the text in the commodity description text set, and the text sentence successfully matched with the text in the commodity description text set is determined to be used as the target text sentence.

For example, a text sentence in the text to be recognized may include: "this computer works very well", "i want to recommend to a person", "the performance is very good". By matching each text sentence with the text in the commodity description text set, the target text sentence can be determined as "the computer is very good".

The text model may be a bert model, which is a pre-trained language characterization model. The bert model employs a new masked language model (MLM, markup language model) to enable the generation of deep bi-directional language representations that can be used to extract features of text.

For example, the target text sentence is input into the bert model, the characteristic extraction is performed on the target text sentence through the bert model, and the text characteristic of the target text sentence is output. In the embodiment of the application, the text characteristics of the target text sentence can be extracted through the bert model to obtain the target text characteristics.

104. And acquiring multi-modal characteristics of preset commodities in the commodity library, and determining target similarity between the media information to be identified and the preset commodities in the commodity library based on the target image characteristics, the target text characteristics and the multi-modal characteristics.

In this embodiment of the present application, the commodity library includes a plurality of preset commodities and information of each preset commodity, where the information of the preset commodity may be a commodity title and a commodity picture downloaded from an official website of a commodity brand manufacturer, and a commodity title and a commodity picture purchased or downloaded from an electronic commerce platform.

The multi-modal feature refers to a feature of the preset commodity in multiple dimensions, for example, the multi-modal feature may include a feature of an image dimension and a feature of a text dimension.

The commodity library comprises a plurality of multi-mode features corresponding to preset commodities.

For example, the preset merchandise may include: commodity A, commodity B, commodity C and commodity D etc. commodity A's multimode characteristic is: the multi-modal characteristics A and B are as follows: the multi-modal characteristics B, the multi-modal characteristics of the commodity C are: the multi-modal feature C, the multi-modal feature of the commodity D is: the obtaining the multi-modal feature D of the preset commodity in the commodity library may include: multimode feature a, multimode feature B, multimode feature C, and multimode feature D.

In some embodiments, in order to quickly acquire the multi-modal feature of the preset merchandise, before the step of acquiring the multi-modal feature of the preset merchandise in the merchandise library, the method may further include the following steps:

acquiring commodity information of a plurality of preset commodities;

inputting commodity information into a multi-modal model, and extracting characteristics of commodity pictures and commodity description texts through the multi-modal model to obtain multi-modal characteristics of preset commodities.

The commodity information comprises commodity pictures and commodity description texts, and the commodity description texts can comprise commodity titles, brand information and class information.

Specifically, the collected commodity information can be downloaded from an official website of a commodity brand manufacturer, or can be purchased or downloaded from an electronic commerce platform, so that commodity titles, commodity pictures, brand information and class information of preset commodities are obtained and used as commodity information.

In embodiments of the present application, the multimodal model may be a multimodal cross-attention model. The multi-modal cross-attention model designs a cross-attention mechanism by jointly modeling intra-modal and inter-modal relationships of image regions and sentence words in a unified depth model, which can utilize not only intra-modal relationships of each modality, but also inter-modal relationships between image regions and sentence words to complement and enhance matching of images and sentences with each other. The multimodal cross-attention model can be used to extract multimodal features in the image dimension as well as in the text dimension.

For example, commodity information is input into a multi-mode cross attention model, and features of images and texts in the commodity information are extracted through the multi-mode cross attention model, so that multi-mode features of preset commodities are obtained.

In some embodiments, to improve the accurate identification of the media information to be identified, the step of determining the target similarity between the media information to be identified and the preset merchandise in the merchandise library based on the target image feature, the target text feature, and the multi-modal feature may include the following operations:

calculating a first target similarity between the target image features and the multi-modal features;

calculating a second target similarity between the target text feature and the multi-modal feature;

and determining the target similarity between the media information to be identified and the preset commodity based on the first target similarity and the second target similarity.

The first target similarity refers to similarity between target image features of media information to be identified and multi-mode features of preset commodities in a commodity library; the second target similarity refers to the similarity between the target text feature of the media information to be identified and the multi-modal feature of the preset commodity.

In some embodiments, in order to accurately calculate the similarity between features, before the step of "calculating the first target similarity between the target image feature and the multi-modal feature", the following steps may be further included:

Collecting sample media information and sample commodity information;

constructing a first sample pair based on sample text and sample commodity information in the sample media information, and constructing a second sample pair based on sample images and sample commodity information in the sample media information;

training the preset network model based on the first sample pair and the second sample pair to obtain a trained model.

The sample media information refers to media information subjected to commodity identification, and the sample media information comprises sample text and sample images. The sample text refers to commodity description text in the sample media information, and the sample image refers to commodity images contained in the sample media information.

Specifically, the first sample pair refers to a text-commodity information pair consisting of a sample text and sample commodity information; the second sample pair refers to an image-commodity information pair composed of a sample image and sample commodity information.

Wherein in the first sample pair, the sample text is associated with the commodity description text in the sample commodity information, and in the second sample pair, the sample image is associated with the commodity picture in the sample commodity information.

For example, the sample media information may include: the first media information, the second media information, and the third media information, wherein the first media information may include: the first text and the first image may include the second text and the second image in the second media information, and the third text and the third image may be included in the third media information. The sample merchandise information may include: first commodity information, second commodity information, third commodity information, and the like.

The first text can be matched with the commodity description text in the first commodity information, and a first sample pair can be constructed based on the first text and the first commodity information; the second text can be matched with the commodity description text in the second commodity information, and a first sample pair can be constructed based on the second text and the second commodity information; the third text may be matched with the article description text in the third article information, and a first pair of samples may be constructed based on the third text and the third article information.

The first image can be matched with the commodity picture in the first commodity information, and a second sample pair can be constructed based on the first image and the first commodity information; the second image can be matched with the commodity picture in the second commodity information, and a second sample pair can be constructed based on the second image and the second commodity information; the third image may be matched with the merchandise picture in the third merchandise information, and a second sample pair may be constructed based on the third image and the third merchandise information. In this way, a plurality of first sample pairs and a plurality of second sample pairs can be constructed.

In some embodiments, the step of training the preset network model based on the first pair of samples and the second pair of samples to obtain a trained model may include the following operations:

Extracting sample text features corresponding to the sample text, sample image features corresponding to the sample image and sample multi-mode features corresponding to the sample commodity information;

obtaining a first characteristic sample pair based on the sample text characteristics and the sample multi-modal characteristics, and obtaining a second characteristic sample pair based on the sample image characteristics and the sample multi-modal characteristics;

calculating a first prediction similarity between the sample text features and the sample multi-modal features in the first feature sample pair through a preset network model, and calculating a second prediction similarity between the sample image features and the sample multi-modal features in the second feature sample pair;

and adjusting model parameters of a preset network model based on the actual similarity between the first prediction similarity and the sample text features and the sample multi-modal features in the first feature sample pair and the actual similarity between the second prediction similarity and the sample image features and the sample multi-modal features in the second feature sample pair until the preset network model converges to obtain a trained model.

Specifically, a sample text is input into a text model, sample text features of the sample text are output through the text model, a text image is input into an image model, sample image features of the sample image are output through the image model, sample commodity information is input into a multi-mode model, and sample multi-mode features corresponding to the sample commodity information are output through the multi-mode model.

The first characteristic sample pair comprises a sample text characteristic and a sample multi-mode characteristic, wherein the sample text characteristic and the sample multi-mode characteristic are sample text characteristics corresponding to sample texts in the same first sample pair and sample multi-mode characteristics corresponding to sample commodity information.

For example, the first sample pair may include: the method comprises the steps of first samples and first sample commodity information, extracting sample text characteristics A of the first samples and sample commodity information A of the first sample commodity information, constructing a first characteristic sample pair, and obtaining the first characteristic sample pair comprises the following steps: sample text feature a and sample merchandise information a.

For example, the first sample pair may include: the method for obtaining the first characteristic sample pair comprises the steps of extracting sample text characteristics A of a first sample text and sample multi-mode characteristics A of the first sample commodity information, constructing the first characteristic sample pair, and obtaining the first characteristic sample pair, wherein the first characteristic sample pair comprises the following steps: sample text feature a and sample multimodal feature a.

The second characteristic sample pair comprises a sample image characteristic and a sample multi-mode characteristic, wherein the sample image characteristic and the sample multi-mode characteristic are sample multi-mode characteristics corresponding to sample images in the same second sample pair, and the sample image characteristic and the sample multi-mode characteristic correspond to sample commodity information.

For example, the second sample pair may include: the second sample image and the second sample commodity information, extracting sample image characteristics B of the first sample image and sample multi-mode characteristics B of the second sample commodity information, constructing a second characteristic sample pair, and obtaining the second characteristic sample pair comprises the following steps: sample image feature B and sample multi-modal feature B.

The trained model can be used for calculating the similarity between the image features and the multi-modal features and calculating the similarity between the text features and the multi-modal features. The trained model can be obtained through training of a preset network model.

Specifically, inputting sample text features and sample multi-modal features in a first feature sample pair into a preset network model, and calculating the similarity of the sample text features and the sample multi-modal features through the preset network model to serve as a first preset similarity; and inputting the sample image features and the sample multi-modal features in the second feature sample pair into a preset network model, and calculating the similarity of the sample image and the sample multi-modal features through the preset network model to serve as a second preset similarity.

Further, an actual similarity between the sample text features and the sample multi-modal features in the first feature sample pair and an actual similarity between the sample image features and the sample multi-modal features in the second feature sample pair are obtained. And adjusting model parameters of a preset network model based on the difference between the first prediction similarity and the actual similarity between the sample text features and the sample multi-modal features in the first feature sample pair and the difference between the second prediction similarity and the actual similarity between the sample image features and the sample multi-modal features in the second feature sample pair until the preset network model converges, so as to obtain a trained model.

In the embodiment of the present application, during the training process of the preset network model, the loss function may involve two parts, where one part is the loss function between the sample text feature and the sample multi-modal feature, and the formula is as follows:

wherein D is _ij Representing the distance, y, between the sample text feature i and the sample multimodal feature j _ij =1 represents that the sample text feature and the sample multi-modal feature belong to the same class, and are a pair of positive samples; y is _ij =0 represents that the sample text feature and the sample multimodal feature do not belong to the same class, are a pair of negative sample pairs, [ ]+ is a range function (a loss function): [ x ]] ₊ =max (0, x), m is a preset super-parameter.

The other part is a loss function between the sample image characteristics and the sample multi-modal characteristics, and the formula is as follows:

wherein D is _mn Representing the distance between the sample image feature m and the sample multi-modal feature n, y _mn =1 represents that the sample image features and the sample multi-modal features belong to the same class, are a pair of positive samples, y _mn =0 represents that the sample image feature and the sample multi-modal feature do not belong to the sameCategory, a pair of negative sample pairs, []+ is a range function (a loss function): [ x ]] ₊ =max (0, x), k is a preset super-parameter.

Further, the two-part loss function is fused, and the formula is as follows:

L＝α*L _text ({D _ij })+β*L _img ({D _mn })；

where α and β are two-part loss function learnable weight parameters.

For example, referring to fig. 2, fig. 2 is a schematic application scenario diagram of a media information identification method according to an embodiment of the present application. FIG. 2 shows a modeling flow according to sample media information and sample merchandise information, which may specifically include: modeling sample media information features, performing object detection on images in the sample media information, extracting target object images, extracting image features through a deep learning model (Resnet), extracting key texts (commodity description texts) from texts in the sample media information, and extracting text features through berts.

The method comprises the following steps: and (3) modeling the characteristics of the sample commodity information, inputting the text (commodity title, brand and class) and commodity picture of the sample commodity information into a multi-mode cross attention model, and extracting multi-mode characteristics.

Further, the comparison learning task 1 is performed based on the sample image features and the multimodal features, that is, the preset network model is trained by the second feature samples constructed by the sample image features and the multimodal features, and the comparison learning task 2 is performed based on the sample text features and the multimodal features, that is, the preset network model is trained by the first feature samples constructed by the sample text features and the multimodal features. The contrast learning task trains the similarity between the image features and the multi-modal features and the similarity between the text features and the multi-modal features.

In some embodiments, the step of "calculating a first target similarity between the target image feature and the multi-modal feature" may include the operations of:

and calculating the first target similarity between the target image features and the multi-modal features through the trained model.

Specifically, the target image features and the multi-modal features are input into a trained network model, cosine similarity between the target image features and the multi-modal features is calculated through the trained network model and is used as the first similarity, namely text similarity between the media information to be identified and the preset commodity can be expressed.

The cosine similarity, also called cosine similarity, is estimated by calculating the cosine value of the included angle of two vectors. Cosine similarity draws a vector into a vector space according to coordinate values.

In some embodiments, the step of "calculating the second target similarity between the target text feature and the multimodal feature" may include the following operations:

and calculating second target similarity between the target text features and the multi-modal features based on the trained model.

Specifically, the target text feature and the multi-modal feature are input into a trained network model, cosine similarity between the target text feature and the multi-modal feature is calculated through the trained network model and is used as second similarity, and image similarity between the media information to be identified and the preset commodity can be expressed.

105. And screening target commodities matched with the media information to be identified from the commodity library based on the similarity.

In some embodiments, the step of "screening the target merchandise matching the media information to be identified from the merchandise library based on the similarity" may include the following operations:

and selecting the commodity with the similarity between the commodity library and the media information to be identified being greater than a preset threshold value from the commodity library as a target commodity.

In the embodiment of the application, a preset threshold is set for screening the commodities matched with the media information.

Specifically, selecting the commodity with the similarity larger than the preset threshold value from the commodity library, wherein the similarity between the commodity and the media information to be identified is larger than the preset threshold value, may be that, from a plurality of preset commodities in the commodity library, the preset commodity corresponding to the similarity larger than the preset similarity is determined as the target commodity according to the similarity between the media information to be identified and each preset commodity.

The similarity between the media information to be identified and each preset commodity can be determined according to the first target similarity and the second target similarity, and specifically, the maximum similarity is determined from the first target similarity and the second target similarity.

For example, the first similarity between the target image feature corresponding to the media information to be identified and the multi-modal feature of a preset commodity may be: 0.8, the second similarity between the target text feature corresponding to the media information to be identified and the multi-modal feature of a preset commodity may be: 0.7, the maximum similarity, that is, the first similarity 0.8, may be used as the similarity between the media information to be identified and the preset merchandise.

Further, according to the similarity between the media information to be identified and each preset commodity, selecting the preset commodity corresponding to the similarity larger than the preset threshold value as the target commodity. Therefore, the commodity with the matched media information to be identified can be identified, and commodity information of the target commodity can be marked in the media information to be identified, so that a user can conveniently check the corresponding commodity information.

The embodiment of the application discloses a media information identification method, which comprises the following steps: acquiring media information to be identified, wherein the media information to be identified comprises a picture to be identified and a text to be identified; extracting target image features corresponding to objects in the picture to be identified based on the picture to be identified; extracting target text features corresponding to commodity description texts in the text to be identified based on the text to be identified; acquiring multi-modal characteristics of preset commodities in a commodity library, and determining target similarity between media information to be identified and the preset commodities in the commodity library based on target image characteristics, target text characteristics and the multi-modal characteristics; and screening target commodities matched with the media information to be identified from the commodity library based on the similarity. Therefore, the commodity identification accuracy of the media information can be improved.

In order to facilitate better implementation of the media information identification method provided by the embodiment of the application, the embodiment of the application also provides a media information identification device based on the media information identification method. Where the meaning of the term is the same as in the media information identification method described above, specific implementation details may be referred to in the description of the method embodiment.

Referring to fig. 3, fig. 3 is a block diagram of a media information identifying apparatus according to an embodiment of the present application, where the apparatus includes:

A first obtaining unit 301, configured to obtain media information to be identified, where the media information to be identified includes a picture to be identified and a text to be identified;

a first extracting unit 302, configured to extract, based on the picture to be identified, a target image feature corresponding to an object in the picture to be identified;

a second extracting unit 303, configured to extract, based on the text to be identified, a target text feature corresponding to a commodity description text in the text to be identified;

a second obtaining unit 304, configured to obtain multi-modal features of preset commodities in a commodity library, and determine a target similarity between the media information to be identified and the preset commodities in the commodity library based on the target image features, the target text features, and the multi-modal features;

and a screening unit 305, configured to screen out, from the commodity library, a target commodity matching the media information to be identified based on the similarity.

In some embodiments, the first extraction unit 302 may include:

In some embodiments, the second extraction unit 303 may include:

In some embodiments, the second acquisition unit 304 may include:

In some embodiments, the apparatus may further comprise:

in some embodiments, the first computing subunit may be specifically configured to:

In some embodiments, the training unit may include:

In some embodiments, the screening unit 305 may include:

In some embodiments, the apparatus may further comprise:

The embodiment of the application discloses a media information identification device, which acquires media information to be identified through a first acquisition unit 301, wherein the media information to be identified comprises a picture to be identified and a text to be identified; the first extraction unit 302 extracts target image features corresponding to objects in the picture to be identified based on the picture to be identified; the second extracting unit 303 extracts target text features corresponding to commodity description texts in the text to be identified based on the text to be identified; the second obtaining unit 304 obtains multi-modal features of preset commodities in a commodity library, and determines target similarity between the media information to be identified and the preset commodities in the commodity library based on the target image features, the target text features and the multi-modal features; the screening unit 305 screens out target commodities matching the media information to be identified from the commodity library based on the similarity. Therefore, the commodity identification accuracy of the media information can be improved.

Correspondingly, the embodiment of the application also provides computer equipment, which can be a server. Fig. 4 is a schematic structural diagram of a computer device according to an embodiment of the present application, as shown in fig. 4. The computer device 500 includes a processor 501 having one or more processing cores, a memory 502 having one or more computer readable storage media, and a computer program stored on the memory 502 and executable on the processor. The processor 501 is electrically connected to the memory 502. It will be appreciated by those skilled in the art that the computer device structure shown in the figures is not limiting of the computer device and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components.

The processor 501 is a control center of the computer device 500, connects various parts of the entire computer device 500 using various interfaces and lines, and performs various functions of the computer device 500 and processes data by running or loading software programs and/or modules stored in the memory 502, and calling data stored in the memory 502, thereby performing overall monitoring of the computer device 500.

In the embodiment of the present application, the processor 501 in the computer device 500 loads the instructions corresponding to the processes of one or more application programs into the memory 502 according to the following steps, and the processor 501 executes the application programs stored in the memory 502, so as to implement various functions:

extracting target text features corresponding to commodity description texts in the text to be identified based on the text to be identified;

acquiring multi-modal characteristics of preset commodities in a commodity library, and determining target similarity between media information to be identified and the preset commodities in the commodity library based on target image characteristics, target text characteristics and the multi-modal characteristics;

In some embodiments, extracting, based on the picture to be identified, a target image feature corresponding to the object in the picture to be identified includes:

In some embodiments, extracting, based on the text to be identified, a target text feature corresponding to the commodity description text in the text to be identified includes:

In some embodiments, determining the target similarity of the media information to be identified to the preset merchandise in the merchandise library based on the target image feature, the target text feature, and the multi-modal feature includes:

In some embodiments, before calculating the first target similarity between the target image feature and the multi-modal feature, further comprising:

collecting sample media information and sample commodity information;

training a preset network model based on the first sample pair and the second sample pair to obtain a trained model;

calculating a first target similarity between the target image feature and the multi-modal feature, comprising:

In some embodiments, training the pre-set network model based on the first and second pairs of samples to obtain a trained model comprises:

In some embodiments, calculating a second target similarity between the target text feature and the multimodal feature comprises:

In some embodiments, screening target merchandise matching media information to be identified from a library of merchandise based on similarity includes:

In some embodiments, before acquiring the multi-modal feature of the preset commodity in the commodity library, the method further comprises:

acquiring commodity information of a plurality of preset commodities, wherein the commodity information comprises commodity pictures and commodity description texts;

The specific implementation of each operation above may be referred to the previous embodiments, and will not be described herein.

Optionally, as shown in fig. 4, the computer device 500 further includes: a touch display screen 503, a radio frequency circuit 504, an audio circuit 505, an input unit 506, and a power supply 507. The processor 501 is electrically connected to the touch display 503, the radio frequency circuit 504, the audio circuit 505, the input unit 506, and the power supply 507, respectively. Those skilled in the art will appreciate that the computer device structure shown in FIG. 4 is not limiting of the computer device and may include more or fewer components than shown, or may be combined with certain components, or a different arrangement of components.

The touch display screen 503 may be used to display a graphical user interface and receive operation instructions generated by a user acting on the graphical user interface. The touch display screen 503 may include a display panel and a touch panel. Wherein the display panel may be used to display information entered by a user or provided to a user as well as various graphical user interfaces of a computer device, which may be composed of graphics, guidance information, icons, video, and any combination thereof. Alternatively, the Display panel may be configured in the form of a liquid crystal Display (LCD, liquiDCrystal Display), an Organic Light-Emitting Diode (OLED), or the like. The touch panel may be used to collect touch operations on or near the user (such as operations on or near the touch panel by the user using any suitable object or accessory such as a finger, stylus, etc.), and generate corresponding operation instructions, and the operation instructions execute corresponding programs. Alternatively, the touch panel may include two parts, a touch detection device and a touch controller. The touch detection device detects the touch azimuth of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch detection device and converts it into touch point coordinates, which are then sent to the processor 501, and can receive commands from the processor 501 and execute them. The touch panel may overlay the display panel, and upon detection of a touch operation thereon or thereabout, the touch panel is passed to the processor 501 to determine the type of touch event, and the processor 501 then provides a corresponding visual output on the display panel based on the type of touch event. In the embodiment of the present application, the touch panel and the display panel may be integrated into the touch display screen 503 to implement the input and output functions. In some embodiments, however, the touch panel and the display panel may be implemented as two separate components to implement the input and output functions. I.e. the touch sensitive display 503 may also implement an input function as part of the input unit 506.

The radio frequency circuitry 504 may be used to transceive radio frequency signals to establish wireless communications with a network device or other computer device via wireless communications.

The audio circuitry 505 may be used to provide an audio interface between a user and a computer device through speakers, microphones, and so on. The audio circuit 505 may transmit the received electrical signal after audio data conversion to a speaker, and convert the electrical signal into a sound signal for output by the speaker; on the other hand, the microphone converts the collected sound signals into electrical signals, which are received by the audio circuit 505 and converted into audio data, which are processed by the audio data output processor 501 for transmission to, for example, another computer device via the radio frequency circuit 504, or which are output to the memory 502 for further processing. The audio circuit 505 may also include an ear bud jack to provide communication of the peripheral ear bud with the computer device.

The input unit 506 may be used to receive input numbers, character information, or user characteristics (e.g., fingerprint, iris, facial information, etc.), as well as to generate keyboard, mouse, joystick, optical, or trackball signal inputs related to user settings and function control.

The power supply 507 is used to power the various components of the computer device 500. Alternatively, the power supply 507 may be logically connected to the processor 501 through a power management system, so as to implement functions of managing charging, discharging, and power consumption management through the power management system. The power supply 507 may also include one or more of any components, such as a direct current or alternating current power supply, a recharging system, a power failure detection circuit, a power converter or inverter, a power status indicator, and the like.

Although not shown in fig. 4, the computer device 500 may further include a camera, a sensor, a wireless fidelity module, a bluetooth module, etc., which are not described herein.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to related descriptions of other embodiments.

As can be seen from the above, the computer device provided in this embodiment obtains the media information to be identified, where the media information to be identified includes the picture to be identified and the text to be identified; extracting target image features corresponding to objects in the picture to be identified based on the picture to be identified; extracting target text features corresponding to commodity description texts in the text to be identified based on the text to be identified; acquiring multi-modal characteristics of preset commodities in a commodity library, and determining target similarity between media information to be identified and the preset commodities in the commodity library based on target image characteristics, target text characteristics and the multi-modal characteristics; and screening target commodities matched with the media information to be identified from the commodity library based on the similarity.

Those of ordinary skill in the art will appreciate that all or a portion of the steps of the various methods of the above embodiments may be performed by instructions, or by instructions controlling associated hardware, which may be stored in a computer-readable storage medium and loaded and executed by a processor.

To this end, embodiments of the present application provide a computer readable storage medium having stored therein a plurality of computer programs that can be loaded by a processor to perform steps in any of the media information identification methods provided by embodiments of the present application. For example, the computer program may perform the steps of:

collecting sample media information and sample commodity information;

Wherein the storage medium may include: read Only Memory (ROM), random access Memory (RAM, random Access Memory), magnetic or optical disk, and the like.

The steps in any media information identification method provided in the embodiments of the present application may be executed by the computer program stored in the storage medium, so that the beneficial effects that any media information identification method provided in the embodiments of the present application may be achieved are detailed in the previous embodiments, and are not repeated herein.

According to one aspect of the present application, there is also provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device performs the methods provided in the various alternative implementations of the above embodiments.

The foregoing describes in detail a media information identification method, apparatus, storage medium and computer device provided in the embodiments of the present application, and specific examples are applied to illustrate the principles and embodiments of the present application, where the foregoing description of the embodiments is only for helping to understand the method and core ideas of the present application; meanwhile, those skilled in the art will have variations in the specific embodiments and application scope in light of the ideas of the present application, and the present description should not be construed as limiting the present application in view of the above.

Claims

1. A method for identifying media information, the method comprising:

2. The method according to claim 1, wherein the extracting, based on the picture to be identified, a target image feature corresponding to an object in the picture to be identified includes:

and extracting the characteristics of the object image through an image model to obtain the characteristics of the target image.

3. The method according to claim 1, wherein the extracting, based on the text to be identified, a target text feature corresponding to a commodity description text in the text to be identified includes:

performing sentence segmentation on the text to be identified to obtain at least one text sentence;

determining a target text sentence containing commodity description text from the at least one text sentence;

and extracting the characteristics of the target text sentence through a text model to obtain the characteristics of the target text.

4. The method of claim 1, wherein the determining the target similarity of the media information to be identified to a preset commodity in the commodity library based on the target image feature, the target text feature, and the multi-modal feature comprises:

calculating a first target similarity between the target image feature and the multi-modal feature;

5. The method of claim 4, further comprising, prior to said calculating the first target similarity between the target image feature and the multi-modal feature:

Collecting sample media information and sample commodity information;

constructing a first sample pair based on sample text in the sample media information and the sample commodity information, and constructing a second sample pair based on sample images in the sample media information and the sample commodity information;

the computing a first target similarity between the target image feature and the multi-modal feature comprises:

6. The method of claim 5, wherein training the predetermined network model based on the first pair of samples and the second pair of samples to obtain a trained model comprises:

obtaining a first characteristic sample pair based on the sample text characteristic and the sample multi-modal characteristic, and obtaining a second characteristic sample pair based on the sample image characteristic and the sample multi-modal characteristic;

Calculating a first prediction similarity between the sample text features in the first feature sample pair and the sample multi-modal features through the preset network model, and calculating a second prediction similarity between the sample image features in the second feature sample pair and the sample multi-modal features;

and adjusting model parameters of the preset network model based on the first prediction similarity, the actual similarity between the first feature sample pair and the sample text feature and the sample multi-modal feature, and the actual similarity between the second prediction similarity and the sample image feature and the sample multi-modal feature in the second feature sample pair until the preset network model converges to obtain the trained model.

7. The method of claim 5, wherein said calculating a second target similarity between said target text feature and said multimodal feature comprises:

8. The method of claim 1, wherein the screening target merchandise from the merchandise library that matches the media information to be identified based on the similarity comprises:

And selecting the commodity with the similarity between the commodity library and the media information to be identified being greater than a preset threshold value from the commodity library as the target commodity.

9. The method of claim 1, further comprising, prior to said obtaining the multimodal features of the preset merchandise in the merchandise library:

inputting the commodity information into a multi-mode model, and extracting features of the commodity picture and the commodity description text through the multi-mode model to obtain multi-mode features of the preset commodity.

10. A media information identification apparatus, the apparatus comprising:

11. A computer device comprising a memory, a processor and a computer program stored on the memory and running on the processor, wherein the processor implements the media information identification method according to any one of claims 1 to 9 when executing the program.

12. A storage medium storing a plurality of instructions adapted to be loaded by a processor to perform the method of identifying media information as claimed in any one of claims 1 to 9.