CN113704513A

CN113704513A - Model training method, information display method and device

Info

Publication number: CN113704513A
Application number: CN202110849818.0A
Authority: CN
Inventors: 周鑫; 曹佐; 左凯; 马潮
Original assignee: Beijing Sankuai Online Technology Co Ltd
Current assignee: Beijing Sankuai Online Technology Co Ltd
Priority date: 2021-07-27
Filing date: 2021-07-27
Publication date: 2021-11-26
Anticipated expiration: 2041-07-27
Also published as: CN113704513B

Abstract

The specification discloses a model training method, an information display method and a device, and particularly discloses a method for obtaining training samples from a sample set which is constructed in advance, determining associated multimedia information corresponding to target multimedia information, then determining text characteristics corresponding to the text information, performing semantic enhancement on the target multimedia information through the associated multimedia information to determine semantic characteristics corresponding to the target multimedia information, determining the association degree between the text information and the target multimedia information through a matching model to be trained based on the characteristics, and training the matching model according to the association degree and label information corresponding to the training samples. Therefore, after the trained matching model is utilized, the multimedia information with the highest degree of association with the text information is selected and used as the multimedia information corresponding to the object described by the text information, and therefore the selected multimedia information and the text information are guaranteed to have higher matching degree.

Description

Model training method, information display method and device

Technical Field

The specification relates to the technical field of computers, in particular to a model training method, an information display method and an information display device.

Background

With the rapid development of information technology, a large number of application scenes which need a platform to independently select information and combine the information according to requirements are emerging in recent years, such as selecting the most suitable matching pictures for news, capturing the best cover pages for videos, screening the most relevant display pictures (such as head portraits) for merchants and the like, and the information is displayed to users in a text-image combined mode in most of the scenes, so that the users can more easily acquire the required information from the displayed information, and better browsing experience is obtained.

At present, when a platform selects matching pictures for a text, the candidate pictures are scored according to the aesthetic degree, the picture with the highest score is selected and used as the matching picture of the text, and the text and the selected matching picture are combined and then displayed to a user. At this time, a problem that the selected pictures cannot accurately reflect the text content may occur, for example, a beautiful scene graph is uploaded by a foot bath store, and the scene graph may be selected for display when the beauty of the scene graph is high, so that the service content of the store cannot be known from the picture information when a user browses, which may cause a trouble to the information acquisition of the user.

Therefore, in the existing matching picture selection scheme, the problem that the picture selected for the text cannot truly reflect the content expressed by the text information exists, namely the problem that the matching picture selected for the text is low in matching degree with the text.

Disclosure of Invention

The present specification provides a model training method, an information displaying method and an information displaying apparatus, so as to partially solve the above problems in the prior art.

The technical scheme adopted by the specification is as follows:

the present specification provides a method of model training, comprising:

acquiring a training sample from a pre-constructed sample set, wherein the training sample comprises target multimedia information and text information corresponding to the target multimedia information, and the text information is used for describing the characteristics of an object to which the target multimedia information belongs;

determining associated multimedia information corresponding to the target multimedia information;

determining text features corresponding to the text information, and performing semantic enhancement on the target multimedia information through the associated multimedia information to determine semantic features corresponding to the target multimedia information;

determining the association degree between the text information and the target multimedia information through a matching model to be trained on the basis of the text characteristics and the semantic characteristics;

and training the matching model according to the relevance and the label information corresponding to the training sample.

Optionally, the matching model comprises: the system comprises a feature extraction layer and a feature matching layer, wherein the feature extraction layer comprises a text feature extraction layer and an image feature extraction layer;

determining text features corresponding to the text information, and performing semantic enhancement on the target multimedia information through the associated multimedia information to determine semantic features corresponding to the target multimedia information, specifically including:

inputting the text information into the text feature extraction layer to obtain text features corresponding to the text information, and inputting the target multimedia information and the associated multimedia information into the image semantic feature extraction layer to obtain semantic features corresponding to the target multimedia information subjected to semantic enhancement through the associated multimedia information;

determining the association degree between the text information and the target multimedia information through a matching model to be trained on the basis of the text features and the semantic features, and specifically comprises the following steps:

and inputting the text features and the semantic features into the feature matching layer, and determining the association degree between the text information and the target multimedia information.

Optionally, constructing a sample set specifically includes:

aiming at each preset category, acquiring each initial multimedia information under the category;

determining similarity between each other initial multimedia information except the initial multimedia information under the category and the initial multimedia information aiming at each initial multimedia information under the category;

and selecting the multimedia information for constructing the sample set of the category from the initial multimedia information of the category according to the similarity.

Optionally, according to the similarity, selecting multimedia information for constructing a sample set of the category from each initial multimedia information under the category, specifically including:

determining the number of other initial multimedia information with the similarity greater than the set similarity between the initial multimedia information and the initial multimedia information as the multimedia matching number corresponding to the initial multimedia information;

and selecting the multimedia information for constructing the sample set of the category according to the multimedia matching number corresponding to each initial multimedia information under the category.

Optionally, selecting the multimedia information used for constructing the sample set of the category according to the multimedia matching number corresponding to each initial multimedia information in the category, specifically including:

sequencing the initial multimedia information under the category according to the sequence of the multimedia matching quantity corresponding to each initial multimedia information under the category from large to small to obtain a sequencing result;

taking the initial multimedia information before a first set ranking in the ranking result as a positive sample, and taking the initial multimedia information after a second set ranking in the ranking result as a negative sample, wherein the first set ranking is before the second set ranking;

and constructing a sample set of the category according to the positive sample and the negative sample.

Optionally, determining associated multimedia information corresponding to the target multimedia information specifically includes:

determining a category corresponding to an object to which the target multimedia information belongs as a target category;

and determining associated multimedia information corresponding to the target multimedia information from the multimedia information set corresponding to the target category.

Optionally, determining associated multimedia information corresponding to the target multimedia information from the multimedia information set corresponding to the target category specifically includes:

and selecting a set number of multimedia information according to the sequencing result of each multimedia information in the multimedia information set corresponding to the target category from the front to the back to obtain the associated multimedia information corresponding to the target multimedia information.

The present specification provides a method of information presentation, comprising:

acquiring text information and candidate multimedia information corresponding to an object described by the text information;

determining associated multimedia information corresponding to each candidate multimedia information;

determining text features corresponding to the text information, and performing semantic enhancement on the candidate multimedia information through the associated multimedia information to determine semantic features corresponding to the candidate multimedia information;

determining the association degree between the text information and the candidate multimedia information through a pre-trained matching model based on the text characteristics and the semantic characteristics, wherein the matching model is obtained by training through the model training method;

and selecting multimedia information used for representing the object from the candidate multimedia information according to the association degree between the text information and each candidate multimedia information, and displaying the object to a user according to the selected multimedia information.

The present specification provides an apparatus for model training, comprising:

the system comprises an acquisition module, a storage module and a processing module, wherein the acquisition module is used for acquiring a training sample from a pre-constructed sample set, the training sample comprises target multimedia information and text information corresponding to the target multimedia information, and the text information is used for describing the characteristics of an object to which the target multimedia information belongs;

the associated multimedia information determining module is used for determining associated multimedia information corresponding to the target multimedia information;

the characteristic determining module is used for determining text characteristics corresponding to the text information and performing semantic enhancement on the target multimedia information through the associated multimedia information so as to determine semantic characteristics corresponding to the target multimedia information;

the relevancy determining module is used for determining the relevancy between the text information and the target multimedia information through a matching model to be trained on the basis of the text characteristics and the semantic characteristics;

and the training module is used for training the matching model according to the association degree and the label information corresponding to the training sample.

This specification provides an apparatus for information presentation, comprising:

the acquisition module is used for acquiring text information and candidate multimedia information corresponding to an object described by the text information;

the relevant multimedia information determining module is used for determining relevant multimedia information corresponding to each candidate multimedia information;

the characteristic determining module is used for determining text characteristics corresponding to the text information and performing semantic enhancement on the candidate multimedia information through the associated multimedia information so as to determine semantic characteristics corresponding to the candidate multimedia information;

the relevancy determining module is used for determining the relevancy between the text information and the candidate multimedia information through a pre-trained matching model based on the text characteristics and the semantic characteristics, wherein the matching model is obtained through the model training method;

and the display module is used for selecting the multimedia information used for representing the object from the candidate multimedia information according to the association degree between the text information and each candidate multimedia information, and displaying the object to the user according to the selected multimedia information.

The present specification provides a computer-readable storage medium storing a computer program which, when executed by a processor, implements the above-described method of model training and method of information presentation.

The present specification provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method for model training and the method for information presentation when executing the program.

The technical scheme adopted by the specification can achieve the following beneficial effects:

in the model training method and the information display method provided in this specification, a training sample is obtained from a sample set that is constructed in advance, the training sample includes target multimedia information and text information corresponding to the target multimedia information, the text information is used to describe characteristics of an object to which the target multimedia information belongs, and then associated multimedia information corresponding to the target multimedia information is determined. And then, determining text features corresponding to the text information, performing semantic enhancement on the target multimedia information through the associated multimedia information to determine semantic features corresponding to the target multimedia information, and determining the association degree between the text information and the target multimedia information through a matching model to be trained on the basis of the text features and the semantic features. And finally, training the matching model according to the relevance and the label information corresponding to the training sample. Therefore, before information presentation, after the trained matching model is utilized according to the text information, the multimedia information which is most matched with the object described by the text information is selected from the candidate multimedia information, and the object is presented to the user according to the selected multimedia information.

It can be seen from the above method that when the method selects multimedia information, a matching model for determining the degree of association between text information and the multimedia information can be trained in advance, so that when the multimedia information is selected, the degree of association between the text information and candidate multimedia information is determined through the matching model, and then according to the degree of matching, multimedia information with a high degree of association with the text information can be selected as multimedia information corresponding to an object described by the text information, so as to ensure that the selected multimedia information has a high degree of matching with the text information. Meanwhile, during model training, the associated multimedia information corresponding to the multimedia information is determined, and the candidate multimedia information is subjected to semantic enhancement through the associated multimedia information, so that the training speed of the matching model is increased, and the fact that the trained matching model can determine the multimedia information with high matching degree with the text information is further ensured.

Drawings

The accompanying drawings, which are included to provide a further understanding of the specification and are incorporated in and constitute a part of this specification, illustrate embodiments of the specification and together with the description serve to explain the specification and not to limit the specification in a non-limiting sense. In the drawings:

FIG. 1 is a schematic flow chart diagram illustrating a method for model training provided in an embodiment of the present disclosure;

FIG. 2 is a schematic flow chart diagram illustrating a method for displaying information provided by an embodiment of the present disclosure;

FIG. 3 is a schematic structural diagram of an apparatus for model training provided in an embodiment of the present disclosure;

FIG. 4 is a schematic structural diagram of an apparatus for displaying information provided by an embodiment of the present disclosure;

fig. 5 is a schematic structural diagram of an electronic device provided in an embodiment of this specification.

Detailed Description

In order to make the objects, technical solutions and advantages of the present disclosure more clear, the technical solutions of the present disclosure will be clearly and completely described below with reference to the specific embodiments of the present disclosure and the accompanying drawings. It is to be understood that the embodiments described are only a few embodiments of the present disclosure, and not all embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present specification without any creative effort belong to the protection scope of the present specification.

In order to solve the problem that the matching degree between the matching image selected for the text and the text is low when the matching image is selected, the specification provides a model training method, an information display method and an information display device. According to the scheme, a matching model for determining the association degree between the text information and the multimedia information is obtained through training according to a training sample, so that the matching model learns the association relation between the text information and the multimedia information. And then, when the multimedia information is selected, the multimedia information with higher association degree with the text information is selected from the candidate multimedia information by utilizing the matching model obtained by training and is used as the multimedia information corresponding to the object described by the text information, and then when the object is displayed, the object is displayed to the user according to the selected multimedia information. Therefore, the selected multimedia information can be more matched with the object described by the text information, and the high matching degree between the selected multimedia information and the text information is ensured.

The technical solutions provided by the embodiments of the present description are described in detail below with reference to the accompanying drawings.

Fig. 1 is a schematic flow chart of a method for training a model provided in an embodiment of the present specification, which specifically includes the following steps:

step S100, obtaining a training sample from a pre-constructed sample set, wherein the training sample comprises target multimedia information and text information corresponding to the target multimedia information, and the text information is used for describing the characteristics of an object to which the target multimedia information belongs.

The executing body of the model training and information display scheme provided in the present specification may be a platform or a server providing service support for information display, or may be a terminal device such as a desktop computer. For convenience of description, the following description will be made by taking only the platform as an execution subject.

The target multimedia information in the training sample corresponds to the text information in a one-to-one manner, the text information is used for describing the characteristics of the object to which the target multimedia information belongs, when the training sample is a positive sample, the text information in the training sample can accurately describe the characteristics of the object to which the target multimedia information belongs, and when the training sample is a negative sample, the description of the text information in the training sample on the characteristics of the object to which the target multimedia information belongs is inaccurate, that is, the object described by the content expressed in the text information does not accord with the object to which the target multimedia information belongs.

The technical scheme in the specification can be applied to various application scenarios, and for different application scenarios, the multimedia information involved in the specification is different, and the corresponding text information is different. For example, when selecting the most suitable matching picture for news, the multimedia information may be a picture, and the corresponding text information may include a news topic, a sub-title in news, and the like. For example, when the most relevant display picture is screened for the merchant, the multimedia information refers to the picture, and the corresponding text information may include the name of the merchant, the category of the goods operated by the merchant, a characteristic label marked by the platform for the merchant, comments of the customer on the goods operated by the merchant, and the like. In an application scene of selecting the optimal preview short video for the video, the multimedia information refers to the short video, and the multimedia information comprises a video name, a video partition, audio text information corresponding to the video and the like. Therefore, the specific form of the multimedia information is not limited in this specification, and a technician can determine the form of the multimedia information according to specific service requirements.

In specific implementation, a sample set for training a matching model needs to be constructed by the platform before model training, and the construction process of the sample set will be described in detail below.

In an actual service scene, services implemented by a platform are generally subjected to basic classification, and objects under each class obtained after classification often have relatively large commonality, so that multimedia information describing objects under the class is often relatively similar. Thus, the larger the number of objects shown in the multimedia information in each category, the more the objects belonging to the multimedia information may belong to the category, and the multimedia information may reflect the characteristics of the service in the category to some extent.

For example, released news often has a specific category, such as civil news, entertainment news, military news, scientific news, etc., and there may be a large difference between the contents of news reports in different categories and similar topics between the contents of news reports in the same category. For another example, when the most relevant display pictures are screened for the merchants, in the catering field, the merchants may be classified into categories such as dinners, fast food, halal, dessert, and drinks, so that the commodities provided by the merchants belonging to the same category are greatly overlapped, and the commodities provided by the merchants belonging to different categories are greatly different.

Therefore, when a training sample set is constructed in the present specification, a sample set corresponding to each category may be constructed based on existing classifications of services implemented by a platform, and a sample set for training a matching model is finally obtained.

In specific implementation, the platform firstly acquires each initial multimedia information under each preset category, then determines the similarity between each other initial multimedia information under the category except the initial multimedia information and the initial multimedia information according to each initial multimedia information under the category, and finally selects the multimedia information for constructing the sample set of the category from each initial multimedia information under the category according to the similarity.

When the platform selects the multimedia information used for constructing the sample set of the category, the number of other initial multimedia information with the similarity between the initial multimedia information and the determined number of other initial multimedia information with the similarity larger than the set similarity is used as the multimedia matching number corresponding to the initial multimedia information, and then the multimedia information used for constructing the sample set of the category is selected according to the multimedia matching number corresponding to each initial multimedia information under the category.

In this specification, the platform may select multimedia information for constructing the sample set of the category from each initial multimedia information by a plurality of methods.

For example, the platform sorts the initial multimedia information under the category according to the multimedia matching number corresponding to each initial multimedia information under the category and the sequence of the multimedia matching number corresponding to each initial multimedia information under the category from large to small to obtain a sorting result, then takes the initial multimedia information before the first set ranking in the sorting result as a positive sample (for example, the first 5000 initial multimedia information is selected as a positive sample), and takes the initial multimedia information after the second set ranking in the sorting result as a negative sample (for example, the last 5000 initial multimedia information is selected as a negative sample), and constructs a sample set of the category according to the determined positive sample and the determined negative sample. Wherein, the first setting rank is before the second setting rank.

For another example, the platform may determine that the number of corresponding multimedia matches is greater than the first set number of initial multimedia messages, and select these initial multimedia messages as positive samples. Meanwhile, the platform determines initial multimedia information of which the corresponding multimedia matching quantity is smaller than a second set quantity, and selects the initial multimedia information as a negative sample. And then, constructing a sample set of the category according to the determined positive samples and the determined negative samples. Wherein the first set number is greater than the second set number.

In this specification, when the number of the initial multimedia information is large and the quality of the positive and negative samples selected according to the set ranking is good, one of the two methods may be selected to select the positive and negative samples, and a sample set of the category may be constructed. When the number of the initial multimedia information is small, or the quality of the positive and negative samples selected according to the set ranking is poor, the platform can select the positive and negative samples according to the multimedia matching number corresponding to the initial multimedia information so as to ensure that the selected positive and negative samples have high quality.

In practical applications, technicians may select a sample selection mode according to actual service requirements, which is not specifically limited in this specification.

And after the positive and negative samples are determined, the platform supplements corresponding text information for the positive and negative samples. Wherein the text information can be automatically generated by the platform according to the corresponding initial multimedia information. Of course, in practical applications, the service implemented by the platform has a large amount of service data (including text information and multimedia information), and thus can be directly obtained from the service data of the service for which the platform is responsible.

For example, in an application scenario in which most relevant display pictures are screened for merchants, the platform may obtain, for each merchant in each category, pictures (including pictures uploaded by the merchant and pictures uploaded when the merchant reviews the pictures) and text information (which may include a merchant name, category information of a product provided by the merchant, a label marked by the platform for the merchant, a user review of the product provided by the merchant, and the like, where the text information may reflect information such as a business range and a business characteristic of the merchant) included in business data of the merchant, and regarding each picture, take the text information of the merchant to which the picture belongs as text information corresponding to the picture.

For example, in an application scenario where a most suitable matching is selected for news, for each piece of published news in each category, a matching of the news and text information (which may include news titles, subtitles, and the like) corresponding to the news may be acquired, and then, for each picture, the text information of the news to which the picture belongs is taken as the text information corresponding to the picture.

In this specification, the label information of the training sample corresponding to the positive sample is 1, which represents the multimedia information in the training sample and is highly related to the text information in the training sample. The label information of the training sample corresponding to the negative sample is 0, represents the multimedia information in the training sample, and has low association degree with the text information in the training sample.

Therefore, after the sample set is constructed in the above mode, the platform can train the matching model according to the training samples in the sample set.

And step S102, determining the associated multimedia information corresponding to the target multimedia information.

In specific implementation, the platform first determines a category corresponding to an object to which the target multimedia information belongs, as a target category, and then determines associated multimedia information corresponding to the target multimedia information from a multimedia information set corresponding to the target category. Wherein, the multimedia information set is constructed in advance. In specific implementation, when the platform constructs a sample set, a certain amount of multimedia information can be selected from the positive samples in the category according to each category to form a multimedia information set corresponding to the category.

In this way, the multimedia information in the multimedia information set corresponding to each category can better represent the service under that category to a certain extent, that is, the multimedia information in the multimedia information set is associated with the object belonging to the category, and thus the multimedia information in the multimedia information set is also associated with the text information describing the object belonging to the category.

Therefore, the multimedia information in the multimedia information set corresponding to each category can be selected and manually checked, whether the multimedia information in the multimedia information set is representative under the category and is associated with the object belonging to the category is checked, the multimedia information in the multimedia information set is screened according to the checking result, and the lacking representative multimedia information is supplemented to the multimedia information in the multimedia information set.

Of course, in this specification, the multimedia information set may be constructed in other manners. For example, a set number of multimedia information is manually set from the initial multimedia information to form a multimedia information set. For another example, in the process of constructing the sample set corresponding to each category, after determining the multimedia matching number corresponding to each initial multimedia information, the platform may sort the initial multimedia information in the category according to the sequence of the multimedia matching number corresponding to each initial multimedia information from large to small to obtain a sorting result, and then sequentially select a set number of initial multimedia information from the initial multimedia information in the first ranking to form a multimedia information set. Other ways are not exemplified here.

Further, when selecting multimedia information in the multimedia information set, the platform may preferentially select a positive sample ranked before as associated multimedia information in the multimedia information set, in view of the first positive sample selection manner. For the second positive sample selection manner, the platform may preferentially select the positive samples with a large number of corresponding multimedia matches as the associated multimedia information in the multimedia information set. In this way, the multimedia information in the selected multimedia information set is representative of the category, and is associated with the object belonging to the category.

In practical application, the platform may determine associated multimedia information corresponding to the target multimedia information from the multimedia information set corresponding to the target category in a plurality of ways.

For example, the platform may determine the multimedia matching amount corresponding to each multimedia information in the multimedia information set corresponding to the target category, sort the multimedia information under the target category according to the sequence of the multimedia matching amount corresponding to each multimedia information from large to small, and then select the set number of multimedia information according to the sorting result of the multimedia information and the sequence from front to back to obtain the associated multimedia information corresponding to the target multimedia information.

For another example, the platform may determine, for each multimedia information in the multimedia information set, a similarity between the multimedia information and the target multimedia information as a similarity corresponding to the multimedia information, then sort the multimedia information in the target category according to a descending order of the similarity corresponding to each multimedia information, and then select a set number of multimedia information according to a sorting result of the multimedia information, to obtain associated multimedia information corresponding to the target multimedia information.

Of course, there are other associated multimedia information selection methods, such as random selection, which are not exemplified herein.

And 104, determining text characteristics corresponding to the text information, and performing semantic enhancement on the target multimedia information through the associated multimedia information to determine semantic characteristics corresponding to the target multimedia information.

And 106, determining the association degree between the text information and the target multimedia information through a matching model to be trained based on the text characteristics and the semantic characteristics.

And 108, training the matching model according to the relevance and the label information corresponding to the training sample.

The matching model comprises: the device comprises a feature extraction layer and a feature matching layer, wherein the feature extraction layer comprises a text feature extraction layer and an image feature extraction layer.

In specific implementation, the platform inputs the text information into a text feature extraction layer of the matching model to obtain text features corresponding to the text information. Meanwhile, the platform inputs the target multimedia information and the associated multimedia information into an image semantic feature extraction layer of the matching model to obtain semantic features corresponding to the target multimedia information subjected to semantic enhancement through the associated multimedia information, and then inputs the obtained text features and the semantic features into a feature matching layer of the matching model to determine the degree of association between the text information and the target multimedia information. And then, training the matching model according to the determined association degree and the label information corresponding to the training sample.

After the platform inputs the text information into a text feature extraction layer of the matching model, the text feature extraction layer performs word embedding, position embedding and period embedding on each word in the text information to obtain text features corresponding to the word. Similarly, after the platform inputs the target multimedia information and the associated multimedia information into the image semantic feature extraction layer of the matching model, the image semantic feature extraction layer performs image embedding (image embedding), position embedding and sentence segment embedding on each piece of multimedia information to obtain the semantic features corresponding to the multimedia information. The picture embedding corresponding to the multimedia information may be abstract semantic features corresponding to the multimedia information obtained through a pre-trained Convolutional Neural Networks (CNN) or a Variational Auto-Encoder (VAE).

Then, the platform can directly input the text feature of each word corresponding to the text information, the target multimedia information and the semantic feature corresponding to each multimedia information corresponding to the associated multimedia information into the feature matching layer, the feature matching layer determines the semantic feature corresponding to the target multimedia information and the association degree between the semantic feature corresponding to each associated multimedia information, determines the association degree between the semantic feature corresponding to the target multimedia information and the text feature corresponding to each word, and obtains the association degree between the target multimedia information and the text information after fusion normalization.

Of course, in this specification, the platform may also determine the similarity between the target multimedia information and the associated multimedia information according to the semantic features corresponding to the target multimedia information and the association degree between the semantic features corresponding to each associated multimedia information before inputting the determined features into the feature matching layer, and may preliminarily determine that the probability of the target multimedia information is a positive sample when there is a high similarity between the associated multimedia information and the target multimedia information, so that the target multimedia information may be subjected to forward semantic enhancement through the associated multimedia information, and the target multimedia information is represented to be more inclined to the positive sample, thereby determining the semantic features corresponding to the target multimedia information.

When the similarity between each piece of associated multimedia information and the target multimedia information is low, or the similarity between a plurality of pieces of associated multimedia information and the target multimedia information is low, the target multimedia information can be preliminarily determined to be a negative sample, so that reverse semantic enhancement can be performed on the target multimedia information through the associated multimedia information, the target multimedia information is represented to be more inclined to the negative sample, and the semantic features corresponding to the target multimedia information are determined.

And then, the platform inputs semantic features and text information corresponding to the target multimedia information after semantic enhancement into a feature matching layer, the feature matching layer determines the degree of association between the semantic features corresponding to the target multimedia information and the text features corresponding to each word, and after fusion normalization, the degree of association between the target multimedia information and the text information is obtained.

In addition, when the matching model is trained, for each category, the positive samples are multimedia information with a high degree of association with the text information, and the object described by the text information and the object to which the multimedia information belongs belong to the same category, so that the positive samples have a large number of commonalities. The multimedia information as the negative examples is relatively low in association degree with the corresponding text information, and the object described by the text information may be completely different from the object to which the multimedia information belongs, so that the commonality between the negative examples may be poor. Therefore, when training a matching model using negative samples, the training speed of the model may be slow.

In this regard, in this specification, when the model is trained using a positive sample, the obtained loss function is multiplied by a first weight coefficient. And, when training the model using negative examples, the resulting loss function is multiplied by a second weight coefficient. The first weight coefficient is smaller than the second weight coefficient. Therefore, when the negative sample training model is used, the updating step length parameter determined by the matching model is larger, and the convergence speed of the matching model during the negative sample training can be increased.

Further, since the difference between the negative samples is more obvious, and there may be a case where the convergence rate of part of the negative samples is high and the convergence rate of part of the negative samples is low, in this specification, the obtained loss function may be multiplied by the second weight coefficient only when the negative sample training model is used and the model determination error is determined, so as to accelerate the learning speed of the matching model on the negative samples which cannot be correctly identified.

After the text-image matching model training is completed, the text-image matching model can be used for selecting and displaying the image information for the text information, and the specific process is shown in fig. 2.

Fig. 2 is a flow chart of a method for displaying information provided in the present specification.

Step S200, acquiring text information and candidate multimedia information corresponding to an object described by the text information.

Step S202, configured to determine, for each candidate multimedia information, associated multimedia information corresponding to the candidate multimedia information.

Step S204, determining the text characteristics corresponding to the text information, and performing semantic enhancement on the candidate multimedia information through the associated multimedia information to determine the semantic characteristics corresponding to the candidate multimedia information.

Step S206, based on the text features and the semantic features, determining the association degree between the text information and the candidate multimedia information through a pre-trained matching model, wherein the matching model is obtained through the model training method.

Step S208, selecting multimedia information used for representing the object from the candidate multimedia information according to the association degree between the text information and each candidate multimedia information, and displaying the object to the user according to the selected multimedia information.

The following is a brief description of an execution flow of the information display method for an application scenario in which the merchant selects the most relevant display picture as an example.

When the platform screens the most relevant display pictures for the merchants, firstly, text information (merchant names (such as XXX small caged bags), commodity categories (such as small caged bags, wontons, soups, beverages, snacks and the like) managed by the merchants, characteristic labels (such as breakfast hot selection) marked by the platform for the merchants, comments (such as wontons and snacks) of the commodities managed by the merchants, and candidate multimedia information (such as picture A uploaded by the merchants and picture B uploaded by the customers during evaluation) corresponding to the merchants are obtained from business data.

Then, the platform determines that the merchant corresponds to the category of breakfast (i.e. the target category corresponding to the object to which the multimedia information belongs) for the picture a, and selects the associated multimedia information (including the picture C, the picture D and the picture E) corresponding to the candidate multimedia information from the multimedia information set corresponding to the category of breakfast.

And then, inputting XXX small steamed stuffed buns, chaos, soup, beverages, snacks, breakfast hot wontons and the like serving as text information into a text feature extraction layer of the matching model by the platform to obtain corresponding text features, and simultaneously inputting the picture A, the picture C, the picture D and the picture E into a picture feature extraction layer of the matching model by the platform to obtain semantic features corresponding to the semantically enhanced picture A.

And then, the platform inputs the text characteristics and the semantic characteristics corresponding to the picture A into a characteristic matching layer of a matching model, and outputs the matching degree of 0.7 corresponding to the picture A and the text information.

And finally, the platform continuously determines that the matching degree of the picture B and the text information is 0.2, selects the picture A as a picture for representing the merchant, and displays the merchant to the user according to the selected picture A.

Through the steps, when the platform selects the multimedia information, a matching model used for determining the degree of association between the text information and the multimedia information can be trained in advance, so that when the multimedia information is selected, the degree of association between the text information and the candidate multimedia information is determined through the matching model, and then the multimedia information with higher degree of association with the text information can be selected according to the degree of association to serve as the multimedia information corresponding to the object described by the text information, so that the selected multimedia information and the text information have higher degree of association. Meanwhile, during model training, the associated multimedia information corresponding to the multimedia information is determined, and the candidate multimedia information is subjected to semantic enhancement through the associated multimedia information, so that the training speed of the matching model is increased, and the fact that the trained matching model can determine the multimedia information with high matching degree with the text information is further ensured.

Based on the same idea, the present specification also provides a corresponding model training apparatus and information presentation apparatus, as shown in fig. 3 and 4.

Fig. 3 is a schematic structural diagram of a model training apparatus provided in an embodiment of this specification, which specifically includes:

an obtaining module 300, configured to obtain a training sample from a pre-constructed sample set, where the training sample includes target multimedia information and text information corresponding to the target multimedia information, and the text information is used to describe a feature of an object to which the target multimedia information belongs;

an associated multimedia information determining module 301, configured to determine associated multimedia information corresponding to the target multimedia information;

a feature determining module 302, configured to determine a text feature corresponding to the text information, and perform semantic enhancement on the target multimedia information through the associated multimedia information to determine a semantic feature corresponding to the target multimedia information;

the association degree determining module 303 is configured to determine, based on the text feature and the semantic feature, an association degree between the text information and the target multimedia information through a matching model to be trained;

a training module 304, configured to train the matching model according to the association degree and the label information corresponding to the training sample.

the feature determining module 302 is specifically configured to input the text information into the text feature extraction layer to obtain a text feature corresponding to the text information, and input the target multimedia information and the associated multimedia information into the image semantic feature extraction layer to obtain a semantic feature corresponding to the target multimedia information after performing semantic enhancement on the associated multimedia information;

the relevance determining module 303 is specifically configured to input the text feature and the semantic feature into the feature matching layer, and determine the relevance between the text information and the target multimedia information.

Optionally, the apparatus further comprises:

a sample set constructing module 305, configured to obtain, for each preset category, each initial multimedia information in the category; determining similarity between each other initial multimedia information except the initial multimedia information under the category and the initial multimedia information aiming at each initial multimedia information under the category; and selecting the multimedia information for constructing the sample set of the category from the initial multimedia information of the category according to the similarity.

Optionally, the sample set constructing module 305 is specifically configured to determine the number of other initial multimedia information with similarity higher than a set similarity to the initial multimedia information, as the number of multimedia matches corresponding to the initial multimedia information; and selecting the multimedia information for constructing the sample set of the category according to the multimedia matching number corresponding to each initial multimedia information under the category.

Optionally, the sample set constructing module 305 is specifically configured to sort the initial multimedia information in the category according to a descending order of the matching number of the multimedia corresponding to each initial multimedia information in the category, so as to obtain a sorting result; taking the initial multimedia information before a first set ranking in the ranking result as a positive sample, and taking the initial multimedia information after a second set ranking in the ranking result as a negative sample, wherein the first set ranking is before the second set ranking; and constructing a sample set of the category according to the positive sample and the negative sample.

Optionally, the associated multimedia information determining module 301 is specifically configured to determine a category corresponding to an object to which the target multimedia information belongs, as a target category; and determining associated multimedia information corresponding to the target multimedia information from the multimedia information set corresponding to the target category.

Optionally, the associated multimedia information determining module 301 is specifically configured to select a set number of multimedia information according to the sorting result of each multimedia information in the multimedia information set corresponding to the target category, and obtain associated multimedia information corresponding to the target multimedia information.

Fig. 4 is a schematic structural diagram of an information display apparatus provided in an embodiment of the present specification, which specifically includes:

an obtaining module 400, configured to obtain text information and candidate multimedia information corresponding to an object described in the text information;

an associated multimedia information determining module 401, configured to determine, for each candidate multimedia information, associated multimedia information corresponding to the candidate multimedia information;

a feature determining module 402, configured to determine a text feature corresponding to the text information, and perform semantic enhancement on the candidate multimedia information through the associated multimedia information to determine a semantic feature corresponding to the candidate multimedia information;

a relevance determining module 403, configured to determine, based on the text feature and the semantic feature, a relevance between the text information and the candidate multimedia information through a pre-trained matching model, where the matching model is obtained through the model training method;

a displaying module 404, configured to select, according to the association degree between the text information and each candidate multimedia information, multimedia information used for representing the object from the candidate multimedia information, and display the object to the user according to the selected multimedia information.

The present specification also provides a computer-readable storage medium having stored thereon a computer program operable to execute the method of model training provided above with respect to fig. 1 and the method of information presentation provided above with respect to fig. 2.

The present specification also provides a schematic structural diagram of the electronic device shown in fig. 5. As shown in fig. 5, at the hardware level, the electronic device includes a processor, an internal bus, a network interface, a memory, and a non-volatile memory, but may also include hardware required for other services. The processor reads a corresponding computer program from the non-volatile memory into the memory and then runs the computer program to implement the model training method described in fig. 1 and the information display method provided in fig. 2. Of course, besides the software implementation, the present specification does not exclude other implementations, such as logic devices or a combination of software and hardware, and the like, that is, the execution subject of the following processing flow is not limited to each logic unit, and may be hardware or logic devices.

In the 90 s of the 20 th century, improvements in a technology could clearly distinguish between improvements in hardware (e.g., improvements in circuit structures such as diodes, transistors, switches, etc.) and improvements in software (improvements in process flow). However, as technology advances, many of today's process flow improvements have been seen as direct improvements in hardware circuit architecture. Designers almost always obtain the corresponding hardware circuit structure by programming an improved method flow into the hardware circuit. Thus, it cannot be said that an improvement in the process flow cannot be realized by hardware physical modules. For example, a Programmable Logic Device (PLD), such as a Field Programmable Gate Array (FPGA), is an integrated circuit whose Logic functions are determined by programming the Device by a user. A digital system is "integrated" on a PLD by the designer's own programming without requiring the chip manufacturer to design and fabricate application-specific integrated circuit chips. Furthermore, nowadays, instead of manually making an Integrated Circuit chip, such Programming is often implemented by "logic compiler" software, which is similar to a software compiler used in program development and writing, but the original code before compiling is also written by a specific Programming Language, which is called Hardware Description Language (HDL), and HDL is not only one but many, such as abel (advanced Boolean Expression Language), ahdl (alternate Hardware Description Language), traffic, pl (core universal Programming Language), HDCal (jhdware Description Language), lang, Lola, HDL, laspam, hardward Description Language (vhr Description Language), vhal (Hardware Description Language), and vhigh-Language, which are currently used in most common. It will also be apparent to those skilled in the art that hardware circuitry that implements the logical method flows can be readily obtained by merely slightly programming the method flows into an integrated circuit using the hardware description languages described above.

The controller may be implemented in any suitable manner, for example, the controller may take the form of, for example, a microprocessor or processor and a computer-readable medium storing computer-readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, an Application Specific Integrated Circuit (ASIC), a programmable logic controller, and an embedded microcontroller, examples of which include, but are not limited to, the following microcontrollers: ARC 625D, Atmel AT91SAM, Microchip PIC18F26K20, and Silicone Labs C8051F320, the memory controller may also be implemented as part of the control logic for the memory. Those skilled in the art will also appreciate that, in addition to implementing the controller as pure computer readable program code, the same functionality can be implemented by logically programming method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Such a controller may thus be considered a hardware component, and the means included therein for performing the various functions may also be considered as a structure within the hardware component. Or even means for performing the functions may be regarded as being both a software module for performing the method and a structure within a hardware component.

The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. One typical implementation device is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smartphone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

For convenience of description, the above devices are described as being divided into various units by function, and are described separately. Of course, the functions of the various elements may be implemented in the same one or more software and/or hardware implementations of the present description.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

As will be appreciated by one skilled in the art, embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, the description may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the description may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

This description may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The specification may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The above description is only an example of the present specification, and is not intended to limit the present specification. Various modifications and alterations to this description will become apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present specification should be included in the scope of the claims of the present specification.

Claims

1. A method of model training, comprising:

2. The method of claim 1, wherein the matching model comprises: the system comprises a feature extraction layer and a feature matching layer, wherein the feature extraction layer comprises a text feature extraction layer and an image feature extraction layer;

3. The method of claim 1, wherein constructing the sample set specifically comprises:

4. The method according to claim 3, wherein selecting multimedia information for constructing a sample set of the category from the initial multimedia information under the category according to the similarity comprises:

5. The method of claim 4, wherein selecting the multimedia information for constructing the sample set of the category according to the multimedia matching number corresponding to each initial multimedia information in the category specifically comprises:

6. The method of claim 1, wherein determining the associated multimedia information corresponding to the target multimedia information specifically comprises:

7. The method of claim 6, wherein determining associated multimedia information corresponding to the target multimedia information from the multimedia information set corresponding to the target category specifically comprises:

8. A method of information presentation, comprising:

determining the association degree between the text information and the candidate multimedia information through a pre-trained matching model based on the text characteristics and the semantic characteristics, wherein the matching model is obtained through training by the method of any one of claims 1-7;

9. An apparatus for model training, comprising:

an associated multimedia information determining module for determining associated multimedia information corresponding to the target multimedia information

10. An apparatus for information presentation, comprising:

a relevancy determination module, configured to determine relevancy between the text information and the candidate multimedia information through a pre-trained matching model based on the text feature and the semantic feature, where the matching model is obtained through training by the method according to any one of claims 1 to 7;

11. A computer-readable storage medium, characterized in that the storage medium stores a computer program which, when executed by a processor, implements the method of any of the preceding claims 1 to 7 or 8.

12. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of any of claims 1 to 7 or 8 when executing the program.