CN115982666A

CN115982666A - Image-text association degree determining method and device, computer equipment and storage medium

Info

Publication number: CN115982666A
Application number: CN202211571282.1A
Authority: CN
Inventors: 陈维识
Original assignee: Douyin Vision Co Ltd
Current assignee: Douyin Vision Co Ltd
Priority date: 2022-12-08
Filing date: 2022-12-08
Publication date: 2023-04-18

Abstract

The disclosure provides a method and a device for determining image-text association degree, computer equipment and a storage medium. Wherein, the method comprises the following steps: acquiring a picture sequence to be processed and a text to be processed; performing first feature extraction processing on the picture sequence to be processed to obtain a first target feature vector sequence; performing second feature extraction processing on the text to be processed to obtain a second target feature vector sequence; fusing the first target characteristic vector sequence and the second target characteristic vector sequence to obtain fused characteristic data; and determining the image-text association degree between the picture sequence to be processed and the text to be processed based on the fusion characteristic data.

Description

Image-text association degree determining method and device, computer equipment and storage medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a method and an apparatus for determining a degree of image-text association, a computer device, and a storage medium.

Background

With the development of the self-media technology, more and more users can publish own written articles on the internet, and with the lower and lower threshold of self-media entrance, more and more low-quality articles appear on the internet, which causes serious information flooding.

The image-text association degree is an important quantitative index for judging the quality of an article, and in the image-text association degree model in some existing online detection articles, only the association degree of a single image and a section of text can be processed.

Disclosure of Invention

The embodiment of the disclosure provides at least a method, a device, a computer device and a storage medium for determining image-text relevancy, and the embodiment of the disclosure provides a method for determining image-text relevancy, which includes:

acquiring a picture sequence to be processed and a text to be processed;

performing first feature extraction processing on the picture sequence to be processed to obtain a first target feature vector sequence; performing second feature extraction processing on the text to be processed to obtain a second target feature vector sequence;

fusing the first target characteristic vector sequence and the second target characteristic vector sequence to obtain fused characteristic data;

and determining the image-text association degree between the image sequence to be processed and the text to be processed based on the fusion characteristic data.

Therefore, the feature extraction processing is respectively carried out on the picture sequence to be processed and the text to be processed to obtain the feature vector sequences respectively corresponding to the picture sequence and the text, then the feature vector sequences of the picture sequence and the text are fused to obtain fused feature data, a plurality of pictures and a plurality of sections of texts can be processed at one time according to the fused feature data, and then the processing result of the overall picture-text association degree of the picture sequence to be processed and the text to be processed is output, so that the processing efficiency of the picture-text association degree is improved, the time cost of consumption is reduced, and the online deployment is facilitated.

In an alternative embodiment, the sequence of pictures to be processed and the text to be processed are from the same article;

the acquiring of the to-be-processed picture sequence and the to-be-processed text comprises:

acquiring each frame of picture to be processed from the article, and forming a picture sequence to be processed according to the sequence of each frame of picture to be processed in the article;

and acquiring each section of text from the article, and splicing each section of text according to the sequence of each section of text in the article to generate the text to be processed.

Therefore, the processing result of the overall image-text association degree of the article can be obtained by processing the multiple images and the multiple sections of texts in the article, so that the association degree between the multiple images and the multiple sections of texts in the article can be rapidly judged, the processing efficiency is high, and the online deployment requirement is met.

In an optional embodiment, the method further comprises:

determining whether the image-text association degree between the image sequence to be processed and the text to be processed is greater than or equal to a preset image-text association degree threshold value;

and responding to the image-text association degree which is larger than or equal to a preset image-text association degree threshold value, and generating the pushing information of the article.

Therefore, by presetting the image-text association degree threshold value, the image-text association degree between the image sequence to be processed and the text to be processed can be screened, and articles with higher image-text association degree can be pushed, so that the quality of information pushed for users is improved in some application scenes.

In an optional embodiment, the obtaining a to-be-processed picture sequence includes:

obtaining semantic information of the text to be processed;

screening multiple frames of pictures to be processed from multiple frames of alternative pictures based on the semantic information;

and generating the picture sequence to be processed based on the screened multiple frames of pictures to be processed.

Therefore, the picture to be processed can be screened from the multiple frames of alternative pictures through the semantic information of the text to be processed, the picture to be processed meeting the semantic information of the text input by the user can be rapidly determined for the user, and the screening requirement of the user on the picture is met.

In an optional embodiment, there are a plurality of the picture sequences to be processed; the method further comprises the following steps:

determining a target picture sequence from the plurality of picture sequences to be processed based on the picture-text association degrees between the plurality of picture sequences to be processed and the text to be processed respectively;

and generating an article based on the target picture sequence and the text to be processed.

Therefore, the target picture sequence is determined according to the image-text association degrees of the plurality of picture sequences to be processed and the text to be processed, in some application scenes, pictures with higher association degrees with the text can be determined for the text to be processed with higher efficiency and accuracy, and corresponding articles can be generated according to the pictures and the text.

In an optional implementation manner, the to-be-processed picture sequence is subjected to first feature extraction processing to obtain a first target feature vector sequence; and performing second feature extraction processing on the text to be processed to obtain a second target feature vector sequence, wherein the second feature extraction processing comprises the following steps:

respectively performing feature extraction processing on each frame of picture in the picture sequence to be processed to obtain a first data sequence corresponding to the picture sequence to be processed, and performing self-attention processing on the first data sequence to obtain a first target feature vector sequence corresponding to the picture sequence to be processed; the first data sequence comprises original characteristic data corresponding to each frame of the picture; the first target feature vector sequence comprises first target feature vectors corresponding to the pictures of each frame respectively; and

converting the text to be processed into a second data sequence, and performing self-attention processing on the second data sequence to obtain a second target feature vector sequence corresponding to the text to be processed; the second data sequence comprises coded data corresponding to each vocabulary in the text to be processed; the second target feature vector sequence comprises second target feature vectors corresponding to a plurality of words respectively.

In an optional implementation manner, the performing feature extraction processing on each frame of picture in the sequence of pictures to be processed respectively to obtain a first data sequence corresponding to the sequence of pictures to be processed, and performing self-attention processing on the first data sequence to obtain the first target feature vector sequence corresponding to the sequence of pictures to be processed includes:

respectively processing each frame of picture in the picture sequence to be processed by utilizing a pre-trained picture processing model to obtain original characteristic data corresponding to each frame of picture;

the first data sequence is formed on the basis of original characteristic data corresponding to the pictures of each frame;

performing self-attention processing on the first data sequence to obtain a first intermediate characteristic sequence, and fusing the first data sequence and the first intermediate characteristic sequence to obtain a second intermediate characteristic sequence; the first intermediate characteristic sequence comprises first intermediate characteristic vectors respectively corresponding to multiple frames of pictures; the second intermediate characteristic sequence comprises second intermediate characteristic vectors respectively corresponding to multiple frames of pictures;

and carrying out full connection processing on the second intermediate characteristic sequence at least once to obtain the first target characteristic vector sequence.

In an optional embodiment, the performing self-attention processing on the first data sequence to obtain a first intermediate feature sequence, and fusing the first data sequence and the first intermediate feature sequence to obtain a second intermediate feature sequence includes:

determining self-attention weights corresponding to a plurality of original feature data in the first data sequence respectively by using a self-attention network;

weighting each original feature data by using the self-attention weight corresponding to each original feature data to obtain a first intermediate feature vector corresponding to each original feature data;

forming the first intermediate feature sequence based on first intermediate feature vectors respectively corresponding to the plurality of original feature data;

carrying out counterpoint addition on each original feature data in the first data sequence and the first intermediate feature vector in the first intermediate feature sequence respectively to obtain a third intermediate feature sequence; the third intermediate characteristic sequence comprises a plurality of frames of third intermediate characteristic vectors respectively corresponding to the pictures;

and carrying out layer normalization processing on the third intermediate characteristic sequence to obtain the second intermediate characteristic sequence.

In an optional implementation manner, the fusing the first target feature vector sequence and the second target feature vector sequence to obtain fused feature data includes:

performing first data mapping processing on the first target characteristic vector sequence to obtain a first data matrix and a second data matrix; performing second data mapping processing on the second target characteristic vector sequence to obtain a third data matrix;

and performing cross attention processing based on the first data matrix, the second data matrix and the third data matrix to obtain the fusion characteristic data.

In a second aspect, an embodiment of the present disclosure further provides an apparatus for determining a degree of image-text association, including:

the acquisition module is used for acquiring a picture sequence to be processed and a text to be processed;

the characteristic extraction module is used for carrying out first characteristic extraction processing on the picture sequence to be processed to obtain a first target characteristic vector sequence; performing second feature extraction processing on the text to be processed to obtain a second target feature vector sequence;

the fusion module is used for fusing the first target characteristic vector sequence and the second target characteristic vector sequence to obtain fusion characteristic data;

and the determining module is used for determining the image-text association degree between the picture sequence to be processed and the text to be processed based on the fusion characteristic data.

the acquisition module is configured to:

In an optional implementation manner, the apparatus further includes a generation module configured to:

In an optional embodiment, when acquiring the sequence of pictures to be processed, the acquiring module is configured to:

obtaining semantic information of the text to be processed;

In an optional embodiment, there are a plurality of the to-be-processed picture sequences, and the obtaining module is further configured to:

In an optional implementation, the feature extraction module is configured to:

respectively performing feature extraction processing on each frame of picture in the picture sequence to be processed to obtain a first data sequence corresponding to the picture sequence to be processed, and performing self-attention processing on the first data sequence to obtain a first target feature vector sequence corresponding to the picture sequence to be processed; the first data sequence comprises original characteristic data corresponding to each frame of the pictures respectively; the first target feature vector sequence comprises first target feature vectors corresponding to the pictures of each frame respectively; and

In an optional implementation manner, when the feature extraction module performs feature extraction processing on each frame of picture in the to-be-processed picture sequence to obtain a first data sequence corresponding to the to-be-processed picture sequence, and performs self-attention processing on the first data sequence to obtain the first target feature vector sequence corresponding to the to-be-processed picture sequence, the feature extraction module is further configured to:

In an optional implementation manner, when the feature extraction module performs self-attention processing on the first data sequence to obtain a first intermediate feature sequence, and fuses the first data sequence and the first intermediate feature sequence to obtain a second intermediate feature sequence, the feature extraction module is further configured to:

respectively carrying out alignment addition on each original feature data in the first data sequence and the first intermediate feature vector in the first intermediate feature sequence to obtain a third intermediate feature sequence; the third intermediate characteristic sequence comprises a plurality of frames of third intermediate characteristic vectors respectively corresponding to the pictures;

Thus, the original feature data and the first intermediate feature vector are respectively added in a counterpoint way, so that the loss of the original feature information is avoided.

In an alternative embodiment, the fusion module is configured to:

and performing cross attention processing on the basis of the first data matrix, the second data matrix and the third data matrix to obtain the fusion characteristic data.

In a third aspect, this disclosure also provides a computer device, a processor, and a memory, where the memory stores machine-readable instructions executable by the processor, and the processor is configured to execute the machine-readable instructions stored in the memory, and when the machine-readable instructions are executed by the processor, the machine-readable instructions are executed by the processor to perform the steps in the first aspect or any one of the possible implementations of the first aspect.

In a fourth aspect, this disclosure also provides a computer-readable storage medium having a computer program stored thereon, where the computer program is executed to perform the steps in the first aspect or any one of the possible implementation manners of the first aspect.

For the description of the effects of the image-text association degree determining apparatus, the computer device, and the computer-readable storage medium, reference is made to the description of the image-text association degree determining method, which is not repeated herein.

In order to make the aforementioned objects, features and advantages of the present disclosure more comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings required for use in the embodiments will be briefly described below, and the drawings herein incorporated in and forming a part of the specification illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the technical solutions of the present disclosure. It is appreciated that the following drawings depict only certain embodiments of the disclosure and are therefore not to be considered limiting of its scope, for those skilled in the art will be able to derive additional related drawings therefrom without the benefit of the inventive faculty.

Fig. 1 illustrates a flowchart of a method for determining a degree of image-text relevance according to some embodiments of the present disclosure;

fig. 2 illustrates a flowchart of obtaining a first target feature vector sequence according to some embodiments of the present disclosure;

fig. 3 is a diagram illustrating a specific example of a graph-text relevancy prediction model according to some embodiments of the disclosure;

fig. 4 is a diagram illustrating another exemplary graph and text relevance prediction model provided by some embodiments of the disclosure;

fig. 5 is a diagram illustrating a specific example of another graph-text relevancy prediction model according to some embodiments of the disclosure;

fig. 6 illustrates a specific example diagram of yet another graph-text relevance prediction model provided by some embodiments of the present disclosure;

fig. 7 is a schematic diagram illustrating an apparatus for determining a degree of image-text relevance according to an embodiment of the disclosure;

fig. 8 shows a schematic diagram of a computer device provided by an embodiment of the present disclosure.

Detailed Description

To make the objects, technical solutions and advantages of the embodiments of the present disclosure more apparent, the technical solutions in the embodiments of the present disclosure will be described clearly and completely with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are only a part of the embodiments of the present disclosure, not all of the embodiments. The components of embodiments of the present disclosure, as generally described and illustrated herein, may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the disclosure is not intended to limit the scope of the disclosure as claimed, but is merely representative of selected embodiments of the disclosure. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the disclosure without making creative efforts, shall fall within the protection scope of the disclosure.

Research shows that, as the threshold of self-media access is lower and lower, a large number of low-quality articles appear on the internet, semantics of text contents and picture contents in the low-quality articles are not matched, and a user is often interfered by the low-quality articles during browsing, so that the user is difficult to acquire information desired by the user, and the use experience of the user is reduced.

For the situation, some common methods introduce a graph-text association degree model to calculate whether the semantics of the pictures and the texts in the articles are matched, and then use the quantized data obtained by analysis as a judgment basis for judging the low-quality articles.

However, in some existing on-line image-text association degree models, only the association degree between a single picture and a text can be processed, when the image-text association degree in an article is to be calculated, only each picture and context can be respectively introduced into the existing image-text association degree model, and then the image-text association degree of the whole article is obtained through weighted average and other operations.

In addition, when an article is generated, besides a text, a plurality of corresponding illustrations need to be matched for the article in many scenes; the method for automatically screening the illustration from the picture library by using the image-text association degree greatly facilitates the use of users at present; however, the efficiency of determining a plurality of pictures for a text is low due to the limitation of the mode of determining the image-text association degree, and the efficiency of generating an article is low.

Based on the research, the present disclosure provides a method and an apparatus for determining a degree of association between graphics and text, a computer device, and a storage medium. The method can process a plurality of pictures and a plurality of sections of texts at one time, and further output the processing result of the picture sequence to be processed and the overall picture-text association degree of the text to be processed. The processing efficiency of determining the image-text association degree of the multi-frame images and the multi-section texts is improved, the consumed time cost is reduced, and online deployment is facilitated.

The above-mentioned drawbacks are the results of the inventor after practical and careful study, and therefore, the discovery process of the above-mentioned problems and the solutions proposed by the present disclosure to the above-mentioned problems should be the contribution of the inventor in the process of the present disclosure.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.

In addition, before the technical solutions disclosed in the following embodiments of the present disclosure are used, the type, the use range, the use scene, etc. of the personal information related to the present disclosure should be informed to the user and obtain the authorization of the user through a proper manner according to the relevant laws and regulations.

For example, in response to receiving a user's active request, prompt information is sent to the user to explicitly prompt the user that the requested operation to be performed would require acquisition and use of personal information to the user. Thus, the user can autonomously select whether to provide personal information to software or hardware such as an electronic device, an application program, a server, or a storage medium that performs the operations of the disclosed technical solution, according to the prompt information.

As an optional but non-limiting implementation manner, in response to receiving an active request from the user, the manner of sending the prompt information to the user may be, for example, a pop-up window, and the prompt information may be presented in a text manner in the pop-up window. In addition, a selection control for providing personal information to the electronic device by the user's selection of "agreeing" or "disagreeing" can be carried in the pop-up window.

It is understood that the above notification and user authorization process is only illustrative and not limiting, and other ways of satisfying relevant laws and regulations may be applied to the implementation of the present disclosure.

To facilitate understanding of the embodiment, first, a method for determining a degree of image-text relevance disclosed in the embodiment of the present disclosure is described in detail, where an execution subject of an interaction control method provided in the embodiment of the present disclosure is generally a computer device with certain computing capability, and the computer device includes, for example: a terminal device including, for example, a touch terminal and a Personal Computer (PC) terminal, or a server or other processing device. The touch terminal includes, for example: smart phones, tablet computers, and the like; the PC terminal includes, for example, a desktop computer, a notebook computer, and the like.

The following describes a method for determining a degree of association between pictures and texts provided by an embodiment of the present disclosure.

Referring to fig. 1, a flowchart of a method for determining an image-text relevance provided by an embodiment of the present disclosure is shown, where the method includes steps S101 to S104, where:

s101, acquiring a picture sequence to be processed and a text to be processed;

here, the manner for acquiring the to-be-processed picture sequence and the to-be-processed text includes, but is not limited to, acquiring from an article uploaded by a user or acquiring from some open-source data set.

Some of the open source data sets include some downloadable picture data sets and text data sets on the internet, etc.

In some embodiments provided by the present disclosure, the sequence of pictures to be processed and the text to be processed originate from the same article; acquiring each frame of picture to be processed from the article, and forming a picture sequence to be processed according to the sequence of each frame of picture to be processed in the article; and acquiring each section of text from the article, and splicing each section of text according to the sequence of each section of text in the article to generate the text to be processed.

For example, in a self-media platform carrying the method for determining the image-text association degree provided by the present disclosure, a user uploads a written article through a publishing function provided by the platform, the platform performs content review on the article uploaded by the user, obtains layout information of the article when performing an image-text association degree review stage on the article, determines the number of pictures in the article according to the layout information, and generates a picture sequence to be processed according to the layout sequence of the pictures in the article. Meanwhile, text paragraph information in the article is determined according to the typesetting information, and the to-be-processed text is generated from the multiple sections of texts according to the typesetting sequence in the article. After the picture sequence to be processed and the text to be processed are obtained, the picture sequence to be processed and the text to be processed are respectively input into the picture-text association degree model, and the picture-text association degree of the article uploaded by the user is obtained. Further, the image-text association degree can be used as an important criterion for judging the quality of the article, and if the image-text association degree of the article is low, the platform returns information which is not approved by the user, and returns the article released by the user.

In some embodiments provided by the present disclosure, it is determined whether a degree of teletext association between the sequence of pictures to be processed and the text to be processed is greater than or equal to a preset degree of teletext association threshold; and responding to the image-text association degree larger than or equal to a preset image-text association degree threshold value, and generating the pushing information of the article.

Here, the preset image-text association threshold may be determined by using a deep learning algorithm, for example, a threshold to be trained is determined, then some training samples are input into the image-text association model to output scores of image-text association corresponding to the training samples, prediction results of the training samples are determined according to the scores of the image-text association of the training samples and the threshold to be trained, two prediction results, namely association and non-association, are obtained, and then the prediction results are compared with standard results of the training samples, so as to obtain four training results, where the first training result is: the standard result of the training sample is correlation, and the prediction result is correlation; the second training result is: the standard result of the training sample is related, and the prediction result is non-related; the third training result is: the standard result of the training sample is not related, and the prediction result is related; the fourth training result is: the standard result of the training sample is not related, and the prediction result is not related. And correcting the threshold value to be trained through multiple rounds of training, so that the proportion of the first training result and the fourth training result in the total training result is increased, and the final preset image-text association degree threshold value is obtained.

Illustratively, after a preset image-text association degree threshold value is determined, an article uploaded by a user is input into an image-text association degree model to obtain an image-text association degree score, the image-text association degree score is compared with the preset image-text association degree threshold value, if the image-text association degree score is larger than or equal to the preset image-text association degree threshold value and no problem is found in other auditing stages, the article is determined to be approved, push information is generated, and the article is released.

The image-text relevancy model is described in detail in some embodiments of the present disclosure below, and is not described in detail herein.

In some embodiments provided by the present disclosure, semantic information of the text to be processed is obtained;

Illustratively, semantic information of a text to be processed can be acquired according to the text to be processed input by a user, pictures are searched and derived from some open-source picture data sets according to the semantic information of the text to be processed, and a picture sequence to be processed is generated. The semantic information of the text to be processed may include a plurality of semantics, and at least one picture corresponding to the semantics is searched and derived from each semantic in the picture data set. If each semantic meaning searches in the picture data set and derives a plurality of pictures corresponding to the semantic meaning, the pictures can be classified to generate a plurality of picture sequences to be processed. Each picture in the picture sequence to be processed corresponds to a non-repeated semantic. For example, semantic information of the text to be processed includes semantic 1, semantic 2, and semantic 3, and a picture A1, a picture A2, and a picture A3 corresponding to the semantic 1 are searched and derived in the picture data set according to the semantic information; a picture B1, a picture B2, a picture B3 corresponding to the semantic 2; picture C1, picture C2, and picture C3 corresponding to semantic 3. Correspondingly generating a picture sequence 1 to be processed, including: picture A1, picture B1, picture C1; a to-be-processed picture sequence 2 comprising a picture A2, a picture B2 and a picture C2; the picture sequence to be processed 3 includes: picture A3, picture B3, and picture C3. Here, acquiring pictures from a picture data set according to semantic information of a text to be processed may cause a situation of acquiring a large number of pictures, so that a large number of picture sequences to be processed are generated, and consumption of computing resources and storage resources is increased, so that an acquisition threshold value can be set, pictures which can be acquired by each semantic in the semantic information of the text to be processed are controlled within a certain number, and the number of generated picture sequences to be processed is controlled; and the number of the pictures in each picture sequence to be processed can be in one-to-one correspondence with the semantics, so that the situation that pictures corresponding to the semantics do not exist in some picture sequences to be processed is avoided.

In a conceivable embodiment, a text may also be searched and derived from some open-source text libraries according to semantic information of a picture in a to-be-processed picture sequence input by a user, and a specific processing process is similar to the above process of acquiring the to-be-processed picture sequence according to the semantic information of the to-be-processed text, which is not described in detail in the present disclosure.

In some embodiments provided by the present disclosure, the to-be-processed picture sequence is multiple; determining a target picture sequence from the plurality of picture sequences to be processed based on the picture-text association degrees between the plurality of picture sequences to be processed and the text to be processed respectively; and generating an article based on the target picture sequence and the text to be processed.

Illustratively, in a conceivable application scenario, a user needs to map an article when creating the article, at this time, the user may input the written article into a target application carrying the method, the target application generates a text to be processed according to text paragraph information of the article input by the user and inputs the text to be processed into an image-text association degree prediction model, on the other hand, a plurality of pre-prepared image sequences to be processed, which are derived from an image data set according to semantic information of the text to be processed, are sequentially input into the image-text association degree prediction model, the model outputs a score of the image-text association degree between each image sequence to be processed and the text to be processed, in a case where the score of the image-text association degree between the plurality of image sequences to be processed and the text to be processed is greater than or equal to a preset image-text association degree threshold value, a prediction result may be output to the user, the user selects one target image sequence from the plurality of image sequences to be processed, or may directly select one image sequence to be processed with the highest image-text association degree score from the plurality of image sequences to be processed as the target image sequence. And after the target picture sequence is determined, generating an article according to the semantic relation between the target picture sequence and the text to be processed.

Receiving the S101, after acquiring the to-be-processed picture sequence and the to-be-processed text, further including:

s102, performing first feature extraction processing on the picture sequence to be processed to obtain a first target feature vector sequence; and performing second feature extraction processing on the text to be processed to obtain a second target feature vector sequence.

Illustratively, the image-text relevancy prediction model comprises an image feature extraction unit and a text feature extraction unit, the image feature extraction unit is configured to perform first feature extraction processing on an image sequence to be processed to obtain a first data sequence composed of image feature data, the text feature extraction unit is configured to perform second feature extraction processing on a text to be processed to obtain a second data sequence composed of text feature data, after the data sequence is extracted, subsequent processing procedures of the first data sequence and the second data sequence are similar, and the following feature extraction process of the image sequence to be processed and the process of obtaining the first target feature vector sequence are described in detail.

Specifically, in some embodiments provided by the present disclosure, feature extraction processing is performed on each frame of picture in the to-be-processed picture sequence, so as to obtain a first data sequence corresponding to the to-be-processed picture sequence, and self-attention processing is performed on the first data sequence, so as to obtain the first target feature vector sequence corresponding to the to-be-processed picture sequence; the first data sequence comprises original characteristic data corresponding to each frame of the picture; the first target feature vector sequence comprises first target feature vectors corresponding to the pictures of each frame.

Illustratively, after the picture sequence to be processed is imported into the picture-text relevancy prediction model, the picture sequence to be processed is first imported into the feature extraction unit for feature extraction processing. The feature extraction unit may perform a first feature extraction process on the to-be-processed picture sequence by using a Pre-trained neural network model, for example, a contrast Language picture Pre-training (CLIP) model is used as a picture encoder model, a picture in the to-be-processed picture sequence may be represented by a 512-dimensional vector, and after performing the first feature extraction process on all pictures in the to-be-processed picture sequence, a first data sequence corresponding to the to-be-processed picture sequence is obtained.

Here, the pre-trained neural network model of the CLIP model is only used as the first feature extraction processing for the picture sequence to be processed in some embodiments provided by the present disclosure, and after the first data sequence is obtained, the first data sequence is sent to the self-attention processing unit, and does not participate in the subsequent processing process. The CLIP model is the prior art, a 512-dimensional vector is only used as an example, in practical application, the higher the dimension is, the larger the calculation amount is, and the lower the loss information is, and vectors of other dimensions can be set according to practical application.

In a possible implementation manner, other convolutional neural network models to be trained may also be used to perform the first feature extraction processing on the picture sequence to be processed, and if other convolutional neural network models to be trained are used, the subsequent processing process needs to be involved.

Then, after receiving the first data sequence sent by the feature extraction unit, the self-attention processing unit sends the first data sequence to the self-attention processing model for self-attention processing.

Specifically, referring to the flowchart shown in fig. 2 for obtaining the first target feature vector sequence, obtaining the first target feature vector sequence corresponding to the to-be-processed picture sequence at least includes the following steps S201 to S204:

s201, respectively processing each frame of picture in the picture sequence to be processed by utilizing a pre-trained picture processing model to obtain original characteristic data corresponding to each frame of picture.

And S202, forming the first data sequence based on the original characteristic data corresponding to the pictures of each frame.

Here, the pre-trained picture processing model may be the CLIP model described above, and the specific acquisition process is not described in this disclosure.

S203, performing self-attention processing on the first data sequence to obtain a first intermediate feature sequence, and fusing the first data sequence and the first intermediate feature sequence to obtain a second intermediate feature sequence; the first intermediate characteristic sequence comprises first intermediate characteristic vectors respectively corresponding to multiple frames of pictures; the second intermediate characteristic sequence comprises second intermediate characteristic vectors respectively corresponding to the multiple frames of pictures.

Performing self-attention processing on the first data sequence to obtain a first intermediate feature sequence, and fusing the first data sequence and the first intermediate feature sequence to obtain a second intermediate feature sequence, where the second intermediate feature sequence at least includes the following steps S2031 to S2034:

s2031, determining respective self-attention weights corresponding to the plurality of original feature data in the first data sequence by using a self-attention network.

Here, the original feature data in the first data sequence may be represented by a Key-Value pair (Key-Value). Key can be thought of as the address of Value. And determining the self-attention weight for Value according to the similarity of Key and Query.

Illustratively, for example, query is a book, then the self-attention weight is determined according to the similarity between the Key in the first data sequence and the book, for example, key1 in the first data sequence is a newspaper, and its self-attention weight is 0.2; key2 is a paper with a self-attention weight of 0.15; key3 is a book whose self-attention weight is 0.65, and in this way, a plurality of original feature data in the first data sequence are respectively matched with the corresponding self-attention weights.

Here, the Query may be a plurality of queries, and each Query determines a self-attention weight for each of the plurality of raw feature data in the first data sequence.

And S2032, weighting each original feature data by using the self-attention weight corresponding to each original feature data to obtain a first intermediate feature vector corresponding to each original feature data.

Illustratively, after determining the self-attention weights corresponding to the plurality of original feature data in the first data sequence, the original feature data is weighted to obtain a first intermediate feature vector, for example, as illustrated in the above S2031, the first intermediate feature vector may be represented as 0.2 × Value1,0.15 × Value2,0.65 × Value3, where Value1 corresponds to Key1, value2 corresponds to Key2, and Value3 corresponds to Key3.

S2033, the first intermediate feature sequence is configured based on the first intermediate feature vectors corresponding to the plurality of original feature data, respectively.

S2034, performing alignment addition on each original feature data in the first data sequence and the first intermediate feature vector in the first intermediate feature sequence, respectively, to obtain a third intermediate feature sequence; the third intermediate feature sequence comprises a plurality of frames of third intermediate feature vectors respectively corresponding to the pictures.

Illustratively, the original feature data and the first intermediate feature vector are added in a position-aligned manner, so as to avoid loss of the original feature data. For example, as exemplified in S2032 above, value1+ (0.2 × Value 1), value2+ (0.15 × Value 12), and Value3+ (0.65 × Value 3) are obtained. That is, a third intermediate feature vector is obtained.

And S2035, performing layer normalization processing on the third intermediate feature sequence to obtain the second intermediate feature sequence.

And performing layer normalization processing on the third intermediate feature sequence, and reducing the magnitude of different third intermediate feature vectors to facilitate calculation. For example, the result obtained by subjecting the third intermediate feature vector to the layer normalization processing is limited to be between 0 and 1, and the second intermediate feature sequence is obtained.

And S204, carrying out full connection processing on the second intermediate characteristic sequence at least once to obtain a first target characteristic vector sequence.

Here, the second intermediate feature sequence is subjected to full-concatenation processing to fit the distribution of the second intermediate feature vector, thereby increasing robustness. In order to achieve better robustness, the second intermediate feature sequence may be subjected to full-concatenation processing twice. And after two times of full connection processing, obtaining a first target characteristic vector sequence.

S102, performing second feature extraction processing on the text to be processed to obtain a second target feature vector sequence;

illustratively, after a text to be processed is led into the image-text relevance prediction model, word segmentation processing is firstly performed on the text to be processed to obtain a 64-dimensional word segmentation vector sequence, and the word segmentation vector sequence is led into the feature extraction unit to be subjected to feature extraction processing. The feature extraction unit may obtain a word segmentation Embedding sequence of the text to be processed by using embedded lookup (Embedding Look-Up), and send the word segmentation Embedding sequence to the text self-attention processing unit corresponding to the text to be processed.

Here, after the word segmentation embedding sequence is sent to the text self-attention processing unit corresponding to the text to be processed, a second target feature vector sequence is obtained, and here, since the processing process of the word segmentation embedding sequence is similar to the processing process of the picture sequence to be processed in S203 to S204, the present disclosure is not described herein in detail.

After receiving the step S102, after obtaining the first target feature vector sequence of the to-be-processed picture sequence and the second target feature vector sequence of the to-be-processed text, the method further includes:

s103, fusing the first target characteristic vector sequence and the second target characteristic vector sequence to obtain fused characteristic data.

Specifically, in an embodiment provided by the present disclosure, a first data mapping process is performed on the first target feature vector sequence to obtain a first data matrix K and a second data matrix V; performing second data mapping processing on the second target characteristic vector sequence to obtain a third data matrix Q; and performing cross attention processing on the basis of the first data matrix K, the second data matrix V and the third data matrix Q to obtain the fusion characteristic data.

Exemplarily, the first target characteristic vector sequence is subjected to full connection processing of different parameters once respectively to obtain a first data matrix K and a second data matrix V with the same dimensionality; and performing primary full-connection processing on the second target feature vector sequence to obtain a third data matrix Q, inputting a K, Q, V matrix as input into a cross attention processing unit to obtain cross attention weight, performing weighting processing on the cross attention weight and original feature data in a K, Q, V matrix to obtain a first intermediate feature sequence of a K, Q, V matrix, performing alignment addition on the data in the first intermediate feature sequence and the second target feature vector sequence, performing two full-connection processing to obtain a processed target matrix, and outputting a picture-text correlation degree prediction result after the target feature data in the target matrix passes through a global leveling pooling layer and a full-connection layer.

Receiving the step S103, in the method for determining image-text relevance provided in the embodiment of the present disclosure, after obtaining the fusion feature data by using the step S103, the method further includes:

and S104, determining the image-text association degree between the picture sequence to be processed and the text to be processed based on the fusion characteristic data.

For example, when determining the degree of textual relevance between the sequence of pictures to be processed and the text to be processed, the difference between the textual relevance prediction model and the reference model may be corrected according to a loss function. For example, if the final image-text relevancy prediction model outputs a score related to image-text relevancy, MSE Loss may be selected to measure the difference between the value output by the image-text relevancy prediction model and the target value, and the image-text relevancy prediction model may be modified according to the difference.

For another example, if the final image-text relevancy prediction model outputs a relevant or irrelevant judgment result, log-Loss or Binary Cross-entry may be selected to measure the Loss of the image-text relevancy prediction model. And correcting the image-text correlation degree prediction model according to the loss.

In addition, the present disclosure also provides a specific example of a method for determining a degree of image-text relevance, referring to an example diagram of an image-text relevance prediction model described in fig. 3, the image-text relevance prediction model includes an image feature extraction unit, an image self-attention processing unit, a text feature extraction unit, a text self-attention processing unit, and a cross-attention processing unit.

The picture feature extraction unit is used for converting pictures in the picture sequence to be processed into a 512-dimensional vector in response to receiving the picture sequence to be processed; after all the pictures in the picture sequence to be processed are respectively converted into 512-dimensional vectors, a first data sequence corresponding to the picture vectors is generated and sent to the picture self-attention processing unit. Referring to another exemplary diagram of the image-text relevancy prediction model shown in fig. 4, a picture receives a first data sequence from an attention processing unit, the first data sequence respectively passes through a self-attention layer X1, a self-attention layer X2, and a self-attention layer X3 to obtain a self-attention weight corresponding to each self-attention layer, the first data sequence is weighted according to the self-attention weight corresponding to each self-attention layer to obtain a weighted first data sequence, and the weighted first data sequence is spliced to obtain a first intermediate feature sequence of the first data sequence; and performing residual error connection on the feature data in the first intermediate feature sequence and the feature data in the first data sequence to obtain a third intermediate feature sequence, and outputting a first target feature vector sequence through layer normalization and a full connection layer A1 and a full connection layer A2.

The text feature extraction unit responds to the received word segmentation sequence of the text to be processed, converts the word segmentation sequence into a 64-dimensional word vector, generates a second data sequence corresponding to the word vector, and sends the second data sequence to the text self-attention processing unit. Referring to fig. 5, in another exemplary diagram of the image-text relevancy prediction model, a text self-attention processing unit receives a second data sequence, the second data sequence respectively passes through a self-attention layer Y1, a self-attention layer Y2, and a self-attention layer Y3 to obtain self-attention weights respectively corresponding to the self-attention layers, the second data sequence is weighted according to the self-attention weights corresponding to the self-attention layers to obtain a weighted second data sequence, and the weighted second data sequence is then spliced to obtain a first intermediate feature sequence of the second data sequence; and performing residual error connection on the feature data in the first intermediate feature sequence and the feature data in the first data sequence to obtain a third intermediate feature sequence, and outputting a second target feature vector sequence after layer normalization, a full connection layer B1 and a full connection layer B2.

Generating a matrix K and a matrix V according to the first target characteristic vector sequence; generating a matrix Q according to a second target feature vector sequence, referring to an example diagram of another image-text relevance prediction model shown in FIG. 6, wherein feature data of a K, Q, V matrix respectively pass through a cross attention layer Z1, a cross attention layer Z2 and a cross attention layer Z3 of a cross attention processing unit to obtain cross attention weights respectively corresponding to each cross attention layer, weighting original feature data in a K, Q, V matrix according to the cross attention weight corresponding to each cross attention layer to obtain weighted feature data, and splicing the weighted feature data to obtain a first intermediate feature sequence of a K, Q, V matrix; and performing residual error connection on the data in the first intermediate characteristic sequence and the data in the second target characteristic vector sequence to obtain a third intermediate characteristic sequence, outputting a target matrix through a layer normalization layer and a full connection layer C1 and a full connection layer C2, and outputting a picture-text correlation degree prediction result through a full connection layer D after the target characteristic data in the target matrix passes through a global average pooling layer.

It will be understood by those skilled in the art that in the method of the present invention, the order of writing the steps does not imply a strict order of execution and any limitations on the implementation, and the specific order of execution of the steps should be determined by their function and possible inherent logic.

Based on the same inventive concept, the embodiment of the present disclosure further provides a device for determining a degree of image-text association corresponding to the method for determining a degree of image-text association, and since the principle of solving the problem of the device in the embodiment of the present disclosure is similar to the method for determining the degree of image-text association described in the embodiment of the present disclosure, the implementation of the device may refer to the implementation of the method, and repeated details are omitted.

Referring to fig. 7, which is a schematic diagram of a method for determining a degree of image-text relevance provided in an embodiment of the present disclosure, the apparatus includes: an acquisition module 71, a feature extraction module 72, a fusion module 73, and a determination module 74; wherein the content of the first and second substances,

an obtaining module 71, configured to obtain a to-be-processed picture sequence and a to-be-processed text;

the feature extraction module 72 is configured to perform a first feature extraction process on the to-be-processed picture sequence to obtain a first target feature vector sequence; performing second feature extraction processing on the text to be processed to obtain a second target feature vector sequence;

a fusion module 73, configured to perform fusion processing on the first target feature vector sequence and the second target feature vector sequence to obtain fusion feature data;

a determining module 74, configured to determine, based on the fusion feature data, a degree of image-text association between the to-be-processed picture sequence and the to-be-processed text.

the obtaining module 71 is configured to:

In an optional embodiment, the apparatus further comprises a generating module 75 configured to:

In an optional embodiment, the obtaining module 71, when obtaining the sequence of pictures to be processed, is configured to:

obtaining semantic information of the text to be processed;

In an optional embodiment, there are a plurality of the to-be-processed picture sequences, and the obtaining module 71 is further configured to:

In an alternative embodiment, the feature extraction module 72 is configured to:

In an optional implementation manner, when the feature extraction module 72 performs feature extraction processing on each frame of picture in the to-be-processed picture sequence to obtain a first data sequence corresponding to the to-be-processed picture sequence, and performs self-attention processing on the first data sequence to obtain the first target feature vector sequence corresponding to the to-be-processed picture sequence, the feature extraction module is further configured to:

the first data sequence is formed on the basis of the original characteristic data corresponding to the pictures of each frame;

In an optional embodiment, when the feature extraction module 72 performs the self-attention processing on the first data sequence to obtain a first intermediate feature sequence, and fuses the first data sequence and the first intermediate feature sequence to obtain a second intermediate feature sequence, the feature extraction module is further configured to:

In an alternative embodiment, the fusion module 73 is configured to:

The description of the processing flow of each module in the apparatus and the interaction flow between the modules may refer to the relevant description in the above method embodiments, and will not be described in detail here.

An embodiment of the present disclosure further provides a computer device, as shown in fig. 8, which is a schematic structural diagram of the computer device provided in the embodiment of the present disclosure, and includes:

a processor 81 and a memory 82; the memory 82 stores machine-readable instructions executable by the processor 81, the processor 81 being configured to execute the machine-readable instructions stored in the memory 82, the processor 81 performing the following steps when the machine-readable instructions are executed by the processor 81:

acquiring a picture sequence to be processed and a text to be processed;

and determining the image-text association degree between the picture sequence to be processed and the text to be processed based on the fusion characteristic data.

The storage 82 includes a memory 821 and an external storage 822; the memory 821 is also referred to as an internal memory and temporarily stores operation data in the processor 81 and data exchanged with the external memory 822 such as a hard disk, and the processor 81 exchanges data with the external memory 822 through the memory 821.

For the specific execution process of the instruction, reference may be made to the step of the method for determining the image-text relevancy in the embodiment of the present disclosure, which is not described herein again.

The embodiment of the present disclosure further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the step of the method for determining the image-text association degree in the foregoing method embodiment is executed. The storage medium may be a volatile or non-volatile computer-readable storage medium.

The embodiments of the present disclosure further provide a computer program product, where the computer program product carries a program code, and instructions included in the program code may be used to execute the step of the method for determining a picture-text association degree in the foregoing method embodiments, which may be specifically referred to the foregoing method embodiments, and are not described herein again.

The computer program product may be implemented by hardware, software or a combination thereof. In an alternative embodiment, the computer program product is embodied in a computer storage medium, and in another alternative embodiment, the computer program product is embodied in a Software product, such as a Software Development Kit (SDK), or the like.

The disclosure relates to the field of augmented reality, and aims to detect or identify relevant features, states and attributes of a target object by means of various visual correlation algorithms by acquiring picture information of the target object in a real environment, so as to obtain an AR effect combining virtual and reality matched with specific applications. For example, the target object may relate to a face, a limb, a gesture, an action, etc. associated with a human body, or a marker, a marker associated with an object, or a sand table, a display area, a display item, etc. associated with a venue or a place. The vision-related algorithms may involve visual localization, SLAM, three-dimensional reconstruction, picture registration, background segmentation, key point extraction and tracking of objects, pose or depth detection of objects, etc. The specific application can not only relate to interactive scenes such as navigation, explanation, reconstruction, virtual effect superposition display and the like related to real scenes or articles, but also relate to special effect treatment related to people, such as interactive scenes such as makeup beautification, limb beautification, special effect display, virtual model display and the like. The detection or identification processing of the relevant characteristics, states and attributes of the target object can be realized through the convolutional neural network. The convolutional neural network is a network model obtained by performing model training based on a deep learning framework.

It can be clearly understood by those skilled in the art that, for convenience and simplicity of description, the specific working process of the system and the apparatus described above may refer to the corresponding process in the foregoing method embodiment, and details are not described herein again. In the several embodiments provided in the present disclosure, it should be understood that the disclosed system, apparatus, and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on multiple network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present disclosure may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer-readable storage medium executable by a processor. Based on such understanding, the technical solution of the present disclosure may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present disclosure. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

Finally, it should be noted that: the above-mentioned embodiments are merely specific embodiments of the present disclosure, which are used for illustrating the technical solutions of the present disclosure and not for limiting the same, and the scope of the present disclosure is not limited thereto, and although the present disclosure is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive of the technical solutions described in the foregoing embodiments or equivalent technical features thereof within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present disclosure, and should be construed as being included therein. Therefore, the protection scope of the present disclosure shall be subject to the protection scope of the claims.

Claims

1. A method for determining image-text association degree is characterized by comprising the following steps:

acquiring a picture sequence to be processed and a text to be processed;

2. The method of claim 1, wherein the sequence of pictures to be processed and the text to be processed originate from the same article;

the acquiring of the picture sequence to be processed and the text to be processed includes:

3. The method of claim 2, further comprising:

4. The method according to claim 1, wherein the obtaining the sequence of pictures to be processed comprises:

obtaining semantic information of the text to be processed;

5. The method according to claim 4, wherein there are a plurality of said sequences of pictures to be processed; the method further comprises the following steps:

6. The method according to any one of claims 1 to 5, wherein the first feature extraction processing is performed on the to-be-processed picture sequence to obtain a first target feature vector sequence; and performing second feature extraction processing on the text to be processed to obtain a second target feature vector sequence, wherein the second feature extraction processing comprises the following steps:

7. The method according to claim 6, wherein the performing feature extraction processing on each frame of picture in the to-be-processed picture sequence respectively to obtain a first data sequence corresponding to the to-be-processed picture sequence, and performing self-attention processing on the first data sequence to obtain the first target feature vector sequence corresponding to the to-be-processed picture sequence comprises:

and carrying out full connection processing on the second intermediate characteristic sequence at least once to obtain a first target characteristic vector sequence.

8. The method of claim 7, wherein the self-attention processing the first data sequence to obtain a first intermediate feature sequence, and the fusing the first data sequence and the first intermediate feature sequence to obtain a second intermediate feature sequence comprises:

9. The method according to claim 1, wherein the fusing the first target feature vector sequence and the second target feature vector sequence to obtain fused feature data comprises:

10. An image-text association degree determining apparatus, comprising:

the acquisition module is used for acquiring the picture sequence to be processed and the text to be processed;

11. A computer device, comprising: a processor, a memory, said memory storing machine-readable instructions executable by said processor, said processor being configured to execute machine-readable instructions stored in said memory, said machine-readable instructions, when executed by said processor, causing said processor to perform the steps of the teletext relevance determination method according to any one of claims 1 to 9.

12. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when executed by a computer device, performs the steps of the method for determining a degree of teletext association according to any one of claims 1-9.