CN115393849A

CN115393849A - Data processing method and device, electronic equipment and storage medium

Info

Publication number: CN115393849A
Application number: CN202210901461.0A
Authority: CN
Inventors: 张菁菁; 方山城; 毛震东; 张志伟; 陈小帅
Original assignee: Beijing Zhongke Research Institute; Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Zhongke Research Institute; Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2022-07-28
Filing date: 2022-07-28
Publication date: 2022-11-25

Abstract

The present disclosure relates to a data processing method, an apparatus, an electronic device, and a storage medium, the method including: acquiring a service image to be processed and a service text related to the service image; obtaining a visual prompt vector based on the extracted visual features of the service image to be processed; coding the visual cue vector and the service text to obtain a coding vector; the coding vector comprises a visual coding vector corresponding to the visual prompt vector and a text coding vector corresponding to the service text; determining an entity prompt vector based on the visual coding vector and an entity coding vector corresponding to each named entity in the text coding vector; and decoding the coding vector based on the visual cue vector and the entity cue vector to obtain a description text corresponding to the service image to be processed. The present disclosure improves the accuracy of generating named entities in descriptive text.

Description

Data processing method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a data processing method and apparatus, an electronic device, and a storage medium.

Background

At present, news articles published on the internet often contain news images, and each news image is also provided with corresponding descriptive text. To improve the efficiency of publishing news articles on the internet, descriptive text is typically automatically generated for each news image in the news article.

Considering that a large number Of named entities such as person names, place names, organization names and the like are involved in a news scene and often exceed the Vocabulary range (OOV for short) Of a language model, in order to automatically generate descriptive texts containing the named entities in the related art, a manually written entity-level description template is combined on the basis Of the language model, and because the range covered by the manually written entity-level description template is limited and only local entity words can be sensed, the generation accuracy Of the named entities is not sufficient, and further, the accuracy Of the descriptive texts corresponding to news images is reduced.

Disclosure of Invention

The disclosure provides a data processing method, a data processing device, an electronic device and a storage medium, which are used for at least solving the problem of poor accuracy of automatically generating a descriptive text corresponding to a news image in the related technology. The technical scheme of the disclosure is as follows:

according to a first aspect of an embodiment of the present disclosure, there is provided a data processing method, including:

acquiring a service image to be processed and a service text associated with the service image to be processed; the service text comprises named entities;

obtaining a visual cue vector based on the extracted visual characteristics of the service image to be processed;

coding the visual cue vector and the service text to obtain a coding vector; the coding vectors comprise visual coding vectors corresponding to the visual cue vectors and text coding vectors corresponding to the business texts;

determining an entity prompt vector based on the visual coding vector and an entity coding vector corresponding to each named entity in the text coding vector;

and decoding the coding vector based on the visual cue vector and the entity cue vector to obtain a description text corresponding to the service image to be processed.

In an exemplary embodiment, the determining an entity hint vector based on the entity encoding vector corresponding to each of the named entities in the visual encoding vector and the text encoding vector comprises:

determining a degree of correlation between the visual code vector and each entity code vector in the text code vector;

and selecting a target entity coding vector from the entity coding vectors based on the correlation degree to form an entity prompt vector.

In an exemplary embodiment, the selecting a target entity code vector from the entity code vectors based on the degree of correlation, and forming an entity hint vector includes:

determining an entity coding vector corresponding to the maximum correlation degree to obtain a key entity coding vector;

determining the degree of dependence between the key entity code vector and each remaining entity code vector; the residual entity encoding vector refers to an entity encoding vector except the key entity encoding vector;

determining a target remaining entity encoding vector based on the degree of dependence; the dependency degree corresponding to the target residual entity coding vector is larger than a preset dependency degree threshold value;

and splicing the key entity coding vector and the target residual entity coding vector as target entity coding vectors to obtain an entity prompt vector.

In an exemplary embodiment, the determining the degree of dependency between the key entity encoding vector and each remaining entity encoding vector includes:

taking the key entity encoding vector as an initial hidden state of a bidirectional long-short term memory network;

inputting the residual entity coding vectors into the bidirectional long-short term memory network to obtain state vectors corresponding to the residual entity coding vectors;

performing normalization processing based on the state vector corresponding to the residual entity coding vector to obtain a normalization result corresponding to the residual entity coding vector; the normalization result characterizes a degree of dependence between the key entity encoding vector and the corresponding remaining entity encoding vector.

In an exemplary embodiment, the decoding the coding vector based on the visual cue vector and the entity cue vector to obtain the description text corresponding to the service image to be processed includes:

splicing the visual cue vector and the entity cue vector to obtain a multi-mode cue vector;

and performing autoregressive decoding processing on the basis of the multi-modal prompt vector and the coding vector to obtain a description text corresponding to the service image to be processed.

In an exemplary embodiment, the obtaining a visual cue vector based on the extracted visual features of the to-be-processed business image includes:

inputting the business image to be processed into a vision-language pre-training model for feature extraction to obtain extracted visual features;

mapping the visual features to an input space of a target language model based on a multilayer perception network to obtain visual prompt vectors;

wherein the target language model is used for the encoding process and the decoding process.

In an exemplary embodiment, the method is implemented based on a data processing model, the method further comprising the step of training the data processing model:

acquiring a sample service image text pair and a corresponding reference description text; the sample service image text pair comprises a sample service image and a sample service text related to the sample service image, and the sample service text comprises a named entity;

the visual features of the sample business image are extracted based on a visual-language pre-training model, and the visual features of the sample business image are mapped to an input space of the pre-training language model based on an initial multilayer perception network to obtain a sample visual cue vector;

inputting the sample visual cue vector and the sample service text into an encoder of a pre-training language model for encoding processing to obtain a sample encoding vector; the sample coding vectors comprise a sample visual vector corresponding to the sample visual cue vector and a sample text coding vector corresponding to the sample business text;

determining a sample entity prompt vector based on the sample visual coding vector and a sample entity coding vector corresponding to each named entity in the sample text coding vector;

splicing the sample visual cue vector and the sample entity cue vector to obtain a sample multi-modal cue vector; inputting the sample multi-modal prompt vector and the sample coding vector into a decoder of the pre-training language model for decoding processing to obtain a prediction description text;

and adjusting model parameters based on the difference between the prediction description text and the reference description text until a preset training end condition is reached and the training is finished to obtain the data processing model.

In an exemplary embodiment, the adjusting of the model parameter based on the difference between the prediction description text and the reference description text includes:

determining a loss value based on a difference between the prediction description text and the reference description text;

and fixing the model parameters of the vision-language pre-training model unchanged, and adjusting the parameters of the initial multilayer perception network and the pre-training language model based on the loss value.

In an exemplary embodiment, the determining a sample entity prompt vector based on the sample visual coding vector and a sample entity coding vector corresponding to each named entity in the sample text coding vector comprises:

determining a degree of correlation between the sample visual coding vector and each sample entity coding vector in the sample text coding vector;

and selecting a target sample entity coding vector from the sample entity coding vectors based on the sample correlation degree to form a sample entity prompt vector.

In an exemplary embodiment, the selecting a target sample entity encoding vector from the sample entity encoding vectors based on the sample correlation degree, and forming a sample entity hint vector includes:

determining a sample entity coding vector corresponding to the maximum sample correlation degree to obtain a key sample entity coding vector;

determining a sample dependency degree between the key sample entity encoding vector and each residual sample entity encoding vector; the residual sample entity encoding vector refers to a sample entity encoding vector except the key sample entity encoding vector;

determining a target residual sample entity encoding vector based on the sample dependency; the sample dependency degree corresponding to the target residual sample entity encoding vector is larger than a preset dependency degree threshold value;

and splicing the key sample entity coding vector and the target residual sample entity coding vector as target sample entity coding vectors to obtain a sample entity prompt vector.

In an exemplary embodiment, the determining the sample dependency between the key sample entity encoding vector and each remaining sample entity encoding vector includes:

taking the key sample entity coding vector as an initial hidden state of a bidirectional long-short term memory network;

inputting the residual sample entity coding vector into the bidirectional long-short term memory network to obtain a state vector corresponding to the residual sample entity coding vector;

performing normalization processing based on the state vector corresponding to the residual sample entity coding vector to obtain a sample normalization result corresponding to the residual sample entity coding vector; the sample normalization result characterizes a sample dependency between the key sample entity encoding vector and the residual sample entity encoding vector.

According to a second aspect of the embodiments of the present disclosure, there is provided a data processing apparatus including:

the data acquisition unit is configured to acquire a to-be-processed service image and a service text associated with the to-be-processed service image; the service text comprises named entities;

the visual cue vector determining unit is configured to execute visual feature based on the extracted service image to be processed to obtain a visual cue vector;

the coding unit is configured to perform coding processing on the visual cue vector and the service text to obtain a coding vector; the coding vector comprises a visual coding vector corresponding to the visual prompt vector and a text coding vector corresponding to the service text;

an entity prompt vector determination unit configured to perform determining an entity prompt vector based on the visual coding vector and an entity coding vector corresponding to each named entity in the text coding vector;

and the decoding unit is configured to perform decoding processing on the coding vector based on the visual cue vector and the entity cue vector to obtain a description text corresponding to the service image to be processed.

In an exemplary embodiment, the entity hint vector determination unit includes:

a first degree of correlation determination unit configured to perform determining a degree of correlation between the visual code vector and each entity code vector in the text code vector;

and the entity prompt vector determining subunit is configured to select a target entity code vector from the entity code vectors based on the correlation degree to form an entity prompt vector.

In an exemplary embodiment, the entity hint vector determination subunit includes:

the first key entity determining unit is configured to determine an entity coding vector corresponding to the maximum correlation degree to obtain a key entity coding vector;

a first dependency level determination unit configured to perform determining a dependency level between the key entity encoding vector and each remaining entity encoding vector; the residual entity encoding vector refers to an entity encoding vector except the key entity encoding vector;

a first determining unit configured to perform determining a target remaining entity encoding vector based on the degree of dependency; the dependency degree corresponding to the target residual entity coding vector is larger than a preset dependency degree threshold value;

and the first construction subunit is configured to perform splicing on the key entity coding vector and the target residual entity coding vector as target entity coding vectors to obtain an entity prompt vector.

In an exemplary embodiment, the first dependency degree determining unit includes:

a first initialization unit configured to perform an initialization hiding state of the key entity encoding vector as a bidirectional long-short term memory network;

a first state vector determining unit configured to perform input of the residual entity code vector to the bidirectional long-short term memory network, resulting in a state vector corresponding to the residual entity code vector;

the first normalization unit is configured to perform normalization processing based on the state vector corresponding to the residual entity coding vector to obtain a normalization result corresponding to the residual entity coding vector; the normalization result characterizes a degree of dependency between the key entity encoding vector and the remaining entity encoding vector.

In an exemplary embodiment, the decoding unit includes:

the multi-modal prompt vector determining unit is configured to perform splicing on the visual prompt vector and the entity prompt vector to obtain a multi-modal prompt vector;

and the decoding subunit is configured to perform autoregressive decoding processing based on the multi-modal prompt vector and the coding vector to obtain a description text corresponding to the service image to be processed.

In an exemplary embodiment, the visual cue vector determination unit includes:

the visual feature extraction unit is configured to input the to-be-processed business image into a visual-language pre-training model for feature extraction to obtain extracted visual features;

a first mapping unit configured to perform mapping of the visual features to an input space of a target language model based on a multi-layer perceptual network, resulting in a visual cue vector;

In an exemplary embodiment, the apparatus further comprises a training unit comprising:

the sample acquisition unit is configured to acquire a sample service image text pair and a corresponding reference description text; the sample service image text pair comprises a sample service image and a sample service text related to the sample service image, wherein the sample service text comprises a named entity;

the sample visual cue vector determining unit is configured to execute visual features of the sample business image extracted based on a visual-language pre-training model, and map the visual features of the sample business image to an input space of the pre-training language model based on an initial multilayer perception network to obtain a sample visual cue vector;

the sample coding unit is configured to perform coding processing on the sample visual cue vector and the sample service text input to a coder of a pre-training language model to obtain a sample coding vector; the sample coding vector comprises a sample visual cue vector corresponding to the sample visual cue vector and a sample text coding vector corresponding to the sample business text;

a sample entity prompt vector determination unit configured to perform a sample entity prompt vector determination based on the sample visual coding vector and a sample entity coding vector corresponding to each of the named entities in the sample text coding vector;

the sample multi-modal suggestive vector determining unit is configured to splice the sample visual cue vector and the sample entity cue vector to obtain a sample multi-modal cue vector; inputting the sample multi-modal prompt vector and the sample coding vector into a decoder of the pre-training language model for decoding processing to obtain a prediction description text;

and the parameter adjusting unit is configured to perform adjustment of model parameters based on the difference between the prediction description text and the reference description text until a preset training end condition is reached and training is finished, so that the data processing model is obtained.

In an exemplary embodiment, the parameter adjusting unit includes:

a loss determination unit configured to perform determining a loss value based on a difference between the prediction description text and the reference description text;

a parameter adjusting subunit configured to perform fixing model parameters of the visual-language pre-training model unchanged, adjust parameters of the initial multi-layer perceptual network and the pre-training language model based on the loss value

In an exemplary embodiment, the sample entity hint vector determination unit includes:

a second degree of correlation determination unit configured to perform determining a degree of correlation between the sample visual coding vector and each sample entity coding vector in the sample text coding vector;

and the sample entity prompt vector determining subunit is configured to select a target sample entity code vector from the sample entity code vectors based on the sample correlation degree to form a sample entity prompt vector.

In an exemplary embodiment, the sample entity hint vector determination subunit includes:

the second key entity determining unit is configured to determine a sample entity encoding vector corresponding to the maximum sample correlation degree to obtain a key sample entity encoding vector;

a second dependency level determination unit configured to perform determining a sample dependency level between the key sample entity encoding vector and each remaining sample entity encoding vector; the residual sample entity encoding vector refers to a sample entity encoding vector other than the key sample entity encoding vector;

a second determination unit configured to perform a determination of a target residual sample entity encoding vector based on the sample dependency; the sample dependency degree corresponding to the target residual sample entity encoding vector is larger than a preset dependency degree threshold value;

and the second construction subunit is configured to perform splicing on the key sample entity encoding vector and the target residual sample entity encoding vector as target sample entity encoding vectors to obtain a sample entity prompt vector.

In an exemplary embodiment, the second dependency degree determining unit includes:

a second initialization unit configured to perform an initialization hiding state of the key sample entity encoding vector as a bidirectional long-short term memory network;

a second state vector determination unit configured to perform input of the residual sample entity coded vector into the bidirectional long-short term memory network, resulting in a state vector corresponding to the residual sample entity coded vector;

the second normalization unit is configured to perform normalization processing based on the state vector corresponding to the residual sample entity coding vector to obtain a sample normalization result corresponding to the residual sample entity coding vector; the sample normalization result characterizes a sample dependency between the key sample entity encoding vector and the residual sample entity encoding vector.

According to a third aspect of the embodiments of the present disclosure, there is provided an electronic apparatus including:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the data processing method of the first aspect.

According to a fourth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium, in which instructions, when executed by a processor of an electronic device, enable the electronic device to perform the data processing method of the first aspect described above.

According to a fifth aspect of embodiments of the present disclosure, there is provided a computer program product comprising computer programs/instructions which, when executed by a processor, implement the data processing method of the first aspect described above.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:

the visual cue vector is obtained based on the extracted visual features of the to-be-processed service image, the coding vector is obtained by coding the visual cue vector and the service text associated with the to-be-processed service image, the coding vector comprises a visual coding vector corresponding to the visual cue vector and a text coding vector corresponding to the service text, the entity cue vector is determined based on the visual coding vector and the entity coding vector corresponding to each named entity in the text coding vector, and the coding vector is decoded based on the visual cue vector and the entity cue vector to obtain the description text corresponding to the to-be-processed service image, so that the named entities related to the content of the to-be-processed service image can be focused more in the decoding process, the generation accuracy of the named entities is improved, and the accuracy of the description text is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.

FIG. 1 is a schematic diagram of an application environment illustrating a method of data processing in accordance with an exemplary embodiment;

FIG. 2 is a flow chart illustrating a method of data processing in accordance with an exemplary embodiment;

FIG. 3 is a flow diagram illustrating the construction of an entity hint vector in accordance with an exemplary embodiment;

FIG. 4 is a flow diagram illustrating another construction of an entity hint vector in accordance with an illustrative embodiment;

FIG. 5 is a schematic diagram of a data processing model shown in accordance with an exemplary embodiment;

FIG. 6 is a diagram illustrating the fine tuning of a pre-trained model to arrive at the data processing model of FIG. 5, in accordance with an exemplary embodiment;

FIG. 7 is a block diagram illustrating the structure of a data processing apparatus according to an exemplary embodiment;

FIG. 8 is a block diagram illustrating an electronic device in accordance with an example embodiment.

Detailed Description

In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in other sequences than those illustrated or described herein. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

It should also be noted that the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data for presentation, analyzed data, etc.) referred to in the present disclosure are both information and data that are authorized by the user or sufficiently authorized by various parties.

The news image description generation technology aims to automatically generate descriptive words for pictures of news articles, because the generation of the news image description needs to pay attention to visual subjects in the images and background information contained in the articles at the same time, the traditional image description generation method is limited to image input and cannot process long text information; in addition, a large number of named entities, such as names of people, places, names of organizations, etc., are involved in a news scene, and these entity words often exceed the vocabulary range (out of vocabularies, abbreviated as OOV) of the language model, so that the conventional image description generation method cannot solve the generation of texts containing entity words.

In the related art, some methods adopt templating, all entity words in a description sentence are replaced by slots representing entity words, and then entity-level matching and blank filling are performed with input, and subdivision may include two-stage generation and one-stage generation, wherein the two-stage generation is to generate a template sentence without entity words first and then perform entity word prediction, and the one-stage generation is to consider the distribution probability of non-entity words and entity words when generating each word element so as to realize end-to-end dynamic prediction. However, the above method has the following drawbacks: 1) The generation of entity words depends heavily on the quality of a manual template at an entity level, and in practical application, image description sentences of a news scene are usually natural sentences and rich in information content, so that the definition of an optimal template is time-consuming and difficult to cover all manually written description sentences; 2) The current text generation is limited to a template-based supervision training mode, the complexity of a model is increased and certain error accumulation is caused in multi-stage iteration generation and template filling, the OOV problem is difficult to relieve in one-stage generation, and the accuracy of entity word generation is poor.

In other methods in the related art, in order to ensure the precision of the entity words, an updated and refined entity word template is considered as a supervision signal, that is, the entity word generation space is further constrained by introducing a label of an entity word type according to the classification of grammatical components of the entity words in sentences, but the supervision training based on the entity level template can only perform local entity word perception, so that not only is the semantic coherence of a language structure and a context damaged, but also the perception of the internal relation of the global entity words in the language space is lacked, and the entity words in the finally generated image description text are not accurate enough.

In view of this, the disclosed embodiment provides a data processing method, in which a visual cue vector is obtained based on extracted visual features of a service image to be processed, and a coding vector is obtained by coding the visual cue vector and a service text associated with the service image to be processed, the coding vector includes a visual coding vector corresponding to the visual cue vector and a text coding vector corresponding to the service text, and further an entity cue vector is determined based on the visual coding vector and the entity coding vector corresponding to each named entity in the text coding vector, and the coding vector is decoded based on the visual cue vector and the entity cue vector to obtain a description text corresponding to the service image to be processed, so that a named entity related to the content of the service image to be processed can be focused more in a decoding process, the accuracy of generating the named entity is improved, and the accuracy of the image description text is improved.

Referring to fig. 1, an application environment of a data processing method according to an exemplary embodiment is shown, where the application environment may include a terminal 110 and a server 120, and the terminal 110 and the server 120 may be connected through a wired network or a wireless network.

The terminal 110 may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, and the like. The terminal 110 may have client software such as an Application (App) for providing a data processing function installed therein, and the Application may be a stand-alone Application or a sub-program in the Application. Illustratively, the application may include a news-like application or the like, such as an application having a news release function. The user of the terminal 110 may log into the application through pre-registered user information, which may include an account number and a password.

The server 120 may be a server providing a background service for the application program in the terminal 110, and specifically, the background service may be a service for processing an image to generate a description text corresponding to the image. The server 120 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as cloud service, a cloud database, cloud computing, cloud functions, cloud storage, network service, cloud communication, middleware service, domain name service, security service, CDN, and big data and artificial intelligence platforms.

To elaborate on the technical solution of the embodiment of the present disclosure, as shown in fig. 2, a flowchart of a data processing method according to an exemplary embodiment is shown, and the method can be applied to the electronic device shown in fig. 1, as shown in fig. 2, and includes the following steps:

in step S201, a to-be-processed service image and a service text associated with the to-be-processed service image are obtained.

The service text associated with the service image to be processed includes named entities, such as a person name, a place name, an organization name, and the like. For example, the service image to be processed may be a news image, and the service text associated with the service image to be processed may be a news article corresponding to the news image.

In step S203, a visual cue vector is obtained based on the extracted visual features of the to-be-processed service image.

The visual feature of the service image to be processed refers to a feature obtained by feature extraction of the service image to be processed.

The visual cue vector is a vector obtained by performing dimension transformation on visual features, and the dimension transformation is used for establishing a connection between a visual concept and language prior. In a specific implementation, the dimension transformation may be performed based on the input dimension of the text information for the subsequent encoding process.

In step S205, the visual cue vector and the service text are encoded to obtain an encoded vector.

Wherein the coding vectors include a visual coding vector corresponding to the visual cue vector and a text coding vector corresponding to the service text.

Specifically, the word segmentation processing may be performed on the service text, each word is used as a token (lemma) to obtain a word sequence, then the visual cue vector and the word sequence are spliced to obtain a sequence to be coded, and then the sequence to be coded is coded.

In the embodiment of the present disclosure, in order to simultaneously encode the service image to be processed and the service text based on the bidirectional self-attention mechanism by using the context information, that is, the sequence to be encoded is encoded based on the bidirectional self-attention mechanism. For example, the encoding process may be expressed as: c = SA (p) ^v ,x ^A )。

Wherein SA () represents the Self-attention mechanism (Self-attention); p is a radical of ^v Representing a visual cue vector; x is the number of ^A Representing a service text; c denotes a code vector. Wherein C comprises C _I And C _A ，C _I Representing a visually encoded vector, C _A Representing a text encoding vector; wherein, C _A Comprising a solid code vector C _E I.e. C _E ∈C _A 。

In step S207, an entity prompt vector is determined based on the entity code vector corresponding to each named entity in the visual code vector and the text code vector.

And the entity coding vector is a vector corresponding to the named entity in the text coding vector.

In a specific implementation, all named entities in the service text may be tagged by using a natural language processing kit (e.g., spaCy kit), so that an entity code vector corresponding to each named entity may be extracted from the code vectors based on the tagging information.

In an exemplary embodiment, in order to enable the entity hint vector to more accurately reflect the content of the image to be processed, the step S207 may be implemented to include:

In a specific implementation, the correlation score corresponding to each entity code vector can be calculated by the following formula to characterize the degree of correlation:

wherein, C _I Representing a visual coding vector;

represents a weighted average;

which represents the physical code vector i and which is,

tau represents a hyper-parameter and can be set according to actual experience;<>representing a dot product calculation; s is _i Representing the relevance score of the entity code vector i.

A set of relevance scores S may be obtained _E Including each entity encodingAnd selecting a target entity coding vector of which the correlation degree with the visual coding vector meets a preset condition from the entity coding vectors based on the correlation scores corresponding to the vectors to form an entity prompt vector.

In an exemplary embodiment, when selecting a target entity code vector from the entity code vectors based on the degree of correlation, constructing an entity hint vector may include the following steps as shown in fig. 3:

in step S301, the entity encoding vector corresponding to the maximum correlation degree is determined, and a key entity encoding vector is obtained.

In a specific implementation, the key entity encodes vector C _key Can be expressed by the following formula:

in step S303, the degree of dependency between the key entity-encoding vector and each remaining entity-encoding vector is determined.

Wherein the remaining entity encoding vectors refer to entity encoding vectors other than the key entity encoding vector.

In an exemplary embodiment, in order to obtain a context hint vector at a entity level to improve the accuracy of entity words in the generated description text, as shown in fig. 4, the step S303 may be implemented to include:

in step S401, the key entity encoding vector is used as an initial hidden state of the bi-directional long-short term memory network.

In step S403, the remaining physical code vectors are input into the bidirectional long-short term memory network, and state vectors corresponding to the remaining physical code vectors are obtained.

In step S405, normalization processing is performed based on the state vector corresponding to the residual entity encoding vector, so as to obtain a normalization result corresponding to the residual entity encoding vector.

Wherein the normalization result characterizes a degree of dependency between the key entity encoding vector and the corresponding remaining entity encoding vector.

The bidirectional long and short term memory network is initialized in step S401 so that the initial hidden state of the initialized bidirectional long and short term memory network is the key entity code vector, and then the remaining entity code vectors are input into the initialized bidirectional long and short term memory network in step S403 to be processed to obtain the state vector corresponding to the remaining entity code vector.

In a specific implementation, the state vector can be represented by the following formula:

C _prompt ＝LSTM(C _key ,C′ _E )

accordingly, the normalization result can be expressed by the following formula:

S _prompt ＝σ(fc(C _prompt ))

wherein LSTM represents a bidirectional long-short term memory network, C _key As the initial hidden state of the LSTM; c' _E Representing a residual entity code vector; c _prompt A state vector representing the network output; fc () represents a single linear layer; σ () represents an activation function; s _prompt The normalized result is expressed as a numerical value, which can embody a score.

According to the embodiment, the key entity encoding vector is used as the initial hidden state of the bidirectional LSTM, so that the potential dependency relationship between the key entity encoding vector and other entity encoding vectors can be modeled based on the context information, and the accuracy of determining the dependency relationship is improved.

In step S305, a target remaining entity encoding vector is determined based on the degree of dependency.

And the dependency degree corresponding to the target residual entity coding vector is greater than a preset dependency degree threshold value.

The preset dependency threshold value can be set according to actual needs.

In a particular implementation, if

(preset dependency threshold), then

Corresponding entity code vector

Determining to code the vector for the target residual entity so as to obtain

In step S307, the key entity encoding vector and the target remaining entity encoding vector are used as target entity encoding vectors to be spliced, so as to obtain an entity prompt vector.

In particular implementations, the entity hint vector p ^μ Can be expressed by the following formula:

the above embodiment realizes the screening and aggregation of important entity encoding vectors (namely target entity encoding vectors) from the entity encoding vectors based on the context information, and generates accurate and global entity-level-oriented context hint vectors.

In step S209, the coding vector is decoded based on the visual cue vector and the entity cue vector, so as to obtain a description text corresponding to the service image to be processed.

Specifically, the visual cue vector and the entity cue vector can be spliced to obtain a multi-modal cue vector, and then the coding vector and the multi-modal cue vector are utilized to perform decoding processing through a cross attention mechanism, so that a description text corresponding to the service image to be processed is obtained.

According to the technical scheme of the embodiment of the disclosure, the description text of the to-be-processed service image is obtained by constructing the visual cue vector and the entity cue vector and decoding the coding vector by combining the two cue vectors, so that the named entity related to the content of the service image can be focused more during decoding, the accuracy of the named entity generated in the description text is improved, and the accuracy of the generated description text is further improved.

In an exemplary embodiment, in order to improve the accuracy of the generated description text, the decoding process in step S209 may be an autoregressive decoding process, that is, the prediction result of the current time step needs to depend on the prediction result of the past time step. Therefore, the step S209 may include:

In particular, the multimodal prompt vector P may be represented as P = [ P ] ^v ；p ^μ ]＝[p ₁ ,…,p _k ]And k represents the length of the multi-modal prompt vector, in the embodiment of the present disclosure, the length k of the multi-modal prompt vector is a fixed length, and a specific length value can be set according to practical experience.

The time step of autoregressive decoding, y, is denoted by t _t Representing the token generated corresponding to time step t, when performing autoregressive decoding processing based on the multimodal prompt vector and the encoding vector, firstly based on S = [ P; y is]And processing S by the self-attention mechanism to obtain an output vector SelfAtt (S), wherein y = { y = ₁ ,…,y _t-1 Then cross attention processing is carried out based on SelfAtt (S) and the coding vector C, and further, a cross attention processing result h is further based on the cross attention processing result _t To predict the corresponding lemma of time step t.

Note that the embodiments of the present disclosure are based on S = [ P; y ] and the self-attention mechanism process S to obtain an output vector SelfAtt (S), if the input of the query Q is y and the input of the key value pair K-V is S, then:

wherein Q, K and V are obtained by mapping y; k ^P ,V ^P Mapping by P to obtain; [;]representing a splice; d is a radical of _H Representing a characteristic dimension of the self-attention mechanism.

In the embodiment of the present disclosure, when performing cross attention processing based on SelfAtt (D) and coding vector C, the "query Q" input is made the above-described SelfAtt (S), and the "key value pair K-V" is made the coding vector C. Processing the result h based on cross attention _t When predicting the corresponding word element of time step t, from h _t Carrying out the distribution probability calculation of the lemma by the vector corresponding to the sequence after the multi-mode prompt vector in the S is extracted, wherein the specific distribution probability

The calculation formula of (c) can be expressed as follows:

wherein the content of the first and second substances,

a matrix vector representing the logical distribution mapped onto the vocabulary; [ | P _idx |:]The indication is a sequence located after the multi-modal prompt vector.

In the above embodiment, the autoregressive decoding processing is performed based on the multi-modal prompt vector and the encoding vector, and the multi-modal prompt vector is used to guide the generation of the entity lemmas and the non-entity lemmas in each decoding step, so that the entity words in the generated description text are more accurate.

In an exemplary embodiment, in order to improve the influence of the visual cue vector on semantic understanding so as to focus more accurately on the entity words related to the image content, step S203 may be implemented by:

inputting a service image to be processed into a vision-language pre-training model for feature extraction to obtain extracted visual features;

Specifically, the business image to be processed may be encoded by an image encoder based on a visual-language pre-training model, so as to obtain an encoding feature, which is used as the extracted visual feature. The visual-Language Pre-training model may be a CLIP (contrast Language-Image Pre-training) model, and the CLIP is a Pre-training model obtained by performing contrast learning training on a large-scale Image-text data set.

The target language model is obtained by fine-tuning the pre-training language model based on the sample service image and the sample service text corresponding to the sample service image, and the target language model is used for realizing the encoding processing and the decoding processing in the embodiment of the disclosure.

Illustratively, the pre-trained language model includes a Transformer-based encoder and a Transformer-based decoder. The coder based on the Transformer may be a coder having a Bidirectional coding function, for example, the pretraining language model may be BART (Bidirectional and Auto-regenerative Transformers), the BART uses a standard neural machine translation architecture based on the Transformer, and may be regarded as a generalized form of pretraining models such as BERT (Bidirectional coder), GPT (decoder from left to right), and the like.

In the embodiment, the visual features are extracted through the visual-language pre-training model, so that the visual features which are more consistent with semantic understanding can be obtained by utilizing the high-level image semantic understanding capability of the visual-language pre-training model, the influence of the obtained visual cue vector on the semantic understanding can be favorably improved, and the accuracy of describing the entity words in the text can be improved.

As can be seen from the foregoing implementation manner, the data processing method according to the embodiment of the present disclosure may be implemented based on a data processing model, where the data processing model is composed of a visual-language pre-training model, a multi-layer perceptual network, a target language model and an entity prompt vector building module, and as shown in fig. 5, the data processing model provided in the embodiment of the present disclosure is a schematic structural diagram of the data processing model, where the target language model is obtained by performing a fine-tuning process on the pre-training language model and includes a transform-based encoder and a transform-based decoder.

In specific implementation, a service image to be processed is input into a visual-language pre-training model, an image encoder of the visual-language pre-training model performs encoding processing to obtain output visual features, the visual features are used as input of a multilayer perception network, the multilayer perception network is mapped to an input space of a target language model to obtain a visual cue vector, the visual cue vector is spliced with a service text associated with the service image to be processed and then used as input of an encoder in the target language model, and the encoder of the target language model performs encoding processing to obtain an encoding vector. The coding vector comprises a visual coding vector corresponding to the visual prompt vector and a text coding vector corresponding to the service text.

And inputting the entity coding vectors corresponding to the named entities in the visual coding vector and the text coding vector into an entity prompt vector construction model, wherein the entity prompt vector construction model selects a target entity coding vector from the entity coding vectors based on the correlation degree between the visual prompt vector and each entity coding vector to form the entity prompt vector.

After the visual cue vector and the entity cue vector are spliced, the visual cue vector and the coding vector output by the coder form the input of a decoder in the target language model. The decoder may include a plurality of transform decoding modules stacked, each decoding module includes two attention layers, where a first attention layer uses the splicing of the visual cue vector and the entity cue vector and the prediction result of the historical time as input to perform processing based on an auto-attention mechanism, and a second attention layer uses the processing result of the first attention layer and the coding vector output by the encoder to perform auto-regressive decoding based on a cross-attention mechanism, where the processing manner of the auto-attention mechanism of the first attention layer may refer to the related content of the foregoing step S209, which is not described herein again.

Based on this, in an exemplary implementation manner, the embodiment of the present disclosure may further include a step of training the data processing model, and specifically, training the data processing model may include:

and acquiring a sample service image text pair and a corresponding reference description text. The sample service image text pair comprises a sample service image and a sample service text related to the sample service image, wherein the sample service text comprises a named entity.

And mapping the visual features of the sample business images to an input space of a pre-training language model based on the initial multilayer perception network to obtain sample visual cue vectors.

And inputting the sample visual cue vector and the sample service text into an encoder of a pre-training language model for encoding processing to obtain a sample encoding vector. Wherein the sample coding vectors include a sample visual vector corresponding to the sample visual cue vector and a sample text coding vector corresponding to the sample business text.

And determining a sample entity prompt vector based on the sample visual coding vector and a sample entity coding vector corresponding to each named entity in the sample text coding vector.

Splicing the sample visual cue vector and the sample entity cue vector to obtain a sample multi-modal cue vector; and inputting the sample multi-modal prompt vector and the sample coding vector into a decoder of the pre-training language model for decoding processing to obtain a prediction description text.

And adjusting model parameters based on the difference between the prediction description text and the reference description text until a preset training ending condition is reached and training is ended to obtain a data processing model.

The preset training end condition may be set according to actual needs, for example, the iteration number reaches a preset iteration number threshold, or the loss value reaches a preset loss threshold, or the like.

In the above embodiment, the sample service image is converted into the visual cue vector by using the large-scale visual-language pre-training model, the entity cue vector is constructed by combining the sample service text, and then the multi-modal cue vector constructed based on the image and the text is combined with the cue learning mechanism to perform fine tuning on the pre-training language model, so that the entity-level representations in the service text and the service image can be unified at the same time, the common learning of the two pre-training models is realized, and the data processing model of the embodiment of the present disclosure is obtained.

In an exemplary embodiment, in order to improve the efficiency of the fine tuning, the visual-language pre-training model may be fixed without participating in parameter updating, and only the mapping network (i.e., the initial multi-layer perceptual network) and the pre-training language model are fine-tuned together. Therefore, the adjusting of the model parameters based on the difference between the prediction description text and the reference description text may include:

The loss value may be obtained based on a preset loss function, where the preset loss function may be a cross entropy-based one-way language modeling loss, and the specific preset loss function L may be represented as follows:

wherein x is ^A Representing a sample service text; x is the number of ^I Representing a sample business image; p represents a multi-modal prompt vector, P = [ P ] ^v ；p ^μ ]＝[p ₁ ,…,p _k ]；y _τ＜t Representing predicted lemma, y, before time step t _τ＜t ＝y ₁ ,…,y _t-1 ；p _θ () Representing a likelihood function; l denotes k and the reference description textThe sum of the number of included lemmas.

In an exemplary embodiment, determining a sample entity prompt vector based on the sample visual encoding vector and a sample entity encoding vector corresponding to each of the named entities in the sample text encoding vector may include:

determining a sample correlation degree between the sample visual coding vector and each sample entity coding vector in the sample text coding vector;

For a specific determination manner of the sample correlation degree, reference may be made to the determination manner of the correlation degree in the foregoing embodiment shown in fig. 2 in the embodiment of the present disclosure, and details are not repeated here.

By the implementation mode, the sample entity prompt vector can reflect the content of the sample image more accurately, and the training effect of the model is improved.

In an exemplary embodiment, selecting a target sample entity encoding vector from the sample entity encoding vectors based on the sample correlation degree, and constructing the sample entity hint vector includes:

determining a sample dependency degree between the key sample entity encoding vector and each residual sample entity encoding vector; the residual sample entity encoding vector refers to a sample entity encoding vector other than the key sample entity encoding vector;

For a specific determination manner of the sample dependency degree and the target residual sample entity encoding vector, reference may be made to the related description in the method shown in fig. 3 in the foregoing embodiment of the present disclosure, which is not described herein again.

According to the embodiment, the important entities are screened and aggregated based on the context information, and the entity-level context prompt vector is generated, so that the entity accuracy of the model prediction result is improved.

In an exemplary embodiment, determining a sample dependency between the key sample entity encoding vector and each remaining sample entity encoding vector comprises:

In the above embodiment, the key sample entity encoding vector is used as the initial hidden state of the bidirectional LSTM, so that the potential dependency relationship between the key sample entity encoding vector and other sample entity encoding vectors can be modeled based on the context information, and the accuracy of determining the dependency relationship is improved.

In order to more clearly illustrate the technical solution of the embodiment of the present disclosure, in the news service, taking the visual-language pre-training model as CLIP and the pre-training language model as BART as an example, the fine-tuning process of the embodiment of the present disclosure is introduced to obtain the data processing model of the embodiment of the present disclosure in combination with fig. 6.

As shown in FIG. 6, for a given sample news image and sample news text data set in pairs

Wherein the content of the first and second substances,

respectively representing sample news text and corresponding sample news image

A CLIP module for inputting to the data processing model, and a news image obtained by the image encoder

Extracting the features to obtain visual features, and recording the visual features as

Inputting visual features into an initial multi-layer perception network MLP of a data model, mapping the visual features to an input space of BART through the initial multi-layer perception network to obtain visual cue vectors, and recording the visual cue vectors as

Wherein MLP, i.e., multilayered Perceptron, denotes a Multilayer Perceptron, p ^v A visual cue vector is represented.

Will the visual cue vector p ^v And sample news text as input sequence to the BART encoder, noted

Input into BART encoder via bidirectional attention mechanism

And a sample coding vector C is obtained based on the hidden layer state vector in the last layer of encoder layer. Wherein C comprises p ^v Corresponding visual coding vector C _I And

corresponding sample text encodingVector C _A As shown in (a) and (b) of fig. 6.

Inputting the sample coding vector C into a context entity prompt construction module for sequence modeling to obtain an entity prompt vector p ^μ And p is ^v And p ^μ Obtaining a multi-mode prompt vector P by splicing, and marking as P = [ P ] ^v ；p ^μ ]＝[p ₁ ,…,p _k ]。

The context entity prompt construction module automatically learns the association between the visual prompt and the text representation from the potential semantic space, constructs a global entity prompt vector sequence, and directs the generation of entity lemmas and non-entity lemmas in each decoding step. In a specific implementation, all named entities E in the sample news text are tagged with the spaCy toolkit, and then the vector C is encoded from the sample text _A Extracting the hidden state vector C corresponding to the entity words _E (C _E ∈C _A ) Finally, the vector C is encoded visually by the linear layer phi _I Carrying out weighted average, and calculating the relevance score of each entity word element based on the average result so as to obtain a relevance score set S _E S, as shown in (c) of FIG. 6 _E Each element in (a) represents an importance score for a named entity. In training, the entity code vector with the highest score is taken out as a key entity code vector C through argmax operation _key (as in FIG. 6

). Then, as shown in (d) of FIG. 6, C is used _key The potential dependency relationship between the key entity and other entities (i.e. score in fig. 6) is modeled as an initial hidden state of the bidirectional LSTM model, and then a target residual entity encoding vector is obtained by combining a preset dependency degree threshold η. Will C _key Splicing with the target residual entity coding vector to construct and obtain an entity prompt vector p ^μ 。

Adding a multi-modal prompt vector P to an input sequence of a BART decoder, noted as [ P; y ], input to the BART decoder using the cross-attention mechanism with C and P in [ P; y ] to generate the prediction description text by performing autoregressive iteration (namely, sequentially obtaining the vocabulary elements of the prediction description text). Among them, the attention layer of the BART decoder can be expressed as:

wherein Q, K and V are obtained by mapping y; k ^P ,V ^P Obtained by P mapping; [;]representing a splice; d _H Representing a characteristic dimension of the self-attention mechanism.

If P is used _idx Represents the index number, | P, corresponding to the multi-modal prompt vector P in the whole decoder input sequence _idx If | represents the length of P, the probability distribution of t time step output is:

wherein the content of the first and second substances,

representing a matrix vector mapped to a logical distribution on a vocabulary; [ | P _idx |:]The indication is a sequence located after the multi-modal prompt vector.

After obtaining the predictive description text, based on the predictive description text and

the differences between the corresponding reference description texts result in cross-entropy losses. And when the model parameters are adjusted, freezing the CLIP module, adjusting the parameters of the initial MLP and BART based on the cross entropy loss until the cross entropy loss reaches a preset minimum value, and ending the training to obtain a data processing model, wherein the data processing model comprises a vision-language pre-training model, a fine-tuned MLP (namely a multilayer perception network), a fine-tuned BART (namely a target language model) and a context entity prompt construction module.

Fig. 7 is a block diagram illustrating a structure of a data processing apparatus according to an exemplary embodiment. Referring to fig. 7, the data processing apparatus 700 includes:

a data obtaining unit 710 configured to perform obtaining of a to-be-processed service image and a service text associated with the to-be-processed service image; the service text comprises named entities;

a visual cue vector determination unit 720, configured to perform visual feature based on the extracted service image to be processed, to obtain a visual cue vector;

the encoding unit 730 is configured to perform encoding processing on the visual cue vector and the service text to obtain an encoding vector; the coding vector comprises a visual coding vector corresponding to the visual prompt vector and a text coding vector corresponding to the service text;

an entity hint vector determination unit 740 configured to perform determining an entity hint vector based on the visual coded vector and an entity coded vector corresponding to each of the named entities in the text coded vector;

a decoding unit 750 configured to perform decoding processing on the coding vector based on the visual cue vector and the entity cue vector, so as to obtain a description text corresponding to the service image to be processed.

In an exemplary embodiment, the entity hint vector determination unit 740 includes:

a first dependency degree determining unit configured to perform determining a dependency degree between the key entity encoding vector and each remaining entity encoding vector; the residual entity encoding vector refers to an entity encoding vector except the key entity encoding vector;

and the first constructing subunit is configured to splice the key entity code vector and the target residual entity code vector as target entity code vectors to obtain an entity prompt vector.

a first state vector determining unit configured to perform input of the residual entity code vectors into the bidirectional long-short term memory network, resulting in state vectors corresponding to the residual entity code vectors;

In an exemplary embodiment, the decoding unit 750 includes:

the multi-mode prompt vector determining unit is configured to perform splicing on the visual prompt vector and the entity prompt vector to obtain a multi-mode prompt vector;

and the decoding subunit is configured to perform autoregressive decoding processing based on the multi-modal prompt vector and the coding vector to obtain a description text corresponding to the to-be-processed service image.

In an exemplary embodiment, the visual cue vector determining unit 720 includes:

In an exemplary embodiment, the apparatus further comprises a training unit, the training unit comprising:

the sample coding unit is configured to perform coding processing on the sample visual cue vector and the sample service text input to a coder of a pre-training language model to obtain a sample coding vector; the sample coding vectors comprise a sample visual vector corresponding to the sample visual cue vector and a sample text coding vector corresponding to the sample business text;

In an exemplary embodiment, the parameter adjusting unit includes:

In one exemplary embodiment, the sample entity hint vector determination subunit includes:

the second key entity determining unit is configured to determine a sample entity coding vector corresponding to the maximum sample correlation degree to obtain a key sample entity coding vector;

a second state vector determining unit configured to perform input of the residual sample entity coded vector into the bidirectional long-short term memory network, resulting in a state vector corresponding to the residual sample entity coded vector;

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

In an exemplary embodiment, there is also provided an electronic device comprising a processor; a memory for storing processor-executable instructions; when the processor is configured to execute the instructions stored in the memory, any one of the data processing methods provided by the embodiments of the present disclosure is implemented.

The electronic device may be a terminal, a server, or a similar computing device, taking the electronic device as a server as an example, fig. 8 is a block diagram of an electronic device for data Processing according to an exemplary embodiment, and as shown in fig. 8, the server 800 may generate a relatively large difference due to different configurations or performances, and may include one or more Central Processing Units (CPUs) 810 (the processor 810 may include but is not limited to a Processing device such as a microprocessor MCU or a programmable logic device FPGA, etc.), a memory 830 for storing data, and one or more storage media 820 (e.g., one or more mass storage devices) for storing an application program 823 or data 822. Memory 830 and storage medium 820 may be, among other things, transient or persistent storage. The program stored in the storage medium 820 may include one or more modules, each of which may include a series of instruction operations in a server. Still further, the central processor 810 may be configured to communicate with the storage medium 820 to execute a series of instruction operations in the storage medium 820 on the server 800. The server 800 may also include one or more power supplies 860, one or more wired or wireless network interfaces 850, one or more input-output interfaces 840, and/or one or more operating systems 821, such as Windows Server, mac OS XTM, unixTM, linuxTM, freeBSDTM, and so forth.

The input-output interface 840 may be used to receive or transmit data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the server 800. In one example, i/o Interface 840 includes a Network adapter (NIC) that may be coupled to other Network devices via a base station to communicate with the internet. In one example, the input/output interface 840 may be a Radio Frequency (RF) module, which is used to communicate with the internet in a wireless manner.

It will be understood by those skilled in the art that the structure shown in fig. 8 is only an illustration and is not intended to limit the structure of the electronic device. For example, server 800 may also include more or fewer components than shown in FIG. 8, or have a different configuration than shown in FIG. 8.

In an exemplary embodiment, a computer-readable storage medium comprising instructions, such as the memory 830 comprising instructions, executable by the processor 810 of the apparatus 800 to perform the method described above is also provided. Alternatively, the computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

In an exemplary embodiment, a computer program product is further provided, which includes a computer program/instruction, and when executed by a processor, the computer program/instruction implements any one of the data processing methods provided by the embodiments of the present disclosure.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A data processing method, comprising:

obtaining a visual prompt vector based on the extracted visual features of the service image to be processed;

coding the visual cue vector and the service text to obtain a coding vector; the coding vector comprises a visual coding vector corresponding to the visual prompt vector and a text coding vector corresponding to the service text;

2. The method of claim 1, wherein determining an entity hint vector based on the entity encoding vector corresponding to each of the named entities in the visual encoding vector and the text encoding vector comprises:

3. The method of claim 2, wherein the selecting a target entity-encoding vector from the entity-encoding vectors based on the degree of correlation to construct an entity hint vector comprises:

4. The method of claim 3, wherein determining a degree of dependency between the key entity-encoded vector and each remaining entity-encoded vector comprises:

performing normalization processing on the basis of the state vector corresponding to the residual entity coding vector to obtain a normalization result corresponding to the residual entity coding vector; the normalization result characterizes a degree of dependency between the key entity encoding vector and the remaining entity encoding vector.

5. The method of claim 1, wherein the decoding the encoded vector based on the visual cue vector and the entity cue vector to obtain a description text corresponding to the service image to be processed comprises:

6. The method according to claim 1, wherein the deriving a visual cue vector based on the extracted visual features of the to-be-processed service image comprises:

7. The method according to any one of claims 1 to 6, wherein the method is implemented based on a data processing model, the method further comprising the step of training the data processing model:

acquiring a sample service image text pair and a corresponding reference description text; the sample service image text pair comprises a sample service image and a sample service text related to the sample service image, wherein the sample service text comprises a named entity;

8. The method of claim 7, wherein the adjusting model parameters based on the difference between the predictive description text and the reference description text comprises:

9. The method of claim 7, wherein determining a sample entity hint vector based on the sample visual coding vector and a sample entity coding vector corresponding to each of the named entities in the sample text coding vector comprises:

10. The method of claim 9, wherein the selecting a target sample entity code vector from the sample entity code vectors based on the sample correlation degree to form a sample entity hint vector comprises:

11. The method of claim 10, wherein determining the sample dependency between the key sample entity encoding vector and each remaining sample entity encoding vector comprises:

performing normalization processing on the basis of the state vector corresponding to the residual sample entity coding vector to obtain a sample normalization result corresponding to the residual sample entity coding vector; the sample normalization result characterizes a sample dependency between the key sample entity encoding vector and the residual sample entity encoding vector.

12. A data processing apparatus, comprising:

13. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the data processing method of any one of claims 1 to 11.

14. A computer-readable storage medium, wherein instructions in the computer-readable storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the data processing method of any of claims 1 to 11.

15. A computer program product comprising computer programs/instructions, characterized in that the computer programs/instructions, when executed by a processor, implement the data processing method of any of claims 1 to 11.