CN115393849A - Data processing method and device, electronic equipment and storage medium - Google Patents

Data processing method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN115393849A
CN115393849A CN202210901461.0A CN202210901461A CN115393849A CN 115393849 A CN115393849 A CN 115393849A CN 202210901461 A CN202210901461 A CN 202210901461A CN 115393849 A CN115393849 A CN 115393849A
Authority
CN
China
Prior art keywords
vector
entity
sample
coding
visual
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210901461.0A
Other languages
Chinese (zh)
Inventor
张菁菁
方山城
毛震东
张志伟
陈小帅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Zhongke Research Institute
Beijing Dajia Internet Information Technology Co Ltd
Original Assignee
Beijing Zhongke Research Institute
Beijing Dajia Internet Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Zhongke Research Institute, Beijing Dajia Internet Information Technology Co Ltd filed Critical Beijing Zhongke Research Institute
Priority to CN202210901461.0A priority Critical patent/CN115393849A/en
Publication of CN115393849A publication Critical patent/CN115393849A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • G06V10/467Encoded features or binary features, e.g. local binary patterns [LBP]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Multimedia (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The present disclosure relates to a data processing method, an apparatus, an electronic device, and a storage medium, the method including: acquiring a service image to be processed and a service text related to the service image; obtaining a visual prompt vector based on the extracted visual features of the service image to be processed; coding the visual cue vector and the service text to obtain a coding vector; the coding vector comprises a visual coding vector corresponding to the visual prompt vector and a text coding vector corresponding to the service text; determining an entity prompt vector based on the visual coding vector and an entity coding vector corresponding to each named entity in the text coding vector; and decoding the coding vector based on the visual cue vector and the entity cue vector to obtain a description text corresponding to the service image to be processed. The present disclosure improves the accuracy of generating named entities in descriptive text.

Description

Data processing method and device, electronic equipment and storage medium
Technical Field
The present disclosure relates to the field of computer technologies, and in particular, to a data processing method and apparatus, an electronic device, and a storage medium.
Background
At present, news articles published on the internet often contain news images, and each news image is also provided with corresponding descriptive text. To improve the efficiency of publishing news articles on the internet, descriptive text is typically automatically generated for each news image in the news article.
Considering that a large number Of named entities such as person names, place names, organization names and the like are involved in a news scene and often exceed the Vocabulary range (OOV for short) Of a language model, in order to automatically generate descriptive texts containing the named entities in the related art, a manually written entity-level description template is combined on the basis Of the language model, and because the range covered by the manually written entity-level description template is limited and only local entity words can be sensed, the generation accuracy Of the named entities is not sufficient, and further, the accuracy Of the descriptive texts corresponding to news images is reduced.
Disclosure of Invention
The disclosure provides a data processing method, a data processing device, an electronic device and a storage medium, which are used for at least solving the problem of poor accuracy of automatically generating a descriptive text corresponding to a news image in the related technology. The technical scheme of the disclosure is as follows:
according to a first aspect of an embodiment of the present disclosure, there is provided a data processing method, including:
acquiring a service image to be processed and a service text associated with the service image to be processed; the service text comprises named entities;
obtaining a visual cue vector based on the extracted visual characteristics of the service image to be processed;
coding the visual cue vector and the service text to obtain a coding vector; the coding vectors comprise visual coding vectors corresponding to the visual cue vectors and text coding vectors corresponding to the business texts;
determining an entity prompt vector based on the visual coding vector and an entity coding vector corresponding to each named entity in the text coding vector;
and decoding the coding vector based on the visual cue vector and the entity cue vector to obtain a description text corresponding to the service image to be processed.
In an exemplary embodiment, the determining an entity hint vector based on the entity encoding vector corresponding to each of the named entities in the visual encoding vector and the text encoding vector comprises:
determining a degree of correlation between the visual code vector and each entity code vector in the text code vector;
and selecting a target entity coding vector from the entity coding vectors based on the correlation degree to form an entity prompt vector.
In an exemplary embodiment, the selecting a target entity code vector from the entity code vectors based on the degree of correlation, and forming an entity hint vector includes:
determining an entity coding vector corresponding to the maximum correlation degree to obtain a key entity coding vector;
determining the degree of dependence between the key entity code vector and each remaining entity code vector; the residual entity encoding vector refers to an entity encoding vector except the key entity encoding vector;
determining a target remaining entity encoding vector based on the degree of dependence; the dependency degree corresponding to the target residual entity coding vector is larger than a preset dependency degree threshold value;
and splicing the key entity coding vector and the target residual entity coding vector as target entity coding vectors to obtain an entity prompt vector.
In an exemplary embodiment, the determining the degree of dependency between the key entity encoding vector and each remaining entity encoding vector includes:
taking the key entity encoding vector as an initial hidden state of a bidirectional long-short term memory network;
inputting the residual entity coding vectors into the bidirectional long-short term memory network to obtain state vectors corresponding to the residual entity coding vectors;
performing normalization processing based on the state vector corresponding to the residual entity coding vector to obtain a normalization result corresponding to the residual entity coding vector; the normalization result characterizes a degree of dependence between the key entity encoding vector and the corresponding remaining entity encoding vector.
In an exemplary embodiment, the decoding the coding vector based on the visual cue vector and the entity cue vector to obtain the description text corresponding to the service image to be processed includes:
splicing the visual cue vector and the entity cue vector to obtain a multi-mode cue vector;
and performing autoregressive decoding processing on the basis of the multi-modal prompt vector and the coding vector to obtain a description text corresponding to the service image to be processed.
In an exemplary embodiment, the obtaining a visual cue vector based on the extracted visual features of the to-be-processed business image includes:
inputting the business image to be processed into a vision-language pre-training model for feature extraction to obtain extracted visual features;
mapping the visual features to an input space of a target language model based on a multilayer perception network to obtain visual prompt vectors;
wherein the target language model is used for the encoding process and the decoding process.
In an exemplary embodiment, the method is implemented based on a data processing model, the method further comprising the step of training the data processing model:
acquiring a sample service image text pair and a corresponding reference description text; the sample service image text pair comprises a sample service image and a sample service text related to the sample service image, and the sample service text comprises a named entity;
the visual features of the sample business image are extracted based on a visual-language pre-training model, and the visual features of the sample business image are mapped to an input space of the pre-training language model based on an initial multilayer perception network to obtain a sample visual cue vector;
inputting the sample visual cue vector and the sample service text into an encoder of a pre-training language model for encoding processing to obtain a sample encoding vector; the sample coding vectors comprise a sample visual vector corresponding to the sample visual cue vector and a sample text coding vector corresponding to the sample business text;
determining a sample entity prompt vector based on the sample visual coding vector and a sample entity coding vector corresponding to each named entity in the sample text coding vector;
splicing the sample visual cue vector and the sample entity cue vector to obtain a sample multi-modal cue vector; inputting the sample multi-modal prompt vector and the sample coding vector into a decoder of the pre-training language model for decoding processing to obtain a prediction description text;
and adjusting model parameters based on the difference between the prediction description text and the reference description text until a preset training end condition is reached and the training is finished to obtain the data processing model.
In an exemplary embodiment, the adjusting of the model parameter based on the difference between the prediction description text and the reference description text includes:
determining a loss value based on a difference between the prediction description text and the reference description text;
and fixing the model parameters of the vision-language pre-training model unchanged, and adjusting the parameters of the initial multilayer perception network and the pre-training language model based on the loss value.
In an exemplary embodiment, the determining a sample entity prompt vector based on the sample visual coding vector and a sample entity coding vector corresponding to each named entity in the sample text coding vector comprises:
determining a degree of correlation between the sample visual coding vector and each sample entity coding vector in the sample text coding vector;
and selecting a target sample entity coding vector from the sample entity coding vectors based on the sample correlation degree to form a sample entity prompt vector.
In an exemplary embodiment, the selecting a target sample entity encoding vector from the sample entity encoding vectors based on the sample correlation degree, and forming a sample entity hint vector includes:
determining a sample entity coding vector corresponding to the maximum sample correlation degree to obtain a key sample entity coding vector;
determining a sample dependency degree between the key sample entity encoding vector and each residual sample entity encoding vector; the residual sample entity encoding vector refers to a sample entity encoding vector except the key sample entity encoding vector;
determining a target residual sample entity encoding vector based on the sample dependency; the sample dependency degree corresponding to the target residual sample entity encoding vector is larger than a preset dependency degree threshold value;
and splicing the key sample entity coding vector and the target residual sample entity coding vector as target sample entity coding vectors to obtain a sample entity prompt vector.
In an exemplary embodiment, the determining the sample dependency between the key sample entity encoding vector and each remaining sample entity encoding vector includes:
taking the key sample entity coding vector as an initial hidden state of a bidirectional long-short term memory network;
inputting the residual sample entity coding vector into the bidirectional long-short term memory network to obtain a state vector corresponding to the residual sample entity coding vector;
performing normalization processing based on the state vector corresponding to the residual sample entity coding vector to obtain a sample normalization result corresponding to the residual sample entity coding vector; the sample normalization result characterizes a sample dependency between the key sample entity encoding vector and the residual sample entity encoding vector.
According to a second aspect of the embodiments of the present disclosure, there is provided a data processing apparatus including:
the data acquisition unit is configured to acquire a to-be-processed service image and a service text associated with the to-be-processed service image; the service text comprises named entities;
the visual cue vector determining unit is configured to execute visual feature based on the extracted service image to be processed to obtain a visual cue vector;
the coding unit is configured to perform coding processing on the visual cue vector and the service text to obtain a coding vector; the coding vector comprises a visual coding vector corresponding to the visual prompt vector and a text coding vector corresponding to the service text;
an entity prompt vector determination unit configured to perform determining an entity prompt vector based on the visual coding vector and an entity coding vector corresponding to each named entity in the text coding vector;
and the decoding unit is configured to perform decoding processing on the coding vector based on the visual cue vector and the entity cue vector to obtain a description text corresponding to the service image to be processed.
In an exemplary embodiment, the entity hint vector determination unit includes:
a first degree of correlation determination unit configured to perform determining a degree of correlation between the visual code vector and each entity code vector in the text code vector;
and the entity prompt vector determining subunit is configured to select a target entity code vector from the entity code vectors based on the correlation degree to form an entity prompt vector.
In an exemplary embodiment, the entity hint vector determination subunit includes:
the first key entity determining unit is configured to determine an entity coding vector corresponding to the maximum correlation degree to obtain a key entity coding vector;
a first dependency level determination unit configured to perform determining a dependency level between the key entity encoding vector and each remaining entity encoding vector; the residual entity encoding vector refers to an entity encoding vector except the key entity encoding vector;
a first determining unit configured to perform determining a target remaining entity encoding vector based on the degree of dependency; the dependency degree corresponding to the target residual entity coding vector is larger than a preset dependency degree threshold value;
and the first construction subunit is configured to perform splicing on the key entity coding vector and the target residual entity coding vector as target entity coding vectors to obtain an entity prompt vector.
In an exemplary embodiment, the first dependency degree determining unit includes:
a first initialization unit configured to perform an initialization hiding state of the key entity encoding vector as a bidirectional long-short term memory network;
a first state vector determining unit configured to perform input of the residual entity code vector to the bidirectional long-short term memory network, resulting in a state vector corresponding to the residual entity code vector;
the first normalization unit is configured to perform normalization processing based on the state vector corresponding to the residual entity coding vector to obtain a normalization result corresponding to the residual entity coding vector; the normalization result characterizes a degree of dependency between the key entity encoding vector and the remaining entity encoding vector.
In an exemplary embodiment, the decoding unit includes:
the multi-modal prompt vector determining unit is configured to perform splicing on the visual prompt vector and the entity prompt vector to obtain a multi-modal prompt vector;
and the decoding subunit is configured to perform autoregressive decoding processing based on the multi-modal prompt vector and the coding vector to obtain a description text corresponding to the service image to be processed.
In an exemplary embodiment, the visual cue vector determination unit includes:
the visual feature extraction unit is configured to input the to-be-processed business image into a visual-language pre-training model for feature extraction to obtain extracted visual features;
a first mapping unit configured to perform mapping of the visual features to an input space of a target language model based on a multi-layer perceptual network, resulting in a visual cue vector;
wherein the target language model is used for the encoding process and the decoding process.
In an exemplary embodiment, the apparatus further comprises a training unit comprising:
the sample acquisition unit is configured to acquire a sample service image text pair and a corresponding reference description text; the sample service image text pair comprises a sample service image and a sample service text related to the sample service image, wherein the sample service text comprises a named entity;
the sample visual cue vector determining unit is configured to execute visual features of the sample business image extracted based on a visual-language pre-training model, and map the visual features of the sample business image to an input space of the pre-training language model based on an initial multilayer perception network to obtain a sample visual cue vector;
the sample coding unit is configured to perform coding processing on the sample visual cue vector and the sample service text input to a coder of a pre-training language model to obtain a sample coding vector; the sample coding vector comprises a sample visual cue vector corresponding to the sample visual cue vector and a sample text coding vector corresponding to the sample business text;
a sample entity prompt vector determination unit configured to perform a sample entity prompt vector determination based on the sample visual coding vector and a sample entity coding vector corresponding to each of the named entities in the sample text coding vector;
the sample multi-modal suggestive vector determining unit is configured to splice the sample visual cue vector and the sample entity cue vector to obtain a sample multi-modal cue vector; inputting the sample multi-modal prompt vector and the sample coding vector into a decoder of the pre-training language model for decoding processing to obtain a prediction description text;
and the parameter adjusting unit is configured to perform adjustment of model parameters based on the difference between the prediction description text and the reference description text until a preset training end condition is reached and training is finished, so that the data processing model is obtained.
In an exemplary embodiment, the parameter adjusting unit includes:
a loss determination unit configured to perform determining a loss value based on a difference between the prediction description text and the reference description text;
a parameter adjusting subunit configured to perform fixing model parameters of the visual-language pre-training model unchanged, adjust parameters of the initial multi-layer perceptual network and the pre-training language model based on the loss value
In an exemplary embodiment, the sample entity hint vector determination unit includes:
a second degree of correlation determination unit configured to perform determining a degree of correlation between the sample visual coding vector and each sample entity coding vector in the sample text coding vector;
and the sample entity prompt vector determining subunit is configured to select a target sample entity code vector from the sample entity code vectors based on the sample correlation degree to form a sample entity prompt vector.
In an exemplary embodiment, the sample entity hint vector determination subunit includes:
the second key entity determining unit is configured to determine a sample entity encoding vector corresponding to the maximum sample correlation degree to obtain a key sample entity encoding vector;
a second dependency level determination unit configured to perform determining a sample dependency level between the key sample entity encoding vector and each remaining sample entity encoding vector; the residual sample entity encoding vector refers to a sample entity encoding vector other than the key sample entity encoding vector;
a second determination unit configured to perform a determination of a target residual sample entity encoding vector based on the sample dependency; the sample dependency degree corresponding to the target residual sample entity encoding vector is larger than a preset dependency degree threshold value;
and the second construction subunit is configured to perform splicing on the key sample entity encoding vector and the target residual sample entity encoding vector as target sample entity encoding vectors to obtain a sample entity prompt vector.
In an exemplary embodiment, the second dependency degree determining unit includes:
a second initialization unit configured to perform an initialization hiding state of the key sample entity encoding vector as a bidirectional long-short term memory network;
a second state vector determination unit configured to perform input of the residual sample entity coded vector into the bidirectional long-short term memory network, resulting in a state vector corresponding to the residual sample entity coded vector;
the second normalization unit is configured to perform normalization processing based on the state vector corresponding to the residual sample entity coding vector to obtain a sample normalization result corresponding to the residual sample entity coding vector; the sample normalization result characterizes a sample dependency between the key sample entity encoding vector and the residual sample entity encoding vector.
According to a third aspect of the embodiments of the present disclosure, there is provided an electronic apparatus including:
a processor;
a memory for storing the processor-executable instructions;
wherein the processor is configured to execute the instructions to implement the data processing method of the first aspect.
According to a fourth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium, in which instructions, when executed by a processor of an electronic device, enable the electronic device to perform the data processing method of the first aspect described above.
According to a fifth aspect of embodiments of the present disclosure, there is provided a computer program product comprising computer programs/instructions which, when executed by a processor, implement the data processing method of the first aspect described above.
The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:
the visual cue vector is obtained based on the extracted visual features of the to-be-processed service image, the coding vector is obtained by coding the visual cue vector and the service text associated with the to-be-processed service image, the coding vector comprises a visual coding vector corresponding to the visual cue vector and a text coding vector corresponding to the service text, the entity cue vector is determined based on the visual coding vector and the entity coding vector corresponding to each named entity in the text coding vector, and the coding vector is decoded based on the visual cue vector and the entity cue vector to obtain the description text corresponding to the to-be-processed service image, so that the named entities related to the content of the to-be-processed service image can be focused more in the decoding process, the generation accuracy of the named entities is improved, and the accuracy of the description text is improved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.
FIG. 1 is a schematic diagram of an application environment illustrating a method of data processing in accordance with an exemplary embodiment;
FIG. 2 is a flow chart illustrating a method of data processing in accordance with an exemplary embodiment;
FIG. 3 is a flow diagram illustrating the construction of an entity hint vector in accordance with an exemplary embodiment;
FIG. 4 is a flow diagram illustrating another construction of an entity hint vector in accordance with an illustrative embodiment;
FIG. 5 is a schematic diagram of a data processing model shown in accordance with an exemplary embodiment;
FIG. 6 is a diagram illustrating the fine tuning of a pre-trained model to arrive at the data processing model of FIG. 5, in accordance with an exemplary embodiment;
FIG. 7 is a block diagram illustrating the structure of a data processing apparatus according to an exemplary embodiment;
FIG. 8 is a block diagram illustrating an electronic device in accordance with an example embodiment.
Detailed Description
In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.
It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in other sequences than those illustrated or described herein. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.
It should also be noted that the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data for presentation, analyzed data, etc.) referred to in the present disclosure are both information and data that are authorized by the user or sufficiently authorized by various parties.
The news image description generation technology aims to automatically generate descriptive words for pictures of news articles, because the generation of the news image description needs to pay attention to visual subjects in the images and background information contained in the articles at the same time, the traditional image description generation method is limited to image input and cannot process long text information; in addition, a large number of named entities, such as names of people, places, names of organizations, etc., are involved in a news scene, and these entity words often exceed the vocabulary range (out of vocabularies, abbreviated as OOV) of the language model, so that the conventional image description generation method cannot solve the generation of texts containing entity words.
In the related art, some methods adopt templating, all entity words in a description sentence are replaced by slots representing entity words, and then entity-level matching and blank filling are performed with input, and subdivision may include two-stage generation and one-stage generation, wherein the two-stage generation is to generate a template sentence without entity words first and then perform entity word prediction, and the one-stage generation is to consider the distribution probability of non-entity words and entity words when generating each word element so as to realize end-to-end dynamic prediction. However, the above method has the following drawbacks: 1) The generation of entity words depends heavily on the quality of a manual template at an entity level, and in practical application, image description sentences of a news scene are usually natural sentences and rich in information content, so that the definition of an optimal template is time-consuming and difficult to cover all manually written description sentences; 2) The current text generation is limited to a template-based supervision training mode, the complexity of a model is increased and certain error accumulation is caused in multi-stage iteration generation and template filling, the OOV problem is difficult to relieve in one-stage generation, and the accuracy of entity word generation is poor.
In other methods in the related art, in order to ensure the precision of the entity words, an updated and refined entity word template is considered as a supervision signal, that is, the entity word generation space is further constrained by introducing a label of an entity word type according to the classification of grammatical components of the entity words in sentences, but the supervision training based on the entity level template can only perform local entity word perception, so that not only is the semantic coherence of a language structure and a context damaged, but also the perception of the internal relation of the global entity words in the language space is lacked, and the entity words in the finally generated image description text are not accurate enough.
In view of this, the disclosed embodiment provides a data processing method, in which a visual cue vector is obtained based on extracted visual features of a service image to be processed, and a coding vector is obtained by coding the visual cue vector and a service text associated with the service image to be processed, the coding vector includes a visual coding vector corresponding to the visual cue vector and a text coding vector corresponding to the service text, and further an entity cue vector is determined based on the visual coding vector and the entity coding vector corresponding to each named entity in the text coding vector, and the coding vector is decoded based on the visual cue vector and the entity cue vector to obtain a description text corresponding to the service image to be processed, so that a named entity related to the content of the service image to be processed can be focused more in a decoding process, the accuracy of generating the named entity is improved, and the accuracy of the image description text is improved.
Referring to fig. 1, an application environment of a data processing method according to an exemplary embodiment is shown, where the application environment may include a terminal 110 and a server 120, and the terminal 110 and the server 120 may be connected through a wired network or a wireless network.
The terminal 110 may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, and the like. The terminal 110 may have client software such as an Application (App) for providing a data processing function installed therein, and the Application may be a stand-alone Application or a sub-program in the Application. Illustratively, the application may include a news-like application or the like, such as an application having a news release function. The user of the terminal 110 may log into the application through pre-registered user information, which may include an account number and a password.
The server 120 may be a server providing a background service for the application program in the terminal 110, and specifically, the background service may be a service for processing an image to generate a description text corresponding to the image. The server 120 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as cloud service, a cloud database, cloud computing, cloud functions, cloud storage, network service, cloud communication, middleware service, domain name service, security service, CDN, and big data and artificial intelligence platforms.
To elaborate on the technical solution of the embodiment of the present disclosure, as shown in fig. 2, a flowchart of a data processing method according to an exemplary embodiment is shown, and the method can be applied to the electronic device shown in fig. 1, as shown in fig. 2, and includes the following steps:
in step S201, a to-be-processed service image and a service text associated with the to-be-processed service image are obtained.
The service text associated with the service image to be processed includes named entities, such as a person name, a place name, an organization name, and the like. For example, the service image to be processed may be a news image, and the service text associated with the service image to be processed may be a news article corresponding to the news image.
In step S203, a visual cue vector is obtained based on the extracted visual features of the to-be-processed service image.
The visual feature of the service image to be processed refers to a feature obtained by feature extraction of the service image to be processed.
The visual cue vector is a vector obtained by performing dimension transformation on visual features, and the dimension transformation is used for establishing a connection between a visual concept and language prior. In a specific implementation, the dimension transformation may be performed based on the input dimension of the text information for the subsequent encoding process.
In step S205, the visual cue vector and the service text are encoded to obtain an encoded vector.
Wherein the coding vectors include a visual coding vector corresponding to the visual cue vector and a text coding vector corresponding to the service text.
Specifically, the word segmentation processing may be performed on the service text, each word is used as a token (lemma) to obtain a word sequence, then the visual cue vector and the word sequence are spliced to obtain a sequence to be coded, and then the sequence to be coded is coded.
In the embodiment of the present disclosure, in order to simultaneously encode the service image to be processed and the service text based on the bidirectional self-attention mechanism by using the context information, that is, the sequence to be encoded is encoded based on the bidirectional self-attention mechanism. For example, the encoding process may be expressed as: c = SA (p) v ,x A )。
Wherein SA () represents the Self-attention mechanism (Self-attention); p is a radical of v Representing a visual cue vector; x is the number of A Representing a service text; c denotes a code vector. Wherein C comprises C I And C A ,C I Representing a visually encoded vector, C A Representing a text encoding vector; wherein, C A Comprising a solid code vector C E I.e. C E ∈C A
In step S207, an entity prompt vector is determined based on the entity code vector corresponding to each named entity in the visual code vector and the text code vector.
And the entity coding vector is a vector corresponding to the named entity in the text coding vector.
In a specific implementation, all named entities in the service text may be tagged by using a natural language processing kit (e.g., spaCy kit), so that an entity code vector corresponding to each named entity may be extracted from the code vectors based on the tagging information.
In an exemplary embodiment, in order to enable the entity hint vector to more accurately reflect the content of the image to be processed, the step S207 may be implemented to include:
determining a degree of correlation between the visual code vector and each entity code vector in the text code vector;
and selecting a target entity coding vector from the entity coding vectors based on the correlation degree to form an entity prompt vector.
In a specific implementation, the correlation score corresponding to each entity code vector can be calculated by the following formula to characterize the degree of correlation:
Figure BDA0003771022960000111
wherein, C I Representing a visual coding vector;
Figure BDA0003771022960000112
represents a weighted average;
Figure BDA0003771022960000113
which represents the physical code vector i and which is,
Figure BDA0003771022960000114
tau represents a hyper-parameter and can be set according to actual experience;<>representing a dot product calculation; s is i Representing the relevance score of the entity code vector i.
A set of relevance scores S may be obtained E Including each entity encodingAnd selecting a target entity coding vector of which the correlation degree with the visual coding vector meets a preset condition from the entity coding vectors based on the correlation scores corresponding to the vectors to form an entity prompt vector.
In an exemplary embodiment, when selecting a target entity code vector from the entity code vectors based on the degree of correlation, constructing an entity hint vector may include the following steps as shown in fig. 3:
in step S301, the entity encoding vector corresponding to the maximum correlation degree is determined, and a key entity encoding vector is obtained.
In a specific implementation, the key entity encodes vector C key Can be expressed by the following formula:
Figure BDA0003771022960000121
in step S303, the degree of dependency between the key entity-encoding vector and each remaining entity-encoding vector is determined.
Wherein the remaining entity encoding vectors refer to entity encoding vectors other than the key entity encoding vector.
In an exemplary embodiment, in order to obtain a context hint vector at a entity level to improve the accuracy of entity words in the generated description text, as shown in fig. 4, the step S303 may be implemented to include:
in step S401, the key entity encoding vector is used as an initial hidden state of the bi-directional long-short term memory network.
In step S403, the remaining physical code vectors are input into the bidirectional long-short term memory network, and state vectors corresponding to the remaining physical code vectors are obtained.
In step S405, normalization processing is performed based on the state vector corresponding to the residual entity encoding vector, so as to obtain a normalization result corresponding to the residual entity encoding vector.
Wherein the normalization result characterizes a degree of dependency between the key entity encoding vector and the corresponding remaining entity encoding vector.
The bidirectional long and short term memory network is initialized in step S401 so that the initial hidden state of the initialized bidirectional long and short term memory network is the key entity code vector, and then the remaining entity code vectors are input into the initialized bidirectional long and short term memory network in step S403 to be processed to obtain the state vector corresponding to the remaining entity code vector.
In a specific implementation, the state vector can be represented by the following formula:
C prompt =LSTM(C key ,C′ E )
accordingly, the normalization result can be expressed by the following formula:
S prompt =σ(fc(C prompt ))
wherein LSTM represents a bidirectional long-short term memory network, C key As the initial hidden state of the LSTM; c' E Representing a residual entity code vector; c prompt A state vector representing the network output; fc () represents a single linear layer; σ () represents an activation function; s prompt The normalized result is expressed as a numerical value, which can embody a score.
According to the embodiment, the key entity encoding vector is used as the initial hidden state of the bidirectional LSTM, so that the potential dependency relationship between the key entity encoding vector and other entity encoding vectors can be modeled based on the context information, and the accuracy of determining the dependency relationship is improved.
In step S305, a target remaining entity encoding vector is determined based on the degree of dependency.
And the dependency degree corresponding to the target residual entity coding vector is greater than a preset dependency degree threshold value.
The preset dependency threshold value can be set according to actual needs.
In a particular implementation, if
Figure BDA0003771022960000131
(preset dependency threshold), then
Figure BDA0003771022960000132
Corresponding entity code vector
Figure BDA0003771022960000133
Determining to code the vector for the target residual entity so as to obtain
Figure BDA0003771022960000134
In step S307, the key entity encoding vector and the target remaining entity encoding vector are used as target entity encoding vectors to be spliced, so as to obtain an entity prompt vector.
In particular implementations, the entity hint vector p μ Can be expressed by the following formula:
Figure BDA0003771022960000135
the above embodiment realizes the screening and aggregation of important entity encoding vectors (namely target entity encoding vectors) from the entity encoding vectors based on the context information, and generates accurate and global entity-level-oriented context hint vectors.
In step S209, the coding vector is decoded based on the visual cue vector and the entity cue vector, so as to obtain a description text corresponding to the service image to be processed.
Specifically, the visual cue vector and the entity cue vector can be spliced to obtain a multi-modal cue vector, and then the coding vector and the multi-modal cue vector are utilized to perform decoding processing through a cross attention mechanism, so that a description text corresponding to the service image to be processed is obtained.
According to the technical scheme of the embodiment of the disclosure, the description text of the to-be-processed service image is obtained by constructing the visual cue vector and the entity cue vector and decoding the coding vector by combining the two cue vectors, so that the named entity related to the content of the service image can be focused more during decoding, the accuracy of the named entity generated in the description text is improved, and the accuracy of the generated description text is further improved.
In an exemplary embodiment, in order to improve the accuracy of the generated description text, the decoding process in step S209 may be an autoregressive decoding process, that is, the prediction result of the current time step needs to depend on the prediction result of the past time step. Therefore, the step S209 may include:
splicing the visual cue vector and the entity cue vector to obtain a multi-mode cue vector;
and performing autoregressive decoding processing on the basis of the multi-modal prompt vector and the coding vector to obtain a description text corresponding to the service image to be processed.
In particular, the multimodal prompt vector P may be represented as P = [ P ] v ;p μ ]=[p 1 ,…,p k ]And k represents the length of the multi-modal prompt vector, in the embodiment of the present disclosure, the length k of the multi-modal prompt vector is a fixed length, and a specific length value can be set according to practical experience.
The time step of autoregressive decoding, y, is denoted by t t Representing the token generated corresponding to time step t, when performing autoregressive decoding processing based on the multimodal prompt vector and the encoding vector, firstly based on S = [ P; y is]And processing S by the self-attention mechanism to obtain an output vector SelfAtt (S), wherein y = { y = 1 ,…,y t-1 Then cross attention processing is carried out based on SelfAtt (S) and the coding vector C, and further, a cross attention processing result h is further based on the cross attention processing result t To predict the corresponding lemma of time step t.
Note that the embodiments of the present disclosure are based on S = [ P; y ] and the self-attention mechanism process S to obtain an output vector SelfAtt (S), if the input of the query Q is y and the input of the key value pair K-V is S, then:
Figure BDA0003771022960000141
wherein Q, K and V are obtained by mapping y; k P ,V P Mapping by P to obtain; [;]representing a splice; d is a radical of H Representing a characteristic dimension of the self-attention mechanism.
In the embodiment of the present disclosure, when performing cross attention processing based on SelfAtt (D) and coding vector C, the "query Q" input is made the above-described SelfAtt (S), and the "key value pair K-V" is made the coding vector C. Processing the result h based on cross attention t When predicting the corresponding word element of time step t, from h t Carrying out the distribution probability calculation of the lemma by the vector corresponding to the sequence after the multi-mode prompt vector in the S is extracted, wherein the specific distribution probability
Figure BDA0003771022960000142
The calculation formula of (c) can be expressed as follows:
Figure BDA0003771022960000143
wherein the content of the first and second substances,
Figure BDA0003771022960000144
a matrix vector representing the logical distribution mapped onto the vocabulary; [ | P idx |:]The indication is a sequence located after the multi-modal prompt vector.
In the above embodiment, the autoregressive decoding processing is performed based on the multi-modal prompt vector and the encoding vector, and the multi-modal prompt vector is used to guide the generation of the entity lemmas and the non-entity lemmas in each decoding step, so that the entity words in the generated description text are more accurate.
In an exemplary embodiment, in order to improve the influence of the visual cue vector on semantic understanding so as to focus more accurately on the entity words related to the image content, step S203 may be implemented by:
inputting a service image to be processed into a vision-language pre-training model for feature extraction to obtain extracted visual features;
mapping the visual features to an input space of a target language model based on a multilayer perception network to obtain visual prompt vectors;
wherein the target language model is used for the encoding process and the decoding process.
Specifically, the business image to be processed may be encoded by an image encoder based on a visual-language pre-training model, so as to obtain an encoding feature, which is used as the extracted visual feature. The visual-Language Pre-training model may be a CLIP (contrast Language-Image Pre-training) model, and the CLIP is a Pre-training model obtained by performing contrast learning training on a large-scale Image-text data set.
The target language model is obtained by fine-tuning the pre-training language model based on the sample service image and the sample service text corresponding to the sample service image, and the target language model is used for realizing the encoding processing and the decoding processing in the embodiment of the disclosure.
Illustratively, the pre-trained language model includes a Transformer-based encoder and a Transformer-based decoder. The coder based on the Transformer may be a coder having a Bidirectional coding function, for example, the pretraining language model may be BART (Bidirectional and Auto-regenerative Transformers), the BART uses a standard neural machine translation architecture based on the Transformer, and may be regarded as a generalized form of pretraining models such as BERT (Bidirectional coder), GPT (decoder from left to right), and the like.
In the embodiment, the visual features are extracted through the visual-language pre-training model, so that the visual features which are more consistent with semantic understanding can be obtained by utilizing the high-level image semantic understanding capability of the visual-language pre-training model, the influence of the obtained visual cue vector on the semantic understanding can be favorably improved, and the accuracy of describing the entity words in the text can be improved.
As can be seen from the foregoing implementation manner, the data processing method according to the embodiment of the present disclosure may be implemented based on a data processing model, where the data processing model is composed of a visual-language pre-training model, a multi-layer perceptual network, a target language model and an entity prompt vector building module, and as shown in fig. 5, the data processing model provided in the embodiment of the present disclosure is a schematic structural diagram of the data processing model, where the target language model is obtained by performing a fine-tuning process on the pre-training language model and includes a transform-based encoder and a transform-based decoder.
In specific implementation, a service image to be processed is input into a visual-language pre-training model, an image encoder of the visual-language pre-training model performs encoding processing to obtain output visual features, the visual features are used as input of a multilayer perception network, the multilayer perception network is mapped to an input space of a target language model to obtain a visual cue vector, the visual cue vector is spliced with a service text associated with the service image to be processed and then used as input of an encoder in the target language model, and the encoder of the target language model performs encoding processing to obtain an encoding vector. The coding vector comprises a visual coding vector corresponding to the visual prompt vector and a text coding vector corresponding to the service text.
And inputting the entity coding vectors corresponding to the named entities in the visual coding vector and the text coding vector into an entity prompt vector construction model, wherein the entity prompt vector construction model selects a target entity coding vector from the entity coding vectors based on the correlation degree between the visual prompt vector and each entity coding vector to form the entity prompt vector.
After the visual cue vector and the entity cue vector are spliced, the visual cue vector and the coding vector output by the coder form the input of a decoder in the target language model. The decoder may include a plurality of transform decoding modules stacked, each decoding module includes two attention layers, where a first attention layer uses the splicing of the visual cue vector and the entity cue vector and the prediction result of the historical time as input to perform processing based on an auto-attention mechanism, and a second attention layer uses the processing result of the first attention layer and the coding vector output by the encoder to perform auto-regressive decoding based on a cross-attention mechanism, where the processing manner of the auto-attention mechanism of the first attention layer may refer to the related content of the foregoing step S209, which is not described herein again.
Based on this, in an exemplary implementation manner, the embodiment of the present disclosure may further include a step of training the data processing model, and specifically, training the data processing model may include:
and acquiring a sample service image text pair and a corresponding reference description text. The sample service image text pair comprises a sample service image and a sample service text related to the sample service image, wherein the sample service text comprises a named entity.
And mapping the visual features of the sample business images to an input space of a pre-training language model based on the initial multilayer perception network to obtain sample visual cue vectors.
And inputting the sample visual cue vector and the sample service text into an encoder of a pre-training language model for encoding processing to obtain a sample encoding vector. Wherein the sample coding vectors include a sample visual vector corresponding to the sample visual cue vector and a sample text coding vector corresponding to the sample business text.
And determining a sample entity prompt vector based on the sample visual coding vector and a sample entity coding vector corresponding to each named entity in the sample text coding vector.
Splicing the sample visual cue vector and the sample entity cue vector to obtain a sample multi-modal cue vector; and inputting the sample multi-modal prompt vector and the sample coding vector into a decoder of the pre-training language model for decoding processing to obtain a prediction description text.
And adjusting model parameters based on the difference between the prediction description text and the reference description text until a preset training ending condition is reached and training is ended to obtain a data processing model.
The preset training end condition may be set according to actual needs, for example, the iteration number reaches a preset iteration number threshold, or the loss value reaches a preset loss threshold, or the like.
In the above embodiment, the sample service image is converted into the visual cue vector by using the large-scale visual-language pre-training model, the entity cue vector is constructed by combining the sample service text, and then the multi-modal cue vector constructed based on the image and the text is combined with the cue learning mechanism to perform fine tuning on the pre-training language model, so that the entity-level representations in the service text and the service image can be unified at the same time, the common learning of the two pre-training models is realized, and the data processing model of the embodiment of the present disclosure is obtained.
In an exemplary embodiment, in order to improve the efficiency of the fine tuning, the visual-language pre-training model may be fixed without participating in parameter updating, and only the mapping network (i.e., the initial multi-layer perceptual network) and the pre-training language model are fine-tuned together. Therefore, the adjusting of the model parameters based on the difference between the prediction description text and the reference description text may include:
determining a loss value based on a difference between the prediction description text and the reference description text;
and fixing the model parameters of the vision-language pre-training model unchanged, and adjusting the parameters of the initial multilayer perception network and the pre-training language model based on the loss value.
The loss value may be obtained based on a preset loss function, where the preset loss function may be a cross entropy-based one-way language modeling loss, and the specific preset loss function L may be represented as follows:
Figure BDA0003771022960000171
wherein x is A Representing a sample service text; x is the number of I Representing a sample business image; p represents a multi-modal prompt vector, P = [ P ] v ;p μ ]=[p 1 ,…,p k ];y τ<t Representing predicted lemma, y, before time step t τ<t =y 1 ,…,y t-1 ;p θ () Representing a likelihood function; l denotes k and the reference description textThe sum of the number of included lemmas.
In an exemplary embodiment, determining a sample entity prompt vector based on the sample visual encoding vector and a sample entity encoding vector corresponding to each of the named entities in the sample text encoding vector may include:
determining a sample correlation degree between the sample visual coding vector and each sample entity coding vector in the sample text coding vector;
and selecting a target sample entity coding vector from the sample entity coding vectors based on the sample correlation degree to form a sample entity prompt vector.
For a specific determination manner of the sample correlation degree, reference may be made to the determination manner of the correlation degree in the foregoing embodiment shown in fig. 2 in the embodiment of the present disclosure, and details are not repeated here.
By the implementation mode, the sample entity prompt vector can reflect the content of the sample image more accurately, and the training effect of the model is improved.
In an exemplary embodiment, selecting a target sample entity encoding vector from the sample entity encoding vectors based on the sample correlation degree, and constructing the sample entity hint vector includes:
determining a sample entity coding vector corresponding to the maximum sample correlation degree to obtain a key sample entity coding vector;
determining a sample dependency degree between the key sample entity encoding vector and each residual sample entity encoding vector; the residual sample entity encoding vector refers to a sample entity encoding vector other than the key sample entity encoding vector;
determining a target residual sample entity encoding vector based on the sample dependency; the sample dependency degree corresponding to the target residual sample entity encoding vector is larger than a preset dependency degree threshold value;
and splicing the key sample entity coding vector and the target residual sample entity coding vector as target sample entity coding vectors to obtain a sample entity prompt vector.
For a specific determination manner of the sample dependency degree and the target residual sample entity encoding vector, reference may be made to the related description in the method shown in fig. 3 in the foregoing embodiment of the present disclosure, which is not described herein again.
According to the embodiment, the important entities are screened and aggregated based on the context information, and the entity-level context prompt vector is generated, so that the entity accuracy of the model prediction result is improved.
In an exemplary embodiment, determining a sample dependency between the key sample entity encoding vector and each remaining sample entity encoding vector comprises:
taking the key sample entity coding vector as an initial hidden state of a bidirectional long-short term memory network;
inputting the residual sample entity coding vector into the bidirectional long-short term memory network to obtain a state vector corresponding to the residual sample entity coding vector;
performing normalization processing based on the state vector corresponding to the residual sample entity coding vector to obtain a sample normalization result corresponding to the residual sample entity coding vector; the sample normalization result characterizes a sample dependency between the key sample entity encoding vector and the residual sample entity encoding vector.
In the above embodiment, the key sample entity encoding vector is used as the initial hidden state of the bidirectional LSTM, so that the potential dependency relationship between the key sample entity encoding vector and other sample entity encoding vectors can be modeled based on the context information, and the accuracy of determining the dependency relationship is improved.
In order to more clearly illustrate the technical solution of the embodiment of the present disclosure, in the news service, taking the visual-language pre-training model as CLIP and the pre-training language model as BART as an example, the fine-tuning process of the embodiment of the present disclosure is introduced to obtain the data processing model of the embodiment of the present disclosure in combination with fig. 6.
As shown in FIG. 6, for a given sample news image and sample news text data set in pairs
Figure BDA0003771022960000181
Wherein the content of the first and second substances,
Figure BDA0003771022960000182
respectively representing sample news text and corresponding sample news image
Figure BDA0003771022960000183
A CLIP module for inputting to the data processing model, and a news image obtained by the image encoder
Figure BDA0003771022960000184
Extracting the features to obtain visual features, and recording the visual features as
Figure BDA0003771022960000185
Inputting visual features into an initial multi-layer perception network MLP of a data model, mapping the visual features to an input space of BART through the initial multi-layer perception network to obtain visual cue vectors, and recording the visual cue vectors as
Figure BDA0003771022960000186
Wherein MLP, i.e., multilayered Perceptron, denotes a Multilayer Perceptron, p v A visual cue vector is represented.
Will the visual cue vector p v And sample news text as input sequence to the BART encoder, noted
Figure BDA0003771022960000187
Input into BART encoder via bidirectional attention mechanism
Figure BDA0003771022960000188
And a sample coding vector C is obtained based on the hidden layer state vector in the last layer of encoder layer. Wherein C comprises p v Corresponding visual coding vector C I And
Figure BDA0003771022960000189
corresponding sample text encodingVector C A As shown in (a) and (b) of fig. 6.
Inputting the sample coding vector C into a context entity prompt construction module for sequence modeling to obtain an entity prompt vector p μ And p is v And p μ Obtaining a multi-mode prompt vector P by splicing, and marking as P = [ P ] v ;p μ ]=[p 1 ,…,p k ]。
The context entity prompt construction module automatically learns the association between the visual prompt and the text representation from the potential semantic space, constructs a global entity prompt vector sequence, and directs the generation of entity lemmas and non-entity lemmas in each decoding step. In a specific implementation, all named entities E in the sample news text are tagged with the spaCy toolkit, and then the vector C is encoded from the sample text A Extracting the hidden state vector C corresponding to the entity words E (C E ∈C A ) Finally, the vector C is encoded visually by the linear layer phi I Carrying out weighted average, and calculating the relevance score of each entity word element based on the average result so as to obtain a relevance score set S E S, as shown in (c) of FIG. 6 E Each element in (a) represents an importance score for a named entity. In training, the entity code vector with the highest score is taken out as a key entity code vector C through argmax operation key (as in FIG. 6
Figure BDA0003771022960000191
). Then, as shown in (d) of FIG. 6, C is used key The potential dependency relationship between the key entity and other entities (i.e. score in fig. 6) is modeled as an initial hidden state of the bidirectional LSTM model, and then a target residual entity encoding vector is obtained by combining a preset dependency degree threshold η. Will C key Splicing with the target residual entity coding vector to construct and obtain an entity prompt vector p μ
Adding a multi-modal prompt vector P to an input sequence of a BART decoder, noted as [ P; y ], input to the BART decoder using the cross-attention mechanism with C and P in [ P; y ] to generate the prediction description text by performing autoregressive iteration (namely, sequentially obtaining the vocabulary elements of the prediction description text). Among them, the attention layer of the BART decoder can be expressed as:
Figure BDA0003771022960000192
wherein Q, K and V are obtained by mapping y; k P ,V P Obtained by P mapping; [;]representing a splice; d H Representing a characteristic dimension of the self-attention mechanism.
If P is used idx Represents the index number, | P, corresponding to the multi-modal prompt vector P in the whole decoder input sequence idx If | represents the length of P, the probability distribution of t time step output is:
Figure BDA0003771022960000193
wherein the content of the first and second substances,
Figure BDA0003771022960000194
representing a matrix vector mapped to a logical distribution on a vocabulary; [ | P idx |:]The indication is a sequence located after the multi-modal prompt vector.
After obtaining the predictive description text, based on the predictive description text and
Figure BDA0003771022960000195
the differences between the corresponding reference description texts result in cross-entropy losses. And when the model parameters are adjusted, freezing the CLIP module, adjusting the parameters of the initial MLP and BART based on the cross entropy loss until the cross entropy loss reaches a preset minimum value, and ending the training to obtain a data processing model, wherein the data processing model comprises a vision-language pre-training model, a fine-tuned MLP (namely a multilayer perception network), a fine-tuned BART (namely a target language model) and a context entity prompt construction module.
Fig. 7 is a block diagram illustrating a structure of a data processing apparatus according to an exemplary embodiment. Referring to fig. 7, the data processing apparatus 700 includes:
a data obtaining unit 710 configured to perform obtaining of a to-be-processed service image and a service text associated with the to-be-processed service image; the service text comprises named entities;
a visual cue vector determination unit 720, configured to perform visual feature based on the extracted service image to be processed, to obtain a visual cue vector;
the encoding unit 730 is configured to perform encoding processing on the visual cue vector and the service text to obtain an encoding vector; the coding vector comprises a visual coding vector corresponding to the visual prompt vector and a text coding vector corresponding to the service text;
an entity hint vector determination unit 740 configured to perform determining an entity hint vector based on the visual coded vector and an entity coded vector corresponding to each of the named entities in the text coded vector;
a decoding unit 750 configured to perform decoding processing on the coding vector based on the visual cue vector and the entity cue vector, so as to obtain a description text corresponding to the service image to be processed.
In an exemplary embodiment, the entity hint vector determination unit 740 includes:
a first degree of correlation determination unit configured to perform determining a degree of correlation between the visual code vector and each entity code vector in the text code vector;
and the entity prompt vector determining subunit is configured to select a target entity code vector from the entity code vectors based on the correlation degree to form an entity prompt vector.
In an exemplary embodiment, the entity hint vector determination subunit includes:
the first key entity determining unit is configured to determine an entity coding vector corresponding to the maximum correlation degree to obtain a key entity coding vector;
a first dependency degree determining unit configured to perform determining a dependency degree between the key entity encoding vector and each remaining entity encoding vector; the residual entity encoding vector refers to an entity encoding vector except the key entity encoding vector;
a first determining unit configured to perform determining a target remaining entity encoding vector based on the degree of dependency; the dependency degree corresponding to the target residual entity coding vector is larger than a preset dependency degree threshold value;
and the first constructing subunit is configured to splice the key entity code vector and the target residual entity code vector as target entity code vectors to obtain an entity prompt vector.
In an exemplary embodiment, the first dependency degree determining unit includes:
a first initialization unit configured to perform an initialization hiding state of the key entity encoding vector as a bidirectional long-short term memory network;
a first state vector determining unit configured to perform input of the residual entity code vectors into the bidirectional long-short term memory network, resulting in state vectors corresponding to the residual entity code vectors;
the first normalization unit is configured to perform normalization processing based on the state vector corresponding to the residual entity coding vector to obtain a normalization result corresponding to the residual entity coding vector; the normalization result characterizes a degree of dependency between the key entity encoding vector and the remaining entity encoding vector.
In an exemplary embodiment, the decoding unit 750 includes:
the multi-mode prompt vector determining unit is configured to perform splicing on the visual prompt vector and the entity prompt vector to obtain a multi-mode prompt vector;
and the decoding subunit is configured to perform autoregressive decoding processing based on the multi-modal prompt vector and the coding vector to obtain a description text corresponding to the to-be-processed service image.
In an exemplary embodiment, the visual cue vector determining unit 720 includes:
the visual feature extraction unit is configured to input the to-be-processed business image into a visual-language pre-training model for feature extraction to obtain extracted visual features;
a first mapping unit configured to perform mapping of the visual features to an input space of a target language model based on a multi-layer perceptual network, resulting in a visual cue vector;
wherein the target language model is used for the encoding process and the decoding process.
In an exemplary embodiment, the apparatus further comprises a training unit, the training unit comprising:
the sample acquisition unit is configured to acquire a sample service image text pair and a corresponding reference description text; the sample service image text pair comprises a sample service image and a sample service text related to the sample service image, wherein the sample service text comprises a named entity;
the sample visual cue vector determining unit is configured to execute visual features of the sample business image extracted based on a visual-language pre-training model, and map the visual features of the sample business image to an input space of the pre-training language model based on an initial multilayer perception network to obtain a sample visual cue vector;
the sample coding unit is configured to perform coding processing on the sample visual cue vector and the sample service text input to a coder of a pre-training language model to obtain a sample coding vector; the sample coding vectors comprise a sample visual vector corresponding to the sample visual cue vector and a sample text coding vector corresponding to the sample business text;
a sample entity prompt vector determination unit configured to perform a sample entity prompt vector determination based on the sample visual coding vector and a sample entity coding vector corresponding to each of the named entities in the sample text coding vector;
the sample multi-modal suggestive vector determining unit is configured to splice the sample visual cue vector and the sample entity cue vector to obtain a sample multi-modal cue vector; inputting the sample multi-modal prompt vector and the sample coding vector into a decoder of the pre-training language model for decoding processing to obtain a prediction description text;
and the parameter adjusting unit is configured to perform adjustment of model parameters based on the difference between the prediction description text and the reference description text until a preset training end condition is reached and training is finished, so that the data processing model is obtained.
In an exemplary embodiment, the parameter adjusting unit includes:
a loss determination unit configured to perform determining a loss value based on a difference between the prediction description text and the reference description text;
a parameter adjusting subunit configured to perform fixing model parameters of the visual-language pre-training model unchanged, adjust parameters of the initial multi-layer perceptual network and the pre-training language model based on the loss value
In an exemplary embodiment, the sample entity hint vector determination unit includes:
a second degree of correlation determination unit configured to perform determining a degree of correlation between the sample visual coding vector and each sample entity coding vector in the sample text coding vector;
and the sample entity prompt vector determining subunit is configured to select a target sample entity code vector from the sample entity code vectors based on the sample correlation degree to form a sample entity prompt vector.
In one exemplary embodiment, the sample entity hint vector determination subunit includes:
the second key entity determining unit is configured to determine a sample entity coding vector corresponding to the maximum sample correlation degree to obtain a key sample entity coding vector;
a second dependency level determination unit configured to perform determining a sample dependency level between the key sample entity encoding vector and each remaining sample entity encoding vector; the residual sample entity encoding vector refers to a sample entity encoding vector other than the key sample entity encoding vector;
a second determination unit configured to perform a determination of a target residual sample entity encoding vector based on the sample dependency; the sample dependency degree corresponding to the target residual sample entity encoding vector is larger than a preset dependency degree threshold value;
and the second construction subunit is configured to perform splicing on the key sample entity encoding vector and the target residual sample entity encoding vector as target sample entity encoding vectors to obtain a sample entity prompt vector.
In an exemplary embodiment, the second dependency degree determining unit includes:
a second initialization unit configured to perform an initialization hiding state of the key sample entity encoding vector as a bidirectional long-short term memory network;
a second state vector determining unit configured to perform input of the residual sample entity coded vector into the bidirectional long-short term memory network, resulting in a state vector corresponding to the residual sample entity coded vector;
the second normalization unit is configured to perform normalization processing based on the state vector corresponding to the residual sample entity coding vector to obtain a sample normalization result corresponding to the residual sample entity coding vector; the sample normalization result characterizes a sample dependency between the key sample entity encoding vector and the residual sample entity encoding vector.
With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.
In an exemplary embodiment, there is also provided an electronic device comprising a processor; a memory for storing processor-executable instructions; when the processor is configured to execute the instructions stored in the memory, any one of the data processing methods provided by the embodiments of the present disclosure is implemented.
The electronic device may be a terminal, a server, or a similar computing device, taking the electronic device as a server as an example, fig. 8 is a block diagram of an electronic device for data Processing according to an exemplary embodiment, and as shown in fig. 8, the server 800 may generate a relatively large difference due to different configurations or performances, and may include one or more Central Processing Units (CPUs) 810 (the processor 810 may include but is not limited to a Processing device such as a microprocessor MCU or a programmable logic device FPGA, etc.), a memory 830 for storing data, and one or more storage media 820 (e.g., one or more mass storage devices) for storing an application program 823 or data 822. Memory 830 and storage medium 820 may be, among other things, transient or persistent storage. The program stored in the storage medium 820 may include one or more modules, each of which may include a series of instruction operations in a server. Still further, the central processor 810 may be configured to communicate with the storage medium 820 to execute a series of instruction operations in the storage medium 820 on the server 800. The server 800 may also include one or more power supplies 860, one or more wired or wireless network interfaces 850, one or more input-output interfaces 840, and/or one or more operating systems 821, such as Windows Server, mac OS XTM, unixTM, linuxTM, freeBSDTM, and so forth.
The input-output interface 840 may be used to receive or transmit data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the server 800. In one example, i/o Interface 840 includes a Network adapter (NIC) that may be coupled to other Network devices via a base station to communicate with the internet. In one example, the input/output interface 840 may be a Radio Frequency (RF) module, which is used to communicate with the internet in a wireless manner.
It will be understood by those skilled in the art that the structure shown in fig. 8 is only an illustration and is not intended to limit the structure of the electronic device. For example, server 800 may also include more or fewer components than shown in FIG. 8, or have a different configuration than shown in FIG. 8.
In an exemplary embodiment, a computer-readable storage medium comprising instructions, such as the memory 830 comprising instructions, executable by the processor 810 of the apparatus 800 to perform the method described above is also provided. Alternatively, the computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.
In an exemplary embodiment, a computer program product is further provided, which includes a computer program/instruction, and when executed by a processor, the computer program/instruction implements any one of the data processing methods provided by the embodiments of the present disclosure.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims (15)

1. A data processing method, comprising:
acquiring a service image to be processed and a service text associated with the service image to be processed; the service text comprises named entities;
obtaining a visual prompt vector based on the extracted visual features of the service image to be processed;
coding the visual cue vector and the service text to obtain a coding vector; the coding vector comprises a visual coding vector corresponding to the visual prompt vector and a text coding vector corresponding to the service text;
determining an entity prompt vector based on the visual coding vector and an entity coding vector corresponding to each named entity in the text coding vector;
and decoding the coding vector based on the visual cue vector and the entity cue vector to obtain a description text corresponding to the service image to be processed.
2. The method of claim 1, wherein determining an entity hint vector based on the entity encoding vector corresponding to each of the named entities in the visual encoding vector and the text encoding vector comprises:
determining a degree of correlation between the visual code vector and each entity code vector in the text code vector;
and selecting a target entity coding vector from the entity coding vectors based on the correlation degree to form an entity prompt vector.
3. The method of claim 2, wherein the selecting a target entity-encoding vector from the entity-encoding vectors based on the degree of correlation to construct an entity hint vector comprises:
determining an entity coding vector corresponding to the maximum correlation degree to obtain a key entity coding vector;
determining the degree of dependence between the key entity code vector and each remaining entity code vector; the residual entity encoding vector refers to an entity encoding vector except the key entity encoding vector;
determining a target remaining entity encoding vector based on the degree of dependence; the dependency degree corresponding to the target residual entity coding vector is larger than a preset dependency degree threshold value;
and splicing the key entity coding vector and the target residual entity coding vector as target entity coding vectors to obtain an entity prompt vector.
4. The method of claim 3, wherein determining a degree of dependency between the key entity-encoded vector and each remaining entity-encoded vector comprises:
taking the key entity encoding vector as an initial hidden state of a bidirectional long-short term memory network;
inputting the residual entity coding vectors into the bidirectional long-short term memory network to obtain state vectors corresponding to the residual entity coding vectors;
performing normalization processing on the basis of the state vector corresponding to the residual entity coding vector to obtain a normalization result corresponding to the residual entity coding vector; the normalization result characterizes a degree of dependency between the key entity encoding vector and the remaining entity encoding vector.
5. The method of claim 1, wherein the decoding the encoded vector based on the visual cue vector and the entity cue vector to obtain a description text corresponding to the service image to be processed comprises:
splicing the visual cue vector and the entity cue vector to obtain a multi-mode cue vector;
and performing autoregressive decoding processing on the basis of the multi-modal prompt vector and the coding vector to obtain a description text corresponding to the service image to be processed.
6. The method according to claim 1, wherein the deriving a visual cue vector based on the extracted visual features of the to-be-processed service image comprises:
inputting the business image to be processed into a vision-language pre-training model for feature extraction to obtain extracted visual features;
mapping the visual features to an input space of a target language model based on a multilayer perception network to obtain visual prompt vectors;
wherein the target language model is used for the encoding process and the decoding process.
7. The method according to any one of claims 1 to 6, wherein the method is implemented based on a data processing model, the method further comprising the step of training the data processing model:
acquiring a sample service image text pair and a corresponding reference description text; the sample service image text pair comprises a sample service image and a sample service text related to the sample service image, wherein the sample service text comprises a named entity;
the visual features of the sample business image are extracted based on a visual-language pre-training model, and the visual features of the sample business image are mapped to an input space of the pre-training language model based on an initial multilayer perception network to obtain a sample visual cue vector;
inputting the sample visual cue vector and the sample service text into an encoder of a pre-training language model for encoding processing to obtain a sample encoding vector; the sample coding vectors comprise a sample visual vector corresponding to the sample visual cue vector and a sample text coding vector corresponding to the sample business text;
determining a sample entity prompt vector based on the sample visual coding vector and a sample entity coding vector corresponding to each named entity in the sample text coding vector;
splicing the sample visual cue vector and the sample entity cue vector to obtain a sample multi-modal cue vector; inputting the sample multi-modal prompt vector and the sample coding vector into a decoder of the pre-training language model for decoding processing to obtain a prediction description text;
and adjusting model parameters based on the difference between the prediction description text and the reference description text until a preset training end condition is reached and the training is finished to obtain the data processing model.
8. The method of claim 7, wherein the adjusting model parameters based on the difference between the predictive description text and the reference description text comprises:
determining a loss value based on a difference between the prediction description text and the reference description text;
and fixing the model parameters of the vision-language pre-training model unchanged, and adjusting the parameters of the initial multilayer perception network and the pre-training language model based on the loss value.
9. The method of claim 7, wherein determining a sample entity hint vector based on the sample visual coding vector and a sample entity coding vector corresponding to each of the named entities in the sample text coding vector comprises:
determining a sample correlation degree between the sample visual coding vector and each sample entity coding vector in the sample text coding vector;
and selecting a target sample entity coding vector from the sample entity coding vectors based on the sample correlation degree to form a sample entity prompt vector.
10. The method of claim 9, wherein the selecting a target sample entity code vector from the sample entity code vectors based on the sample correlation degree to form a sample entity hint vector comprises:
determining a sample entity coding vector corresponding to the maximum sample correlation degree to obtain a key sample entity coding vector;
determining a sample dependency degree between the key sample entity encoding vector and each residual sample entity encoding vector; the residual sample entity encoding vector refers to a sample entity encoding vector other than the key sample entity encoding vector;
determining a target residual sample entity encoding vector based on the sample dependency; the sample dependency degree corresponding to the target residual sample entity encoding vector is larger than a preset dependency degree threshold value;
and splicing the key sample entity coding vector and the target residual sample entity coding vector as target sample entity coding vectors to obtain a sample entity prompt vector.
11. The method of claim 10, wherein determining the sample dependency between the key sample entity encoding vector and each remaining sample entity encoding vector comprises:
taking the key sample entity coding vector as an initial hidden state of a bidirectional long-short term memory network;
inputting the residual sample entity coding vector into the bidirectional long-short term memory network to obtain a state vector corresponding to the residual sample entity coding vector;
performing normalization processing on the basis of the state vector corresponding to the residual sample entity coding vector to obtain a sample normalization result corresponding to the residual sample entity coding vector; the sample normalization result characterizes a sample dependency between the key sample entity encoding vector and the residual sample entity encoding vector.
12. A data processing apparatus, comprising:
the data acquisition unit is configured to acquire a to-be-processed service image and a service text associated with the to-be-processed service image; the service text comprises named entities;
the visual cue vector determining unit is configured to execute visual feature based on the extracted service image to be processed to obtain a visual cue vector;
the coding unit is configured to perform coding processing on the visual cue vector and the service text to obtain a coding vector; the coding vector comprises a visual coding vector corresponding to the visual prompt vector and a text coding vector corresponding to the service text;
an entity prompt vector determination unit configured to perform determining an entity prompt vector based on the visual coding vector and an entity coding vector corresponding to each named entity in the text coding vector;
and the decoding unit is configured to perform decoding processing on the coding vector based on the visual cue vector and the entity cue vector to obtain a description text corresponding to the service image to be processed.
13. An electronic device, comprising:
a processor;
a memory for storing the processor-executable instructions;
wherein the processor is configured to execute the instructions to implement the data processing method of any one of claims 1 to 11.
14. A computer-readable storage medium, wherein instructions in the computer-readable storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the data processing method of any of claims 1 to 11.
15. A computer program product comprising computer programs/instructions, characterized in that the computer programs/instructions, when executed by a processor, implement the data processing method of any of claims 1 to 11.
CN202210901461.0A 2022-07-28 2022-07-28 Data processing method and device, electronic equipment and storage medium Pending CN115393849A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210901461.0A CN115393849A (en) 2022-07-28 2022-07-28 Data processing method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210901461.0A CN115393849A (en) 2022-07-28 2022-07-28 Data processing method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN115393849A true CN115393849A (en) 2022-11-25

Family

ID=84116138

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210901461.0A Pending CN115393849A (en) 2022-07-28 2022-07-28 Data processing method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN115393849A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116894089A (en) * 2023-08-11 2023-10-17 腾讯科技(深圳)有限公司 Digest generation method, digest generation device, digest generation apparatus, digest generation program, and digest generation program

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116894089A (en) * 2023-08-11 2023-10-17 腾讯科技(深圳)有限公司 Digest generation method, digest generation device, digest generation apparatus, digest generation program, and digest generation program
CN116894089B (en) * 2023-08-11 2023-12-15 腾讯科技(深圳)有限公司 Digest generation method, digest generation device, digest generation apparatus, digest generation program, and digest generation program

Similar Documents

Publication Publication Date Title
US11481562B2 (en) Method and apparatus for evaluating translation quality
CN109874029B (en) Video description generation method, device, equipment and storage medium
CN111241237B (en) Intelligent question-answer data processing method and device based on operation and maintenance service
CN113205817B (en) Speech semantic recognition method, system, device and medium
CN108153913B (en) Training method of reply information generation model, reply information generation method and device
CN111930914B (en) Problem generation method and device, electronic equipment and computer readable storage medium
CN109190134B (en) Text translation method and device
CN112528637B (en) Text processing model training method, device, computer equipment and storage medium
CN111353049A (en) Data updating method and device, electronic equipment and computer readable storage medium
CN112131883A (en) Language model training method and device, computer equipment and storage medium
CN116628186B (en) Text abstract generation method and system
CN113392265A (en) Multimedia processing method, device and equipment
CN114328817A (en) Text processing method and device
CN111126084B (en) Data processing method, device, electronic equipment and storage medium
Wang et al. Data augmentation for internet of things dialog system
CN115393849A (en) Data processing method and device, electronic equipment and storage medium
CN111814496B (en) Text processing method, device, equipment and storage medium
CN115357710B (en) Training method and device for table description text generation model and electronic equipment
CN116958738A (en) Training method and device of picture recognition model, storage medium and electronic equipment
CN116956953A (en) Translation model training method, device, equipment, medium and program product
CN115132182B (en) Data identification method, device, equipment and readable storage medium
CN116956816A (en) Text processing method, model training method, device and electronic equipment
CN115273856A (en) Voice recognition method and device, electronic equipment and storage medium
CN116913278B (en) Voice processing method, device, equipment and storage medium
CN115081459B (en) Spoken language text generation method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination