CN116543170A - Image processing method, device, equipment and storage medium - Google Patents

Image processing method, device, equipment and storage medium Download PDF

Info

Publication number
CN116543170A
CN116543170A CN202310595246.7A CN202310595246A CN116543170A CN 116543170 A CN116543170 A CN 116543170A CN 202310595246 A CN202310595246 A CN 202310595246A CN 116543170 A CN116543170 A CN 116543170A
Authority
CN
China
Prior art keywords
image
target
features
target object
object attribute
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310595246.7A
Other languages
Chinese (zh)
Inventor
马兰
施耀一
李振
张少雄
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Industrial and Commercial Bank of China Ltd ICBC
Original Assignee
Industrial and Commercial Bank of China Ltd ICBC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Industrial and Commercial Bank of China Ltd ICBC filed Critical Industrial and Commercial Bank of China Ltd ICBC
Priority to CN202310595246.7A priority Critical patent/CN116543170A/en
Publication of CN116543170A publication Critical patent/CN116543170A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Image Analysis (AREA)

Abstract

The present disclosure provides an image processing method, apparatus, device, and storage medium, which can be applied to the technical fields of computer vision and natural language processing. The method includes extracting image features of a target image, wherein the target image is related to a target object; object attribute detection is carried out on a target object in the target image, and object attribute characteristics corresponding to the target object are obtained; fusing the image characteristics and the object attribute characteristics according to an attention mechanism to obtain fused characteristics; from the fusion features, descriptive text suitable for characterizing the behavior of the target object is determined.

Description

Image processing method, device, equipment and storage medium
Technical Field
The present disclosure relates to the field of computer vision and natural language processing technology, and in particular, to an image processing method, apparatus, electronic device, computer readable storage medium, and computer program product.
Background
Under the condition that image processing technology is increasingly developed, image processing is widely applied to the technical fields of computer vision, natural language processing and the like. Image processing is to generate natural sentences for a specified picture by a pointer to precisely describe the covered content of the picture.
In the process of implementing the disclosed concept, the inventor finds that at least the following problems exist in the related art: the existing method ignores the clear high-level semantic concept of the image by using the image features trained by the model, so that a computer is difficult to understand the image effectively, and further the generated text is low in quality, monotonous in content and poor in interpretability.
Disclosure of Invention
In view of the above, the present disclosure provides an image processing method, apparatus, electronic device, computer readable medium, and computer program product.
According to one aspect of the present disclosure, there is provided an image processing method including: extracting image characteristics of a target image, wherein the target image is related to a target object; object attribute detection is carried out on a target object in the target image, so that object attribute characteristics corresponding to the target object are obtained; merging the image features and the object attribute features according to an attention mechanism to obtain merged features; and determining descriptive text suitable for representing the behavior of the target object according to the fusion characteristics.
According to an embodiment of the present disclosure, the above image processing method further includes: and (3) carrying out centering treatment on the initial image to obtain the target image.
According to an embodiment of the present disclosure, the detecting an object attribute of the target object in the target image to obtain an object attribute feature corresponding to the target object includes: inputting the target image into an object attribute detection network, and outputting a visual word corresponding to the object attribute of the target object; and encoding the visual words to obtain the object attribute characteristics.
According to an embodiment of the present disclosure, the visual word includes at least one of: visual position words representing the position of the target object, visual gesture words representing the gesture of the target object, and visual size words representing the size of the target object.
According to an embodiment of the present disclosure, the fusing the image feature and the object attribute feature according to an attention mechanism to obtain a fused feature includes: determining query characteristics according to the image characteristics; determining key characteristics and value characteristics according to the object attribute characteristics; the query feature, the key feature, and the value feature are input to an attention network, and the fusion feature is output.
According to an embodiment of the present disclosure, determining the descriptive text suitable for characterizing the behavior of the target object according to the fusion feature includes: and inputting the fusion characteristics into a text prediction network, and outputting the descriptive text.
According to an embodiment of the present disclosure, the above-described text prediction network includes at least one of: long-short term memory network, cyclic neural network, two-way long-short term memory network, gate-controlled cyclic neural network.
According to an embodiment of the present disclosure, the above image processing method further includes: determining a service prompt message according to the description text; and sending the service prompt message to the target client.
Another aspect of the present disclosure provides a behavior description text generating apparatus, including: the first feature extraction module is used for extracting image features of a target image, wherein the target image is related to a target object; the second feature extraction module is used for detecting object attributes of the target object in the target image to obtain object attribute features corresponding to the target object; the attention mechanism fusion module is used for determining query characteristics according to the image characteristics, determining key characteristics and value characteristics according to the object attribute characteristics, and obtaining fusion characteristics according to the query characteristics, the key characteristics and the value characteristics; and the behavior description text generation module is used for determining a description text suitable for representing the behavior of the target object according to the fusion characteristics.
Another aspect of the present disclosure provides an electronic device, comprising: one or more processors; and a memory for storing one or more programs, wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the method.
Another aspect of the present disclosure also provides a computer-readable storage medium having stored thereon executable instructions that, when executed by a processor, cause the processor to perform the above-described method.
Another aspect of the present disclosure also provides a computer program product comprising a computer program which, when executed by a processor, implements the above method.
According to the embodiment of the disclosure, the image characteristics of the target image related to the target object are extracted, then the target object in the target image is subjected to object attribute detection, so that the object attribute characteristics corresponding to the target object are obtained, the image characteristics and the object attribute characteristics are fused according to the attention mechanism, so that the obtained fusion characteristics are focused on the image area related to the object attribute characteristics in the target image, the fusion characteristics can fully represent the target object attributes in the target object, the description text suitable for representing the behaviors of the target object is determined according to the fusion characteristics, and therefore, the technical problem that the correlation degree between the original characteristics of the image and the text prediction is low is at least partially solved.
Drawings
The foregoing and other objects, features and advantages of the disclosure will be more apparent from the following description of embodiments of the disclosure with reference to the accompanying drawings, in which:
fig. 1 schematically illustrates an application scenario diagram of an image processing method according to an embodiment of the present disclosure;
FIG. 2 schematically illustrates a flow chart of an image processing method according to an embodiment of the disclosure;
FIG. 3 schematically illustrates a block diagram of an attention network according to an embodiment of the present disclosure;
FIG. 4 schematically illustrates a schematic diagram of an image processing method according to an embodiment of the present disclosure;
FIG. 5 schematically illustrates a block diagram of a behavior description text generation apparatus according to an embodiment of the present disclosure;
fig. 6 schematically shows a block diagram of an electronic device adapted to implement an image processing method according to an embodiment of the disclosure.
Detailed Description
Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. It should be understood that the description is only exemplary and is not intended to limit the scope of the present disclosure. In the following detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the embodiments of the present disclosure. It may be evident, however, that one or more embodiments may be practiced without these specific details. In addition, in the following description, descriptions of well-known structures and techniques are omitted so as not to unnecessarily obscure the concepts of the present disclosure.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The terms "comprises," "comprising," and/or the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.
All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It should be noted that the terms used herein should be construed to have meanings consistent with the context of the present specification and should not be construed in an idealized or overly formal manner.
Where expressions like at least one of "A, B and C, etc. are used, the expressions should generally be interpreted in accordance with the meaning as commonly understood by those skilled in the art (e.g.," a system having at least one of A, B and C "shall include, but not be limited to, a system having a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.).
In the technical scheme of the disclosure, the related data (such as including but not limited to personal information of a user) are collected, stored, used, processed, transmitted, provided, disclosed, applied and the like, all conform to the regulations of related laws and regulations, necessary security measures are adopted, and the public welcome is not violated.
Image descriptions are typically natural sentences generated for a specified target image to precisely describe the contents of coverage of the target image. In recent years, deep learning based models have demonstrated excellent performance in many different image processing tasks, including machine translation, image recognition, and the like.
However, the inventor finds that in many models, attention mechanisms are adopted to make the models pay more attention to the relevant areas in the image when generating words, but if the features of the image input at each moment are weighted by the attention mechanisms, the most original features of the image cannot enter the following text generation process, but when the image is subjected to part positioning, the complete features are required, and if the complete image features cannot be received, the part positioning of the image is wrong. Once the generated attention weight is inaccurate, the subsequent text generation process receives the wrong image region, which can lead to inaccuracy of the output word.
An embodiment of the present disclosure provides an image processing method, including: extracting image features of a target image, wherein the target image is related to a target object; object attribute detection is carried out on a target object in the target image, and object attribute characteristics corresponding to the target object are obtained; fusing the image characteristics and the object attribute characteristics according to an attention mechanism to obtain fused characteristics; from the fusion features, descriptive text suitable for characterizing the behavior of the target object is determined.
According to the embodiment of the disclosure, the image characteristics of the target image related to the target object are extracted, the target object in the target image is subjected to object attribute detection to obtain the object attribute characteristics corresponding to the target object, the image characteristics and the object attribute characteristics related to the object attribute are fused according to the attention mechanism, the obtained fusion characteristics can be focused on the image area related to the object attribute characteristics in the target image, and therefore the fusion characteristics can fully represent the target object attribute in the target object. Therefore, according to the fusion characteristics, the description text suitable for representing the behavior of the target object is determined, and the characteristics related to the object attributes of the target image are accurately positioned in the description text generation process, so that the technical problem that the correlation degree between the original characteristics of the image and the text prediction in the related technology is low can be at least partially overcome, the technical problem that the accuracy of the output text is low due to detection of an incorrect image area in the text generation process can be avoided, the interpretation of the image characteristics is enhanced, the obtained description text can capture the behavior of a client in a business hall more accurately, the intention of handling the business of the client can be predicted more accurately, accurate service can be provided for the client, and the service quality and service efficiency of places such as a bank business hall and the like are improved.
In the technical scheme of the disclosure, the authorization or consent of the user is obtained before the personal information of the user is obtained or acquired.
In the technical scheme of the disclosure, the related processes of collecting, storing, using, processing, transmitting, providing, disclosing, applying and the like of the personal information of the user all conform to the regulations of related laws and regulations, necessary security measures are adopted, and the public order harmony is not violated.
Fig. 1 schematically illustrates an application scenario diagram of an image processing method according to an embodiment of the present disclosure. It should be noted that fig. 1 illustrates only an example of an application scenario in which the embodiments of the present disclosure may be applied to help those skilled in the art understand the technical content of the present disclosure, but it does not mean that the embodiments of the present disclosure may not be applied to other devices, systems, environments, or scenarios.
As shown in fig. 1, an application scenario 100 according to this embodiment may include a first terminal device 101, a second terminal device 102, a third terminal device 103, a network 104, and a server 105. The network 104 is a medium used to provide a communication link between the first terminal device 101, the second terminal device 102, the third terminal device 103, and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.
The user may interact with the server 105 through the network 104 using at least one of the first terminal device 101, the second terminal device 102, the third terminal device 103, to receive or send messages, etc. Various communication client applications, such as a shopping class application, a web browser application, a search class application, an instant messaging tool, a mailbox client, social platform software, etc. (by way of example only) may be installed on the first terminal device 101, the second terminal device 102, and the third terminal device 103.
The first terminal device 101, the second terminal device 102, the third terminal device 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablets, laptop and desktop computers, and the like.
The server 105 may be a server providing various services, such as a background management server (by way of example only) providing support for websites browsed by the user using the first terminal device 101, the second terminal device 102, and the third terminal device 103. The background management server may analyze and process the received data such as the user request, and feed back the processing result (e.g., the web page, information, or data obtained or generated according to the user request) to the terminal device.
It should be noted that, the image processing method provided by the embodiment of the present disclosure may be generally executed by the server 105. Accordingly, the behavior description text generating apparatus provided by the embodiments of the present disclosure may be generally provided in the server 105. The image processing method provided by the embodiment of the present disclosure may also be performed by a server or a server cluster that is different from the server 105 and is capable of communicating with the first terminal device 101, the second terminal device 102, the third terminal device 103, and/or the server 105. Accordingly, the behavior description text generating apparatus provided by the embodiments of the present disclosure may also be provided in a server or a server cluster that is different from the server 105 and is capable of communicating with the first terminal device 101, the second terminal device 102, the third terminal device 103, and/or the server 105.
It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
Fig. 2 schematically shows a flowchart of an image processing method according to an embodiment of the present disclosure.
As shown in fig. 2, the image processing method of this embodiment includes operations S210 to S240.
In operation S210, image features of a target image are extracted, wherein the target image is related to a target object.
According to the embodiment of the present disclosure, the target object may be a customer, a pet, or the like in a place such as a bank business hall, etc., and the embodiment of the present disclosure does not limit a specific type of the target object, and a person skilled in the art may select according to actual needs.
According to the embodiments of the present disclosure, the image features of the target image may be extracted based on a neural network algorithm, for example, the image features of the target image may be extracted based on a convolutional neural network, but not limited thereto, and the image features may be extracted based on other types of neural network algorithms, and the specific algorithm type of extracting the image features is not limited in the embodiments of the present disclosure.
In an exemplary embodiment, the neural Network algorithm that extracts image features of the target image may be a Residual Network or multi-layer perceptron (Multilayer Perceptron, MLP).
In operation S220, object attribute detection is performed on the target object in the target image, so as to obtain an object attribute feature corresponding to the target object.
According to embodiments of the present disclosure, an object attribute may be information characterizing an action, identity, location, etc. of a target object. For example, the identity attribute may be child, young, elderly, etc.
According to the embodiment of the present disclosure, the object attribute detection may be performed on the object in the target image based on a neural network algorithm, for example, the object attribute detection may be performed on the object in the target image based on a convolutional neural network, but not limited thereto, and the object attribute detection may be performed on the object in the target image based on other types of neural network algorithms, and the specific algorithm type of the object attribute detection performed on the object in the target image is not limited.
In an exemplary embodiment, the neural Network algorithm for object attribute detection of a target object in a target image may be a Residual Network or multi-layer perceptron (Multilayer Perceptron, MLP).
In operation S230, the image features and the object attribute features are fused according to the attention mechanism, resulting in fused features.
According to the embodiment of the present disclosure, the image feature and the object attribute feature may be fused based on an attention network algorithm to obtain a fused feature, for example, an attention network based on a multi-head attention layer may be constructed to fuse the image feature and the object attribute feature to obtain a fused feature, but not limited thereto, and the image feature and the object attribute feature may be fused based on other types of attention networks to obtain a fused feature.
According to the embodiment of the disclosure, the image features and the object attribute features are fused according to the attention mechanism, so that the obtained fusion features are focused on the image area related to the object attribute features in the target image, and the fusion features can fully represent the object attributes in the target object.
In operation S240, descriptive text suitable for characterizing the behavior of the target object is determined from the fusion features.
According to embodiments of the present disclosure, the fusion features may be processed based on a neural network algorithm to determine descriptive text suitable for characterizing the behavior of the target object, and the embodiments of the present disclosure do not limit the specific neural network algorithm type to determine descriptive text suitable for characterizing the behavior of the target object.
According to the embodiment of the disclosure, the method and the device for accurately positioning the characteristics related to the target image and the object attribute in the process of generating the descriptive text can be realized, the technical problem that the association degree of the original characteristics of the image and the text prediction is low can be at least partially overcome, the technical problem that the accuracy of the output text is low due to detection of an incorrect image area in the process of generating the text can be avoided, the interpretation of the image characteristics is enhanced, the obtained descriptive text can capture the behavior of a client in a business hall more accurately, therefore, the intention of the client for handling the business can be predicted more accurately, accurate service can be provided for the client, and the service quality and service efficiency of places such as a bank business hall are improved.
According to an embodiment of the present disclosure, the image processing method further includes performing a centering process on the initial image to obtain a target image.
According to embodiments of the present disclosure, the initial image may include an image acquired by an image acquisition device, which may include a camera, or the like. In the case where the image capturing device is a camera having a video capturing function, at least one initial image may be determined from images captured by one or more cameras.
According to embodiments of the present disclosure, the initial image may also be cropped, for example, to a 224×224 size, prior to the centering process.
According to an embodiment of the present disclosure, the centering process may be to subtract a preset value from a pixel value of each point in the initial image, respectively, to obtain a pixel value of the target image.
In an exemplary embodiment, the pixel value (x) of the target image is obtained by subtracting (104,116,122) from the pixel value (x, y, z) of each point in the initial image ,y ,z ). The pixel value of the target image can be expressed by formula (1).
According to the embodiment of the disclosure, after the initial image is subjected to the centering processing, the obtained target image can be enabled to achieve the purpose of accelerating the convergence speed of the network.
According to an embodiment of the present disclosure, performing object attribute detection on a target object in a target image, obtaining an object attribute feature corresponding to the target object includes: inputting the target image into an object attribute detection network, and outputting a visual word corresponding to the object attribute of the target object; and encoding the visual words to obtain object attribute characteristics.
According to embodiments of the present disclosure, the object attribute detection network may be a detection network based on information of attributes of actions, identities, locations, etc. of the target object.
According to the embodiment of the disclosure, after the visual word is output, the word segmentation operation is performed on the visual word by using a word segmentation tool. For example, the visual word may be segmented by a segmentation tool such as a jieba segmentation tool, but not limited thereto, and the visual word may be segmented based on other types of segmentation tools, and the specific type of the segmentation tool is not limited by the embodiments of the present disclosure.
In an exemplary embodiment, visual words are segmented with a jieba segmentation tool, the symbols in the visual words are replaced with spaces, start "< start >" and end "< end >" are added before and after all visual words, respectively, unknown visual words are padded with "< unk >" and all visual words are padded with "< end >" to a maximum word length.
According to the embodiments of the present disclosure, the visual words are encoded, for example, an encoder such as a one-hot (one-hot) encoder, but not limited thereto, and the visual words may be encoded based on other types of encoders, and the specific types of encoders are not limited by the embodiments of the present disclosure.
According to embodiments of the present disclosure, one-hot encoding is the process of converting a classification variable into a form that can be better predicted by a computer algorithm, i.e., converting a visual word into a form that can be predicted by a text prediction network.
In an exemplary embodiment, the visual word is encoded as a one-hot vector, the corresponding index position of the one-hot vector is set to 1, the other positions are 0, and the position of 1 in the vector corresponds to the sequence number of the visual word in the object attribute feature.
According to an embodiment of the present disclosure, the visual word includes at least one of: visual position words characterizing the position of the target object, visual pose words characterizing the pose of the target object, visual size words characterizing the size of the target object.
According to embodiments of the present disclosure, the visual location words characterizing the location of the target object may be a gate location of the customer at the bank lobby, a number machine location of the customer at the bank lobby, a counter location of the customer at the bank lobby, a waiting seat location of the customer at the bank lobby.
According to embodiments of the present disclosure, the visual gesture word characterizing the gesture of the target object may be walking, running, sitting, standing, squatting, bending.
According to the embodiment of the disclosure, when the target object is an old person, a child, or a pet, the visual size word representing the size of the target object is adjusted according to the target object.
According to embodiments of the present disclosure, the visual words may further include visual gender words that characterize the gender of the target object, which may be male, female, others, when the target object is a pet, the visual gender words are others.
According to an embodiment of the present disclosure, fusing image features and object attribute features according to an attention mechanism, the obtaining a fused feature includes: determining query features according to the image features; determining key features and value features according to the object attribute features; the query feature, key feature and value feature are input to the attention network, and the fusion feature is output.
Fig. 3 schematically illustrates a block diagram of an attention network according to an embodiment of the present disclosure.
As shown in fig. 3, the attention network includes a multi-headed attention layer and a fully connected network layer. The query feature may be determined from the image feature, the key feature and the value feature may be determined from the object attribute feature, or the query feature may be determined from the object attribute feature, and the key feature and the value feature may be determined from the image feature. The query feature, key feature and value feature are input to the attention network based on the attention network constructed by the multi-headed attention layer.
According to an embodiment of the present disclosure, the query feature is Q, the key feature is K, and the value feature is V. The attention mechanism (multi-head attention) in the multi-head attention layer is to project Q, K, V through h different linear transforms, and finally stitch the different attention results together, and the multi-head attention layer can be represented by the following formula (2) and formula (3):
MultiHead(Q,K,V)= (2)
Concat(head 1 ,…head h )W o
wherein, the liquid crystal display device comprises a liquid crystal display device,refers to input vectors, the multi-head attention layer is to divide the model into a plurality of heads to form a plurality of subspaces, make the model focus on information of different aspects, and finally divide the plurality of heads (head 1 ,…head h ) Information of (1) is merged [ Concat (head) 1 ,…head h )W o ]And obtaining an output result.
The self-attention mechanism is to take Q, K, V the same, Q, K, V can be image feature at the same time, or Q, K, V can be object attribute feature at the same time, and the self-attention mechanism can be represented by the following formula (4):
attention_output=Attention(Q,K,V) (4)
in addition, the self-attention mechanism is calculated by adopting a Scaled Dot-Product Attention mechanism, which is represented by the following formula (5):
the multi-layer attention layer is prepared by H times of the process of the Scaled Dot-Product Attention mechanism, and is represented by the following formula (6):
finally, each of the values calculated by the above formula (6) is combined.
The fully connected network layer can be represented by formula (7):
FCN(I)=max(0,IW f +b f )W ff +b ff (7)
Wherein max (0,) is the activation function; w (W) f And W is equal to ff Is a learnable matrix; b f And b ff Is a bias term.
According to an embodiment of the present disclosure, the attention network uses the features of one modality as a guide, integrating the features of the current modality:
I N =FCN(MutAtt(S,I)) (8)
S N =(MutAtt(I N ,S)) (9)
when n=0, I 0 、S 0 Representing the initial image feature and the object attribute feature respectively, and obtaining the final output I of the two stacks by repeating the same process for N times N 、S N . First, using the object attribute feature S as a query to find the most relevant visual area in the image feature I, generating the image feature I related to the object attribute feature N . Image feature I follows N As a query, the most relevant object attribute features are further searched, and irrelevant visual words in the object attribute features S are filtered out. And the attention mechanism is iteratively executed between the image features and the object attribute features, so that the visual receptive field is gradually concentrated in a remarkable visual area, and the original object attribute features are gradually fused to summarize the corresponding visual area.
According to the embodiment of the disclosure, the image characteristics and the object attribute characteristics are fused by using the attention mechanism, so that the association between the image characteristics and the object attribute characteristics is realized, and the output description text can accurately describe the behavior of the target object.
According to an embodiment of the present disclosure, determining descriptive text suitable for characterizing a behavior of a target object from a fusion feature includes: and inputting the fusion characteristics into a text prediction network, and outputting descriptive text.
According to an embodiment of the disclosure, the fusion features are decoded using a text prediction network, and descriptive text characterizing the target object is output.
According to the embodiment of the disclosure, the characteristic information can be captured while the descriptive text is output, and the gradient dispersion problem of the standard cyclic neural network is solved. Meanwhile, the method can identify the customer behaviors in places such as a bank hall and the like, automatically generate corresponding description texts for the behaviors of the customers, provide accurate services for the customers, understand the requirements of the customers, and improve the service quality and service efficiency of places such as the bank hall and the like.
According to an embodiment of the present disclosure, a text prediction network includes at least one of: long-short term memory network, cyclic neural network, two-way long-short term memory network, gate-controlled cyclic neural network.
According to an embodiment of the present disclosure, a recurrent neural network (Recurrent Neural Network, RNN) is used to solve the problem that the recurrent neural network cannot model sequence data, the network input is sequence data, recursion is performed in the evolution direction of the sequence, and all the recurrent neural networks in which the recurrent units are connected in a chain are widely applied to mining of timing information and semantic information of the sequence data.
According to embodiments of the present disclosure, the long and short term memory network (Long Short Term Memory, LSTM) is an upgrade network to a recurrent neural network that defines internal memory cell states to store long term information. The memory cell states interact with previous outputs and subsequent inputs to determine which elements of the internal state vector should be updated, maintained, or deleted.
According to the embodiment of the disclosure, a two-way long-short-term memory network (Bi-LSTM) is divided into 2 independent long-term memory networks, input sequences are respectively input into the 2 long-term memory networks in positive sequence and reverse sequence to perform feature extraction, word vectors formed by splicing 2 output vectors (namely extracted feature vectors) are used as final feature expression of the word, a model of the Bi-LSTM is that feature data obtained at time t simultaneously has information between the past and future, and the efficiency and performance of the neural network structure model on text feature extraction are superior to those of a single long-term memory network structure model.
According to an embodiment of the present disclosure, a gated recurrent neural network (gated recurrent neural network) is provided to better capture the large time step distance dependence in a time series, which controls the flow of information through gates that can be learned.
In an exemplary embodiment, the Residual Network, the attention Network, and the long and short term memory Network may comprise an integrated Network. After the initial image is centered to obtain a target image, initializing the weight of the comprehensive network according to uniform distribution, wherein the weight is represented by the following formula (10):
W~U(-0.01,0.01)(10)
wherein W represents the weight of the integrated network and U (x) represents a uniform distribution, i.e. a uniform distribution of weights of the integrated network obeying-0.01 to 0.01.
According to an embodiment of the present disclosure, the image processing method further includes: determining a service prompt message according to the description text; and sending a service prompt message to the target client.
According to the embodiment of the disclosure, the service prompt message may include a description text suitable for characterizing the behavior of the client in a place such as a bank business hall, and the target client can timely learn the actual requirement of the client, such as the consultation requirement aiming at the problem of business, by analyzing the service prompt message through the description text.
According to the embodiment of the disclosure, the target client may be a service robot, a mobile phone, a computer or the like in a place such as a bank business hall, and the embodiment of the disclosure does not limit the specific type of the target client.
Fig. 4 schematically illustrates a schematic diagram of an image processing method according to an embodiment of the present disclosure.
As shown in fig. 4, after the target image is acquired, the target image is input to a first feature extraction network, and image features of the target image are extracted by using the first feature extraction network, wherein the target image is related to the target object. And simultaneously, inputting the target image into a second feature extraction network, and outputting object attribute features corresponding to the target object.
According to an embodiment of the disclosure, a query feature is determined from image features, key features and value features are determined from object attribute features, the query feature, key features and value features are input to an attention network, and fusion features are output. And inputting the fusion characteristics into a text prediction network.
According to an embodiment of the present disclosure, a text prediction network obtains description text that fuses feature determinations as applicable to characterizing the behavior of a target object.
Based on the image processing method, the disclosure also provides a behavior description text generation device. The device will be described in detail below in connection with fig. 5.
Fig. 5 schematically shows a block diagram of a behavior description text generating apparatus according to an embodiment of the present disclosure. As shown in fig. 5, the behavior description text generation apparatus 500 of this embodiment includes a first feature extraction module 510, a second feature extraction module 520, an attention mechanism fusion module 530, and a behavior description text generation module 540.
The first feature extraction module 510 is configured to extract image features of a target image, where the target image is related to a target object. In an embodiment, the first feature extraction module 510 may be configured to perform the operation S210 described above, which is not described herein.
The second feature extraction module 520 is configured to perform object attribute detection on a target object in the target image, so as to obtain an object attribute feature corresponding to the target object. In an embodiment, the second feature extraction module 520 may be used to perform the operation S220 described above, which is not described herein.
The attention mechanism fusion module 530 is configured to determine query features according to image features, determine key features and value features according to object attribute features, and obtain fusion features according to the query features, the key features and the value features. In an embodiment, the attention mechanism fusing module 530 may be used to perform the operation S230 described above, which is not described herein.
The behavior description text generation module 540 is configured to determine a description text suitable for characterizing the behavior of the target object according to the fusion feature. In an embodiment, the behavior description text generation module 540 may be used to perform the operation S240 described above, which is not described herein.
Any of the first feature extraction module 510, the second feature extraction module 520, the attention mechanism fusion module 530, and the behavior description text generation module 540 may be combined in one module to be implemented, or any of the modules may be split into a plurality of modules, according to embodiments of the present disclosure. Alternatively, at least some of the functionality of one or more of the modules may be combined with at least some of the functionality of other modules and implemented in one module. According to embodiments of the present disclosure, at least one of the first feature extraction module 510, the second feature extraction module 520, the attention mechanism fusion module 530, and the behavior description text generation module 540 may be implemented at least in part as hardware circuitry, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or as hardware or firmware in any other reasonable manner of integrating or packaging the circuitry, or as any one of or a suitable combination of any of the three. Alternatively, at least one of the first feature extraction module 510, the second feature extraction module 520, the attention mechanism fusion module 530, and the behavior description text generation module 540 may be at least partially implemented as a computer program module that, when executed, may perform the corresponding functions.
It should be noted that, in the embodiment of the present disclosure, the behavior description text generating device 500 corresponds to the image processing method in the embodiment of the present disclosure, and the description of the behavior description text generating device 500 specifically refers to the image processing method and is not repeated herein.
Fig. 6 schematically shows a block diagram of an electronic device adapted to implement an image processing method according to an embodiment of the disclosure. The electronic device shown in fig. 6 is merely an example and should not be construed to limit the functionality and scope of use of the disclosed embodiments.
As shown in fig. 6, an electronic device 600 according to an embodiment of the present disclosure includes a processor 601 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 602 or a program loaded from a storage section 608 into a Random Access Memory (RAM) 603. The processor 601 may include, for example, a general purpose microprocessor (e.g., a CPU), an instruction set processor and/or an associated chipset and/or a special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), or the like. Processor 601 may also include on-board memory for caching purposes. The processor 601 may comprise a single processing unit or a plurality of processing units for performing different actions of the method flows according to embodiments of the disclosure.
In the RAM603, various programs and data necessary for the operation of the electronic apparatus 600 are stored. The processor 601, the ROM602, and the RAM603 are connected to each other through a bus 604. The processor 601 performs various operations of the method flow according to the embodiments of the present disclosure by executing programs in the ROM602 and/or the RAM 603. Note that the program may be stored in one or more memories other than the ROM602 and the RAM 603. The processor 601 may also perform various operations of the method flow according to embodiments of the present disclosure by executing programs stored in the one or more memories.
According to an embodiment of the present disclosure, the electronic device 600 may also include an input/output (I/O) interface 605, the input/output (I/O) interface 605 also being connected to the bus 604. The electronic device 600 may also include one or more of the following components connected to an input/output (I/O) interface 605: an input portion 606 including a keyboard, mouse, etc.; an output portion 607 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, a speaker, and the like; a storage section 608 including a hard disk and the like; and a communication section 609 including a network interface card such as a LAN card, a modem, or the like. The communication section 609 performs communication processing via a network such as the internet. The drive 610 is also connected to an input/output (I/O) interface 605 as needed. Removable media 611 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is installed as needed on drive 610 so that a computer program read therefrom is installed as needed into storage section 608.
The present disclosure also provides a computer-readable storage medium that may be embodied in the apparatus/device/system described in the above embodiments; or may exist alone without being assembled into the apparatus/device/system. The computer-readable storage medium carries one or more programs which, when executed, implement methods in accordance with embodiments of the present disclosure.
According to embodiments of the present disclosure, the computer-readable storage medium may be a non-volatile computer-readable storage medium, which may include, for example, but is not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. For example, according to embodiments of the present disclosure, the computer-readable storage medium may include ROM602 and/or RAM603 and/or one or more memories other than ROM602 and RAM603 described above.
Embodiments of the present disclosure also include a computer program product comprising a computer program containing program code for performing the methods shown in the flowcharts. The program code, when executed in a computer system, causes the computer system to perform the methods provided by embodiments of the present disclosure.
The above-described functions defined in the system/apparatus of the embodiments of the present disclosure are performed when the computer program is executed by the processor 601. The systems, apparatus, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the disclosure.
In one embodiment, the computer program may be based on a tangible storage medium such as an optical storage device, a magnetic storage device, or the like. In another embodiment, the computer program may also be transmitted, distributed in the form of signals over a network medium, and downloaded and installed via the communication section 609, and/or installed from the removable medium 611. The computer program may include program code that may be transmitted using any appropriate network medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.
In such an embodiment, the computer program may be downloaded and installed from a network through the communication portion 609, and/or installed from the removable medium 611. The above-described functions defined in the system of the embodiments of the present disclosure are performed when the computer program is executed by the processor 601. The systems, devices, apparatus, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the disclosure.
According to embodiments of the present disclosure, program code for performing computer programs provided by embodiments of the present disclosure may be written in any combination of one or more programming languages, and in particular, such computer programs may be implemented in high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. Programming languages include, but are not limited to, such as Java, c++, python, "C" or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
Those skilled in the art will appreciate that the features recited in the various embodiments of the disclosure and/or in the claims may be provided in a variety of combinations and/or combinations, even if such combinations or combinations are not explicitly recited in the disclosure. In particular, the features recited in the various embodiments of the present disclosure and/or the claims may be variously combined and/or combined without departing from the spirit and teachings of the present disclosure. All such combinations and/or combinations fall within the scope of the present disclosure.
The embodiments of the present disclosure are described above. However, these examples are for illustrative purposes only and are not intended to limit the scope of the present disclosure. Although the embodiments are described above separately, this does not mean that the measures in the embodiments cannot be used advantageously in combination. The scope of the disclosure is defined by the appended claims and equivalents thereof. Various alternatives and modifications can be made by those skilled in the art without departing from the scope of the disclosure, and such alternatives and modifications are intended to fall within the scope of the disclosure.

Claims (12)

1. An image processing method, comprising:
extracting image features of a target image, wherein the target image is related to a target object;
object attribute detection is carried out on a target object in the target image, so that object attribute characteristics corresponding to the target object are obtained;
fusing the image characteristics and the object attribute characteristics according to an attention mechanism to obtain fusion characteristics;
and determining descriptive text suitable for representing the behavior of the target object according to the fusion characteristics.
2. The method of claim 1, further comprising:
and carrying out centering treatment on the initial image to obtain the target image.
3. The method of claim 1, wherein the performing object attribute detection on the target object in the target image to obtain an object attribute feature corresponding to the target object comprises:
inputting the target image into an object attribute detection network, and outputting a visual word corresponding to the object attribute of the target object;
and encoding the visual words to obtain the object attribute characteristics.
4. A method according to claim 3, wherein the visual word comprises at least one of:
visual position words characterizing the position of the target object, visual pose words characterizing the pose of the target object, visual size words characterizing the size of the target object.
5. The method of claim 1, wherein the fusing the image features and the object attribute features according to an attention mechanism to obtain fused features comprises:
determining query features according to the image features;
determining key features and value features according to the object attribute features;
and inputting the query feature, the key feature and the value feature into an attention network, and outputting the fusion feature.
6. The method of claim 1, wherein the determining descriptive text suitable for characterizing the behavior of the target object based on the fusion features comprises:
And inputting the fusion characteristics into a text prediction network, and outputting the description text.
7. The method of claim 6, wherein the text prediction network comprises at least one of:
long-short term memory network, cyclic neural network, two-way long-short term memory network, gate-controlled cyclic neural network.
8. The method of any of claims 1 to 7, further comprising:
determining a service prompt message according to the description text; and
and sending the service prompt message to the target client.
9. A behavior description text generation apparatus comprising:
the first feature extraction module is used for extracting image features of a target image, wherein the target image is related to a target object;
the second feature extraction module is used for detecting object attributes of the target object in the target image to obtain object attribute features corresponding to the target object;
the attention mechanism fusion module is used for determining query characteristics according to the image characteristics, determining key characteristics and value characteristics according to the object attribute characteristics, and obtaining fusion characteristics according to the query characteristics, the key characteristics and the value characteristics;
and the behavior description text generation module is used for determining a description text suitable for representing the behavior of the target object according to the fusion characteristics.
10. An electronic device, comprising:
one or more processors;
a memory for storing one or more programs,
wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1 to 8.
11. A computer readable storage medium having stored thereon executable instructions which, when executed by a processor, cause the processor to perform the method according to any of claims 1 to 8.
12. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1 to 8.
CN202310595246.7A 2023-05-24 2023-05-24 Image processing method, device, equipment and storage medium Pending CN116543170A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310595246.7A CN116543170A (en) 2023-05-24 2023-05-24 Image processing method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310595246.7A CN116543170A (en) 2023-05-24 2023-05-24 Image processing method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN116543170A true CN116543170A (en) 2023-08-04

Family

ID=87452304

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310595246.7A Pending CN116543170A (en) 2023-05-24 2023-05-24 Image processing method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN116543170A (en)

Similar Documents

Publication Publication Date Title
CN108776787B (en) Image processing method and device, electronic device and storage medium
US11775574B2 (en) Method and apparatus for visual question answering, computer device and medium
US11392792B2 (en) Method and apparatus for generating vehicle damage information
CN111259215A (en) Multi-modal-based topic classification method, device, equipment and storage medium
CN113868497A (en) Data classification method and device and storage medium
WO2023178930A1 (en) Image recognition method and apparatus, training method and apparatus, system, and storage medium
CN116720004A (en) Recommendation reason generation method, device, equipment and storage medium
CN110377733A (en) A kind of text based Emotion identification method, terminal device and medium
CN107291774B (en) Error sample identification method and device
CN112766284A (en) Image recognition method and device, storage medium and electronic equipment
Archilles et al. Vision: a web service for face recognition using convolutional network
CN117093687A (en) Question answering method and device, electronic equipment and storage medium
US11669583B2 (en) Generating interactive screenshot based on a static screenshot
CN115525781A (en) Multi-mode false information detection method, device and equipment
CN115909357A (en) Target identification method based on artificial intelligence, model training method and device
CN116451700A (en) Target sentence generation method, device, equipment and storage medium
CN116012612A (en) Content detection method and system
CN116030375A (en) Video feature extraction and model training method, device, equipment and storage medium
CN116543170A (en) Image processing method, device, equipment and storage medium
US10910014B2 (en) Method and apparatus for generating video
CN115510457A (en) Data identification method, device, equipment and computer program product
CN117392379B (en) Method and device for detecting target
CN117873897A (en) Test method, test device, test equipment and storage medium
US20240127616A1 (en) Connecting vision and language using fourier transform
Xiao et al. LIDA‐YOLO: An unsupervised low‐illumination object detection based on domain adaptation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination