CN116758362A

CN116758362A - Image processing method, device, computer equipment and storage medium

Info

Publication number: CN116758362A
Application number: CN202310854778.8A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Beijing Real AI Technology Co Ltd
Current assignee: Beijing Real AI Technology Co Ltd
Priority date: 2023-07-12
Filing date: 2023-07-12
Publication date: 2023-09-15

Abstract

The embodiment of the application discloses an image processing method, an image processing device, computer equipment and a storage medium. The method comprises the following steps: extracting features of the image to obtain an image semantic feature vector, and extracting features of text description corresponding to the image to obtain a text feature vector; decoding the fusion feature vector to obtain a decoding feature vector, wherein the fusion feature vector is obtained by carrying out feature fusion on the image semantic feature vector and the text feature vector; obtaining a recognition result corresponding to the image according to the decoding feature vector; and carrying out structuring processing on the image based on the identification result. The method provided by the application is suitable for various scenes, and for newly added scene types or target objects, the model is not required to be trained again, so that the development efficiency is effectively improved, and the data structuring efficiency is further improved.

Description

Image processing method, device, computer equipment and storage medium

Technical Field

The embodiment of the application relates to the technical field of data transmission, in particular to an image processing method, an image processing device, computer equipment and a storage medium.

Background

Data structuring refers to organizing and managing data according to certain rules and modes, so as to facilitate storage, retrieval, analysis, processing and the like. Data structuring is an important component of modern information technology, which can help users better understand and utilize data, and increase the value and benefit of the data.

In the research and practice process of the prior art, the inventor of the embodiment of the application finds that in the method for structuring Computer Vision (CV) data, corresponding platform software is usually required to be developed, the CV data is displayed after the corresponding platform software is developed, and elements of the CV data are classified through an artificial intelligence pre-training model.

However, since the artificial intelligence pre-training model requires a large amount of labeling data to train the model, the data currently used for training the pre-training model must be labeled to perform effective training, if a part or all of the objects in one image are not labeled, for the pre-training model, the similar objects which are not labeled in the new input image cannot be effectively structured in the subsequent data structuring stage. Considering that the actual service requirements can change continuously, the structured processing mode cannot perform marking processing on a larger amount of data more quickly or does not miss marking of part of objects in the image at all, which results in incapability of adapting to the continuously changing service requirements rapidly, and overall, the processing efficiency of the data structuring is low.

Disclosure of Invention

The embodiment of the application provides an image processing method, an image processing device, computer equipment and a storage medium, which can improve the processing efficiency of data structuring.

In a first aspect, an embodiment of the present application provides an image processing method, including: acquiring data to be processed, wherein the data to be processed comprises at least one target image and text descriptions corresponding to each target image, and the text descriptions comprise at least one of object text descriptions and scene text descriptions of target objects in the target image; extracting features of the target image to obtain an image semantic feature vector, and extracting features of text description corresponding to the target image to obtain a text feature vector; fusing the image semantic feature vector and the text feature vector to obtain a fused feature vector; obtaining a recognition result corresponding to the text description according to a decoding feature vector obtained by decoding the fusion feature vector; and carrying out structuring treatment on the target image based on the identification result to obtain structured data.

In a second aspect, an embodiment of the present application provides an image processing apparatus including: an input-output unit and a processing unit; the input/output unit is used for acquiring data to be processed, wherein the data to be processed comprises at least one target image and text descriptions corresponding to each target image, and the text descriptions comprise at least one of object text descriptions and scene text descriptions of target objects in the target image; the processing unit is used for carrying out feature extraction on the target image to obtain an image semantic feature vector, and carrying out feature extraction on a text description corresponding to the target image to obtain a text feature vector; fusing the image semantic feature vector and the text feature vector to obtain a fused feature vector; obtaining a recognition result corresponding to the text description according to a decoding feature vector obtained by decoding the fusion feature vector; and carrying out structuring treatment on the target image based on the identification result to obtain structured data.

In a third aspect, an embodiment of the present application provides a computer device, where the computer device includes a memory and a processor, where the memory stores a computer program, and where the processor implements a method as described above when executing the computer program.

In a fourth aspect, embodiments of the present application provide a computer readable storage medium storing a computer program comprising program instructions which, when executed by a processor, implement a method as described above.

In a fifth aspect, embodiments of the present application provide a computer program product comprising instructions, the computer program product comprising program instructions which, when run on a computer or a processor, cause the computer or the processor to perform the method as described above.

In a sixth aspect, an embodiment of the present application provides a chip system, including: a communication interface for inputting and/or outputting information; a processor for executing a computer-executable program to cause a device on which the chip system is installed to perform the method as described above.

Compared with the prior art, in the scheme provided by the embodiment of the application, the text description corresponding to the target image and the feature extraction are respectively carried out on the target image to obtain the image semantic feature vector and the text feature vector, the fusion feature vector obtained by fusing the image semantic feature vector and the text feature vector is decoded to obtain the decoding feature vector, the recognition result corresponding to the text description is obtained, and the target image is structured based on the recognition result to obtain the structured data. According to the method, the text description corresponding to the target image and the target image are used as the basis, the target image is identified to obtain the structured data, in the process, the model is not required to be trained depending on the labeling data, when the newly added type of target image is processed, the model is not required to be trained again, the structured processing of the newly added type of target image can be realized only by adding the text description corresponding to the target image, and the problem that in the method for carrying out the structured processing on the image by training the model based on the labeling data, the labeling processing cannot be carried out on a larger amount of data or the labeling of part of objects in the image is not omitted completely, so that the processing efficiency of the data structuring is low is avoided. The image processing method provided by the application can be rapidly adapted to the continuously changing service demands, and effectively improves the efficiency of data structuring.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic diagram of scene classification of a target image according to an embodiment of the present application;

FIG. 2 is a schematic diagram of object detection on an object image according to an embodiment of the present application;

FIG. 3 is a schematic diagram of identifying a target image according to an embodiment of the present application;

FIG. 4 is a schematic diagram of facial recognition according to an embodiment of the present application;

FIG. 5 is a flowchart of an image processing method according to an embodiment of the present application;

FIG. 6 is a flow chart of a scene classification according to an embodiment of the present application;

FIG. 7 is a flow chart of object detection according to an embodiment of the present application;

FIG. 8 is a flowchart of a scene classification and object detection combination provided in an embodiment of the present application;

FIG. 9 is a flow chart of face recognition according to an embodiment of the present application;

Fig. 10 is a schematic structural diagram of an image processing apparatus according to an embodiment of the present application;

fig. 11 is a block diagram of a terminal device according to an embodiment of the present application;

fig. 12 is a schematic diagram of a server structure according to an embodiment of the present application.

Detailed Description

The terms first, second and the like in the description and in the claims of embodiments of the application and in the above-described figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments described herein may be implemented in other sequences than those illustrated or otherwise described herein.

With the rapid development and application of artificial intelligence technology, the propagation path and production mode of computer vision data are continuously diversified, and great challenges are brought to the management of the computer vision data. There are a number of methods for structuring computer vision data. The computer vision data may be image data, including pictures, videos, and the like. Data structuring, generally refers to logically expressing data by a two-dimensional table structure, following data format and length specifications, and mainly storing and managing through a relational database. The data is structured, so that the search engine can be helped to understand information on the webpage, the search engine can conveniently recognize and classify the webpage and judge the relativity, and meanwhile, the search engine can provide richer search result abstracts, corresponding information provided for specific query of the user is provided, and convenience is provided for the query of the user.

Specifically, the process of structuring computer vision data is a process of identifying elements in computer vision data to archive the computer vision data. A brief description of a common data structuring process is provided below.

The method comprises the following steps: by developing corresponding platform software, the computer vision data is displayed based on the platform software, and the element identification and classification are performed on the computer vision data manually in a manual operation mode. However, the method of manually identifying and classifying the elements of the computer vision data has a problem of low efficiency, and the speed of manually identifying the elements of the computer vision data is far lower than the speed of generating the computer vision data along with the development of information technology.

According to the application, based on the target image and the text description of the target image object, the structuring processing of the target image is realized, and the process does not need to manually identify the elements of the target object, so that the efficiency of the data structuring processing can be effectively improved.

The second method is as follows: the method of combining an artificial intelligence pre-training (AI) model with professional platform software is adopted, computer vision data is displayed based on the platform software, batch element identification is automatically carried out on the computer vision data through the AI model, and classification archiving is carried out on the computer vision data according to the identified elements.

Compared with the first method, the second method improves the efficiency of data structuring processing to a certain extent, but is only suitable for certain specific fields or specific scenes under normal conditions, and has poor overall generalization performance. Meanwhile, in the training process of the AI model, a large amount of data is required and the data are marked, and if a part or all of objects in one image are not marked, the pre-training model cannot effectively structure the same type of objects which are not marked in a new input image in a subsequent data structuring stage. Considering that the actual service requirements can change continuously, the structured processing mode cannot perform marking processing on a larger amount of data more quickly or does not miss marking of part of objects in the image at all, which results in incapability of adapting to the continuously changing service requirements rapidly, and overall, the processing efficiency of the data structuring is low.

Meanwhile, a longer time period and labor cost are needed in the whole iterative training process, and for a newly added scene, the model is required to be trained again in the second method in order to be suitable for the newly added scene, so that the processing efficiency of data structuring is further low.

Based on the target image and the text description of the target image, the image semantic feature vector corresponding to the target image and the text feature vector corresponding to the text description are subjected to feature fusion to obtain a fusion feature vector. Based on the fusion feature vector, a recognition result corresponding to the target image is obtained, and according to the recognition result, the target image can be subjected to structuring processing to obtain structured data. The application realizes the structuring treatment of the target image based on the target image and the text description of the target image object, the realization process does not need to rely on labeling treatment of a large amount of data, and when the newly added type of target image is treated, the model does not need to be trained again, thus being capable of rapidly adapting to the continuously changing business requirement and effectively improving the efficiency of the data structuring treatment.

Specifically, the image processing method provided by the application can be applied to scene classification of the target image, detection of the target object in the target image and face recognition of the target object, and can be realized based on the image-text multi-mode large model. The application scenario corresponding to the image processing method provided by the application is described below based on the graphic multi-mode large model and with reference to fig. 1-4.

Scene one: in the embodiment of the application, the target image can be identified based on the scene text description corresponding to the target image and the target image, the scene corresponding to the target image is identified, and the scene classification result is obtained. For the target image, the target image and the scene text description can be matched with images in an image library, so that a scene corresponding to the target image is identified. As shown in fig. 1, the present application provides a schematic diagram for classifying a scene of a target image. And inputting the target image and the scene text description into the graphic multi-mode large model so as to identify the input target image. The scene text description can be "a basketball court with a basketball stand thereon". Based on the identification result, the scene corresponding to the target image is obtained and classified as a sports field, and further the structuring processing of the image is realized.

Scene II: in the embodiment of the application, the target image can be identified based on the acquired target image and the object text description corresponding to the target image, so as to obtain the target detection result of the target image, namely, the target object contained in the target image is identified. Wherein the target object corresponds to the object text description. Fig. 2 is a schematic diagram of object detection on an object image according to an embodiment of the present application. For example, for a target image including a target object such as a player, a basketball stand, etc., the electronic device may mark the player, the basketball, and the basketball stand in the target image based on the target image and the corresponding text description of the target object, and identify the target object included in the target image.

Scene III: in the embodiment of the application, the target image can be identified based on the acquired target image, and the scene text description and the object text description corresponding to the target image, so that the scene classification and the target detection result corresponding to the target image can be obtained. As shown in fig. 3, the present application provides a schematic diagram for identifying a target image. For example, for the target image in fig. 3, which includes the target objects such as the athlete, the basketball stand, and the like, the electronic device may obtain, based on the target image and the scene text description corresponding to the target image, the scene classification corresponding to the target image as the basketball court, and identify, based on the target image and the object text description, the target object in the target image, where the target object in the target image includes the athlete, the basketball stand, and the like.

Scene four: in the embodiment of the application, the target image can be identified based on the acquired target image, and the scene text description and the object text description corresponding to the target image, and the scene classification corresponding to the target image and the target object in the target image can be output at the same time. On this basis, the electronic device can recognize the target object based on the face recognition algorithm. Fig. 4 is a schematic diagram of facial recognition according to an embodiment of the present application. For example, the electronic device may identify, based on a face recognition algorithm, the athlete in the target image and identify the identity information thereof while obtaining the scene classification corresponding to the target image and the target object included in the target image.

It should be noted that, the face recognition algorithm in the present application may be a face recognition algorithm or an animal recognition algorithm. In the face recognition process, the face recognition is legally performed on the user through a face recognition algorithm under the condition of user authorization.

Specifically, the embodiment of the application provides an image processing method, a related device and a storage medium, and an execution subject of the image processing method may be the image processing device provided by the embodiment of the application, or a computer device integrated with the image processing device, where the image processing device may be implemented in a hardware or software manner, and the computer device may be a terminal or a server.

When the computer device is a server, the server may be an independent physical server, or may be a server cluster or a distributed system formed by a plurality of physical servers, or may be a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, basic cloud computing services such as big data and artificial intelligence platforms, and the like.

When the computer device is a terminal, the terminal may include: smart phones, tablet computers, notebook computers, desktop computers, smart televisions, smart speakers, personal digital assistants (hereinafter abbreviated as PDA, english: personal Digital Assistant), desktop computers, smart watches, and the like, which carry multimedia data processing functions (e.g., video data playing functions, music data playing functions), but are not limited thereto.

The scheme of the embodiment of the application can be realized based on an artificial intelligence technology, and particularly relates to the fields of computer vision technology in the artificial intelligence technology, databases in cloud technology and the like, and the technical fields are respectively described below.

Artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

Computer Vision (CV) is a science of studying how to "look" a machine, and more specifically, to replace human eyes with a camera and a Computer to perform machine Vision such as recognition, tracking and measurement on a target, and further perform graphic processing to make the Computer process into an image more suitable for human eyes to observe or transmit to an instrument to detect. As a scientific discipline, computer vision research-related theory and technology has attempted to build artificial intelligence systems that can acquire information from images or multidimensional data. Computer vision techniques typically include image processing, model robustness detection, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D techniques, virtual reality, augmented reality, synchronous positioning, map construction, etc., as well as common model robustness detection, fingerprint recognition, etc., biometric techniques.

With research and advancement of artificial intelligence technology, research and application of artificial intelligence technology is being developed in various fields, such as common smart home, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned, automatic driving, unmanned aerial vehicles, robots, smart medical treatment, smart customer service, etc., and it is believed that with the development of technology, artificial intelligence technology will be applied in more fields and with increasing importance value.

Cloud technology (Cloud technology) refers to a hosting technology for integrating hardware, software, network and other series resources in a wide area network or a local area network to realize calculation, storage, processing and sharing of data. Cloud technology (Cloud technology) is based on the general terms of network technology, information technology, integration technology, management platform technology, application technology and the like applied by Cloud computing business models, and can form a resource pool, so that the Cloud computing business model is flexible and convenient as required. Cloud computing technology will become an important support. Background services of technical networking systems require a significant amount of computing, storage resources, such as video websites, image-like websites, and more portals. Along with the high development and application of the internet industry, each article possibly has an own identification mark in the future, the identification mark needs to be transmitted to a background system for logic processing, data with different levels can be processed separately, and various industry data needs strong system rear shield support and can be realized only through cloud computing. According to the embodiment of the application, the identification result can be stored through the cloud technology.

The Database (Database), which can be considered as an electronic filing cabinet, is a place for storing electronic files, and users can perform operations such as adding, inquiring, updating, deleting and the like on the data in the files. A "database" is a collection of data stored together in a manner that can be shared with multiple users, with as little redundancy as possible, independent of the application.

The database management system (Database Management System, abbreviated as DBMS) is a computer software system designed for managing databases, and generally has basic functions of storage, interception, security, backup and the like. The database management system may classify according to the database model it supports, e.g., relational, XML (Extensible Markup Language ); or by the type of computer supported, e.g., server cluster, mobile phone; or by the query language used, e.g., SQL (structured query language ), XQuery; or by performance impact emphasis, such as maximum scale, maximum speed of operation; or other classification schemes. Regardless of the manner of classification used, some DBMSs are able to support multiple query languages across categories, for example, simultaneously. In the embodiment of the application, the identification result can be stored in the database management system, so that the server can conveniently call.

It should be specifically noted that, the service terminal according to the embodiments of the present application may be a device that provides voice and/or data connectivity to the service terminal, a handheld device with a wireless connection function, or other processing device connected to a wireless modem. Such as mobile telephones (or "cellular" telephones) and computers with mobile terminals, which can be portable, pocket, hand-held, computer-built-in or car-mounted mobile devices, for example, which exchange voice and/or data with radio access networks. For example, personal communication services (English full name: personal Communication Service, english short name: PCS) telephones, cordless telephones, session Initiation Protocol (SIP) phones, wireless local loop (Wireless Local Loop, english short name: WLL) stations, personal digital assistants (English full name: personal Digital Assistant, english short name: PDA) and the like.

An image processing method provided by an embodiment of the present application is described in detail below with reference to fig. 5, and fig. 5 is a flowchart of an image processing method provided by an embodiment of the present application, including S101 to S105.

S101: and obtaining data to be processed.

The data to be processed comprises at least one target image and text descriptions corresponding to the target images, wherein the text descriptions comprise at least one of object text descriptions and scene text descriptions of target objects in the target images.

Specifically, the target image may be a still picture or a dynamic video. The electronic equipment can input a plurality of target images into the graphic multi-mode large model at the same time, and perform structuring processing on the plurality of target images at the same time so as to improve the processing efficiency of data structuring.

S102: and extracting features of the target image to obtain an image semantic feature vector, and extracting features of the text description corresponding to the target image to obtain a text feature vector.

Wherein, the text description is a description of the target image in the form of words. Specifically, when the text description is a scene text description, the corresponding text feature vector is a scene feature vector; when the text description is an object text description, the corresponding text feature vector is an object feature vector.

S103: and fusing the image semantic feature vector and the text feature vector to obtain a fused feature vector.

S104: and obtaining a recognition result corresponding to the text description according to the decoding feature vector obtained by decoding the fusion feature vector.

Specifically, the recognition result corresponding to the text description may be at least one of a scene classification corresponding to the target image and a type of the target object in the target image. For example, when the text description is a scene text description, the recognition result is a scene classification corresponding to the target image; when the text description is an object text description, the recognition result is the type of the target object; when the text description is a scene text description and an object text description, the recognition result may be a scene classification corresponding to the target image and a type of the target object in the target image.

S105: and carrying out structuring treatment on the target image based on the identification result to obtain structured data.

According to the embodiment of the application, the image semantic feature vector and the text feature vector are obtained by extracting the features of the target image and the text description corresponding to the target image, wherein the text description comprises at least one of scene text description and object text description. And carrying out feature fusion on the image semantic feature vector and the text feature vector to obtain a fusion feature vector, decoding the fusion feature vector to obtain a decoding feature vector, and obtaining a recognition result corresponding to the image based on the decoding feature vector, wherein the recognition result comprises at least one of a target detection result and scene classification. And based on the recognition result, the structuring processing of the image is realized. The method is based on the image-text multi-mode large model, takes text description and images as input, obtains the identification result, and carries out structural processing on the images according to the identification result.

In one possible implementation manner, the image processing method provided by the application can be applied to a scene for classifying a target image as shown in fig. 1, so as to realize scene classification of the target image. An image processing method according to an embodiment of the present application is described below with reference to fig. 6. Fig. 6 is a flowchart of a scene classification according to an embodiment of the present application.

S201, acquiring data to be processed.

Specifically, data to be processed is obtained, wherein the data to be processed comprises at least one target image and text descriptions corresponding to the target images. The text description is a scene text description corresponding to the target image.

The target image may be a still picture or a dynamic video. Multiple target images can be input into the graphic multi-mode large model at the same time, and structuring processing is carried out on the multiple target images at the same time, so that the processing efficiency of data structuring is improved.

The target image and the scene text description corresponding to the target image are input into the graphic multi-mode large model in an exemplary mode. The scene text description is that the scene corresponding to the target image is described in a text mode. For example, for a basketball court picture, the corresponding scene text description may be "a basketball court with players playing the ball".

The multi-mode large model refers to mode information of two or more modes involved in the model. The text and image multi-mode large model in the embodiment of the application relates to text and image two-mode information. The multi-mode large model can be divided into a text-to-graphic model, a graphic-to-text model and a graphic-to-text model according to the positions of the two modes on the model.

The image-text multi-mode large model in the embodiment of the application refers to that images and texts are input into the model together, and can be applied to vision-language pre-training tasks. Unlike single language, visual pre-training, multi-modal joint pre-training can provide more rich information for the model, thereby improving the learning ability of the model.

Meanwhile, as the traditional method for classifying scenes through the supervised training deep learning network needs to be based on a large amount of sample data and label the sample data so as to perform model training, the model obtained by training can be suitable for certain specific fields or specific scenes. However, the model obtained by the method is poor in overall generalization, and for the newly added scene type, a large amount of sample data of the newly added scene type is needed for the scene classification method based on the supervised training deep learning network, so that the model is trained again, the problem of low development efficiency of the model is caused, and the data structuring efficiency is further affected.

The application is based on the graph-text multi-mode large model, takes the target image and the scene text description corresponding to the target image as input, has relatively less sample data volume required during training, and for the newly added scene type, only the corresponding scene description is newly added in the scene text description library which is constructed in advance, the model is not required to be trained again, the development efficiency is greatly improved, and the data structuring efficiency is further effectively improved.

S202, extracting features of the target image to obtain image semantic features.

Where image semantic features are used to represent the meaning of the image content, which can be expressed in language, including natural language and symbolic language (mathematical language), whose appearance corresponds to all ways in which the human visual system understands the image. For example, for an image of a car, the image semantics may include the natural language word "car", or a symbol representing an image of the car in the image, where the symbol refers to a car having the same properties as the car in the image.

Specifically, in the embodiment of the application, the image semantic features of the target image can be extracted based on the CV algorithm model. The type of CV algorithm model in the application is not particularly limited.

S203, coding the image semantic features to obtain image semantic feature vectors.

Specifically, the process of encoding the image semantic features to obtain the image semantic feature vectors refers to the process of converting the image semantic features into a set of feature vectors or feature descriptors representing the image semantics, so as to realize applications such as image classification, retrieval, matching and the like. The extracted image semantic features are converted into image semantic feature vectors through coding, so that subsequent processing and calculation of images can be facilitated.

S204, extracting features of the scene text description corresponding to the target image to obtain scene semantic features.

Wherein, similar to the image semantic features, the scene semantic features are used to represent the scene to which the image corresponds, which can be expressed in language, including natural language and symbolic language (mathematical language).

Specifically, S204 may be executed in parallel with S202, and in the embodiment of the present application, feature extraction may be performed on a scene text description corresponding to the image based on a natural language processing (Natural Language Processing, NLP) algorithm.

Among other things, NLP algorithms can enable understanding, processing, and analyzing human language. NLP has strong semantics that can be understood by deep understanding the contextual meaning of data. Based on the strong semanteme of NLP, the accuracy of scene classification can be effectively improved.

S205, coding the scene semantic features to obtain scene feature vectors.

Specifically, the process of encoding the scene semantic features to obtain the scene feature vectors refers to a process of converting the scene semantic features into a set of feature vectors or feature descriptors representing the scene semantics, thereby facilitating the subsequent classification of the target image according to the scene.

S206, fusing the semantic feature vector and the scene feature vector of the image to obtain a first fused feature vector.

Specifically, the image semantic feature vector and the scene semantic vector may be strongly correlated, so that the image semantic feature vector and the scene semantic vector are closely correlated to each other, and a first fusion feature vector is obtained.

S207, decoding the first fusion feature vector to obtain a first decoding feature vector.

S208, adding a classification head for the first decoding feature vector to obtain a scene classification result corresponding to the target image.

Specifically, adding a classification head to the first decoding feature vector to obtain a first decoding feature vector added with the classification head; and obtaining the scene classification corresponding to the scene text description according to the first decoding feature vector added with the classification head. In general, after extracting a feature vector by using a feature extraction network, a classification header needs to be output through a convolution layer, where the classification header corresponds to category information of a target in an image. In an embodiment of the application, the classification header corresponds to class information of a scene in the image. After the first decoding feature vector is obtained, a classification head is added to the first decoding feature vector, and the first decoding feature vector with the classification head added is obtained, so that the first decoding feature vector is summarized in the corresponding scene classification.

For example, in the embodiment of the present application, a scene text description library may be pre-constructed, where the scene text description library includes text descriptions for various scenes. After the first decoding feature vector is obtained, the first decoding feature vector is matched with scene text descriptions in a scene text description library constructed in advance, so that a scene classification result is obtained.

In one possible implementation, a target image corresponds to multiple scene text descriptions. Feature extraction can be performed on a plurality of scene text descriptions corresponding to the target image to obtain a plurality of scene feature vectors, and the image semantic feature vectors and the scene feature vectors are respectively fused to obtain a plurality of first fusion feature vectors; aiming at each first fusion feature vector, according to a first decoding feature vector obtained by decoding the first fusion feature vector, obtaining a recognition result corresponding to each scene text description; and respectively carrying out structuring treatment on the target image based on a plurality of recognition results to obtain structured data.

In the embodiment of the application, a target image and a scene text description corresponding to the target image are input into a graph-text multi-mode large model, an image semantic feature vector corresponding to the target image and a scene feature vector corresponding to the scene text description are fused to obtain a first fused feature vector, and the first decoding feature vector is matched with the scene text description in a scene text description library constructed in advance to obtain a scene classification result corresponding to the image. For scene recognition of internet CV data, the application recognizes the image based on the image and the scene text description corresponding to the image, and judges the scene classification corresponding to the image in real time. Meanwhile, based on the scene text description, target images conforming to the scene text description can be effectively screened from a mass image library. For the newly added scene types, the description of the newly added scene types is only added in the scene text description library which is built in advance, and the model is not required to be trained again, so that the development efficiency is effectively improved, and the data structuring efficiency is further improved.

In one possible implementation manner, the image processing method provided by the application can be applied to a scene for detecting a target object in a target image as shown in fig. 2. An image processing method according to an embodiment of the present application is described below with reference to fig. 7. Fig. 7 is a flowchart of object detection according to an embodiment of the present application. Different from the above embodiment, in the embodiment of the present application, the target image and the object text description corresponding to the target object in the target image are input into the graphic multi-mode large model together, so as to identify the target object in the target image, or the image containing the target object is screened out from the image library according to the object text description. The same parts as those of the above embodiment are not repeated here.

S301, acquiring data to be processed.

Specifically, data to be processed is obtained, wherein the data to be processed comprises at least one target image and text descriptions corresponding to the target images. The text description is an object text description corresponding to the target object in the target image.

The object text description corresponding to the target image and the target object is input into the graphic multi-mode large model. The object text description corresponding to the target object describes the target object contained in the image in a text mode. For example, for a basketball court, there are basketball, player, etc. in the picture, and the corresponding text description of the object may be description of basketball, description of player, etc.

According to the image-text multi-mode large model in the embodiment of the application, the target object to be identified in the target image is described through the object text description, and the target image and the object text description corresponding to the target image are taken as input, so that the specific information of the target object in the target image is detected in a reasoning manner. In the model training process, the data sets can be combined together for training, and because the image multi-mode large model takes the target image and the object text description corresponding to the target image as input, the model precision can not be influenced even if the label conflict problem exists among the data sets, and the model generalization can be greatly improved.

S302, extracting features of the target image to obtain image semantic features.

S303, coding the image semantic features to obtain image semantic feature vectors.

S304, extracting features of object text description corresponding to the target object in the target image to obtain object semantic features.

Wherein, like image semantic features, object semantic features are used to represent objects contained in an image, which can be expressed in languages, including natural language and symbolic language (mathematical language).

Specifically, in the embodiment of the application, feature extraction can be performed on the object text description corresponding to the image based on a natural language processing (Natural Language Processing, NLP) algorithm.

S305, encoding the object semantic features to obtain object feature vectors.

Specifically, the process of encoding object semantic features to obtain object feature vectors refers to a process of converting object semantic features into a set of feature vectors or feature descriptors representing target semantics, thereby facilitating subsequent detection of a target object in a target image.

S306, fusing the image semantic feature vector and the object feature vector to obtain a second fused feature vector.

Specifically, the image semantic feature vector and the scene semantic vector may be strongly correlated, so that the image semantic feature vector and the scene semantic vector are closely related to each other, and a second decoding feature vector is obtained.

S307, decoding the second fusion feature vector to obtain a second decoding feature vector.

And S308, adding a detection head for the second decoding feature vector to obtain a target object corresponding to the object text description in the target image.

Specifically, a detection head is added to the second decoding feature vector, a second decoding feature vector with the detection head added is obtained, and a target object corresponding to the object text description in the target image is obtained according to the second decoding feature vector with the detection head added.

The detection head is responsible for detecting and positioning, and corresponds to the category and position information of the target object contained in the target image. After the second decoding feature vector is obtained, adding a detection head to the second decoding feature vector to obtain the second decoding feature vector with the detection head added, thereby obtaining the target object in the target image. For example, the target object may be a class corresponding to the detected target, such as type 1, … …, type n among other types, of detected face 1, … …, face n in fig. 7.

For example, in the embodiment of the application, an object text description library is pre-constructed, and the object text description library contains text descriptions of various target objects possibly existing in the image. After the second decoding feature vector is obtained, matching the second decoding feature vector with the object text description in the pre-constructed object text description library so as to perform target detection.

In one possible implementation, a target image corresponds to a plurality of object text descriptions. Feature extraction can be performed on a plurality of object text descriptions corresponding to the target image, so as to obtain a plurality of object feature vectors; respectively fusing the image semantic feature vectors and the object feature vectors to obtain a plurality of second fused feature vectors; aiming at each second fusion feature vector, according to a second decoding feature vector obtained by decoding the second fusion feature vector, obtaining a recognition result corresponding to each object text description; and respectively carrying out structuring treatment on the target image based on a plurality of recognition results to obtain structured data.

In the embodiment of the application, a target image and an object text description corresponding to the target image are input into a graph-text multi-mode large model, an image semantic feature vector corresponding to the target image and an object feature vector corresponding to the object text description are fused to obtain a second fused feature vector, and the second decoding feature vector is matched with the object text description in a pre-constructed object text description library to obtain a target detection result corresponding to the target image, namely a target object contained in the image. According to the application, after structured processing is carried out on data based on the graphic multi-modeling large model, a target image can be input, a target object in the target image can be detected, an object text description can also be input, and an image containing the target object can be effectively screened from an image library based on the object text description. For the newly added target type, the description of the newly added target type is only added in the pre-constructed object text description library, and the model is not required to be trained again, so that the development efficiency is effectively improved, the method is further suitable for various types of data, and the data structuring efficiency is improved.

In one possible implementation manner, the image processing method provided by the application can be applied to a scene for identifying a target image as shown in fig. 3. An image processing method according to an embodiment of the present application is described below with reference to fig. 8. Fig. 8 is a flowchart of a scene classification and object detection combination according to an embodiment of the present application. Different from the above embodiment, in the embodiment of the present application, the target image, the object text description corresponding to the target object in the target image, and the scene text description corresponding to the target image are input into the graphic multi-mode large model together, so that the scene corresponding to the target image can be classified while the target image is detected. It should be noted that, in the embodiments of the present application, the same parts as those of the above embodiments are not described herein.

S401, acquiring data to be processed.

Specifically, data to be processed is obtained, wherein the data to be processed comprises at least one target image and text descriptions corresponding to the target images. The text description is a scene description corresponding to the target image and an object text description corresponding to the target object in the target image.

And inputting the target image, the object text description corresponding to the target image and the scene text description corresponding to the target image into the graphic multi-mode large model. According to the embodiment of the application, the type of the target object to be identified in the target image can be described through the object text description, and the scene classification corresponding to the target image is described through the scene text description. The target image and the object text description corresponding to the target image are taken as input, and then the specific information of the target object in the target image is detected in a reasoning way while the scene classification corresponding to the target image is determined.

S402, extracting features of the target image to obtain image semantic features.

S403, coding the image semantic features to obtain image semantic feature vectors.

S404, extracting features of the scene text description and the object text description corresponding to the target object in the target image to obtain scene semantic features and object semantic features.

S405, encoding the object semantic features and the scene semantic features to obtain object feature vectors and scene feature vectors.

S406, fusing the image semantic feature vector and the scene feature vector to obtain a first fused feature vector, and fusing the image semantic feature vector and the object feature vector to obtain a second fused feature vector.

S407, decoding the first fusion feature vector to obtain a first decoding feature vector, and decoding the second fusion feature vector to obtain a second decoding feature vector.

S408, adding a classification head for the first decoding feature vector to obtain a scene classification result corresponding to the image.

Specifically, adding a classification head to the first decoding feature vector to obtain a first decoding feature vector added with the classification head; and obtaining the scene classification corresponding to the scene text description according to the first decoding feature vector added with the classification head. In an exemplary embodiment of the present application, a scene text description library is pre-built, and after a second decoding feature vector is obtained, the second decoding feature vector is matched with scene text descriptions in the pre-built scene text description library, so as to obtain a scene classification result corresponding to the target image.

S409, adding a detection head for the second decoding feature vector to obtain a target object in the target image.

Specifically, adding a detection head to the second decoding feature vector to obtain a second decoding feature vector with the detection head added; and obtaining a target object corresponding to the object text description in the target image according to the second decoding feature vector added with the detection head. The detection head is responsible for detecting and positioning, and corresponds to the category and position information of the target object contained in the target image. After the second decoding feature vector is obtained, a detection head is added to the second decoding feature vector, thereby obtaining a target object in the image. By way of example, in the embodiment of the present application, an object text description library is pre-constructed, where the object text description library contains text descriptions of various objects that may exist in an image. After the second decoding feature vector is obtained, matching the second decoding feature vector with the object text description in the pre-constructed object text description library so as to perform target detection.

In a possible implementation manner, in the embodiment of the present application, the face detection result detected by the target may be used as input of a face recognition algorithm, so that the face recognition scheme does not need to additionally perform face detection, thereby improving the efficiency of face recognition. As shown in fig. 9, the figure is a flowchart of face recognition according to an embodiment of the present application.

S410, inputting the facial features in the target object into a facial comparison model with the facial features in the facial feature library.

S411, comparing the facial features in the target object with the facial features in the facial feature library to obtain a facial recognition result corresponding to the target object.

Specifically, after a target object in a target image is obtained, facial features in the target object are compared with a facial feature input facial comparison model in a facial feature library, and a facial recognition result corresponding to the target object is obtained.

The embodiment of the application can identify the face and identify the specific identity of the person; the face of the animal may also be identified. As shown in fig. 9, the person in the target object can be identified as person a, person B, and person C by comparing the facial features in the target object with the facial features in the facial feature library.

Compared with the traditional facial recognition method, the face recognition method needs to detect the human or animal first, and then further detect the face after detecting the face. The embodiment of the application is based on the target detection capability of the image-text multi-mode large model, and in the image-text multi-mode large model reasoning stage, the face in the image is detected, so that the method can be directly used for face recognition, and the efficiency of face recognition can be effectively improved. Meanwhile, in the process of obtaining the face recognition result, the thought of a graphic multi-mode large model is adopted, various scene classification and target detection are carried out on internet CV data by means of the input flexibility of NLP, and meanwhile, the face recognition algorithm is combined, so that each element can be recognized and classified, and structural management on the internet CV data is realized.

In summary, the embodiment of the application inputs the target image and the object text description corresponding to the target image into the image-text multi-mode large model, adds a detection head for the second decoding feature vector to obtain the target object in the target image, and adds a classification head for the second decoding feature vector to obtain the scene classification result corresponding to the target image. The scene classification and the target detection are combined in the feature extraction stage, and only the second half part is required to be output respectively, so that the model efficiency can be effectively improved.

An image processing apparatus according to an embodiment of the present application is capable of implementing steps corresponding to an image processing method performed in the embodiment corresponding to any one of fig. 5 to 9 described above. Referring to fig. 10, fig. 10 is a schematic structural diagram of an image processing apparatus according to an embodiment of the application. The functions realized by the image processing device can be realized by hardware, and can also be realized by executing corresponding software by hardware. The hardware or software includes one or more modules corresponding to the functions described above, which may be software and/or hardware. The image processing apparatus includes: an input-output unit 1101 and a processing unit 1102;

The input/output unit 1101 is configured to obtain data to be processed, where the data to be processed includes at least one target image, and text descriptions corresponding to each target image, where the text descriptions include at least one of an object text description of a target object in the target image, and a scene text description;

the processing unit 1102 is configured to perform feature extraction on the target image to obtain an image semantic feature vector, and perform feature extraction on a text description corresponding to the target image to obtain a text feature vector; fusing the image semantic feature vector and the text feature vector to obtain a fused feature vector; obtaining a recognition result corresponding to the text description according to a decoding feature vector obtained by decoding the fusion feature vector; and carrying out structuring treatment on the target image based on the identification result to obtain structured data.

In a possible implementation manner, the text description is a scene text description, and the extracting unit 1101 is configured to perform feature extraction on the scene text description corresponding to the image to obtain a scene feature vector;

the decoding unit 1102 is configured to decode a first fused feature vector, to obtain a first decoded feature vector, where the first fused feature vector is obtained by feature fusion of the image semantic feature vector and the scene feature vector.

An obtaining unit 1103, configured to add a classification header to the first decoded feature vector, and obtain a first decoded feature vector after adding the classification header; and obtaining the scene classification corresponding to the image according to the first decoding feature vector added with the classification head.

In a possible implementation manner, the text description is a scene text description, the text feature vector is a scene feature vector obtained by extracting features of the scene text description, the fused feature vector is a first fused feature vector obtained by fusing the image semantic feature vector and the scene feature vector, and the decoded feature vector is a first decoded feature vector obtained by decoding the first fused feature vector;

the processing unit 1102 is configured to: adding a classification head to the first decoding feature vector to obtain a first decoding feature vector added with the classification head; and obtaining the scene classification corresponding to the scene text description according to the first decoding feature vector added with the classification head.

In one possible implementation, the one target image corresponds to a plurality of scene text descriptions; the processing unit 1102 is configured to: extracting features of a plurality of scene text descriptions corresponding to the target image to obtain a plurality of scene feature vectors; respectively fusing the image semantic feature vectors and the scene feature vectors to obtain a plurality of first fused feature vectors; aiming at each first fusion feature vector, according to a first decoding feature vector obtained by decoding the first fusion feature vector, obtaining a recognition result corresponding to each scene text description; and respectively carrying out structuring treatment on the target image based on a plurality of recognition results to obtain structured data.

In a possible implementation manner, the text description is an object text description, the text feature vector is an object feature vector obtained by extracting features of the object text description, the fused feature vector is a second fused feature vector obtained by fusing the image semantic feature vector and the object feature vector, and the decoded feature vector is a second decoded feature vector obtained by decoding the second fused feature vector;

the processing unit 1102 is configured to: adding a detection head to the second decoding feature vector to obtain a second decoding feature vector added with the detection head; and obtaining a target object corresponding to the object text description in the target image according to the second decoding feature vector added with the detection head.

In one possible implementation, the one target image corresponds to a plurality of object text descriptions; the processing unit 1102 is configured to: extracting features of a plurality of object text descriptions corresponding to the target image to obtain a plurality of object feature vectors; respectively fusing the image semantic feature vectors and the object feature vectors to obtain a plurality of second fused feature vectors; aiming at each second fusion feature vector, according to a second decoding feature vector obtained by decoding the second fusion feature vector, obtaining a recognition result corresponding to each object text description; and respectively carrying out structuring treatment on the target image based on a plurality of recognition results to obtain structured data.

In one possible implementation manner, the text description is a scene text description and an object text description, the text feature vector is a scene feature vector obtained by feature extraction of the scene text description and an object feature vector obtained by feature extraction of the object text description, the fusion feature vector is a first fusion feature vector obtained by fusion of the image semantic feature vector and the scene feature vector and a second fusion feature vector obtained by fusion of the image semantic feature vector and the object feature vector, and the decoding feature vector is a first decoding feature vector obtained by decoding the first fusion feature vector and a second decoding feature vector obtained by decoding the second fusion feature vector;

the processing unit 1102 is configured to: adding a classification head to the first decoding feature vector to obtain a first decoding feature vector added with the classification head, and adding a detection head to the second decoding feature vector to obtain a second decoding feature vector added with the detection head; and obtaining scene classification corresponding to the scene text description according to the first decoding feature vector added with the classification head, and obtaining a target object corresponding to the object text description in the target image according to the second decoding feature vector added with the detection head.

In a possible implementation manner, the processing unit 1102 is further configured to: and comparing the facial features in the target object with the facial features in the facial feature library to obtain a facial recognition result corresponding to the facial features in the target object.

The embodiment of the application also provides a terminal device, as shown in fig. 11, which is a structural diagram of the terminal device provided by the embodiment of the application. For convenience of explanation, only those portions of the embodiments of the present application that are relevant to the embodiments of the present application are shown, and specific technical details are not disclosed, please refer to the method portions of the embodiments of the present application. The terminal device may be any terminal device including a mobile phone, a tablet computer, a personal digital assistant (Personal Digital Assistant, PDA), a Point of Sales (POS), a vehicle-mounted computer, and the like, taking the mobile phone as an example of the terminal:

fig. 11 is a block diagram showing a part of the structure of a mobile phone related to a terminal device provided by an embodiment of the present application. Referring to fig. 11, the mobile phone includes: radio Frequency (RF) circuit 510, memory 520, input unit 530, display unit 540, sensor 550, audio circuit 560, wireless fidelity (wireless fidelity, wi-Fi) module 570, processor 560, and power supply 590. Those skilled in the art will appreciate that the handset configuration shown in fig. 11 is not limiting of the handset and may include more or fewer components than shown, or may combine certain components, or may be arranged in a different arrangement of components.

The following describes the components of the mobile phone in detail with reference to fig. 11:

the RF circuit 510 may be used for receiving and transmitting signals during a message or a call, and particularly, after receiving downlink information of a base station, the RF circuit is processed by the processor 560; in addition, the data of the design uplink is sent to the base station. Generally, RF circuitry 510 includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a low noise amplifier (English full name: low Noise Amplifier; LNA), a duplexer, and the like. In addition, the RF circuitry 510 may also communicate with networks and other devices via wireless communications. The wireless communication may use any communication standard or protocol, including but not limited to global system for mobile communications (english: global System of Mobile communication, english: GSM), general packet radio service (english: general Packet Radio Service, english: GPRS), code division multiple access (english: code Division Multiple Access, CDMA), wideband code division multiple access (english: wideband Code Division Multiple Access, english: WCDMA), long term evolution (english: long Term Evolution, english: LTE), email, short message service (english: short Messaging Service, english: SMS), and the like.

The memory 520 may be used to store software programs and modules, and the processor 560 performs various functional applications and data processing of the cellular phone by executing the software programs and modules stored in the memory 520. The memory 520 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, application programs required for at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, phonebook, etc.) created according to the use of the handset, etc. In addition, memory 520 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.

The input unit 530 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the handset. In particular, the input unit 530 may include a touch panel 531 and other input devices 532. The touch panel 531, also referred to as a touch screen, may collect touch operations thereon or thereabout by a user (e.g., operations of the user on the touch panel 531 or thereabout by using any suitable object or accessory such as a finger, a stylus, etc.), and drive the corresponding connection device according to a predetermined program. Alternatively, the touch panel 531 may include two parts, a touch detection device and a touch controller. The touch detection device detects the touch azimuth of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch detection device and converts it into touch point coordinates, which are then sent to the processor 560, and can receive commands from the processor 560 and execute them. In addition, the touch panel 531 may be implemented in various types such as resistive, capacitive, infrared, and surface acoustic wave. The input unit 530 may include other input devices 532 in addition to the touch panel 531. In particular, other input devices 532 may include, but are not limited to, one or more of a physical keyboard, function keys (e.g., volume control keys, switch keys, etc.), a trackball, mouse, joystick, etc.

The display unit 540 may be used to display information input by a user or information provided to the user and various menus of the mobile phone. The display unit 540 may include a display panel 541, and optionally, the display panel 541 may be configured in the form of a liquid crystal display (english: liquid Crystal Display, abbreviated as LCD), an Organic Light-Emitting Diode (OLED), or the like. Further, the touch panel 531 may cover the display panel 541, and when the touch panel 531 detects a touch operation thereon or thereabout, the touch operation is transferred to the processor 560 to determine the type of the touch event, and then the processor 560 provides a corresponding visual output on the display panel 541 according to the type of the touch event. Although in fig. 11, the touch panel 531 and the display panel 541 are two independent components to implement the input and output functions of the mobile phone, in some embodiments, the touch panel 531 and the display panel 541 may be integrated to implement the input and output functions of the mobile phone.

The Wi-Fi of the processor 560 belongs to a short-distance wireless transmission technology, and the mobile phone can help a user to send and receive e-mail, browse web pages, access streaming media and the like through the Wi-Fi module 550, so that wireless broadband internet access is provided for the user. Although fig. 11 shows Wi-Fi module 550, it will be appreciated that it does not belong to the necessary constitution of the handset, and can be omitted entirely as required within the scope of not changing the essence of the application.

Processor 560 is a control center of the handset, connects various parts of the entire handset using various interfaces and lines, and performs various functions and processes data of the handset by running or executing software programs and/or modules stored in memory 520, and invoking data stored in memory 520, thereby performing overall monitoring of the handset. Optionally, processor 560 may include one or more processing units; preferably, the processor 560 may integrate an application processor that primarily handles operating systems, user interfaces, applications, etc., with a modem processor that primarily handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 560.

Although not shown, the mobile phone may further include a camera, a bluetooth module, etc., which will not be described herein.

In an embodiment of the present application, the processor 560 included in the mobile phone further has a flowchart for controlling the execution of the method described above with reference to fig. 5-8.

Fig. 12 is a schematic diagram of a server structure according to an embodiment of the present application, where the server 620 may have a relatively large difference due to different configurations or performances, and may include one or more central processing units (in english: central processing units, in english: CPU) 622 (for example, one or more processors) and a memory 632, and one or more storage media 630 (for example, one or more mass storage devices) storing application programs 642 or data 644. Wherein memory 632 and storage medium 630 may be transitory or persistent storage. The program stored on the storage medium 630 may include one or more modules (not shown), each of which may include a series of instruction operations on a server. Still further, the central processor 622 may be configured to communicate with a storage medium 630 and execute a series of instruction operations in the storage medium 630 on the server 620.

The Server 620 may also include one or more power supplies 626, one or more wired or wireless network interfaces 650, one or more input/output interfaces 658, and/or one or more operating systems 641, such as Windows Server, mac OS X, unix, linux, freeBSD, and the like.

The steps performed by the server in the above embodiments may be based on the structure of the server 620 shown in fig. 12. The steps of the server shown in fig. 2 in the above embodiment may be based on the server structure shown in fig. 12, for example. For example, the processor 622 performs the following operations by invoking instructions in the memory 632: acquiring data to be processed, wherein the data to be processed comprises at least one target image and text descriptions corresponding to each target image, and the text descriptions comprise at least one of object text descriptions and scene text descriptions of target objects in the target image; extracting features of the target image to obtain an image semantic feature vector, and extracting features of text description corresponding to the target image to obtain a text feature vector; fusing the image semantic feature vector and the text feature vector to obtain a fused feature vector; obtaining a recognition result corresponding to the text description according to a decoding feature vector obtained by decoding the fusion feature vector; and carrying out structuring treatment on the target image based on the identification result to obtain structured data.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to related descriptions of other embodiments.

It will be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the systems, apparatuses and modules described above may refer to the corresponding processes in the foregoing method embodiments, which are not repeated herein.

In the embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, and for example, the division of the modules is merely a logical function division, and there may be additional divisions when actually implemented, for example, multiple modules or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or modules, which may be in electrical, mechanical, or other forms.

The above description has been made in detail on the technical solutions provided by the embodiments of the present application, and specific examples are applied in the embodiments of the present application to illustrate the principles and implementation manners of the embodiments of the present application, where the above description of the embodiments is only for helping to understand the methods and core ideas of the embodiments of the present application; meanwhile, as for those skilled in the art, according to the idea of the embodiment of the present application, there are various changes in the specific implementation and application scope, and in summary, the present disclosure should not be construed as limiting the embodiment of the present application.

Claims

1. An image processing method, comprising:

acquiring data to be processed, wherein the data to be processed comprises at least one target image and text descriptions corresponding to each target image, and the text descriptions comprise at least one of object text descriptions and scene text descriptions of target objects in the target image;

extracting features of the target image to obtain an image semantic feature vector, and extracting features of text description corresponding to the target image to obtain a text feature vector;

fusing the image semantic feature vector and the text feature vector to obtain a fused feature vector;

Obtaining a recognition result corresponding to the text description according to a decoding feature vector obtained by decoding the fusion feature vector;

and carrying out structuring treatment on the target image based on the identification result to obtain structured data.

2. The method according to claim 1, wherein the text description is a scene text description, the text feature vector is a scene feature vector obtained by feature extraction of the scene text description, the fused feature vector is a first fused feature vector obtained by fusing the image semantic feature vector and the scene feature vector, and the decoded feature vector is a first decoded feature vector obtained by decoding the first fused feature vector;

the obtaining the recognition result corresponding to the text description according to the decoding feature vector obtained by decoding the fusion feature vector comprises the following steps:

adding a classification head to the first decoding feature vector to obtain a first decoding feature vector added with the classification head;

and obtaining the scene classification corresponding to the scene text description according to the first decoding feature vector added with the classification head.

3. The method of claim 2, wherein the one target image corresponds to a plurality of scene text descriptions;

The feature extraction is performed on the text description corresponding to the target image to obtain a text feature vector, which comprises the following steps:

extracting features of a plurality of scene text descriptions corresponding to the target image to obtain a plurality of scene feature vectors;

the fusing of the image semantic feature vector and the text feature vector to obtain a fused feature vector comprises the following steps:

respectively fusing the image semantic feature vectors and the scene feature vectors to obtain a plurality of first fused feature vectors;

aiming at each first fusion feature vector, according to a first decoding feature vector obtained by decoding the first fusion feature vector, obtaining a recognition result corresponding to each scene text description;

the step of carrying out structuring processing on the target image based on the recognition result to obtain structured data comprises the following steps:

and respectively carrying out structuring treatment on the target image based on a plurality of recognition results to obtain structured data.

4. The method according to claim 1, wherein the text description is an object text description, the text feature vector is an object feature vector obtained by feature extraction of the object text description, the fused feature vector is a second fused feature vector obtained by fusing the image semantic feature vector and the object feature vector, and the decoded feature vector is a second decoded feature vector obtained by decoding the second fused feature vector;

adding a detection head to the second decoding feature vector to obtain a second decoding feature vector added with the detection head;

and obtaining a target object corresponding to the object text description in the target image according to the second decoding feature vector added with the detection head.

5. The method of claim 4, wherein the one target image corresponds to a plurality of object text descriptions;

extracting features of a plurality of object text descriptions corresponding to the target image to obtain a plurality of object feature vectors;

respectively fusing the image semantic feature vectors and the object feature vectors to obtain a plurality of second fused feature vectors;

Aiming at each second fusion feature vector, according to a second decoding feature vector obtained by decoding the second fusion feature vector, obtaining a recognition result corresponding to each object text description;

6. The method according to claim 1, wherein the text description is a scene text description and an object text description, the text feature vector is a scene feature vector obtained by feature extraction of the scene text description and an object feature vector obtained by feature extraction of the object text description, the fused feature vector is a first fused feature vector obtained by fusing the image semantic feature vector and the scene feature vector and a second fused feature vector obtained by fusing the image semantic feature vector and the object feature vector, and the decoded feature vector is a first decoded feature vector obtained by decoding the first fused feature vector and a second decoded feature vector obtained by decoding the second fused feature vector;

adding a classification head to the first decoding feature vector to obtain a first decoding feature vector added with the classification head, and adding a detection head to the second decoding feature vector to obtain a second decoding feature vector added with the detection head;

and obtaining scene classification corresponding to the scene text description according to the first decoding feature vector added with the classification head, and obtaining a target object corresponding to the object text description in the target image according to the second decoding feature vector added with the detection head.

7. The method according to any one of claims 4 to 6, wherein after obtaining the recognition result corresponding to the text description according to the decoded feature vector obtained by decoding the fused feature vector, the method further comprises:

and comparing the facial features in the target object with the facial features in the facial feature library to obtain a facial recognition result corresponding to the facial features in the target object.

8. An image processing apparatus, comprising:

An input-output unit and a processing unit;

the input/output unit is used for acquiring data to be processed, wherein the data to be processed comprises at least one target image and text descriptions corresponding to each target image, and the text descriptions comprise at least one of object text descriptions and scene text descriptions of target objects in the target image;

the processing unit is used for carrying out feature extraction on the target image to obtain an image semantic feature vector, and carrying out feature extraction on a text description corresponding to the target image to obtain a text feature vector; fusing the image semantic feature vector and the text feature vector to obtain a fused feature vector; obtaining a recognition result corresponding to the text description according to a decoding feature vector obtained by decoding the fusion feature vector; and carrying out structuring treatment on the target image based on the identification result to obtain structured data.

9. A computer device, characterized in that it comprises a memory on which a computer program is stored and a processor which, when executing the computer program, implements the method according to any of claims 1-7.

10. A computer readable storage medium, characterized in that the storage medium stores a computer program comprising program instructions which, when executed by a processor, can implement the method of any of claims 1-7.

11. A computer program product comprising instructions, characterized in that the computer program product comprises program instructions which, when run on a computer or a processor, cause the computer or the processor to perform the method of any of claims 1 to 7.

12. A chip system, comprising:

a communication interface for inputting and/or outputting information;

a processor for executing a computer-executable program to cause a device on which the chip system is installed to perform the method of any one of claims 1 to 7.