CN117711001B

CN117711001B - Image processing method, device, equipment and medium

Info

Publication number: CN117711001B
Application number: CN202410155582.4A
Authority: CN
Inventors: 刘刚
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2024-02-04
Filing date: 2024-02-04
Publication date: 2024-05-07
Anticipated expiration: 2044-02-04
Also published as: CN117711001A

Abstract

The application provides an image processing method, an image processing device, image processing equipment and an image processing medium, which relate to the technical field of artificial intelligence and can be applied to cloud technology, artificial intelligence, intelligent traffic, auxiliary driving and other scenes; performing feature embedding on the image text to obtain image-text features; image analysis is carried out on the image characteristics, the detection frame characteristics and the image-text characteristics based on the image processing model, so that an image analysis result is obtained, and the image analysis result comprises a multi-dimensional content label; the image processing model is obtained by training the analysis content generation of a feature fusion network and a visual language generation network of the initial image processing model, and the visual language generation network is constructed based on a large-scale language model. The application can improve the modeling efficiency, generalization and practicability of image processing.

Description

Image processing method, device, equipment and medium

Technical Field

The present application relates to the field of artificial intelligence, and in particular, to an image processing method, apparatus, device, and medium.

Background

In the age of rapid development of the internet, as the threshold of content production is reduced, the uploading amount of image data is increased at an exponential rate, the related recommended scene is rapidly expanded, and efficient and deep accurate image content understanding can carry out content auditing and classification before image data distribution recommendation, and help information flow business build a bridge of content and users.

In the related technical scheme, content understanding is realized by manually marking simple tag information through standardized processing, but the high-grade personalized recommendation requirement cannot be met, the manual marking cost is very high, or a large model comprising a multi-branch network is constructed for each scene to individually predict a model result for each scene, the mode also needs to individually mark a large amount of sample data for each scene, the cost is high, and the current visual model is usually trained for predicting and identifying limited object types, so that the generalization and the practicability of the model are limited by a strict supervision training mode. Therefore, the existing video content understanding schemes cannot meet the demands of service and scene diversity in terms of modeling efficiency, cost and expansion.

Disclosure of Invention

The application provides an image processing method, an image processing device, image processing equipment and an image processing medium, which can remarkably improve the modeling efficiency, generalization and practicability of image processing.

In one aspect, the present application provides an image processing method, the method including:

Acquiring image features, detection frame features and image texts of an image to be analyzed, wherein the image texts at least comprise frame category texts corresponding to the detection frame features, and the frame category texts are used for indicating content categories of image areas corresponding to the detection frame features in the image to be analyzed;

Performing feature embedding on the image text to obtain image-text features;

Performing image analysis on the image features, the detection frame features and the image-text features based on an image processing model to obtain an image analysis result, wherein the image analysis result comprises a multi-dimensional content tag, and the multi-dimensional content tag is used for indicating the multi-dimensional content category of the image to be analyzed;

The image processing model is obtained based on sample image features, sample detection frame features, sample image-text features, instruction text features and sample labels corresponding to sample instruction texts corresponding to sample images, and is obtained by training cross-mode feature fusion and feature space alignment of a visual mode and a text mode of a feature fusion network of an initial image processing model in combination with instruction fine adjustment, and performing analysis content generation training on a visual language generation network of the initial image processing model, wherein the visual language generation network is constructed based on a pre-training large language model.

Another aspect provides an image processing apparatus, the apparatus comprising:

The acquisition module is used for: the method comprises the steps of acquiring image features, detection frame features and image texts of an image to be analyzed, wherein the image texts at least comprise frame category texts corresponding to the detection frame features, and the frame category texts are used for indicating content categories of image areas corresponding to the detection frame features in the image to be analyzed;

and a feature embedding module: the method is used for carrying out feature embedding on the image text to obtain image-text features;

And an image analysis module: the image processing module is used for carrying out image analysis on the image characteristics, the detection frame characteristics and the image-text characteristics based on an image processing model to obtain an image analysis result, wherein the image analysis result comprises a multi-dimensional content tag, and the multi-dimensional content tag is used for indicating the multi-dimensional content category of the image to be analyzed;

In another aspect there is provided a computer device comprising a processor and a memory having stored therein at least one instruction or at least one program loaded and executed by the processor to implement an image processing method as described above.

Another aspect provides a computer readable storage medium having stored therein at least one instruction or at least one program loaded and executed by a processor to implement an image processing method as described above.

In another aspect, a server is provided, the server including a processor and a memory, the memory storing at least one instruction or at least one program, the at least one instruction or the at least one program being loaded and executed by the processor to implement an image processing method as described above.

Another aspect provides a terminal comprising a processor and a memory having stored therein at least one instruction or at least one program loaded and executed by the processor to implement an image processing method as described above.

Another aspect provides a computer program product or computer program comprising computer instructions which, when executed by a processor, implement an image processing method as described above.

The image processing method, the device, the equipment, the storage medium, the server, the terminal, the computer program and the computer program product provided by the application have the following technical effects:

According to the technical scheme, the image characteristics, the detection frame characteristics and the image text of the image to be analyzed are obtained, the image text at least comprises a frame type text corresponding to the detection frame characteristics, and the frame type text is used for indicating the content type of an image area corresponding to the detection frame characteristics in the image to be analyzed; performing feature embedding on the image text to obtain image-text features; image analysis is carried out on the image characteristics, the detection frame characteristics and the image-text characteristics based on the image processing model, so that an image analysis result is obtained, wherein the image analysis result comprises a multi-dimensional content tag, and the multi-dimensional content tag is used for indicating the multi-dimensional content category of an image to be analyzed; the image processing model is obtained based on sample image features, sample detection frame features, sample image-text features, instruction text features and sample labels corresponding to sample instruction texts corresponding to sample images, and is obtained by combining instruction fine adjustment to train cross-mode feature fusion and feature space alignment of a visual mode and a text mode of a feature fusion network of an initial image processing model and train analysis content generation of a visual language generation network of the initial image processing model, wherein the visual language generation network is constructed based on a pre-trained large language model. In this way, the detection frame features and the corresponding frame type texts are added outside the image feature input, so that an image processing model can learn fine-granularity region information in image content, the image content label range capable of being described and depicted is enlarged, a content label result with better coverage and accuracy is obtained, the bottleneck problem that a pre-trained large-scale multi-mode language model is insensitive to fine-granularity information such as positions, quantity and small objects in an image processing task can be solved through the introduction of fine-granularity auxiliary information such as the detection frame features and the frame type texts, and the understanding degree of image objects and topics is improved; moreover, the method and the system fully utilize knowledge and logic reasoning capability in a large language model, realize the alignment of images and text features and provide fine-granularity understanding support of regional images by increasing fine-granularity target region detection results, so that fine-granularity content understanding covering more dimensionalities and generation of multidimensional content labels can be supported, and modeling efficiency, generalization and practicability of image processing are improved while processing cost is obviously reduced.

Additional aspects and advantages of the application will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the application.

Drawings

In order to more clearly illustrate the embodiments of the application or the technical solutions and advantages of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are only some embodiments of the application, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of an application environment provided by an embodiment of the present application;

Fig. 2 is a schematic flow chart of an image processing method according to an embodiment of the present application;

FIG. 3 is a flowchart of another image processing method according to an embodiment of the present application;

FIG. 4 is a flowchart of another image processing method according to an embodiment of the present application;

FIG. 5 is a flowchart of another image processing method according to an embodiment of the present application;

FIG. 6 is a flowchart of another image processing method according to an embodiment of the present application;

Fig. 7 is a schematic block diagram of an image processing method according to an embodiment of the present application;

Fig. 8 is a schematic diagram of a frame of an image processing apparatus according to an embodiment of the present application;

Fig. 9 is a block diagram of a hardware structure of an electronic device for performing an image processing method according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or server that comprises a list of steps or sub-modules is not necessarily limited to those steps or sub-modules that are expressly listed or inherent to such process, method, article, or apparatus, but may include other steps or sub-modules that are not expressly listed.

Before describing embodiments of the present application in further detail, the terms and terminology involved in the embodiments of the present application will be described, and the terms and terminology involved in the embodiments of the present application will be used in the following explanation.

Artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) is the theory, method, technique, and application system that simulates, extends, and extends human intelligence using a digital computer or a machine controlled by a digital computer, perceives the environment, obtains knowledge, and uses the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, pre-training model technologies, operation/interaction systems, mechatronics, and the like. The pre-training model is also called a large model and a basic model, and can be widely applied to all large-direction downstream tasks of artificial intelligence after fine adjustment. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

Computer Vision (CV) is a science of studying how to "look" at a machine, and more specifically, to replace human eyes with a camera and a Computer to perform machine Vision such as recognition, detection and measurement on a target, and further perform graphic processing to make the Computer process into an image more suitable for human eyes to observe or transmit to an instrument for detection. As a scientific discipline, computer vision research-related theory and technology has attempted to build artificial intelligence systems that can acquire information from images or multidimensional data. The large model technology brings important transformation for the development of computer vision technology, and pre-trained models in the vision fields of swin-transducer, viT, V-MOE, MAE and the like can be quickly and widely applied to downstream specific tasks through fine tuning. Computer vision techniques typically include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D techniques, virtual reality, augmented reality, synchronous positioning, and map construction, among others, as well as common biometric recognition techniques such as face recognition, fingerprint recognition, and others.

Natural language processing (Nature Language processing, NLP) is an important direction in the fields of computer science and artificial intelligence. It is studying various theories and methods that enable effective communication between a person and a computer in natural language. Natural language processing involves natural language, i.e., language that people use daily, closely with linguistic research, as well as computer science and mathematics. An important technique for model training in the artificial intelligence domain, a pre-training model, is developed from a large language model (Large Language Model) in the NLP domain. Through fine tuning, the large language model can be widely applied to downstream tasks. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic questions and answers, knowledge graph techniques, and the like.

The Pre-training model (Pre-training model), also called a matrix model and a large model, refers to a deep neural network (Deep neural network, DNN) with large parameters, trains massive unlabeled data, utilizes the function approximation capability of the large-parameter DNN to enable PTM to extract common features on the data, and is suitable for downstream tasks through technologies such as fine tuning, efficient fine tuning (PEFT) and prompt-tuning. Therefore, the pre-training model can achieve ideal effects in a small sample (Few-shot) or Zero sample (Zero-shot) scene. PTM can be classified according to the data modality of processing into a language model (ELMO, BERT, GPT), a visual model (swin-transducer, viT, V-MOE), a speech model (VALL-E), a multi-modal model (ViBERT, CLIP, flamingo, gato), etc., wherein a multi-modal model refers to a model that builds a representation of two or more data modality features. The pre-trained model is an important tool for outputting Artificial Intelligence Generation Content (AIGC), and can also be used as a general interface for connecting a plurality of specific task models.

Machine learning (MACHINE LEARNING, ML) is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, and the like. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like. The pre-training model is the latest development result of deep learning, and integrates the technology.

Short video: the short video is an internet content transmission mode, and is video transmission content which is transmitted on new internet media and has a duration of less than 5 minutes; with the popularization of mobile terminals and the acceleration of networks, short and quick mass-flow transmission contents are gradually and widely spread.

LLM: the large language model (Large Language Model, LLM) refers to a computer model that is capable of processing and generating natural language; it represents a significant advancement in the field of artificial intelligence and is expected to change this field through learned knowledge. LLM can predict the next word or sentence through learning the statistical rule and semantic information of language data, and with the continuous expansion of input data set and parameter space, LLM's ability also can improve correspondingly. It is used in a variety of application fields such as robotics, machine learning, machine translation, speech recognition, image processing, etc., and so is also called a Multimodal Large Language Model (MLLM).

Instruction Tuning: instruction fine tuning, which is to generate an instruction separately for each task by performing fine tuning on a plurality of full-shot tasks and then evaluating generalization capability (zero shot) on specific tasks, is usually performed by thawing pre-trained model parameters, usually on a large number of published NLP task data sets, for exciting understanding capability of a language model, and by giving more obvious instructions, letting the model understand and make correct feedback.

Prompt tune: prompting learning, and a class of learning methods in machine learning: under the condition of not significantly changing the structure and parameters of the pre-training language model, the effect of the model is greatly improved by adding prompt information to the input and enhancing the prompt information as information, and the model can be regarded as an instruction for tasks, and is also multiplexing of a large model pre-training target.

RLHF: an extension of (Human Feedback reinforcement learning, reinforcement LEARNING WITH Human Feedback) Reinforcement Learning (RL) incorporates Human Feedback into the training process, providing a natural, humanized interactive learning process for the machine. In addition to the reward signal, RLHF agents get feedback from humans, learn with a wider view and higher efficiency, similar to how humans learn from another person's expertise. By setting up a bridge between the agent and the human, RLHF allows the human to direct the machine and allows the machine to master decision elements that are explicitly embedded in human experience, RLHF can help to some extent mitigate the detrimental content generated by Large Language Models (LLMs) and improve information integrity as an efficient alignment technique.

In recent years, as artificial intelligence technology research and progress, artificial intelligence technology expands research and application in various fields such as common smart home, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned, automatic driving, unmanned aerial vehicle, digital twinning, virtual persons, robots, artificial Intelligence Generation Content (AIGC), conversational interactions, smart medical treatment, smart customer service, game AI, etc., it is believed that with the development of technology, artificial intelligence technology will find application in more fields and become increasingly important.

Referring to fig. 1, fig. 1 is a schematic diagram of an application environment provided in an embodiment of the present application, and as shown in fig. 1, the application environment may at least include a terminal 01 and a server 02. In practical applications, the terminal 01 and the server 02 may be directly or indirectly connected through wired or wireless communication, which is not limited herein.

The server 02 in the embodiment of the present application may be an independent physical server, or may be a server cluster or a distributed system formed by a plurality of physical servers, or may be a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs (Content Delivery Network, content delivery networks), and basic cloud computing services such as big data and artificial intelligent platforms.

Specifically, cloud technology (Cloud technology) refers to a hosting technology that unifies serial resources such as hardware, software, networks, etc. in a wide area network or a local area network, so as to implement calculation, storage, processing, and sharing of data. The cloud technology can be applied to various fields such as medical cloud, cloud internet of things, cloud security, cloud education, cloud conference, artificial intelligent cloud service, cloud application, cloud calling, cloud social contact and the like, and is based on cloud computing (closed computing) business model application, and the cloud technology distributes computing tasks on a resource pool formed by a large number of computers, so that various application systems can acquire computing power, storage space and information service according to requirements. The network providing the resources is called a ' cloud ', and the resources in the cloud ' are infinitely expandable to the user, and can be acquired, used as required, expanded as required and paid for use as required. As a basic capability provider of cloud computing, a cloud computing resource pool (abbreviated as a cloud platform, generally referred to as IaaS (Infrastructure AS A SERVICE) platform) is established, and multiple types of virtual resources are deployed in the resource pool for external clients to select for use. The cloud computing resource pool mainly comprises: computing devices (which are virtualized machines, including operating systems), storage devices, network devices.

Specifically, the server 02 may include an entity device, may include a network communication sub-module, a processor, a memory, and the like, may include software running in the entity device, and may include an application program and the like.

Specifically, the terminal 01 may include a smart phone, a desktop computer, a tablet computer, a notebook computer, a digital assistant, an augmented reality (augmented reality, AR)/Virtual Reality (VR) device, an intelligent voice interaction device, an intelligent home appliance, an intelligent wearable device, a vehicle-mounted terminal device, and other types of entity devices, and may also include software running in the entity devices, such as an application program, and the like.

In the embodiment of the application, the terminal 01 can be used for receiving the uploaded image data, such as a video, a picture and the like, so as to send the image data to the server 02 for carrying out image processing such as content understanding and the like, the server 02 is used for preprocessing the image data to obtain an image to be analyzed, the image characteristics, the detection frame characteristics and the image text of the image to be analyzed can be key content frames and the like in the video, the image text at least comprises a frame type text corresponding to the detection frame characteristics, the frame type text is used for indicating the content type of an image area corresponding to the detection frame characteristics in the image to be analyzed, and further, the image analysis based on the image characteristics, the detection frame characteristics and the image text is carried out through an image processing model to obtain an image analysis result.

Further, it should be understood that fig. 1 illustrates only an application environment of an image processing method, and the application environment may include more or fewer nodes, and the present application is not limited herein.

Referring to fig. 3, the present application further provides an image processing system, which includes a multi-modal feature extraction module, a feature fusion network, and a visual language generation network, where the multi-modal feature extraction network is used for inputting an image and an image text to be analyzed to output an image feature, a detection frame feature, and a text feature, the feature fusion network is used for generating a fusion feature based on the image feature, the detection frame feature, and the text feature, and the visual language generation network is used for performing image analysis based on the fusion feature to obtain an image analysis result.

The application environment, or the terminal 01 and the server 02 in the application environment, according to the embodiments of the present application may be a distributed system formed by connecting a client, a plurality of nodes (any form of computing device in an access network, such as a server, a user terminal) through a network communication. The distributed system may be a blockchain system that may provide the image processing services, model training services, and related data storage services described above, among others.

In the age of rapid development of the internet, as the threshold of content production decreases, the uploading amount of image data such as video increases at an exponential rate. These image data originate from various content authoring institutions, and the peak-to-peak daily uploading of individual sources has exceeded a million level. The current process of content distribution flow (from the start of uploading, to the success of entering user consumption) of image data such as short video comprises: the video is shot through the terminal shooting tool, then uploaded through the terminal or the B side, the video file is normalized through re-transcoding in the video uploading process, the meta-information of the video is stored, and the playing compatibility of the video on each platform is improved. The video is then manually audited, and the machine can acquire auxiliary features such as classification, labels and the like of the content through an algorithm at the same time of manual audit, finally, a recommendation engine carries out content recommendation through a recommendation algorithm such as collaborative recommendation, matrix decomposition, supervised learning algorithm Logistic Regression model, deep learning-based model, factorization Machine and GBDT (gradient lifting decision tree (Gradient Boosting Decision Tree) and the like based on object requirements, and then the video is clicked and consumed, object information is deposited through interactive operation of a user and the content and then deposited on classification and label information corresponding to the image content.

In addition, in the recommendation system, due to the personalized requirement of users, the recommendation system needs to accumulate object interest models, needs to keep complete context, has complete semantic granularity, and can well characterize the interests and interest trends of the users. The traditional content understanding is carried out by the conventional technical scheme through standardized processing of the manually marked simple tag information, the advanced personalized recommendation requirement cannot be well met, the cost of the manually specially marked tag is very high, and particularly the subdivided multi-granularity classification cannot be supported; either a model needs to be built separately for each scene or a large model comprising a multi-branch network is built, model results are predicted separately for each scene, but it is also costly to label a large amount of fine-grained sample data separately for each scene. Meanwhile, current computer vision models are usually trained for predicting and identifying limited object types, the strict supervision training mode limits generalization and practicability of the models, and is especially difficult to adapt to understanding and identifying some fine-grained objects and interrelationships in images, and the models usually also need additional labeling data to complete vision concepts which are not seen in training, such as many vision tasks are difficult to express by text, and poor performance is achieved on part of fine-grained classification. In addition, because the tag words in different scenes come from words described by natural language, the tag words have the characteristics of multiple angles, omnibearing, numerous systems and different thicknesses, and the scheme cannot realize diversified understanding and generation of tags.

In summary, the existing image content understanding mainly uses entity class labels, and cannot well describe object interest points and content label characteristics in generalization, richness, cost and richness, and meanwhile, the modeling efficiency, cost and expansion cannot meet the requirements of service and scene diversification.

In order to solve at least one of the above problems, the following description of the technical solution of the present application is based on the above application environment, and the embodiments of the present application may be applied to various scenarios, including, but not limited to, cloud technology, artificial intelligence, intelligent transportation, assisted driving, and the like. Referring to fig. 2, fig. 2 is a schematic flow chart of an image processing method according to an embodiment of the present application, and the present specification provides method operation steps according to an embodiment or the flowchart, but may include more or less operation steps based on conventional or non-inventive labor. The order of steps recited in the embodiments is merely one way of performing the order of steps and does not represent a unique order of execution. When implemented in a real system or server product, the methods illustrated in the embodiments or figures may be performed sequentially or in parallel (e.g., in a parallel processor or multithreaded environment). Specifically, as shown in fig. 2, the method may include the following steps S201 to S203:

S201: and acquiring image features, detection frame features and image texts of the image to be analyzed.

Specifically, the image to be analyzed may be a picture to be subjected to content analysis, for example, may be a video cover map, a key video frame of a video, and the like. The image features can be generated by extracting image features of the image to be analyzed based on a visual encoder of the multi-mode feature extraction module, the detection frame features can be generated based on detection frame information of a target detection frame of the image to be analyzed, and the detection frame information can be obtained by carrying out target detection on the image to be analyzed and is used for indicating position information of the target detection frame in the image. Specifically, numbers in natural language can be used for representing object positions, for example, [ x _min, y_min, x_max, y_max ] is used for representing a bounding box, [ x _center, y_center ] is used for representing a center point of a region where the object is located, the coordinate information can be position information normalized according to the size of an image, fine granularity position and quantity information of an entity object in image data content can be learned by a model, the range of image content labels which can be depicted and described can be enlarged, and label results with better coverage and accuracy can be obtained; by introducing the fine granularity auxiliary information, the bottleneck insensitive to the fine granularity information such as the position, the number, the small objects and the like in the application of the downstream task of the pre-trained multi-mode large language model can be dealt with, and the understanding of picture figures and subjects is more accurate and profound.

Specifically, the image text at least comprises a frame category text corresponding to the detection frame feature; the frame type text is used for indicating the content type of the image area corresponding to the detection frame characteristic in the image to be analyzed, it can be understood that the detection frame can frame the image area containing the entity object (such as the foreground object or the background area) in the image, the detection frame characteristic comprises pixel information, semantic information and the like in the corresponding image area, the type of the entity object corresponding to the detection frame can be obtained correspondingly through image detection, the type can be a major type or a subdivision type, the major type can comprise the detection frame as the foreground or the background, or the content major type of the corresponding entity object, such as a dog, and the subdivision type can be the subdivision type subordinate to the content major type of the corresponding entity object, such as Chai Quan; the box category text may be generated based on box category information of the target detection box. Illustratively, the image area framed by the object detection box in the figure includes dogs, and the box category information indicates the categories "foreground", "dog" and its subdivision category "Chai Quan", and accordingly the box category text is "foreground", "dog" and "Chai Quan". In some embodiments, based on the actual data base of the image to be analyzed, the image text may also include additional text of other categories, which may include, but are not limited to, recognized text in the image to be analyzed (e.g., text recognized in the image to be analyzed based on OCR), headline text (e.g., picture headline or video headline), topic label (Hashtag) text, and the like. It will be appreciated that the image to be analyzed may or may not include one or more of the other categories of text described above, in addition to the box category text, as determined based on the actual data carried by the image.

In some embodiments, S201 may include S301-S305:

S301: performing target detection on an image to be analyzed to obtain image characteristics, detection frame information of a target detection frame and frame type information, wherein the frame type information is used for indicating identification information of frame type texts;

s303: performing feature representation on the detection frame information to obtain detection frame features;

s305: a frame category text is generated based on the frame category information.

Specifically, the target detection model may be used to perform target detection on the image to be analyzed, so as to obtain a target detection result, where the target detection result includes detection frame information and frame type information, and the image area marked by the target detection frame may include physical objects, such as a face, an animal object, an object such as a car, and the like, and the detection frame information is used to characterize the area position of the target detection frame in the image. The frame category information may be coding information or identification information of the corresponding entity object category, and the corresponding frame category text is determined based on the corresponding relation between the frame category information and the category text. It can be appreciated that more than one object detection model may be used to perform object detection on the image to be analyzed, for example, face detection and other types of object detection may be performed by the face detection model and the object detection model, respectively, so as to improve the comprehensiveness of object detection.

Specifically, the detection frame features may be formed by directly splicing sub-information (such as boundary frame coordinates and region center coordinates) of each position of the detection frame information, or may be obtained by encoding the detection frame features by using a visual encoder.

In some embodiments, image features extracted by the target detection model in the target detection process may be used as input of an image processing model, and correspondingly, referring to fig. 3, the visual encoder is constructed by using the target detection model, uses an image to be analyzed as input, directly outputs image features and a target detection result, and then performs feature representation on detection frame information to obtain detection frame features. The image features comprise respective regional image features of each image region corresponding to each target detection frame, and can also comprise integral image features of the image to be analyzed. In this way, a visual encoder is adopted to detect the target and extract the image characteristics, fine-granularity visual and text information is introduced through the processing of visual information, and detection frame information (such as objects, faces, backgrounds and the like in a picture) of a main target is serialized and then added into an image processing model, so that the label range of the image content capable of being depicted and described is enlarged, and a label result with better coverage and accuracy is obtained.

In other embodiments, the image to be analyzed and the target detection result are output to a visual encoder of the image processing system for image feature extraction and feature representation of detection frame information to generate image features and detection frame features for inputting the image processing model.

It will be appreciated that the image features and the detection frame features are stitched together and then input into the image processing model.

S203: and performing feature embedding on the image text to obtain image-text features.

Specifically, referring to fig. 3, the text of the image is text-coded through the text embedding network to obtain a vectorized representation thereof, so as to obtain the text characteristics of the image, such as the text of the box category is feature represented, so as to obtain the text characteristics of the box category. As described above, the image text may further include additional text, and accordingly, the text embedding network is used to perform feature embedding on each additional text to obtain corresponding additional text features, and the text features of the frame category and the additional features are spliced to obtain graphic features as input of the image processing model. The additional text includes at least one of text carried in the image content of the image to be processed and accompanying descriptive text of the image to be processed, which may include, but is not limited to, recognized text in the image to be processed (e.g., text recognized in the image to be processed based on OCR), which may be additional descriptive words or sentences carried by the image to be processed, including, but not limited to, headline text (e.g., a picture title or a video title), topic tag (Hashtag) text, and the like.

S205: image analysis is carried out on the image characteristics, the detection frame characteristics and the image-text characteristics based on the image processing model to obtain an image analysis result,

Specifically, the image analysis result can indicate fine-grained content information of the image to be analyzed. The method can specifically comprise a multi-dimensional content tag of the image to be analyzed, wherein the multi-dimensional content tag is used for indicating the multi-dimensional content category of the image to be analyzed. The multidimensional content category may specifically include a physical object category in the image to be analyzed, an image emotion category of the image to be analyzed, an image content category or other reference category of the image to be analyzed, and the like. For example, the image analysis result of an image of a firewood dog may be "lovely pet/happy mood/firewood dog/lovely pet dog"; the image analysis result of one image of two high-speed rails may be "high-speed rail/train/two-vehicle racing/speed model" or the like.

In some embodiments, referring to fig. 3, the image processing model may include a feature fusion network and a visual language generation network. The image processing model is obtained by training cross-mode feature fusion and feature space alignment of a visual mode and a text mode of a feature fusion network of an initial image processing model and training analysis content generation of a visual language generation network of the initial image processing model by combining instruction fine adjustment based on sample image features, sample detection frame features, sample image-text features and instruction text features and sample labels corresponding to sample image and sample instruction texts.

Specifically, the sample image is similar to the image to be analyzed, and may be from image data actually received in an image service scene, or may be an image in an existing graphic pair dataset that can be acquired. It will be appreciated that the sample image features, sample detection frame features, sample teletext features are similar to those previously described. The sample tag may be used to describe a result truth value of fine granularity content information of the sample image, may include a result truth value of a multidimensional content tag, such as a text tag in an image-text pair of an existing dataset, or may be a manually corrected tag, including a plurality of fine granularity category tags, such as a sample tag of a pet dog Chai Quan image being "lovely pet dog/happy mood/firewood dog/lovely pet dog/lovely pet dog/lovely loved sprout".

The sample Instruction text is an Instruction < Instruction > set for a business scene and used for guiding an image processing model to understand image content, and can be randomly selected from a predefined template, such as distinguishing the number, color, action, category, azimuth, relation and the like of specific objects (objects or faces and the like) in visual information, so as to describe a picture in detail, and the style of the template is like "Count object THIS IMAGE IN DETAIL" and "Could you describe the contents of THIS IMAGE IN DETAIL" and the like. And performing feature coding on the sample instruction text by adopting a text embedding network to obtain the instruction text features represented by the vectors. It will be appreciated that more than one sample instruction text may be included in a set of sample inputs, such as including the above-described "Count object THIS IMAGE IN DETAIL" and "Could you describe the contents of THIS IMAGE IN DETAIL" as instruction instructions for the sample.

Specifically, the visual language generating network is constructed based on a pre-training large language model, the pre-training Large Language Model (LLM) adopted by the visual language generating network is taken as a content understanding basis of the image processing model, and a basic model which is subjected to partial capability and alignment is obtained through instruction fine tuning, and the visual language generating network can comprise at least one fixed large language model, such as a plurality of large language models which are formed by connecting. The pre-training large-scale language model construction based on the Transform architecture adopted by the visual language generation network can greatly improve the parallelism of the model on the premise of not losing the effect or even improving the effect.

In summary, the image processing model learns the fine granularity information such as the entity position, the number and the like in the image content, the image content label range capable of being described is enlarged, the content label result with better coverage and accuracy is obtained, the bottleneck problem that the pretrained large-scale multi-mode language model is insensitive to the fine granularity information such as the position, the number and the small objects in the image processing task can be solved by introducing the fine granularity auxiliary information such as the detection frame characteristics and the frame type text, and the understanding degree of the image objects and the subjects is improved; in addition, the scheme fully utilizes knowledge and logic reasoning capability in a large language model, realizes alignment of images and text features and provides fine-granularity understanding support by increasing a fine-granularity target region detection result, so that generation of fine-granularity content labels covering more advanced concepts and combinations can be supported, and modeling efficiency, generalization and practicability of image processing are improved while processing cost is remarkably reduced.

In some embodiments, referring to FIG. 4, S205 includes S401-S402:

s401: inputting the image features, the detection frame features and the image-text features into a feature fusion network of an image processing model to perform feature extraction and cross-modal feature fusion, so as to obtain fusion features;

S402: and inputting the fusion characteristics into a visual language generation network of the image processing model to perform image content analysis, so as to obtain an image analysis result of the text mode.

Specifically, the fusion feature is obtained by carrying out feature cross extraction on the image feature, the detection frame feature and the image-text feature. The input features are extracted through the feature fusion network, so that the visual content and the text content are subjected to cross fusion, the model learns the relevance between the visual content and the description content, the content analysis accuracy of visual language generation is further improved, and the fine granularity accuracy of regional content understanding and the like of analysis results is improved.

In some embodiments, referring to fig. 3, the feature fusion network includes a first encoding module, a second encoding module, and a fully-connected layer, the first encoding module and the second encoding module sharing a first attention sub-module, the second encoding module further including a second attention sub-module based on a cross-layer attention mechanism. The first coding module is used for receiving the output of the visual encoder and the output of the text embedding network, further realizing feature fusion to output fusion features, and the second coding module is used for receiving the output of the text embedding network to perform feature extraction to obtain text extraction features. The first attention submodule is used for fusing the image characteristics of the graphic character and the visual mode and the first fusion of the detection frame characteristics, and the second attention submodule is used for secondarily fusing the result after the first fusion with the image characteristics of the visual mode and the detection frame characteristics again to obtain fusion characteristics. The full connection layer is used for carrying out fusion extraction on the fusion characteristics and the text extraction characteristics to serve as input of the visual language generation network. Accordingly, referring to fig. 5, S401 includes S4011-S4013:

S4011: and carrying out feature fusion corresponding to the first coding module on the image features, the detection frame features and the image-text features based on the first attention sub-module, and carrying out feature extraction corresponding to the second coding module on the image-text features to obtain initial fusion features and text extraction features.

Specifically, the first attention sub-module may be a multi-head self-attention network, and performs feature extraction on the image feature, the detection frame feature and the image-text feature based on a multi-head self-attention mechanism to obtain an initial fusion feature. And cross extracting the vector representations of the text of the picture and text feature middle frame category and the text of the attachment to obtain text extraction features. Self-Attention is applied to the input feature sequence based on the first Attention sub-module to simultaneously mine correlations between each item and all other items in the sequence and to mine information from different vector subspaces. In addition, the first coding module and the second coding module also comprise a Feed-Forward Network (Feed-Forward Network), a layer of Feed-Forward Network is added after the Attention, the nonlinear expression capability of the model is endowed, and the interaction relation among different dimensions can be mined. It will be appreciated that the first Attention sub-module may comprise a plurality Transformer Layer (encoder Layer), one transformer Layer consisting of a Multi-Head Self-Attention Layer and a Position-wise Feed-Forward Network Layer, wherein the Multi-Head Self-Attention Layer and the FFN use a residual Network at the output part and are Layer normalized. The multiple encoder layer stacks can learn more complex and higher-order interaction information, and the comprehensiveness and accuracy of information extraction are improved.

S4012: and based on the second attention sub-module, performing cross-modal feature fusion on the image features, the detection frame features and the initial fusion features to obtain intermediate fusion features.

Specifically, the second attention sub-module can be constructed based on a cross-attention network (Cross attention), and the visual mode characteristics and the text mode characteristics are recombined based on a cross-attention mechanism, wherein the cross-attention mechanism is a technology for expanding a self-attention mechanism, the self-attention mechanism is mainly used for capturing the relevance of different positions in an input sequence, the cross-attention mechanism takes in a spliced sequence comprising image characteristics and detection frame characteristics as an additional input sequence so as to fuse the additional input sequence and information of two different sources of fused characteristics, so as to realize more accurate modeling and characteristic depiction, the corresponding expression is as follows,For the query feature of the nth layer of attention,/>And/>Is a key-value feature of the nth layer attention layer,/>Weight distribution output for the nth layer of attention layer,/>For the weight distribution output by the previous attention layer, n is the number of layers of the attention layer, and the greater n is, the farther is from the input layer.

Accordingly, in some embodiments, S4012 comprises: inputting the image features and the initial fusion features into a second attention sub-module, taking the image features and the detection frame features as query features, taking the initial fusion features as key value features, and performing cross-attention feature representation to obtain intermediate fusion features. Therefore, the method realizes the acquisition and mining of the association information between the visual mode characteristics and the visual/text fusion mode characteristics with different sources, and improves the modeling and characteristic characterization accuracy.

The second attention sub-module can also adopt a multi-layer transducer structure, and it can be understood that from the perspective of the whole model structure, the information interaction and fusion can be optimized by adopting the result, but the closer to the output layer, the more difficult to obtain the semantic information of the historical shallow layer due to the too far output distance, the problem of forgetting knowledge exists, and the effect of the final model is affected. Accordingly, in the cross-attention feature representation process, the historical weight distribution corresponding to the previous attention layer of the second attention sub-module is subjected to weight attenuation and then is used as the input of the next attention layer. I.e. the historical weight distribution of the previous attention layer is multiplied by an attenuation coefficient to input the next attention layer, wherein the attenuation coefficient is a number greater than 0 and less than 1, so that the historical information of each layer has different weights, and the farther the distance is, the smaller the weight is. Therefore, by adopting a transition cross-layer attention connection mechanism with attenuation, semantic information of the transition cross-layer can be better captured, so that enough shallow semantic information can be obtained by an attention layer with a longer distance, the problem of forgetting knowledge can be solved, and the model effect can be improved.

In one embodiment, the expression of the second attention sub-module used in the present application is as follows, where a is an attenuation coefficient, which is a super parameter and may be set empirically, for example, 0.5, to indicate the attenuation degree of the next layer result affected by the previous layer.

Specifically, the calculation process of the cross-attention mechanism adopted by the second attention sub-module comprises the following steps: mapping the query features and the key features to different spaces; obtaining association degree distribution by calculating the similarity between the query characteristics and the key characteristics; multiplying and summing the relevance distribution and the value characteristics to obtain a cross-attention representation fused with two different input sequence information, and obtaining a fusion representation; the intermediate fusion feature may be normalized to obtain a final weight distribution. And finally obtaining the intermediate fusion characteristic based on a transducer layer adjacent to the output layer.

S4013: and inputting the intermediate fusion features and the text extraction features into a full-connection layer to perform feature mapping aiming at the visual language generation network, so as to obtain fusion features.

Specifically, the fully-connected layer may be a linear fully-connected layer, so as to serve as a connection with the visual language generating network, and the fully-connected layer maps the intermediate fusion feature and the text extraction feature to a feature space of the visual language generating network, so that the intermediate fusion feature and the text extraction feature form fusion features with the same feature dimension as the text embedding of the visual language generating network.

Based on the above part or all of the embodiments, in the embodiment of the present application, the method further includes a training method of an image processing model, referring to fig. 6, including S501-S507:

S501: an initial image processing model, a plurality of sample images, sample additional text and sample labels corresponding to the sample images, and sample instruction text are acquired.

Specifically, the visual language generating network in the initial image processing model adopts a pre-training model, and the first coding module and the second coding module in the feature fusion network are also constructed by adopting the pre-training model. The data sources of the sample images and the sample tags can be image-text pairs (such as video covers/key frame contents and text descriptions) of an open source data set, and can also be image-text pairs in actual scenes of the business. The sample tag may include a content tag for describing fine-grained content of the sample image, such as "lovely pet/happy mood/firewood dog/lovely pet dog/lovely sprout. The sample additional text includes at least one of text carried in the image content of the sample image and accompanying descriptive text of the sample image, which may include, but is not limited to, recognized text in the sample image (e.g., text recognized in the sample image based on OCR), which may be additional descriptive words or sentences carried by the image to be processed, which may include, but is not limited to, headline text (e.g., a picture title or a video title) of the sample image, topic tag (Hashtag) text, and the like. The sample instruction text is used to provide instructional information for the image content understanding that is required by the initial image processing model in performing the image content analysis task. The sample instruction text is a text for indicating the initial image processing model to learn the instructions of the questions/descriptions in the learning process so as to excite the understanding capability of the large language model, and the correct result is output for understanding and classifying the image content through the obvious instructions given by the sample instruction text. The title/descriptive instruction refers to a prompting instruction required by taking a sample label or image descriptive content corresponding to the sample label as an answer; for example, the sample instruction text may include instruction information and answer text of the instruction information, such as instruction information "observe picture, ask about several trains in the picture? Selecting: a-1. 2-2 "and the answer text is" 2 ".

S503: and carrying out target detection on the sample image to obtain sample image characteristics, sample detection frame characteristics of the sample detection frame and sample frame category text.

It is to be understood that S503 is similar to S201 described above, and the same points are not repeated.

In some embodiments, the sample box category text may be text corresponding to sample box category information.

In other embodiments, the sample box category text may be newly added subdivision category information obtained after the determination of the sample box category information and the manual auxiliary expansion, for example, coarse-grained detection box element categories such as "birds, oranges and dogs" are subdivided and expanded into bird categories such as "hawks", and dog subdivision and expansion are "pet dogs", "police dogs" or "shepherd dogs", so that related samples are constructed through an additional target detection system and manual auxiliary, sample construction efficiency and refined tag granularity are improved, and element category granularity is expanded to cover sub-categories of finer subdivision, so that different content understanding and description can be generated, and service scenes such as video distribution are more adapted. Accordingly, the target detection phase may specifically include: encoding the input sample image to generate a semantic region as a candidate semantic region; based on the detected sample frame category information, performing label expansion through manual expansion to obtain a sample label; and calculating the similarity between the candidate semantic region and the sample label to finish alignment.

S505: and performing feature embedding on the sample additional text, the sample box type text and the sample instruction text to obtain sample graphic features and instruction text features.

Specifically, referring to fig. 3, a text embedding network is adopted to perform feature embedding on a sample additional text and a sample box type text to obtain sample image-text features, and feature embedding is performed on a sample instruction text to obtain instruction text features.

S507: based on sample image features, sample detection frame features, sample image-text features, instruction text features and sample labels, combining instruction fine tuning to perform cross-mode feature fusion and feature space alignment training of a visual mode and a text mode on a feature fusion network, and performing analysis content generation training on a visual language generation network to obtain an image processing model.

In this way, the capability of the multi-mode large-scale language model is fully developed, knowledge and certain logic reasoning capability in the large-scale language model are fully utilized, alignment of the image and text features is realized by increasing fine-grained target region detection, and meanwhile, the multi-mode large-scale language model is finely tuned by a sample instruction text of vision and text alignment data, so that fine-grained understanding support of region images is provided, more-dimensional content understanding and generation of multi-dimensional content labels can be provided except for basic entity labels, and quality marking effects which can be obtained by combining multi-mode understanding reasoning are improved; and the detection frame features of the main targets are serialized and then added into the model, for example, numbers in natural language are directly used for representing the positions of objects, so that the model is helped to learn the fine-granularity position and quantity information of the objects in the image content, the label range of the image content capable of being depicted and described is enlarged, and label results with better coverage and accuracy can be obtained; by introducing the fine granularity auxiliary information, the bottleneck insensitive to the fine granularity information such as positions, quantity, small objects and the like in the application of the downstream task of the multi-mode language model of the pre-training model can be solved, and the fine granularity contents such as picture characters and subjects can be more accurately understood.

In some embodiments, the above model training may be performed in two stages, where in the first stage, network parameters of the visual language generation network are fixed, and feature extraction and cross-modal feature fusion are performed by a feature fusion module for inputting sample image features, sample detection frame features, sample image-text features, and instruction text features to obtain sample fusion features, where the sample fusion features are similar to the acquisition manner of the fusion features, and the difference is that the generation of corresponding sample intermediate fusion features includes the input of the instruction text features, and the generation of sample text extraction features also includes the input of the instruction text features. And then, inputting the sample fusion features and the instruction text features into a visual language generation network, analyzing the image content based on the instruction text prompt, and generating a first sample analysis result. Generating a first loss based on the first sample analysis result and the sample label to adjust network parameters of the feature fusion network to obtain an updated feature fusion network, and repeating the steps to realize iterative training until the first training stage is finished, thereby obtaining an updated image processing model. The training process of the second stage comprises the following steps: and fixing and updating parameters of the feature fusion network of the image processing model, and adjusting network parameters of the visual language generating network based on the sample input and the sample label until the second training stage is finished, so as to obtain the image processing model. In the preferred embodiment, in the second training stage, the input sample data further includes a content description text constructed based on the sample image, and correspondingly, the content description text, the sample additional text, the sample box type text and the sample instruction text are subjected to feature embedding through the text embedding network to obtain sample image-text features including description text features, so that prompt learning of a pre-trained large language model of the visual language generating network is realized by using the content description text, namely, the content description text is used as a soft prompt of the instruction fine tuning stage, and the model training effect and efficiency are improved.

In other embodiments, the content generation network may be employed for the auxiliary training of the image processing model, and accordingly, S507 may include S601-S609:

S601: a content generation network is acquired.

Specifically, the content generation network is constructed based on a pre-trained large language model, is a language expert model, and can decode and analyze the input features into text analysis results.

S603: inputting the sample image features, the sample detection frame features, the sample image-text features and the instruction text features into a feature fusion network of an initial image processing model to perform feature extraction and cross-modal feature fusion, so as to obtain first sample fusion features.

It can be understood that the first sample fusion feature is similar to the aforementioned fusion feature acquisition manner, and the details are not repeated, and the difference is that the input of the second encoding module in the training process increases the instruction text feature, so that the initial fusion feature corresponding to the first sample fusion feature is generated based on the sample image feature, the sample detection frame feature, the sample image-text feature and the instruction text feature, and the text extraction feature corresponding to the first sample fusion feature is generated based on the sample image-text feature and the instruction text feature.

S605: inputting the first sample fusion characteristic and the instruction text characteristic into a content generation network to perform image content analysis, and obtaining a first sample analysis result and a sample content description text.

Specifically, the instruction for content understanding is indicated by the instruction text feature, so that the content generation network generates a first sample analysis result aligned with the sample tag, and at least a sample content Description text (Description) obtained by content expanding the input image text including the sample box category text, for example, the sample content Description text may be "the picture includes a high-speed running high-speed rail".

S607: and under the condition of fixing network parameters of the content generation network, adjusting the network parameters of the feature fusion network of the initial image processing model based on the first sample analysis result and the sample label until the first training ending condition is met, and obtaining the feature fusion network of the image processing model.

Specifically, model loss is generated based on the difference between the first sample analysis result and the sample label, so that under the condition that network parameters are generated by fixed content, network parameters of the feature fusion network are updated to obtain an updated feature fusion network, then the steps are repeated, an iterative process of a first training stage is performed until the preset iterative times or the model loss is smaller than the preset loss, so that a final feature fusion network of the image processing model is obtained, and the updated image processing model is obtained.

Referring to fig. 7, in the multi-modal content feature extraction module stage, visual processing and text processing are performed on all sample images and associated text, respectively, to obtain fine-grained visual and text information. And in the first training stage, cross-modal fusion and fine-granularity alignment learning are realized, the content generation network is connected with a visual encoder by constructing an interface of a feature fusion network for visual text fusion and a content generation network based on a pre-trained language model (LLMs), and visual features extracted from the visual encoder are mapped to fixed-length features through the feature fusion network so as to connect the fusion features extracted for representing visual content with the pre-trained LLM. The feature dimension of the fusion feature can be converted into the dimension which is the same as text embedding of the LLM through the full-connection layer so as to realize fusion embedding, and the interface is trained and learned, so that the alignment of the visual mode and the language mode is completed, the problems of losing visual information and space-time complexity information are effectively avoided, and a high-efficiency and learnable image data understanding system (such as a video understanding system) is obtained. The feature fusion network can be initialized by adopting a pre-training weight of a pre-training transducer base, so that the convergence rate of the model is increased.

It can be appreciated that the sequence coding representation (patch tokens) of each object image block in the image can be obtained through object detection, the formed region corresponds to a complete visual concept in the image, the matching relationship judgment between the sample image and the sample image text is correspondingly greatly influenced, and the object detection frame and the tag text message describing the region content of the object detection frame can be obtained through the semantic features of the image block region and the sample detection frame information (including the position information) of the sample detection frame (BBox). In the training process, the regional visual features output by the visual encoder are interacted with fusion features covering text contents through a cross-attention mechanism in a cross-mode stage to realize finer granularity alignment, and in the training process, a comparison target can be guided by the maximum similarity of a token between the visual features (including sample image features and sample detection frame features) and a sample label, so that loss calculation and parameter adjustment in a first training stage are realized. Specifically, the training process of the first stage may include: performing target detection and encoding on the sample image, and generating a semantic region corresponding to the sample detection frame as a candidate object; based on sample image texts and sample instruction texts which are input by category texts and the like of a sample detection box, cross-modal feature fusion and feature extraction are carried out to generate fusion features, and in combination with instruction text features, a predefined sample template is used for inputting content generation network (LLM) to serve as a soft visual prompt, so that an original sample label is expanded into a sample content description text while the feature fusion network is trained, and expansion of the input text is realized; the training phase can complete alignment by calculating the similarity between the candidate semantic region corresponding to the sample detection frame and the sample label text.

S609: and taking the sample image features, the sample detection frame features, the sample image-text features, the instruction text features and the description text features corresponding to the sample content description text as the input of a feature fusion network of the image processing model, and performing constraint training of image content analysis on the initial image processing model based on a human feedback reinforcement learning method until the second training ending condition is met, so as to obtain the image processing model.

Specifically, in the process of constraint training of image content analysis, the network parameters of the fixed feature fusion network, namely the second training stage, is training of performing instruction fine tuning on the visual language generation network in combination with sample content description text.

In this way, the content generating network of the large language model is adopted as an auxiliary module for training the feature fusion network, knowledge capacity of the large language model is fully utilized, and a fine-granularity image detection result is combined to realize visual text association learning and alignment of the feature fusion network, so that the feature fusion network can obtain fine-granularity visual content mining and extraction capacity, and image content understanding accuracy is improved.

In some embodiments, the second training phase corresponding to S609 includes S6091-S6094:

S6091: and inputting the sample image features, the sample detection frame features, the sample image-text features, the instruction text features and the description text features into a feature fusion network of the image processing model to perform feature extraction and cross-modal feature fusion, so as to obtain a second sample fusion feature.

It will be appreciated that the second sample fusion feature is similar to the first sample fusion feature, except that the feature fusion network is input to add descriptive text features obtained by embedding the features of the sample cosmetic descriptive text output by the content generation network, and correspondingly, the input in the generation process of the intermediate fusion feature corresponding to the second sample fusion feature includes descriptive text features, and the input in the generation process of the text extraction feature corresponding to the second sample fusion feature also includes descriptive text features.

In some embodiments, in the visual-text interface instruction fine tuning stage of the second training stage, the data sources may include graphic pairs of the open source data set and corresponding sample content descriptive text (such as video cover/key frame content and text description), and may also be graphic pairs in the actual scene of the business, and corresponding sample content descriptive text, which is the final result generated by the first training stage. The total number of samples is at the kilobit level to fine tune the visual language generation network. In a preferred embodiment, the entered samples are also corrected based on manual inspection to obtain an input of the updated image processing model of S609. Specifically, the sample may be constructed based on a predefined template of sample content description text that is updated by manual verification and auditing, and thus generated input sample features, "# # Human: < Img > < Image eature > </Img > < Instuction > # # Assistant:". The training objective of the second training phase is to generate corresponding text content from the constructed prompt.

S6092: and inputting the second sample fusion features and the instruction text features into a visual language generation network to analyze the image content, so as to obtain a second sample analysis result.

It will be appreciated that S6092 is similar to S605 described above, except that the visual language generating network employing the image processing model in S6092 performs content understanding and analysis of the input to obtain a second sample analysis result in conformity with the form of the sample tag. Specifically, the instruction text features are used as instruction instructions to help the visual language generation network to perform parameter adjustment, and a second sample analysis result is generated, wherein the second sample analysis result can be fine-grained multidimensional label text, such as 'lovely pet/happy mood/firewood/lovely dog lovely sprout'. It will be appreciated that the second sample analysis result output by the large language model may include a plurality of fine-grained label texts, such as "lovely pet/happy mood/firewood dog/lovely dog", "lovely pet dog/happy/firewood dog/lovely pet dog", or "lovely pet dog/lovely pet dog" etc.

S6093: and adjusting network parameters of the visual language generating network based on the difference between the sample label and the second sample analysis result to perform iterative training until the second training ending condition is met, so as to obtain an intermediate image processing model.

Specifically, model loss is generated based on the difference between the sample label and the second sample analysis result, the network parameters of the visual language generation network are updated under the condition of fixing the network parameters of the feature fusion network, and the steps are repeated until the iteration times reach the preset iteration times or the model loss is smaller than the preset loss, so that an intermediate image processing model is obtained.

S6094: and performing fine tuning training on a visual language generation network of the intermediate image processing model based on a human feedback reinforcement learning method to obtain the image processing model.

Specifically, in order to ensure that the output effect of the final actual task can be aligned with the expected human, a human feedback reinforcement learning method (RLHF) is introduced, after an intermediate image processing model is obtained and before the model is formally online, the result is manually scored and manually expected to be aligned with the understanding result output by the visual language generation network, and the result is improved in a reinforcement learning mode, so that the accuracy and the scene suitability of the output content of the model are improved.

Specifically, the samples used in the reinforcement learning stage may be all the samples used in the foregoing S6091-S6093, or may be partial samples thereof, or may be generated by a test analysis image and a test image text on the user side in the case of an online test. Sample content description text generation of test analysis images can be performed in conjunction with the content generation network described previously to construct new test samples as input to the reinforcement learning stage. After the intermediate image processing model outputs a second sample analysis result based on the sample data in the reinforcement learning stage, human feedback information corresponding to the second sample analysis result is received, and under the condition that the network parameters of the feature fusion network are fixed, the network parameters of the visual language generating network are adjusted based on the second sample analysis result and the human feedback information, so that the image processing model is obtained.

By adopting the technical scheme, the hidden multidimensional information of image data such as video and the like can be fully utilized, the granularity of tag information is thinned, the description of content is deeply expanded, independent modeling of scenes is not needed, and the labor labeling cost, training cost and modeling cost are reduced. And the obtained image processing model has strong generalization capability, and label content expansion is performed by adopting content description text in the training process, so that fine-granularity content understanding and output in the application stage are realized, the diversity of the label content is improved, and the multi-scene requirements of label result application and the abstract requirements of different degrees in the distribution recommendation process are met. Meanwhile, the Zero-Shot capability of the application is not limited by the predefined number of classes, and the training and reasoning effects are better.

In addition, through the fine granularity fusion and alignment of the unified space of the visual and text information of the image content, more characteristic expressions for describing the visual content and understanding the content are obtained, the characteristic expressions comprise multi-level fine granularity classification information of the image, the content distribution of the power-assisted recommendation system reduces the modeling cost, improves the modeling efficiency, does not need to annotate a large amount of fine granularity sample data and consume a large amount of manpower, and effectively improves the research and development efficiency; simultaneously, the natural language processing capability and reserve knowledge of the multi-mode large language model are fully utilized, the understanding of the context information and the semantic relationship is assisted, the semantic association and expansion of various high-level labels are important, the relationship understanding among entities is enriched, the understanding and coverage of the high-level labels can be provided, the description is more comprehensive and accurate, and the content distribution efficiency is improved; in addition, the image processing method can also exceed the limitation of the main entity content, realize multi-level multi-dimensional classification labels and descriptions of the images, and realize content question-answering so as to facilitate the expansion of more dimensions and increase the generalization and richness of the labels. In a word, through the introduction of fine granularity detection information and multi-mode instruction fine adjustment information, the generation of understanding labels of the content is completed in a unified mode, and good expansibility is maintained.

The embodiment of the application also provides an image processing apparatus 800, as shown in fig. 8, fig. 8 shows a schematic structural diagram of the image processing apparatus provided in the embodiment of the application, where the apparatus may include the following modules.

The acquisition module 10: the method comprises the steps that image characteristics, detection frame characteristics and image texts of an image to be analyzed are obtained, wherein the image texts at least comprise frame category texts corresponding to the detection frame characteristics, and the frame category texts are used for indicating content categories of image areas corresponding to the detection frame characteristics in the image to be analyzed;

feature embedding module 20: the method is used for carrying out feature embedding on the image text to obtain image-text features;

image analysis module 30: the image processing module is used for carrying out image analysis on the image characteristics, the detection frame characteristics and the image-text characteristics based on the image processing model to obtain an image analysis result, wherein the image analysis result comprises a multi-dimensional content label which is used for indicating the multi-dimensional content category of the image to be analyzed;

the image processing model is obtained based on sample image features, sample detection frame features, sample image-text features, instruction text features and sample labels corresponding to sample instruction texts corresponding to sample images, and is obtained by combining instruction fine adjustment to train cross-mode feature fusion and feature space alignment of a visual mode and a text mode of a feature fusion network of an initial image processing model and train analysis content generation of a visual language generation network of the initial image processing model, wherein the visual language generation network is constructed based on a pre-trained large language model.

In some embodiments, image analysis module 30 includes:

feature fusion submodule: the method comprises the steps of inputting image features, detection frame features and image-text features into a feature fusion network of an image processing model to perform feature extraction and cross-modal feature fusion, so as to obtain fusion features;

Content analysis sub-module: the visual language generating network is used for inputting the fusion features into the visual language generating network of the image processing model to analyze the image content, and an image analysis result of the text mode is obtained.

In some embodiments, the feature fusion network comprises a first encoding module, a second encoding module, and a full connection layer, the first encoding module and the second encoding module sharing a first attention sub-module, the second encoding module further comprising a second attention sub-module based on a cross-layer attention mechanism; the feature fusion submodule comprises:

A first attention unit: the method comprises the steps of carrying out feature fusion corresponding to a first coding module on image features, detection frame features and image-text features based on a first attention sub-module, and carrying out feature extraction corresponding to a second coding module on the image-text features to obtain initial fusion features and text extraction features;

A second attention unit: the method comprises the steps of performing cross-modal feature fusion on image features, detection frame features and initial fusion features based on a second attention sub-module to obtain intermediate fusion features;

Full connection unit: and the method is used for inputting the intermediate fusion features and the text extraction features into a full-connection layer to perform feature mapping aiming at the visual language generation network, so as to obtain fusion features.

In some embodiments, the second attention unit is specifically configured to: inputting the image features and the initial fusion features into a second attention sub-module, taking the image features and the detection frame features as query features, and taking the initial fusion features as key value features to perform cross-attention feature representation to obtain intermediate fusion features; in the cross-attention characteristic representation process, the historical weight distribution corresponding to the previous attention layer of the second attention sub-module is used as the input of the next attention layer after weight attenuation.

In some embodiments, the acquisition module 10 includes:

a target detection sub-module: the method comprises the steps of performing target detection on an image to be analyzed to obtain image characteristics, detection frame information of a target detection frame and frame type information, wherein the frame type information is used for indicating identification information of frame type texts;

The feature representation sub-module: the method comprises the steps of carrying out feature representation on detection frame information to obtain detection frame features;

a box text generation sub-module: for generating a box category text based on the box category information.

In some embodiments, the apparatus further comprises:

Sample acquisition module: the method comprises the steps of acquiring an initial image processing model, a plurality of sample images, sample additional text and sample labels corresponding to the sample images, and sample instruction text, wherein the sample additional text comprises at least one of text carried in image contents of the sample images and attached description text of the sample images, and the sample instruction text is used for providing instruction information for understanding the image contents required by the initial image processing model when performing image content analysis tasks;

Sample detection module: the method comprises the steps of performing target detection on a sample image to obtain sample image characteristics, sample detection frame characteristics of a sample detection frame and sample frame category texts;

sample text embedding module: the method comprises the steps of performing feature embedding on a sample additional text, a sample box type text and a sample instruction text to obtain sample graphic and text features and instruction text features;

Training module: the method is used for carrying out cross-mode feature fusion and feature space alignment training of a visual mode and a text mode on a feature fusion network based on sample image features, sample detection frame features, sample image-text features, instruction text features and sample labels in combination with instruction fine adjustment, and carrying out analysis content generation training on a visual language generation network to obtain an image processing model.

In some embodiments, the training module comprises:

The network construction submodule: the method comprises the steps of acquiring a content generation network, and constructing the content generation network based on a pre-trained large language model;

sample fusion submodule: the method comprises the steps of inputting sample image features, sample detection frame features, sample image-text features and instruction text features into a feature fusion network of an initial image processing model to perform feature extraction and cross-modal feature fusion, so as to obtain first sample fusion features;

sample analysis submodule: the method comprises the steps of inputting first sample fusion characteristics and instruction text characteristics into a content generation network to perform image content analysis to obtain a first sample analysis result and a sample content description text;

a first training sub-module: the method comprises the steps of adjusting network parameters of a feature fusion network of an initial image processing model based on a first sample analysis result and a sample label under the condition of fixing network parameters of a content generation network until a first training ending condition is met, so as to obtain the feature fusion network of the image processing model;

a second training sub-module: the method comprises the steps of taking sample image features, sample detection frame features, sample image-text features, instruction text features and description text features corresponding to sample content description texts as input of a feature fusion network of an image processing model, performing constraint training of image content analysis on an initial image processing model based on a human feedback reinforcement learning method until a second training ending condition is met, obtaining the image processing model, and fixing network parameters of the feature fusion network in the process of the constraint training of the image content analysis.

In some embodiments, the second training submodule includes:

Feature fusion unit: the method comprises the steps of inputting sample image features, sample detection frame features, sample image-text features, instruction text features and description text features into a feature fusion network of an image processing model to perform feature extraction and cross-modal feature fusion, so as to obtain second sample fusion features;

Analysis unit: the method comprises the steps of inputting a second sample fusion feature and an instruction text feature into a visual language generation network to analyze image content, and obtaining a second sample analysis result;

A second training unit: the method comprises the steps of adjusting network parameters of a visual language generating network based on the difference between a sample label and a second sample analysis result to perform iterative training until a second training ending condition is met, and obtaining an intermediate image processing model;

Reinforcement learning unit: the method is used for performing fine tuning training on the visual language generation network of the intermediate image processing model based on the human feedback reinforcement learning method to obtain the image processing model.

It should be noted that the above apparatus embodiments and method embodiments are based on the same implementation manner.

The embodiment of the application provides equipment, which can be a terminal or a server, and comprises a processor and a memory, wherein at least one instruction or at least one section of program is stored in the memory, and the at least one instruction or the at least one section of program is loaded and executed by the processor to realize the image processing method or the neural network training method provided by the embodiment of the method.

The memory may be used to store software programs and modules that the processor executes to perform various functional applications and anomaly detection by running the software programs and modules stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, application programs required for functions, and the like; the storage data area may store data created according to the use of the device, etc. In addition, the memory may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device. Accordingly, the memory may also include a memory controller to provide access to the memory by the processor.

The method embodiment provided by the embodiment of the application can be executed in electronic equipment such as a mobile terminal, a computer terminal, a server or similar computing devices. Fig. 9 is a block diagram of a hardware structure of an electronic device according to an embodiment of the present application. As shown in fig. 9, the electronic device 900 may vary considerably in configuration or performance, and may include one or more central processing units (Central Processing Units, CPUs) 910 (the processor 910 may include, but is not limited to, a microprocessor MCU, a programmable logic device FPGA, etc.), a memory 930 for storing data, one or more storage mediums 920 (e.g., one or more mass storage devices) for storing applications 923 or data 922. Wherein memory 930 and storage medium 920 may be transitory or persistent storage. The program stored on the storage medium 920 may include one or more modules, each of which may include a series of instruction operations in the electronic device. Still further, the central processor 910 may be configured to communicate with a storage medium 920 and execute a series of instruction operations in the storage medium 920 on the electronic device 900. The electronic device 900 may also include one or more power supplies 960, one or more wired or wireless network interfaces 950, one or more input/output interfaces 940, and/or one or more operating systems 921, such as Windows ServerTM, mac OS XTM, unixTM, linuxTM, freeBSDTM, and the like.

The input-output interface 940 may be used to receive or transmit data via a network. Specific examples of the network described above may include a wireless network provided by a communications provider of the electronic device 900. In one example, the input-output interface 940 includes a network adapter (Network Interface Controller, NIC) that may be connected to other network devices through a base station to communicate with the internet. In one example, the input/output interface 940 may be a Radio Frequency (RF) module for communicating with the internet wirelessly.

It will be appreciated by those skilled in the art that the configuration shown in fig. 9 is merely illustrative and is not intended to limit the configuration of the electronic device. For example, electronic device 900 may also include more or fewer components than shown in FIG. 9, or have a different configuration than shown in FIG. 9.

Embodiments of the present application also provide a computer readable storage medium that may be disposed in an electronic device to store at least one instruction or at least one program related to implementing an anomaly detection method in a method embodiment, where the at least one instruction or the at least one program is loaded and executed by the processor to implement the anomaly detection method provided in the method embodiment.

Alternatively, in this embodiment, the storage medium may be located in at least one network server among a plurality of network servers of the computer network. Alternatively, in the present embodiment, the storage medium may include, but is not limited to: a usb disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing program codes.

According to one aspect of the present application, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The computer instructions are read from the computer-readable storage medium by a processor of a computer device, and executed by the processor, cause the computer device to perform the methods provided in the various alternative implementations described above.

The image processing method, the device, the equipment, the storage medium, the server, the terminal and the program product provided by the application acquire the image characteristics, the detection frame characteristics and the image text of the image to be analyzed, wherein the image text at least comprises a frame type text corresponding to the detection frame characteristics, and the frame type text is used for indicating the content type of the image area corresponding to the detection frame characteristics in the image to be analyzed; performing feature embedding on the image text to obtain image-text features; image analysis is carried out on the image characteristics, the detection frame characteristics and the image-text characteristics based on the image processing model, so that an image analysis result is obtained, wherein the image analysis result comprises a multi-dimensional content tag, and the multi-dimensional content tag is used for indicating the multi-dimensional content category of an image to be analyzed; the image processing model is obtained based on sample image features, sample detection frame features, sample image-text features, instruction text features and sample labels corresponding to sample instruction texts corresponding to sample images, and is obtained by combining instruction fine adjustment to train cross-mode feature fusion and feature space alignment of a visual mode and a text mode of a feature fusion network of an initial image processing model and train analysis content generation of a visual language generation network of the initial image processing model, wherein the visual language generation network is constructed based on a pre-trained large language model. In this way, the detection frame features and the corresponding frame type texts are added outside the image feature input, so that an image processing model can learn fine-granularity region information in image content, the image content label range capable of being described and depicted is enlarged, a content label result with better coverage and accuracy is obtained, the bottleneck problem that a pre-trained large-scale multi-mode language model is insensitive to fine-granularity information such as positions, quantity and small objects in an image processing task can be solved through the introduction of fine-granularity auxiliary information such as the detection frame features and the frame type texts, and the understanding degree of image objects and topics is improved; moreover, the method and the system fully utilize knowledge and logic reasoning capability in a large language model, realize the alignment of images and text features and provide fine-granularity understanding support of regional images by increasing fine-granularity target region detection results, so that fine-granularity content understanding covering more dimensionalities and generation of multidimensional content labels can be supported, and modeling efficiency, generalization and practicability of image processing are improved while processing cost is obviously reduced.

It should be noted that: the sequence of the embodiments of the present application is only for description, and does not represent the advantages and disadvantages of the embodiments. And the foregoing description has been directed to specific embodiments of this application. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

The embodiments of the present application are described in a progressive manner, and the same and similar parts of the embodiments are all referred to each other, and each embodiment is mainly described in the differences from the other embodiments. In particular, for apparatus, devices and storage medium embodiments, the description is relatively simple as it is substantially similar to method embodiments, with reference to the description of method embodiments in part.

It will be appreciated by those of ordinary skill in the art that all or part of the steps of implementing the above embodiments may be implemented by hardware, or may be implemented by a program indicating that the relevant hardware is implemented, and the program may be stored in a computer readable storage medium, where the storage medium may be a read only memory, a magnetic disk or optical disk, etc.

The foregoing is only illustrative of the present application and is not to be construed as limiting thereof, but rather as various modifications, equivalent arrangements, improvements, etc., within the spirit and principles of the present application.

Claims

1. An image processing method, the method comprising:

Performing feature embedding on the image text to obtain image-text features;

The image processing model is obtained based on sample image features, sample detection frame features, sample image-text features, instruction text features and sample labels corresponding to sample instruction texts, and is obtained by performing cross-mode feature fusion and feature space alignment training of a visual mode and a text mode on a feature fusion network of an initial image processing model in combination with instruction fine adjustment, and performing analysis content generation training on a visual language generation network of the initial image processing model, wherein the visual language generation network is constructed based on a pre-training large language model;

the sample image-text feature is obtained by feature embedding of sample additional text and the box type text, and the sample additional text comprises at least one of text carried in the image content of the sample image and attached descriptive text of the sample image.

2. The method of claim 1, wherein performing image analysis on the image feature, the detection frame feature, and the teletext feature based on the image processing model, to obtain an image analysis result comprises:

Inputting the image features, the detection frame features and the image-text features into a feature fusion network of the image processing model to perform feature extraction and cross-modal feature fusion to obtain fusion features;

And inputting the fusion characteristics into a visual language generation network of the image processing model to perform image content analysis, so as to obtain the image analysis result of the text mode.

3. The method of claim 2, wherein the feature fusion network comprises a first encoding module, a second encoding module, and a fully-connected layer, the first encoding module and the second encoding module sharing a first attention sub-module, the second encoding module further comprising a second attention sub-module based on a cross-layer attention mechanism; inputting the image features, the detection frame features and the image-text features into a feature fusion network of the image processing model to perform feature extraction and cross-modal feature fusion, and obtaining fusion features comprises the following steps:

Based on the first attention sub-module, carrying out feature fusion corresponding to the first coding module on the image features, the detection frame features and the image-text features, and carrying out feature extraction corresponding to the second coding module on the image-text features to obtain initial fusion features and text extraction features;

based on the second attention sub-module, performing cross-modal feature fusion on the image features, the detection frame features and the initial fusion features to obtain intermediate fusion features;

And inputting the intermediate fusion feature and the text extraction feature into the full-connection layer to perform feature mapping aiming at a content analysis network, so as to obtain the fusion feature.

4. The method of claim 3, wherein the cross-modal feature fusion of the image feature, the detection frame feature, and the initial fusion feature based on the second attention sub-module, the obtaining an intermediate fusion feature comprises:

Inputting the image features and the initial fusion features into the second attention sub-module, taking the image features and the detection frame features as query features, and taking the initial fusion features as key value features to perform cross-attention feature representation to obtain the intermediate fusion features; in the process of representing the cross-attention characteristic, the historical weight distribution corresponding to the previous attention layer of the second attention sub-module is used as the input of the next attention layer after the weight is attenuated.

5. The method of claim 1, wherein the acquiring image features, detection frame features, and image text of the image to be analyzed comprises:

Performing target detection on the image to be analyzed to obtain the image characteristics, detection frame information of a target detection frame and frame type information, wherein the frame type information is used for indicating identification information of the frame type text;

Performing feature representation on the detection frame information to obtain the detection frame features;

the box category text is generated based on the box category information.

6. The method according to any one of claims 1-5, further comprising:

acquiring the initial image processing model, a plurality of sample images, sample additional text and sample labels corresponding to the sample images and sample instruction text, wherein the sample instruction text is used for providing instruction information for image content understanding required by the initial image processing model when executing an image content analysis task;

Performing target detection on the sample image to obtain sample image characteristics, sample detection frame characteristics of a sample detection frame and sample frame category texts;

Feature embedding is carried out on the sample additional text, the sample box type text and the sample instruction text to obtain sample image-text features and instruction text features;

Based on the sample image features, the sample detection frame features, the sample image-text features, the instruction text features and the sample labels, training the cross-modal feature fusion and feature space alignment of the visual mode and the text mode of the feature fusion network and training the analysis content generation of the visual language generation network by combining instruction fine adjustment, so as to obtain the image processing model.

7. The method of claim 6, wherein the training method of the image processing model comprises:

Acquiring a content generation network, wherein the content generation network is constructed based on a pre-trained large language model;

inputting the sample image features, the sample detection frame features, the sample image-text features and the instruction text features into a feature fusion network of the initial image processing model to perform feature extraction and cross-modal feature fusion, so as to obtain first sample fusion features;

Inputting the first sample fusion feature and the instruction text feature into the content generation network to perform image content analysis to obtain a first sample analysis result and a sample content description text;

Under the condition of fixing network parameters of the content generation network, adjusting network parameters of a feature fusion network of the initial image processing model based on the first sample analysis result and the sample label until a first training ending condition is met, and obtaining the feature fusion network of the image processing model;

and taking the sample image characteristics, the sample detection frame characteristics, the sample image-text characteristics, the instruction text characteristics and the description text characteristics corresponding to the sample content description text as the input of a characteristic fusion network of the image processing model, performing constraint training of image content analysis on the initial image processing model based on a human feedback reinforcement learning method until a second training ending condition is met, and obtaining the image processing model, wherein network parameters of the characteristic fusion network are fixed in the process of the constraint training of the image content analysis.

8. The method of claim 7, wherein the performing constraint training of image content analysis on the initial image processing model based on a human feedback reinforcement learning method with the sample image feature, the sample detection frame feature, the sample teletext feature, the instruction text feature, and the descriptive text feature corresponding to the sample content descriptive text as inputs to a feature fusion network of the image processing model, until a second training end condition is satisfied, comprises:

inputting the sample image features, the sample detection frame features, the sample image-text features, the instruction text features and the description text features into a feature fusion network of the image processing model to perform feature extraction and cross-modal feature fusion, so as to obtain second sample fusion features;

inputting the second sample fusion features and the instruction text features into the visual language generation network to perform image content analysis to obtain a second sample analysis result;

adjusting network parameters of the visual language generating network based on the difference between the sample label and the second sample analysis result to perform iterative training until a second training ending condition is met, so as to obtain an intermediate image processing model;

And performing fine tuning training on the visual language generation network of the intermediate image processing model based on a human feedback reinforcement learning method to obtain the image processing model.

9. An image processing apparatus, characterized in that the apparatus comprises:

10. The apparatus of claim 9, wherein the image analysis module comprises:

Feature fusion submodule: the feature fusion network is used for inputting the image features, the detection frame features and the image-text features into the image processing model to perform feature extraction and cross-modal feature fusion to obtain fusion features;

Content analysis sub-module: and the visual language generating network is used for inputting the fusion characteristics into the image processing model to perform image content analysis, so as to obtain the image analysis result of the text mode.

11. The apparatus of claim 10, wherein the feature fusion network comprises a first encoding module, a second encoding module, and a fully-connected layer, the first encoding module and the second encoding module sharing a first attention sub-module, the second encoding module further comprising a second attention sub-module based on a cross-layer attention mechanism; the feature fusion submodule comprises:

A first attention unit: the method comprises the steps of carrying out feature fusion corresponding to a first coding module on the image features, the detection frame features and the image-text features based on the first attention sub-module, and carrying out feature extraction corresponding to a second coding module on the image-text features to obtain initial fusion features and text extraction features;

a second attention unit: the second attention sub-module is used for performing cross-modal feature fusion on the image features, the detection frame features and the initial fusion features to obtain intermediate fusion features;

Full connection unit: and the method is used for inputting the intermediate fusion feature and the text extraction feature into the full-connection layer to perform feature mapping aiming at a content analysis network, so as to obtain the fusion feature.

12. The apparatus according to claim 11, wherein the second attention unit is specifically configured to:

13. The apparatus of claim 9, wherein the acquisition module comprises:

A target detection sub-module: the frame type information is used for indicating the identification information of the frame type text;

A box text generation sub-module: for generating the box category text based on the box category information.

14. The apparatus according to any one of claims 9-13, wherein the apparatus further comprises:

Sample acquisition module: the method comprises the steps of acquiring an initial image processing model, a plurality of sample images, sample additional text and sample labels corresponding to the sample images and sample instruction text, wherein the sample instruction text is used for providing instruction information for image content understanding required by the initial image processing model when performing an image content analysis task;

Sample text embedding module: the method comprises the steps of carrying out feature embedding on the sample additional text, the sample box type text and the sample instruction text to obtain sample image-text features and instruction text features;

Training module: the image processing model is obtained by combining instruction fine adjustment to train cross-mode feature fusion and feature space alignment of a visual mode and a text mode of the feature fusion network and train analysis content generation of the visual language generation network based on the sample image features, the sample detection frame features, the sample image-text features, the instruction text features and the sample labels.

15. The apparatus of claim 14, wherein the training module comprises:

The network construction submodule: the method comprises the steps of obtaining a content generation network, wherein the content generation network is constructed based on a pre-trained large language model;

Sample fusion submodule: the feature fusion network is used for inputting the sample image features, the sample detection frame features, the sample image-text features and the instruction text features into the feature fusion network of the initial image processing model to perform feature extraction and cross-modal feature fusion, so as to obtain first sample fusion features;

Sample analysis submodule: the first sample fusion feature and the instruction text feature are input into the content generation network to perform image content analysis, so that a first sample analysis result and a sample content description text are obtained;

A first training sub-module: the method comprises the steps of adjusting network parameters of a feature fusion network of an initial image processing model based on a first sample analysis result and a sample label under the condition of fixing the network parameters of the content generation network until a first training ending condition is met, so as to obtain the feature fusion network of the image processing model;

A second training sub-module: the method comprises the steps of taking the sample image characteristics, the sample detection frame characteristics, the sample image-text characteristics, the instruction text characteristics and description text characteristics corresponding to sample content description texts as input of a characteristic fusion network of an image processing model, performing constraint training of image content analysis on the initial image processing model based on a human feedback reinforcement learning method until a second training ending condition is met, obtaining the image processing model, and fixing network parameters of the characteristic fusion network in the process of the constraint training of the image content analysis.

16. The apparatus of claim 15, wherein the second training submodule comprises:

feature fusion unit: the feature fusion network is used for inputting the sample image features, the sample detection frame features, the sample image-text features, the instruction text features and the description text features into the image processing model to perform feature extraction and cross-modal feature fusion, so as to obtain a second sample fusion feature;

Analysis unit: the second sample fusion feature and the instruction text feature are input into the visual language generation network to perform image content analysis, so that a second sample analysis result is obtained;

A second training unit: the method comprises the steps of adjusting network parameters of the visual language generating network based on the difference between the sample label and the second sample analysis result to perform iterative training until a second training ending condition is met, and obtaining an intermediate image processing model;

Reinforcement learning unit: and the fine tuning training is performed on the visual language generation network of the intermediate image processing model based on a human feedback reinforcement learning method to obtain the image processing model.

17. A computer-readable storage medium, characterized in that at least one instruction or at least one program is stored in the storage medium, the at least one instruction or the at least one program being loaded and executed by a processor to implement the image processing method according to any one of claims 1-8.

18. A computer device, characterized in that it comprises a processor and a memory in which at least one instruction or at least one program is stored, which is loaded and executed by the processor to implement the image processing method according to any of claims 1-8.

19. A computer program product comprising computer instructions which, when executed by a processor, implement the image processing method according to any of claims 1-8.