CN116977684A

CN116977684A - Image recognition method, device, equipment and storage medium

Info

Publication number: CN116977684A
Application number: CN202211476632.6A
Authority: CN
Inventors: 刘刚
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-11-23
Filing date: 2022-11-23
Publication date: 2023-10-31

Abstract

The embodiment of the application provides an image identification method, an image identification device, image identification equipment and a storage medium, wherein the method comprises the following steps: acquiring an image to be identified; invoking a target recognition model to acquire feature representation information of the image to be recognized, and invoking the target recognition model to process the feature representation information to acquire a recognition result for indicating whether the image to be recognized is of a target class; the target recognition model is obtained by training an initial recognition model according to first difference data, second difference data and third difference data, and the first difference data is determined according to a first classification result of the training sample by the image recognition module and an image classification label; the second difference data is determined according to a second classification result of the training sample by the text recognition module and the text classification label; the third difference data is determined according to the third classification result of the training sample by the graph and text recognition module and the graph and text classification label. By adopting the embodiment of the application, the accuracy of unhealthy content identification can be improved.

Description

Image recognition method, device, equipment and storage medium

Technical Field

The present application relates to computer technologies, and in particular, to an image recognition method, apparatus, device, and storage medium.

Background

With the development of internet technology, the release amount of various contents is rapidly increasing, and the video and the browsing of graphic contents on a platform are one of the most common living habits of people. Wherein the quality of the content published on the platform varies, possibly involving unhealthy content of a particular type. The unhealthy contents of the specific type are contents related to intimate behavior, improper wear, minors, and the like, excluding critical-part exposure or unhealthy behavior directly described. For example, a particular type of unhealthy content relates to an expression package consisting of images of minors and unsuitable text.

At present, in the process of auditing uploaded contents by a platform, an artificial intelligence (Artificial Int elligence, AI) algorithm has a good identification effect on unhealthy contents of a certain type, and the unhealthy contents of the certain type have clear characteristics, but have a bad identification effect on unhealthy contents of a specific type. Therefore, the content of unhealthy content of a specific type is mostly manually checked by auditors, the manual audit is subjective, a certain background knowledge is required for part of the obscure content, the identification effect is poor, and unhealthy content of the specific type cannot be accurately identified.

Therefore, how to improve the accuracy of identifying unhealthy content of a specific type is a technical problem to be solved.

Disclosure of Invention

The embodiment of the application provides an image identification method, an image identification device, terminal equipment and a storage medium, which can improve the accuracy of identifying unhealthy contents of a specific type and improve the effect of identifying unhealthy contents of the specific type.

In a first aspect, an embodiment of the present application provides an image recognition method, including:

acquiring an image to be identified;

invoking a target recognition model to acquire feature representation information of the image to be recognized, wherein the feature representation information comprises text feature representation information and image semantic representation information;

the target recognition model is called to process the characteristic representation information of the image to be recognized, so that a recognition result of the image to be recognized is obtained, and the recognition result is used for indicating whether the image to be recognized is of a target type or not;

the target recognition model is obtained by training an initial recognition model according to first difference data, second difference data and third difference data, and the initial recognition model comprises an image recognition module, a text recognition module and an image-text recognition module; the first difference data is determined according to a first classification result of the training sample and an image classification label by the image recognition module; the second difference data is determined according to a second classification result and a text classification label of the training sample by the text recognition module; the third difference data is determined according to a third classification result of the training sample by the image-text recognition module and an image-text classification label, and the image-text classification label is used for indicating whether the training sample is of the target class.

In a second aspect, an embodiment of the present application provides an image recognition apparatus, including:

the acquisition unit is used for acquiring the image to be identified;

the calling unit is used for calling the target recognition model to acquire the characteristic representation information of the image to be recognized, wherein the characteristic representation information comprises text characteristic representation information and image semantic representation information;

the calling unit is further used for calling the target recognition model to process the feature representation information of the image to be recognized to obtain a recognition result of the image to be recognized, wherein the recognition result is used for indicating whether the image to be recognized is of a target type;

In a third aspect, an embodiment of the present application provides an image recognition apparatus, including a processor, a communication interface, and a memory, where the processor, the communication interface, and the memory are connected to each other, where the memory stores executable program code, and the processor is configured to call the executable program code to perform the image recognition method of the first aspect.

In a fourth aspect, an embodiment of the present application further provides a computer readable storage medium, where instructions are stored, which when executed on a computer, cause the computer to perform the image recognition method of the first aspect.

In a fifth aspect, embodiments of the present application also provide a computer program product or computer program comprising computer instructions stored on a computer readable storage medium. The processor of the terminal device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions so that the terminal device performs the image recognition method of the first aspect.

In the embodiment of the application, the image to be identified is input into the target identification model, the text characteristics and the image characteristics of the image to be identified are extracted through the target identification model, and the extracted characteristics are processed, so that an identification result for indicating whether the image to be identified is of a target type or not is obtained. Therefore, the target recognition model can be used for processing the multi-dimensional characteristic information such as text content, image content and the like, so that subjective factors of manual auditing are eliminated, and auditing personnel do not need to have background knowledge of unhealthy content of a specific type, thereby improving the recognition accuracy and recognition effect. In addition, compared with a manual processing mode, the processing speed is improved in the mode of identifying whether unhealthy content of a specific type exists through the target identification model, the identification efficiency can be greatly improved, and the content auditing efficiency is also improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings required for the description of the embodiments will be briefly described, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings may be obtained according to the drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic diagram of an architecture of an image recognition system according to an embodiment of the present application;

FIG. 2 is a schematic diagram of another image recognition system according to an embodiment of the present application;

fig. 3 is a schematic flow chart of an image recognition method according to an embodiment of the present application;

FIG. 4 is another flow chart of an image recognition method according to an embodiment of the present application;

FIG. 5 is a schematic structural diagram of an initial recognition model according to an embodiment of the present application;

FIG. 6 is a schematic diagram of another initial recognition model provided by an embodiment of the present application;

FIG. 7 is a schematic diagram of a training initial recognition model according to an embodiment of the present application;

fig. 8 is a schematic flow chart of an image recognition method according to an embodiment of the present application;

Fig. 9 is a schematic structural diagram of an image recognition device according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

Before describing embodiments of the present application in further detail, the terms and terminology involved in the embodiments of the present application will be described, and the terms and terminology involved in the embodiments of the present application will be used in the following explanation.

1. Modality

Modalities refer to data in different forms of presence or sources of information, such as text, images, video, audio, and the like. The modes are divided into a single mode and a multi-mode, wherein the single mode refers to data of one existing form or information source, and the multi-mode refers to data which is composed of more than two modes. The multi-mode data fusion method can be applied to the scene of collaborative reasoning of various heterogeneous mode data, and is mainly used for the problem of fusion of different types of data during research. In the field of artificial intelligence, a computer is used for carrying out comprehensive processing of multi-mode data, and is responsible for fusing information of all modes to execute target prediction.

As used in the present application, multimodal may refer to data consisting of images and text, video and text, audio and text, for example, unhealthy content of a particular type is multimodal data consisting of images, text. The target recognition model can process multi-mode data to judge whether the data belong to a target category, namely whether the data are unhealthy contents of the specific type.

2. Key frame

The key frame is the term of computer animation, and the frame is the single image picture of the minimum unit in the animation, which is equivalent to each frame of shots on the film. A keyframe can be understood as an original picture in a two-dimensional animation. Refers to the frame in which a critical action in character or object motion or change is located.

In the application, the image frames in the video to be identified can be used for determining the identification result of the video to be identified, and the key frames can be the image frames comprising different scene information in the video to be identified.

3. Professional production (Professional Generated Content, PGC)

PGC refers to professional production content (video website), expert production content (blog). Is used to refer broadly to content personalization, view angle diversification, and social relationship virtualization. Also known as PPC, (professionly-produc ed Content), versus user-produced content (User Generated Content, UGC). The PGC content is combined, so that a mechanism or a person realizing stable business rendering by guaranteeing continuous output of the content can be called a Multi-Channel Network (MCN).

The content production terminal applied to the application can be terminal equipment of a video website, can be terminal equipment of an expert, and can also refer to terminal equipment of a user.

4. Content

Content refers to the substantive things that an thing contains.

In the application, the content can refer to articles and videos which can be recommended to users for reading by a platform capable of publishing the content, the articles can be edited and published by the users in account numbers, and the videos are actively published by PGCs or UGC.

5. Message source (Feeds)

Message sources, which may also be referred to as feeds (feeds), feeds, information offerings, summaries, sources, news subscriptions (news feeds), web feeds (web feeds), are a data format through which websites propagate up-to-date information to users. The message sources are generally arranged in a time axis (Timeline), which is the most basic presentation form of the message sources. The web site provides a source of messages, and the user browses and subscribes to the web site, thereby browsing the source of messages on the web site. Message sources are converged together as aggregation (aggregation), and software for aggregation is called an aggregator (aggregator). So-called aggregators are software that is specifically used to subscribe to websites, also commonly referred to as simple syndication (Really Simple Syndic ation, RSS) readers, feed readers, news readers, etc.

In the present application, a user can browse various contents arranged in a time axis manner.

The embodiment of the application provides an image recognition scheme which can be applied to various content release applications or content release platforms, wherein the content release applications or content release systems refer to applications or systems with functions of releasing content by users and providing the content to the users. Specifically, the content publishing application or the system can process the image to be identified by calling the target identification model to obtain feature expression information comprising text feature expression information and image semantic information, and further process the feature expression information through the target identification model to obtain an identification result of the image to be identified, wherein the identification result can be used for indicating whether the image to be identified is of a target type. Therefore, the text features and the image features are obtained through the target recognition model, so that multi-mode classification is performed, the recognition accuracy of unhealthy contents of specific types can be improved, and the recognition effect is also improved. In addition, the target recognition model is used for recognition, so that the recognition efficiency is improved.

The signaling processing scheme provided by the embodiment of the application relates to artificial intelligence, machine learning, computer vision and other technologies, wherein:

Artificial intelligence (Artificial Intelligence, AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision. The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operating (interactive) systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning (deep learning) and other directions.

Machine Learning (ML) is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, etc. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine Learning and Deep Learning (DL) generally includes techniques such as artificial neural networks, confidence networks, reinforcement Learning, transfer Learning, induction Learning, teaching Learning, and the like.

Computer Vision (CV) is a science of studying how to "look" a machine, and more specifically, to replace a human eye with a camera and a Computer to perform machine Vision such as recognition and measurement on a target, and further perform graphic processing to make the Computer process an image more suitable for human eye observation or transmission to an instrument for detection. As a scientific discipline, computer vision research-related theory and technology has attempted to build artificial intelligence systems that can acquire information from images or multidimensional data. Computer vision techniques typically include image processing, image recognition, image semantic understanding, image retrieval, text recognition (optical character recognition, OCR), video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D techniques, virtual reality, augmented reality, synchronous positioning, and map construction, among others, as well as common biometric recognition techniques such as face recognition, fingerprint recognition, and others.

Referring to fig. 1, fig. 1 is a schematic architecture diagram of an image recognition system according to an embodiment of the present application, where the image recognition system may include a plurality of terminal devices, for example, a content production terminal 101 and a content review terminal 103, and the image recognition system may also include two electronic devices, that is, an image recognition device 102 and a model training device 104. The image recognition device 102 may be directly or indirectly connected to the plurality of terminal devices through a wired or wireless manner, and the model training device 104 and the image recognition device 102 may be directly or indirectly connected through a wired or wireless manner. Alternatively, the image recognition device 102 and the model training device 104 may be the same electronic device or two different electronic devices, which is not limited in this application. It should be noted that the number and the form of the devices shown in fig. 1 are used as examples, and are not limited to the embodiments of the present application, and the image recognition system may include only two terminal devices, such as one content generation terminal and one content review terminal in practical application, or may include three or more content generation terminals and three or more content review terminals in practical application. The image recognition system may in practice also comprise at least two image recognition devices. The embodiment of the present application is drawn and explained taking three content production terminals 101, three content review terminals 103, and one electronic device (i.e., the image recognition device 102 and the model training device 104 are the same electronic device) as an example.

As shown in fig. 1, three terminal apparatuses in the content production terminal 101 may be terminal apparatuses of three different users, and a user may issue content through the content production terminal 101, or may browse content issued by other users through the content production terminal 101. When a user issues content including an image through the content production terminal 101, the image recognition device 102 may recognize the content uploaded by the user, where the content may be an image, take the image as an image to be recognized, call a target recognition model to obtain feature representation information of the image to be recognized, and process the feature representation information to obtain a recognition result of whether the image to be recognized is of a target category. The identification result of the image identification device 102 may be acquired and checked by an auditor of the content distribution application or the content distribution system through the content audit terminal 103. If the verification is passed or if the image to be identified is identified by the image identifying device 102 not to be the target category, the verified image may be displayed on the content distribution application or the content distribution system, and may be pushed to the content production terminal 101 for browsing. The target category may simply be that the image to be identified includes a particular type of unhealthy content.

The target recognition model called by the image recognition device 102 may be obtained by training an initial recognition model by the model training device 104 according to the first difference data, the second difference data and the third difference data, where the initial recognition model includes an image recognition module, a text recognition module and a graphics context recognition module. The first difference data is determined according to a first classification result of the training sample by the image recognition module and an image classification label, the second difference data is determined according to a second classification result of the training sample by the text recognition module and a text classification label, and the third difference data is determined according to a third classification result of the training sample by the image-text recognition module and an image-text classification label, wherein the image-text classification label is used for indicating whether the training sample is of the target class. The model training apparatus 104 may be used to construct and train the target recognition model.

Any of the above-mentioned terminal devices (e.g., the content production terminals 101 and the kernel audit terminals 103) may be, but not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, etc. The image recognition device 102 and the model training device 104 may also be a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, etc.; the image recognition device 102 and the model training device 104 may be servers, for example, independent physical servers, a server cluster or a distributed system formed by a plurality of physical servers, or cloud servers providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, content delivery networks (Content Delivery Network, CDN), and basic cloud computing services such as big data and artificial intelligence platforms.

Referring to fig. 2 together, fig. 2 is a schematic diagram of an architecture of another image recognition system according to an embodiment of the application. As shown in fig. 2, the image recognition system may include a content production terminal 201, a content scheduling device 202, an image recognition device 205, a model training device 206, and a content auditing terminal 207. The image recognition system may further include a metadata database 203 for storing metadata of the content, where the metadata refers to a file size, a file format, a title, a distribution time, distribution account identity information, and the like of the content. The image recognition system may also include a content database 204 for storing various content published, and may also include a training sample database 208 for training an initial recognition model by a model training device 206.

In a specific implementation, for example, in a scenario where a platform issues content, a user may install and run an application program having content issue and browsing functions through the content production terminal 201, and further, the user may issue content through an issue function in the application program. Alternatively, the content may be an image, or may be a video including a plurality of image frames, and text information may be included in the image. Specifically, the user uploads the content (e.g., image) that the user wants to publish to the content scheduling device 202, and then the content scheduling device 202 may obtain meta information of the image, and store the meta information in the meta information database 203, where the meta information of the image may include a file size, a file format, a title, a publication time, a publication account identity information, and so on of the image. The content scheduling device 202 may also store the source file of the image in the content storage database 204 for subsequent file downloads. The content scheduling device 202 may then schedule the image recognition device 205 to recognize the image and determine whether the image is of the target class.

The content recognition device 205 may invoke a target recognition model to recognize the image, where the target recognition model may be obtained by training an initial recognition model by the model training device 206, and the model training device 206 may obtain training samples in the training sample database 208 to train the initial recognition model. The auditor of the platform can review the image identified by the image identifying device 205 through the content review terminal 207, and when the image is reviewed to be an image belonging to the target category, the image belonging to the target category can be stored in the training sample database 208, and the image belonging to the target category is used as a negative sample to train the initial image identifying model. Optionally, the content auditing terminal 207 may also receive a reported complaint of the user for the content, and the auditor confirms whether the reported complaint content is the target category through the content auditing terminal 207, and if so, may add the reported content to the training sample database 208. And, the auditor browses the content released in the platform through the content auditing terminal 207, if yes, the content released can be processed, and the content is added in the training sample database, so as to keep the content provided by the platform healthy.

Through the image recognition system, the image to be recognized is input into the target recognition model, text features and image features of the image to be recognized are extracted through the target recognition model, the extracted features are processed, and then recognition results for indicating whether the image to be recognized is of a target type or not are obtained. Therefore, the target recognition model can be used for processing the multi-dimensional characteristic information such as text content, image content and the like, so that subjective factors of manual auditing are eliminated, and auditing personnel do not need to have background knowledge of unhealthy content of a specific type, thereby improving the recognition accuracy and recognition effect. In addition, compared with a manual processing mode, the processing speed is improved in the mode of identifying whether unhealthy content of a specific type exists through the target identification model, the identification efficiency can be greatly improved, and the content auditing efficiency is also improved.

In one implementation, the training samples, the images to be processed and the released contents can be stored in the blockchain, so that the training samples, the images to be processed and the released contents can be prevented from being tampered. The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like, is essentially a decentralised database, and is a series of data blocks which are generated by correlation by using a cryptography method, and each data block contains information of a batch of network transactions and is used for verifying the validity (anti-counterfeiting) of the information and generating a next block.

It may be understood that, the content pushing system described in the embodiment of the present application is for more clearly describing the technical solution of the embodiment of the present application, and does not constitute a limitation on the technical solution provided by the embodiment of the present application, and those skilled in the art can know that, with the evolution of the system architecture and the appearance of a new service scenario, the technical solution provided by the embodiment of the present application is equally applicable to similar technical problems.

Based on the above-described image recognition scheme and the image recognition system, the embodiment of the present application provides an image recognition method, which may be performed by an electronic device, and the electronic device may be the image recognition device 102 in the image recognition system shown in fig. 1 or the image recognition device 205 in the image recognition system shown in fig. 2. If the image recognition device is a server, it may be a dedicated server or some internet application server, through which not only the relevant steps of the embodiment of the present application may be executed, but also other services may be provided. Referring to fig. 3, fig. 3 is a flowchart of an image recognition method according to an embodiment of the present application, and the content pushing method includes the following steps S301 to S303:

S301, acquiring an image to be identified.

In the embodiment of the application, the image to be identified can be an image included in the content which the user wants to issue on the platform, the image to be identified is checked by the electronic equipment of the platform, whether the image to be identified is of a target type is determined, if yes, the image to be identified is unhealthy content of a specific type and cannot be issued on the platform. The image to be identified can be uploaded by a content production terminal, such as a mobile phone of a user, or can be obtained by an electronic device from a content database, for example, the image is obtained from the content database for inspection or rechecking. Wherein text information may be included in the image to be identified, since the image may not be unhealthy content, but adding unsuitable text information to the image becomes unhealthy content, for example unhealthy content of a specific type. In addition, different text information and proper text information are added to the same background diagram, so that normal contents can be obtained, and unhealthy contents of a specific type can be obtained after improper text information is added.

The background image may include an animal, a person, a landscape, or the like. The characters in the background image can be people of various ages, for example, adults and minors. If content such as critical-portion exposure is included in the background map, the background map is unhealthy content. If the background image is only an image including animals, minors and scenery, the background image is a normal image. The text information may include normal content such as news, joke, etc., unhealthy content such as content directly describing unhealthy behavior, and may include content with objectionable cues, obscuration. If text information including objectionable hints, obscured content is added to the normal background diagram, healthy content may be possible or unhealthy content may be possible. By way of example, normal content may be if the content including the obscuration is added to a normal background image including animals, and specific types of unhealthy content may be if text information of poorly implied, obscurated content is added to a background image including minors. The text information of "play together with water bar" is added to the background image including the puppy, the added image is normal content, and the background image is replaced with minors, and the added image is unhealthy content of a specific type.

Therefore, the embodiment of the application can identify the image to be identified comprising text information, and considers the characteristics of multiple modes, so that whether the image to be identified is unhealthy content of a specific type can be accurately identified.

In an actual service scenario, the content production terminal may issue a video, and the image to be identified may be an image frame in the video to be identified, where a portion of the image frame may include unhealthy content of a specific type. Therefore, the problem of image frame level result sampling and image frame result fusion of the video to be identified is involved in the process of identifying the video to be identified, wherein the image frame level result sampling refers to extracting part of image frames from the video to be identified to obtain an identification result, and the image frame result fusion refers to determining the identification result of the video to be identified according to the identification result of the part of image frames.

In the information flow service, the videos can comprise different video durations, so that image frames to be identified can be extracted for the videos to be identified with different video durations, and the identification result of the video to be identified can be determined according to the identification result of the extracted image frames. For example, the video to be identified may be framed to obtain a plurality of image frames, and a portion of the image frames are extracted as images to be identified, where the images to be identified are key frames of the video to be identified. And then, the images to be identified are identified to obtain an identification result, and the identification result of the video to be identified is determined according to the identification results of the images to be identified. When framing video, the electronic device may frame the video with a video duration of 0.1 seconds, i.e., the interval duration of each frame may be 0.1 frame.

In one possible implementation, if the duration of the video to be identified is less than a preset duration, the preset duration may be, for example, 15 seconds. Namely, under the condition that the video to be identified is relatively short and the video duration is less than 15 seconds, N image frames can be uniformly selected as the image frames to be identified at equal interval duration, and N is an integer greater than 1. The N image frames may include a first frame and a last frame of the video to be identified, where the first frame refers to a first frame visually seen by the video, and the last frame refers to a last frame visually seen by the video. Illustratively, in the case where N is equal to 5, in the case where the video duration is less than 15 seconds, 5 frames may be uniformly acquired at equal intervals, and the first frame and the last frame of the video may be included in the 5 frames. In still another example, in the case where the video duration is less than 15 seconds, the video may be equally divided into 5 parts according to the equal time length, and one image frame is selected from each part, so as to obtain 5 image frames to be identified.

In another possible implementation manner, if the time length of the video to be identified is greater than or equal to the preset time length, the electronic device may select M image frames including different scene information in the video to be identified as the image frames to be identified, where M is an integer greater than 1. Alternatively, M may be greater than or equal to N. For example, the electronic device may invoke the extraction tool FFMPEG to extract M image frames including different scene information, e.g., by FFMPEG-i1.mp4 select= 'gt (scene, 0.5)' in the FFMPEG tool, where 0.5 may be used to identify the probability that the frame is a new scene. Optionally, the M image frames may also include a first frame and a last frame of the video to be identified.

If the number of the image frames including different scene information is selected from the video to be identified to be smaller than M, the electronic device can traverse the video to be identified again, and continue frame extraction until the number of extracted frames is M. Specifically, explanation is given by taking M equal to 9 as an example, if the number of the obtained image frames is less than 9 frames (e.g. 5 frames) after the electronic device selects one frame from the image frames of each scene information in the different scene information of the video to be identified, the supplement is needed, otherwise, if the number of the obtained image frames of the electronic device is greater than or equal to 9 frames (i.e. greater than or equal to M frames), the extracted image frames are all taken as the image frames to be identified.

Specifically, when traversing the image frames of the video to be identified, the electronic device may determine the image frame with the longest interval duration between two adjacent frames in the extracted 5 frames, for example, the 2 nd frame and the 3 rd frame, and then the electronic device may select the 1 st frame as the complementary frame from the image frames in the middle of the 2 nd frame and the 3 rd frame. For example, an intermediate frame between the 2 nd frame and the 3 rd frame may be selected as the complementary frame, or an optional 1 st frame may be selected as the complementary frame. And further judging whether the number of the selected image frames to be identified (such as 6 frames) is smaller than M, if so, continuing to traverse, determining two frames with the longest interval duration, and selecting 1 frame from the two frames again until the number of the selected image frames to be identified is equal to M, so as to obtain 9 image frames.

In yet another possible implementation manner, when the video to be identified is greater than or equal to the preset duration, for example, the duration of the video to be identified is 34 seconds and greater than the preset duration, but if only 9 image frames are extracted, more image frames may be missed. And M is equal to 12 in the case where the duration of the video to be recognized is 30 seconds or more. The values of N and M and the value of the preset duration can be set specifically according to the service scene, which is only taken as an example and not limited, so as to obtain the image frame to be identified of the video to be identified.

S302, calling a target recognition model to acquire the characteristic representation information of the image to be recognized.

In the embodiment of the application, the object recognition model is a model for recognizing the image to be recognized, the feature identification information can comprise feature representation information obtained by processing the image to be recognized by the object recognition model, the feature representation information can comprise text feature representation information and image semantic representation information, and the feature representation information can be understood as image features and text features extracted by the object recognition model. Specifically, after the electronic device acquires the image frame to be identified, the electronic device may call the target identification model to acquire the feature representation information of the image to be identified. The object recognition model comprises an image recognition module and a text recognition module, the image semantic representation information can be obtained by processing an image to be recognized by the image recognition module, and the text feature representation information can be obtained by processing text information of the image to be recognized by the text recognition module.

In one possible implementation, the electronic device inputs the image to be identified into an image identification module in the target identification model for processing, so as to obtain the image semantic representation information, wherein the image semantic representation information can represent the image. The electronic device may obtain text information and tag information included in the image to be identified, where the text information may be text included in the image to be identified, and the tag information may be a tag of the image to be identified, where the tag may be a tag related to content of the image to be identified. For example, the image to be identified is an image of a mobile phone, and the tag information may include a model, a brand, a function of the mobile phone, and the like of the mobile phone. The electronic device can acquire text information included in the image to be recognized by calling an OCR service. And further, inputting the label information and the text information included in the image to be processed into a text recognition module in the target recognition model for processing to obtain text feature identification information of the image to be recognized. And the electronic equipment can call the target recognition model to process the text feature identification information and the image semantic representation information in the feature representation information to obtain a recognition result of the image to be recognized.

S303, calling the target recognition model to process the characteristic representation information of the image to be recognized, and obtaining a recognition result of the image to be recognized.

In the embodiment of the application, the identification result can be used for indicating whether the image to be identified is a target category, and the target category can be understood as that the image to be identified comprises unhealthy content of a specific type. Specifically, the electronic device may call the target recognition model to process the feature representation information of the image to be recognized, so as to obtain a recognition result of the image to be recognized. The object recognition model also comprises an image-text recognition module, text feature representation information and image semantic representation information of the image to be recognized are fused based on a self-attention mechanism of the image-text recognition module, fusion feature identification information is obtained, and the fusion feature representation information is input into the image-text recognition module for processing, so that a recognition result is obtained.

In one possible implementation manner, the Self-attention mechanism of the image-text recognition module may be a Multi-head Self-attention mechanism (Multi-head Self-attention), for example, may include two heads (heads) for processing the image semantic representation information and processing the text feature representation information, and further, the acquired image semantic representation information and the processed text feature representation information may be fused (fusion) to obtain fused feature representation information. Optionally, before the text feature representation information and the image semantic representation information of the image to be identified are fused based on the self-attention mechanism of the image-text identification module, the text feature representation information and the image semantic representation information of the image to be identified can be initially fused, for example, the text feature representation information and the image semantic representation information of the image to be identified are spliced, and then the initially fused image-text feature identification information is fused based on the self-attention mechanism of the image-text identification module, so as to obtain the fused feature representation information. And further, the fusion characteristic identification information is processed, such as classified, through the image-text recognition module, so as to obtain a recognition result for indicating whether the image to be recognized is of the target class.

It should be noted that, the target recognition model may be a multi-mode fusion recognition model (transducer) based on a multi-head self-attention mechanism, where the image recognition module and the text recognition module in the target recognition model are both single-mode classification branches, that is, an image branch and a text information branch, and the multi-mode classification of the image-text recognition module is taken as a main part, and the single-mode classification is taken as an auxiliary classification to realize global optimization, that is, the effect of multi-mode classification is best. The object recognition model is explained by taking two branches as examples, wherein a single-mode branch included in the object recognition model is a branch of an image recognition module and a branch of a text recognition module, and a multi-mode branch included in the object recognition model is a branch of an image-text recognition module. From the point of multi-mode learning, the mode of text information and the mode of an image are fused, the text information and the image share the input control of the target recognition model, semantic interaction among different modes is performed by utilizing the global feature of a self-attention mechanism, the problem of classification of image-text combination is solved by fusion of the image feature and the text feature, and the effect of classification of the image-text combination can be improved.

In another possible implementation, the object recognition model may further include a branch of the face recognition model in addition to the branch of the image recognition module and the branch of the text recognition module. The face recognition model can be used for recognizing a face area in an image to be recognized, can also be used for recognizing key points of a face, and can also be used for recognizing age and sex attribute information and the like. The branches can enhance the classification effect of the image mode branches, for example, the branches can be used for enhancing the characteristic response of the face area of a person, thereby being beneficial to the accuracy of multi-mode classification. Optionally, the face recognition model may be trained by using the image including the face detection frame of the minors as a mask for labeling, so that the feature response of the face region of the minors can be enhanced, and the class of the image to be recognized including the faces of the minors can be better recognized. When the face recognition model needs to be described, the face recognition model can be an optional branch module and can be parallel to the image recognition module, and the single-mode classification is taken as auxiliary classification to realize global optimization.

Specifically, the electronic device inputs an image to be identified into the image identification module to process to obtain image semantic representation information of the image to be identified, inputs the image to be identified into the face identification model to process to obtain face semantic representation information of the image to be identified, inputs label information of the image to be identified and text information included in the image to be identified into the text identification module to process to obtain text feature representation information of the image to be identified. The self-attention mechanism based on the image-text recognition module carries out fusion processing on the image semantic representation information, the text feature representation information and the face semantic representation information of the image to be recognized to obtain fusion feature representation information comprising the face semantic representation information, and the fusion feature representation information comprising the face semantic representation information is input into the image-text recognition module to be processed to obtain a recognition result.

The image recognition module included in the object recognition model may be a convolutional neural network (Convo lutional Neural Networks, CNN) or other neural networks, for example, may be an image classification model Vit (Vision Transformer), where the effect of acquiring the image semantic representation information by using the VIT model is good, and the model is light and has strong expandability. The text recognition module included in the target recognition model can be a model (Natural Language Processing, NLP) for natural language processing, for example, the Bert model can be a litchi model (LICH EES) pre-trained based on Tencent Chinese, and the litchi model is a pre-trained model trained by using a large-scale corpus, has strong semantic understanding capability, and can extract semantic features from the text information corpus better. The image-text recognition module included in the target recognition model can be a model (transducer) comprising a multi-head self-attention mechanism, and the image-text recognition module is a cross-modal model which can make full use of the interaction relationship among three characteristics of text, images and images. In this way, compared with the mode of reconstructing a model in each mode, the method has the advantages of low deployment cost and low resource consumption, and can obtain a better recognition effect.

Alternatively, the face recognition model may be an object detection model Retinaface, which is a single-step (one stage) model that refers to the category and location information of an object present therein that can be directly obtained from an input image. The Retinaface model can be obtained by training a data set (windows) of a labeling data training sample comprising key points of a human face, and can be used for identifying a human face area, a plurality of key points of the human face, classifying the human face and detecting difficult human faces.

In one possible implementation manner, the recognition result is a recognition result of an image to be recognized, and if the image to be recognized is an image frame in the video to be recognized, the electronic device may acquire a recognition result of each image frame to be recognized included in the video to be recognized, and determine the recognition result of the video to be recognized based on the recognition result of each image frame to be recognized and a preset hit rule, where the recognition result of the video to be recognized is used to indicate whether the video to be recognized is of a target type.

The preset hit rule may be a rule set in the electronic device for determining a recognition result of the video to be recognized according to a recognition result of each image frame to be recognized included in the video to be recognized. For example, if one identification result exists in the identification result of each image frame to be identified included in the video to be identified, the identification result indicates that the category of the corresponding image frame is the target category, the electronic device may determine that the video to be identified is the target category. For another example, if the recognition results of a certain proportion (e.g., 5%) of the image frames in the recognition results of each image frame to be recognized included in the video to be recognized are all target categories, the electronic device may determine that the video to be recognized is a target category. The application is not limited to this, and the hit rule can be defined by the severity of the auditor for the actual scene of the business.

Referring to fig. 4, fig. 4 is another flow chart of an image recognition method according to an embodiment of the present application, where the image recognition method may be performed by an electronic device, and the electronic device may be the image recognition device 102 in the image recognition system shown in fig. 1 or the model training device 104 in the image recognition system shown in fig. 1, where the image recognition device 102 and the model training device 104 may be the same electronic device. The electronic device may be the image recognition device 205 in the image recognition system shown in fig. 2, or the model training device 206 in the image recognition system shown in fig. 2, where the image recognition device 205 and the model training device 206 may be the same electronic device. The image recognition method includes steps S401 to S403:

S401, acquiring training samples and tag information.

In an embodiment of the present application, the training samples may include a plurality of training samples, and the tag information may be a training tag for indicating a category to which the training sample belongs. The label information may include an image classification label, a text classification label, and a text classification label. The image classification labels are classification labels of branches of the image recognition module, the text classification labels are classification labels of branches of the text recognition module, and the image-text classification labels are classification labels of branches of the image-text recognition module. The image classification tag may be a domain class to which the image semantic representation information belongs, and may be, for example, a landscape, a sports, a self-timer, a dance, or the like. The image classification tag may be, for example, a normal image or an induced image, and the induced image may refer to the image as an unhealthy image. The text classification tag may be a classification of the area to which the text content belongs, e.g. to laught, news, society, etc. The text classification labels may also be, for example, unhealthy text of a particular type, unhealthy text suspected of a particular type, and normal text. The image-text classification label comprises two classification results: the target type and not the target type.

It should be noted that, because classification has a certain subjectivity, unhealthy text or normal text of a specific type cannot be completely determined, and thus, the increase of the suspected probability can improve the flexibility of application policy.

Optionally, the face classification tag is used to indicate whether the training sample includes a face of the target type. For example, the face region may be classified by coordinates, key points, or the like including the age, sex, or the like of the face. It should be noted that, the classification of the single-mode branch is used as an auxiliary task to assist the multi-mode branch to extract better fusion feature representation information, so that the classification can be better performed. The specific image classification label and the text classification label can be self-defined, for example, can be classification data which is obtained through a multi-layer classification tree manually marked mark and is used for respectively indicating whether unhealthy contents exist in the image and the text information, or can be a data set of which the open source is selected, for example, can be a domain classification to which the image semantic and the text feature representation information belongs.

The specific type of unhealthy content used as the training sample is less, the service scene is sparse, and the specific type of unhealthy content relates to images or text information of minors, so that effective samples are difficult to obtain, and the number of the training samples can be increased by adopting a data synthesis mode. Moreover, the images and the text information can be replaced arbitrarily in the combination of the images and the text information, so that the obtained images comprising the text information are of different categories. Thus, in cases where most of the image and text information is not aligned, it may be appropriate to train the initial recognition model using the synthesized training samples.

In one possible implementation, an electronic device may obtain a set of text information including a plurality of text information, each text information carrying a text label that may label the corresponding text information as unhealthy text information or unsuitable text information. Further, the electronic device may obtain a set of images, each image in the set carrying an image tag, which may be a classification of the area of the tagged image. This is because different classification results can be obtained by combining with different text information in different scenes. Thus, by marking the classification of the domain to which the image belongs, it can be used to subsequently determine what class the synthesized image belongs to in particular. Wherein each image included in the set of images does not include text information, i.e., the image used for composition is a word-free image. Further, the electronic device synthesizes each text information in the text information set with each image in the image set to obtain a plurality of images including the text information. The electronic device can call a visual geometry composition text (Visual Geometry Group Network SynthText, VGG Sy nthText) model to synthesize word-free graphs and text information, and a large number of training samples (particularly negative samples) are quickly obtained for training.

Further, the electronic device may determine whether the synthesized image including the text information is a positive or negative sample according to whether the text mark and the image mark match the preset mark. If the text mark and the image mark of the first image in the plurality of images including the text information are matched with the preset mark, determining that the first image is a negative sample in the training sample, and the image-text classification label of the negative sample is the target category; otherwise, if the text mark and the image mark in the second image in the plurality of images including the text information are not matched with the preset mark, determining the second image as a positive sample in the training samples. It will be appreciated that the text labels and image labels are used to determine positive and negative ones of the training samples. The preset mark is mark information formed by determining a domain keyword and a descriptive matching word based on business background knowledge of unhealthy content, and the preset mark is respectively provided with an image mark, a text mark and a corresponding category (namely positive and negative samples).

In one possible implementation, the electronic device rich text prediction may be multiple rounds of similar text matching expansion using the NLP text similarity service. Specifically, the process of matching and expanding a round of similar text can be specifically: the electronic device acquires a text information set, and the electronic device can acquire a seed text and a text information set to be marked, wherein the seed text carries a text mark, and the text mark can be a mark for indicating that the corresponding text information is unhealthy content or unsuitable content or a mark for indicating that the corresponding text information is normal content. The seed text may be a more difficult to obtain text expected from the technician's focused mining, and may be normal text or unhealthy content. Furthermore, the electronic device can perform text vectorization on each text information in the seed text and the text information to be marked, and further calculate the distance between different vectors, namely determine the similarity between the seed text and each text information in the text information set to be marked, so as to determine similar texts, and determine whether to add text marks to each text information in the text information set to be marked.

Further, in the case that the electronic device determines that the similarity between a certain text message and the text marking seed text is high, the electronic device may determine to add a text marking to the text message, that is, the text marking of the seed text is used as the text marking of the text message, especially in the case that the text marking of the seed text is used for indicating unhealthy content. Alternatively, in the case where the electronic device determines that the similarity between a certain text information and the text mark for indicating the seed text as unhealthy content is low, the electronic device may determine that the text mark may not be added to the text information, or may determine that the text mark opposite to the text mark of the seed text is added to the text information. After the expanded text information is obtained, the electronic equipment can acquire the text information set to be marked again, and perform text vectorization on the seed text and each text information in the text information to be marked again, so as to calculate the distance between different vectors, determine the similarity between the seed text and the text information to be marked, and further determine whether to add text marks. The text information can be quantized by calling a word2vec model or by calling a text vectorization model of Bert and Bert varieties, which is not limited by the application. Furthermore, the electronic device can use the seed text and the text information added with the text mark as a text information set to be synthesized with the images in the image set to obtain a training sample.

In one possible implementation, the electronic device may similarly expand the images in the image set, may acquire the image to be marked, and determine the image mark in the image to be marked through the object detection and OCR recognition service. For example, the electronic device may invoke a browse once (you only look once, YOLO) model (e.g., Y OLOV 5) to target the image to be marked and may invoke an OCR service to obtain text in the image to be marked so that a classification of the area to which the image to be marked belongs may be determined or identified.

In one possible implementation, if there are more cases of a certain image style, the stylized processing, for example, the cartooning processing, or other stylized processing, may be performed on each image in the image set, and the image before the processing and the image after the stylized processing are taken as the image set. The application is explained by taking the image cartoon processing as an example, and the electronic equipment can call StyleGAN model and other technologies to cartoon the face in the image. In the Style gan model, "Style" may refer to information of main attributes of a face in an image, such as a pose of a person, which is equivalent to a Style of the face, including an expression on a face shape, a face orientation, a hairstyle, and the like, and also includes a color of a face skin on texture details, a face illumination, and the like. By performing stylization processing on the image, generalization of the sample can be effectively increased, which is equivalent to enhancing the sample, and low-resource learning optimization is realized.

After the training sample is obtained, the classification labels of the 3 branches of image classification, text classification and image-text classification can be respectively marked, and the text content on the image does not need to be marked, because the electronic equipment calls the OCR service and the obtained label information. Alternatively, the tag information may include a title of the training sample. The text classification label may be determined according to the text label, or may be determined according to label information, for example, determining a domain class to which the text information belongs, or may be determined according to a preset text label, or the image classification label may be determined by the electronic device according to whether the image label matches with the preset image label, or may be determined according to the image label. The text classification label can be determined according to whether the text mark and the image mark are matched with a preset mark or not, so that training samples and label information are obtained.

S402, calling an initial recognition model to process the training sample, and obtaining a first classification result, a second classification result and a third classification result of the training sample.

In one possible implementation, the electronic device may invoke the initial recognition model to process the training sample to obtain a first classification result, a second classification result, and a third classification result. The initial recognition model comprises an image recognition module, a text recognition module and a picture and text recognition module. The first classification result is a classification result output by the image recognition module in the initial recognition model, the second classification result is a classification result output by the text recognition module in the initial recognition model, and the third classification result is a classification result output by the image-text recognition module in the initial recognition model.

Referring to fig. 5 together, fig. 5 is a schematic structural diagram of an initial image recognition model according to an embodiment of the present application, and as shown in fig. 5, the initial image recognition model includes an image recognition module, a text recognition module and a text recognition module. The training sample is input into the initial recognition model, the initial recognition model is called to obtain feature representation information of an image to be recognized, the training sample can be input into the image recognition module to be processed to obtain image semantic representation information of the training sample, the electronic equipment can also obtain Tag information of the training sample, such as Tag information, and the Tag information can specifically comprise lovely, by hand, sister, go-go, old man, etc. The electronic device may obtain the content title, account nickname, and text information included in the image, such as title/Puin_Name, by invoking an OCR service. As shown in fig. 5, text information in a training sample that the electronic device may invoke OCR service recognition may include "vacate hands, hands are not valid", "play to play", "do not play together" and so on.

Specifically, a training sample is input into an initial recognition model, an initial recognition model image recognition module is called to obtain image semantic representation information of the training sample, and a first classification result output by the image recognition module is obtained. And inputting the label information of the training sample and the text information included in the training sample into a text recognition module in the initial recognition model for processing to obtain text characteristic representation information of the training sample and a second classification result output by a text classification module. And then inputting the image semantic representation information and the text feature representation information into an image-text recognition module in the initial recognition model for fusion processing to obtain fusion feature representation information, and inputting the same-sum feature representation information into the image-text recognition module for processing to obtain a third classification result output by the image-text recognition module. And training the initial recognition model according to the first classification result, the second classification result, the third classification result and the label information to obtain a target recognition model.

Further, first difference data is determined based on the first classification result and the image classification label, second difference data is determined based on the second classification result and the text classification label, and third difference data is determined based on the third classification result and the text classification label. And adjusting parameters of the initial recognition model according to the first difference data, the second difference data and the third difference data to obtain the target recognition model. The difference data may refer to a difference between the classification result and the tag information, and may be, for example, a Loss parameter (Loss).

Optionally, in the case that the image recognition module in the initial recognition model is the VIT model and the text recognition module is the native Bert model, the multi-modal fusion classification effect is analyzed from the overall optimization perspective, and it is found that the classification effect of the two single-modal branches is unbalanced, and the confusion matrix of the text classification branch and the image classification branch can be obtained by comparing: under the condition that the text classification labels are classified into unhealthy contents of a specific type, unhealthy contents suspected of the specific type and normal text, the recall rate of unhealthy contents of the specific type is low (for example, unhealthy contents of the specific type are lower than 20 percent, and the recall rate of unhealthy contents suspected of the specific type is lower than 75 percent), only a small amount of normal samples are mistakenly detected as unhealthy contents suspected of the specific type, namely, the identification effect on the normal samples is poor, and the accuracy is high. In the case where the classification of the image classification tag is an induced image as well as a normal image, the accuracy is low (e.g., 30% of the normal images are misdetected as picture induction), and the recall is high (e.g., 97%). Therefore, the text classification branch accuracy is higher, the recall is lower, the multi-mode branch recall is limited by the text modal characteristics, the image branch recall is higher but the accuracy is lower, and therefore, the multi-mode branch accuracy is mainly influenced by the image modal characteristics. Therefore, the recall is enhanced aiming at the text modal characteristics, the image modal enhancement accuracy is enhanced, and the last recall or accuracy effect corresponding to the multi-modal characteristics is enhanced.

In one possible implementation manner, in order to improve the recall rate of the text classification branches, the backbone network (backbone) of the text pre-training model may be enhanced, for example, the original Bert may be replaced by the litchi model, which is obtained through Chinese pre-training and knowledge distillation, because the litchi model obtained through training by using large-scale text information in the service field can be better adapted to the service scene, recall more potentially problematic contents, and understand that the field background knowledge is learned in a targeted manner, so that the recall rate can be enhanced, the size of the model can be reduced, and the processing speed of the model can be improved by using the litchi model. In order to improve the accuracy of the image classification branch, for example, a face recognition model is added to be parallel to the image recognition module, a training sample is input into the face recognition model, and the training sample is processed through the face recognition model to obtain face semantic representation information and a fourth classification result output by the face recognition model. And training the initial recognition model according to the first classification result, the second classification result, the third classification result, the fourth classification result and the label information to obtain a target recognition model.

Referring to fig. 6 together, fig. 6 is another schematic structural diagram of an initial image recognition model according to an embodiment of the present application, and as shown in fig. 6, the initial image recognition model includes an image recognition module, a text recognition module, a face recognition module, and a graphics context recognition module. On the basis of fig. 5, branches including a face recognition model are additionally added, a training sample is input into the face recognition model for processing, face semantic representation information of the training sample and a fourth classification result output by the face image recognition model are obtained, the image-text recognition module can fuse text feature representation information, image semantic representation information and face semantic representation information to obtain fused feature identification information including the face semantic representation information, and the fused feature identification information including the face semantic representation information is processed to obtain a third classification result.

Further, first difference data is determined based on the first classification result and the image classification label, second difference data is determined based on the second classification result and the text classification label, third difference data is determined based on the third classification result and the image-text classification label, and fourth difference data is determined based on the fourth classification result and the face classification label. And adjusting parameters of the initial recognition model according to the first difference data, the second difference data, the third difference data and the fourth difference data to obtain the target recognition model.

And S403, training the initial recognition model based on the first classification result, the second classification result, the third classification result and the label information to obtain a target recognition model.

In one possible implementation, the electronic device may determine first difference data based on the first classification result and the image classification label, determine second difference data based on the second classification result and the text classification label, and determine third difference data based on the third classification result and the image-text classification label. Further, target difference data is determined based on the first difference data, the second difference data, and the third difference data, parameters of the initial recognition model are adjusted, and the adjusted initial recognition model is used as the target recognition model. The target difference data may be a sum of the first difference data, the second difference data, and the third difference data. Specifically, the target difference data may be as shown in formula 1:

l_total=l_multi+w l_text+w l_image equation 1

Wherein l_total refers to target difference data, l_image is first difference data, l_text is second difference data, and l_multi is third difference data. w is the weight of the first difference data and the second difference data, and the value of w may be, for example, 0.25 through experimental comparison. Wherein the loss function type of the loss parameters of the first, second and third difference data may be cross entropy loss functions. And since there are fewer negative samples, a balanced cross entropy loss function (focal loss) can be used to improve the positive and negative sample imbalance. Alternatively, resampling may be used, i.e. copying fewer samples several times more, to balance the positive and negative samples. Optionally, the problem of unbalance can be improved by manually marking the samples which are difficult to classify again in a mode of mining the samples which are difficult to learn actively. The resampling mode, the mode of adopting a balanced cross entropy loss function and the mode of actively learning samples difficult to excavate can improve the problem of unbalance of positive and negative samples from different angles, and one or more of the above modes can be used.

Alternatively, the model parameters of the initial recognition model may be adjusted by a gradient descent method (gradient descent). When the model parameters are updated by using a gradient descent method, difference data such as the gradient of a loss function is calculated, and the model parameters are iteratively updated according to the gradient so as to gradually converge the initial recognition model to improve the prediction accuracy of the model.

Optionally, the electronic device may determine fourth difference data based on the fourth classification result and the face classification label. And calling a face recognition model to process the training sample to obtain a fourth classification result of the training sample and a facial thermodynamic diagram. The facial thermodynamic diagram can also be called attention diagram, heat diagram and class activation diagram (Class Activation Map, CAM), and the CAM activation diagram is introduced to assist in training classification, wherein the class activation diagram is a class response diagram generated from a facial recognition model, and can roughly locate an object area with discrimination in an image on the basis of label-level labeling information of an image class, map the response size of a feature diagram to an original diagram, and more intuitively understand the effect of the model. For example, the face area of the minor may be obtained, so that the face area of the minor can be enhanced, and the probability that the normal image is erroneously detected by the minor can be reduced.

The electronic device may determine fourth difference data based on the fourth classification result, the facial thermodynamic diagram, and the facial classification label of the training sample. Specifically, the fourth difference data may be composed of two parts of difference data, one part is determined according to the fourth classification result and the face classification label of the training sample, and the other part is determined according to the facial thermodynamic diagram.

Referring to fig. 7, fig. 7 is a schematic structural diagram of a training initial recognition model, as shown in fig. 7, a training sample is input into a face recognition model, facial semantic representation information of the training sample is extracted through the face recognition model, the processed features are obtained through global average pooling (global average pooling, GAP) processing, classification is performed to obtain classification results (logits), difference data is determined according to the classification results and facial classification labels, and difference data can be determined according to facial thermodynamic diagrams and classification response diagrams generated by hidden features in the face recognition model. As shown in fig. 7, the facial thermodynamic diagram may be a highlight of the face region included in the training sample. Specifically, the electronic device may determine the difference data according to the facial thermodynamic diagram as shown in formula 2:

Loss_cam= -sum (mask_outside < 0)) equation 2

The loss_cam is difference data calculated according to a facial thermodynamic diagram, each point in the facial thermodynamic diagram corresponds to a value, and the magnitude of the value can be used for representing the magnitude of contribution degree to the recognition result. Alternatively, the smaller the value, the greater the contribution to the recognition result may be indicated, and in order to increase the face region of the person, other regions may be removed, i.e., where (mask_outside < 0) may indicate that the portion of the face is individually enhanced, excluding the influence of regions other than the face, i.e., excluding corresponding regions of values less than 0 in the regions other than the face.

Further, the electronic device may determine the target difference data based on the first difference data, the second difference data, the third difference data, and the fourth difference data. Wherein the target difference data may be determined based on a sum of the first difference data, the second difference data, the third difference data, and the fourth difference data. Specifically, the target difference data may be as shown in formula 3:

l_total=l_multi+w l_text+w l_image+k loss_face+k loss_cam formula 3

Wherein l_total refers to target difference data, l_image is first difference data, l_text is second difference data, and l_multi is third difference data. w is the weight of the first difference data and the second difference data, and k is the weight of the fourth difference data. Loss_cam is difference data calculated from facial thermodynamic diagrams for classification enhancement for auxiliary branches. The value of k may be determined experimentally, and the present application is not limited thereto, and k may be 1, for example. Therefore, the end-to-end integral optimization is realized through the single-mode branch and multi-mode multi-branch collaborative training, and finally, a fused better characteristic representation of the picture and the text is obtained.

And the electronic equipment adjusts the parameters of the initial recognition model based on the target difference data, and takes the adjusted initial recognition model as a target recognition model, so that training is completed. Optionally, after the target recognition model is obtained, training samples may be periodically obtained to train and update the target recognition model.

The target recognition model provided by the application can be used for performing end-to-end training, namely a mode of performing pre-training on an initial model through a small number of training samples, performing fine adjustment on the pre-trained model through a small number of training samples, and classifying through input samples to obtain a classification result and adjusting model parameters without performing fine adjustment. Compared with a training mode of pretraining and fine-tuning two stages, the method of end-to-end training is helpful for improving recall, and a single-mode classification branch is used as an auxiliary target to help multi-mode classification optimization, and the end-to-end optimization can jointly optimize 3 processing modules, namely text information, images and graphics, so that a target recognition model can learn that a certain mode affects a final classification result, can learn that the image and the text information are fused to affect the final result, can realize global optimization, and avoids local overfitting.

Referring to fig. 8 together, a method for determining a video type according to an embodiment of the application is further described below with reference to fig. 8. As shown in fig. 8, fig. 8 is a schematic flow chart of a video type determining method according to an embodiment of the present application.

In the whole information flow (feeds) content flow and processing process, on one hand, the electronic equipment can call the initial recognition model to process a training sample, and the sample with inconsistent classification results obtained after manual labeling and model processing is subjected to manual labeling and model training again, so that sample labeling quality can be improved rapidly through iteration for several rounds. And, continuously collect user negative feedback (such as reporting complaint content) and specific type unhealthy content related to active patrol, and form a closed loop by adding reliable training samples. The feedback of the user, such as reporting complaints and active inspection of auditors, is very important for collecting samples of unhealthy contents, and is a good source and channel for verifying the technical problems and the final business effect.

In one possible implementation manner, the content production end may acquire the interface address of the uplink and downlink content interface servers through the mobile end or the back end interface application programming interface (Application Programming Interface, API), provide the released content to the uplink and downlink content interface servers, and select or upload a corresponding content cover map for the released content, optimize the content, and so on when providing the released content. The content consuming end can obtain the released content for browsing and viewing by calling the content delivery outlet server to obtain the index information of the released content, such as a download address, a uniform resource positioning system (uniform resource locator, URL) and the like, wherein the behavior data such as reading speed, completion rate, reading time, cartoon, loading time, playing click and the like of the user browsing can be reported during the process of providing the released content and the process of obtaining the released content by the content producing end.

Content production end may include PGC, UGC, MCN and professional user produced content (Profes sional Generated Content + User Generated Content, PUGC), which is a combination of PGC and UGC. Through the mobile terminal (such as terminal equipment) or the back-end interface API system, local or shot content or written image-text content (such as public number articles, atlas and the like) is provided, and a user who issues the content can select to actively upload the cover map of the corresponding content, and the content uploaded by the content production terminal is the main content source of the subsequent distributed content. Through the communication between the content production end and the uplink and downlink content interface service, the content production end can acquire the interface address of the uplink and downlink content interface server for uploading the released content, then upload the local file, and under the condition that the released content is video content, a user releasing the content can select matched music, a filter template, a beautifying function of the video and the like for the local video content in the shooting process.

And the content consumption end is communicated with the content distribution outlet server to acquire index information of the corresponding content, such as a download address and a URL. In the case of obtaining the delivered video content, the content consumer may download the corresponding streaming media file and play the view through the local player, and in the case of obtaining the delivered teletext content, the content consumer may communicate directly with the edge-deployed content delivery network (Content Delivery Networ k, CDN) service. The content consumption end usually browses the released content in a Feeds stream mode, provides an inlet for directly reporting complaints and feedback for the content consumption end, particularly for unhealthy content of a specific type, and the inlet can be butted by a manual auditing system so as to be convenient for an auditor to confirm and review, and the result of review can be stored in a sample database to be used as a training sample source for subsequent training of an initial recognition model.

The content submitted from the content production end can generally comprise a title, a publisher, a summary, a cover map, a release time and the like of the content. The specific release content can be graphics context, video, and the like, and the release content can directly enter the server through the uplink and downlink content interface server to store the file of the release content into the content storage service. The uplink and downlink content interface server can also write the meta information of the released content, such as the file size, the cover map link, the code rate, the file format, the title, the release time, the author and other information uploaded by the content production end, into the content database. And the uplink and downlink content interface server can submit the uploaded file and content meta-information to a dispatching center service for subsequent content processing and circulation.

The content database is a core database of the content, and meta-information of the content issued by all content production terminals is stored in the service database. For example, meta information of the content itself, such as file size, cover map link, code rate, file format, title, distribution time, author, video file size, video format, whether original mark, classification of the content in the manual review process, specifically, one, two, three-level classification and tag information, such as a video content explaining a certain brand of mobile phone, the first-level classification may be science and technology, the second-level classification may be smart phone, the third-level classification may be domestic mobile phone, and the tag information may be brand, model, etc. of the mobile phone. In the manual auditing process, the auditing personnel can read the information in the content database, and meanwhile, the result and the state of the manual auditing can be transmitted back to the content database. The processing of the issued content by the dispatching center mainly comprises machine processing and manual auditing processing, wherein the machine processing refers to processing of various quality judgments, such as low-quality filtering and content labeling (such as three-level classification and label information), content duplication can be processed by the machine, and the results of the machine processing can be written into a content database together. Alternatively, repeating the same content entirely does not allow for repeated secondary processing by the human operator. When tag information is subsequently acquired, meta-information of the content may be read from a content database from which meta-information acquired for the particular type of unhealthy content is also derived.

The dispatching center service is mainly used for streaming the released contents, such as video and graphic contents, namely the whole dispatching process of the released contents, receives the contents to be added into the database through the uplink and downlink content interface server, and then acquires the meta information of the contents from the content meta information database. In the actual dispatcher operating as the image-text and video links, the dispatch center service can dispatch the image recognition service system to process the corresponding content for the picture content in the links according to the type of the content, directly filter or mark the content for reducing the weight or limiting the distribution, etc. The dispatching center service is also used for dispatching the manual auditing system and the machine processing system and controlling the dispatching sequence and priority. And, the dispatch center service is further used for enabling the content through the manual auditing system, and further providing the content index information obtained by the content consumer end through the content outlet distribution service (usually a recommendation engine or a search engine or operation) to the content consumer of the terminal through the direct presentation page.

The manual auditing system is usually a World Wide Web (WEB) system, is used for receiving the machine processing result on a link, manually confirms and rechecks the machine processing result, writes the rechecked result into a meta-information database, and can evaluate the actual effect of the machine processing and filtering model on line through the manual rechecked result. The server can count detailed running water of the audit including the source of the acquisition task, the audit result, the audit start time, the audit end time and the like. The manual review system can be in butt joint with a complaint and content report at a content consumption end and a review system for review by a reviewer, and is used for processing specific types of unhealthy contents found by complaint, report and active review with high priority, and meanwhile, a data base is provided for a subsequent construction target identification model in a sample database by the result of manual review.

The content storage service is a storage server which is usually a storage server with a wide distribution range and is accessed nearby to a user, and optionally, a CDN acceleration server can be deployed to perform distributed cache acceleration, and video and picture contents uploaded by a content producer are stored through an uplink content interface server and a downlink content interface server. After the content consumption terminal obtains the content index information, the content consumption terminal can also directly access the video content storage service to download the corresponding content. Besides being used as a data source of external service, the system also is used as a data source of internal service for the download file system to acquire the data of the original content for relevant processing, and the paths of the internal data source and the external data source are usually arranged separately, so that mutual influence is avoided.

The sample database is used for acquiring the contents of the manual audit marks from the meta-information database and the content storage service as a prototype database for establishing the sample database and storing specific unhealthy samples discovered by manual inspection of reporting and auditing personnel. The sample database may periodically (e.g., a week) scoop new unhealthy content of a particular type for updating the model.

The target recognition model can be used for constructing a multi-mode fusion image recognition model based on a multi-head attention mechanism by using the model structure, the training sample acquisition and the training method shown in fig. 5 and 6. Wherein the training samples and meta information for model training are used to sample the database and the content database.

An image recognition service for servicing the above-constructed object recognition model, and constructing a service that can be invoked on a link to realize recognition and tagging of unhealthy contents of a specific type.

And the downloading file system is used for downloading and acquiring the original published content from the content storage server and controlling the downloading speed and progress. The file downloading system may be a parallel server and consists of relevant task scheduling and distributing clusters. In the case that the released content is video, the frame extraction service can be invoked to acquire key frames of the video file from the video source file as a subsequent construction target recognition model for service.

The frame extraction service is used for extracting and supplementing frames of the video according to the frame extraction and supplementing method. The downloading file system downloads the released video source file from the content storage service to perform primary processing of file characteristics, namely video extraction, and takes the extracted image frame as an image frame to be identified when a subsequent image identification service is called, namely as input.

The statistical server is used for receiving the report of the consumption flow of the content auditing terminal and the content consumption terminal, wherein the consumption flow can comprise the content browsed by the user, the browsing time point, and the behavior data such as comment, praise, forwarding and the like for the released content when the content is video. The statistical server is also used for carrying out statistical mining and analysis on the reported running water and providing monitoring and analysis on the content enabling rate and the content auditing backlog time delay by the dispatching center service.

It can be understood that in the specific embodiment of the present application, the data such as the image to be processed, the sample data, the content browsed by the user, the browsing time point and the playing completion progress watched by the user, the comment made by the user on the published content, the praise, the forwarding and the like are related to the data, and when the above embodiment of the present application is applied to the specific product or technology, the collection, the use and the processing of the related data need to comply with the related laws and regulations and standards of the related country and region.

Referring to fig. 9, fig. 9 is a schematic structural diagram of an image recognition device according to an embodiment of the present application, and the image recognition device 90 according to an embodiment of the present application may be disposed on an electronic device, which may be the image recognition device in fig. 1. The image recognition apparatus 90 includes the following elements:

an acquisition unit 901 for acquiring an image to be recognized;

a calling unit 902, configured to call a target recognition model to obtain feature representation information of the image to be recognized, where the feature representation information includes text feature representation information and image semantic representation information;

the calling unit 902 is further configured to call the target recognition model to process the feature representation information of the image to be recognized, so as to obtain a recognition result of the image to be recognized, where the recognition result is used to indicate whether the image to be recognized is of a target class;

In one implementation manner, the invoking unit 902 is configured to invoke the object recognition model to obtain the feature representation information of the image to be recognized, and specifically is configured to:

inputting the image to be identified into the image identification module for processing to obtain image semantic representation information of the image to be identified;

acquiring label information of the image to be identified and text information included in the image to be identified;

and inputting the label information of the image to be identified and the text information included in the image to be identified into the text identification module for processing to obtain the text characteristic representation information of the image to be identified.

In one implementation manner, the calling unit 902 is configured to call the target recognition model to process the feature representation information of the image to be recognized to obtain a recognition result of the image to be recognized, and specifically is configured to:

based on the self-attention mechanism of the image-text recognition module, carrying out fusion processing on the text feature representation information and the image semantic representation information of the image to be recognized to obtain fusion feature representation information;

and inputting the fusion characteristic representing information into the image-text recognition module for processing to obtain the recognition result.

In one implementation manner, the acquiring unit 901 is further configured to acquire a training sample and tag information, where the tag information includes an image classification tag, a text classification tag, and an image-text classification tag;

the calling unit 902 is further configured to call an initial recognition model to process the training sample, so as to obtain a first classification result, a second classification result, and a third classification result of the training sample;

the training unit 903 is configured to train the initial recognition model based on the first classification result, the second classification result, the third classification result, and the tag information, so as to obtain a target recognition model.

In one implementation manner, the training unit 903 is configured to train the initial recognition model based on the first classification result, the second classification result, the third classification result, and the tag information to obtain a target recognition model, which is specifically configured to:

determining first difference data based on the first classification result and the image classification label;

determining second difference data based on the second classification result and the text classification label;

determining third difference data based on the third classification result and the image-text classification label;

Determining target difference data based on the first difference data, the second difference data, and the third difference data;

and adjusting parameters of the initial recognition model based on the target difference data, and taking the adjusted initial recognition model as a target recognition model.

In one implementation, the initial recognition model further includes a face recognition model; the training unit 903 is configured to determine target difference data based on the first difference data, the second difference data, and the third difference data, and specifically configured to:

invoking a face recognition model to process the training sample to obtain a fourth classification result and a facial thermodynamic diagram of the training sample;

determining fourth difference data based on the fourth classification result, the facial thermodynamic diagram, and a facial classification label of the training sample, the facial classification label being used to indicate whether the training sample includes a target type of face;

target difference data is determined based on the first difference data, the second difference data, the third difference data, and the fourth difference data.

In one implementation manner, the obtaining unit 901 is further configured to obtain a text information set, where each text information in the text information set carries a text mark;

The acquiring unit 901 is further configured to acquire an image set, where each image in the image set carries an image mark; each of the images does not include text information;

a synthesizing unit 904, configured to synthesize each text information in the text information set with each image in the image set, so as to obtain a plurality of images including the text information;

a determining unit 905, configured to determine that, if a text label and an image label of a first image in the plurality of images including text information match a preset label, the first image is a negative sample in the training sample, and a text classification label of the negative sample is the target class;

the determining unit 905 is further configured to determine that the second image is a positive sample in the training samples if the text mark and the image mark in the second image in the plurality of images including text information do not match the preset mark.

In one implementation manner, the acquiring unit 901 is configured to acquire a text information set, and specifically is configured to:

acquiring a seed text and a text information set to be marked, wherein the seed text carries text marks;

determining whether to add text marks to each text message in the text message set to be marked according to the similarity between the seed text and each text message in the text message set to be marked;

And taking the seed text and the text information added with the text mark as the text information set.

In one implementation, the image to be identified is an image frame to be identified in the video to be identified, and the apparatus further includes:

a selecting unit 906, configured to select M image frames including different scene information from the video to be identified as the image frames to be identified if the time length of the video to be identified is greater than or equal to a preset time length; m is an integer greater than 1; the M image frames include a first frame and a last frame of the video to be identified.

In one implementation manner, the acquiring unit 901 is configured to acquire a recognition result of each image frame to be recognized included in the video to be recognized;

the determining unit 905 is configured to determine, based on the identification result of each image frame to be identified and a preset hit rule, an identification result of the video to be identified, where the identification result of the video to be identified is used to indicate whether the video to be identified is of a target category.

According to one embodiment of the application, the steps involved in the method of fig. 3 or 6 may be performed by the units in the image recognition device of fig. 9. For example, step S301 shown in fig. 3 is performed by the acquisition unit 901 shown in fig. 9, step S302 and step S303 are performed by the calling unit 902 shown in fig. 9, and as another example, step S401 shown in fig. 4 is performed by the acquisition unit 901 shown in fig. 9, step S402 is performed by the calling unit 902 shown in fig. 9, and step S403 is performed by the training unit 903 shown in fig. 9.

According to another embodiment of the present application, each unit in the content pushing apparatus shown in fig. 9 may be separately or completely combined into one or several other units, or some unit(s) thereof may be further split into a plurality of units with smaller functions, which may achieve the same operation without affecting the implementation of the technical effects of the embodiments of the present application. The above units are divided based on logic functions, and in practical applications, the functions of one unit may be implemented by a plurality of units, or the functions of a plurality of units may be implemented by one unit. In other embodiments of the present application, the content-based pushing device may also include other units, and in practical applications, these functions may also be implemented with assistance from other units, and may be implemented by cooperation of multiple units.

Based on the above description of the embodiments of the push method, the embodiment of the present application also discloses an electronic device, referring to fig. 10, where the electronic device may at least include a processor 1001, a communication interface 1002, and a computer storage medium 1003. Wherein the processor 1001, communication interface 1002, and computer storage medium 1003 within an electronic device may be connected by a bus or other means.

The computer storage medium 1003 is a memory device in the electronic device for storing programs and data. It is understood that the computer storage media 1003 herein may include a built-in storage medium of the electronic device, and may include an extended storage medium supported by the electronic device. The computer storage media 1003 provides storage space that stores the operating system of the electronic device. Also stored in this memory space are one or more instructions, which may be one or more computer programs (including program code), adapted to be loaded and executed by the processor 1001. Note that the computer storage medium herein may be a high-speed RAM memory; optionally, the computer storage medium may be at least one computer storage medium remote from the foregoing processor, where the foregoing processor may be referred to as a central processing unit (Central Processing Unit, CPU), and is a core of the electronic device and a control center, and adapted to be implemented with one or more instructions, and specifically load and execute the one or more instructions to implement the corresponding method flow or function.

In one implementation, the processor 1001 may load and execute one or more first instructions stored in a computer storage medium to implement the corresponding steps of the method in the content push method embodiment described above; in particular implementations, one or more first instructions in the computer storage medium are loaded by the processor 1001 and perform the following:

acquiring an image to be identified;

In one implementation, the processor 1001 loads and executes one or more first instructions stored in the computer storage medium to call the object recognition model to obtain the feature representation information of the image to be recognized, specifically:

In one implementation manner, the processor 1001 loads and executes one or more first instructions stored in a computer storage medium, and is configured to call the object recognition model to process the feature representation information of the image to be recognized, so as to obtain a recognition result of the image to be recognized, where the first instruction is specifically used to:

In one implementation, one or more computer programs in the computer storage media described above are loaded by the processor 1001 and perform the steps of:

acquiring training samples and label information, wherein the label information comprises an image classification label, a text classification label and an image-text classification label;

calling an initial recognition model to process the training sample to obtain a first classification result, a second classification result and a third classification result of the training sample;

training the initial recognition model based on the first classification result, the second classification result, the third classification result and the label information to obtain a target recognition model.

In one implementation, the processor 1001 loads and executes one or more first instructions stored in a computer storage medium to train the initial recognition model based on the first classification result, the second classification result, the third classification result, and the tag information to obtain a target recognition model, which is specifically used to:

In one implementation, the initial recognition model further includes a face recognition model; the processor 1001 loads and executes one or more first instructions stored in a computer storage medium to determine target difference data based on the first difference data, the second difference data, and the third difference data, specifically:

acquiring a text information set, wherein each text information in the text information set carries a text mark;

acquiring an image set, wherein each image in the image set carries an image mark; each of the images does not include text information;

combining each text message in the text message set with each image in the image set to obtain a plurality of images comprising the text message;

if the text mark and the image mark of a first image in the plurality of images comprising text information are matched with a preset mark, determining that the first image is a negative sample in the training sample, and the image-text classification label of the negative sample is the target category;

and if the text mark and the image mark in the second image in the plurality of images comprising the text information are not matched with the preset mark, determining the second image as a positive sample in the training samples.

In one implementation, the processor 1001 loads and executes one or more first instructions stored in the computer storage medium to obtain a text information set, specifically for:

In one implementation, the image to be identified is an image frame to be identified in the video to be identified, and one or more computer programs in the computer storage medium are loaded by the processor 1001 and perform the following steps:

if the time length of the video to be identified is greater than or equal to the preset time length, selecting M image frames comprising different scene information from the video to be identified as the image frames to be identified; m is an integer greater than 1; the M image frames include a first frame and a last frame of the video to be identified.

Acquiring the identification result of each image frame to be identified, which is included in the video to be identified;

and determining the identification result of the video to be identified based on the identification result of each image frame to be identified and a preset hit rule, wherein the identification result of the video to be identified is used for indicating whether the video to be identified is of a target type.

The specific implementation of each step executed by the processor 1001 in the embodiment of the present application may refer to the description of the related content in the foregoing embodiment, and the same technical effects may be achieved, which is not repeated herein.

The embodiment of the application also provides a computer readable storage medium, wherein the computer readable storage medium stores a computer program, and a processor runs the computer program to enable the electronic device to execute the method provided by the previous embodiment.

Embodiments of the present application also provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the electronic device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the electronic device performs the method provided by the foregoing embodiment.

The steps in the method of the embodiment of the application can be sequentially adjusted, combined and deleted according to actual needs.

Those skilled in the art will appreciate that the processes implementing all or part of the methods of the above embodiments may be implemented by a computer program for instructing relevant hardware, and the program may be stored in a computer readable storage medium, and the program may include the processes of the embodiments of the methods as above when executed. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a random-access Memory (Random Access Memory, RAM), or the like.

The above disclosure is only a preferred embodiment of the present application, and it should be understood that the scope of the application is not limited thereto, and those skilled in the art will appreciate that all or part of the procedures described above can be performed according to the equivalent changes of the claims, and still fall within the scope of the present application.

Claims

1. An image recognition method, the method comprising:

acquiring an image to be identified;

Invoking the target recognition model to process the characteristic representation information of the image to be recognized to obtain a recognition result of the image to be recognized, wherein the recognition result is used for indicating whether the image to be recognized is of a target type;

the target recognition model is obtained by training an initial recognition model according to first difference data, second difference data and third difference data, and the initial recognition model comprises an image recognition module, a text recognition module and an image-text recognition module; the first difference data is determined according to a first classification result of the training sample by the image recognition module and an image classification label; the second difference data is determined according to a second classification result and a text classification label of the training sample by the text recognition module; the third difference data is determined according to a third classification result of the image-text recognition module on the training sample and an image-text classification label, and the image-text classification label is used for indicating whether the training sample is of the target class.

2. The method according to claim 1, wherein the invoking the object recognition model to obtain the feature representation information of the image to be recognized includes:

and inputting the label information of the image to be identified and the text information included in the image to be identified into the text identification module for processing to obtain the text feature representation information of the image to be identified.

3. The method according to claim 2, wherein the calling the object recognition model to process the feature representation information of the image to be recognized to obtain the recognition result of the image to be recognized includes:

based on a self-attention mechanism of the image-text recognition module, carrying out fusion processing on text feature representation information and image semantic representation information of the image to be recognized to obtain fusion feature representation information;

and inputting the fusion characteristic representation information into the image-text recognition module for processing to obtain the recognition result.

4. A method according to any one of claims 1-3, wherein the method further comprises:

Invoking an initial recognition model to process the training sample to obtain a first classification result, a second classification result and a third classification result of the training sample;

5. The method of claim 4, wherein training the initial recognition model based on the first classification result, the second classification result, the third classification result, and the tag information to obtain a target recognition model comprises:

determining third difference data based on the third classification result and the graphic classification label;

6. The method of claim 5, wherein the initial recognition model further comprises a face recognition model; the determining target difference data based on the first difference data, the second difference data, and the third difference data includes:

determining fourth difference data based on the fourth classification result, the facial thermodynamic diagram, and a facial classification tag of the training sample, the facial classification tag being used to indicate whether the training sample includes a target type of face;

7. A method according to any one of claims 1-3, wherein the method further comprises:

synthesizing each text message in the text message set with each image in the image set to obtain a plurality of images comprising the text message;

If the text mark and the image mark of a first image in the plurality of images comprising text information are matched with a preset mark, determining that the first image is a negative sample in the training sample, and determining that the image-text classification label of the negative sample is the target category;

and if the text mark and the image mark in a second image in the plurality of images comprising the text information are not matched with the preset mark, determining that the second image is a positive sample in the training samples.

8. The method of claim 7, wherein the obtaining a set of text information comprises:

9. The method of claim 1, wherein the image to be identified is an image frame to be identified in a video to be identified, the method further comprising:

If the time length of the video to be identified is greater than or equal to the preset time length, selecting M image frames comprising different scene information from the video to be identified as the image frames to be identified; m is an integer greater than 1; wherein the M image frames include a first frame and a last frame of the video to be identified.

10. The method according to claim 8 or 9, characterized in that the method further comprises:

acquiring a recognition result of each image frame to be recognized, which is included in the video to be recognized;

11. An image recognition apparatus, the apparatus comprising:

the acquisition unit is used for acquiring the image to be identified;

12. An image recognition device comprising a processor, a communication interface and a memory, the processor, the communication interface and the memory being interconnected, wherein the memory stores executable program code, the processor being adapted to invoke the executable program code to perform the method of any of claims 1-10.

13. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program which, when executed by a processor, performs the method of any of claims 1-10.