CN111709398A

CN111709398A - Image recognition method, and training method and device of image recognition model

Info

Publication number: CN111709398A
Application number: CN202010668385.4A
Authority: CN
Inventors: 邓强
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-07-13
Filing date: 2020-07-13
Publication date: 2020-09-25

Abstract

The application discloses an image identification method based on artificial intelligence, which comprises the following steps: acquiring an image to be identified; if the image to be identified comprises a picture, acquiring an image feature vector according to the image to be identified; if the image to be recognized comprises a text, acquiring a text characteristic vector according to the image to be recognized; and acquiring an image identification result corresponding to the image to be identified according to the image characteristic vector and the text characteristic vector. The application also discloses a training method and a device of the image recognition model. In the process of identifying the image, the extracted image characteristic vector and the text characteristic vector can be jointly used as the basis of the predicted image type to finally obtain the image identification result, and the image identification result belongs to the characteristics of two dimensions of the comprehensive image and the text, so that the image identification result with higher accuracy can be obtained.

Description

Image recognition method, and training method and device of image recognition model

Technical Field

The present application relates to the field of computer vision, and in particular, to an image recognition method, an image recognition model training method, and an image recognition model training device.

Background

Due to the advantages of intuition, large information carrying capacity and the like, images often appear in various media information, for example, for information recommendation products, images to be recognized are inserted in the middle of an article, or images to be recognized appear at the beginning of the article. In this case, the efficiency and experience of reading the article by the user are influenced to some extent. For another example, in order to attract traffic, lawless persons may generate or transmit a large number of illegal images such as vulgars.

Therefore, how to rapidly identify massive images and ensure the accuracy of image identification results is an urgent technical problem to be solved. At present, the method of identifying an image generally includes, starting from an original feature of the image, extracting an image feature of the image, then identifying the image feature by using a deep learning model, and finally generating an image identification result.

However, information in an image is various and may include a picture, a text, and the like, and therefore, the image is recognized only by original features of the image, and deviation is easy to occur, resulting in low accuracy of image recognition.

Disclosure of Invention

The embodiment of the application provides an image recognition method, and an image recognition model training method and device, in the process of recognizing an image, extracted image feature vectors and text feature vectors can be jointly used as the basis of a predicted image type, and an image recognition result is finally obtained, wherein the image recognition result belongs to the features of two dimensions of a comprehensive image and a text, so that the image recognition result with higher accuracy can be obtained.

In view of the above, an aspect of the present application provides an image recognition method, including:

acquiring an image to be identified;

if the image to be identified comprises a picture, acquiring an image feature vector according to the image to be identified, wherein the image feature vector comprises a basic feature vector which represents the size information of the picture;

if the image to be recognized comprises a text, acquiring a text feature vector according to the image to be recognized, wherein the text feature vector is determined according to a semantic feature vector of the text and a position feature vector of the text, the image feature vector also comprises a statistical characteristic vector, and the statistical characteristic vector represents the number of characters of the text;

and acquiring an image recognition result corresponding to the image to be recognized according to the image feature vector and the text feature vector.

Another aspect of the present application provides a training method for an image recognition model, including:

acquiring an image set to be trained, wherein the image set to be trained comprises at least one image to be trained, each image to be trained corresponds to a label, and the labels are used for representing the category of the image to be trained;

acquiring an image feature vector corresponding to an image to be trained based on any image to be trained in an image set to be trained, wherein the image feature vector comprises a basic feature vector and a statistical feature direction, the basic feature vector represents size information of a picture in the image to be trained, and the statistical feature vector represents the number of characters of a text in the image to be trained;

acquiring a text feature vector corresponding to an image to be trained based on any one image to be trained, wherein the text feature vector is determined according to a semantic feature vector of a text in the image to be trained and size information of the text in the image to be trained;

based on an image feature vector and a text feature vector corresponding to any image to be trained, obtaining category probability distribution corresponding to the image to be trained through an image recognition model to be trained;

and updating the model parameters of the image recognition model to be trained according to the class probability distribution and the label corresponding to at least one image to be trained until the model convergence condition is reached, thereby obtaining the image recognition model.

Another aspect of the present application provides a method for recognizing an image to be recognized, including:

acquiring an article to be processed, wherein the article to be processed comprises an image to be identified and article information;

if the image to be recognized comprises a text, acquiring a text feature vector according to the image to be recognized, wherein the text feature vector is determined according to a semantic feature vector of the text and size information of the text, the image feature vector also comprises a statistical characteristic vector, and the statistical characteristic vector represents the number of characters of the text;

acquiring an image identification result corresponding to the image to be identified according to the image characteristic vector and the text characteristic vector;

and if the image recognition result is the advertisement category, processing the article to be processed.

Another aspect of the present application provides an image recognition apparatus, including:

the acquisition module is used for acquiring an image to be identified;

the obtaining module is further used for obtaining an image feature vector according to the image to be identified if the image to be identified comprises a picture, wherein the image feature vector comprises a basic feature vector which represents the size information of the picture;

the acquisition module is further used for acquiring a text feature vector according to the image to be recognized if the image to be recognized comprises a text, wherein the text feature vector is determined according to a semantic feature vector of the text and a position feature vector of the text, the image feature vector further comprises a statistical characteristic vector, and the statistical characteristic vector represents the number of characters of the text;

and the identification module is used for acquiring an image identification result corresponding to the image to be identified according to the image characteristic vector and the text characteristic vector.

In one possible design, in an implementation manner of another aspect of the embodiment of the present application, the image feature vector further includes at least one of an object feature vector, a scene feature vector, a two-dimensional code feature vector, and a template feature vector;

the acquisition module is specifically used for acquiring category feature vectors through an object detection model based on the image to be identified, wherein the category feature vectors represent category probabilities corresponding to the image;

or the obtaining module is specifically configured to obtain a scene feature vector through a scene recognition model based on the image to be recognized, where the scene feature vector represents a scene probability corresponding to the image;

or the obtaining module is specifically configured to obtain a two-dimensional code feature vector through a two-dimensional code recognition model based on the image to be recognized, where the two-dimensional code feature vector represents category probability and size information corresponding to the two-dimensional code;

or the obtaining module is specifically configured to generate the template feature vector according to the occurrence frequency of the picture in the image to be recognized.

In one possible design, in another implementation of another aspect of an embodiment of the present application,

the acquisition module is specifically used for performing Optical Character Recognition (OCR) on an image to be recognized to obtain a text and position information of the text, wherein the text segment comprises P words, and P is an integer greater than or equal to 1;

generating a semantic feature vector according to the text, wherein the semantic feature vector comprises P word embedding vectors, and the word embedding vectors and the words have corresponding relations;

generating a position feature vector according to the position information of the text;

and generating a text feature vector according to the semantic feature vector and the position feature vector.

the obtaining module is specifically configured to process the semantic feature vector and the position feature vector based on a multi-head attention mechanism to obtain a text feature vector, where the semantic feature vector belongs to a query corresponding to the multi-head attention mechanism, and the position feature vector belongs to a key corresponding to the multi-head attention mechanism.

the acquisition module is further used for acquiring a text feature vector and a first feature vector according to the image to be recognized after the image to be recognized is acquired and if the image to be recognized does not meet the image feature extraction condition;

the obtaining module is further configured to obtain an image recognition result corresponding to the image to be recognized according to the image feature vector and the text feature vector, and includes:

the obtaining module is further used for obtaining an image recognition result corresponding to the image to be recognized through the image recognition model based on the first feature vector and the text feature vector.

the acquisition module is further used for acquiring an image to be recognized, and acquiring an image feature vector and a second feature vector according to the image to be recognized if the image to be recognized does not meet the text feature extraction condition;

and the obtaining module is specifically used for obtaining an image recognition result corresponding to the image to be recognized through the image recognition model based on the image feature vector and the second feature vector.

the acquisition module is specifically used for acquiring a target image feature vector through a shallow network included in the image recognition model based on the image feature vector;

based on the text feature vector, obtaining a target text feature vector through a deep network included in the image recognition model;

based on the target image feature vector and the target text feature vector, acquiring the class probability distribution of the image to be recognized through a multilayer perceptron included in the image recognition model;

and determining an image recognition result corresponding to the image to be recognized according to the class probability distribution of the image to be recognized.

Another aspect of the present application provides a model training apparatus, including:

the device comprises an acquisition module, a training module and a training module, wherein the acquisition module is used for acquiring an image set to be trained, the image set to be trained comprises at least one image to be trained, each image to be trained corresponds to a label, and the label is used for representing the category of the image to be trained;

the acquisition module is further used for acquiring an image feature vector corresponding to the image to be trained based on any image to be trained in the image set to be trained, wherein the image feature vector comprises a basic feature vector and a statistical feature direction, the basic feature vector represents size information of a picture in the image to be trained, and the statistical feature vector represents the number of characters of a text in the image to be trained;

the acquisition module is further used for acquiring a text feature vector corresponding to the image to be trained based on any one image to be trained, wherein the text feature vector is determined according to the semantic feature vector of the text in the image to be trained and the size information of the text in the image to be trained;

the acquisition module is also used for acquiring the class probability distribution corresponding to the image to be trained through the image recognition model to be trained based on the image characteristic vector and the text characteristic vector corresponding to any image to be trained;

and the training module is used for updating the model parameters of the image recognition model to be trained according to the class probability distribution and the label corresponding to at least one image to be trained until the model convergence condition is reached, so as to obtain the image recognition model.

the acquisition module is specifically used for acquiring a target image feature vector through a shallow network included in the to-be-trained image recognition model based on an image feature vector corresponding to any one to-be-trained image;

acquiring a target text feature vector through a deep network included in an image recognition model to be trained based on an image feature vector corresponding to any one image to be trained;

and obtaining the class probability distribution corresponding to the image to be trained through a multilayer perceptron included in the image recognition model to be trained based on the target image feature vector and the target text feature vector.

Another aspect of the present application provides an advertisement image recognition apparatus, including:

the device comprises an acquisition module, a recognition module and a processing module, wherein the acquisition module is used for acquiring an article to be processed, and the article to be processed comprises an image to be recognized and article information;

the acquisition module is further used for acquiring a text feature vector according to the image to be recognized if the image to be recognized comprises a text, wherein the text feature vector is determined according to a semantic feature vector of the text and size information of the text, the image feature vector further comprises a statistical characteristic vector, and the statistical characteristic vector represents the number of characters of the text;

the acquisition module is also used for acquiring an image identification result corresponding to the image to be identified according to the image characteristic vector and the text characteristic vector;

and the processing module is used for processing the article to be processed if the image identification result is the advertisement category.

Another aspect of the present application provides a computer device, comprising: a memory, a transceiver, a processor, and a bus system;

wherein, the memory is used for storing programs;

a processor for executing the program in the memory, the processor for performing the above-described aspects of the method according to instructions in the program code;

the bus system is used for connecting the memory and the processor so as to enable the memory and the processor to communicate.

Another aspect of the present application provides a computer-readable storage medium having stored therein instructions, which when executed on a computer, cause the computer to perform the method of the above-described aspects.

In another aspect of the application, a computer program product or computer program is provided, the computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the method provided by the various alternative implementations of the aspects.

According to the technical scheme, the embodiment of the application has the following advantages:

the embodiment of the application provides an image identification method, which includes the steps of firstly obtaining an image to be identified, obtaining an image characteristic vector according to the image to be identified if the image to be identified comprises a picture, obtaining a text characteristic vector according to the image to be identified if the image to be identified comprises a text, and finally obtaining an image identification result corresponding to the image to be identified according to the image characteristic vector and the text characteristic vector. By adopting the mode, in the process of identifying the image, the image and the text can be extracted from the image to be used as the basis of image identification, the extracted image characteristic vector and the text characteristic vector are jointly used as the basis of a predicted image type, and the image identification result is finally obtained, belongs to the characteristics of two dimensions of the comprehensive image and the text, so that the image identification result with higher accuracy can be obtained.

Drawings

FIG. 1 is a block diagram of an image recognition system according to an embodiment of the present disclosure;

FIG. 2 is a schematic view of an interaction flow of an image recognition method in an embodiment of the present application;

FIG. 3 is a schematic diagram of an embodiment of an image recognition method in an embodiment of the present application;

FIG. 4 is a schematic diagram of an image to be recognized in the embodiment of the present application;

FIG. 5 is a schematic structural diagram of an image recognition model in an embodiment of the present application;

FIG. 6 is a schematic diagram of generating text feature vectors based on a multi-head attention mechanism in an embodiment of the present application;

FIG. 7 is a schematic diagram of another structure of an image recognition model in the embodiment of the present application;

FIG. 8 is a diagram illustrating an embodiment of an image recognition model training method in an embodiment of the present application;

FIG. 9 is a schematic diagram of an embodiment of an image recognition method to be recognized in the embodiment of the present application;

FIG. 10 is a schematic diagram of an image to be recognized based on an article to be processed in the embodiment of the present application;

FIG. 11 is another schematic diagram of an image to be identified based on an article to be processed in the embodiment of the present application;

FIG. 12 is another schematic diagram of an image to be identified based on an article to be processed in the embodiment of the present application;

FIG. 13 is a schematic view of a portal interface for viewing articles in an embodiment of the application;

FIG. 14 is a schematic diagram of an embodiment of an image recognition apparatus in an embodiment of the present application;

FIG. 15 is a schematic diagram of an embodiment of a model training apparatus according to an embodiment of the present application;

FIG. 16 is a schematic diagram of an embodiment of an advertisement image recognition apparatus in an embodiment of the present application;

FIG. 17 is a schematic structural diagram of a server in an embodiment of the present application;

fig. 18 is a schematic structural diagram of a terminal device in the embodiment of the present application.

Detailed Description

The terms "first," "second," "third," "fourth," and the like in the description and in the claims of the present application and in the drawings described above, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are, for example, capable of operation in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "corresponding" and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

It should be appreciated that the present application provides an Artificial Intelligence (AI) -based image recognition method, which is particularly suitable for a scene to be recognized. With the continuous development of network information, the way of publishing information via internet has become a popular form, and information can be represented in a plain text form or a plain image form, and of course, more information is also in a form of accompanying pictures and texts. In an illegal image scenario, because people experience discomfort in the case of vulgar, nausea, or horror images, it is necessary for the information distribution platform to detect and process such inappropriate images, such as "coding", "masking", or "reporting". In a scene of an image to be identified, since not only images related to an article but also advertisement images to be identified may exist in the article with luxuriant pictures, if a large number of advertisement images appear at the beginning of the article, or the number of the appearing advertisement images is large, or the size of the advertisement images occupied in the article is too large, the reading experience of people is likely to be affected, and therefore, it is necessary for the information distribution platform to detect and process such advertisement images, for example, to adjust the positions of the advertisement images appearing in the article, or to directly filter the advertisement images from the article, and the like.

Based on this, the application provides a method for image recognition, which first decomposes an image into a plurality of dimensional features, wherein the dimensional features mainly include two aspects, namely an image feature vector and a text feature vector. And then, the image characteristic vector and the text characteristic vector are jointly used as the input of an image recognition model, and the end-to-end image recognition model is used for carrying out comprehensive judgment so as to output an image recognition result.

Referring to fig. 1, fig. 1 is a schematic diagram of an architecture of an image recognition system in an embodiment of the present application, where as shown in the figure, the image recognition system includes a server and a terminal device. The server related to the application can be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, and can also be a cloud server providing basic cloud computing services such as cloud service, a cloud database, cloud computing, a cloud function, cloud storage, network service, cloud communication, middleware service, domain name service, security service, Content Delivery Network (CDN), big data and an artificial intelligence platform. The terminal device may be a smart phone, a tablet computer, a notebook computer, a palm computer, a personal computer, etc., but is not limited thereto.

In the image recognition system, the terminal device and the server may communicate with each other through a wireless network, a wired network, or a removable storage medium. Wherein the wireless network described above uses standard communication techniques and/or protocols. The wireless Network is typically the Internet (Internet), but may be any Network including, but not limited to, bluetooth, Local Area Network (LAN), Metropolitan Area Network (MAN), Wide Area Network (WAN), mobile, private, or any combination of virtual private networks. In some embodiments, custom or dedicated data communication techniques may be used in place of or in addition to the data communication techniques described above. The removable storage medium may be a Universal Serial Bus (USB) flash drive, a removable hard disk, or other removable storage media, and the application is not limited thereto.

Referring to fig. 2, fig. 2 is a schematic view of an interaction flow of an image recognition method in an embodiment of the present application, and as shown in the figure, specifically:

in step S1, taking the information publishing platform as an example, the server side obtains an article to be processed, where the article to be processed includes an image to be recognized, and the image to be recognized may include text and pictures.

In step S2, if both text and pictures are included in the image to be recognized, text feature vectors and image feature vectors are extracted based on the text and the pictures. If the image to be recognized comprises texts and pictures, but the characteristics of the pictures which can be extracted are missing, the missing characteristics in the image characteristic vector can adopt '0' as a complementary bit, or each element in the image characteristic vector is set as a preset value. If the image only comprises pictures and no text, the extracted text feature vector can adopt '0' as a complementary position, or each element in the text feature vector is set as a preset value.

In step S3, the text feature vector and the image feature vector are used together as input of the image recognition model, and then the corresponding image recognition result is output by the image recognition model.

In step S4, based on the image recognition result, the article to be processed is processed, for example, the article to be processed is directly filtered or the weight of the article to be processed is reduced so that the article to be processed is ranked later.

In step S5, the server pushes the article to be processed to the terminal device used by the user, or the server no longer pushes the article to be processed to the terminal device used by the user.

In step S6, if the server pushes the article to be processed to the terminal device, the user can view the article to be processed through the terminal device.

Based on the above introduction, the image recognition method, the image recognition model training method, and the to-be-recognized image recognition method provided by the present application can be implemented by an artificial intelligence technique, and specifically relate to a Computer Vision (CV) technique and a Machine Learning (ML). The artificial intelligence is a theory, a method, a technology and an application system which simulate, extend and expand human intelligence by using a digital computer or a machine controlled by the digital computer, sense the environment, acquire knowledge and obtain the best result by using the knowledge. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Computer Vision (CV) technology is a science for researching how to make a machine "see", and further refers to that a camera and a Computer are used to replace human eyes to perform machine Vision such as identification, tracking and measurement on a target, and further image processing is performed, so that the Computer processing becomes an image more suitable for human eyes to observe or is transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. Computer vision technologies generally include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technologies, virtual reality, augmented reality, synchronous positioning, map construction, and other technologies, and also include common biometric technologies such as face recognition and fingerprint recognition.

Machine learning is a multi-field cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.

With the research and progress of artificial intelligence technology, the artificial intelligence technology is developed and applied in a plurality of fields, such as common smart homes, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned driving, automatic driving, unmanned aerial vehicles, robots, smart medical care, smart customer service, and the like.

With reference to the above description, a scheme provided in an embodiment of the present application relates to technologies such as artificial intelligence computer vision, Image Semantic Understanding (ISU), and machine learning, and an Image recognition method provided in the present application is described below with reference to fig. 3, where an embodiment of the method for Image recognition in the embodiment of the present application includes:

101. acquiring an image to be identified;

in this embodiment, the image recognition device obtains an image to be recognized, where the image to be recognized may be an image crawled from a website, an image to be recognized that is inserted in an article, an image posted in a forum or a microblog, and the like, and this is not limited here. For convenience of understanding, please refer to fig. 4, where fig. 4 is a schematic diagram of an image to be recognized in an embodiment of the present application, and as shown in the figure, the image to be recognized may generally include two portions, namely a picture and a text, respectively, where the picture may also include a common picture and a two-dimensional code picture. In fig. 4, a1 indicates a normal picture, a2 indicates text, and A3 indicates a two-dimensional code picture.

The image recognition apparatus is disposed in a computer device, which may be a server or a terminal device, and the present application takes a computer as an example for description, but this should not be construed as a limitation to the present application.

102. If the image to be identified comprises a picture, acquiring an image feature vector according to the image to be identified, wherein the image feature vector comprises a basic feature vector which represents the size information of the picture;

in this embodiment, when the image to be recognized includes a picture, corresponding image feature vectors may be extracted, where the image feature vectors include basic feature vectors, and the basic feature vectors are used to represent size information of a general picture.

Specifically, the following describes a manner of obtaining the basis feature vector:

taking an ordinary picture as an example, assuming that the width of the ordinary picture is 30 pixels and the length of the ordinary picture is 70 pixels, the normalization process may be performed on the value, for example, the normalized width is 30/70 ═ 0.43, and the normalized length is 70/70 ═ 1, that is, the basis feature vector may be represented as (0.43, 1). Alternatively, assuming that the width of the image to be recognized where the normal picture is located is 80 and the length is 200, the length and the width of the normal picture may be normalized respectively, where the normalized width is 30/80-0.375 and the normalized length is 70/200-0.35, that is, the basis feature vector may be represented as (0.43,1,0.375, 0.35). Alternatively, assuming that the normal picture belongs to an illustration in an article, the length of the article is 7000 pixels, and the width of the article is 500 pixels, the proportion of the normal picture occupying the article can be calculated, the proportion occupied in the width is 30/500-0.06, and the proportion occupied in the length is 70/7000-0.01, that is, the basis eigenvector can be expressed as (0.43,1,0.375,0.35,0.06, 0.01).

It should be noted that the basic feature vector may be represented by various examples, and one or more dimensions of the basic feature vector may also be selected as the basic feature vector, which is not limited herein.

103. If the image to be recognized comprises a text, acquiring a text feature vector according to the image to be recognized, wherein the text feature vector is determined according to a semantic feature vector of the text and a position feature vector of the text, the image feature vector also comprises a statistical characteristic vector, and the statistical characteristic vector represents the number of characters of the text;

in this embodiment, under the condition that the image to be recognized includes a text, the corresponding text feature vector and the statistical characteristic vector in the image feature vector may be extracted, and the text feature vector may include a semantic feature vector and a position feature vector of the text, that is, the text feature vector may be obtained by splicing the semantic feature vector and the position feature vector of the text. The text feature vector may also be generated based on the semantic feature vector and the position feature vector of the text, for example, the semantic feature vector and the position feature vector of the text are processed by using an attention mechanism to obtain the text feature vector.

Specifically, the manner of obtaining the position feature vector will be described below:

taking a section of text as an example, recognizing the content of a text part from an image to be recognized by using an OCR technology, assuming that every 48 characters are divided into a section of text, and the text with less than 48 characters is processed by adopting a zero padding mode. Assuming that the text has a width of 10 pixels and a length of 20 pixels, the value can be normalized, for example, the normalized width is 10/20 ═ 0.5, and the normalized length is 20/20 ═ 1, that is, the position feature vector can be represented as (0.5, 1). Alternatively, assuming that the width of the image to be recognized where the text is located is 80 and the length is 200, the length and the width of the text may be normalized, respectively, where the normalized width is 10/80-0.125 and the normalized length is 20/200-0.1, that is, the position feature vector may be represented as (0.5,1,0.125, 0.1). Alternatively, assuming that the length of an article is 2000 pixels and the width of the article is 500 pixels, the proportion of the article occupied by the text can be calculated, the proportion occupied in width is 10/500-0.02, the proportion occupied in length is 20/2000-0.01, and the position feature vector is expressed as (0.5,1,0.125,0.1,0.02, 0.01).

Specifically, the following describes a manner of obtaining the semantic feature vector:

taking a text as an example, a semantic feature vector corresponding to the text is generated based on a Word to vector (Word 2vec) model. The word2vec model may be used to map each word to a vector, which may be used to represent word-to-word relationships, the vector being a hidden layer of a neural network.

Specifically, the manner of obtaining the statistical property vector will be described below:

taking a piece of text as an example, the OCR technology is used to recognize the content of a text portion from an image to be recognized, and count the number of characters existing in the text, for example, 10 english characters, 20 kanji characters, and 3 punctuation marks exist, that is, the statistical characteristic vector may be represented as (10,20, 3).

It should be noted that the representation manner of the position feature vector and the representation manner of the statistical characteristic vector may be various examples described above, and one or more dimensions thereof may also be selected for representation, which is not limited herein.

104. And acquiring an image recognition result corresponding to the image to be recognized according to the image feature vector and the text feature vector.

In this embodiment, the image recognition device may determine the image recognition result corresponding to the image to be recognized according to the image feature vector and the text feature vector.

Specifically, one implementation manner is that the image recognition device splices the obtained image feature vector and text feature vector, inputs the spliced image feature vector to the image recognition model, and outputs a category probability distribution by the image recognition model, where a + b is equal to 1, a represents a probability that the image to be recognized belongs to the first type, and b represents a probability that the image to be recognized belongs to the second type, and for example, if the category probability distribution is (0.3,0.7), the result of image recognition of the image to be recognized is the second type. For another example, if the class probability distribution is (0.8.0.2), it indicates that the image recognition result of the image to be recognized is of the first type. In addition, the category probability distribution may also indicate a category probability value, for example, if the category probability value is 0.8, it indicates that the image recognition result of the image to be recognized is of the first type.

The other implementation manner is that the image recognition device splices the obtained image feature vector and the text feature vector to obtain a group of features to be predicted with fixed length, and then the features to be predicted can be compared with the existing feature library to find out two features with minimum similarity, so that the type corresponding to the features in the feature library is determined as the image recognition result. By way of example, it is assumed that the feature to be predicted is (1,0,1,1,0), and there are two groups of features in the feature library, which are respectively (1,1,1,1,1) and (0,0,0,0,0), where the type corresponding to the group of features (1,1,1,1,1) is a first type and the type corresponding to the group of features (0,0,0,0,0) is a second type, and it is known through comparison that the similarity between the feature (1,0,1,1,0) to be predicted and the feature (1,1,1,1,1) is higher, so that the image recognition result can be determined as the first type corresponding to the feature (1,1,1, 1).

It will be appreciated that the first type may be an advertisement type and the second type a non-advertisement type. Alternatively, the first type may be a non-advertising type and the second type an advertising type. Or, the first type is an illegal type, the second type is a legal type, and the flexible setting is performed according to the actual situation, which is not limited here.

For convenience of understanding, please refer to fig. 5, where fig. 5 is a schematic structural diagram of an image recognition model in an embodiment of the present application, and as shown in the drawing, first, features of an image to be recognized are extracted, and assuming that the image to be recognized includes both an image and a text, an image feature vector and a text feature vector may be extracted, and then the image feature vector and the text feature vector are input to the image recognition model together, and the image recognition model outputs a corresponding class probability distribution, so as to obtain an image recognition result.

Optionally, on the basis of the embodiment corresponding to fig. 3, in another optional embodiment provided in the embodiment of the present application, the image feature vector further includes at least one of an object feature vector, a scene feature vector, a two-dimensional code feature vector, and a template feature vector;

the method comprises the following steps of obtaining an image feature vector according to an image to be identified:

based on an image to be identified, obtaining a category characteristic vector through an object detection model, wherein the category characteristic vector represents category probability corresponding to the image;

or, acquiring an image feature vector according to the image to be identified, specifically comprising the following steps:

acquiring a scene characteristic vector through a scene recognition model based on an image to be recognized, wherein the scene characteristic vector represents the scene probability corresponding to the image;

acquiring a two-dimensional code characteristic vector through a two-dimensional code identification model based on an image to be identified, wherein the two-dimensional code characteristic vector represents category probability and size information corresponding to a two-dimensional code;

and generating a template feature vector according to the occurrence times of the pictures in the image to be recognized.

In this embodiment, a manner of obtaining an object feature vector, a scene feature vector, a two-dimensional code feature vector, and a template feature vector for a picture is described. Since the definition of the image to be recognized is relatively wide, taking the image of the advertisement category as an example, the text and the two-dimensional code in the image may be considered as the advertisement by shooting in close-up, so that the feature dimensions involved in determining the image category are relatively large, and the description will be given below for the image feature vectors in 4 dimensions.

Firstly, object feature vectors;

and after the image to be recognized is input into the object detection model, the probability scores of various objects are output through the object detection model. Assuming that the object detection model can detect 6 object categories, which are respectively "coke", "potato chip", "clothes", "ball", "egg", and "television", each object category corresponds to a result of one dimension, for example, the object feature vector is (1,0,0,0,0,0), which indicates that the object category of the common picture in the image to be recognized is "coke". For another example, if the output object feature vector is (0.1,0.2,0.7,0,0,0), it indicates that the object type of the normal picture in the image to be recognized is "clothing".

It should be noted that the object detection model may specifically be a Convolutional Neural Network (CNN) model, and the type of the CNN model is not limited herein.

Secondly, a scene feature vector;

and after the image to be recognized is input into the scene recognition model, the probability scores of various scenes are output through the scene recognition model. Assuming that the scene recognition model can detect 2 scene categories, which are "documented scene" and "natural scene", respectively, and each scene category corresponds to a result of one dimension, for example, the output scene feature vector is (1,0), which indicates that the scene category of the common picture in the image to be recognized is "documented scene". For another example, if the output object feature vector is (0.1,0.9), it indicates that the scene type of the normal picture in the image to be recognized is "natural scene".

It should be noted that the scene recognition model may specifically be a CNN model, and the type of the CNN model is not limited herein. It is understood that the "paperwork scene" represents a scene subjected to post-processing, and the "natural scene" represents a natural scene directly photographed.

Thirdly, two-dimension code feature vectors;

and after the image to be recognized is input into the scene recognition model, whether the two-dimensional code exists in the image to be recognized or not, the type and the size of the two-dimensional code and the like are output through the two-dimensional code recognition model. Assuming that there is a two-dimensional code picture, the two-dimensional code feature vector is represented as (1,0), and if there is no two-dimensional code picture, the two-dimensional code feature vector is represented as (0, 1). Further, assuming that the two-dimensional code recognition model can detect 2 two-dimensional code categories, which are respectively a "personal category" and a "public number category", and each two-dimensional code category corresponds to a result of one dimension, for example, if the output two-dimensional code feature vector is (1,0), it indicates that the scene category of the two-dimensional code picture in the image to be recognized is a "personal category". For another example, if the output two-dimensional code feature vector is (0.1,0.9), it indicates that the scene type of the two-dimensional code picture in the image to be recognized is the "public number type". Assuming that the width of the two-dimensional code picture is 50 pixels and the length is 50 pixels, normalization processing may be performed on the value, and if the normalized width is 50/50 equal to 1 and the normalized length is 50/50 equal to 1, the two-dimensional code feature vector may be represented as (1, 1).

In summary, a two-dimensional code feature vector corresponding to a two-dimensional code picture with "personal category" can be represented as (1,0,1,0,1, 1).

It should be noted that the two-dimensional code recognition model may specifically be a CNN model, and the type of the CNN model is not limited herein.

Fourthly, template feature vectors;

the number of the images to be recognized appearing on the page (or article) is detected by adopting a template matching mode, if the same images to be recognized do not appear on the page (or article), the template feature vector can be represented as (1,0,0,0), if the same images to be recognized appear on the page (or article) for 1 time, the template feature vector can be represented as (0,1,0,0), if the same images to be recognized appear on the page (or article) for 2 times, the template feature vector can be represented as (0,0,1,0), and if the same images to be recognized appear on the page (or article) for 3 times, the template feature vector can be represented as (0,0,0, 1).

It should be noted that the template feature vector may also be represented by using other number of dimensions, which is only an illustration here and should not be understood as a limitation to the present application.

It can be understood that, in practical application, the image to be recognized may be divided into a plurality of small blocks to be recognized respectively, so as to obtain a result with a finer granularity.

Secondly, in the embodiment of the application, a mode for acquiring an object feature vector, a scene feature vector, a two-dimensional code feature vector and a template feature vector for a picture is provided, and by adopting the mode, an image is decomposed from multiple dimensions, and features in each dimension are continuously enriched, so that the overall generalization capability and recall capability of a model are improved.

Optionally, on the basis of the embodiment corresponding to fig. 3, in another optional embodiment provided in the embodiment of the present application, the obtaining a text feature vector according to an image to be recognized specifically includes the following steps:

performing OCR on an image to be recognized to obtain a text and position information of the text, wherein the text segment comprises P words, and P is an integer greater than or equal to 1;

In this embodiment, a manner of obtaining a semantic feature vector and a position feature vector for a text is introduced, and considering that there is often a correlation between the size and the position of the text and the text semantics, for example, as long as a business hotline appears in an image to be recognized, the business hotline is considered to belong to an advertisement category regardless of the size. For another example, a price appears in the image to be recognized, and the image needs to be considered as belonging to the advertisement category when the price is relatively large. The semantic feature vector and the position feature vector will be described separately below.

Firstly, semantic feature vectors;

the method includes the steps that text content in an image to be recognized is obtained through an OCR technology, and special symbols (such as expression symbols), traditional Chinese characters, English lowercase or English uppercase may exist in the extracted text, so that text preprocessing is needed, for example, the special symbols are filtered, the traditional Chinese characters are converted into simplified Chinese characters, the lowercase is converted into uppercase in a unified mode, or the uppercase is converted into lowercase in a unified mode. And performing word segmentation on the preprocessed text, for example, obtaining P words, and then mapping each word to a high-dimensional word embedding vector, where each word embedding vector may be 300 dimensions, and is not limited herein.

Two, position feature vector

Determining text content in the image to be recognized by using an OCR technology, assuming that the width of the text is 10 pixels and the length is 20 pixels, normalizing the value, for example, the normalized width is 10/20-0.5 and the normalized length is 20/20-1. The width of the image to be recognized where the text is located is 80, the length is 200, the length and the width of the text can be normalized respectively, the normalized width is 10/80-0.125, the normalized length is 20/200-0.1, the length of the article is 2000 pixels, and the width of the article is 500 pixels, so that the proportion of the text occupying the article can be calculated, the proportion occupied in the width is 10/500-0.02, and the proportion occupied in the length is 20/2000-0.01, based on which the position feature vector can be expressed as (0.5,1,0.125,0.1,0.02, 0.01).

Secondly, in the embodiment of the application, a mode for acquiring semantic feature vectors and position feature vectors for texts is provided, and by adopting the mode, the texts are decomposed from multiple dimensions, and features in each dimension are continuously enriched, so that the overall generalization capability and recall capability of the model are improved.

Optionally, on the basis of the embodiment corresponding to fig. 3, in another optional embodiment provided in the embodiment of the present application, the generating a text feature vector according to the semantic feature vector and the position feature vector specifically includes the following steps:

and processing the semantic feature vector and the position feature vector based on the multi-head attention mechanism to obtain a text feature vector, wherein the semantic feature vector belongs to a query corresponding to the multi-head attention mechanism, and the position feature vector belongs to a key corresponding to the multi-head attention mechanism.

In this embodiment, a method for obtaining a text feature vector based on a multi-head attention mechanism is described. The nature of the attention function can be described as a mapping of a Query (Query) to a series of key-value pairs (value). The attention calculation method mainly comprises three steps, wherein the first step is to calculate the similarity of the query and each key to obtain the weight, and common similarity functions comprise dot products, splicing, perceptrons and the like. The second step is usually normalization using a softmax function to both of these weights. And the third step is to carry out weighted summation on the weight and the corresponding value to obtain the final attention result. It will be appreciated that, in general, key and value use the same result, i.e., key value.

Based on this, the present application can use a Multi-head attention (Multi-head attention) mechanism to process the semantic feature vector and the location feature vector. Referring to fig. 6, fig. 6 is a schematic diagram of generating a text feature vector based on a multi-attention mechanism in an embodiment of the present application, and as shown in the figure, specifically, in order to relate the position of the text to the semantics of the text, the semantic feature vector of the text and the position feature vector may be combined as shown in the following formula;

wherein Q and K represent the result after the semantic feature vector and the position feature vector are spliced, and the ith word after the splicing is represented as (e)_i,o_i). V denotes the semantic feature vector, and the semantic feature vector of the ith word is denoted as e_i。d_kRepresenting the dimensions of Q and V. When d is_kWhen the value is relatively small, the effect of the multiplication attention and the addition attention is almost the same, and when d is used_kIn the case of larger, if no scaling factor is used

The addition is better in attention because the multiplication result will be larger and easily enter the saturation region of the softmax function.

Under the multi-head attention mechanism, firstly, combining semantic feature vectors and position feature vectors of a text, then calculating the weight between the two feature vectors, namely performing matrix multiplication (matmul) processing, then performing scaling (scale) processing by indicating scaling factors, then performing mask operation, and finally performing matrix multiplication processing (namely dot multiplication processing) on the semantic feature vectors and a result machine output by softmax to obtain the text feature vectors.

In the embodiment of the application, a method for acquiring the text feature vector based on the multi-head attention mechanism is provided, and through the method, the multi-head attention mechanism can be used for learning the relationship between the size of the text block and the text semantic combination, so that the effect of the model is further improved, and the interference of the text size and the text position on the text semantic expression is effectively solved. In addition, by combining the content of the embodiment, the image recognition model can automatically extract the features of the image to be recognized in each dimension to obtain the corresponding image recognition result, thereby ensuring the precision while the recall is ensured, and also considering the calculation speed.

Optionally, on the basis of the embodiment corresponding to fig. 3, in another optional embodiment provided in the embodiment of the present application, after the image to be recognized is acquired, the method may further include the following steps:

if the image to be recognized does not meet the image feature extraction condition, acquiring a text feature vector and a first feature vector according to the image to be recognized;

acquiring an image recognition result corresponding to an image to be recognized according to the image feature vector and the text feature vector, and specifically comprising the following steps of:

and acquiring an image recognition result corresponding to the image to be recognized through the image recognition model based on the first feature vector and the text feature vector.

In this embodiment, a processing method in which an image to be recognized does not satisfy a picture feature extraction condition is introduced. In practical situations, there may be no picture in the image to be recognized, or it is difficult to extract all features of the picture, for example, in the case of no two-dimensional code picture, a two-dimensional code feature vector cannot be extracted. However, the feature vector input to the image recognition model still has fixed dimensions, and therefore, it is necessary to perform zero padding or preset value filling on the feature that is not extracted, so as to obtain a first feature vector, and then the first feature vector is used as an image feature vector and is input to the image recognition model together with the text feature vector.

For easy understanding, please refer to table 1, where table 1 is a text feature case extracted based on an image to be recognized.

TABLE 1

As can be seen from table 1, the feature values with the feature dimension 1 and the feature dimension 2 are not extracted, and therefore, the feature values corresponding to the two feature dimensions may be set to 0 or other preset values. Taking the setting of 0 as an example, the first feature vector may be represented as (0,0,0.5,0.125,0.7,0.3,0.1,0.1), and in this case, the first feature vector may be regarded as an image feature vector.

Secondly, in the embodiment of the application, a processing mode that the image to be recognized does not satisfy the image feature extraction condition is provided, and by adopting the mode, even if the image to be recognized does not satisfy the image feature extraction condition, the image feature vector can be obtained in a zero padding or preset value filling mode.

if the image to be recognized does not meet the text feature extraction condition, acquiring an image feature vector and a second feature vector according to the image to be recognized;

obtaining an image recognition result corresponding to the image to be recognized according to the image feature vector and the text feature vector, wherein the image recognition result comprises the following steps:

and acquiring an image identification result corresponding to the image to be identified through the image identification model based on the image feature vector and the second feature vector.

In this embodiment, a processing method in which an image to be recognized does not satisfy a text feature extraction condition is introduced. In practical situations, there may be no text in the image to be recognized, or it is difficult to extract all features of the text, for example, in the case of no text, text feature vectors cannot be extracted, and for example, the positions of the text are too scattered to recognize the corresponding positions. However, the feature vector input to the image recognition model still has fixed dimensions, and therefore, it is necessary to perform zero padding or preset value filling on the feature that is not extracted, so as to obtain a second feature vector, and then the second feature vector is used as a text feature vector and input to the image recognition model together with the image feature vector.

For ease of understanding, please refer to table 2, where table 2 is a text feature case extracted based on the image to be recognized.

TABLE 2

Characteristic dimension	Characteristic value
		1	1
2	1
		3	0.5
4	0.3
		5	Is free of
6	Is free of
		7	Is free of
8	Is free of

As can be seen from table 2, the feature values with the feature dimension of 5, the feature dimension of 6, the feature dimension of 7, and the feature dimension of 8 are not extracted, and therefore, the feature values corresponding to these four feature dimensions may be set to 0 or other preset values. Taking the setting of 0 as an example, the second feature vector may be represented as (1,1,0.5,0.3,0,0,0,0), and in this case, the second feature vector may be regarded as a text feature vector.

For example, in practical applications, there may be no text in the image to be recognized, and thus each feature value in the second feature vector may be set to 0, or set to another value, which is not limited herein.

Secondly, in the embodiment of the application, a processing mode that the image to be recognized does not satisfy the text feature extraction condition is provided, and by adopting the mode, even if the image to be recognized does not satisfy the text feature extraction condition, the text feature vector can be obtained in a zero padding or preset value filling mode.

Optionally, on the basis of the embodiment corresponding to fig. 3, in another optional embodiment provided in the embodiment of the present application, obtaining an image recognition result corresponding to an image to be recognized according to the image feature vector and the text feature vector includes:

based on the image feature vector, obtaining a target image feature vector through a shallow network included in an image recognition model;

In the embodiment, a mode for obtaining an image recognition result based on a width and depth (wide & deep) model is provided. As can be seen from the foregoing embodiments, after the image feature vector and the text feature vector are obtained, the image feature vector and the text feature vector may be spliced together and input to the image recognition model together, where the image feature vector is usually represented as a numerical feature, the text feature vector is represented as an embedded (embedding) feature, and the combination manner of the image feature vector and the text feature vector may be splicing, for example, the image feature vector and the text feature vector may be combined by using a splicing (concat) layer.

The image recognition model can output the class probability distribution of the image to be recognized based on the input image feature vector and the text feature vector, and the probability that the image to be recognized belongs to a certain class can be known based on the class probability distribution.

Specifically, the image recognition model may be a wide & deep model, and the wide & deep model mainly includes two parts, namely, a shallow network and a deep network. The shallow layer network is usually a linear model, for example, a Logistic Regression (LR) model, and the shallow layer network can efficiently realize the memory capability by using the cross feature, so as to achieve the purpose of accurate recommendation. Shallow networks achieve some generalization capability by adding some broad class of features. The Deep network is generally a Deep model, for example, a Deep Neural Network (DNN) model or a Factorization Machine (FM) model, and the Deep network may implement the generalization capability of the model through the learned low-latitude dense vectors, including implementing the generalized recommendation of the unseen content.

The wide & deep model can well balance the memory ability, i.e. finding the correlation between features from historical data, and the generalization ability, i.e. the transfer of correlation, finding new feature combinations that have little or no appearance in historical data.

Based on the content introduced in the foregoing embodiment, for convenience of understanding, please refer to fig. 7, where fig. 7 is another schematic structural diagram of an image recognition model in the embodiment of the present application, and as shown in the figure, an image to be recognized is first obtained, then features of the image are extracted in multiple dimensions, specifically, features related to the image are extracted through an object detection model, a scene recognition model and a two-dimensional code recognition model, and then a semantic feature vector and a position feature vector are extracted by using an OCR technology. The image feature vector comprises an object feature vector, a scene feature vector, a two-dimensional code feature vector, a template feature vector, a basic feature vector and a statistical feature vector. The semantic feature vector and the position feature vector are respectively input to a Multi-layer perceptron (MLP) layer, and an output result is processed by adopting a Multi-head attention mechanism to obtain a text feature vector. And finally, inputting the text feature vector into a deep network in the wide & deep model, thereby outputting the target text feature vector. And inputting the image feature vector into a shallow network in the wide & deep model, thereby outputting a target image feature vector. And outputting the comprehensive target text characteristic vector and the target image characteristic vector by the wide & deep model, outputting the category probability distribution of the image to be recognized after inputting the comprehensive target text characteristic vector and the target image characteristic vector into the MLP, and finally determining the image recognition result corresponding to the image to be recognized according to the category probability distribution.

Further, in the embodiment of the application, a mode for obtaining an image recognition result based on the wide & deep model is provided, by adopting the mode, the accuracy and the efficiency of image recognition can be improved by adopting the wide & deep model as the image recognition model, and the wide & deep model has the capabilities of memorizing and generalization and is suitable for jointly learning numerical features and text embedded features. Under the scene provided by the application, the number of images needing to be processed is very large, and the wide & deep model is simple in structure, so that the method is suitable for large-scale prediction scenes.

With reference to fig. 8, an embodiment of a training method for an image recognition model in the present application includes:

201. acquiring an image set to be trained, wherein the image set to be trained comprises at least one image to be trained, each image to be trained corresponds to a label, and the labels are used for representing the category of the image to be trained;

in this embodiment, before performing model training, the model training apparatus acquires an image set to be trained, where the image set to be trained includes at least one image to be trained, and may include, for example, 3 thousands of images to be trained in a public article. Each image to be trained corresponds to a label labeled manually, and the label is used to indicate the category of the image to be trained, for example, the image to be trained belonging to the advertisement category has a label of "1", and the image to be trained belonging to the non-advertisement category has a label of "0". For another example, the label of the image to be trained belonging to the sensitive category is "1", and the label of the image to be trained belonging to the non-sensitive category is "0".

It should be noted that the model training model may be deployed in a computer device, where the computer device may be a server or a terminal device, and is not limited herein.

202, acquiring an image feature vector corresponding to an image to be trained based on any image to be trained in an image set to be trained, wherein the image feature vector comprises a basic feature vector and a statistical feature direction, the basic feature vector represents size information of a picture in the image to be trained, and the statistical feature vector represents the number of characters of a text in the image to be trained;

in this embodiment, an example of any image to be trained in the image set to be trained is described, and in the actual training process, the content described in step 202 to step 204 needs to be executed on each image to be trained in the image set to be trained, so as to obtain the class probability distribution corresponding to each image to be trained in the image set to be trained.

In the case where the image to be trained includes a picture, the model training device may extract a corresponding image feature vector, where the image feature vector includes a basic feature vector, and the basic feature vector is used to represent size information of a general picture.

It is understood that the manner of extracting the basic feature vector may refer to the content described in step 102 in the foregoing embodiment, and therefore, the description thereof is omitted here.

203. Acquiring a text feature vector corresponding to an image to be trained based on any one image to be trained, wherein the text feature vector is determined according to a semantic feature vector of a text in the image to be trained and size information of the text in the image to be trained;

in this embodiment, under the condition that the image to be trained includes a text, the model training device may extract a corresponding text feature vector and a statistical characteristic vector in the image feature vector, where the text feature vector may include a semantic feature vector and a position feature vector of the text, that is, the text feature vector may be obtained by splicing the semantic feature vector and the position feature vector of the text. The text feature vector may also be generated based on the semantic feature vector and the position feature vector of the text, for example, the semantic feature vector and the position feature vector of the text are processed by using an attention mechanism to obtain the text feature vector.

It is to be understood that the manner of extracting the text feature vector and the manner of extracting the statistical characteristic vector may refer to the content described in step 103 in the foregoing embodiment, and therefore, the details are not described herein. In addition, the text feature vector may be trained based on text of a fixed character length, e.g., 48 characters in fixed length, with 0 padding for samples having a sequence length less than the fixed character length.

204. Based on an image feature vector and a text feature vector corresponding to any image to be trained, obtaining category probability distribution corresponding to the image to be trained through an image recognition model to be trained;

in this embodiment, after splicing the obtained image feature vector and the text feature vector, the model training device inputs the spliced image feature vector to the image recognition model, and the image recognition model outputs a category probability distribution, where the category probability distribution may be represented as (a, b), where a + b is equal to 1, a represents a probability that the image to be trained belongs to the first category, and b represents a probability that the image to be trained belongs to the second category.

Taking the identification of the advertisement category as an example, if the category probability distribution is (0.3,0.7), it indicates that the image identification result of the image to be trained is a non-advertisement category. For another example, if the category probability distribution is (0.8.0.2), it indicates that the image recognition result of the image to be trained is an advertisement category. Furthermore, the category probability distribution may also indicate a category probability value, for example, the category probability value is 0.8, and if the category probability value is greater than the category probability threshold value of 0.5, the image recognition result belonging to the advertisement category, that is, the image to be trained, is the advertisement category.

205. And updating the model parameters of the image recognition model to be trained according to the class probability distribution and the label corresponding to at least one image to be trained until the model convergence condition is reached, thereby obtaining the image recognition model.

In this embodiment, after obtaining the class probability distribution corresponding to each image to be trained, the model training device takes the class probability distribution as a predicted value, takes an artificially labeled tag as a true value, and optimizes a target by using an adaptive moment estimation (Adam) optimization algorithm and a binary cross entropy loss (binary cross entropy loss) function, so as to update model parameters of the image recognition model to be trained, and outputs the image recognition model until a model convergence condition is reached.

The embodiment of the application provides a method for training an image recognition model, which includes the steps of firstly obtaining a set of images to be trained, then obtaining an image feature vector corresponding to the images to be trained based on any one of the images to be trained in the set of images to be trained, obtaining a text feature vector corresponding to the images to be trained based on any one of the images to be trained, then obtaining category probability distribution corresponding to the images to be trained through the image recognition model to be trained based on the image feature vector and the text feature vector corresponding to any one of the images to be trained, and finally updating model parameters of the image recognition model to be trained according to the category probability distribution and a label corresponding to at least one of the images to be trained until a model convergence condition is achieved to obtain the image recognition model. By adopting the method, a model for identifying the image category can be obtained through training, so that the extracted image feature vector and the text feature vector can be jointly used as the basis for predicting the image type in the image identification process, and the image identification result is finally obtained, belongs to the features of two dimensions of the comprehensive image and the text, and is beneficial to obtaining the image identification result with higher accuracy.

Optionally, on the basis of the embodiment corresponding to fig. 8, in another optional embodiment provided in the embodiment of the present application, based on an image feature vector and a text feature vector corresponding to any one of the images to be trained, the method for obtaining the class probability distribution corresponding to the image to be trained through the image recognition model to be trained specifically includes the following steps:

acquiring a target image feature vector through a shallow network included in an image recognition model to be trained based on an image feature vector corresponding to any one image to be trained;

In this embodiment, a training mode based on the wide & deep model is introduced, and as can be seen from the foregoing embodiments, after an image feature vector and a text feature vector are obtained, the image feature vector and the text feature vector may be spliced together and input to an image recognition model to be trained together, where the image feature vector is usually represented as a numerical feature, the text feature vector is represented as an embedded embedding feature, and a combination mode of the two may be splicing, for example, the image feature vector and the text feature vector may be combined using a concat layer.

The to-be-trained image recognition model can output class probability distribution of the to-be-trained image based on the input image characteristic vector and the text characteristic vector, and the probability that the to-be-trained image belongs to a certain class can be known based on the class probability distribution.

Specifically, the image recognition model to be trained may be a wide & deep model to be trained, and the wide & deep model to be trained mainly includes two parts, namely a shallow network and a deep network. The shallow layer network is usually a linear model, for example, an LR model, etc., and the shallow layer network can efficiently realize the memory capability by using the cross feature, so as to achieve the purpose of accurate recommendation. Shallow networks achieve some generalization capability by adding some broad class of features. The deep network is generally a depth model, for example, a DNN model or an FM model, and the deep network can implement the generalization capability of the model through the learned low-latitude dense vectors, including implementing the generalization recommendation of unseen content. The wide & deep model can well balance the memory ability, i.e. finding the correlation between features from historical data, and the generalization ability, i.e. the transfer of correlation, finding new feature combinations that have little or no appearance in historical data.

Taking any image to be trained as an example, firstly obtaining the image to be trained, then extracting the characteristics of the image to be trained on multiple dimensions, specifically, extracting the characteristics related to the image through an object detection model, a scene recognition model and a two-dimensional code recognition model respectively, and then extracting a semantic characteristic vector and a position characteristic vector by using an OCR technology. The image feature vector comprises an object feature vector, a scene feature vector, a two-dimensional code feature vector, a template feature vector, a basic feature vector and a statistical feature vector. And the semantic feature vector and the position feature vector are respectively input into an MLP layer, and then an output result is processed by adopting a multi-head attention mechanism to obtain a text feature vector. And finally, inputting the text feature vector into a deep network in the wide & deep model to be trained, thereby outputting the target text feature vector. And inputting the image feature vector into a shallow network in the wide & deep model to be trained, thereby outputting the target image feature vector. And outputting the comprehensive target text characteristic vector and the target image characteristic vector by the wide & deep model to be trained, inputting the comprehensive target text characteristic vector and the target image characteristic vector into the MLP, outputting the category probability distribution of the image to be recognized, and finally determining the image recognition result corresponding to the image to be recognized according to the category probability distribution.

Based on this, after the class probability distribution corresponding to the plurality of images to be trained is obtained, the model parameters of the deep layer network and the model parameters of the shallow layer network can be updated by adopting mini-batch stochastic optimization (mini-batch stochastic optimization). The deep layer network may use an adaptive gradient (adabasal) method to learn and update the model parameters, and the shallow layer network may use a method following regular guiding (e.g., the regularized-leader) and L1 regularization to learn and update the model parameters.

And training to obtain the wide & deep model, and then obtaining the image recognition model. The following will be described by taking the processing of images to be recognized in the article of the public number as an example:

the method comprises the steps of firstly, acquiring an image to be processed in a public article, and preprocessing and characteristic extraction are carried out on the image to be processed.

And secondly, inputting the processed text feature vectors and the processed image feature vectors into a trained image recognition model (specifically, a wide & deep model), thereby calculating the probability score of the image to be recognized belonging to the advertisement category.

And thirdly, judging whether the probability score belongs to the advertisement category according to a preset rule, for example, if the probability score is greater than or equal to 0.5, determining that the image to be identified belongs to the advertisement category, and then further processing the image to be identified, for example, judging whether the proportion of the image to be identified in the article in the public number is large enough, if the proportion of the image to be identified in the article in the public number is greater than or equal to a proportion threshold value, filtering the image to be identified, or directly filtering the article in which the image to be identified is located. For another example, the position of the image to be recognized in the article in the public number is determined, and if the image to be recognized is at the head or middle of the article in the public number, the image to be recognized needs to be filtered.

Secondly, in the embodiment of the application, a training mode based on the wide & deep model is provided, and the wide & deep model is used as an image recognition model in the mode, so that the method has better interpretability. Meanwhile, the wide & deep model has the capabilities of memorizing and generalization, and is suitable for jointly learning numerical features and text embedded features. Under the scene provided by the application, the number of images needing to be processed is very large, and the wide & deep model is simple in structure, so that the method is suitable for large-scale prediction scenes.

With reference to the above description, the method for recognizing an image to be recognized provided by the present application will be described below, and referring to fig. 9, an embodiment of the method for recognizing an image to be recognized in the embodiment of the present application includes:

301. acquiring an article to be processed, wherein the article to be processed comprises an image to be identified and article information;

in this embodiment, for example, the information publishing platform is taken as an example, the advertisement image recognition device may obtain a large number of articles, for example, to facilitate description, and any one of the articles is taken as an example, that is, the article to be processed, specifically, the article to be processed may be an article to be published on a website, and may also be an article to be published to a public number, and the like, which is not limited herein. The method comprises the steps of extracting an image to be identified and article information from an article to be processed, wherein the image to be identified can generally comprise two parts, namely a picture and a text, and the picture can also comprise a common picture and a two-dimensional code picture. The article information is the text content in the article to be processed.

It should be noted that the advertisement image recognition apparatus is disposed in a computer device, and the computer device may be a server or a terminal device, and the application takes a computer as an example of the server, but this should not be construed as a limitation to the application.

302. If the image to be identified comprises a picture, acquiring an image feature vector according to the image to be identified, wherein the image feature vector comprises a basic feature vector which represents the size information of the picture;

303. If the image to be recognized comprises a text, acquiring a text feature vector according to the image to be recognized, wherein the text feature vector is determined according to a semantic feature vector of the text and size information of the text, the image feature vector also comprises a statistical characteristic vector, and the statistical characteristic vector represents the number of characters of the text;

It is to be understood that the manner of extracting the text feature vector and the manner of extracting the statistical characteristic vector may refer to the content described in step 103 in the foregoing embodiment, and therefore, the details are not described herein.

304. Acquiring an image identification result corresponding to the image to be identified according to the image characteristic vector and the text characteristic vector;

in this embodiment, the advertisement image recognition apparatus concatenates the obtained image feature vector and text feature vector, inputs the concatenated vector to the image recognition model, and outputs a category probability distribution by the image recognition model, where the category probability distribution may be represented as (a, b), where a + b is equal to 1, a represents a probability that the image to be recognized belongs to the advertisement category, and b represents a probability that the image to be recognized does not belong to the advertisement category, and for example, if the category probability distribution is (0.3,0.7), the image recognition result of the image to be recognized is a non-advertisement category. For another example, if the category probability distribution is (0.8.0.2), it indicates that the image recognition result of the image to be recognized is the advertisement category. Furthermore, the category probability distribution may also indicate a category probability value, for example, the category probability value is 0.8, and if the category probability value is greater than the category probability threshold value of 0.5, the image recognition result belonging to the advertisement category, that is, the image to be recognized is the advertisement category.

305. And if the image recognition result is the advertisement category, processing the article to be processed.

In this embodiment, if the image recognition result belongs to the advertisement category, the advertisement image recognition apparatus needs to process the article to be processed. For information products, the reading experience of users is greatly influenced by the quality of articles, and various types of advertisements appear in a plurality of articles in an article candidate pool, so that the reading experience of users is influenced. Therefore, it is necessary to process such articles (i.e., articles to be processed), and the following description will be made in conjunction with three cases.

Referring to fig. 10, fig. 10 is a schematic diagram of an image to be recognized based on an article to be processed in an embodiment of the present application, as shown in the drawing, a picture and a text exist in the image to be recognized indicated by B1, and the picture includes a two-dimensional code picture, if it is determined that the image to be recognized belongs to an advertisement category, a position of the image to be recognized in the article is further obtained, the image to be recognized in fig. 10 is located at the top of the article, and a user can see the image immediately after clicking the article, which is easily objectionable to the user. And the appearance at the tail of the article is more easily accepted. Therefore, the image to be recognized can be filtered from the article to be processed, or the image to be recognized is adjusted to the tail of the article to be processed, or the article to be processed is directly removed from the article candidate pool, or the weight of the article to be processed is reduced, so that the articles to be processed are in a later sequence.

Referring to fig. 11, fig. 11 is another schematic diagram of an image to be recognized based on an article to be processed in the embodiment of the present application, as shown in the drawing, a picture and a text exist in the image to be recognized indicated by C1, and the picture includes a two-dimensional code picture, if it is determined that the image to be recognized belongs to an advertisement category, the position of the image to be recognized in the article is further obtained, and the image to be recognized in fig. 11 is located at the tail of the article. Therefore, no processing of the image to be recognized is required.

Referring to fig. 12, fig. 12 is another schematic diagram of an image to be recognized based on an article to be processed in the embodiment of the present application, as shown in the figure, a picture and a text exist in the image to be recognized indicated by D1, and the picture includes a two-dimensional code picture, and if it is determined that the image to be recognized belongs to an advertisement category, a position of the image to be recognized in the article and an area occupied by the image to be recognized are further obtained. The image to be recognized in fig. 12 is located in the middle of the article and occupies a larger area, for example, 1/5 of the entire article to be processed, at this time, the entire advertisement image occupies a larger area and has a worse influence, so the image to be recognized may be filtered from the article to be processed, or the image to be recognized may be adjusted to the tail of the article to be processed, or the article to be processed may be directly removed from the candidate pool of articles, or the weight of the article to be processed may be reduced, so that the articles to be processed appear in a later order.

Optionally, the advertisement image recognition device adjusts the article to be processed and then releases the article to the information platform. For easy understanding, please refer to fig. 13, fig. 13 is a schematic view of a portal interface for viewing articles in the embodiment of the present application, as shown in fig. 13, a portal for viewing articles is provided in fig. 13 (a), that is, a "see-at-a-look" portal indicated by E1, and after the portal is triggered by a user, the user may enter the interface shown in fig. 13 (B), that is, at least one article is displayed on the interface, and the user reads the article of interest by clicking the interface.

The embodiment of the application provides a method for identifying an image to be identified, which includes the steps of firstly obtaining an article to be processed, obtaining an image feature vector according to the image to be identified if the image to be identified comprises a picture, obtaining a text feature vector according to the image to be identified if the image to be identified comprises a text, obtaining an image identification result corresponding to the image to be identified according to the image feature vector and the text feature vector, and finally processing the article to be processed. By adopting the mode, in the process of identifying the image to be identified, the picture and the text can be extracted from the image to be identified as the basis of image identification, the extracted image characteristic vector and the text characteristic vector are jointly used as the basis of prediction image types, and the image identification result is finally obtained, belongs to the characteristics of two dimensions of the comprehensive picture and the text, so that the image identification result with higher accuracy can be obtained, various advertisement images hidden in the article can be effectively identified, and the advertisement images seriously influencing the reading of the user can be filtered.

Referring to fig. 14, fig. 14 is a schematic diagram of an embodiment of an image recognition apparatus in an embodiment of the present application, in which the image recognition apparatus 40 includes:

an obtaining module 401, configured to obtain an image to be identified;

the obtaining module 401 is further configured to obtain an image feature vector according to the image to be recognized if the image to be recognized includes a picture, where the image feature vector includes a basic feature vector, and the basic feature vector represents size information of the picture;

the obtaining module 401 is further configured to obtain a text feature vector according to the image to be recognized if the image to be recognized includes a text, where the text feature vector is determined according to a semantic feature vector of the text and a position feature vector of the text, the image feature vector further includes a statistical characteristic vector, and the statistical characteristic vector represents the number of characters of the text;

the identifying module 402 is configured to obtain an image identifying result corresponding to the image to be identified according to the image feature vector and the text feature vector.

Optionally, on the basis of the embodiment corresponding to fig. 14, in another embodiment of the image recognition apparatus 40 provided in the embodiment of the present application, the image feature vector further includes at least one of an object feature vector, a scene feature vector, a two-dimensional code feature vector, and a template feature vector;

Alternatively, on the basis of the embodiment corresponding to fig. 14, in another embodiment of the image recognition apparatus 40 provided in the embodiment of the present application,

the acquiring module 401 is specifically configured to perform Optical Character Recognition (OCR) on an image to be recognized to obtain a text and position information of the text, where a text segment includes P words and P is an integer greater than or equal to 1;

the obtaining module 401 is specifically configured to process the semantic feature vector and the position feature vector based on the multi-head attention mechanism to obtain a text feature vector, where the semantic feature vector belongs to a query corresponding to the multi-head attention mechanism, and the position feature vector belongs to a key corresponding to the multi-head attention mechanism.

the obtaining module 401 is further configured to, after obtaining the image to be recognized, obtain a text feature vector and a first feature vector according to the image to be recognized if the image to be recognized does not satisfy the image feature extraction condition;

the obtaining module 401 is further configured to obtain an image recognition result corresponding to the image to be recognized according to the image feature vector and the text feature vector, and includes:

the obtaining module 401 is further configured to obtain, based on the first feature vector and the text feature vector, an image recognition result corresponding to the image to be recognized through the image recognition model.

the obtaining module 401 is further configured to, after obtaining the image to be recognized, obtain an image feature vector and a second feature vector according to the image to be recognized if the image to be recognized does not satisfy the text feature extraction condition;

the obtaining module 401 is specifically configured to obtain, based on the image feature vector and the second feature vector, an image recognition result corresponding to the image to be recognized through the image recognition model.

an obtaining module 401, specifically configured to obtain a target image feature vector through a shallow network included in the image recognition model based on the image feature vector;

Referring to fig. 15, fig. 15 is a schematic view of an embodiment of the model training device in the embodiment of the present application, and the model training device 50 includes:

an obtaining module 501, configured to obtain an image set to be trained, where the image set to be trained includes at least one image to be trained, and each image to be trained corresponds to a label, and the label is used to indicate a category of the image to be trained;

the obtaining module 501 is further configured to obtain an image feature vector corresponding to an image to be trained based on any one image to be trained in the image set to be trained, where the image feature vector includes a basic feature vector and a statistical feature direction, the basic feature vector represents size information of a picture in the image to be trained, and the statistical feature vector represents the number of characters of a text in the image to be trained;

the obtaining module 501 is further configured to obtain a text feature vector corresponding to an image to be trained based on any one of the images to be trained, where the text feature vector is determined according to a semantic feature vector of a text in the image to be trained and size information of the text in the image to be trained;

the obtaining module 501 is further configured to obtain, based on an image feature vector and a text feature vector corresponding to any one image to be trained, category probability distribution corresponding to the image to be trained through an image recognition model to be trained;

the training module 502 is configured to update a model parameter of the image recognition model to be trained according to the class probability distribution and the label corresponding to the at least one image to be trained until a model convergence condition is reached, so as to obtain the image recognition model.

Alternatively, on the basis of the embodiment corresponding to fig. 15, in another embodiment of the model training device 50 provided in the embodiment of the present application,

the obtaining module 501 is specifically configured to obtain a target image feature vector through a shallow network included in an image recognition model to be trained based on an image feature vector corresponding to any one image to be trained;

Referring to fig. 16, fig. 16 is a schematic view of an embodiment of an advertisement image recognition apparatus in an embodiment of the present application, and an advertisement image recognition apparatus 60 includes:

the acquisition module 601 is configured to acquire an article to be processed, where the article to be processed includes an image to be identified and article information;

the obtaining module 601 is further configured to obtain an image feature vector according to the image to be recognized if the image to be recognized includes a picture, where the image feature vector includes a basic feature vector, and the basic feature vector represents size information of the picture;

the obtaining module 601 is further configured to obtain a text feature vector according to the image to be recognized if the image to be recognized includes a text, where the text feature vector is determined according to a semantic feature vector of the text and size information of the text, the image feature vector further includes a statistical characteristic vector, and the statistical characteristic vector represents the number of characters of the text;

the obtaining module 601 is further configured to obtain an image recognition result corresponding to the image to be recognized according to the image feature vector and the text feature vector;

the processing module 602 is configured to process the article to be processed if the image identification result obtained by the obtaining module 601 is the advertisement category.

The embodiment of the application also provides another image recognition device, a model training device and an advertisement image recognition device, which can be deployed in computer equipment, and the computer equipment can be a server. Referring to fig. 17, fig. 17 is a schematic structural diagram of a server according to an embodiment of the present disclosure, where the server 700 may have a relatively large difference due to different configurations or performances, and may include one or more Central Processing Units (CPUs) 722 (e.g., one or more processors) and a memory 732, and one or more storage media 730 (e.g., one or more mass storage devices) storing an application program 742 or data 744. Memory 732 and storage medium 730 may be, among other things, transient storage or persistent storage. The program stored in the storage medium 730 may include one or more modules (not shown), each of which may include a series of instruction operations for the server. Further, the central processor 722 may be configured to communicate with the storage medium 730, and execute a series of instruction operations in the storage medium 730 on the server 700.

The Server 700 may also include one or more power supplies 726, one or more wired or wireless network interfaces 750, one or more input-output interfaces 758, and/or one or more operating systems 741, such as a Windows Server^TM，Mac OS X^TM，Unix^TM,Linux^TM，FreeBSD^TMAnd so on.

The steps performed by the server in the above embodiment may be based on the server structure shown in fig. 17.

The embodiment of the application also provides another image recognition device, a model training device and an advertisement image recognition device, which can be deployed in computer equipment, and the computer equipment can be terminal equipment. As shown in fig. 18, for convenience of explanation, only the parts related to the embodiments of the present application are shown, and details of the technology are not disclosed, please refer to the method part of the embodiments of the present application. The terminal device may be any terminal device including a mobile phone, a tablet computer, a Personal Digital Assistant (PDA), a point of sale (POS), a vehicle-mounted computer, and the like, taking the terminal device as the mobile phone as an example:

fig. 18 is a block diagram illustrating a partial structure of a mobile phone related to a terminal device provided in an embodiment of the present application. Referring to fig. 18, the cellular phone includes: radio Frequency (RF) circuitry 810, memory 820, input unit 830, display unit 840, sensor 850, audio circuitry 860, wireless fidelity (WiFi) module 870, processor 880, and power supply 890. Those skilled in the art will appreciate that the handset configuration shown in fig. 18 is not intended to be limiting and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.

The following describes each component of the mobile phone in detail with reference to fig. 18:

the RF circuit 810 may be used for receiving and transmitting signals during information transmission and reception or during a call, and in particular, for processing downlink information of a base station after receiving the downlink information to the processor 880; in addition, the data for designing uplink is transmitted to the base station. In general, RF circuit 810 includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a Low Noise Amplifier (LNA), a duplexer, and the like. In addition, the RF circuit 810 may also communicate with networks and other devices via wireless communication. The wireless communication may use any communication standard or protocol, including but not limited to global system for mobile communications (GSM), General Packet Radio Service (GPRS), Code Division Multiple Access (CDMA), Wideband Code Division Multiple Access (WCDMA), Long Term Evolution (LTE), email, Short Message Service (SMS), etc.

The memory 820 may be used to store software programs and modules, and the processor 880 executes various functional applications and data processing of the cellular phone by operating the software programs and modules stored in the memory 820. The memory 820 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. Further, the memory 820 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.

The input unit 830 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the cellular phone. Specifically, the input unit 830 may include a touch panel 831 and other input devices 832. The touch panel 831, also referred to as a touch screen, can collect touch operations performed by a user on or near the touch panel 831 (e.g., operations performed by the user on the touch panel 831 or near the touch panel 831 using any suitable object or accessory such as a finger, a stylus, etc.) and drive the corresponding connection device according to a preset program. Alternatively, the touch panel 831 may include two portions, i.e., a touch detection device and a touch controller. The touch detection device detects the touch direction of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device, converts it to touch point coordinates, and sends the touch point coordinates to the processor 880, and can receive and execute commands from the processor 880. In addition, the touch panel 831 may be implemented by various types such as a resistive type, a capacitive type, an infrared ray, and a surface acoustic wave. The input unit 830 may include other input devices 832 in addition to the touch panel 831. In particular, other input devices 832 may include, but are not limited to, one or more of a physical keyboard, function keys (such as volume control keys, switch keys, etc.), a trackball, a mouse, a joystick, and the like.

The display unit 840 may be used to display information input by the user or information provided to the user and various menus of the cellular phone. The display unit 840 may include a display panel 841, and the display panel 841 may be optionally configured in the form of a Liquid Crystal Display (LCD), an organic light-emitting diode (OLED), or the like. Further, touch panel 831 can overlay display panel 841, and when touch panel 831 detects a touch operation thereon or nearby, communicate to processor 880 to determine the type of touch event, and processor 880 can then provide a corresponding visual output on display panel 841 based on the type of touch event. Although in fig. 18, the touch panel 831 and the display panel 841 are two separate components to implement the input and output functions of the mobile phone, in some embodiments, the touch panel 831 and the display panel 841 may be integrated to implement the input and output functions of the mobile phone.

The handset may also include at least one sensor 850, such as light sensors, motion sensors, and other sensors. Specifically, the light sensor may include an ambient light sensor that adjusts the brightness of the display panel 841 according to the brightness of ambient light, and a proximity sensor that turns off the display panel 841 and/or the backlight when the mobile phone is moved to the ear. As one of the motion sensors, the accelerometer sensor can detect the magnitude of acceleration in each direction (generally, three axes), can detect the magnitude and direction of gravity when stationary, and can be used for applications of recognizing the posture of a mobile phone (such as horizontal and vertical screen switching, related games, magnetometer posture calibration), vibration recognition related functions (such as pedometer and tapping), and the like; as for other sensors such as a gyroscope, a barometer, a hygrometer, a thermometer, and an infrared sensor, which can be configured on the mobile phone, further description is omitted here.

Audio circuitry 860, speaker 861, microphone 862 may provide an audio interface between the user and the handset. The audio circuit 860 can transmit the electrical signal converted from the received audio data to the speaker 861, and the electrical signal is converted into a sound signal by the speaker 861 and output; on the other hand, the microphone 862 converts collected sound signals into electrical signals, which are received by the audio circuit 860 and converted into audio data, which are then processed by the audio data output processor 880 and transmitted to, for example, another cellular phone via the RF circuit 810, or output to the memory 820 for further processing.

WiFi belongs to short-distance wireless transmission technology, and the mobile phone can help a user to send and receive e-mails, browse webpages, access streaming media and the like through the WiFi module 870, and provides wireless broadband Internet access for the user. Although fig. 18 shows WiFi module 870, it is understood that it does not belong to the essential constitution of the handset, and may be omitted entirely as needed within the scope not changing the essence of the invention.

The processor 880 is a control center of the mobile phone, connects various parts of the entire mobile phone using various interfaces and lines, and performs various functions of the mobile phone and processes data by operating or executing software programs and/or modules stored in the memory 820 and calling data stored in the memory 820, thereby integrally monitoring the mobile phone. Optionally, processor 880 may include one or more processing units; optionally, the processor 880 may integrate an application processor and a modem processor, wherein the application processor primarily handles operating systems, user interfaces, applications, etc., and the modem processor primarily handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into processor 880.

The phone also includes a power supply 890 (e.g., a battery) for supplying power to various components, optionally, the power supply may be logically connected to the processor 880 via a power management system, so as to implement functions of managing charging, discharging, and power consumption via the power management system.

Although not shown, the mobile phone may further include a camera, a bluetooth module, etc., which are not described herein.

The steps performed by the terminal device in the above-described embodiment may be based on the terminal device configuration shown in fig. 18.

Embodiments of the present application also provide a computer-readable storage medium, in which a computer program is stored, and when the computer program runs on a computer, the computer is caused to execute the steps described in the foregoing embodiments.

Embodiments of the present application also provide a computer program product comprising a program, which when run on a computer, causes the computer to perform the steps as described in the previous embodiments.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. A method of image recognition, comprising:

acquiring an image to be identified;

if the image to be identified comprises a picture, obtaining an image feature vector according to the image to be identified, wherein the image feature vector comprises a basic feature vector which represents the size information of the picture;

if the image to be recognized comprises a text, acquiring a text feature vector according to the image to be recognized, wherein the text feature vector is determined according to a semantic feature vector of the text and a position feature vector of the text, the image feature vector further comprises a statistical characteristic vector, and the statistical characteristic vector represents the number of characters of the text;

and acquiring an image identification result corresponding to the image to be identified according to the image characteristic vector and the text characteristic vector.

2. The method of claim 1, wherein the image feature vector further comprises at least one of an object feature vector, a scene feature vector, a two-dimensional code feature vector, and a template feature vector;

the obtaining of the image feature vector according to the image to be identified comprises:

acquiring the category characteristic vector through an object detection model based on the image to be identified, wherein the category characteristic vector represents category probability corresponding to the picture;

or, the obtaining of the image feature vector according to the image to be recognized includes:

acquiring the scene characteristic vector through a scene recognition model based on the image to be recognized, wherein the scene characteristic vector represents the scene probability corresponding to the picture;

acquiring the two-dimension code characteristic vector through a two-dimension code identification model based on the image to be identified, wherein the two-dimension code characteristic vector represents category probability and size information corresponding to the two-dimension code;

and generating the template characteristic vector according to the occurrence times of the pictures in the image to be recognized.

3. The method according to claim 1, wherein the obtaining a text feature vector according to the image to be recognized comprises:

performing Optical Character Recognition (OCR) on the image to be recognized to obtain the text and the position information of the text, wherein the text segment comprises P words, and P is an integer greater than or equal to 1;

generating the semantic feature vector according to the text, wherein the semantic feature vector comprises P word embedding vectors, and the word embedding vectors and the words have corresponding relations;

generating the position characteristic vector according to the position information of the text;

and generating the text feature vector according to the semantic feature vector and the position feature vector.

4. The method of claim 3, wherein the generating the text feature vector from the semantic feature vector and the location feature vector comprises:

and processing the semantic feature vector and the position feature vector based on a multi-head attention mechanism to obtain the text feature vector, wherein the semantic feature vector belongs to a query corresponding to the multi-head attention mechanism, and the position feature vector belongs to a key corresponding to the multi-head attention mechanism.

5. The method of claim 1, wherein after the acquiring the image to be identified, the method further comprises:

the obtaining of the image recognition result corresponding to the image to be recognized according to the image feature vector and the text feature vector includes:

and acquiring an image recognition result corresponding to the image to be recognized through an image recognition model based on the first feature vector and the text feature vector.

6. The method of claim 1, wherein after the acquiring the image to be identified, the method further comprises:

and acquiring an image identification result corresponding to the image to be identified through an image identification model based on the image feature vector and the second feature vector.

7. The method according to any one of claims 1 to 6, wherein the obtaining an image recognition result corresponding to the image to be recognized according to the image feature vector and the text feature vector includes:

based on the image feature vector, acquiring a target image feature vector through a shallow network included in an image recognition model;

based on the target image feature vector and the target text feature vector, obtaining the class probability distribution of the image to be recognized through a multilayer perceptron included in the image recognition model;

8. A training method of an image recognition model is characterized by comprising the following steps:

acquiring an image set to be trained, wherein the image set to be trained comprises at least one image to be trained, each image to be trained corresponds to a label, and the label is used for representing the category of the image to be trained;

acquiring an image feature vector corresponding to an image to be trained based on any image to be trained in the image set to be trained, wherein the image feature vector comprises a basic feature vector and a statistical feature vector, the basic feature vector represents size information of a picture in the image to be trained, and the statistical feature vector represents the number of characters of a text in the image to be trained;

acquiring a text feature vector corresponding to the image to be trained based on the any one image to be trained, wherein the text feature vector is determined according to a semantic feature vector of a text in the image to be trained and size information of the text in the image to be trained;

based on the image feature vector and the text feature vector corresponding to any one image to be trained, obtaining the class probability distribution corresponding to the image to be trained through an image recognition model to be trained;

and updating the model parameters of the image recognition model to be trained according to the class probability distribution and the label corresponding to the at least one image to be trained until the model convergence condition is reached, thereby obtaining the image recognition model.

9. The method according to claim 8, wherein the obtaining of the class probability distribution corresponding to the image to be trained through the image recognition model to be trained based on the image feature vector and the text feature vector corresponding to any one of the images to be trained comprises:

based on the image feature vector corresponding to any one image to be trained, acquiring a target image feature vector through a shallow network included in the image recognition model to be trained;

based on the image feature vector corresponding to any one image to be trained, acquiring a target text feature vector through a deep network included in the image recognition model to be trained;

10. A method for recognizing an image to be recognized is characterized by comprising the following steps:

if the image to be recognized comprises a text, acquiring a text feature vector according to the image to be recognized, wherein the text feature vector is determined according to a semantic feature vector of the text and size information of the text, the image feature vector further comprises a statistical characteristic vector, and the statistical characteristic vector represents the number of characters of the text;

acquiring an image recognition result corresponding to the image to be recognized according to the image feature vector and the text feature vector;

and if the image identification result is the advertisement category, processing the article to be processed.

11. An image recognition apparatus, comprising:

the acquisition module is used for acquiring an image to be identified;

the obtaining module is further configured to obtain an image feature vector according to the image to be recognized if the image to be recognized includes a picture, where the image feature vector includes a basic feature vector, and the basic feature vector represents size information of the picture;

the obtaining module is further configured to obtain a text feature vector according to the image to be recognized if the image to be recognized includes a text, where the text feature vector is determined according to a semantic feature vector of the text and a position feature vector of the text, the image feature vector further includes a statistical characteristic vector, and the statistical characteristic vector represents the number of characters of the text;

12. A model training apparatus, comprising:

the training device comprises an acquisition module, a training module and a training module, wherein the acquisition module is used for acquiring an image set to be trained, the image set to be trained comprises at least one image to be trained, each image to be trained corresponds to a label, and the label is used for representing the category of the image to be trained;

the obtaining module is further configured to obtain an image feature vector corresponding to the image to be trained based on any one image to be trained in the image set to be trained, where the image feature vector includes a basic feature vector and a statistical feature direction, the basic feature vector represents size information of a picture in the image to be trained, and the statistical feature vector represents the number of characters of a text in the image to be trained;

the obtaining module is further configured to obtain a text feature vector corresponding to the image to be trained based on the any one image to be trained, where the text feature vector is determined according to a semantic feature vector of a text in the image to be trained and size information of the text in the image to be trained;

the acquisition module is further configured to acquire category probability distribution corresponding to the image to be trained through an image recognition model to be trained based on the image feature vector and the text feature vector corresponding to the any one image to be trained;

and the training module is used for updating the model parameters of the image recognition model to be trained according to the class probability distribution and the label corresponding to the at least one image to be trained until a model convergence condition is reached, so as to obtain the image recognition model.

13. An advertisement image recognition apparatus, comprising:

the obtaining module is further configured to obtain a text feature vector according to the image to be recognized if the image to be recognized includes a text, where the text feature vector is determined according to a semantic feature vector of the text and size information of the text, the image feature vector further includes a statistical characteristic vector, and the statistical characteristic vector represents the number of characters of the text;

the obtaining module is further configured to obtain an image recognition result corresponding to the image to be recognized according to the image feature vector and the text feature vector;

14. A computer device, comprising: a memory, a transceiver, a processor, and a bus system;

wherein the memory is used for storing programs;

the processor is configured to execute a program in the memory, the processor is configured to perform the method of any one of claims 1 to 7, or the training method of claim 8 or 9, or the method of claim 10, according to instructions in the program code;

15. A computer-readable storage medium comprising instructions which, when executed on a computer, cause the computer to perform the method of any one of claims 1 to 7, or perform the training method of claim 8 or 9, or perform the method of claim 10.