CN116030301A

CN116030301A - Image content classification method, device, equipment, medium and product

Info

Publication number: CN116030301A
Application number: CN202211726229.4A
Authority: CN
Inventors: 向东; 王巍; 李捷; 杨昀欣; 王慧; 石明; 程海波; 徐柯文; 王迪
Original assignee: Shanghai Pudong Development Bank Co Ltd
Current assignee: Shanghai Pudong Development Bank Co Ltd
Priority date: 2022-12-30
Filing date: 2022-12-30
Publication date: 2023-04-28

Abstract

The application relates to an image content classification method, an image content classification apparatus, a computer device, a storage medium and a computer program product. The method comprises the following steps: firstly, acquiring a target image, then determining the output size of the target image according to a preset step size parameter, then extracting target text characteristics of each information item in the target image according to the output size, an attention network and a feedforward neural network, wherein the attention network comprises a spatial local convolution layer, a spatial remote convolution layer and a channel convolution layer, and finally classifying each part of content in the target image according to the target text characteristics. According to the method, the target text features are extracted according to the output size of the target image, the attention network and the feedforward neural network, and similar texts in the image can be classified.

Description

Image content classification method, device, equipment, medium and product

Technical Field

The present application relates to the field of artificial intelligence, and in particular, to an image content classification method, apparatus, computer device, storage medium, and computer program product.

Background

A large amount of image data can be generated in the financial scene business activities and production, with the continuous development of OCR (optical character recognition ) technology and related deep learning, the research on character recognition in images is more and more advanced, and the image classification is more and more important to be taken as an important preposition means for the subsequent character recognition extraction and is more and more important to practitioners in the financial industry.

The existing image classification method is to input the preprocessed image data into a multi-path output network frame, wherein the multi-path output network frame uses two path selection algorithms as a classifier of a network, then determines an optimal path according to the path selection algorithms in the two paths, and finally classifies the content in the image according to the optimal path.

Disclosure of Invention

In view of the foregoing, it is desirable to provide an image content classification method, apparatus, computer device, computer readable storage medium, and computer program product that are capable of classifying similar text in an image.

In a first aspect, the present application provides a method for classifying image content, the method comprising:

acquiring a target image;

determining the output size of the target image according to the preset step size parameter;

extracting target text characteristics of each information item in a target image according to the output size, an attention network and a feedforward neural network, wherein the attention network comprises a spatial local convolution layer, a spatial remote convolution layer and a channel convolution layer;

and classifying each information item in the target image according to the target text characteristics.

In one embodiment, the determining the output size of the target image according to the preset step size parameter includes:

determining a sampling rate according to the preset step length parameter;

downsampling the target image according to the sampling rate;

and determining the output size according to the downsampled sampling result.

In one embodiment, the extracting the target text feature of each information item in the target image according to the output size, the attention network and the feedforward neural network includes:

according to a batch normalization algorithm, respectively carrying out normalization processing on first input data input into the attention network and second input data input into the feedforward neural network, wherein the first input data and the second input data respectively represent text data of a target image input into the attention network and text data of a target image input into the feedforward neural network;

and extracting target text characteristics of each part of content in the target image according to the normalization processing result, the output size, the attention network and the feedforward neural network.

In one embodiment, the extracting the target text feature of each information item in the target image according to the output size, the attention network and the feedforward neural network further includes:

acquiring the input size of a target image;

determining the spatial resolution of the target image according to the input size;

determining the number of output channels of the target image according to the spatial resolution;

and extracting target text characteristics of each part of content in the target image according to the number of output channels, the output size, the attention network and the feedforward neural network.

extracting initial text features of information items in a target image according to the output size, the attention network and the feedforward neural network;

and processing the initial text feature according to a preset activation function to obtain the target text feature.

In one embodiment, the classifying each information item in the target image according to the target text feature includes:

normalizing all target text features of the target image according to a layer normalization algorithm;

and classifying the contents of each part in the target image according to the normalization processing result.

In a second aspect, the present application further provides an image content classification apparatus, the apparatus comprising:

the acquisition module is used for acquiring a target image;

the determining module is used for determining the output size of the target image according to the preset step size parameter;

the extraction module is used for extracting target text characteristics of each information item in the target image according to the output size, the attention network and the feedforward neural network, wherein the attention network comprises a spatial local convolution layer, a spatial remote convolution layer and a channel convolution layer;

and the classification module is used for classifying each information item in the target image according to the target text characteristics.

In a third aspect, the present application also provides a computer device. The computer device comprises a memory storing a computer program and a processor implementing the steps of the method of any of the embodiments described above when the processor executes the computer program.

In a fourth aspect, the present application also provides a computer-readable storage medium. The computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the method of any of the embodiments described above.

In a fifth aspect, the present application also provides a computer program product. The computer program product comprising a computer program which, when executed by a processor, implements the steps of the method of any of the embodiments described above.

The image content classification method, the image content classification device, the computer equipment, the storage medium and the computer program product firstly acquire a target image, then determine the output size of the target image according to preset step size parameters, then extract target text characteristics of each information item in the target image according to the output size, the attention network and the feedforward neural network, wherein the attention network comprises a spatial local convolution layer, a spatial remote convolution layer and a channel convolution layer, and finally classify each part of content in the target image according to the target text characteristics. According to the method, the target text features are extracted according to the output size of the target image, the attention network and the feedforward neural network, and similar texts in the image can be classified.

Drawings

FIG. 1 is a flow diagram of a method for classifying image content in one embodiment;

FIG. 2 is a flow chart of a method of output sizing in one embodiment;

FIG. 3 is a schematic diagram of a multi-stage visual attention network in another embodiment;

FIG. 4 is a block diagram showing the structure of an image content classifying apparatus according to an embodiment;

fig. 5 is an internal structural diagram of a computer device in one embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

In one embodiment, as shown in fig. 1, an image content classification method is provided, where the method is applied to a terminal to illustrate the method, it is understood that the method may also be applied to a server, and may also be applied to a system including the terminal and the server, and implemented through interaction between the terminal and the server.

In this embodiment, the method includes the steps of:

s102, acquiring a target image.

The target image refers to an image of a financial bill in a financial business scene, and includes a plurality of information items, for example, the information items include a title of the bill, an amount of the bill, a customer name, and a customer account number, and in other embodiments, the information items may include other information items, which are not particularly limited herein.

S104, determining the output size of the target image according to the preset step size parameter.

The preset step size parameter is used for controlling the sampling rate of downsampling, and the output size refers to the size of the target image after being processed by the attention network and the feedforward neural network.

Specifically, the terminal determines a sampling rate of downsampling according to a preset step size parameter, downsampling the target image according to the sampling rate, and determining an output size of the target image according to a downsampling result, wherein the output size of the target image processed by the attention network is consistent with the output size of the target image processed by the feedforward neural network.

S106, extracting target text characteristics of each information item in the target image according to the output size, the attention network and the feedforward neural network, wherein the attention network comprises a spatial local convolution layer, a spatial remote convolution layer and a channel convolution layer.

The attention network is used for adaptively selecting input features, so that deep learning is more targeted when the features of a target are extracted, the precision of related tasks is improved, the feedforward neural network is the simplest neural network, all neurons are arranged in layers, each neuron is connected with neurons of a previous layer only, the target text features refer to the text features of each information item of the target image, the spatial local convolution layer is depth-wise convolution, the spatial local convolution layer can identify image text context information, the spatial remote convolution layer is depth-wise convolution, the spatial remote convolution layer can capture the dependence of the image field range, wherein the expansion rate of the spatial remote convolution layer is d, the convolution size is (2 d-1) x (2 d-1), the channel convolution layer is convolution with the size of 1x1, and the channel convolution layer can capture the relation of the channel dimensions of the image.

Specifically, the terminal firstly controls the attention network to primarily extract text features of the target image, and then controls the feedforward neural network to process the primarily extracted text features to obtain the target text features.

S108, classifying each information item in the target image according to the target text characteristics.

After the target text features of the target image are determined, carrying out normalization processing on all the target text features in the target image by using a layer normalization algorithm, and classifying each information item in the target image according to the normalized target text features.

In the image content classification method, firstly, a target image is acquired, then, the output size of the target image is determined according to the preset step size parameter, then, the target text characteristics of each information item in the target image are extracted according to the output size, the attention network and the feedforward neural network, the attention network comprises a spatial local convolution layer, a spatial remote convolution layer and a channel convolution layer, and finally, the contents of each part in the target image are classified according to the target text characteristics. According to the method, the target text features are extracted according to the output size of the target image, the attention network and the feedforward neural network, and similar texts in the image can be classified.

In some embodiments, as shown in fig. 2, fig. 2 is a flowchart of an output size determining method in one embodiment, determining an output size of a target image according to a preset step size parameter, including: determining a sampling rate according to a preset step size parameter; downsampling the target image according to the sampling rate; and determining the output size according to the sampling result of the downsampling.

In this step, the sampling rate refers to the sampling frequency of downsampling, which characterizes the number of samples extracted from the continuous signal per second and constituting a discrete signal, and downsampling, i.e. decimation, is the basic content in the processing of the multi-rate signal.

The method provided by the step determines the output size according to the sampling result of downsampling, and can lay a foundation for classifying the subsequent image content.

In some embodiments, extracting target text features for each information item in the target image based on the output size, the attention network, and the feedforward neural network includes: according to a batch normalization algorithm, respectively carrying out normalization processing on first input data of an input attention network and second input data of an input feedforward neural network, wherein the first input data and the second input data respectively represent text data of a target image of the input attention network and text data of a target image of the input feedforward neural network; and extracting target text characteristics of each part of content in the target image according to the normalization processing result, the output size, the attention network and the feedforward neural network.

In this step, a batch normalization algorithm (Batch Normalization, BN) is a technique for improving the performance and stability of artificial neural networks.

Specifically, before the first input data is input into the attention network, the terminal performs normalization processing on the first input data by using a batch normalization algorithm, the first input data is processed by the attention network to obtain second input data, the terminal performs normalization processing on the second input data by using the batch normalization algorithm, then the second input data is input into the feedforward neural network, and the second input data is processed by the feedforward neural network to obtain target text characteristics.

The method provided by the step can accelerate the convergence speed of the attention network and the feedforward neural network, and avoid overfitting.

In some embodiments, extracting target text features for each information item in the target image based on the output size, the attention network, and the feedforward neural network further includes: acquiring the input size of a target image; determining the spatial resolution of the target image according to the input size; determining the number of output channels of the target image according to the spatial resolution; and extracting target text characteristics of each part of content in the target image according to the number of output channels, the output size, the attention network and the feedforward neural network.

In this step, the input size refers to the original size of the target image, the spatial resolution refers to the size or the size of the minimum unit that can be distinguished in detail on the remote sensing image, the output channels refer to the number of output channels after the convolution operation, and the number of output channels determines the number of convolution kernels.

According to the method provided by the step, the target text characteristics are determined according to the number of the output channels, and the accuracy of determining the target text characteristics can be improved.

In some embodiments, extracting target text features for each information item in the target image based on the output size, the attention network, and the feedforward neural network further includes: extracting initial text characteristics of each information item in the target image according to the output size, the attention network and the feedforward neural network; and processing the initial text features according to a preset activation function to obtain target text features.

In this step, the initial text feature refers to a text feature that has not been processed by a preset activation function, where the activation function refers to a function running on a neuron of the artificial neural network and is responsible for mapping an input of the neuron to an output, and for example, the activation function may be a GULE function (linear rectification function, recitified Linear Unit). The data is further processed by the activation function after the text data of the target image is processed by the attention network and the feedforward neural network.

The method provided by the step can introduce nonlinear factors into neurons by using the activation function, so that the neural network can approximate any nonlinear function at will.

In some embodiments, classifying the information items in the target image according to the target text feature includes: normalizing all target text features of the target image according to a layer normalization algorithm; and classifying each information item in the target image according to the normalization processing result.

In this step, the layer normalization algorithm normalizes the elements in each sample.

The method provided by the step can enable classification of each information item to be more accurate by using a layer normalization algorithm.

In one embodiment, another image content classification method is provided, which is implemented by a multi-stage visual Attention network, where the multi-stage visual Attention network includes a sequence of four stages, each stage decreasing spatial resolution and increasing channel number, the four stages having feature map sizes H/4×w/4, H/8×w/8, H/16×w/16, and H/32×w/32, respectively, where H and W represent the height and width of an input image, respectively, and the four stages are schematically shown in fig. 3, where FFN is a feed forward neural network, BN (batch normalization ) is a batch normalization layer, gel (linear rectification function, recitified Linear Unit) is an activation function, attention is an Attention network, LCC is a large-kernel decomposition convolution, and d is an expansion rate.

A large kernel decomposition convolution of size K x K is formed from a large kernel decomposition convolution of size K x K

Is composed of a depth-wise convolution, a depth-wise-extended convolution of size (2 d-1) x (2 d-1), and a channel convolution (1 x1 convolution), where K isThe geometric dimension of the large-kernel decomposition convolution can fully utilize the local context information of the image, the depth expansion convolution can capture the dependence of the long range of the image, the channel convolution can capture the relation of the channel dimension of the image, the large-kernel decomposition convolution absorbs the advantages of the convolution and the attention characteristic diagram, and the local structure information, the long dependence and the adaptability of the image can be obtained. The large kernel decomposition convolution can be expressed by the following formula:

Attention＝Conv _1×1 (DW-D-Conv(DW-Conv(F)))

Output＝Attention⊙F

in the formula, attention is a Attention feature map, DW-Conv (F) is a depth convolution, DW-D-Conv is a depth expansion convolution, output is an Output feature map, F is an input feature map, and as such, the result is a dot product operation.

The image content classification method in the embodiment specifically includes the following: and (3) downsampling the input image, controlling the downsampling rate by using step length parameters, after downsampling, keeping all layers under the same module at the same output size, extracting text features of the input image by using a plurality of groups of batch normalization, activation functions, large kernel decomposition convolution and a forward feedback network, performing layer normalization processing on the text features, and classifying the content of the input image according to the processing result of layer normalization.

It should be understood that, although the steps in the flowcharts related to the embodiments described above are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.

Based on the same inventive concept, the embodiment of the application also provides an image content classification device for realizing the above related image content classification method. The implementation of the solution provided by the device is similar to the implementation described in the above method, so the specific limitation in the embodiments of the image content classification device or devices provided below may be referred to the limitation of the image content classification method hereinabove, and will not be repeated here.

In one embodiment, as shown in fig. 4, there is provided an image content classification apparatus 400 including: an acquisition module 401, a determination module 402, an extraction module 403, and a classification module 404, wherein:

an acquisition module 401 is configured to acquire a target image.

A determining module 402, configured to determine an output size of the target image according to a preset step size parameter.

An extraction module 403, configured to extract a target text feature of each information item in the target image according to the output size, an attention network, and a feedforward neural network, where the attention network includes a spatial local convolution layer, a spatial remote convolution layer, and a channel convolution layer.

And the classification module 404 is used for classifying each information item in the target image according to the target text characteristics.

In some embodiments, the determining module 402 is further configured to: determining a sampling rate according to the preset step length parameter; downsampling the target image according to the sampling rate; and determining the output size according to the downsampled sampling result.

In some embodiments, the extraction module 403 is further configured to: according to a batch normalization algorithm, respectively carrying out normalization processing on first input data input into the attention network and second input data input into the feedforward neural network, wherein the first input data and the second input data respectively represent text data of a target image input into the attention network and text data of a target image input into the feedforward neural network; and extracting target text characteristics of each part of content in the target image according to the normalization processing result, the output size, the attention network and the feedforward neural network.

In some embodiments, the extraction module 403 is further configured to: acquiring the input size of a target image; determining the spatial resolution of the target image according to the input size; determining the number of output channels of the target image according to the spatial resolution; and extracting target text characteristics of each part of content in the target image according to the number of output channels, the output size, the attention network and the feedforward neural network.

In some embodiments, the extraction module 403 is further configured to: extracting initial text features of information items in a target image according to the output size, the attention network and the feedforward neural network; and processing the initial text feature according to a preset activation function to obtain the target text feature.

In some embodiments, classification module 404 is further to: normalizing all target text features of the target image according to a layer normalization algorithm; and classifying the contents of each part in the target image according to the normalization processing result.

The respective modules in the above-described image content classification apparatus may be implemented in whole or in part by software, hardware, and a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, a computer device is provided, which may be a terminal, and the internal structure of which may be as shown in fig. 5. The computer device includes a processor, a memory, an input/output interface, a communication interface, a display unit, and an input means. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface, the display unit and the input device are connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The input/output interface of the computer device is used to exchange information between the processor and the external device. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless mode can be realized through WIFI, a mobile cellular network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement a method of classifying image content. The display unit of the computer device is used for forming a visual picture, and can be a display screen, a projection device or a virtual reality imaging device. The display screen can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, can also be a key, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.

It will be appreciated by those skilled in the art that the structure shown in fig. 5 is merely a block diagram of some of the structures associated with the present application and is not limiting of the computer device to which the present application may be applied, and that a particular computer device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is provided comprising a memory and a processor, the memory having stored therein a computer program, the processor when executing the computer program performing the steps of: acquiring a target image; determining the output size of the target image according to the preset step size parameter; extracting target text characteristics of each information item in a target image according to the output size, an attention network and a feedforward neural network, wherein the attention network comprises a spatial local convolution layer, a spatial remote convolution layer and a channel convolution layer; and classifying each information item in the target image according to the target text characteristics.

In one embodiment, determining the output size of the target image according to the preset step size parameter, which is implemented when the processor executes the computer program, comprises: determining a sampling rate according to the preset step length parameter; downsampling the target image according to the sampling rate; and determining the output size according to the downsampled sampling result.

In one embodiment, the processor, when executing the computer program, extracts the target text features of the information items in the target image according to the output size, the attention network and the feedforward neural network, including: according to a batch normalization algorithm, respectively carrying out normalization processing on first input data input into the attention network and second input data input into the feedforward neural network, wherein the first input data and the second input data respectively represent text data of a target image input into the attention network and text data of a target image input into the feedforward neural network; and extracting target text characteristics of each part of content in the target image according to the normalization processing result, the output size, the attention network and the feedforward neural network.

In one embodiment, the processor, when executing the computer program, extracts the target text feature of each information item in the target image according to the output size, the attention network and the feedforward neural network, and further comprises: acquiring the input size of a target image; determining the spatial resolution of the target image according to the input size; determining the number of output channels of the target image according to the spatial resolution; and extracting target text characteristics of each part of content in the target image according to the number of output channels, the output size, the attention network and the feedforward neural network.

In one embodiment, the processor, when executing the computer program, extracts the target text feature of each information item in the target image according to the output size, the attention network and the feedforward neural network, and further comprises: extracting initial text features of information items in a target image according to the output size, the attention network and the feedforward neural network; and processing the initial text feature according to a preset activation function to obtain the target text feature.

In one embodiment, classifying the information items in the target image according to the target text feature, as implemented when the processor executes the computer program, comprises: normalizing all target text features of the target image according to a layer normalization algorithm; and classifying the contents of each part in the target image according to the normalization processing result.

In one embodiment, a computer readable storage medium is provided having a computer program stored thereon, which when executed by a processor, performs the steps of: acquiring a target image; determining the output size of the target image according to the preset step size parameter; extracting target text characteristics of each information item in a target image according to the output size, an attention network and a feedforward neural network, wherein the attention network comprises a spatial local convolution layer, a spatial remote convolution layer and a channel convolution layer; and classifying each information item in the target image according to the target text characteristics.

In one embodiment, determining the output size of the target image according to the preset step size parameter, which is implemented when the computer program is executed by the processor, comprises: determining a sampling rate according to the preset step length parameter; downsampling the target image according to the sampling rate; and determining the output size according to the downsampled sampling result.

In one embodiment, the computer program, when executed by the processor, achieves extracting target text features for each information item in a target image based on the output size, the attention network, and the feedforward neural network, comprising: according to a batch normalization algorithm, respectively carrying out normalization processing on first input data input into the attention network and second input data input into the feedforward neural network, wherein the first input data and the second input data respectively represent text data of a target image input into the attention network and text data of a target image input into the feedforward neural network; and extracting target text characteristics of each part of content in the target image according to the normalization processing result, the output size, the attention network and the feedforward neural network.

In one embodiment, the computer program, when executed by the processor, is implemented to extract target text features for each information item in a target image based on the output size, the attention network, and the feedforward neural network, further comprising: acquiring the input size of a target image; determining the spatial resolution of the target image according to the input size; determining the number of output channels of the target image according to the spatial resolution; and extracting target text characteristics of each part of content in the target image according to the number of output channels, the output size, the attention network and the feedforward neural network.

In one embodiment, the computer program, when executed by the processor, is implemented to extract target text features for each information item in a target image based on the output size, the attention network, and the feedforward neural network, further comprising: extracting initial text features of information items in a target image according to the output size, the attention network and the feedforward neural network; and processing the initial text feature according to a preset activation function to obtain the target text feature.

In one embodiment, the classification of the information items in the target image according to the target text feature, as implemented by the execution of the computer program by the processor, comprises: normalizing all target text features of the target image according to a layer normalization algorithm; and classifying the contents of each part in the target image according to the normalization processing result.

In one embodiment, a computer program product is provided comprising a computer program which, when executed by a processor, performs the steps of: acquiring a target image; determining the output size of the target image according to the preset step size parameter; extracting target text characteristics of each information item in a target image according to the output size, an attention network and a feedforward neural network, wherein the attention network comprises a spatial local convolution layer, a spatial remote convolution layer and a channel convolution layer; and classifying each information item in the target image according to the target text characteristics.

It should be noted that, the user information (including, but not limited to, user equipment information, user personal information, etc.) and the data (including, but not limited to, data for analysis, stored data, presented data, etc.) referred to in the present application are information and data authorized by the user or sufficiently authorized by each party, and the collection, use and processing of the related data are required to comply with the related laws and regulations and standards of the related countries and regions.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in the various embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magnetic random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (Phase Change Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), and the like. The databases referred to in the various embodiments provided herein may include at least one of relational databases and non-relational databases. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processors referred to in the embodiments provided herein may be general purpose processors, central processing units, graphics processors, digital signal processors, programmable logic units, quantum computing-based data processing logic units, etc., without being limited thereto.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples only represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the present application. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application shall be subject to the appended claims.

Claims

1. A method of classifying image content, the method comprising:

acquiring a target image;

2. The method of claim 1, wherein determining the output size of the target image according to the preset step size parameter comprises:

determining a sampling rate according to the preset step length parameter;

downsampling the target image according to the sampling rate;

and determining the output size according to the downsampled sampling result.

3. The method of claim 1, wherein the extracting target text features for each information item in a target image from the output size, the attention network, and the feedforward neural network comprises:

4. The method of claim 1, wherein the extracting target text features for each information item in a target image based on the output size, the attention network, and the feedforward neural network, further comprises:

acquiring the input size of a target image;

5. The method of claim 1, wherein the extracting target text features for each information item in a target image based on the output size, the attention network, and the feedforward neural network, further comprises:

6. The method of claim 1, wherein classifying each information item in a target image according to the target text feature comprises:

7. An image content classification apparatus, the apparatus comprising:

the acquisition module is used for acquiring a target image;

8. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1 to 6 when the computer program is executed.

9. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 6.

10. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 6.