CN116612466B

CN116612466B - Content identification method, device, equipment and medium based on artificial intelligence

Info

Publication number: CN116612466B
Application number: CN202310890786.8A
Authority: CN
Inventors: 王翔翔
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2023-07-20
Filing date: 2023-07-20
Publication date: 2023-09-29
Anticipated expiration: 2043-07-20
Also published as: CN116612466A

Abstract

The application discloses a content identification method, device, equipment and medium based on artificial intelligence, and relates to the technical field of computers. The method comprises the following steps: acquiring image characteristic representations corresponding to the text images; performing feature enhancement on the image feature representation to obtain a coding feature representation corresponding to the image feature representation; acquiring an image quality score corresponding to the text image based on the image characteristic representation; and carrying out text content identification on the coding feature representation based on the image quality score to obtain a content identification result corresponding to the text content. By adding the image quality score, the model participation weight of the language model is adaptively adjusted according to the image definition, so that the text content identification process does not completely depend on the language model, and the text representation result obtained by identification under the condition of higher image definition is ensured to be consistent with the text content in the text image, thereby improving the accuracy of text content identification.

Description

Content identification method, device, equipment and medium based on artificial intelligence

Technical Field

The embodiment of the application relates to the technical field of computers, in particular to a content identification method, device, equipment and medium based on artificial intelligence.

Background

In the context of content recognition, text content recognition refers to that a text image containing text content is input, the text content in the text image is recognized, and then a text representation result corresponding to the text content is output, for example: when the target image contains Tree image content and text content 'Tree', the English word corresponding to the 'Tree' is output through text content identification on the target image.

In the related art, a text recognition model is usually trained in advance, a text image containing text content is input into the text recognition model, and then a text prediction result corresponding to the text content is output, wherein the text content recognition model is usually implemented as a language model for performing text prediction on the text content in the text image according to the context content thereof.

However, in the above related art, the text content in the text image is predicted by the text recognition model, which is too dependent on the text recognition model, so that when the text content in the text image has misspellings, the text prediction result obtained by the text recognition model is correctly spelled text content, resulting in lower accuracy of text recognition.

Disclosure of Invention

The embodiment of the application provides a content identification method, device, equipment and medium based on artificial intelligence, which can improve the accuracy of text identification. The technical scheme is as follows.

In one aspect, there is provided an artificial intelligence based content recognition method, the method comprising:

acquiring image characteristic representations corresponding to text images, wherein the text images comprise text contents;

performing feature enhancement on the image feature representation to obtain a coding feature representation corresponding to the image feature representation;

acquiring an image quality score corresponding to the text image based on the image characteristic representation, wherein the image quality score is used for indicating the image definition of the text image;

and carrying out text content recognition on the coded characteristic representation based on the image quality score to obtain a content recognition result corresponding to the text content, wherein the image quality score is used for determining model participation weight in the text content recognition process of the coded characteristic representation by a pre-trained language model, and the content recognition result is used for representing the text content recognized in the text image.

In another aspect, there is provided an artificial intelligence based content recognition apparatus, the apparatus comprising:

The acquisition module is used for acquiring image characteristic representations corresponding to the text images, wherein the text images comprise text contents;

the enhancement module is used for carrying out feature enhancement on the image feature representation to obtain a coding feature representation corresponding to the image feature representation;

the acquisition module is further used for acquiring an image quality score corresponding to the text image based on the image characteristic representation, wherein the image quality score is used for indicating the image definition of the text image;

the identification module is used for carrying out text content identification on the coded characteristic representation based on the image quality score to obtain a content identification result corresponding to the text content, the image quality score is used for determining model participation weight in the text content identification process of the coded characteristic representation by a pre-trained language model, and the content identification result is used for representing the text content identified in the text image.

In another aspect, a computer device is provided, the computer device including a processor and a memory having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, the at least one instruction, the at least one program, the set of codes, or the set of instructions being loaded and executed by the processor to implement an artificial intelligence based content identification method as in any one of the embodiments of the application described above.

In another aspect, a computer readable storage medium having stored therein at least one instruction, at least one program, code set, or instruction set loaded and executed by a processor to implement an artificial intelligence based content identification method according to any of the embodiments of the present application described above is provided.

In another aspect, a computer program product or computer program is provided, the computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the artificial intelligence based content identification method according to any of the above embodiments.

The technical scheme provided by the embodiment of the application has the beneficial effects that at least:

the method comprises the steps of carrying out feature enhancement on image feature representations based on image feature representations corresponding to text images to obtain corresponding coding feature representations, and obtaining image quality scores corresponding to the text images according to the image feature representations, so that model participation weights for text content identification on the coding feature representations are adjusted according to the image quality scores, and further text content identification is carried out on the coding feature representations to obtain content identification results corresponding to the text contents, namely, the model participation weights of the language models are adaptively adjusted according to image definition in a mode of adding the image quality scores, so that the text content identification process does not completely depend on the language models, and the text content identification results obtained through identification under the condition of higher image definition are consistent with the text contents in the text images, thereby improving the accuracy of text content identification.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic illustration of an implementation environment provided by an exemplary embodiment of the present application;

FIG. 2 is a flowchart of an artificial intelligence based content identification method provided in an exemplary embodiment of the present application;

FIG. 3 is a flowchart of a method for artificial intelligence based content identification provided in another exemplary embodiment of the application;

FIG. 4 is a schematic diagram of a decoder decoding process provided by another exemplary embodiment of the present application;

FIG. 5 is a text content recognition result at different image quality scores provided by another exemplary embodiment of the present application;

FIG. 6 is a flowchart of a method for artificial intelligence based content identification in accordance with yet another exemplary embodiment of the present application;

FIG. 7 is a diagram of image quality score acquisition provided by an exemplary embodiment of the present application;

FIG. 8 is a schematic diagram of a second classifier training process provided by an exemplary embodiment of the present application;

FIG. 9 is a flowchart of a method for artificial intelligence based content identification provided in accordance with yet another exemplary embodiment of the present application;

FIG. 10 is a block diagram of an artificial intelligence based content recognition device according to an exemplary embodiment of the present application;

FIG. 11 is a block diagram of an artificial intelligence based content recognition device according to another exemplary embodiment of the present application;

fig. 12 is a schematic diagram of a server according to an exemplary embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail with reference to the accompanying drawings.

Artificial intelligence (Artificial Intelligence, AI): the system is a theory, a method, a technology and an application system which simulate, extend and extend human intelligence by using a digital computer or a machine controlled by the digital computer, sense environment, acquire knowledge and acquire an optimal result by using the knowledge. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include, for example, sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, pre-training model technologies, operation/interaction systems, mechatronics, and the like. The pre-training model is also called a large model and a basic model, and can be widely applied to all large-direction downstream tasks of artificial intelligence after fine adjustment. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

Computer Vision technology (CV): the computer vision is a science for researching how to make a machine "see", and more specifically, a camera and a computer are used to replace human eyes to identify and measure targets, and the like, and further, graphic processing is performed, so that the computer is processed into images which are more suitable for human eyes to observe or transmit to an instrument to detect. As a scientific discipline, computer vision research-related theory and technology has attempted to build artificial intelligence systems that can acquire information from images or multidimensional data. The large model technology brings important reform for the development of computer vision technology, and a pretrained model in many vision fields can be quickly and widely applied to downstream specific tasks through fine tuning (fine tune). Computer vision techniques typically include image processing, image recognition, image semantic understanding, image retrieval, text recognition (Optical Character Recognition, OCR), video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D techniques, virtual reality, augmented reality, synchronous positioning, and map construction, among others, as well as common biometric recognition techniques such as face recognition, fingerprint recognition, and others.

Natural language processing (Nature Language Processing, NLP): is an important direction in the fields of computer science and artificial intelligence. It is studying various theories and methods that enable effective communication between a person and a computer in natural language. The natural language processing relates to natural language, namely the language used by people in daily life, and is closely researched with linguistics; at the same time, an important technology related to model training in the field of computer science and mathematics and artificial intelligence, namely a pre-training model, is developed from a large language model (Large Language Model) in the NLP field. Through fine tuning, the large language model can be widely applied to downstream tasks. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic questions and answers, knowledge graph techniques, and the like.

Machine Learning (ML): is a multi-domain interdisciplinary, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like. The pre-training model is the latest development result of deep learning, and integrates the technology.

The automatic driving technology refers to that the vehicle realizes self-driving without operation of a driver. Typically including high-precision maps, environmental awareness, computer vision, behavioral decision-making, path planning, motion control, and the like. The automatic driving comprises various development paths such as single car intelligence, car-road coordination, networking cloud control and the like. The automatic driving technology has wide application prospect, and the current field is the field of logistics, public transportation, taxis and intelligent transportation, and is further developed in the future.

With the research and advancement of artificial intelligence technology, the research and application of artificial intelligence technology is developed in various fields, such as common smart home, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned, autopilot, unmanned, digital twin, virtual man, robot, artificial intelligence generation content (Artificial Intelligence Generated Content, AIGC), conversational interaction, smart medical treatment, smart customer service, game AI, etc., and it is believed that with the development of technology, the artificial intelligence technology will find application in more fields and play an increasingly important value.

The scheme provided by the embodiment of the application relates to technologies such as an artificial intelligence content identification method and the like, and is specifically described through the following embodiment.

First, an implementation environment corresponding to the present application will be described. Referring to fig. 1, a schematic diagram of an implementation environment provided by an exemplary embodiment of the present application is shown, and as shown in fig. 1, the implementation environment includes a terminal 110 and a server 120, where the terminal 110 and the server 120 are connected through a communication network 130.

The terminal 110 is provided with a target application program having a content recognition function, for example: at least one of the program types of the search engine class application program, the shopping class application program, the document processor application program and the like. In the process of running the target application program by the terminal 110, when receiving a selection operation on the text image, a text content identification request is generated and sent to the server 120, where the text content identification request is used for requesting text content identification on text content in the text image.

When the server 120 receives the text content recognition request, firstly extracting image feature representation corresponding to the text image, then carrying out feature enhancement on the image feature representation to obtain coding feature representation corresponding to the image feature representation, obtaining image quality scores corresponding to the text image according to the image feature representation, wherein the image quality scores are used for determining model participation weights in the text content recognition process of the language model on the coding feature representation, so that text content recognition is carried out on the coding feature representation according to the image quality scores, and finally obtaining content recognition results corresponding to text content in the text image. And feeding back the content identification result to the terminal 110 for display.

The terminal 110 includes at least one of a smart phone, a tablet computer, a portable laptop, a desktop computer, an intelligent sound box, an intelligent wearable device, an intelligent voice interaction device, an intelligent home appliance, a vehicle-mounted terminal and other terminals, and the server 120 can be applied to a scene of realizing instruction operation in the fields of intelligent transportation, vehicle-mounted terminals, the internet of things and the like.

It should be noted that the above-mentioned communication network 130 may be implemented as a wired network or a wireless network, and the communication network 130 may be implemented as any one of a local area network, a metropolitan area network, or a wide area network, which is not limited in the embodiment of the present application.

It should be noted that the server 120 may be implemented as a Cloud server, where Cloud technology (Cloud technology) refers to a hosting technology that unifies serial resources such as hardware, software, and networks in a wide area network or a local area network to implement calculation, storage, processing, and sharing of data.

In some embodiments, the server 120 described above may also be implemented as a node in a blockchain system.

It should be noted that, the content identification method based on artificial intelligence provided by the embodiment of the present application may be implemented by the terminal 110 alone, or by the server 120 alone, or by the cooperation of the terminal 110 and the server 120.

When the terminal 110 and the server 120 cooperatively implement the scheme provided by the embodiment of the present application, the terminal 110 and the server 120 may be directly or indirectly connected through a wired or wireless communication manner, which is not limited in the embodiment of the present application.

In the embodiment of the present application, the artificial intelligence based content recognition method provided in the server 120 is implemented as a business service in the application service layer.

It should be noted that, before and during the process of collecting the relevant data of the user, the present application may display a prompt interface, a popup window or output voice prompt information, where the prompt interface, popup window or voice prompt information is used to prompt the user to collect the relevant data currently, so that the present application only starts to execute the relevant step of obtaining the relevant data of the user after obtaining the confirmation operation of the user to the prompt interface or popup window, otherwise (i.e. when the confirmation operation of the user to the prompt interface or popup window is not obtained), the relevant step of obtaining the relevant data of the user is finished, i.e. the relevant data of the user is not obtained. In other words, all user data collected by the present application is collected with the consent and authorization of the user, and the collection, use and processing of relevant user data requires compliance with relevant laws and regulations and standards of the relevant region.

In connection with the description of the noun introduction and the implementation environment, the content recognition method based on artificial intelligence according to the embodiment of the present application will be described by taking the execution of the method by a server as an example, and referring to fig. 2, a flowchart of the content recognition method based on artificial intelligence according to an exemplary embodiment of the present application is shown, where the method includes the following steps.

Step 210, obtaining an image feature representation corresponding to the text image.

Wherein the text image comprises text content.

Illustratively, a text image refers to an image that contains text content.

Optionally, the text content includes text of at least one natural language type, such as: the image a comprises Chinese text content "flower", and the image b comprises English text content "good".

In some embodiments, the text content in the text image refers to the text result presented in the form of an image, that is, for image a containing text content "evayday", it is essentially an image, and thus, word processing operations such as word deletion or word addition cannot be performed on the image.

Optionally, the text image is represented in a picture format (e.g., with. Jpg as the file suffix); alternatively, the text image is represented in a portable file format (Portable Document Format, PDF), which is not limited.

Optionally, the text image obtaining manner includes at least one of the following obtaining manners:

1. the user selects a designated image from the local album as a text image;

2. after the user agrees with the download authorization, downloading the obtained image from the website;

3. after the user is authorized by image processing, the image obtained by intercepting the designated image is intercepted;

4. after shooting authorization, the user shoots the text content, and then generates a text image.

It should be noted that the above-mentioned manner of acquiring the text image is merely an illustrative example, and the embodiment of the present application is not limited thereto.

Optionally, at least one text content is included in the text image.

In some embodiments, feature extraction is performed on the text image by a feature extraction module, so as to obtain an image feature representation corresponding to the text image.

Optionally, the image feature representation acquisition mode includes at least one of the following modes:

1. extracting a direction gradient histogram (Histogram of Oriented Gradient, HOG) corresponding to the text image, namely, carrying out color space standardization on the grey image after grey image of the text image, calculating gradients (including sizes and directions) of each pixel point in the text image to obtain contour information of the text image, and counting the gradient histogram of each pixel to finally obtain the direction gradient histogram corresponding to the text image;

2. Extracting a local binary pattern (Local Binary Pattern, LBP) representation corresponding to the text image, namely dividing the text image into a specified number (for example, 16×16) of small areas, comparing the gray values of eight adjacent pixels with the target pixel in each small area, and marking the position of the target pixel as 1 if the gray value of the adjacent pixel is larger than that of the target pixel, otherwise, marking the position of the target pixel as 0. In this way, all pixel points in the small area can be compared to generate multi-bit binary numbers, namely LBP values of the small area are obtained, then a histogram of the small area, namely the frequency of each digital occurrence is calculated, then normalization processing is carried out on the histogram, and finally the obtained statistical histogram of each small area is connected into a feature vector, namely LBP representation of the whole text image;

3. and extracting image characteristic representations corresponding to the text images by using a convolutional neural network, namely, using the convolutional neural network (such as network structures of Resnet-50, resnet-50-CD5, resnet-101 and the like) as a characteristic extraction network, inputting the text images into the characteristic extraction network, and outputting the image characteristic representations corresponding to the text images.

It should be noted that the above-mentioned acquisition manner of the image feature representation is only an illustrative example, and the embodiment of the present application is not limited thereto.

Optionally, the image Feature representation is represented in the form of a Feature Map (Feature Map); alternatively, the image feature representation is represented in a feature rectangle form; alternatively, the image characteristic representation is represented in one-hot (one-hot) form, which is not limited.

And 220, carrying out feature enhancement on the image feature representation to obtain a coding feature representation corresponding to the image feature representation.

In some embodiments, feature enhancement refers to data enhancement of image feature representations by semantic information contained in a text image.

The semantic information of the text image is divided into visual layer information, object layer information and concept layer information, wherein the visual layer information comprises information such as colors, textures, shapes and the like corresponding to the text image, namely, the visual layer information is usually bottom layer information; the object layer is an intermediate layer, and generally includes attribute features of a text image, for example: the state of an object in an image at a certain moment; the conceptual layer is a high layer, and means that things which are closest to the understanding of the user are expressed through the image, for example: the image a comprises beach, sea water and blue sky, the visual layer information is the image area of beach, the image area of sea water and the image area of blue sky, the object layer information is the image a comprises three objects of beach, sea water and blue sky, the conceptual layer is the beach, and the image a expresses the image content.

In some embodiments, the encoded feature representation includes both a visual feature representation (feature representation corresponding to visual layer information) corresponding to the text image and a contextual feature representation (feature representation corresponding to object layer information and feature representation corresponding to conceptual layer information) corresponding to the text image.

Illustratively, in the text image, some image contents with larger deformation degree or unclear image exist, and based on the partial image contents, the image characteristic representation is subjected to characteristic enhancement to obtain the coding characteristic representation corresponding to the image characteristic representation. For example: the image A comprises text content 'Green', wherein the definition of the image content corresponding to the letter 'n' is lower, and because the language information corresponding to the image A is English word 'Green', the image characteristic representation corresponding to the image content of the letter 'n' is subjected to data enhancement by combining the English word 'Green', so that the coding characteristic representation corresponding to the final image characteristic representation is obtained.

In some embodiments, when the image feature representation is represented in the form of a feature map, each pixel point in the text image corresponds to a feature value in the feature map, and feature enhancement is performed on part of the image content in the text image, that is, weight is added to the feature value corresponding to the pixel point of the image content, so that the weighted image feature representation is obtained and used as the coding feature representation.

At step 230, an image quality score corresponding to the text image is obtained based on the image feature representation.

Wherein the image quality score is used to indicate the image sharpness of the text image.

In some embodiments, an image quality score corresponding to a text image obtained from the image feature representation is used to evaluate the image sharpness of the text image, with higher image quality scores representing higher sharpness of the text image and lower image quality scores representing lower sharpness of the text image.

In one possible implementation, a classifier is pre-trained, where a plurality of candidate content categories are stored, where the content categories are used to represent categories corresponding to image content (including text content) in an input image (including text images). For example: included in classifier a are candidate content categories, "cat," "dog," "cat," and "dog," respectively.

Inputting the image characteristic representation corresponding to the text image into a classifier, and outputting classification probabilities corresponding to a plurality of candidate content categories respectively, for example: the text content in the text image a is 'kitten', the text image a is input into a classifier, the classification probability of the output text image a corresponding to 'cat' is 30%, the classification probability of the output text image a corresponding to 'dog' is 5%, the classification probability of the output text image a corresponding to 'cat' is 80%, and the classification probability of the output text image a corresponding to 'dog' is 10%.

Taking an average value of classification probabilities respectively corresponding to a plurality of candidate content categories of the text content as an image quality score corresponding to the text image, for example: the text image a finally corresponds to an image quality fraction of (30% +5% +80% + 10%)/4=31.25%.

In another possible case, a plurality of different candidate quality scores are preset, the different candidate quality scores correspond to different image definitions, the candidate quality scores matched with the image definition of the text image are determined by analyzing the image definition of the input text image, and the candidate quality scores are used as the image quality scores corresponding to the text image.

It should be noted that the above steps 220 and 230 may be performed simultaneously or sequentially, that is, the execution sequence of the steps 220 and 230 is not limited.

And 240, carrying out text content recognition on the coded characteristic representation based on the image quality score to obtain a content recognition result corresponding to the text content.

The image quality score is used for determining model participation weight in the text content recognition process of the pre-trained language model on the coding feature representation, and the content recognition result is used for representing text content obtained by recognition in the text image.

Illustratively, the language model refers to a neural network model that predicts the content of a text image according to the coding feature representation, so as to predict the text representation result corresponding to the text content in the text image, that is, the result output by the language model is the result predicted according to the context feature representation corresponding to the coding feature representation, and is not the result directly predicted according to the visual feature representation of the text image.

In some embodiments, after the image quality score corresponding to the text image is obtained, the model participation weight of the language model in the text content recognition process of the text image is adjusted according to the image quality score, so that the recognition degree by means of the language model in the text content recognition is different according to the image quality score of the input image.

Optionally, the language model includes at least one of model types such as Long Short-Term Memory artificial neural network (LSTM), gate unit loops (Gate Recurrent Unit, GRU), and the like.

Illustratively, in the case of high image sharpness, text content recognition is performed on the text image only by visual feature representations among the encoded feature representations; and when the image definition is low, performing text content identification on the text image through the language model on the context feature representations in the coding feature representations.

Illustratively, the content recognition result refers to converting text content represented in an image form in a text image into text content in a text form, for example: the text content "flower" in the text image is converted into the chinese word "flower".

In summary, according to the content recognition method based on artificial intelligence provided by the embodiment of the application, the image feature representation is enhanced on the basis of obtaining the image feature representation corresponding to the text image to obtain the corresponding coding feature representation, and the image quality score corresponding to the text image is obtained according to the image feature representation, so that the model participation weight of the text content recognition for the coding feature representation is adjusted according to the image quality score, and further the text content recognition is performed for the coding feature representation to obtain the content recognition result corresponding to the text content, that is, the model participation weight of the language model is adaptively adjusted according to the image definition in a mode of adding the image quality score, so that the text content recognition process does not completely depend on the language model, and the text representation result obtained by recognition under the condition of higher image definition is ensured to be consistent with the text content in the text image, thereby improving the accuracy of text content recognition.

In an alternative embodiment, the text content recognition is performed on the encoded feature representation by a decoder, and reference is made schematically to fig. 3, which shows a flowchart of a content recognition method based on artificial intelligence according to an exemplary embodiment of the present application, that is, step 240 includes step 241, and the method includes the following steps, as shown in fig. 3.

And 241, inputting the image quality score and the coding characteristic representation into a pre-trained decoder to identify text content, and outputting a content identification result corresponding to the obtained text content.

The decoder is used for identifying text content of the coded characteristic representation through the language model and the image quality score.

Illustratively, text content recognition is performed on a text image by a pre-trained decoder in combination with image quality scores and coded feature representations.

In some embodiments, the text content includes n characters, the content recognition result includes content recognition results corresponding to the n characters, and n is a positive integer.

Illustratively, the text content comprises a plurality of characters, and the decoder recognizes the text content of the previous character to obtain a content recognition result corresponding to the previous character, so that the decoder recognizes the text content of the next character according to the content recognition result corresponding to the previous character, and recursive text content recognition is realized.

In the following, two different recursive text content recognition are described in detail.

Firstly, inputting the coding feature representation, the content identification result corresponding to the i-1 th character and the image quality score into a decoder, and outputting the content identification result corresponding to the i-1 th character, wherein i is more than or equal to 2 and less than or equal to n, and i is an integer.

Illustratively, in the first case, starting from the second character of the plurality of characters corresponding to the text content, only the content recognition result corresponding to the i-1 th character needs to be input into the decoder in addition to the coding feature representation and the image quality score, so that the text content recognition is performed on the i-th character, and the content recognition result corresponding to the i-th character is output. Wherein, the content recognition result corresponding to the i-1 th character is also obtained by the decoder.

In some embodiments, a feature fusion module and a first classifier are included in the decoder; inputting the content recognition result corresponding to the i-1 character into a language model, and outputting to obtain language characteristic representation corresponding to the i-1 character; the language feature representation and the coding feature representation corresponding to the ith-1 character are weighted and fused based on the image quality score through a feature fusion module, so that the ith-1 fusion feature representation is obtained; inputting the i-1 fusion characteristic representation into a first classifier, and outputting to obtain a content identification result corresponding to the i character.

Illustratively, in the process of inputting the coding feature representation, the content recognition result corresponding to the i-1 th character and the image quality score into the decoder to perform text content recognition, firstly inputting the content recognition result corresponding to the i-1 th character into a language model trained in advance in the decoder, and obtaining the language feature representation corresponding to the i-1 th character through the language model recognition, namely, the language feature representation corresponding to the i-1 th character is the output result of the language model corresponding to the i-1 th character.

After the language feature representation corresponding to the i-1 character is obtained, inputting the image quality score, the language feature representation corresponding to the i-1 character and the coding feature representation into a feature fusion module in a decoder, wherein the feature fusion module is used for carrying out feature fusion on the language feature representation corresponding to the i-1 character and the coding feature representation through the image quality score, so as to obtain a fusion feature representation corresponding to the i-1 character, and inputting the fusion feature representation corresponding to the i-1 character into a first classifier in the decoder for recognition to obtain a content recognition result corresponding to the i character.

The method comprises the steps of carrying out weighted fusion on language characteristic representation corresponding to the ith-1 character and coding characteristic representation according to image quality scores aiming at a characteristic fusion module in a decoder, so as to obtain fusion characteristic representation corresponding to the ith-1 character, wherein a specific fusion process can refer to a formula I.

Equation one:

wherein, the liquid crystal display device comprises a liquid crystal display device,fusion feature representation corresponding to the i-1 th character,/->Representing an image quality score (typically expressed in decimal form),>and representing language characteristic representation corresponding to the i-1 th character, and f represents coding characteristic representation.

As can be seen from the formula one, when the image quality score is high, the weight of the language feature representation corresponding to the i-1 th character is lower in the fusion feature representation corresponding to the i-1 th character, so that the feature participation degree of the language feature representation corresponding to the i-1 th character is lower in the content recognition result corresponding to the i-1 th character obtained through the first classifier recognition, and the model participation degree of the language model in the text content recognition process is lower.

In this embodiment, the first classifier includes a plurality of candidate classification results, the fusion feature representation corresponding to the i-1 th character is input into the first classifier, and the prediction probabilities corresponding to the plurality of candidate classification results corresponding to the i-th character are output, where the candidate classification result with the highest prediction probability is used as the content recognition result corresponding to the i-th character.

In some embodiments, a start code feature corresponding to a start character is obtained; inputting the coding feature representation, the initial coding feature and the image quality score into a decoder, and outputting a content identification result corresponding to the initial character.

In this embodiment, for the first character in the text content, i.e. the initial character, there is no content corresponding to the previous characterThe recognition result is that the initial character is customized, namely, the initial coding feature corresponding to the customized initial character is recorded asAnd carrying out weighted fusion on the initial coding feature and the coding feature representation according to the image quality score to obtain a fusion feature representation corresponding to the initial character, inputting the fusion feature representation corresponding to the initial character into a first classifier, and outputting to obtain a content recognition result corresponding to the first character (namely, the initial character).

Referring to fig. 4, a schematic diagram of a decoding process of a decoder according to an exemplary embodiment of the present application is shown, and as shown in fig. 4, a decoding process of the decoder is shown, taking the i-1 character as an example, a content recognition result 401 corresponding to the i-1 character is obtained currently, the content recognition result 401 is input into a language model 402, a language feature representation 403 corresponding to the i-1 character is output, the language feature representation 403, an image quality score 404 and an encoding feature representation 405 are input into a feature fusion module 410, the language feature representation 403 and the encoding feature representation 405 are subjected to weighted fusion through the image quality score 404, a fusion feature representation 406 corresponding to the i-1 character is obtained, the fusion feature representation 406 is input into a first classifier 420, a content recognition result 407 corresponding to the i-1 character is output, and the content recognition result 407 corresponding to the i.e. the final character (yoos) is output as the input of the next decoder.

Second, the coding feature representation, the content recognition result corresponding to the first i-1 characters and the image quality score are input into a decoder, and the content recognition result corresponding to the ith character is output.

Illustratively, in the second case, starting from the second character of the plurality of characters corresponding to the text content, inputting the content recognition results corresponding to the first i-1 characters into the decoder in addition to the coding feature representation and the image quality score, so as to perform text content recognition on the ith character, and outputting the content recognition results corresponding to the ith character. Wherein, the content recognition result corresponding to the first i-1 characters is also obtained by the decoder recognition.

Illustratively, in the process of inputting the coding feature representation, the content recognition result corresponding to the first i-1 characters and the image quality score into the decoder to perform text content recognition, firstly inputting the content recognition results corresponding to the first i-1 characters into a language model trained in advance in the decoder, and obtaining the language feature representation corresponding to the first i-1 characters through the language model recognition, namely, the language feature representation corresponding to the first i-1 characters is the output result of the language model corresponding to the first i-1 characters.

After the language feature representation corresponding to the first i-1 characters is obtained, inputting the image quality score, the language feature representation corresponding to the first i-1 characters and the coding feature representation into a feature fusion module in a decoder, wherein the feature fusion module is used for carrying out feature fusion on the language feature representation corresponding to the first i-1 characters and the coding feature representation through the image quality score, so that fusion feature representation corresponding to the first i-1 characters is obtained, and the fusion feature representation corresponding to the first i-1 characters is input into a first classifier in the decoder and used for identifying and obtaining a content identification result corresponding to the ith character.

The method comprises the steps of carrying out weighted fusion on language characteristic representations corresponding to the first i-1 characters and coding characteristic representations according to image quality scores aiming at a characteristic fusion module in a decoder, so as to obtain fusion characteristic representations corresponding to the first i-1 characters, wherein a specific fusion process can refer to the formula I.

In one possible case, the model participation of the language model is consistent with the image quality score for the above formula one, but in another possible case, the model participation of the language model is obtained based on the image quality score, but is not the same as the image quality score. That is, in response to the image quality score reaching a first score threshold, setting a model engagement of the language model in the decoder as a first engagement weight; and setting the model participation degree of the language model in the decoder as a second participation weight in response to the image quality score not reaching the first score threshold, wherein the first participation weight is lower than the second participation weight.

In this embodiment, when the model participation degree of the language model is different from the image quality score, a first score threshold is preset, when the image quality score reaches the first score threshold, the model participation degree of the language model in the decoder is set as a first participation weight, and when the image quality score does not reach the first score threshold, the model participation degree of the language model in the decoder is set as a second participation weight. The first participation weight and the second participation weight are used for carrying out weighted fusion on the language characteristic representation and the coding characteristic representation corresponding to the characters.

And under other realizable conditions, setting a plurality of different score thresholds, wherein the different score thresholds correspond to different participation weights, so that a target score threshold matched with the different score thresholds is determined according to the image quality score, and the participation weight corresponding to the target score threshold is used as a target participation weight corresponding to the language model.

Referring to fig. 5, a text content recognition result at different image quality scores according to an exemplary embodiment of the present application is shown, as shown in fig. 5, where a text image 510 is currently displayed to identify a text representation result 501 and a text image 520 is currently displayed to identify a text representation result 502. As can be seen from fig. 5, since the image sharpness of the text image 510 is higher than that of the text image 520, the text content is mainly recognized by the visual feature representation corresponding to the text image in the text image 510, and therefore, when the text content itself in the text image 510 includes a spelling error, the text representation result 501 obtained by the visual feature representation recognition is also misspelled, but the text content recognition result is satisfied at this time, and the text content is mainly recognized by the contextual feature representation corresponding to the text image in the text image 520, so that the situation of recognition error in the case of low image sharpness can be avoided.

In the embodiment, the pre-trained decoder is arranged, so that the image quality score and the coding characteristic representation are input into the decoder to identify the text content, and the text content identification efficiency and accuracy can be improved.

In this embodiment, by setting the language model in the decoder to correspond to different participation weights according to different image quality scores, it is possible to improve accuracy of text content recognition while avoiding text recognition errors caused by excessive reliance on the language model.

In this embodiment, the encoding feature representation, the content recognition result corresponding to the previous character and the image quality score are input into the decoder, so that the content recognition result corresponding to the current character is obtained, the text content can be recognized word by word, and the accuracy of text content recognition is improved.

In this embodiment, the encoding feature representation, the content recognition results corresponding to the previous characters and the image quality score are input into the decoder, so that the content recognition results corresponding to the current characters are obtained, text content recognition can be performed on the current characters by combining the content recognition results corresponding to the recognized characters, and the accuracy of text content recognition is improved.

In this embodiment, the language feature representation corresponding to the previous character is obtained by inputting the content recognition result corresponding to the previous character into the language model, so that the language feature representation corresponding to the previous character and the coding feature representation are weighted and fused according to the image quality score, and the fused feature representation obtained by the weighted and fused is input into the first classifier for prediction, so that the participation degree of the language model is different when images with different definition are used for text content prediction, and the adaptability of the language model application scene in text content recognition is improved.

In this embodiment, the initial coding feature is set for the first character, that is, the initial character, so that a content recognition result corresponding to the initial character is obtained according to the initial coding feature, the coding feature representation and the image quality score, each character in the text content can be ensured to be subjected to the text content death, and the accuracy of the text content death is improved.

In some embodiments, feature extraction is performed on a text image by a feature extraction module, feature enhancement is performed on an image feature representation by a self-attention module, and a corresponding image quality score of the text image is obtained by a second classifier, and referring to fig. 6, schematically, a flowchart of a content recognition method based on artificial intelligence according to an exemplary embodiment of the present application is shown, that is, step 210 includes step 211, step 220 includes step 221, step 230 includes step 231, and the method includes the following steps, as shown in fig. 6.

Step 211, inputting the text image into a feature extraction module for feature extraction, and outputting to obtain image feature representation corresponding to the text image.

Illustratively, inputting the text image with the image size of H×W into a feature extraction module for feature extraction, thereby obtaining an image feature representation corresponding to the text image, wherein the feature size of the image feature representation is H/8×W/8.

In this embodiment, a convolutional neural network is used as a feature extraction module to perform feature extraction on a text image, and the image feature representation obtained by output is the convolutional image feature.

Step 221, inputting the image characteristic representation into the self-attention module for characteristic enhancement, and outputting to obtain the coding characteristic representation.

Illustratively, an image feature representation of feature size H/8 XW/8 is input to the self-attention module and an encoded feature representation of feature size H/8 XW/8 is output.

And 231, inputting the image characteristic representation into a second classifier, and outputting the image quality score corresponding to the obtained text image.

In some embodiments, the second classifier includes a plurality of candidate content categories therein; inputting the image characteristic representation into a second classifier, and outputting classification probabilities respectively corresponding to the text image on a plurality of candidate content categories; and taking the probability average value of the classification probabilities corresponding to the text image on the plurality of candidate content categories as the image quality score corresponding to the text image.

In this embodiment, a second classifier is trained in advance, where a plurality of candidate content categories are stored in the second classifier, after the image feature representation is input into the second classifier, classification probabilities corresponding to the plurality of candidate content categories corresponding to the text image are output, so that an average value result corresponding to the plurality of classification probabilities is used as an image quality score corresponding to the text image.

Referring to fig. 7, an exemplary embodiment of the present application provides an image quality score obtaining schematic, as shown in fig. 7, a text image 701 is currently obtained, a text image is input into a feature extraction module 702 for feature extraction, an image feature representation 703 is output, the image feature representation 703 is input into a second classifier 704, and an image quality score 705 corresponding to the text image is output. The second classifier 704 may also be a classifier stored in the image quality evaluation module.

Illustratively, the feature extraction module, the self-attention module, and the second classifier belong to a preset module in the encoder.

It should be noted that the first classifier and the second classifier belong to different modules, the first classifier belongs to a decoder, and the second classifier belongs to an encoder, so that the candidate content category stored in the first classifier may be the same as or different from the candidate content category stored in the second classifier.

Next, a training process of the second sample classifier will be described in detail.

In some embodiments, obtaining an image feature representation corresponding to a sample text image, wherein the sample text image comprises sample text content, and the sample text image is marked with an image quality label; inputting the image characteristic representation into a second sample classifier, and outputting to obtain the image quality fraction corresponding to the sample text image; training the second sample classifier based on the difference between the image quality score and the image quality label to obtain a second classifier.

Illustratively, a sample text image is acquired, the sample text image is input into a feature extraction module for feature extraction, and an image feature representation corresponding to the sample text image is output. The sample text image is marked with an image quality label, and the image quality label is used for indicating a true value of the image definition of the sample text image.

And inputting the image characteristic representation corresponding to the sample text image into a second sample classifier, and outputting to obtain the image quality score corresponding to the sample text image, so that a first loss function is determined according to the difference between the image quality score and the image quality label, and training the second sample classifier by using the first loss function to obtain a trained second classifier.

Wherein the first loss function is implemented as a probability-based loss function (Connectionist Temporal Classification Loss, CTC loss function).

Referring to fig. 8, a schematic diagram of a second classifier training process provided by an exemplary embodiment of the present application is shown in fig. 8, in which a sample text image 801 is currently obtained, the sample text image 801 is input to a feature extraction module for feature extraction, an image feature representation 802 corresponding to the sample text image 801 is output, the image feature representation is input to a second sample classifier 803, an image quality score corresponding to the sample text image 801 is output, a first loss function 804 is determined according to a difference between the image quality score and an image quality label marked on the sample text image 801, and the first loss function 804 is used to train the second sample classifier 803, so as to obtain a trained second classifier.

In some embodiments, the sample text content in the sample text image is also tagged with a text content tag; performing feature enhancement on the image feature representation to obtain a coding feature representation corresponding to the image feature representation; inputting the coding feature representation and the image quality score into a sample decoder, and outputting a content prediction result corresponding to the obtained sample text content; and training the sample decoder based on the content prediction result and the difference before the text content label to obtain the decoder.

In this embodiment, in addition to training the second classifier in the encoder, there is a process of training the sample decoder through a sample text image, where the sample text image includes sample text content, the sample text content is further labeled with a text content tag, an image feature representation corresponding to the sample text image is input to the self-attention module for feature enhancement, an encoded feature representation corresponding to the image feature representation is output to obtain the image feature representation, the encoded feature representation and the image quality score are input to the sample decoder, and a content prediction result corresponding to the sample text content is output to obtain, so that a second loss function is determined according to a difference between the content prediction result and the text content tag, and the sample decoder is trained through the second loss function, thereby obtaining a trained decoder.

Wherein the second loss function employs a cross entropy loss function.

In this embodiment, by setting the corresponding structures of the encoder and the decoder and setting the feature extraction module, the self-attention module and the second classifier in the encoder, the text content recognition efficiency can be improved.

In this embodiment, the second classifier is trained through the CTC loss function, and since the CTC loss function is not affected by the language model, the situation of repeated attention or attention jump generated in the text content recognition process through the language model can be avoided, and the accuracy of text content recognition is further improved.

In this embodiment, the decoder is trained through the cross entropy loss function, so that the recognition accuracy can be improved, and the generalization capability of text content recognition is improved.

In some embodiments, an application of an artificial intelligence based content recognition method to an english listening writing scenario is described as an example, and referring to fig. 9, a flowchart of an artificial intelligence based content recognition method according to an exemplary embodiment of the application is shown, and the method includes the following steps as shown in fig. 9.

A text image 901 including english dictation text content is obtained, wherein the text image 901 is an image generated by shooting the english dictation content by a specified shooting device under the condition of authorization, that is, the text content included in the text image 901 is the english dictation content, and is the english text content handwritten by a user.

The text image 901 is input into a feature extraction module 902 in the encoder 910, an image feature representation corresponding to the text image 901 is extracted, the image feature representation is respectively input into a self-attention model 903 and a second classifier 904 in the encoder 910, the self-attention module 903 is used for enhancing the image feature representation to obtain an encoded feature representation, the second classifier 904 is used for outputting an image quality score corresponding to the text image 901, the encoded feature representation and the image quality score are input into a decoder 920 to perform text content recognition, finally, an English word corresponding to English dictation content is obtained by recognition and is importat, and although the English word is Important, spelling errors exist in the text image 901, and therefore the text content recognition result is still "recognition correct".

In the scheme of the application, the participation degree of the language model is determined by the image quality. The image feature extraction module completes the preliminary mining of the image feature representation. The second classifier uses the image features to score the quality of the image. The convolved image features further attenuate the interference caused by noise through the self-attention module. The image quality fraction is input to a decoder, the decoder adjusts the participation degree of the language model according to the image quality, for clear sequence scenes, the decoder decodes more depending on visual features, and for blurred text images, the decoder uses the language model to assist in judging text contents in the images.

FIG. 10 is a block diagram illustrating an artificial intelligence based content recognition device according to an exemplary embodiment of the present application, as shown in FIG. 10, the device including:

an obtaining module 1010, configured to obtain a plurality of position obtaining modules corresponding to action tracks of a target object in a historical time range, and obtain an image feature representation corresponding to a text image, where the text image includes text content;

the enhancement module 1020 is configured to perform feature enhancement on the image feature representation to obtain an encoded feature representation corresponding to the image feature representation;

The obtaining module 1010 is further configured to obtain an image quality score corresponding to the text image based on the image feature representation, where the image quality score is used to indicate image sharpness of the text image;

the identifying module 1030 is configured to identify text content of the encoded feature representation based on the image quality score, and obtain a content identification result corresponding to the text content, where the image quality score is used to determine a model participation weight in the text content identification process of the encoded feature representation by using a pre-trained language model, and the content identification result is used to represent text content identified in the text image.

In some embodiments, as shown in fig. 11, the identification module 1030 includes:

and an input unit 1031, configured to input the image quality score and the coding feature representation into a pre-trained decoder for performing the text content recognition, and output a content recognition result corresponding to the text content, where the decoder includes the language model, and the decoder is configured to perform text content recognition on the coding feature representation through the language model and the image quality score.

In some embodiments, the identification module 1030 further comprises:

a setting unit 1032 configured to set a model participation degree of the language model in the decoder as a first participation weight in response to the image quality score reaching a first score threshold;

the setting unit 1032 is further configured to set, in response to the image quality score not reaching the first score threshold, a model participation degree of the language model in the decoder to a second participation weight, where the first participation weight is lower than the second participation weight.

In some embodiments, the text content includes n characters, the content recognition result includes content recognition results corresponding to the n characters respectively, and n is a positive integer;

the input unit 1031 is further configured to input the encoded feature representation, the content identification result corresponding to the i-1 th character, and the image quality score to the decoder, and output the content identification result corresponding to the i-1 th character, where i is greater than or equal to 2 and less than or equal to n, and i is an integer.

In some embodiments, the input unit 1031 is further configured to input the encoding feature representation, the content recognition result corresponding to the first i-1 characters, and the image quality score to the decoder, and output the content recognition result corresponding to the i-th character.

In some embodiments, the decoder includes a feature fusion module and a first classifier therein;

the input unit 1031 is further configured to input a content recognition result corresponding to the i-1 th character into the language model, and output a language feature representation corresponding to the i-1 th character; the language feature representation corresponding to the ith-1 character and the coding feature representation are weighted and fused based on the image quality score through the feature fusion module to obtain an ith-1 fusion feature representation; inputting the i-1 fusion characteristic representation into the first classifier, and outputting to obtain a content identification result corresponding to the i character.

In some embodiments, the n characters include a start character arranged at a start position;

the input unit 1031 is further configured to obtain a start code feature corresponding to the start character; inputting the coding feature representation, the initial coding feature and the image quality score into the decoder, and outputting a content identification result corresponding to the initial character.

In some embodiments, the obtaining module 1010 is further configured to input the text image into a feature extraction module to perform feature extraction, and output the image feature representation corresponding to the text image;

The enhancement module 1020 is further configured to input the image feature representation to the self-attention module for the feature enhancement, and output the image feature representation to obtain the encoded feature representation;

the obtaining module 1010 is further configured to input the image feature representation into a second classifier, and output an image quality score corresponding to the text image.

In some embodiments, the second classifier includes therein a plurality of candidate content categories;

the obtaining module 1010 is further configured to input the image feature representation into the second classifier, and output a classification probability obtained by obtaining the text image respectively corresponding to the plurality of candidate content categories; and taking the probability average value of the classification probabilities respectively corresponding to the text image on the plurality of candidate content categories as the image quality score corresponding to the text image.

In some embodiments, the obtaining module 1010 is further configured to obtain an image feature representation corresponding to a sample text image, where the sample text image includes sample text content, and the sample text image is labeled with an image quality tag;

an input module 1040, configured to input the image feature representation into a second sample classifier, and output an image quality score corresponding to the sample text image;

A training module 1050, configured to train the second sample classifier based on the difference between the image quality score and the image quality label, to obtain the second classifier.

In some embodiments, the sample text content in the sample text image is also tagged with a text content tag;

the enhancement module 1020 is further configured to perform feature enhancement on the image feature representation to obtain an encoded feature representation corresponding to the image feature representation;

the input module 1040 is configured to input the coding feature representation and the image quality score into a sample decoder, and output a content prediction result corresponding to the sample text content;

the training module 1050 is configured to train the sample decoder to obtain the decoder based on the content prediction result and the difference before the text content label.

In summary, according to the content recognition device based on artificial intelligence provided by the embodiment of the application, the image feature representation is enhanced on the basis of obtaining the image feature representation corresponding to the text image to obtain the corresponding coding feature representation, and the image quality score corresponding to the text image is obtained according to the image feature representation, so that the model participation weight of the text content recognition for the coding feature representation is adjusted according to the image quality score, and further the text content recognition is performed for the coding feature representation to obtain the content recognition result corresponding to the text content, that is, the model participation weight of the language model is adaptively adjusted according to the image definition in a mode of adding the image quality score, so that the text content recognition process does not completely depend on the language model, and the text representation result obtained by recognition under the condition of higher image definition is ensured to be consistent with the text content in the text image, thereby improving the accuracy of text content recognition.

It should be noted that: the content recognition device based on artificial intelligence provided in the above embodiment is only exemplified by the division of the above functional modules, and in practical application, the above functional allocation may be performed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules, so as to complete all or part of the functions described above. In addition, the content recognition device based on artificial intelligence provided in the above embodiment and the content recognition method embodiment based on artificial intelligence belong to the same concept, and detailed implementation processes of the content recognition device based on artificial intelligence are detailed in the method embodiment, and are not described herein again.

Fig. 12 is a schematic diagram showing a structure of a server according to an exemplary embodiment of the present application. Specifically, the following is said:

the server 1200 includes a central processing unit (Central Processing Unit, CPU) 1201, a system Memory 1204 including a random access Memory (Random Access Memory, RAM) 1202 and a Read Only Memory (ROM) 1203, and a system bus 1205 connecting the system Memory 1204 and the central processing unit 1201. The server 1200 also includes a mass storage device 1206 for storing an operating system 1213, application programs 1214, and other program modules 1215.

The mass storage device 1206 is connected to the central processing unit 1201 through a mass storage controller (not shown) connected to the system bus 1205. The mass storage device 1206 and its associated computer-readable media provide non-volatile storage for the server 1200. That is, the mass storage device 1206 may include a computer readable medium (not shown) such as a hard disk or compact disk read only memory (Compact Disc Read Only Memory, CD-ROM) drive.

Computer readable media may include computer storage media and communication media without loss of generality. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes RAM, ROM, erasable programmable read-only memory (Erasable Programmable Read Only Memory, EPROM), electrically erasable programmable read-only memory (Electrically Erasable Programmable Read Only Memory, EEPROM), flash memory or other solid state memory technology, CD-ROM, digital versatile disks (Digital Versatile Disc, DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices. Of course, those skilled in the art will recognize that computer storage media are not limited to the ones described above. The system memory 1204 and mass storage device 1206 described above may be collectively referred to as memory.

According to various embodiments of the application, the server 1200 may also operate by being connected to a remote computer on a network, such as the Internet. That is, the server 1200 may be connected to the network 1212 through a network interface unit 1211 coupled to the system bus 1205, or alternatively, the network interface unit 1211 may be used to connect to other types of networks or remote computer systems (not shown).

The memory also includes one or more programs, one or more programs stored in the memory and configured to be executed by the CPU.

Embodiments of the present application also provide a computer device including a processor and a memory having at least one instruction, at least one program, code set, or instruction set stored therein, the at least one instruction, at least one program, code set, or instruction set being loaded and executed by the processor to implement the artificial intelligence based content identification method provided by the above method embodiments.

Embodiments of the present application also provide a computer readable storage medium having at least one instruction, at least one program, a code set, or an instruction set stored thereon, where the at least one instruction, the at least one program, the code set, or the instruction set is loaded and executed by a processor to implement the artificial intelligence based content identification method provided by the above method embodiments.

Embodiments of the present application also provide a computer program product, or computer program, comprising computer instructions stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the artificial intelligence based content identification method according to any of the above embodiments.

Alternatively, the computer-readable storage medium may include: read Only Memory (ROM), random access Memory (RAM, random Access Memory), solid state disk (SSD, solid State Drives), or optical disk, etc. The random access memory may include resistive random access memory (ReRAM, resistance Random Access Memory) and dynamic random access memory (DRAM, dynamic Random Access Memory), among others. The foregoing embodiment numbers of the present application are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and the storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The foregoing description of the preferred embodiments of the present application is not intended to limit the application, but rather, the application is to be construed as limited to the appended claims.

Claims

1. A method of content identification based on artificial intelligence, the method comprising:

obtaining image characteristic representations corresponding to text images, wherein the text images comprise text contents, the text contents comprise n characters, the content recognition results comprise content recognition results respectively corresponding to the n characters, and n is a positive integer;

inputting the coding feature representation, the content recognition result corresponding to the i-1 th character and the image quality score into a pre-trained decoder, and outputting the content recognition result corresponding to the i-1 th character, wherein the decoder comprises a pre-trained language model, the image quality score is used for determining model participation weights in the text content recognition process of the coding feature representation by the language model, the decoder is used for carrying out text content recognition on the coding feature representation by the language model and the image quality score, the content recognition result is used for representing text content obtained by recognition in the text image, i is more than or equal to 2 and less than or equal to n, and i is an integer.

2. The method according to claim 1, wherein the method further comprises:

setting a model participation degree of the language model in the decoder as a first participation weight in response to the image quality score reaching a first score threshold;

and setting the model participation degree of the language model in the decoder as a second participation weight in response to the image quality score not reaching the first score threshold, wherein the first participation weight is lower than the second participation weight.

3. The method according to claim 1, wherein the method further comprises:

and inputting the coding feature representation, the content identification result corresponding to the first i-1 characters and the image quality score into the decoder, and outputting to obtain the content identification result corresponding to the ith character.

4. The method of claim 1, wherein the decoder includes a feature fusion module and a first classifier;

inputting the coding feature representation, the content identification result corresponding to the i-1 th character and the image quality score into the decoder, and outputting the content identification result corresponding to the i-1 th character, wherein the method comprises the following steps:

inputting the content recognition result corresponding to the i-1 th character into the language model, and outputting to obtain language characteristic representation corresponding to the i-1 th character;

The language feature representation corresponding to the ith-1 character and the coding feature representation are weighted and fused based on the image quality score through the feature fusion module to obtain an ith-1 fusion feature representation;

inputting the i-1 fusion characteristic representation into the first classifier, and outputting to obtain a content identification result corresponding to the i character.

5. The method of claim 1, wherein the n characters include a start character arranged at a start position;

the method further comprises the steps of:

acquiring initial coding features corresponding to the initial characters;

inputting the coding feature representation, the initial coding feature and the image quality score into the decoder, and outputting a content identification result corresponding to the initial character.

6. The method according to any one of claims 1 to 5, wherein the obtaining an image feature representation corresponding to the text image comprises:

inputting the text image into a feature extraction module for feature extraction, and outputting to obtain the image feature representation corresponding to the text image;

the step of carrying out feature enhancement on the image feature representation to obtain a coding feature representation corresponding to the image feature representation comprises the following steps:

Inputting the image characteristic representation into a self-attention module for the characteristic enhancement, and outputting to obtain the coding characteristic representation;

the obtaining the image quality score corresponding to the text image based on the image characteristic representation comprises the following steps:

and inputting the image characteristic representation into a second classifier, and outputting the image quality score corresponding to the text image.

7. The method of claim 6, wherein the second classifier includes a plurality of candidate content categories therein;

inputting the image characteristic representation into the second classifier, and outputting to obtain the image quality score corresponding to the text image, wherein the method comprises the following steps:

inputting the image characteristic representation into the second classifier, and outputting classification probabilities respectively corresponding to the text image on the plurality of candidate content categories;

and taking the probability average value of the classification probabilities respectively corresponding to the text image on the plurality of candidate content categories as the image quality score corresponding to the text image.

8. The method of claim 6, wherein said inputting the image feature representation into the second classifier and outputting before obtaining the image quality score corresponding to the text image further comprises:

Obtaining image characteristic representations corresponding to sample text images, wherein the sample text images comprise sample text contents, and the sample text images are marked with image quality labels;

inputting the image characteristic representation into a second sample classifier, and outputting the image quality fraction corresponding to the sample text image;

training the second sample classifier based on the difference between the image quality score and the image quality label to obtain the second classifier.

9. The method of claim 8, wherein the sample text content in the sample text image is further tagged with a text content tag;

before the image quality score and the coding feature representation are input into the decoder to identify the text content and the content identification result corresponding to the text content is output, the method further comprises the steps of:

inputting the coding characteristic representation and the image quality score into a sample decoder, and outputting a content prediction result corresponding to the sample text content;

And training the sample decoder based on the content prediction result and the difference before the text content label to obtain the decoder.

10. An artificial intelligence based content recognition device, the device comprising:

the device comprises an acquisition module, a display module and a display module, wherein the acquisition module is used for acquiring image characteristic representations corresponding to a text image, the text image comprises text contents, the text contents comprise n characters, the content recognition results comprise content recognition results respectively corresponding to the n characters, and n is a positive integer;

the identification module is used for inputting the coding feature representation, the content identification result corresponding to the i-1 th character and the image quality score into a pre-trained decoder, outputting the content identification result corresponding to the i-1 th character, wherein the decoder comprises a pre-trained language model, the image quality score is used for determining model participation weights in the text content identification process of the coding feature representation by the pre-trained language model, the decoder is used for carrying out text content identification on the coding feature representation by the language model and the image quality score, the content identification result is used for representing text content identified in the text image, i is more than or equal to 2 and less than or equal to n, and i is an integer.

11. A computer device comprising a processor and a memory, wherein the memory has stored therein at least one program that is loaded and executed by the processor to implement the artificial intelligence based content recognition method of any one of claims 1 to 9.

12. A computer readable storage medium having stored therein at least one program loaded and executed by a processor to implement the artificial intelligence based content identification method of any one of claims 1 to 9.