CN113610082A - Character recognition method and related equipment thereof - Google Patents

Character recognition method and related equipment thereof Download PDF

Info

Publication number
CN113610082A
CN113610082A CN202110925424.9A CN202110925424A CN113610082A CN 113610082 A CN113610082 A CN 113610082A CN 202110925424 A CN202110925424 A CN 202110925424A CN 113610082 A CN113610082 A CN 113610082A
Authority
CN
China
Prior art keywords
coding
recognized
feature
images
coded
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110925424.9A
Other languages
Chinese (zh)
Inventor
蔡悦
黄灿
王长虎
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Youzhuju Network Technology Co Ltd
Original Assignee
Beijing Youzhuju Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Youzhuju Network Technology Co Ltd filed Critical Beijing Youzhuju Network Technology Co Ltd
Priority to CN202110925424.9A priority Critical patent/CN113610082A/en
Publication of CN113610082A publication Critical patent/CN113610082A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Abstract

The application discloses a character recognition method and related equipment thereof, wherein the method comprises the following steps: after a plurality of images to be recognized comprising the same character information are obtained, first coding processing is carried out on each image to be recognized respectively to obtain coding characteristics of each image to be recognized; and then carrying out second coding processing on the coding features of all the images to be recognized to obtain the coding features of the text to be recognized, so that the coding features of the text to be recognized can accurately represent character information carried by all the images to be recognized, the coding features of the text to be recognized can more accurately represent each character in the text to be recognized, and further the character recognition result of the text to be recognized determined based on the coding features of the text to be recognized is more accurate, thereby being beneficial to improving the character recognition accuracy of multi-frame text line recognition.

Description

Character recognition method and related equipment thereof
Technical Field
The present application relates to the field of image processing technologies, and in particular, to a character recognition method and a related device.
Background
With the development of character recognition technology, the application range of character recognition technology is wider and wider. The character recognition technology is used for performing recognition processing on characters appearing in one image.
However, some Character Recognition technologies (e.g., Optical Character Recognition (OCR), etc.) have disadvantages, so that the Recognition accuracy of these Character Recognition technologies in some application scenarios (e.g., multi-frame text line Recognition, etc.) is low. Here, "multi-frame text line recognition" refers to recognition of the same text line appearing in a plurality of images (especially, a plurality of frames of consecutive video images in one video).
Disclosure of Invention
In order to solve the technical problem, the application provides a character recognition method and related equipment thereof, which can improve the character recognition accuracy of multi-frame text line recognition.
In order to achieve the above purpose, the technical solutions provided in the embodiments of the present application are as follows:
the embodiment of the application provides a character recognition method, which comprises the following steps:
acquiring a plurality of images to be identified; wherein the plurality of images to be recognized include the same character information;
respectively carrying out first coding processing on each image to be recognized to obtain the coding characteristics of each image to be recognized;
performing second coding processing on the coding features of the multiple images to be recognized to obtain the coding features of the texts to be recognized, so that the coding features corresponding to the texts to be recognized are used for representing character information carried by the multiple images to be recognized;
and decoding the coding characteristics of the text to be recognized to obtain a character recognition result of the text to be recognized.
In a possible implementation manner, the process of determining the coding features of the text to be recognized includes:
comparing the number of the images to be identified with the number of the coding layers to be used to obtain a comparison result;
determining the features to be coded of the number of the coding layers to be used according to the comparison result and the coding features of the plurality of images to be recognized;
and performing third coding processing on the features to be coded of the number of the coding layers to be used by using the coding layers to be used of the number of the coding layers to be used to obtain the coding features of the text to be recognized.
In a possible implementation manner, the number of the images to be recognized is N;
the determining the features to be coded of the number of the coding layers to be used according to the comparison result and the coding features of the plurality of images to be recognized includes:
if the comparison result shows that the number of the images to be identified is equal to the number of the coding layers to be used, determining the coding feature of the nth image to be identified as the nth feature to be coded; wherein N is a positive integer and is less than or equal to N.
In a possible implementation manner, the determining, according to the comparison result and the coding features of the plurality of images to be recognized, the features to be coded of the number of coding layers to be used includes:
if the comparison result shows that the number of the images to be identified is larger than the number of the coding layers to be used, splicing the coding features of the images to be identified to obtain coding features to be segmented;
and carrying out segmentation processing on the coding features to be segmented according to the number of the coding layers to be used to obtain the coding features to be segmented of the number of the coding layers to be used.
In a possible implementation manner, the number of the images to be identified is N, and the number of the coding layers to be used is J;
the determining the features to be coded of the number of the coding layers to be used according to the comparison result and the coding features of the plurality of images to be recognized includes:
if the comparison result shows that the number of the images to be identified is smaller than the number of the coding layers to be used, determining the coding feature of the ith image to be identified as the ith feature to be coded; wherein i is a positive integer, i is not more than N-1;
and determining the Nth to the J-th to-be-coded features by using the coding features of the Nth to-be-recognized image.
In a possible implementation manner, the number of the coding layers to be used is J;
the third encoding processing is performed on the features to be encoded of the number of the encoding layers to be used by using the encoding layers to be used of the number of the encoding layers to be used, so as to obtain the encoding features of the text to be recognized, and the third encoding processing includes:
utilizing a 1 st coding layer to be used to code the 1 st feature to be coded to obtain a coding processing result of the 1 st feature to be coded;
coding the j to-be-coded feature and the j-1 to-be-coded feature by using the j to-be-used coding layer to obtain a coding processing result of the j to-be-coded feature; wherein, the coding processing result of the j-1 th characteristic to be coded refers to the output result of the j-1 th coding layer to be used; j is a positive integer, J is more than or equal to 2 and less than or equal to J;
and determining the encoding processing result of the J-th feature to be encoded as the encoding feature of the text to be recognized.
In a possible implementation, the 1 st to-be-used coding layer comprises a self-attention module and a feedforward neural network module;
the method for obtaining the coding processing result of the 1 st to-be-coded feature by using the 1 st to-be-used coding layer to code the 1 st to-be-coded feature includes:
carrying out coding pretreatment on the 1 st feature to be coded to obtain a pretreatment result of the 1 st feature to be coded;
inputting the preprocessing result of the 1 st feature to be coded into a self-attention module in the 1 st coding layer to be used, and obtaining a self-attention processing result of the 1 st feature to be coded output by the self-attention module;
inputting the self-attention processing result of the 1 st feature to be coded into a feedforward neural network module in the 1 st coding layer to be used, and obtaining the coding processing result of the 1 st feature to be coded output by the feedforward neural network module.
In a possible implementation, the j to-be-used coding layer includes two self-attention modules and one feedforward neural network module;
the encoding processing of the j to-be-encoded feature and the j-1 to-be-encoded feature by using the j to-be-used encoding layer to obtain the encoding processing result of the j to-be-encoded feature includes:
carrying out coding pretreatment on the jth feature to be coded to obtain a pretreatment result of the jth feature to be coded;
inputting the preprocessing result of the jth feature to be coded into a first self-attention module in the jth coding layer to be used, and obtaining a first self-attention processing result of the jth feature to be coded, which is output by the first self-attention module;
inputting the coding processing result of the (j-1) th feature to be coded and the first self-attention processing result of the (j) th feature to be coded into a second self-attention module in the (j) th coding layer to be used, and obtaining a second self-attention processing result of the (j) th feature to be coded, which is output by the second self-attention module;
and inputting the second attention processing result of the jth feature to be coded into the feedforward neural network module in the jth coding layer to be used to obtain the coding processing result of the jth feature to be coded output by the feedforward neural network module.
In a possible implementation manner, the number of the images to be recognized is N;
the process for determining the coding features of the nth image to be recognized comprises the following steps:
performing feature extraction on the nth image to be recognized to obtain visual features of the nth image to be recognized; wherein N is a positive integer, N is not more than N, and N is a positive integer;
and performing fourth coding processing on the visual features of the nth image to be recognized to obtain the coding features of the nth image to be recognized.
In a possible implementation, the process of acquiring the plurality of images to be identified includes:
clustering a plurality of candidate images to obtain at least one candidate image set, so that all the candidate images in the candidate image set comprise the same character information;
and determining the plurality of images to be identified according to the image set to be identified in the at least one candidate image set.
In a possible implementation, the determining of the at least one candidate image includes:
performing text detection on a plurality of frames of video images in a video to be processed to obtain a text detection result of the plurality of frames of video images;
and respectively carrying out image cutting on the multi-frame video images according to the text detection results of the multi-frame video images to obtain the plurality of candidate images.
An embodiment of the present application further provides a character recognition apparatus, including:
the image acquisition unit is used for acquiring a plurality of images to be identified; wherein the plurality of images to be recognized include the same character information;
the first coding unit is used for respectively carrying out first coding processing on each image to be identified to obtain the coding characteristics of each image to be identified;
the second coding unit is used for carrying out second coding processing on the coding features of the images to be recognized to obtain the coding features of the texts to be recognized, so that the coding features corresponding to the texts to be recognized are used for representing character information carried by the images to be recognized;
and the characteristic decoding unit is used for decoding the coding characteristics of the text to be recognized to obtain a character recognition result of the text to be recognized.
An embodiment of the present application further provides an apparatus, where the apparatus includes a processor and a memory:
the memory is used for storing a computer program;
the processor is used for executing any implementation mode of the character recognition method provided by the embodiment of the application according to the computer program.
The embodiment of the present application further provides a computer-readable storage medium, where the computer-readable storage medium is used to store a computer program, and the computer program is used to execute any implementation manner of the character recognition method provided in the embodiment of the present application.
The embodiment of the present application further provides a computer program product, and when the computer program product runs on a terminal device, the terminal device is enabled to execute any implementation manner of the character recognition method provided by the embodiment of the present application.
Compared with the prior art, the embodiment of the application has at least the following advantages:
according to the technical scheme provided by the embodiment of the application, after a plurality of images to be recognized which all comprise the same character information are obtained, first coding processing is carried out on each image to be recognized respectively to obtain the coding characteristics of each image to be recognized; and then carrying out second coding processing on the coding features of all the images to be recognized to obtain the coding features of the text to be recognized, so that the coding features of the text to be recognized can accurately represent character information carried by all the images to be recognized, the coding features of the text to be recognized can more accurately represent each character in the text to be recognized, and further the character recognition result of the text to be recognized determined based on the coding features of the text to be recognized is more accurate, thereby being beneficial to improving the character recognition accuracy of multi-frame text line recognition.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the present application, and other drawings can be obtained by those skilled in the art without creative efforts.
Fig. 1 is a flowchart of a character recognition method according to an embodiment of the present application;
fig. 2 is a schematic diagram of a multi-frame video image according to an embodiment of the present disclosure;
fig. 3 is a schematic diagram of a text image according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of a first coding network according to an embodiment of the present application;
fig. 5 is a schematic structural diagram of a second coding network according to an embodiment of the present application;
fig. 6 is a schematic diagram illustrating generation of a coding feature to be segmented according to an embodiment of the present disclosure;
fig. 7 is a schematic diagram of a second encoding process according to an embodiment of the present application;
fig. 8 is a schematic structural diagram of a character recognition apparatus according to an embodiment of the present application.
Detailed Description
The inventor finds in research on a character recognition technology that for multi-frame text line recognition, after a plurality of images comprising the same text line are acquired, character recognition can be performed on each image by using an OCR (optical character recognition) to obtain a character recognition result of each image; and combining the character recognition results of all the images according to a preset rule to obtain the character recognition result of the text line. However, since the text line may have different defects (e.g., occlusion, displacement, distortion, character missing, etc.) in different images, the character recognition results of the images are inaccurate, so that the character recognition results determined by combining the character recognition results of the images are also inaccurate, and thus the character recognition accuracy of the multi-frame text line recognition is low.
Based on the above findings, in order to solve the technical problems in the background art section, an embodiment of the present application provides a character recognition method, including: acquiring a plurality of images to be recognized, wherein the images to be recognized comprise the same character information; respectively carrying out first coding processing on each image to be recognized to obtain the coding characteristics of each image to be recognized; carrying out second coding processing on the coding features of the images to be recognized to obtain the coding features of the texts to be recognized; and decoding the coding characteristics of the text to be recognized to obtain a character recognition result of the text to be recognized.
Therefore, the coding features of the text to be recognized are determined according to the coding features of all the images to be recognized, so that the coding features of the text to be recognized can accurately represent character information carried by all the images to be recognized, the coding features of the text to be recognized can accurately represent each character in the text to be recognized, the character recognition result of the text to be recognized determined based on the coding features of the text to be recognized is accurate, and the character recognition accuracy of multi-frame text line recognition is improved.
In addition, the embodiment of the present application does not limit the execution subject of the character recognition method, and for example, the character recognition method provided by the embodiment of the present application may be applied to a data processing device such as a terminal device or a server. The terminal device may be a smart phone, a computer, a Personal Digital Assistant (PDA), a tablet computer, or the like. The server may be a stand-alone server, a cluster server, or a cloud server.
In order to make the technical solutions of the present application better understood, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
For convenience of understanding and explaining the technical solution of the present application, the character recognition method provided by the embodiment of the present application is described below by taking a multi-frame text line recognition process for N images to be recognized as an example.
Method embodiment
Referring to fig. 1, the figure is a flowchart of a character recognition method according to an embodiment of the present application.
The character recognition method provided by the embodiment of the application comprises the following steps of S1-S4:
s1: and acquiring N images to be identified. Wherein N is a positive integer.
The N images to be recognized are used for representing images needing multi-frame text line recognition, and carry the same character information.
Note that the "same character information" may be the following: all characters appearing in the nth image to be recognized are completely the same as all characters appearing in any image to be recognized except the nth image to be recognized in the N images to be recognized; wherein N is a positive integer, N is less than or equal to N, and N is a positive integer. In the 'N images to be recognized', all characters appearing in some images to be recognized are completely the same, but compared with the 'all characters appearing in some images to be recognized', the phenomena of missing characters and missing characters can occur in other images to be recognized, so that the 'other images to be recognized' can only include most characters in the 'all characters appearing in some images to be recognized'. In the "N images to be recognized", although all the characters appearing in all the images to be recognized are identical, there is a certain difference in appearance positions (for example, there is a displacement or the like) or presentation effects (for example, presentation is performed with different degrees of distortion or with different colors or the like) of all the characters in different images to be recognized.
In addition, the embodiment of the present application does not limit the N images to be recognized, for example, the N images to be recognized may be the 1 st frame video image to the N th frame video image shown in fig. 2. For another example, the N images to be recognized may be text images corresponding to the 1 st frame of video image to text images corresponding to the nth frame of video image shown in fig. 3.
And the text image corresponding to the nth frame of video image is obtained by carrying out image segmentation on the nth frame of video image according to the text detection result of the nth frame of video image. N is a positive integer, N is not more than N, and N is a positive integer. It should be noted that the text detection result of the nth frame of video image may be implemented by using any existing text detection method, which is not specifically limited in this embodiment of the present application.
In addition, the embodiment of the present application does not limit the 1 st frame video image to the N th frame video image, for example, the 1 st frame video image to the N th frame video image may refer to consecutive N frames of video images in one video data (e.g., hereinafter, "video to be processed").
The text to be recognized is used to represent character information appearing in the N images to be recognized. For example, if the N images to be recognized are the N images shown in fig. 2 or fig. 3, the text to be recognized may be "this is a text with the same content in one line".
In addition, the present application does not limit the acquiring process of the N images to be identified (i.e., the implementation manner of S1), for example, in a possible implementation manner, the S1 may specifically include S11 to S13:
s11: a plurality of candidate images are acquired.
The candidate image refers to image data required to be used when N images to be identified are screened; also, the number of candidate images is not less than the number of images to be recognized (i.e., the number of candidate images is not less than N).
In addition, the embodiment of the present application also does not limit the candidate image, for example, the candidate image may be a video image of one frame in one video data, or may be a text image corresponding to a video image of one frame in one video data.
The present example is not limited to the embodiment of S11, and for the sake of easy understanding, the following description will be made in conjunction with both cases.
In case 1, if the candidate image is a frame of video image, S11 may specifically include: after the video to be processed is acquired, extracting multiple frames of video images from the video to be processed to serve as the multiple candidate images. The video to be processed refers to video data needing multi-frame text line identification.
In some cases, after the video to be processed is acquired, multiple frames of video images in the video to be processed may be directly determined as multiple candidate images (for example, each frame of video image in the video to be processed may be determined as a candidate image), so that each candidate image is a video image in the video to be processed.
In case 2, if the candidate image is a text image corresponding to one frame of video image, S11 may specifically include S111-S112:
s111: and performing text detection on the multi-frame video image in the video to be processed to obtain a text detection result of the multi-frame video image.
The text detection result of one frame of video image is used for indicating the position of the text in the frame of video image.
In addition, the embodiment of the present application is not limited to the implementation of "text detection" in S111, and may be implemented by any existing or future text detection method.
Based on the related content of S111, after the to-be-processed video is obtained, text detection may be performed on the multiple frames of video images in the to-be-processed video to obtain a text detection result of the multiple frames of video images (for example, text detection is performed on each frame of video image in the to-be-processed video to obtain a text detection result of each frame of video image in the to-be-processed video), so that a text image corresponding to each frame of video image can be determined based on the text detection result of each frame of video image in the following process.
S112: and respectively carrying out image cutting on the multi-frame video images according to the text detection results of the multi-frame video images to obtain a plurality of candidate images.
In the embodiment of the application, if the "multi-frame video image" includes a T-frame video image, after the text detection result of the T-frame video image is obtained, image segmentation may be performed on the T-frame video image according to the text detection result of the T-frame video image to obtain a text image corresponding to the T-frame video image, so that the text image corresponding to the T-frame video image can accurately represent character information carried by the T-frame video image; and determining the text image corresponding to the t-th frame of video image as a t-th candidate image. The size of the text image corresponding to the t-th frame of video image is smaller than that of the t-th frame of video image, so that the non-character information carried by the text image corresponding to the t-th frame of video image is less than that carried by the t-th frame of video image, and the character recognition based on the text image corresponding to the t-th frame of video image is more accurate. T is a positive integer, T is less than or equal to T, and T is a positive integer.
Based on the related contents of S111 to S112, in some cases, after the video to be processed is obtained, text detection may be performed on multiple frames of video images in the video to be processed first, so as to obtain a text detection result of the multiple frames of video images; respectively cutting out a text image corresponding to each frame of video image from each frame of video image according to the text detection result of each frame of video image; and finally, determining the text images corresponding to the multiple frames of video images as candidate images.
Based on the above-mentioned related content of S11, in some application scenarios, multiple candidate images may be determined according to multiple frames of video images in one video data (e.g., a video to be processed), so that multiple images with the same text content can be subsequently screened from the multiple candidate images for performing multiple frames of text line recognition.
S12: and clustering the candidate images to obtain at least one candidate image set, so that all the candidate images in each candidate image set comprise the same character information.
Wherein, the y candidate image set refers to a set of candidate images including the y text. Y is a positive integer, Y is less than or equal to Y, Y is a positive integer, and Y represents the number of candidate image sets.
In addition, the embodiment of the present application is not limited to the implementation of "clustering" in S12, and may be implemented by using any existing or future clustering method.
Based on the related content of S12, after T candidate images are obtained, clustering may be performed on the T candidate images first, so that candidate images carrying the same character information are divided into the same class, and candidate images carrying different character information are divided into different classes, so as to obtain Y class candidate images, and all candidate images in the Y class candidate images include the Y-th text; and then all candidate images in the y-th class of candidate images are collected to be determined as a y-th candidate image set. Wherein Y is a positive integer and is less than or equal to Y.
S13: and determining N images to be identified according to the image set to be identified in the at least one candidate image set.
Wherein the image set to be identified is used to represent any one of the candidate image sets.
It can be seen that after the y-th candidate image set is obtained, each candidate image in the y-th candidate image set may be determined as an image to be recognized, so that a character recognition result (i.e., a character recognition result of the y-th text) corresponding to the y-th candidate image set can be determined subsequently by using the following S2-S3. Wherein Y is a positive integer, Y is less than or equal to Y, and Y is a positive integer.
Based on the related content of S1, for some application scenarios, after the to-be-processed video is acquired, N to-be-recognized images may be determined according to multiple frames of video images in the to-be-processed video (for example, the 1 st frame of video image to the nth frame of video image in fig. 2), so that the N to-be-recognized images can represent character information carried by the multiple frames of video images in the to-be-processed video (for example, "this is a line of text with the same content"), so that multiple frames of text line recognition can be performed on the N to-be-recognized images subsequently.
S2: and carrying out first coding processing on the nth image to be recognized to obtain the coding characteristics of the nth image to be recognized. Wherein N is a positive integer and is less than or equal to N.
Here, the "first encoding process" is used to perform an encoding process on image data (in particular, on character information carried by the image data).
In addition, the present embodiment is not limited to the "first encoding process", and for example, the "first encoding process" may be implemented by any method that can perform encoding processing on image data (in particular, on character information carried by the image data), existing or appearing in the future. For example, the "first encoding process" may be performed in any of embodiments S21 to S22.
The "coding feature of the nth image to be recognized" is used to indicate the character information carried by the nth image to be recognized.
In addition, the present embodiment does not limit the determination process of the "coding feature of the nth image to be recognized" (i.e., the implementation manner of S2), for example, in one possible implementation manner, S2 may specifically include S21 to S22:
s21: and performing feature extraction on the nth image to be recognized to obtain the visual feature of the nth image to be recognized.
The visual characteristics of the nth image to be recognized are used for representing the image information carried by the nth image to be recognized.
In addition, the embodiment of the present application is not limited to the implementation of "feature extraction" in S21, and may be implemented by any method that can perform feature extraction on image data, existing or appearing in the future.
For example, in one possible implementation, S21 may specifically include: and inputting the nth image to be recognized into a pre-constructed convolutional neural network to obtain the visual characteristics of the nth image to be recognized output by the convolutional neural network.
The convolutional neural network is used for performing visual feature extraction on input data of the convolutional neural network, and the embodiment of the present application is not limited to the convolutional neural network, and may be implemented by using any existing or future convolutional neural network (e.g., Deep residual network (ResNet)). In addition, the convolutional neural network can be constructed in advance according to the first sample image and the actual visual features of the first sample image.
Based on the above-mentioned related content of S21, after the nth image to be recognized is acquired, feature extraction may be performed on the nth image to be recognized to obtain the visual feature of the nth image to be recognized, so that the visual feature of the nth image to be recognized can accurately represent the image information carried by the nth image to be recognized.
S22: and performing fourth coding processing on the visual features of the nth image to be recognized to obtain the coding features of the nth image to be recognized.
Wherein, the "fourth encoding processing" is used for encoding processing for a visual feature of one image data; the present embodiment is not limited to the "fourth encoding process", and may be implemented by any encoding method, which is currently available or will come in the future, for example. As another example, any of the embodiments shown below in S221-S223 may be used.
In addition, in order to enable the "encoding characteristic of the nth image to be recognized" to more accurately represent the character information carried by the nth image to be recognized, the embodiment of the present application further provides a possible implementation manner of S22, which may specifically include S221 to S223:
s221: and carrying out position coding on the visual features of the nth image to be recognized to obtain the position features of the nth image to be recognized.
The position feature of the nth image to be recognized is used for representing the position information carried in the nth image to be recognized.
The embodiment of the present application is not limited to the implementation of the "position Encoding" in S221, and for example, the present application may be implemented by any position Encoding method existing or appearing in the future (for example, a position Encoding (Positional Encoding) module in a transform model may be implemented).
S222: and performing feature fusion on the position feature of the nth image to be recognized and the visual feature of the nth image to be recognized to obtain the fusion feature of the nth image to be recognized.
In this embodiment of the application, after the position feature of the nth image to be recognized is obtained, feature fusion (e.g., splicing or adding) may be performed on the position feature of the nth image to be recognized and the visual feature of the nth image to be recognized, so as to obtain a fusion feature of the nth image to be recognized, so that the fusion feature of the nth image to be recognized can more accurately represent character-related information (e.g., information such as each character and the arrangement order of each character) carried by the nth image to be recognized.
S223: and inputting the fusion characteristics of the nth image to be recognized into a pre-constructed coding network to obtain the coding characteristics of the nth image to be recognized output by the coding network.
The coding network is used for coding input data of the coding network. In addition, the embodiment of the present application is not limited to the encoding network, and for example, the present application may be implemented by using any existing encoding network or a future encoding network. As another example, the first encoding network shown in fig. 4 or the second encoding network shown in fig. 5 may be used for implementation.
For the first encoding network shown in FIG. 4, the first encoding network may include M1Each coding layer comprises a Multi-Head Self Attention Module (Multi-Head Self Attention Module), a Feed-Forward neural network (Feed Forward Module) and two summation and normalization modules (Add)&Norm). In addition, in the first coding network, the input data of the 1 st coding layer is the input data of the first coding network (for example, the fusion feature of the nth image to be identified), and the mth coding layer1The input data of each coding layer is m1Output data of 1 coding layer, m1Is a positive integer, 2 is less than or equal to m1≤M1,M1Is a positive integer (e.g., M)1=6)。
The embodiment of the present application is not limited to the implementation of the first coding network, and for example, the first coding network may be implemented by using an Encoder network in a transform model.
For the second encoding network shown in FIG. 5, the second encoding network includes M2Each coding layer comprises a Multi-Head Self Attention Module (MHA), two feed forward Neural networks (FFN), a Convolution Module (Convolution Module) and a normalization Module (Layernorm). In addition, in the second coding network, the input data of the 1 st coding layer is the input data of the second coding network (for example, the fusion feature of the nth image to be identified), and the mth coding layer2The input data of each coding layer is m2Output data of 1 coding layer, m2Is a positive integer, 2 is less than or equal to m2≤M2,M2Is a positive integer (e.g., M)2=7)。
The embodiment of the present application is not limited to the implementation of the second coding network, and may be implemented by using a former network, for example. In addition, in order to save the amount of computation, the convolution module in the second coding network may be implemented by combining channel convolution (poitwise Conv) and spatial convolution (Depthise Conv).
Based on the related content of S223, after the fusion feature of the nth image to be recognized is obtained, the fusion feature of the nth image to be recognized may be input into a pre-constructed coding network, so that the coding network performs coding processing on the fusion feature of the nth image to be recognized, and obtains and outputs the coding feature of the nth image to be recognized, so that the coding feature of the nth image to be recognized can accurately represent character information carried by the nth image to be recognized.
Based on the above-mentioned related content of S2, after the N images to be recognized are obtained, the encoding processing may be performed on each image to be recognized, so as to obtain the encoding features of the 1 st image to be recognized to the encoding features of the nth image to be recognized, so that the encoding features of the text to be recognized may be determined by referring to the encoding features of the N images to be recognized in the following.
S3: and carrying out second coding processing on the coding features of the N images to be recognized to obtain the coding features of the text to be recognized.
Wherein the "second encoding processing" is for performing encoding processing for encoding characteristics of the plurality of image data; the present embodiment is not limited to the "second encoding process", and may be implemented, for example, by using any existing or future encoding method. As another example, the method can be implemented by any of the embodiments shown in the following S31-S33.
The "encoding characteristic of the text to be recognized" is used to represent the character information carried by the "N images to be recognized".
In addition, in order to enable the "encoding characteristics of the text to be recognized" to more accurately represent the character information carried by the above "N images to be recognized", the embodiment of the present application further provides a possible implementation manner of S3, which may specifically include S31-S33:
s31: the number of images to be recognized (i.e., N) is compared with the number of encoding layers to be used, resulting in a comparison result.
The number of the coding layers to be used is the number of the coding layers to be used when the second coding processing is performed on the coding features of the N images to be recognized; in addition, the embodiment of the present application does not limit "the number of coding layers to be used", and may specifically be J, for example. J is a positive integer.
The "comparison result" is used to describe the relative size relationship between the above-described "number of images to be recognized" and the above-described "number of encoding layers to be used" (that is, the relative size relationship between N and J).
S32: and determining the characteristics to be coded of the number of coding layers to be used according to the comparison result and the coding characteristics of the N images to be recognized.
The h-th feature to be encoded refers to data that needs to be input to the h-th encoding layer to be used. h is a positive integer, h is less than or equal to J, J is a positive integer, and J represents the number of the coding layers to be used.
The present example is not limited to the embodiment of S32, and for the sake of easy understanding, the following description will be made with reference to three cases.
Case 1: when the above "comparison result" indicates that the number of images to be recognized is equal to the number of encoding layers to be used (that is, N ═ J), S32 may specifically include: determining the coding feature of the nth image to be identified as the nth feature to be coded; wherein N is a positive integer and is less than or equal to N.
As can be seen, after the comparison result is obtained, if it is determined that the comparison result indicates that the number of the images to be recognized is equal to the number of the coding layers to be used, the coding features of each image to be recognized in the coding features of the N images to be recognized are sequentially set as the input data of the J coding layers to be used according to the time sequence arrangement sequence of the N images to be recognized; and the setting process may specifically include: determining the coding feature of the 1 st image to be identified as the 1 st feature to be coded so as to input the 1 st feature to be coded into the 1 st coding layer to be used in the following; determining the coding feature of the 2 nd image to be identified as the 2 nd feature to be coded so as to input the 2 nd feature to be coded into the 1 st coding layer to be used subsequently; … … (and so on); and determining the coding feature of the Nth image to be identified as the Nth feature to be coded so as to input the Nth feature to be coded into the Nth coding layer to be used subsequently.
Case 2: when the above "comparison result" indicates that the number of images to be recognized is greater than the number of encoding layers to be used (i.e., N > J), S32 may specifically include steps 11 to 12:
step 11: and splicing the coding features of the N images to be identified to obtain the coding features to be segmented.
The method comprises the following steps that coding features to be segmented are used for representing the splicing result of the coding features of N images to be identified; in addition, the embodiment of the present application does not limit the manner of acquiring the "to-be-segmented coding features", and for example, the method may be implemented by using the splicing manner shown in fig. 6.
Step 12: and carrying out segmentation processing on the coding features to be segmented according to the number of the coding layers to be used to obtain the coding features to be segmented of the number of the coding layers to be used.
The segmentation processing is used for carrying out segmentation processing on the coding features to be segmented; the present embodiment is not limited to the embodiment of the "segmentation process", and may be implemented, for example, by using any feature segmentation method that is currently available or will come into existence in the future. As another example, any of the embodiments shown in steps 121-122 below may be used.
In order to facilitate understanding of the above-described "division processing", the following description is made with reference to an example.
As an example, when the feature lengths of the features to be encoded are the same and the feature lengths of the encoding features of the images to be recognized are also the same, step 12 may specifically include steps 121 to 122:
step 121: and determining segmentation parameters according to the number of the images to be identified and the number of the coding layers to be used.
The "segmentation parameter" refers to parameter information that needs to be referred to when the coding feature to be segmented is segmented; furthermore, the embodiments of the present application do not limit "segmentation parameters," and may specifically include segmentation intervals, for example. Wherein, the "division interval" is used to indicate the distance between two adjacent division positions.
In addition, the embodiment of the present application does not limit the determination process of the "partition interval", and for example, the determination process may specifically include steps 21 to 23:
step 21: and (3) rounding the ratio of the number of the images to be recognized to the number of the coding layers to be used to obtain the multiple to be used (as shown in a formula (1)).
Multipleuse=[N/J] (1)
In the formula, MultipleuseRepresents the fold to be used; n represents the number of images to be identified; j represents the number of coding layers to be used; [ N/J ]]This shows rounding for N/J.
Step 22: the candidate interval set is determined according to the multiple to be used and the above-mentioned "feature length of the coding feature of the image to be recognized" (as shown in formula (2)).
Aarternate={1×Lenfigure,2×Lenfigure,……,Multipleuse×Lenfigure} (2)
In the formula, AarternateRepresenting a set of candidate intervals; multipleuseRepresents the fold to be used; lenfigureThe above "feature length of the coding feature of the image to be recognized" is expressed.
Step 23: a segmentation interval is selected from a set of candidate intervals.
In this embodiment, after the candidate interval set is obtained, a partition interval may be selected from the candidate interval set (for example, a candidate interval is randomly selected from the candidate interval set and determined as a partition interval). The "preset interval screening condition" may be preset.
Based on the related content of the above step 121, after the number of the images to be recognized and the number of the coding layers to be used are obtained, a segmentation parameter (for example, a segmentation interval) may be determined according to a ratio between the number of the images to be recognized and the number of the coding layers to be used, so that the subsequent segmentation process can be performed according to the segmentation parameter.
Step 122: and carrying out segmentation processing on the coding features to be segmented according to the segmentation parameters to obtain the coding features to be segmented of the number of coding layers to be used.
In the embodiment of the application, after the segmentation parameters are obtained, the segmentation processing can be performed on the coding features to be segmented according to the segmentation parameters, so that a 1 st block segmentation unit to a J th block segmentation unit are obtained in sequence; then determining the 1 st block segmentation unit as the 1 st feature to be coded; determining a 2 nd block segmentation unit as a 2 nd feature to be coded; … … (and so on); and determining the J-th block division unit as the J-th feature to be coded.
Based on the relevant content of the above steps 11 to 12, after the "comparison result" is obtained, if it is determined that the "comparison result" indicates that the number of the images to be identified is greater than the number of the encoding layers to be used, in order to improve the subsequent encoding efficiency, splicing and cutting may be sequentially performed on the "encoding features of the N images to be identified" to obtain J features to be encoded, so that each feature to be encoded includes the encoding features of a plurality of images to be identified, and thus each subsequent encoding layer to be used can perform encoding processing on the encoding features of the plurality of images to be identified at the same time, which is favorable for improving the encoding efficiency.
Case 3: when the above "comparison result" indicates that the number of images to be recognized is smaller than the number of encoding layers to be used (i.e., N < J), S32 may specifically include steps 31 to 32:
step 31: determining the coding feature of the ith image to be identified as the ith feature to be coded; wherein i is a positive integer, and i is not more than N-1.
Step 32: determining the f-th feature to be coded by using the coding feature of the N-th image to be recognized; wherein f is a positive integer, and N is not less than f and not more than J.
Based on the relevant contents in the above steps 31 to 32, after the "comparison result" is obtained, if it is determined that the "comparison result" indicates that the number of the images to be recognized is less than the number of the encoding layers to be used, J features to be encoded may be determined by performing data filling processing on the "encoding features of N images to be recognized" by means of the "encoding features of the nth image to be recognized"; and the determining process may specifically include: determining the coding feature of the 1 st image to be identified as the 1 st feature to be coded; determining the coding feature of the 2 nd image to be identified as the 2 nd feature to be coded; … … (and so on); determining the coding feature of the Nth image to be identified as the Nth feature to be coded; moreover, in order to make up for the defect that the "coding feature of the image to be recognized" is insufficient (i.e., N < J), the coding feature of the nth image to be recognized may be determined as the (N + 1) th image to be coded; determining the coding feature of the Nth image to be identified as the (N + 2) th feature to be coded; … … (and so on); and determining the coding feature of the Nth image to be identified as the J-th image to be coded.
Based on the above-mentioned related content of S32, after the comparison result between the number of images to be identified and the number of coding layers to be used is obtained, the comparison result and the "coding features of the N images to be identified" may be referred to determine each feature to be coded, so that each feature to be coded may be input into each coding layer to be used in the following.
S33: and performing third coding processing on the features to be coded of the number of the coding layers to be used by using the coding layers to be used of the number of the coding layers to be used to obtain the coding features of the text to be recognized.
The coding layer to be used of the number of the coding layers to be used is used for carrying out third coding processing aiming at the characteristic to be coded of the number of the coding layers to be used; furthermore, the embodiment of the present application does not limit "the number of coding layers to be used," and may be implemented by using, for example, the third coding network shown in fig. 7.
In addition, "the to-be-used coding layers of the number of the to-be-used coding layers" may be constructed in advance; in addition, the embodiment of the present application does not limit the construction process of the "to-be-used coding layers with the number of to-be-used coding layers", and may be implemented by any existing or future construction method.
The "third encoding process" refers to an encoding process implemented by "the number of to-be-used encoding layers.
In addition, the embodiment of the present application is not limited to the implementation of S33, for example, in a possible implementation, when "the number of coding layers to be used" is J, S33 may specifically include S331 to S333:
s331: and carrying out coding processing on the 1 st to-be-coded feature by utilizing the 1 st to-be-used coding layer to obtain a coding processing result of the 1 st to-be-coded feature.
The "1 st to-be-used coding layer" is used to perform coding processing on input data of the 1 st to-be-used coding layer.
In addition, the embodiment of the present application is not limited to the "1 st to-be-used coding layer", for example, the "1 st to-be-used coding layer" may include a self-attention module and a feedforward neural network module; and the input data of the feedforward neural network module comprises the output data of the self-attention module. The self-attention module is used for performing self-attention processing on input data of the self-attention module; furthermore, the embodiments of the present application are not limited to a "self-attention module" (e.g., a multi-head self-attention mechanism may be used for implementation).
In addition, in order to facilitate understanding of the working principle of the "1 st to-be-used encoding layer", the following description will take the determination process of the "1 st to-be-encoded feature encoding processing result" as an example.
As an example, when the "1 st to-be-used encoding layer" includes a self-attention module and a feedforward neural network module, the determination process of the "1 st to-be-encoded feature encoding processing result" may specifically include steps 41 to 43:
step 41: and carrying out coding pretreatment on the 1 st feature to be coded to obtain a pretreatment result of the 1 st feature to be coded.
The encoding preprocessing refers to presetting at least one encoding processing process; furthermore, the embodiment of the present application does not limit "encoding preprocessing", and for example, it may include a feature encoding process and a position encoding process.
The characteristic coding processing is used for coding character information carried by one characteristic to be coded; the embodiment of the present application is not limited to the "feature encoding process", and may be implemented by any existing or future-appearing method capable of performing feature encoding (for example, an Input Embedding network in a transform model).
The position coding processing is used for coding position information carried by one piece of information to be coded; the present embodiment is not limited to the "position Encoding process", and may be implemented by any method that can perform position Encoding (for example, a Positional Encoding network in a transform model) that is currently available or that appears in the future.
The above "preprocessing result of the 1 st feature to be coded" is used to indicate a result obtained by performing coding preprocessing on the 1 st feature to be coded.
In addition, the embodiment of the present application is not limited to the implementation of step 41, for example, as shown in fig. 7, when the "encoding preprocessing" includes a feature encoding processing and a position encoding processing, step 41 may specifically include steps 411 to 414:
step 411: feature coding processing is performed on the 1 st feature to be coded, so as to obtain a feature coding result of the 1 st feature to be coded (e.g., "feature coding result 1" in fig. 7).
Step 412: and fusing the 1 st feature to be coded and the feature coding result of the 1 st feature to be coded to obtain the feature fusion result of the 1 st feature to be coded.
Step 413: and carrying out position coding processing on the 1 st feature to be coded to obtain a position coding result of the 1 st feature to be coded (for example, "position coding result 1" in fig. 7).
Step 414: and fusing the feature fusion result of the 1 st feature to be coded and the position coding result of the 1 st feature to be coded to obtain the preprocessing result of the 1 st feature to be coded.
In the embodiment of the present invention, the "fusion processing" in step 412 and step 414 is not limited, and may be performed by any existing or future feature fusion processing method (for example, a fusion processing method according to a transform model).
Based on the related content of step 41, for the "1 st to-be-used coding layer", after the 1 st to-be-coded feature is obtained, the 1 st to-be-coded feature may be subjected to coding preprocessing to obtain a preprocessing result of the 1 st to-be-coded feature, so that the preprocessing result can more accurately represent information (e.g., character information and position information) carried by the 1 st to-be-coded feature.
Step 42: inputting the preprocessing result of the 1 st feature to be coded into a self-attention module in the 1 st coding layer to be used, and obtaining the self-attention processing result of the 1 st feature to be coded output by the self-attention module.
In this embodiment of the application, for the above "1 st to-be-used coding layer", after the preprocessing result of the 1 st to-be-coded feature is obtained, a self-attention module (for example, an MHA in a 1 st gray solid frame from left to right in fig. 7) in the "1 st to-be-used coding layer" may perform self-attention processing (for example, multi-head self-attention processing) on the preprocessing result of the 1 st to-be-coded feature, so as to obtain and output a self-attention processing result of the 1 st to-be-coded feature.
Step 43: inputting the self-attention processing result of the 1 st feature to be coded into the feedforward neural network module in the 1 st coding layer to be used, and obtaining the coding processing result of the 1 st feature to be coded output by the feedforward neural network module.
In this embodiment of the application, for the "1 st to-be-used coding layer", after the self-attention processing result of the 1 st to-be-coded feature is obtained, the self-attention processing result of the 1 st to-be-coded feature may be processed by a feedforward neural network module (for example, an FFN in a 1 st gray solid-line frame counted from left to right in fig. 7) in the "1 st to-be-used coding layer", so as to obtain and output the coding processing result of the 1 st to-be-coded feature.
Based on the related content of step 331, after the 1 st to-be-encoded feature is obtained, the 1 st to-be-used encoding layer may perform encoding processing on the 1 st to-be-encoded feature, to obtain and output an encoding processing result of the 1 st to-be-encoded feature, so that the 2 nd to-be-used encoding layer may continue encoding processing on the basis of the encoding processing result of the 1 st to-be-encoded feature.
S332: and coding the j to-be-coded feature and the j-1 to-be-coded feature by using the j to-be-used coding layer to obtain a coding processing result of the j to-be-coded feature. The j-1 th encoding processing result of the feature to be encoded refers to the j-1 th output result of the encoding layer to be used. J is a positive integer, J is more than or equal to 2 and less than or equal to J.
The j-th encoding layer to be used is used for encoding the input data of the j-th encoding layer to be used; and the input data of the j-th coding layer to be used comprises the coding processing result of the j-th characteristic to be coded and the j-1-th characteristic to be coded (namely, the output result of the j-1-th coding layer to be used).
In addition, the embodiment of the present application is not limited to the "jth to-be-used coding layer," and for example, the "jth to-be-used coding layer" may include two self-attention modules and one feedforward neural network module. Wherein, the input data of the first self-attention module comprises the jth feature to be coded; the input data of the second self-attention module comprises the output data of the first self-attention module and the coding processing result of the j-1 th feature to be coded; the input data of the "feedforward neural network module" includes the output data of the second self-attention module.
In addition, in order to facilitate understanding of the operation principle of the "jth coding layer to be used", a determination process of the "jth coding processing result of the feature to be coded" is described below as an example.
As an example, when the jth to-be-used encoding layer includes two self-attention modules and one feedforward neural network module, the determination process of the "jth to-be-encoded feature encoding processing result" may specifically include steps 51 to 54:
step 51: and carrying out coding pretreatment on the jth feature to be coded to obtain a pretreatment result of the jth feature to be coded.
It should be noted that, the step 51 may be implemented by any embodiment of the above step 41, and only the "1 st feature to be encoded" in any embodiment of the above step 41 needs to be replaced by the "jth feature to be encoded".
Step 52: inputting the preprocessing result of the jth feature to be coded into a first self-attention module in the jth coding layer to be used, and obtaining a first self-attention processing result of the jth feature to be coded, which is output by the first self-attention module.
In this embodiment of the application, for the "jth coding layer to be used", after the preprocessing result of the jth feature to be coded is obtained, a self-attention module (similar to the MHA in the 2 nd gray solid-line frame from left to right in fig. 7) in the first self-attention module in the "jth coding layer to be used may perform self-attention processing (e.g., multi-head self-attention processing) on the preprocessing result of the jth feature to be coded, so as to obtain and output a first self-attention processing result of the jth feature to be coded.
It should be noted that, if the above-mentioned "first self-attention module in the jth to-be-used encoding layer" is implemented by using a multi-head self-attention mechanism, the Q, K, V parameters involved in the multi-head self-attention mechanism may all be set as the preprocessing result of the jth to-be-encoded feature.
Step 53: inputting the coding processing result of the (j-1) th feature to be coded and the first self-attention processing result of the (j) th feature to be coded into a second self-attention module in the (j) th coding layer to be used, and obtaining a second self-attention processing result of the (j) th feature to be coded, which is output by the second self-attention module.
In this embodiment, for the "jth to-be-used coding layer", after obtaining the coding processing result of the jth to-be-coded feature and the first self-attention processing result of the jth to-be-coded feature, a second self-attention module (similar to the MHA in the 2 nd gray solid-line frame from left to right in fig. 7) in the "jth to-be-used coding layer" may perform self-attention processing (e.g., multi-head self-attention processing) on the two results (i.e., the coding processing result of the jth to-be-coded feature and the first self-attention processing result of the jth to-be-coded feature), so as to obtain and output a second self-attention processing result of the jth to-be-coded feature.
It should be noted that, as shown in fig. 7, if the "second self-attention module in the jth to-be-used coding layer" is implemented by using a multi-head self-attention mechanism, the Q parameter related to the multi-head self-attention mechanism may be set as the first self-attention processing result of the jth to-be-coded feature, and the K, V parameters related to the multi-head self-attention mechanism may be set as the coding processing result of the jth to-be-coded feature (i.e., the output result of the jth to-be-used coding layer).
Step 54: and inputting the second attention processing result of the jth feature to be coded into a feedforward neural network module in the jth coding layer to be used to obtain the coding processing result of the jth feature to be coded output by the feedforward neural network module.
In this embodiment of the application, for the "jth to-be-used coding layer", after the second self-attention processing result of the jth to-be-coded feature is obtained, the second self-attention processing result of the jth to-be-coded feature may be processed by a feedforward neural network module (similar to the FFN in the 2 nd gray solid-line frame from left to right in fig. 7) in the "jth to-be-used coding layer", so as to obtain and output the coding processing result of the jth to-be-coded feature.
Based on the related content of the above step 332, after the coding processing results of the jth to-be-coded feature and the jth to-be-coded feature (i.e., the output result of the jth to-be-used coding layer) are obtained, the coding processing result of the jth to-be-coded feature and the jth to-be-coded feature may be coded by the jth to-be-used coding layer, so as to obtain and output the coding processing result of the jth to-be-coded feature, so that the coding processing can be continued on the basis of the coding processing result of the jth to-be-coded feature by the jth +1 to-be-used coding layer. Wherein J is a positive integer, and J is more than or equal to 2 and less than or equal to J.
S333: and determining the encoding processing result of the J-th feature to be encoded as the encoding feature of the text to be recognized.
In this embodiment of the present application, after the coding processing result of the jth feature to be coded output by the jth coding layer to be used is obtained, the coding processing result of the jth feature to be coded may be directly determined as the coding feature of the text to be recognized (as shown in fig. 7), so that the decoding processing may be continued based on the "coding feature of the text to be recognized" in the following.
Based on the above-mentioned related contents of S31 to S33, after the encoding features of the N images to be recognized are acquired, the J encoding layers to be used may be used to perform the second encoding process on the encoding features of the N images to be recognized, so as to obtain and output the encoding features of the text to be recognized, so that the "encoding features of the text to be recognized" can accurately represent the character information carried by the "N images to be recognized".
S4: and decoding the coding characteristics of the text to be recognized to obtain a character recognition result of the text to be recognized.
The character recognition result of the text to be recognized is used for representing the shared character information in the N images to be recognized.
In addition, the embodiment of the present application is not limited to the implementation of the "decoding process" in S3, and may be implemented by a Decoder network in a transform model, for example.
Based on the related contents of S1 to S4, it can be seen that, with the character recognition method provided in the embodiment of the present application, after a plurality of images to be recognized, each of which includes a text to be recognized, are acquired, first encoding processing is performed on each image to be recognized, so as to obtain an encoding feature of each image to be recognized; and then carrying out second coding processing on the coding features of all the images to be recognized to obtain the coding features of the text to be recognized, so that the coding features of the text to be recognized can accurately represent character information carried by all the images to be recognized, the coding features of the text to be recognized can more accurately represent each character in the text to be recognized, and further the character recognition result of the text to be recognized determined based on the coding features of the text to be recognized is more accurate, thereby being beneficial to improving the character recognition accuracy of multi-frame text line recognition.
Based on the character recognition method provided by the above method embodiment, the embodiment of the present application further provides a character recognition apparatus, which is explained and explained below with reference to the accompanying drawings.
Device embodiment
Please refer to the above method embodiment for the technical details of the character recognition apparatus provided by the apparatus embodiment.
Referring to fig. 8, the figure is a schematic structural diagram of a character recognition apparatus according to an embodiment of the present application.
The character recognition apparatus 800 provided in the embodiment of the present application includes:
an image acquisition unit 801 for acquiring a plurality of images to be recognized; wherein the plurality of images to be recognized include the same character information;
a first encoding unit 802, configured to perform first encoding processing on each image to be identified respectively to obtain an encoding feature of each image to be identified;
a second encoding unit 803, configured to perform a second encoding process on the encoding features of the multiple images to be recognized, so as to obtain the encoding features of the text to be recognized, where the encoding features corresponding to the text to be recognized are used to represent character information carried by the multiple images to be recognized;
the feature decoding unit 804 is configured to decode the coding features of the text to be recognized to obtain a character recognition result of the text to be recognized.
In a possible implementation, the second encoding unit 803 includes:
the number comparison subunit is used for comparing the number of the images to be identified with the number of the coding layers to be used to obtain a comparison result;
the first determining subunit is used for determining the features to be coded of the number of the coding layers to be used according to the comparison result and the coding features of the images to be identified;
and the first coding subunit is used for performing third coding processing on the features to be coded of the number of the coding layers to be used by using the coding layers to be used of the number of the coding layers to be used to obtain the coding features of the text to be recognized.
In a possible implementation manner, the number of the images to be recognized is N;
the first determining subunit is specifically configured to: if the comparison result shows that the number of the images to be identified is equal to the number of the coding layers to be used, determining the coding feature of the nth image to be identified as the nth feature to be coded; wherein N is a positive integer and is less than or equal to N.
In a possible implementation manner, the first determining subunit is specifically configured to: if the comparison result shows that the number of the images to be identified is larger than the number of the coding layers to be used, splicing the coding features of the images to be identified to obtain coding features to be segmented; and carrying out segmentation processing on the coding features to be segmented according to the number of the coding layers to be used to obtain the coding features to be segmented of the number of the coding layers to be used.
In a possible implementation manner, the number of the images to be identified is N, and the number of the coding layers to be used is J;
the first determining subunit is specifically configured to: if the comparison result shows that the number of the images to be identified is smaller than the number of the coding layers to be used, determining the coding feature of the ith image to be identified as the ith feature to be coded; wherein i is a positive integer, i is not more than N-1; and determining the Nth to the J-th to-be-coded features by using the coding features of the Nth to-be-recognized image.
In a possible implementation manner, the number of the coding layers to be used is J;
the first encoding subunit includes:
the second coding subunit is configured to perform coding processing on the 1 st to-be-coded feature by using the 1 st to-be-used coding layer, so as to obtain a coding processing result of the 1 st to-be-coded feature;
the third coding subunit is used for coding the j to-be-coded feature and the coding processing result of the j-1 to-be-coded feature by using the j to-be-used coding layer to obtain the coding processing result of the j to-be-coded feature; wherein, the coding processing result of the j-1 th characteristic to be coded refers to the output result of the j-1 th coding layer to be used; j is a positive integer, J is more than or equal to 2 and less than or equal to J;
and the second determining subunit is used for determining the encoding processing result of the J-th feature to be encoded as the encoding feature of the text to be recognized.
In a possible implementation, the 1 st to-be-used coding layer comprises a self-attention module and a feedforward neural network module;
the second encoding subunit is specifically configured to:
carrying out coding pretreatment on the 1 st feature to be coded to obtain a pretreatment result of the 1 st feature to be coded;
inputting the preprocessing result of the 1 st feature to be coded into a self-attention module in the 1 st coding layer to be used, and obtaining a self-attention processing result of the 1 st feature to be coded output by the self-attention module;
inputting the self-attention processing result of the 1 st feature to be coded into a feedforward neural network module in the 1 st coding layer to be used, and obtaining the coding processing result of the 1 st feature to be coded output by the feedforward neural network module.
In a possible implementation, the j to-be-used coding layer includes two self-attention modules and one feedforward neural network module;
the third encoding subunit is specifically configured to:
carrying out coding pretreatment on the jth feature to be coded to obtain a pretreatment result of the jth feature to be coded;
inputting the preprocessing result of the jth feature to be coded into a first self-attention module in the jth coding layer to be used, and obtaining a first self-attention processing result of the jth feature to be coded, which is output by the first self-attention module;
inputting the coding processing result of the (j-1) th feature to be coded and the first self-attention processing result of the (j) th feature to be coded into a second self-attention module in the (j) th coding layer to be used, and obtaining a second self-attention processing result of the (j) th feature to be coded, which is output by the second self-attention module;
and inputting the second attention processing result of the jth feature to be coded into the feedforward neural network module in the jth coding layer to be used to obtain the coding processing result of the jth feature to be coded output by the feedforward neural network module.
In a possible implementation manner, the number of the images to be recognized is N;
the first encoding unit 802 is specifically configured to:
performing feature extraction on the nth image to be recognized to obtain visual features of the nth image to be recognized; wherein N is a positive integer, N is not more than N, and N is a positive integer;
performing fourth coding processing on the visual features of the nth image to be recognized to obtain the coding features of the nth image to be recognized; wherein N is a positive integer, N is less than or equal to N, and N is a positive integer.
In a possible implementation, the image obtaining unit 801 is specifically configured to: clustering a plurality of candidate images to obtain at least one candidate image set, so that all the candidate images in the candidate image set comprise the same character information; and determining the plurality of images to be identified according to the image set to be identified in the at least one candidate image set.
In a possible implementation, the determining of the at least one candidate image includes: performing text detection on a plurality of frames of video images in a video to be processed to obtain a text detection result of the plurality of frames of video images; and respectively carrying out image cutting on the multi-frame video images according to the text detection results of the multi-frame video images to obtain the plurality of candidate images.
Based on the related content of the character recognition device 800, for the character recognition device 800, after acquiring a plurality of images to be recognized, which all include the same character information, first encoding processing is performed on each image to be recognized, so as to obtain the encoding characteristics of each image to be recognized; and then carrying out second coding processing on the coding features of all the images to be recognized to obtain the coding features of the text to be recognized, so that the coding features of the text to be recognized can accurately represent character information carried by all the images to be recognized, the coding features of the text to be recognized can more accurately represent each character in the text to be recognized, and further the character recognition result of the text to be recognized determined based on the coding features of the text to be recognized is more accurate, thereby being beneficial to improving the character recognition accuracy of multi-frame text line recognition.
Further, an embodiment of the present application further provides an apparatus, where the apparatus includes a processor and a memory:
the memory is used for storing a computer program;
the processor is used for executing any implementation mode of the character recognition method provided by the embodiment of the application according to the computer program.
Further, an embodiment of the present application also provides a computer-readable storage medium, where the computer-readable storage medium is used to store a computer program, where the computer program is used to execute any implementation manner of the character recognition method provided in the embodiment of the present application.
Further, an embodiment of the present application also provides a computer program product, which when running on a terminal device, causes the terminal device to execute any implementation of the character recognition method provided in the embodiment of the present application.
It should be understood that in the present application, "at least one" means one or more, "a plurality" means two or more. "and/or" for describing an association relationship of associated objects, indicating that there may be three relationships, e.g., "a and/or B" may indicate: only A, only B and both A and B are present, wherein A and B may be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. "at least one of the following" or similar expressions refer to any combination of these items, including any combination of single item(s) or plural items. For example, at least one (one) of a, b, or c, may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.
The foregoing is merely a preferred embodiment of the invention and is not intended to limit the invention in any manner. Although the present invention has been described with reference to the preferred embodiments, it is not intended to be limited thereto. Those skilled in the art can make numerous possible variations and modifications to the present teachings, or modify equivalent embodiments to equivalent variations, without departing from the scope of the present teachings, using the methods and techniques disclosed above. Therefore, any simple modification, equivalent change and modification made to the above embodiments according to the technical essence of the present invention are still within the scope of the protection of the technical solution of the present invention, unless the contents of the technical solution of the present invention are departed.

Claims (15)

1. A method of character recognition, the method comprising:
acquiring a plurality of images to be identified; wherein the plurality of images to be recognized include the same character information;
respectively carrying out first coding processing on each image to be recognized to obtain the coding characteristics of each image to be recognized;
performing second coding processing on the coding features of the multiple images to be recognized to obtain the coding features of the texts to be recognized, so that the coding features corresponding to the texts to be recognized are used for representing character information carried by the multiple images to be recognized;
and decoding the coding characteristics of the text to be recognized to obtain a character recognition result of the text to be recognized.
2. The method according to claim 1, wherein the process of determining the encoding characteristics of the text to be recognized comprises:
comparing the number of the images to be identified with the number of the coding layers to be used to obtain a comparison result;
determining the features to be coded of the number of the coding layers to be used according to the comparison result and the coding features of the plurality of images to be recognized;
and performing third coding processing on the features to be coded of the number of the coding layers to be used by using the coding layers to be used of the number of the coding layers to be used to obtain the coding features of the text to be recognized.
3. The method according to claim 2, characterized in that the number of the images to be recognized is N;
the determining the features to be coded of the number of the coding layers to be used according to the comparison result and the coding features of the plurality of images to be recognized includes:
if the comparison result shows that the number of the images to be identified is equal to the number of the coding layers to be used, determining the coding feature of the nth image to be identified as the nth feature to be coded; wherein N is a positive integer and is less than or equal to N.
4. The method according to claim 2, wherein the determining the features to be encoded of the number of encoding layers to be used according to the comparison result and the encoding features of the plurality of images to be recognized comprises:
if the comparison result shows that the number of the images to be identified is larger than the number of the coding layers to be used, splicing the coding features of the images to be identified to obtain coding features to be segmented;
and carrying out segmentation processing on the coding features to be segmented according to the number of the coding layers to be used to obtain the coding features to be segmented of the number of the coding layers to be used.
5. The method according to claim 2, wherein the number of the images to be identified is N, and the number of the coding layers to be used is J;
the determining the features to be coded of the number of the coding layers to be used according to the comparison result and the coding features of the plurality of images to be recognized includes:
if the comparison result shows that the number of the images to be identified is smaller than the number of the coding layers to be used, determining the coding feature of the ith image to be identified as the ith feature to be coded; wherein i is a positive integer, i is not more than N-1;
and determining the Nth to the J-th to-be-coded features by using the coding features of the Nth to-be-recognized image.
6. The method according to any one of claims 2-5, wherein the number of coding layers to be used is J;
the third encoding processing is performed on the features to be encoded of the number of the encoding layers to be used by using the encoding layers to be used of the number of the encoding layers to be used, so as to obtain the encoding features of the text to be recognized, and the third encoding processing includes:
utilizing a 1 st coding layer to be used to code the 1 st feature to be coded to obtain a coding processing result of the 1 st feature to be coded;
coding the j to-be-coded feature and the j-1 to-be-coded feature by using the j to-be-used coding layer to obtain a coding processing result of the j to-be-coded feature; wherein, the coding processing result of the j-1 th characteristic to be coded refers to the output result of the j-1 th coding layer to be used; j is a positive integer, J is more than or equal to 2 and less than or equal to J;
and determining the encoding processing result of the J-th feature to be encoded as the encoding feature of the text to be recognized.
7. The method of claim 6, wherein the 1 st coding layer to be used comprises a self-attention module and a feedforward neural network module;
the method for obtaining the coding processing result of the 1 st to-be-coded feature by using the 1 st to-be-used coding layer to code the 1 st to-be-coded feature includes:
carrying out coding pretreatment on the 1 st feature to be coded to obtain a pretreatment result of the 1 st feature to be coded;
inputting the preprocessing result of the 1 st feature to be coded into a self-attention module in the 1 st coding layer to be used, and obtaining a self-attention processing result of the 1 st feature to be coded output by the self-attention module;
inputting the self-attention processing result of the 1 st feature to be coded into a feedforward neural network module in the 1 st coding layer to be used, and obtaining the coding processing result of the 1 st feature to be coded output by the feedforward neural network module.
8. The method of claim 6, wherein the jth to-be-used coding layer comprises two self-attention modules and one feedforward neural network module;
the encoding processing of the j to-be-encoded feature and the j-1 to-be-encoded feature by using the j to-be-used encoding layer to obtain the encoding processing result of the j to-be-encoded feature includes:
carrying out coding pretreatment on the jth feature to be coded to obtain a pretreatment result of the jth feature to be coded;
inputting the preprocessing result of the jth feature to be coded into a first self-attention module in the jth coding layer to be used, and obtaining a first self-attention processing result of the jth feature to be coded, which is output by the first self-attention module;
inputting the coding processing result of the (j-1) th feature to be coded and the first self-attention processing result of the (j) th feature to be coded into a second self-attention module in the (j) th coding layer to be used, and obtaining a second self-attention processing result of the (j) th feature to be coded, which is output by the second self-attention module;
and inputting the second attention processing result of the jth feature to be coded into the feedforward neural network module in the jth coding layer to be used to obtain the coding processing result of the jth feature to be coded output by the feedforward neural network module.
9. The method according to claim 1, characterized in that the number of the images to be recognized is N;
the process for determining the coding features of the nth image to be recognized comprises the following steps:
performing feature extraction on the nth image to be recognized to obtain visual features of the nth image to be recognized; wherein N is a positive integer, N is not more than N, and N is a positive integer;
and performing fourth coding processing on the visual features of the nth image to be recognized to obtain the coding features of the nth image to be recognized.
10. The method according to claim 1, wherein the process of acquiring the plurality of images to be identified comprises:
clustering a plurality of candidate images to obtain at least one candidate image set, so that all the candidate images in the candidate image set comprise the same character information;
and determining the plurality of images to be identified according to the image set to be identified in the at least one candidate image set.
11. The method of claim 10, wherein the determining of the at least one candidate image comprises:
performing text detection on a plurality of frames of video images in a video to be processed to obtain a text detection result of the plurality of frames of video images;
and respectively carrying out image cutting on the multi-frame video images according to the text detection results of the multi-frame video images to obtain the plurality of candidate images.
12. A character recognition apparatus, comprising:
the image acquisition unit is used for acquiring a plurality of images to be identified; wherein the plurality of images to be recognized include the same character information;
the first coding unit is used for respectively carrying out first coding processing on each image to be identified to obtain the coding characteristics of each image to be identified;
the second coding unit is used for carrying out second coding processing on the coding features of the images to be recognized to obtain the coding features of the texts to be recognized, so that the coding features corresponding to the texts to be recognized are used for representing character information carried by the images to be recognized;
and the characteristic decoding unit is used for decoding the coding characteristics of the text to be recognized to obtain a character recognition result of the text to be recognized.
13. An apparatus, comprising a processor and a memory:
the memory is used for storing a computer program;
the processor is configured to perform the method of any of claims 1-11 in accordance with the computer program.
14. A computer-readable storage medium, characterized in that the computer-readable storage medium is used to store a computer program for performing the method of any of claims 1-11.
15. A computer program product, characterized in that the computer program product, when run on a terminal device, causes the terminal device to perform the method of any of claims 1-11.
CN202110925424.9A 2021-08-12 2021-08-12 Character recognition method and related equipment thereof Pending CN113610082A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110925424.9A CN113610082A (en) 2021-08-12 2021-08-12 Character recognition method and related equipment thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110925424.9A CN113610082A (en) 2021-08-12 2021-08-12 Character recognition method and related equipment thereof

Publications (1)

Publication Number Publication Date
CN113610082A true CN113610082A (en) 2021-11-05

Family

ID=78340530

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110925424.9A Pending CN113610082A (en) 2021-08-12 2021-08-12 Character recognition method and related equipment thereof

Country Status (1)

Country Link
CN (1) CN113610082A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114241468A (en) * 2021-12-21 2022-03-25 北京有竹居网络技术有限公司 Text recognition method and related equipment thereof
CN115438183A (en) * 2022-08-31 2022-12-06 广州宝立科技有限公司 Business website monitoring system based on natural language processing

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107609560A (en) * 2017-09-27 2018-01-19 北京小米移动软件有限公司 Character recognition method and device
CN111126386A (en) * 2019-12-20 2020-05-08 复旦大学 Sequence field adaptation method based on counterstudy in scene text recognition
CN111460889A (en) * 2020-02-27 2020-07-28 平安科技(深圳)有限公司 Abnormal behavior identification method, device and equipment based on voice and image characteristics
CN112861861A (en) * 2021-01-15 2021-05-28 珠海世纪鼎利科技股份有限公司 Method and device for identifying nixie tube text and electronic equipment
CN113221792A (en) * 2021-05-21 2021-08-06 北京声智科技有限公司 Chapter detection model construction method, cataloguing method and related equipment

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107609560A (en) * 2017-09-27 2018-01-19 北京小米移动软件有限公司 Character recognition method and device
CN111126386A (en) * 2019-12-20 2020-05-08 复旦大学 Sequence field adaptation method based on counterstudy in scene text recognition
CN111460889A (en) * 2020-02-27 2020-07-28 平安科技(深圳)有限公司 Abnormal behavior identification method, device and equipment based on voice and image characteristics
CN112861861A (en) * 2021-01-15 2021-05-28 珠海世纪鼎利科技股份有限公司 Method and device for identifying nixie tube text and electronic equipment
CN113221792A (en) * 2021-05-21 2021-08-06 北京声智科技有限公司 Chapter detection model construction method, cataloguing method and related equipment

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114241468A (en) * 2021-12-21 2022-03-25 北京有竹居网络技术有限公司 Text recognition method and related equipment thereof
CN115438183A (en) * 2022-08-31 2022-12-06 广州宝立科技有限公司 Business website monitoring system based on natural language processing

Similar Documents

Publication Publication Date Title
CN110738090B (en) System and method for end-to-end handwritten text recognition using neural networks
CN111476067A (en) Character recognition method and device for image, electronic equipment and readable storage medium
EP3923183A1 (en) Method and system for video analysis
CN113610082A (en) Character recognition method and related equipment thereof
CN111931736B (en) Lip language identification method and system using non-autoregressive model and integrated discharge technology
CN114092742B (en) Multi-angle-based small sample image classification device and method
CN114067143A (en) Vehicle weight recognition method based on dual sub-networks
CN116978011B (en) Image semantic communication method and system for intelligent target recognition
CN114021646A (en) Image description text determination method and related equipment thereof
US20220292877A1 (en) Systems, methods, and storage media for creating image data embeddings to be used for image recognition
CN112035701A (en) Internet short video source tracing method and system
Guan et al. Improving handwritten OCR with augmented text line images synthesized from online handwriting samples by style-conditioned GAN
CN109063772B (en) Image personalized semantic analysis method, device and equipment based on deep learning
CN114005019A (en) Method for identifying copied image and related equipment thereof
CN112597997A (en) Region-of-interest determining method, image content identifying method and device
CN112966676A (en) Document key information extraction method based on zero sample learning
CN113221792B (en) Chapter detection model construction method, cataloguing method and related equipment
CN113255829B (en) Zero sample image target detection method and device based on deep learning
CN112200840B (en) Moving object detection system in visible light and infrared image combination
CN114694209A (en) Video processing method and device, electronic equipment and computer storage medium
CN113657370A (en) Character recognition method and related equipment thereof
CN114495230A (en) Expression recognition method and device, electronic equipment and storage medium
CN113610081A (en) Character recognition method and related equipment thereof
CN108287817B (en) Information processing method and device
CN116863456B (en) Video text recognition method, device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination