CN114429636A

CN114429636A - Image scanning identification method and device and electronic equipment

Info

Publication number: CN114429636A
Application number: CN202210353468.3A
Authority: CN
Inventors: 王金桥; 葛国敬; 朱贵波
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2022-04-06
Filing date: 2022-04-06
Publication date: 2022-05-03
Anticipated expiration: 2042-04-06
Also published as: CN114429636B

Abstract

The invention provides an image scanning identification method, an image scanning identification device and electronic equipment, wherein the method comprises the following steps: for the current iteration, according to the mask area, performing image block matching with the next frame of local picture, performing image splicing, and updating the mask area required by the next iteration; the initial mask region is obtained based on a result of text detection on the initial frame local picture; for the spliced whole picture, if the whole picture meets a preset condition, performing text detection, and performing text recognition on the spliced picture under the condition that a text detection result meets a preset recognition condition; continuing to execute the next iteration until the scanning pen stops scanning; and acquiring an image scanning result of the scanning pen according to the text recognition result of the spliced picture obtained in each iteration process. The invention realizes good real-time performance on low-configuration equipment with limited computing resources and storage resources and improves the efficiency and the precision of image splicing and identification.

Description

Image scanning identification method and device and electronic equipment

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to an image scanning identification method and apparatus, and an electronic device.

Background

With the development of internet technology, instant image scanning devices (such as electronic scanning pens) have become an indispensable part of the learning process of primary and middle school students, even large students; the instant scanning device mainly comprises image splicing, text detection and recognition technologies.

The image stitching technology is a technology for splicing a plurality of images acquired in an instant scanning device, mainly a plurality of images with overlapped parts (namely, images obtained by the same sensor with small time interval and small visual angle change) into a seamless panoramic picture or a high-resolution image.

The image splicing technology mainly comprises three parts: feature point extraction and matching, image registration and image fusion, wherein the feature point extraction is mainly realized by SIFT (Scale-invariant feature transform), SURF (Speeded Up Robust Features) or Oriented sky eye and ORB (Oriented FAST and Rotated BRIEF) operators, and the like, and the image registration stage needs to calculate inverse transformation on a left matrix of a spliced image.

However, the hardware configuration of the instant scanning device is generally low, and as the configuration of the scanning device is more common: the chip is RK3326 and the hardware configuration with the main frequency of 1.5G. Due to the limited hardware performance of the low-configuration embedded device, the performance of the device is easily over-limited as the size of the spliced picture is longer and longer, and the scanning result cannot be accurately output in real time.

Disclosure of Invention

The invention provides an image scanning identification method, an image scanning identification device and electronic equipment, which are used for solving the defects that in the prior art, the hardware performance of low-configuration embedded equipment is limited, and a scanning result cannot be accurately output in real time under the condition that the size of a spliced picture is too long, and the image scanning result can be accurately output in real time on the low-configuration embedded equipment.

The invention provides an image scanning identification method, which comprises the following steps:

acquiring a current frame local picture scanned by a scanning pen, and acquiring a mask area of the current frame local picture;

according to the mask area of the current frame local picture, splicing the next frame local picture after image block matching to obtain a spliced picture corresponding to the next frame local picture, and updating the mask area matched with the next frame local picture;

performing text detection on the spliced picture under the condition that the spliced picture meets a preset detection condition, and performing text identification on the spliced picture under the condition that a text detection result of the spliced picture meets a preset identification condition;

taking the next frame of local picture as a new current frame of local picture, and continuing to execute the steps of image block matching, picture splicing, mask area updating, text detection and text identification until the scanning pen stops scanning;

and acquiring an image scanning recognition result of the scanning pen according to the text recognition result of the spliced image obtained in each iteration process.

According to an image scanning and identifying method provided by the present invention, the obtaining of the mask region of the current frame local picture includes:

under the condition that the current frame is a starting frame, performing text detection on the local picture of the current frame;

under the condition that the text detection result of the current frame local picture comprises a text detection box, acquiring a mask region of the current frame local picture according to the text detection box of the current frame local picture;

and under the condition that the current frame is an intermediate frame, acquiring a mask region of the current frame local picture according to a matching region of the current frame local picture and a previous frame local picture.

According to the image scanning and identifying method provided by the present invention, in a case that the current frame is an intermediate frame, obtaining a mask region of the current frame local picture according to a matching region of the current frame local picture and a previous frame local picture, includes:

determining whether the position of a target image block is positioned at the right boundary of the current frame local picture under the condition that the current frame is an intermediate frame and a mask region exists in the previous frame local picture; the target image block is an image block which is most matched with a mask area of a previous frame of local picture in the current frame of local picture;

taking the mask area of the previous frame local picture as the mask area of the current frame local picture under the condition that the position of the target image block is positioned at the right boundary of the current frame local picture;

and under the condition that the position of the target image block is not positioned at the right boundary of the current frame local picture, updating the mask area of the previous frame local picture, and taking the updated mask area as the mask area of the current frame local picture.

According to the image scanning identification method provided by the invention, according to the mask area of the current frame local picture, the next frame local picture is spliced after image block matching to obtain a spliced picture corresponding to the next frame local picture, and the method comprises the following steps:

performing image block matching on the next frame of local picture according to the mask area of the current frame of local picture, and acquiring an image block which is most matched with the mask area of the current frame of local picture in the next frame of local picture;

under the condition that the most matched image block in the next frame of local picture is positioned at the right boundary of the next frame of local picture, picture splicing is not carried out;

under the condition that the most matched image block in the next frame of local picture is not positioned at the right boundary of the next frame of local picture, acquiring a region to be spliced in the next frame of local picture according to the most matched image block;

and splicing the to-be-spliced area with the spliced picture corresponding to the current frame local picture to obtain the spliced picture corresponding to the next frame local picture.

According to the image scanning and identifying method provided by the invention, the text detection of the spliced image comprises the following steps:

inputting the spliced picture into a text detection model to obtain a text detection result of the spliced picture;

the text detection model is obtained based on a sample picture and a text detection result of the sample picture through training;

the text detection model is constructed and generated based on a lightweight neural network and comprises a trunk network and a head network;

the backbone network is used for extracting features of different scales of the spliced picture to obtain a plurality of first feature maps of different scales of the spliced picture;

and the head network is used for fusing and learning the first feature maps with different scales to obtain a text detection result of the spliced picture.

According to the image scanning and identifying method provided by the invention, the text detection model is trained and obtained based on the following steps:

iteratively training the text detection model based on the sample picture and the text detection result of the sample picture, and pruning the text detection model based on a model pruning algorithm or a model compression algorithm in the training process until a preset training termination condition is met;

the learning rate adopted by the text detection model in the training process comprises a cosine learning rate mechanism or a preheating learning rate mechanism.

According to the image scanning and identifying method provided by the invention, the text identification of the spliced image comprises the following steps:

inputting the spliced picture into a text recognition model to obtain a text recognition result of the spliced picture;

the text recognition model is trained and acquired based on a sample picture and a text recognition result of the sample picture;

the text recognition model is constructed and generated based on a convolutional neural network, a cyclic neural network and a classification network;

the convolutional neural network is used for extracting the characteristics of the spliced picture to obtain a second characteristic diagram of the spliced picture;

the recurrent neural network is used for learning the second feature map to obtain the category probability distribution of the spliced picture;

and the classification network is used for converting the class probability distribution to obtain a text recognition result of the spliced picture.

According to the image scanning identification method provided by the invention, the identification result of the image scanning of the scanning pen is obtained according to the text identification result of the spliced picture obtained in each iteration process, and the method comprises the following steps:

verifying a text recognition result of the spliced picture obtained in each iteration process;

and acquiring a final recognition result of the image scanning of the scanning pen according to the inspection result.

The present invention also provides an image scanning and recognizing apparatus, comprising:

the acquisition module is used for acquiring a current frame local picture scanned by a scanning pen and acquiring a mask area of the current frame local picture;

the splicing module is used for splicing the next frame of local picture after image block matching according to the mask area of the current frame of local picture to obtain a spliced picture corresponding to the next frame of local picture, and updating the mask area matched with the next frame of local picture;

the detection and identification module is used for performing text detection on the spliced picture under the condition that the spliced picture meets a preset detection condition, and performing text identification on the spliced picture under the condition that a text detection result of the spliced picture meets a preset identification condition;

the iteration module is used for taking the next frame of local picture as a new current frame of local picture, and continuously executing the steps of image block matching, picture splicing, mask area updating, text detection and text identification until the scanning pen stops scanning;

and the output module is used for acquiring the identification result of the image scanning of the scanning pen according to the text identification result of the spliced picture obtained in each iteration process.

The invention also provides an electronic device, which comprises a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the processor executes the program to realize the image scanning and identifying method.

The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements an image scan recognition method as described in any of the above.

The present invention also provides a computer program product comprising a computer program which, when executed by a processor, implements the image scanning identification method as described in any one of the above.

According to the image scanning and identifying method, device and electronic equipment, on one hand, the mask area of the local picture of the current frame is obtained in real time, the size is not enlarged, and the mask area can be obtained through calculation from the first frame scanning. In each iteration process, only the mask area obtained by updating the previous frame is required to be subjected to template matching with the image block in the local picture of the current frame, so that image matching and image splicing can be rapidly and accurately carried out, an image scanning result is accurately output in real time, the calculation amount of image scanning is effectively reduced, the calculation performance requirement on the embedded equipment is effectively reduced, and high real-time performance and accuracy can be realized on the embedded equipment with low configuration; on the other hand, the method is used for detecting the texts of the spliced pictures under the condition of meeting the preset detection conditions, the spliced pictures are identified only under the condition of meeting the preset identification conditions, the redundant calculation amount caused by detection and identification of each frame of local pictures can be avoided, the calculation amount in the scanning process is further reduced, the scanning identification result can be accurately output in real time when the scanning is finished, and in order to prevent errors from occurring in the identification while the scanning, the spliced whole pictures can be detected and identified to check the scanning identification result.

Drawings

In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

FIG. 1 is a schematic flow chart of an image scanning and identifying method provided by the present invention;

FIG. 2 is a schematic structural diagram of a text detection network in the image scanning and recognition method provided by the present invention;

FIG. 3 is a schematic structural diagram of a backbone network in a text detection network in the image scanning recognition method provided by the present invention;

FIG. 4 is a schematic structural diagram of a bneck module of a backbone network in the image scanning identification method provided by the present invention;

FIG. 5 is a schematic structural diagram of text recognition in the image scanning recognition method provided by the present invention;

FIG. 6 is a second schematic flowchart of an image scanning and recognizing method according to the present invention;

FIG. 7 is a schematic structural diagram of an image scanning and recognizing apparatus provided by the present invention;

fig. 8 is a schematic structural diagram of an electronic device provided in the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings and embodiments of the present invention, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

The terms "first," "second," and the like in this embodiment are used for distinguishing between similar elements and not necessarily for describing or implying any particular order or sequence. The terms "comprises," "comprising," or any other similar term are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

With the development of internet technology, scanning technology has come to be applied, such as OCR (Optical Character Recognition) technology for printing fonts, which is now well developed. The Tencent Tim office software mobile phone version has an image text extraction function; the Microsoft Office lens tool also has scanning functionality. The scanning technique described above can accurately recognize the print body. However, how to scan a segment of text in real time and output the scanning result in real time in an instant scanning device is an urgent problem in the present industry, especially, the hardware configuration of the instant scanning device is generally low, such as RK3326, which is a hardware configuration with a main frequency of only 1.5G.

In view of the above problem, the present embodiment provides an image scanning and identifying method, as shown in fig. 1, the image scanning and identifying method includes:

step 101, acquiring a current frame local picture scanned by a scanning pen, and acquiring a mask area of the current frame local picture;

the local picture may be a partial picture of a picture to be scanned containing various types of text information, for example, the picture to be scanned is a picture containing text information in other fields such as a biological field and a medical field.

The current frame local picture may be an initial frame local picture of initial scanning, or an intermediate frame local picture of intermediate scanning, and the like, which is not specifically limited in this embodiment.

Optionally, when the scanning pen starts to scan the target picture, starting the camera to acquire a local picture scanned by the scanning pen; the partial pictures include a partial picture without text, a picture of 1/3, 1/2 of a certain segment or a word, and the like, which is not specifically limited in this embodiment.

After the current frame local picture is obtained, if the current frame local picture is an initial frame local picture of initial scanning, determining a mask region of the current frame local picture according to a text detection box in the current frame local picture; if the current frame local picture is an intermediate frame local picture scanned in the middle and the previous frame local picture has a mask region, calculating an image block which is best matched with the mask region of the previous frame local picture according to the current frame local picture, and determining the mask region of the current frame local picture; the present embodiment does not specifically limit the specific obtaining manner of the mask region.

The mask region is the region where the effective text in the local picture of the current frame is located. Therefore, the picture matching and the picture splicing can be quickly and accurately realized through the mask area of the local picture of the current frame.

And 102, according to the mask area of the current frame local picture, splicing the next frame local picture after image block matching to obtain a spliced picture corresponding to the next frame local picture, and updating the mask area matched with the next frame local picture.

Optionally, after determining the mask region of the current frame local picture, performing image block matching on the next frame local picture according to the mask region of the current frame local picture to obtain an image block which is most matched with the mask region of the current frame local picture in the next frame local picture;

the image block matching manner includes standard square error matching, minimum weighted square error matching, and the like, which is not specifically limited in this embodiment.

Then, determining whether to carry out image splicing according to the most matched image block; when the pictures need to be spliced, according to the most matched image blocks, the spliced picture corresponding to the current frame local picture and the next frame local picture are spliced to obtain the spliced picture corresponding to the next frame picture, and the mask area of the next frame local picture is synchronously updated.

In the embodiment, image block matching and splicing are performed on the next frame of local picture through the mask area, and in the process of each matching and splicing, image matching and image splicing can be realized only by simply performing template matching on the mask area of the current frame of local picture and each image block in the next frame of local picture, so that the problem that in a classic image splicing algorithm, along with the increase of the size of a spliced image, the calculation amount of inverse transformation required to be performed on the left matrix of the spliced picture is increased, so that a large amount of calculation cannot be satisfied by low-configuration embedded equipment is effectively solved, the pictures can be accurately matched and spliced in real time, and further, an image scanning result is accurately output in real time.

103, performing text detection on the spliced picture under the condition that the spliced picture meets a preset detection condition, and performing text identification on the spliced picture under the condition that a text detection result of the spliced picture meets a preset identification condition;

the preset detection condition includes whether the length of the spliced picture meets a preset value or whether the length of the spliced picture meets a preset frame interval, for example, the spliced picture is detected once every 45 pixels, which is not specifically limited in this example.

The preset identification condition includes that a newly added text exists in the spliced picture or a matching threshold appearing in the splicing process is smaller than a preset value, and the like, which is not specifically limited in this embodiment.

Optionally, after the spliced picture is obtained, determining whether the spliced picture meets a preset detection condition, performing text detection on the spliced picture under the condition that the preset detection condition is met, and performing text recognition on the spliced picture under the condition that the spliced picture meets a preset recognition condition determined according to a text detection result;

and if the preset detection condition or the preset identification condition is not met, not performing text detection on the spliced picture and further not performing text identification operation.

Due to the fact that high overlap ratio exists between partial adjacent frames, the adjacent frames with high overlap ratio do not substantially contribute to the whole picture splicing and image identification, and redundant calculation amount is brought. Therefore, under the condition that the spliced picture does not meet the preset detection condition, or the condition that no newly added text exists in the spliced picture is determined according to the text detection result, namely the text content coincidence degree of the pictures before and after splicing is high, the text identification is not carried out on the spliced picture, the local picture of each frame is prevented from being identified, and the efficiency of image identification can be effectively improved. Compared with the matching identification of each frame of local picture, the speed can be improved by at least 3 to 4 times.

Step 104, taking the next frame of local picture as a new current frame of local picture, and continuing to execute the steps of image block matching, picture splicing, mask area updating, text detection and text identification until the scanning pen stops scanning;

optionally, when the current iteration is completed, continuing to perform the next iteration, specifically, taking the next frame of local picture as a new current frame of local picture, and continuing to perform the steps of image block matching, picture splicing, mask area updating, text detection and text identification until the scanning pen stops scanning, so as to finally obtain a completely spliced picture.

Each local picture only contains a certain section or a partial region of a certain character in the target picture, for example, a partial local picture comprises a partial region of a region where the 'multi-generation' character is located (for example, a 1/2 region only containing the region where the 'multi-generation' character is located), and a partial local picture also comprises a partial region of the region where the 'multi-generation' character is located (for example, a 3/4 region only containing the region where the 'multi-generation' character is located), and the two local pictures are combined to obtain a complete region containing the 'multi-generation' character.

And 105, acquiring an image scanning recognition result of the scanning pen according to the text recognition result of the spliced image obtained in each iteration process.

Optionally, after obtaining the text recognition result of the stitched image obtained in each iteration process, the text recognition result of the stitched image obtained in each iteration process may be directly used as the image scanning result of the scan pen, or the text recognition result of the stitched image obtained in each iteration process may be verified and used as the image scanning result of the scan pen, which is not specifically limited in this embodiment.

When the scanning pen stops scanning, the scanning result of the image can be output in real time, and the scanning pen has the real-time performance of scanning and outputting.

On one hand, in the embodiment, the mask area of the current frame local picture is obtained in real time, and whether the size of the spliced picture is increased or not, image matching and image splicing can be quickly and accurately performed only by performing template matching on the mask area obtained by updating the previous frame and the image block in the current frame local picture in each iteration process, so that an image scanning result is accurately output in real time, the calculation amount of picture scanning is effectively reduced, the calculation performance requirement on embedded equipment is effectively reduced, and high real-time performance and accuracy can be realized on low-configuration embedded equipment; on the other hand, the method is used for detecting the texts of the spliced pictures under the condition of meeting the preset detection conditions, the spliced pictures are detected and identified only under the condition of meeting the preset identification conditions, the redundant calculation amount caused by identification of each frame of local pictures can be avoided, the calculation amount in the scanning process is further reduced, the scanning identification result can be accurately output in real time after scanning is finished, and in order to prevent errors from occurring in identification while scanning, the spliced whole pictures can be detected and identified to check the scanning identification result.

On the basis of the foregoing embodiment, the acquiring a mask region of the current frame local picture in this embodiment includes: under the condition that the current frame is a starting frame, performing text detection on the local picture of the current frame; under the condition that the text detection result of the current frame local picture comprises a text detection box, acquiring a mask region of the current frame local picture according to the text detection box of the current frame local picture; and under the condition that the current frame is an intermediate frame, acquiring a mask region of the current frame local picture according to a matching region of the current frame local picture and a previous frame local picture.

Optionally, the step of acquiring a mask region of the current frame local picture in step 101 specifically includes:

since the current frame local picture may be a starting picture of initial scanning or a picture of intermediate scanning, it is impossible to know whether a mask region of a previous frame local picture of the current frame local picture exists.

For example, when the current frame local picture is the first frame local picture scanned by the scanning pen, the previous frame local picture does not exist, and the mask region of the previous frame local picture does not naturally exist; when the current frame local picture is the second frame local picture, if the mask region of the previous frame local picture does not exist, the current frame local picture cannot be obtained according to the mask region of the previous frame local picture.

Therefore, whether the current frame is a starting frame or an intermediate frame or not needs to be judged, text detection is started under the condition that the current frame is the starting frame, text detection is carried out on the local picture of the current frame, whether a text detection box is included in the text detection result of the local picture of the current frame or not is determined, and when the text detection box is included in the text detection result of the local picture of the current frame, the fact that the local picture of the current frame includes text information is indicated, and a mask area of the local picture of the current frame can be determined according to the text detection box;

and under the condition that the text detection result of the current frame local picture does not contain a text detection box, not performing splicing operation, and continuously performing text detection on the next frame local picture, and starting subsequent scanning identification steps such as mask region acquisition, image block matching, picture splicing, text detection, text identification and the like until the local picture with the detection box is found.

Under the condition that the current frame is an intermediate frame, judging whether a local picture of the previous frame contains a mask region; if the situation that the mask region is not contained in the previous frame of local picture is judged, the fact that an effective region containing text information does not exist in the multi-frame local picture before the current frame of local picture is indicated;

at this time, text detection needs to be performed on the current frame local picture, whether a text detection box is included in a text detection result of the current frame local picture is determined, and when the text detection result of the current frame local picture includes the text detection box, it is indicated that text information is included in the current frame local picture, and a mask region of the current frame local picture can be determined according to the text detection box.

The method for determining the mask area of the current frame local picture according to the text detection box specifically comprises the steps of taking a partial area or a whole area where the text detection box is located as the mask area of the current frame local picture, and carrying out image block matching with the next frame local picture.

And under the condition that the text detection result of the current frame local picture does not contain a text detection box, continuing to perform text detection on the next frame local picture, and starting to perform subsequent scanning and identification steps such as mask region acquisition, image block matching, picture splicing, text detection, text identification and the like until the local picture with the detection box is found.

And under the condition that the mask area is contained in the previous frame of local picture, acquiring the mask area of the current frame of local picture according to the matching area of the current frame of local picture and the previous frame of local picture.

In this embodiment, under the condition that a local picture of a previous frame of a current frame picture does not include a mask region, only under the condition that a text detection box is detected to exist in the current frame picture, a subsequent scanning and identifying step is executed, so that redundant computation overhead and time overhead caused by scanning and identifying an invalid local picture which does not include text information are effectively avoided, the efficiency of overall picture splicing is effectively improved, the accuracy of a panoramic image obtained by splicing is ensured to be high while the local picture splicing is rapidly realized, the efficiency of image splicing can be improved, namely, scanning input can be stopped in real time, a scanning result is output immediately, or the performance of outputting while scanning is realized.

On the basis of the foregoing embodiment, in this embodiment, when the current frame is an intermediate frame, obtaining a mask region of the current frame local picture according to a matching region between the current frame local picture and a previous frame local picture includes: determining whether the position of a target image block is located at the right boundary of the current frame local picture or not under the condition that the current frame is an intermediate frame and a mask area exists in the previous frame local picture; the target image block is an image block which is most matched with a mask area of a previous frame of local picture in the current frame of local picture; taking the mask area of the previous frame local picture as the mask area of the current frame local picture under the condition that the position of the target image block is positioned at the right boundary of the current frame local picture; and under the condition that the position of the target image block is not positioned at the right boundary of the current frame local picture, updating the mask area of the previous frame local picture, and taking the updated mask area as the mask area of the current frame local picture.

The target image block is the image block with the highest matching degree with the mask area of the previous frame local picture in the current frame local picture. Optionally, when a mask region exists in a previous frame of local picture of the current frame of local picture, a target image block that is most matched with the mask region in the previous frame of local picture in the current frame of local picture needs to be acquired;

determining whether the position of the target image block is located at the right boundary of the current frame local picture, if the target image block is located at the right boundary of the current frame local picture, indicating that the overlapping degree of the current frame local picture and the previous frame local picture is high, and directly using the mask area of the previous frame local picture as the mask area of the current frame local picture without updating the mask area;

if the target image block is not located at the right boundary of the local picture of the current frame, the local picture of the current frame is different from the local picture of the previous frame to a certain extent, and if the mask region of the local picture of the previous frame is continuously adopted for picture matching and splicing, accurate matching precision and splicing precision are difficult to accurately obtain, so that the scanning identification effect is influenced.

For the above problem, the mask region of the previous local picture needs to be updated according to the target image block to obtain the mask region of the current local picture, and specifically, a part or all of the region where the target image block is located may be used as the mask region of the current local picture to update the mask region required by the current stitching task in real time, so as to improve the accuracy of image matching and stitching.

In this embodiment, under the condition that the previous local picture of the current local picture includes a mask region, the mask region of the current local picture is obtained in real time according to the position of the target image block of the current local picture, so that the accuracy of image matching and splicing is higher.

On the basis of the foregoing embodiments, in this embodiment, the splicing after performing image block matching on the next frame of local picture according to the mask region of the current frame of local picture to obtain a spliced picture corresponding to the next frame of local picture includes: performing image block matching on the next frame of local picture according to the mask area of the current frame of local picture, and acquiring an image block which is most matched with the mask area of the current frame of local picture in the next frame of local picture; under the condition that the image block which is most matched in the next frame of local picture is positioned at the right boundary of the next frame of local picture, picture splicing is not carried out; under the condition that the most matched image block in the next frame of local picture is not positioned at the right boundary of the next frame of local picture, acquiring a region to be spliced in the next frame of local picture according to the most matched image block; and splicing the to-be-spliced area with the spliced picture corresponding to the current frame local picture to obtain the spliced picture corresponding to the next frame local picture.

Optionally, the step of matching and splicing image blocks in step 102 specifically includes, after obtaining a mask region of the current frame local picture, performing image block matching on a next frame local picture based on the mask region, and obtaining a best-matched image block in the next frame local picture;

and determining the position of the most matched image block in the next frame of local picture, and under the condition that the most matched image block is positioned at the right boundary of the next frame of local picture, indicating that the overlapping degree of the current frame of local picture and the next frame of local picture is higher, the next frame of local picture is a redundant picture and cannot influence the whole splicing result and the identification result, so that the next frame of local picture can be ignored and the pictures are not spliced.

If the most matched image block is not located at the right boundary of the next local picture, the current local picture and the previous local picture are different to a certain extent, namely the overall splicing result and the identification result are greatly influenced; the area to be spliced in the next frame of local picture can be determined according to the most matched image block, all areas on the right side of the area where the most matched image block is located are specifically used as the area to be spliced, and the area to be spliced is spliced behind the spliced picture corresponding to the current frame of local picture, so that the spliced picture corresponding to the next frame of local picture is obtained.

Because the scanning speed can be determined according to the scanning conditions of different people, the input local pictures can be similar among several frames, or the scanning pictures are completely different although the scanning pictures are always in; when the overlapping degree of partial pictures of partial adjacent frames is high, the partial pictures of the adjacent frames with high overlapping degree are mutually redundant, if each frame is spliced, the spliced pictures have high redundancy, a large amount of redundant computing cost is needed in the text detection and text identification processes, the computing efficiency is low, and the scanning identification result is inaccurate due to the large amount of redundancy in the scanning identification result; and in the embodiment, whether the picture is spliced or not is determined according to the position of the most matched image block in the next frame of the local picture, if the local picture which does not actually contribute to the integrally spliced picture exists, the picture is waited until new data comes in and then matching and splicing are performed, so that splicing and scanning identification of redundant pictures are avoided, the efficiency and accuracy of splicing and identification are effectively improved, and the calculated amount in the scanning process can be greatly reduced.

On the basis of the foregoing embodiment, the text detection on the stitched image in this embodiment includes: inputting the spliced picture into a text detection model to obtain a text detection result of the spliced picture; the text detection model is obtained based on a sample picture and a text detection result of the sample picture through training; the text detection model is constructed and generated based on a lightweight neural network and comprises a trunk network and a head network; the backbone network is used for extracting features of different scales of the spliced picture to obtain a plurality of first feature maps of different scales of the spliced picture; and the head network is used for fusing and learning the first feature maps with different scales to obtain a text detection result of the spliced picture.

The text detection network is constructed and generated based on a lightweight neural network, such as a Differentiable binary network (DBnet), and the like, which is not specifically limited in this embodiment.

As shown in fig. 2, the text detection network includes a backbone network and a Head network (Head network);

the Network structure of the backbone Network is a lightweight Network structure, such as a ResNet18 (Residual Neural Network), a MobileNet series, or a shuffle net series, which is not specifically limited in this embodiment.

The mobile terminal comprises a mobile terminal and an embedded device, wherein the mobile terminal is a lightweight CNN network concentrated in the mobile terminal or the embedded device; the ShuffleNet is designed specially for mobile equipment with very limited computing power by using two new operation methods of grouping point-by-point convolution and channel rearrangement.

The network structure of the backbone network will be described below by taking MobileNetV3 (small) as an example.

As shown in FIG. 3, the network structure of the MobileNet V3-small is as follows: a convolutional layer, a plurality of bneck (bottleneck) modules, a convolutional layer, a Pooling layer (Pooling) layer, and a fully connected layer.

As shown in fig. 4, the bneck module includes a convolutional layer, a normalization layer, a nonlinear layer, a convolutional layer, a normalization layer, a SE (compression and extraction) layer, and the like.

The backbone network is configured to perform feature extraction on an input stitched image at different scales, that is, downsampling the stitched image, and obtain a plurality of first feature maps of different scales of the stitched image, for example, perform feature extraction on an original stitched image, and the scales of the obtained plurality of feature maps of different scales are 1/2, 1/4, 1/8, 1/16, 1/32 of the original stitched image, which is not specifically limited in this embodiment.

The Head network comprises a plurality of upsampling layers and is used for upsampling a plurality of first characteristic graphs with different scales output by the backbone network;

the up-sampling proportion of the up-sampling layer can be set according to actual requirements, such as proportion of 8 times, 4 times and 2 times of the original image;

and inputting the feature graph output by the upper sampling layer and the feature graph output by the lower sampling module corresponding to the upper sampling module of the upper sampling layer into the pyramid pooling layer, and then fusing the feature graphs output by the lower sampling module of any upper sampling layer, wherein the upper sampling module and the lower sampling module are in one-to-one correspondence in advance.

The Head network also comprises a fusion module used for fusing the characteristic diagrams output by the plurality of upper sampling layers to obtain fused characteristic diagrams;

the specific fusion mode is that after convolution and/or up-sampling are carried out on the feature maps output by the plurality of up-sampling layers, the feature maps output by the plurality of up-sampling layers are shaped to obtain feature maps with the same scale, and then the plurality of feature maps are fused. The size of the convolution kernel can be set according to actual requirements, for example, the convolution kernel constraint is 3 × 3.

The Head network further comprises a detection layer used for carrying out probability map prediction and threshold map prediction on the fused feature map and obtaining a text detection result of the spliced picture according to the probability map prediction result and the threshold map prediction result.

It should be noted that, in order to further simplify the result of the model, pruning needs to be performed on the text detection model in the process of training the text detection model.

The text detection model can also be applied to other scenes needing text detection in the scanning process, such as a text detection scene of a local picture, and a text detection box of the local picture can be quickly and accurately obtained.

In summary, compared with the conventional DBnet, the text detection model in this embodiment uses a lighter-weight network structure, a lighter-weight backbone network, and a Head network, and reduces the use of the compress-and-activation (SE) module, and prunes the text detection model synchronously in the training process, so that the text detection result can be output quickly and accurately while the memory space required by the calculation of the text detection model is effectively reduced.

On the basis of the above embodiment, the text detection model in this embodiment is obtained by training based on the following steps: performing iterative training on the text detection model based on the sample picture and the text detection result of the sample picture, and pruning the text detection model based on a model pruning algorithm or a model compression algorithm in the training process until a preset training termination condition is met; the learning rate adopted by the text detection model in the training process comprises a cosine learning rate mechanism or a preheating learning rate mechanism.

Optionally, the training step of the text detection model specifically includes: firstly, acquiring a sample picture and a text detection result of the sample picture;

secondly, constructing a probability map, a binary icon label and a label of a threshold map based on the sample picture and a text detection result of the sample picture;

then, carrying out forward calculation on the input sample picture, calculating a loss function value and a gradient of the text detection model by combining a probability map, a binary icon label and a label of a threshold map of the sample picture, and carrying out optimization training on the text recognition model by adopting a cosine learning rate mechanism or a preheating learning rate mechanism according to the loss function value and the gradient of the text recognition model; and in the optimization training process, synchronously pruning the text detection model based on a model pruning algorithm or a model compression algorithm until a preset training termination condition is met, so as to obtain the text detection model capable of accurately detecting the text of the input picture.

The model pruning algorithm may be geometric median filter pruning, and the like, which is not specifically limited in this embodiment. The preset termination training condition includes reaching a maximum number of iterations or model convergence, and the like, which is not specifically limited in this embodiment.

Under the scene that text detection needs to be carried out on the pictures, the text detection result of the pictures can be quickly and accurately obtained only by inputting the spliced pictures.

In the embodiment, the text detection model is pruned and compressed in the training process, so that the scale of the text detection model can be effectively reduced while the detection precision of the text detection model is ensured, the memory required by the detection of the text detection model is further reduced, the size of the memory required by the calculation of the actually used text detection model only needs 1.4M, and the forward running time on a 3288 chip is reduced to 10 ms.

On the basis of the foregoing embodiments, the text recognition of the stitched image in this embodiment includes: inputting the spliced picture into a text recognition model to obtain a text recognition result of the spliced picture; the text recognition model is trained and acquired based on a sample picture and a text recognition result of the sample picture; the text recognition model is constructed and generated based on a convolutional neural network, a cyclic neural network and a classification network; the convolutional neural network is used for extracting the characteristics of the spliced picture to obtain a second characteristic diagram of the spliced picture; the recurrent neural network is used for learning the second feature map to obtain the category probability distribution of the spliced picture; and the classification network is used for converting the class probability distribution to obtain a text recognition result of the spliced picture.

The language types recognizable by the text recognition model include but are not limited to Chinese, English, Korean and Japanese; the recognizable Chinese number and English number can be set according to actual requirements, such as 6622 Chinese characters can be recognized, 63 characters containing letters and numbers, including English characters, such as A-Z or a-Z, and operators +, -, x and/or.

As shown in fig. 5, the text recognition model may be constructed and generated based on a Convolutional Recurrent Neural Network (CRNN), which mainly includes a Convolutional Neural Network (CNN), a Recurrent Neural Network (crn), and a classification Network;

the structures and parameters of the convolutional neural network, the cyclic neural network and the classification network, such as the type, the number of layers, the initial parameters and the like of the network can be specifically set according to actual requirements.

For example, the type of the convolutional neural network is a deep convolutional neural network, and the convolutional neural network is used for performing feature extraction on an input image to obtain a second feature map;

an exemplary structure of the convolutional neural network is composed of a convolutional layer, a maximum pooling layer, a convolutional layer, a normalization layer, a maximum pooling layer, a convolutional layer and a feature sequence mapping layer in sequence.

The type of the cyclic neural network is a deep bidirectional cyclic neural network, and the cyclic neural network is used for predicting and learning the second characteristic graph to obtain the category probability distribution of the spliced pictures;

an exemplary structure of the convolutional neural network is composed of 2 deep bidirectional cyclic neural networks.

The type of the Classification network is a Connection Timing Classification (CTC) network, and the Classification probability distribution of the spliced picture obtained from the recurrent neural network can be converted into a final tag sequence by using the CTC loss, so as to obtain a text recognition result of the spliced picture.

Optionally, before performing text recognition on the stitched image by using the text recognition model, the text recognition model needs to be trained, and the specific training step includes: firstly, acquiring a sample picture and a text recognition result of the sample picture;

the sample picture and the text recognition result of the sample picture may be a pre-labeled text picture, or may be directly obtained by downloading from an open-source text recognition data set, and the like, which is not specifically limited in this embodiment.

The open-source text recognition data set includes, but is not limited to, public data sets such as LSVT, RCTW-17, MTWI, 2018HE, CCPD2019, etc.

And then, carrying out forward calculation on the sample picture, calculating a loss function value and a gradient of the text recognition model by combining a real text recognition result of the sample picture, and optimizing the text recognition model according to the loss function value and the gradient of the text recognition model to obtain the text recognition model capable of accurately recognizing the input picture.

Under the scene that text recognition needs to be carried out on the spliced picture, the text recognition result of the spliced picture can be quickly and accurately obtained only by inputting the spliced picture, and the memory size needed by calculation of an actually used text recognition model only needs 1.6M.

It should be noted that the text recognition model in this embodiment may also be used in other scenes that require text recognition, such as text recognition in a verification process.

In the embodiment, the text recognition model is constructed and generated based on the convolution recurrent neural network, so that the text recognition result of the spliced picture can be quickly and accurately obtained.

On the basis of the above embodiment, in this embodiment, obtaining the recognition result of the image scanning by the scanning pen according to the text recognition result of the stitched image obtained in each iteration process includes: verifying a text recognition result of the spliced picture obtained in each iteration process; and acquiring a final recognition result of the image scanning of the scanning pen according to the inspection result.

Optionally, in order to further improve the accuracy of the image scanning result, the text recognition result of the stitched image obtained by the scanning pen in each iteration process needs to be corrected, so as to obtain a more accurate and reliable image scanning result.

The correction process specifically comprises the steps of selecting a mask region, matching image blocks, splicing images, detecting texts, identifying texts, correcting texts, removing duplicates of texts and the like again on a global image formed by splicing local images in each iteration process, and then obtaining a final image scanning result.

Specifically, text detection is carried out on the global picture spliced in real time, and a text detection box of the global picture is obtained; it should be noted that when the local pictures are spliced in the scanning process, the local pictures need to be screened, so that a detection frame must exist in the global picture.

Then, determining a mask region of the global picture according to the detection frame; calculating the matching distance between the mask area of the global picture and each newly spliced image block for the newly spliced image blocks meeting the requirements in the global picture, and performing picture splicing and updating of the mask area according to the matching distance until all the newly spliced image blocks are traversed to obtain the final global picture;

and performing text recognition after the final global picture is subjected to text detection, verifying text recognition results of spliced pictures formed by all local pictures according to the text recognition, and correcting and removing duplication of the text recognition results of the spliced pictures according to the verification results to obtain a final image scanning result.

It should be noted that the requirement on real-time performance is not high in the verification process.

In the embodiment, the global spliced picture is detected, spliced and identified again to check the text identification result of the spliced picture formed by the local pictures, so that the real-time performance requirement can be met, and the identification precision can reach the best performance.

As shown in fig. 6, a schematic view of a complete flow of the image scanning and identifying method in this embodiment mainly includes the following steps:

step 1: acquiring a local picture sequence;

and 2, step: based on the local picture sequence, the image overall real-time splicing, the overall image text detection and the text recognition are carried out, and the method specifically comprises the following steps:

step 21, obtaining a mask area of the current frame;

step 22, based on the mask area, performing image block matching on the local picture of the next frame;

step 23, splicing the pictures based on the matching result; specifically, under the condition that a region to be spliced exists behind a region where the most matched image block in the next frame of local picture is located, the most matched image block is attached to the back of the whole image corresponding to the spliced picture of the current frame; if the scanning pen does not exist, continuing to execute the next iteration until the scanning pen stops scanning;

step 24, detecting long texts of the spliced pictures;

step 25, identifying long texts of the spliced pictures;

and step 3: and (3) verifying the text recognition result of the spliced picture obtained in the step (2), and specifically comprising the following steps:

step 31, performing text detection on the spliced picture acquired in real time;

step 32, judging whether the newly added jigsaw data meets the splicing condition, and if so, performing image splicing and text recognition;

and repeating the process of the step 32, and performing deduplication and correction on the text recognition result in the step 2 to obtain a final recognition result.

In summary, the image scanning identification method in this example can achieve beneficial effects including: (1) the real-time performance can be achieved by configuring low-power consumption hardware equipment at low cost, and the requirement of precision performance is met; (2) training to obtain an ultra-lightweight text detection classifier and a lightweight text recognizer according to a use scene, wherein the size of a text detection classifier model is only 1.4M, and the size of text recognition is only 1.6M; (3) by using the method of combining global text recognition and real-time text recognition, the real-time performance requirement can be met, and the recognition precision can also reach the best performance at present.

The following describes the image scanning recognition device provided by the present invention, and the image scanning recognition device described below and the image scanning recognition method described above can be referred to correspondingly.

As shown in fig. 7, the present embodiment provides an image scanning identification apparatus, which includes an obtaining module 701, a stitching module 702, a detection identification module 703, an iteration module 704, and an output module 705, where:

an obtaining module 701, configured to collect a current frame local picture scanned by a scanning pen, and obtain a mask region of the current frame local picture;

a splicing module 702, configured to splice a next local picture after performing image block matching according to the mask region of the current local picture to obtain a spliced picture corresponding to the next local picture, and update the mask region matching the next local picture;

the detection and identification module 703 is configured to perform text detection on the spliced picture when the spliced picture meets a preset detection condition, and perform text identification on the spliced picture when a text detection result of the spliced picture meets a preset identification condition;

an iteration module 704, configured to use the next frame local picture as a new current frame local picture, and continue to perform the steps of image block matching, picture splicing, mask area updating, text detection, and text identification until the scanning pen stops scanning;

the output module 705 is configured to obtain an identification result of image scanning of the scanning pen according to a text identification result of the stitched image obtained in each iteration process.

On the basis of the foregoing embodiment, the obtaining module in this embodiment is specifically configured to: under the condition that the current frame is a starting frame, performing text detection on the local picture of the current frame; under the condition that the text detection result of the current frame local picture comprises a text detection box, acquiring a mask region of the current frame local picture according to the text detection box of the current frame local picture; and under the condition that the current frame is an intermediate frame, acquiring a mask region of the current frame local picture according to a matching region of the current frame local picture and a previous frame local picture.

On the basis of the foregoing embodiment, the obtaining module in this embodiment is specifically configured to: determining whether the position of a target image block is located at the right boundary of the current frame local picture or not under the condition that the current frame is an intermediate frame and a mask area exists in the previous frame local picture; the target image block is an image block which is in the current frame local picture and is most matched with a mask area of a previous frame local picture; taking the mask area of the previous frame local picture as the mask area of the current frame local picture under the condition that the position of the target image block is positioned at the right boundary of the current frame local picture; and under the condition that the position of the target image block is not positioned at the right boundary of the current frame local picture, updating the mask area of the previous frame local picture, and taking the updated mask area as the mask area of the current frame local picture.

On the basis of the foregoing embodiments, the splicing module in this embodiment is specifically configured to: performing image block matching on the next frame of local picture according to the mask area of the current frame of local picture, and acquiring an image block which is most matched with the mask area of the current frame of local picture in the next frame of local picture; under the condition that the most matched image block in the next frame of local picture is positioned at the right boundary of the next frame of local picture, picture splicing is not carried out; under the condition that the most matched image block in the next frame of local picture is not positioned at the right boundary of the next frame of local picture, acquiring a region to be spliced in the next frame of local picture according to the most matched image block; and splicing the to-be-spliced area with the spliced picture corresponding to the current frame local picture to obtain the spliced picture corresponding to the next frame local picture.

On the basis of the foregoing embodiments, the detection module in this embodiment is specifically configured to: inputting the spliced picture into a text detection model to obtain a text detection result of the spliced picture; the text detection model is obtained based on a sample picture and a text detection result of the sample picture through training; the text detection model is constructed and generated based on a lightweight neural network and comprises a trunk network and a head network; the backbone network is used for extracting features of different scales of the spliced picture to obtain a plurality of first feature maps of different scales of the spliced picture; and the head network is used for fusing and learning the first feature maps with different scales to obtain a text detection result of the spliced picture.

On the basis of the above embodiment, the present embodiment further includes a training module, specifically configured to: performing iterative training on the text detection model based on the sample picture and the text detection result of the sample picture, and pruning the text detection model based on a model pruning algorithm or a model compression algorithm in the training process until a preset training termination condition is met; the learning rate adopted by the text detection model in the training process comprises a cosine learning rate mechanism or a preheating learning rate mechanism.

On the basis of the foregoing embodiments, the identification module in this embodiment is specifically configured to: inputting the spliced picture into a text recognition model to obtain a text recognition result of the spliced picture; the text recognition model is trained and acquired based on a sample picture and a text recognition result of the sample picture; the text recognition model is constructed and generated based on a convolutional neural network, a cyclic neural network and a classification network; the convolutional neural network is used for extracting the characteristics of the spliced picture to obtain a second characteristic diagram of the spliced picture; the recurrent neural network is used for learning the second feature map to obtain the category probability distribution of the spliced picture; and the classification network is used for converting the class probability distribution to obtain a text recognition result of the spliced picture.

On the basis of the foregoing embodiments, the output module in this embodiment is specifically configured to: verifying a text recognition result of the spliced picture obtained in each iteration process; and acquiring a final recognition result of the image scanning of the scanning pen according to the inspection result.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process and related description of the image scanning and recognizing apparatus described above may refer to the corresponding process in the foregoing embodiment of the image scanning and recognizing method, and will not be described herein again.

Fig. 8 illustrates a physical structure diagram of an electronic device, which may include, as shown in fig. 8: a processor (processor)801, a communication Interface (Communications Interface)802, a memory (memory)803 and a communication bus 804, wherein the processor 801, the communication Interface 802 and the memory 803 complete communication with each other through the communication bus 804. The processor 801 may invoke logic instructions in the memory 803 to perform an image scan recognition method comprising: acquiring a current frame local picture scanned by a scanning pen, and acquiring a mask area of the current frame local picture; according to the mask area of the current frame local picture, splicing the next frame local picture after image block matching to obtain a spliced picture corresponding to the next frame local picture, and updating the mask area matched with the next frame local picture; performing text detection on the spliced picture under the condition that the spliced picture meets a preset detection condition, and performing text identification on the spliced picture under the condition that a text detection result of the spliced picture meets a preset identification condition; taking the next frame of local picture as a new current frame of local picture, and continuing to execute the steps of image block matching, picture splicing, mask area updating, text detection and text identification until the scanning pen stops scanning; and acquiring an image scanning recognition result of the scanning pen according to the text recognition result of the spliced image obtained in each iteration process.

In addition, the logic instructions in the memory 803 may be implemented in the form of software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

In another aspect, the present invention also provides a computer program product, the computer program product including a computer program, the computer program being stored on a non-transitory computer-readable storage medium, wherein when the computer program is executed by a processor, a computer is capable of executing the image scan identification method provided by the above methods, the method including: acquiring a current frame local picture scanned by a scanning pen, and acquiring a mask area of the current frame local picture; according to the mask area of the current frame local picture, splicing the next frame local picture after image block matching to obtain a spliced picture corresponding to the next frame local picture, and updating the mask area matched with the next frame local picture; performing text detection on the spliced picture under the condition that the spliced picture meets a preset detection condition, and performing text identification on the spliced picture under the condition that a text detection result of the spliced picture meets a preset identification condition; taking the next frame of local picture as a new current frame of local picture, and continuing to execute the steps of image block matching, picture splicing, mask area updating, text detection and text identification until the scanning pen stops scanning; and acquiring an image scanning recognition result of the scanning pen according to the text recognition result of the spliced image obtained in each iteration process.

In yet another aspect, the present invention also provides a non-transitory computer-readable storage medium, on which a computer program is stored, the computer program being implemented by a processor to execute the image scan recognition method provided by the above methods, the method comprising: acquiring a current frame local picture scanned by a scanning pen, and acquiring a mask area of the current frame local picture; according to the mask area of the current frame local picture, splicing the next frame local picture after image block matching to obtain a spliced picture corresponding to the next frame local picture, and updating the mask area matched with the next frame local picture; performing text detection on the spliced picture under the condition that the spliced picture meets a preset detection condition, and performing text identification on the spliced picture under the condition that a text detection result of the spliced picture meets a preset identification condition; taking the next frame of local picture as a new current frame of local picture, and continuing to execute the steps of image block matching, picture splicing, mask area updating, text detection and text identification until the scanning pen stops scanning; and acquiring an identification result of the image scanning of the scanning pen according to the text identification result of the spliced picture obtained in each iteration process.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. An image scanning identification method is characterized by comprising the following steps:

according to the mask area of the current frame local picture, splicing the next frame local picture after matching the image blocks to obtain a spliced picture corresponding to the next frame local picture, and updating the mask area matched with the next frame local picture;

2. The method according to claim 1, wherein the obtaining the mask region of the current frame local picture comprises:

3. The method according to claim 1, wherein, in a case that the current frame is an intermediate frame, obtaining a mask region of the current frame local picture according to a matching region between the current frame local picture and a previous frame local picture comprises:

determining whether the position of a target image block is located at the right boundary of the current frame local picture or not under the condition that the current frame is an intermediate frame and a mask area exists in the previous frame local picture; the target image block is an image block which is most matched with a mask area of a previous frame of local picture in the current frame of local picture;

4. The image scanning identification method according to any one of claims 1 to 3, wherein the obtaining the spliced picture corresponding to the next frame of local picture by splicing the next frame of local picture after performing image block matching according to the mask region of the current frame of local picture comprises:

5. The image scanning identification method according to any one of claims 1-3, wherein the text detection of the stitched image comprises:

6. The image scanning and identifying method of claim 5, wherein the text detection model is obtained by training based on the following steps:

performing iterative training on the text detection model based on the sample picture and the text detection result of the sample picture, and pruning the text detection model based on a model pruning algorithm or a model compression algorithm in the training process until a preset training termination condition is met;

7. The image scanning identification method according to any one of claims 1-3, wherein the text identification of the stitched image comprises:

8. The image scanning identification method according to any one of claims 1 to 3, wherein the obtaining of the identification result of the image scanning of the scanning pen according to the text identification result of the stitched image obtained in each iteration process comprises:

9. An image scanning recognition device, comprising:

10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the image scanning identification method according to any one of claims 1 to 8 when executing the program.