CN110533035B

CN110533035B - Student homework page number identification method based on text matching

Info

Publication number: CN110533035B
Application number: CN201910800669.1A
Authority: CN
Inventors: 张东祥; 聂雨杨; 郭馨茹; 陈李江
Original assignee: Hainan Avanti Technology Co ltd
Current assignee: Hainan Avanti Technology Co ltd
Priority date: 2019-08-28
Filing date: 2019-08-28
Publication date: 2022-02-15
Anticipated expiration: 2039-08-28
Also published as: CN110533035A

Abstract

The invention belongs to the technical field of image matching, particularly relates to a student homework page number recognition method based on text matching, and aims to solve the problems that in the prior art, the page number recognition accuracy cannot be expected and the expansibility is not strong. The method comprises the following steps: identifying and dividing a character communication area of a page to be identified to obtain a plurality of text line pictures; respectively converting the text line pictures into texts through a text sequence identification model; extracting the N-gram characteristics of the text, and inquiring in a characteristic inverted list corresponding to the database; and taking the picture with the most common features in the database and the feature number higher than a set threshold value as a matching picture of the text. The invention uses the text sequence recognition model constructed based on the deep learning neural network, matches pictures from the database in an N-gram characteristic query mode, greatly improves the accuracy of page matching and page number recognition, and has good effect on the improvement of the recognition accuracy rate of irregular characters such as mathematical formulas and the like.

Description

Student homework page number identification method based on text matching

Technical Field

The invention belongs to the technical field of image matching, and particularly relates to a student homework page number identification method based on text matching.

Background

The key to page number recognition is the recognition of the number of the printed page number. The digital identification method based on template matching comprises the following steps: the main problems are that the calculation amount is large, the template cannot be identified if the difference between the template and the digital font to be identified is large, the dependence on the template is strong, the robustness is weak, and the method is sensitive to the noise and the displacement of the image. The method based on feature analysis comprises the following steps: the purpose of identifying the number is achieved by extracting representative features in the number image, and the digital features in the current research mainly include: the method comprises the following steps of focusing characteristics, closed semi-closed characteristics, horizontal and vertical line characteristics, projection characteristics, partition area characteristics and the like, but the characteristics are not enough in robustness and are influenced by factors such as digital fonts and gradient to different degrees, so that the accuracy of digital identification in practical application is directly influenced. According to the method for identifying the page number based on page information matching, a picture identification template is generated according to a preset anchor point region and a text region corresponding to the anchor point region; the anchor region and the text region have a position corresponding relationship, and the text region comprises text information defining text meaning. The method has the advantages that the text in the picture can be identified more quickly according to the preset information, but the method needs to generate the template in advance, so the expansibility is not strong.

In general, the prior art does not achieve the expected accuracy of page number identification, and the method is not very extensive.

Disclosure of Invention

In order to solve the above problems in the prior art, namely the problems that the identification accuracy of the page number in the prior art is not expected and the expansibility is not strong, the invention provides a student homework page number identification method based on text matching, which comprises the following steps:

step S10, acquiring an image of a paper text page as a page to be identified;

step S20, identifying each character connected domain in the page to be identified, dividing character contents according to the identified character connected domain, and obtaining a plurality of text line pictures;

step S30, converting the plurality of text line pictures into corresponding texts respectively based on the text sequence recognition model, and obtaining texts corresponding to the pages to be recognized; the text sequence recognition model is constructed based on a deep learning neural network and comprises an image understanding model and a sequence generation model;

step S40, extracting bi-gram and tri-gram characteristics of the text corresponding to the page to be recognized to obtain a characteristic set;

step S50, respectively inquiring each feature in the feature set in the inverted list to obtain the picture with the most common features; the inverted list is constructed according to bi-gram and tri-gram characteristics corresponding to the database pictures;

step S60, judging whether the number of the common features contained in the picture with the most common features is larger than a set threshold value, if so, the picture is a matching picture of the page to be identified, and the page number category corresponding to the picture is the page number category of the page to be identified; otherwise, the database does not contain the matched picture of the page to be identified.

In some preferred embodiments, the image understanding model is constructed based on a dense convolutional neural network, and has the structure:

the system comprises an input layer, a convolution layer for setting convolution kernels, a maximum pooling layer for setting kernels, a first dense module, a first transition module, a second dense module, a second transition module, a third dense module, a third transition module, a fourth dense module, a fourth transition module and an output layer which are connected in sequence.

In some preferred embodiments, the first dense module, the second dense module, the third dense module, and the fourth dense module each include a set number of dense layers.

In some preferred embodiments, the dense layer has a structure of:

the system comprises a layer normalization operation layer, a convolution operation layer for setting a convolution kernel and a bottleneck layer for setting the kernel which are connected in sequence.

In some preferred embodiments, the first transition module, the second transition module, the third transition module and the fourth transition module respectively comprise a set number of transition layers.

In some preferred embodiments, the transition layer has a structure of:

a convolution operation layer for setting convolution kernel and an average pooling layer for setting kernel.

On the other hand, the invention provides a student homework page number recognition system based on text matching, which comprises an input module, a character line division module, a text sequence recognition module, a feature extraction module, a feature matching module, a page matching module and an output module;

the input module is configured to acquire an image of a paper text page as a page to be identified;

the text line dividing module is configured to identify each text connected domain in the page to be identified, and divide text contents according to the identified text connected domains to obtain a plurality of text line pictures;

the text sequence identification module is configured to convert the plurality of text line pictures into corresponding texts respectively based on a text sequence identification model, and obtain texts corresponding to the pages to be identified;

the feature extraction module is configured to extract bi-gram and tri-gram features of the text corresponding to the page to be recognized to obtain a feature set;

the feature matching module is configured to query each feature in the feature set in an inverted list respectively and obtain a picture with the most common features;

the page matching module is configured to judge whether the number of common features contained in the picture with the most common features is greater than a set threshold, if so, the picture is a matching picture of the page to be identified, and the page number category corresponding to the picture is the page number category of the page to be identified; otherwise, the database does not contain the matched picture of the page to be identified;

the output module is configured to output a page matching result.

In some preferred embodiments, the text sequence recognition module comprises an image understanding module, a sequence generation module;

the image understanding module is configured to understand contents in the pictures according to the input text line pictures to obtain deep representations of the text line pictures;

and the sequence generation module is configured to generate a corresponding text sequence according to the deep representation of the text line picture by combining a long-time memory network and an attention mechanism.

In a third aspect of the present invention, a storage device is provided, in which a plurality of programs are stored, the programs being adapted to be loaded and executed by a processor to implement the above-mentioned student homework page number identification method based on text matching.

In a fourth aspect of the present invention, a processing apparatus is provided, which includes a processor, a storage device; the processor is suitable for executing various programs; the storage device is suitable for storing a plurality of programs; the program is adapted to be loaded and executed by a processor to implement the above-described student assignment page number recognition method based on text matching.

The invention has the beneficial effects that:

the invention relates to a student homework page number recognition method based on text matching, which uses a text sequence recognition model constructed based on a deep learning neural network, matches pictures from a database in an N-gram feature query mode, sets a proper threshold value, indicates that the pictures are matched with a page to be configured only if the number of features shared by the pictures in the database and the page to be recognized is greater than the set threshold value, improves the page matching accuracy to more than 90 percent, even can reach more than 98 percent for the page number recognition accuracy, and can reach more than 70 percent for the recognition accuracy rate of irregular characters such as mathematical formulas.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 is a flow chart of the student homework page number identification method based on text matching according to the invention.

Detailed Description

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

The invention discloses a student homework page number identification method based on text matching, which comprises the following steps:

step S10, acquiring an image of a paper text page as a page to be identified;

In order to more clearly describe the method for recognizing the number of pages in a student's homework based on text matching, the following describes in detail the steps in the embodiment of the method of the present invention with reference to fig. 1.

The student homework page number identification method based on text matching comprises the following steps of S10-S60, wherein the steps are described in detail as follows:

and step S10, acquiring the image of the paper text page as the page to be identified.

And step S20, identifying each character connected domain in the page to be identified, and dividing the character content according to the identified character connected domain to obtain a plurality of text line pictures.

In an embodiment of the present invention, the text row pictures of the pictures are obtained by a method in "a test paper analysis method and a computing device patent query" of patent application No. CN201711262766.7, and details are not described herein.

Step S30, converting the plurality of text line pictures into corresponding texts respectively based on the text sequence recognition model, and obtaining texts corresponding to the pages to be recognized; the text sequence recognition model is constructed based on a deep learning neural network and comprises an image understanding model and a sequence generation model.

The image understanding model is constructed based on a dense convolution neural network, and the structure of the image understanding model is as follows:

In one embodiment of the invention, the model input is a gray-scale image, if the page to be identified is a picture of other patterns, the preprocessing of gray-scale processing can be firstly carried out, the convolution kernel of the convolution layer is 7 × 7, the step length is 2, the kernel of the maximum pooling layer is 3 × 3, and the step length is 2.

The first dense module, the second dense module, the third dense module and the fourth dense module respectively comprise a set number of dense layers.

In one embodiment of the invention, the first dense module comprises 6 dense layers, the second dense module comprises 12 dense layers, the third dense module comprises 64 dense layers, and the fourth dense module comprises 48 dense layers.

The dense layer has the structure as follows:

In one embodiment of the present invention, the convolution kernel of the convolution operation layer is 3 × 3 and the kernel of the bottleneck layer is 1 × 1.

The first transition module, the second transition module, the third transition module and the fourth transition module respectively comprise a set number of transition layers.

The transition layer has the structure that:

In one embodiment of the invention, the convolution kernel of the convolution operation layer is 1 × 1 and the kernel of the average pooling layer is 2 × 2.

The sequence generation model adopts a model based on a long-time memory network and an attention mechanism.

Firstly, recording a deep graphic representation obtained by an image understanding model as v, sequentially inputting each vector in v into a bidirectional long-time memory network, and coding to obtain h;

then, a vector s is obtained by using an attention mechanism for h to represent summary information of all inputs;

and finally, using another long-time memory network and attention mechanism to sequentially obtain each output word based on h and s, and obtaining the final result of the text sequence recognition.

In one embodiment of the invention, for the image understanding model, an open-source ImageNet gallery is used for pre-training the image understanding model, and then 40 ten thousand of labeled text sequence pictures are used for training the image understanding model; for the sequence generation model, a large Chinese corpus and text sequences are used for pre-training the sequence generation model, and then 40 ten thousand labeled text sequence diagrams used for the image understanding model are combined for training the sequence generation model.

And step S40, extracting bi-gram and tri-gram characteristics of the text corresponding to the page to be recognized to obtain a characteristic set.

In one embodiment of the invention, the 2-gram feature of the text is extracted, and the result is 'we', 'people love', 'love in love' and 'China', taking the text 'we love China' as an example, and the 2-gram is extracted.

Step S50, respectively inquiring each feature in the feature set in the inverted list to obtain the picture with the most common features; and the inverted list is constructed according to the bi-gram and tri-gram characteristics corresponding to the database pictures.

In one embodiment of the invention, 2-gram characteristics of texts corresponding to database pictures are extracted, and an inverted list is established. Taking the 2-gram characteristics "we" and "china" in the database as an example, assuming that "we" appear in three pictures of "1.jpg", "2.jpg" and "10.jpg" in the database, and "china" appears in two pictures of "3.jpg" and "10.jpg" in the database, the inverted table index is a characteristic, and the value is in the file where the characteristic appears, then the inverted table is: { "We": [ "1.jpg", "2.jpg", "10.jpg" ], "Chinese" [ "3.jpg", "10.jpg" ] }.

In the process of page matching, each feature is put into the inverted list for comparison, if the inverted list has the same feature, the value of the feature in the inverted list is taken, namely the pages with the feature exist, and the pages are added into the inverted list. If the page does not exist in the inverted list, the weight of the page is set to be 1, if the page exists, the weight of the page is +1, after all the features are traversed, the page with the largest weight in the candidate set is used as a matching picture, a result is returned, meanwhile, a threshold value is set (in one embodiment of the invention, the threshold value for judging the number of the common features contained in the picture with the largest common features is set to be 10), and if the weight of the page with the largest weight is lower than the threshold value, it is indicated that the page does not exist in the matching picture in the database.

Assuming that the features of the page are two features of mathematics and learning problems, we find the two features in the inverted list, assuming that the value of mathematics is [ "12.jpg", "16.jpg" ], the candidate is { "12.jpg":1, "16.jpg":1 ], and then find the learning problems, assuming that the values are [ "12.jpg", "15.jpg" ], update the candidate to { "12.jpg":2, "15.jpg":1, "16.jpg":1}, and then the most weighted one of the candidates is taken as the matching result.

Step S60, judging whether the number of the common features contained in the picture with the most common features is larger than a set threshold value, if so, the picture is a matching picture of the page to be identified, and the page number category corresponding to the picture is the page number category of the page to be identified; otherwise, the database does not contain the input page matching picture.

The student homework page number recognition system based on text matching comprises an input module, a character line dividing module, a text sequence recognition module, a feature extraction module, a feature matching module, a page matching module and an output module;

the output module is configured to output a page matching result.

The text sequence identification module comprises an image understanding module and a sequence generation module;

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process and related description of the system described above may refer to the corresponding process in the foregoing method embodiments, and will not be described herein again.

It should be noted that, the student homework page number identification system based on text matching provided in the above embodiment is only illustrated by the division of the above functional modules, and in practical applications, the above functions may be allocated to different functional modules according to needs, that is, the modules or steps in the embodiment of the present invention are further decomposed or combined, for example, the modules in the above embodiment may be combined into one module, or may be further split into multiple sub-modules, so as to complete all or part of the above described functions. The names of the modules and steps involved in the embodiments of the present invention are only for distinguishing the modules or steps, and are not to be construed as unduly limiting the present invention.

A storage device according to a third embodiment of the present invention stores therein a plurality of programs adapted to be loaded and executed by a processor to implement the above-described student homework page number recognition method based on text matching.

A processing apparatus according to a fourth embodiment of the present invention includes a processor, a storage device; a processor adapted to execute various programs; a storage device adapted to store a plurality of programs; the program is adapted to be loaded and executed by a processor to implement the above-described student assignment page number recognition method based on text matching.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes and related descriptions of the storage device and the processing device described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

Those of skill in the art would appreciate that the various illustrative modules, method steps, and modules described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that programs corresponding to the software modules, method steps may be located in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. To clearly illustrate this interchangeability of electronic hardware and software, various illustrative components and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as electronic hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The terms "comprises," "comprising," or any other similar term are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

So far, the technical solutions of the present invention have been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of the present invention is obviously not limited to these specific embodiments. Equivalent changes or substitutions of related technical features can be made by those skilled in the art without departing from the principle of the invention, and the technical scheme after the changes or substitutions can fall into the protection scope of the invention.

Claims

1. A student homework page number recognition method based on text matching is characterized by comprising the following steps:

step S10, acquiring an image of a paper text page as a page to be identified;

step S30, converting the plurality of text line pictures into corresponding texts respectively based on the text sequence recognition model, and obtaining texts corresponding to the pages to be recognized; the text sequence recognition model is constructed based on a deep learning neural network and comprises an image understanding model and a sequence generation model; the image understanding model is constructed based on a dense convolution neural network, and the structure of the image understanding model is as follows: the system comprises an input layer, a convolution layer for setting convolution kernels, a maximum pooling layer for setting kernels, a first dense module, a first transition module, a second dense module, a second transition module, a third dense module, a third transition module, a fourth dense module, a fourth transition module and an output layer which are connected in sequence;

2. The method for recognizing student homework pages based on text matching according to claim 1, wherein the first dense module, the second dense module, the third dense module and the fourth dense module respectively comprise a set number of dense layers.

3. The method for recognizing the number of pages for student's homework based on text matching as claimed in claim 2, wherein said dense layer has a structure of:

4. The method for recognizing the number of pages for student's homework based on text matching as claimed in claim 1, wherein the first transition module, the second transition module, the third transition module and the fourth transition module respectively comprise a set number of transition layers.

5. The method for recognizing the number of pages for student's homework based on text matching as claimed in claim 4, wherein said transition layer is structured as follows:

6. A student homework page number recognition system based on text matching is characterized by comprising an input module, a character line dividing module, a text sequence recognition module, a feature extraction module, a feature matching module, a page matching module and an output module;

the text sequence identification module is configured to convert the plurality of text line pictures into corresponding texts respectively based on a text sequence identification model, and obtain texts corresponding to the pages to be identified; the text sequence recognition model is constructed based on a deep learning neural network and comprises an image understanding model and a sequence generation model; the image understanding model is constructed based on a dense convolution neural network, and the structure of the image understanding model is as follows: the system comprises an input layer, a convolution layer for setting convolution kernels, a maximum pooling layer for setting kernels, a first dense module, a first transition module, a second dense module, a second transition module, a third dense module, a third transition module, a fourth dense module, a fourth transition module and an output layer which are connected in sequence;

the feature matching module is configured to query each feature in the feature set in an inverted list respectively and obtain a picture with the most common features; the inverted list is constructed according to bi-gram and tri-gram characteristics corresponding to the database pictures;

the output module is configured to output a page matching result.

7. A storage device having stored therein a plurality of programs, wherein the programs are adapted to be loaded and executed by a processor to implement the method for student homework page number recognition based on text matching according to any one of claims 1 to 5.

8. A treatment apparatus comprises

A processor adapted to execute various programs; and

a storage device adapted to store a plurality of programs;

wherein the program is adapted to be loaded and executed by a processor to perform:

the student work page number recognition method based on text matching of any one of claims 1 to 5.