CN117475453A

CN117475453A - Document detection method and device based on OCR and electronic equipment

Info

Publication number: CN117475453A
Application number: CN202311786315.9A
Authority: CN
Inventors: 赵鹏; 孟德旺; 山姗; 张天锋; 李岩; 王强; 师国华; 张茂杰
Original assignee: Xincheng Information Technology Co ltd
Current assignee: Xincheng Information Technology Co ltd
Priority date: 2023-12-25
Filing date: 2023-12-25
Publication date: 2024-01-30
Anticipated expiration: 2043-12-25
Also published as: CN117475453B

Abstract

The application provides an OCR-based document detection method and device and electronic equipment. The method comprises the following steps: OCR recognition is carried out on a document picture of a shot target document, so that first character recognition content of the document picture is obtained; determining a document template to which the document picture belongs according to the first text recognition content and a preset document template library; determining a reference word from the first character recognition content according to all proportional keywords of a document template to which the document picture belongs; calculating the length value of the reference word in the document picture, and determining the length value as a threshold value of the document picture; carrying out ash setting treatment on all straight lines with the length above a threshold value in the document picture; OCR recognition is carried out on the document picture subjected to the gray processing to obtain second character recognition content of the document picture, and a first document detection result of the target document is generated according to the second character recognition content. The method and the device can improve the recognition accuracy of OCR to legal documents.

Description

Document detection method and device based on OCR and electronic equipment

Technical Field

The present disclosure relates to the field of data recognition technologies, and in particular, to a method and an apparatus for detecting documents based on OCR, and an electronic device.

Background

OCR (Optical Character Recognition ) refers to a technique of determining a character shape by detecting dark and light patterns and then translating the character shape into computer text by a character recognition method. With the development of informatization, office automation is a great trend, and in an office environment, a large number of files and paper documents often need to be processed, and the paper documents can be converted into electronic files by using an OCR technology, so that the electronic files are convenient to store, retrieve and share.

At present, when special texts such as legal documents are identified by the OCR technology, the problem of low identification accuracy exists in the case of wrong identification of underlined texts in the text.

Disclosure of Invention

The embodiment of the application provides an OCR-based document detection method, an OCR-based document detection device and electronic equipment, so as to solve the problem of low recognition accuracy in the existing OCR technology.

In a first aspect, an embodiment of the present application provides an OCR-based document detection method, including:

OCR recognition is carried out on a document picture of a shot target document, so that first character recognition content of the document picture is obtained;

determining a document template to which the document picture belongs according to the first text recognition content and a preset document template library; the method comprises the steps that a document template library is preset, wherein the document template library comprises at least one document template, the document template comprises at least one template matching keyword and at least one proportional keyword, and the number of words of the proportional keyword is greater than or equal to two;

Determining a reference word from the first character recognition content according to all proportional keywords of a document template to which the document picture belongs;

calculating the length value of the reference word in the document picture, and determining the length value as a threshold value of the document picture;

carrying out ash setting treatment on all straight lines with the length above a threshold value in the document picture;

OCR recognition is carried out on the document picture subjected to the gray processing to obtain second character recognition content of the document picture, and a first document detection result of the target document is generated according to the second character recognition content.

In a second aspect, embodiments of the present application provide an OCR-based document detection device, including:

the recognition module is used for performing OCR (optical character recognition) on the document picture of the shot target document to obtain first character recognition content of the document picture;

the template determining module is used for determining a document template to which the document picture belongs according to the first text recognition content and a preset document template library; the method comprises the steps that a document template library is preset, wherein the document template library comprises at least one document template, the document template comprises at least one template matching keyword and at least one proportional keyword, and the number of words of the proportional keyword is greater than or equal to two;

the reference word determining module is used for determining a reference word from the first character recognition content according to all proportional keywords of the document template to which the document picture belongs;

The threshold value determining module is used for calculating the length value of the reference word in the document picture and determining the length value as the threshold value of the document picture;

the ash setting module is used for carrying out ash setting treatment on all straight lines with the length above a threshold value in the document picture;

the first detection module is used for performing OCR (optical character recognition) on the document picture subjected to the gray processing to obtain second character recognition content of the document picture, and generating a first document detection result of the target document according to the second character recognition content.

In a third aspect, embodiments of the present application provide an electronic device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the method of the first aspect or any one of the possible implementations of the first aspect as described above when the computer program is executed by the processor.

The embodiment of the application provides a method, a device and electronic equipment for detecting a document based on OCR, wherein when detecting a target document, OCR can be firstly performed on a document picture of the shot target document to obtain first text recognition content of the document picture, then a document template to which the document picture belongs is determined according to the first text recognition content and a preset document template library, then a reference word is determined from the first text recognition content according to all proportional keywords of the document template to which the document picture belongs, a length value of the reference word in the document picture is calculated, and the length value is determined as a threshold value of the document picture. And then carrying out ash setting treatment on all straight lines with the length above a threshold value in the document picture. And finally, performing OCR (optical character recognition) on the document picture subjected to the gray processing to obtain second character recognition content of the document picture, so that a first document detection result of the target document can be obtained. Therefore, the threshold value of the document picture, namely the threshold value which is individually set for the photographed document picture, can be utilized to distinguish the underscore from the straight line of the document picture, and then the interference of the underscore on OCR character recognition can be removed in a phase-changing manner through the gray setting process, and the accuracy of OCR character recognition is greatly improved due to the fact that the interference of the underscore on OCR character recognition is removed.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the following description will briefly introduce the drawings that are needed in the embodiments or the description of the prior art, it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart of an implementation of an OCR-based document detection method provided in an embodiment of the present application;

FIG. 2 is a flowchart of another implementation of an OCR-based document detection method provided by an embodiment of the present application;

FIG. 3 is a schematic structural diagram of an OCR-based document detection device according to an embodiment of the present application;

fig. 4 is a schematic diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system configurations, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the following description will be made with reference to the accompanying drawings by way of specific embodiments.

At present, the OCR technology has a better recognition effect on the content of the common text, and can accurately recognize the characters in the common text, because the common text has a simple format, and has no formats such as underline except the font size, the line thickness and the inclination degree. However, for special texts such as legal documents, since there are usually many underscores and these underscores are usually longer, the underscores are often stuck to the text thereon, so that when the existing OCR technology recognizes the special texts such as legal documents, the underscores are easily recognized as part of the text or part of the text content, for example, "Qin" is recognized as "Qin-not" with underscores, and "Zhao Er" is recognized as "Zhao Yi" with underscores, which results in the problem that the recognition accuracy is not high when the existing OCR technology recognizes the special texts such as legal documents.

In order to solve the problems in the prior art, the embodiment of the application provides an OCR-based document detection method and device and electronic equipment. The method for detecting documents based on OCR provided by the embodiment of the application is first described below.

The execution subject of the OCR-based document detection method may be an OCR-based document detection apparatus, such as a high-speed camera, a document collection device, or any electronic device capable of executing the relevant processing of the OCR-based document detection method, which is not specifically limited in the embodiments of the present application.

Referring to fig. 1, a flowchart of an implementation of an OCR-based document detection method provided in an embodiment of the present application is shown, and described in detail below:

and 110, performing OCR (optical character recognition) on the document picture of the shot target document to obtain first character recognition content of the document picture.

In some embodiments, the target document may be a legal document such as a decision book, a notice sheet, or the like, and the device for photographing the target document may be a photographing device such as a high-speed camera. Taking a document input scene configured with a high-speed camera as an example, a worker can place a target document at a document placing position of the high-speed camera, then control the high-speed camera to shoot the target document, so that equipment such as an office computer which is in data communication with the high-speed camera can receive a document picture of the target document, and then utilize an OCR technology to identify the document picture, so that the text identification content of the document picture, namely first text identification content, can be obtained.

And 120, determining a document template to which the document picture belongs according to the first character recognition content and a preset document template library.

In some embodiments, at least one document template is pre-stored in the preset document template library, and the document template may be a blank document with a fixed format, such as a standing-type decision book, a notice book, or the like, in which legal information is not filled in, and the determined person information is not filled in. In addition, each document template includes at least one template matching keyword and at least one scaling keyword.

Specifically, the template matching keywords may be words capable of representing the document template in the word content of the document template, and preferentially select words capable of uniquely representing the document template in different document templates, for example, investigation, retention, and the like. The proportional keyword may be a word that exists in the text content of the document template and is consistent with the display proportion of the text to be filled in on the lower scribe line, and the number of words is greater than or equal to two. In general, the underline in a document is generally located in the text portion of the document, and the text on the underline is the same or substantially the same in size ratio as the text of the text portion, so that a proportional keyword can be selected from the text portion of the document.

An example of a selection manner of the proportional keyword is given below. Taking a certain decision template as an example, it is assumed that a "decision pair" exists in the text portion, and the "decision pair" is underlined, and for this purpose, "decision pair" or "decision pair" may be selected as the proportional keyword of the decision template.

It is worth mentioning that the document templates in the preset document template library can be added and deleted according to the requirement.

Thus, after the first text recognition content of the document picture of the target document is obtained, the document template to which the document picture belongs can be determined according to the template matching keywords of each document template in the preset document template library.

In some embodiments, the document template to which the document picture belongs may be determined by searching for a template matching keyword in the first text recognition content. For example, each document template may be numbered in advance, and then template matching keywords of the corresponding document template may be searched for in the first text recognition content sequentially according to the numbers.

Specifically, if only one template matching keyword is found in the first text recognition content, the document template corresponding to the template matching keyword can be directly determined as the document template to which the document picture belongs. If a plurality of template matching keywords are found in the first text recognition content, the document template with the most template matching keywords among the plurality of template matching keywords can be determined to be the document template to which the document picture belongs. For example, assuming that 5 template matching keywords are found in the first text recognition content, 3 of the 6 target matching keywords belong to an a template, 2 belong to a B template, and 1 belong to a C template, since the a template hits the template matching keyword that is the most of the 5 template matching keywords, it can be determined that the document template to which the document picture belongs is the a template.

And 130, determining a reference word from the first character recognition content according to all proportional keywords of the document template to which the document picture belongs.

In some embodiments, reference words refer to words in which each word in the word is located in the same line in the document picture and is consistent with the display scale of the text to be filled in on the underline. Although more words in the document templates conform to the characteristics of the reference words, the reference words of each document template cannot be selected in advance due to the uncertainty of the content to be filled in and the problem of typesetting and printing in the document, and for this purpose, some words, namely the aforementioned proportional keywords, can be preselected for each document template.

Therefore, after determining the document template to which the document picture belongs, words which are the same as all the proportion keywords of the document template can be searched in the first text recognition content, and the searched words are marked as candidate words. And then, aiming at each searched candidate word, acquiring the ordinate of the position of the first word in the pixel point of the document picture and the ordinate of the position of the last word in the pixel point of the document picture in each candidate word. The first word refers to the first word in the candidate word, the last word in the candidate word refers to the last word in the candidate word, taking "decision pair" as an example, the first word of the word is "block", and the last word is "pair". In addition, the ordinate of the position of the pixel point of the first character or the last character may be the ordinate of the position of the pixel point of the upper left corner of the character, or may be the ordinate of the position of the pixel point of the upper right corner of the character, or other representative positions, and it should be noted that the ordinate of the position of the pixel point needs to be determined according to a unified standard, for example, according to the pixel point of the upper right corner or the pixel point of the upper left corner. In addition, for the direction of the abscissa, generally, the abscissa in the document picture refers to the horizontal direction, that is, the direction of arrangement of the characters from left to right in a row of characters in the document, and correspondingly, the ordinate in the document picture refers to the vertical direction, that is, the direction of arrangement from top to bottom in different rows of characters in the document.

In this way, coordinate data corresponding to each candidate word can be obtained, and whether target candidate words exist in the candidate words or not can be judged according to the coordinate data, namely, each word in the words is located in the same row in the document picture. For the target candidate word, the judgment can be performed by: the absolute value of the difference between the ordinate of the first pixel position and the ordinate of the last pixel position is smaller than the preset difference.

Specifically, if it is determined that there is a target candidate word among the candidate words, the target candidate word may be determined as a reference word, and it is worth mentioning that if there are a plurality of target candidate words, one target candidate word may be randomly selected as a reference word. If it is determined that the target candidate word does not exist in the candidate words, a preset number of consecutive words adjacent to the candidate word may be determined as the reference word, such as a preset number of consecutive words located on the left side of the candidate word or a preset number of consecutive words located on the right side of the candidate word, the preset number being the number of words of the candidate word.

It should be noted that, if the target candidate word does not exist in the candidate words, it indicates that the candidate words all belong to cross-line words, namely: a part of the words are located in the same line in the document picture and the remaining words are located in another line, i.e. across lines, in the document picture. Generally, several words adjacent to a cross-line word are all located in the same line, and therefore, several words adjacent to a candidate word belonging to the cross-line word may be used as reference words.

And 140, calculating the length value of the reference word in the document picture, and determining the length value as a threshold value of the document picture.

Through the foregoing processing in steps 110-130, the reference word may be determined from the first text recognition content, and the purpose of determining the reference word is to determine a threshold value for a document picture taken by the target document, where the threshold value is a parameter related to a text length in the document picture.

In general, when the photographing distances are different, the display scale of the photographed pictures is also different. Further, when the sizes of the photographed objects are different, the display scale of the photographed pictures may be different even if the photographing distances are the same. Taking paper documents with the same text content as an example, assuming that the numbers of the two documents are different, for example, the number of one document is three and the number of the other document is five, when the two documents are printed, the sizes of the characters in the printed two paper documents are different, then the two document pictures obtained by shooting are shot by using the same high-speed camera, wherein the sizes of the characters are also different. However, in an actual scene, not only the shooting distance and the font of the document may change, but also the document margin and the printing proportion of the printer may also occur, such as different paper specifications, different setting printing proportions, and the like, and the influence of uncertain human factors is further caused, such as that staff changes the font size of part of the characters due to incorrect setting, which further results in extremely non-uniform character sizes in the shot document pictures.

Because the sizes of the characters in the document pictures shot under different conditions are extremely non-uniform, parameters related to the lengths of the characters in the document pictures cannot be set in advance, namely the threshold values of the introduced document pictures cannot be set in advance, and only independent personalized settings can be carried out based on the shot document pictures.

A specific implementation manner of calculating the length value of the reference word in the document picture is given below. First, the abscissa of the position of the first character in the document picture and the abscissa of the position of the last character in the document picture can be obtained. And then, calculating the absolute value of the difference value between the abscissa of the pixel point position of the first character in the document picture and the abscissa of the pixel point position of the tail character in the document picture, wherein the absolute value of the difference value is the length value of the reference character in the document picture.

And 150, carrying out ash setting treatment on all straight lines with the length above a threshold value in the document picture.

As described above, the underscores in the document are responsible for the poor recognition accuracy of the prior art OCR technology, and if the underscores are removed prior to OCR recognition, the recognition accuracy can be improved. Since the underline is a straight line, the interference of the underline can be removed by the processing of the straight line in the document picture.

In the case of recognizing a straight line in a document picture, for example, in a functionally similar manner using an opencv-image processing function, not only "a" but also "a" and "a" are recognized as straight lines, but also strokes "a" and "a" in a long-length word such as "a" b "and" a "b" are recognized as straight lines, and therefore, many straight lines exist in the document picture. In this regard, the length value can be used to distinguish the underlines, however, as described above, the letter sizes in the document pictures taken in different cases are extremely non-uniform, and similarly, the lengths of the underlines in the document pictures taken in different cases are also different, and therefore, the length value cannot be used simply to distinguish the underlines from the straight lines. For example, the "horizontal" of the "T" strokes in the pictures with large display scale may be larger than the length value of the underline in those pictures with small display scale, so that when a fixed length value is set, the misjudgment of dividing the "horizontal" of the strokes into the underline and the case of not dividing the underline may be likely to occur. Therefore, how to set the length value is a difficult problem to overcome.

It was mentioned before that the length value of the reference word in the document picture is determined as the threshold value of the document picture. The threshold is a parameter related to the length of the characters in the document picture and is independently and individually set for the shot document picture, so that the threshold is irrelevant to the display proportion of the shot picture, can reflect the length of the characters in the document picture, and can reflect the length of the underlines in the document picture because the length of the characters in the same document picture is consistent with the length of the underlines. In this way, the underscores can be distinguished from the straight lines of the document pictures by the individually set threshold values for the taken document pictures.

In some embodiments, the existing gray processing functions, such as opencv-image processing functions, may be used to perform gray processing on all straight lines with lengths above a threshold in the document picture.

In the gray level processing, which is gray level processing, gray level data is determined as a background instead of characters during OCR recognition, so that the interference of the underline on OCR character recognition can be removed by performing gray level processing on all straight lines with a length equal to or greater than a threshold value in a document picture, and the OCR character recognition accuracy can be greatly improved after the interference of the underline on OCR character recognition is removed.

And 160, performing OCR (optical character recognition) on the document picture subjected to the gray processing to obtain second character recognition content of the document picture, and generating a first document detection result of the target document according to the second character recognition content.

And after carrying out ash setting treatment on all straight lines with the length above a threshold value in the document picture, obtaining the document picture subjected to ash setting treatment. And then performing OCR (optical character recognition) on the document picture subjected to the gray setting treatment, so that the character recognition content of the document picture subjected to the gray setting treatment, namely the second character recognition content, can be obtained. Thus, the text detection result of the target document, namely the first document detection result, can be generated according to the second text recognition content.

In this embodiment of the present application, when detecting a target document, OCR may be performed on a document picture of the target document to obtain first text recognition content of the document picture, then a document template to which the document picture belongs is determined according to the first text recognition content and a preset document template library, then a reference word is determined from the first text recognition content according to all proportional keywords of the document template to which the document picture belongs, a length value of the reference word in the document picture is calculated, and the length value is determined as a threshold of the document picture. And then carrying out ash setting treatment on all straight lines with the length above a threshold value in the document picture. And finally, performing OCR (optical character recognition) on the document picture subjected to the gray processing to obtain second character recognition content of the document picture, so that a first document detection result of the target document can be obtained. Therefore, the threshold value of the document picture, namely the threshold value which is individually set for the photographed document picture, can be utilized to distinguish the underscore from the straight line of the document picture, and then the interference of the underscore on OCR character recognition can be removed in a phase-changing manner through the gray setting process, and the accuracy of OCR character recognition is greatly improved due to the fact that the interference of the underscore on OCR character recognition is removed.

In the existing legal document detection scene, the existing OCR technology has the problems that whether a legal document has handwriting signature, whether the legal document has special features such as fingerprints and the like cannot be judged besides low text recognition accuracy, so that the document detection function is single. In this regard, a solution is provided below.

Referring to fig. 2, another implementation flowchart of the OCR-based document detection method provided in the embodiment of the present application is shown, which is described in detail below:

step 210, performing OCR recognition on the document picture of the shot target document to obtain a first text recognition content of the document picture.

And 220, determining a document template to which the document picture belongs according to the first text recognition content and a preset document template library.

For the specific processing of steps 210-220, reference may be made to the processing of steps 110-120 described above, and details are not repeated here.

And 230, marking words which are the same as all feature keywords of a document template to which the document picture belongs in the first text recognition content as feature words.

In some embodiments, feature keywords refer to words that characterize particular features in legal documents, such as fingerprints, signatures, and the like. In addition, the labeling process of the feature words may refer to the foregoing labeling process of the candidate words, which is not described herein.

And 240, determining a feature judgment area corresponding to the feature word in the document picture according to the pixel point position coordinates of the feature word in the document picture.

In some embodiments, after the feature word is marked, an abscissa of a position of a pixel point of the document picture and an ordinate of a position of a pixel point of a tail word in the feature word may be obtained. Then, a rectangular region defined by the following formula, which is as follows, may be determined as a feature judgment region corresponding to the feature word:

X _region(s) ∈[X _{Feature words} ，X _{Feature words} +T]；

Y _Region(s) ∈[Y _{Feature words} -D，Y _{Feature words} +D]；

Wherein X is _Region(s) Is the abscissa of the pixel point position of the rectangular area, Y _Region(s) Is the ordinate, X of the pixel point position of the rectangular area _{Feature words} Is the abscissa of the position of the tail word in the feature word at the pixel point of the document picture, Y _{Feature words} The vertical coordinate of the tail words in the feature words at the pixel point positions of the document pictures is that T is a preset transverse numerical value and D is a preset longitudinal numerical value.

It should be noted that, after the rectangular area defined by the above formula is reflected in the document picture, it is the area where the hand signs the hand and the finger presses the fingerprint. Taking the rectangular area corresponding to the feature word "fingerprint" as an example, the area for pressing the fingerprint in the legal document is generally located in the text "fingerprint: "blank area after.

And 250, generating a judging result of whether the feature exists according to the size relation between the total number of the pixel points in the feature judging area and the preset number, and generating a second document detection result of the target document according to the judging result.

In some embodiments, if the total number of pixels in the feature determining area is greater than or equal to the preset number, it indicates that a feature exists in the feature determining area, and the feature existing in the feature determining area is usually a corresponding special feature, for example, the feature determining area corresponding to the feature word "fingerprint" corresponds to the special feature of fingerprint, so that a determination result of the feature existence can be generated. If the total number of the pixel points in the feature judgment area is smaller than the preset number, the fact that the feature does not exist in the feature judgment area is indicated, and therefore a judgment result without the feature can be generated.

Thus, after the determination result of whether the feature exists is generated, the feature class detection result of the target document, that is, the second document detection result, can be generated according to the determination result. Therefore, the problem that whether the handwriting signature exists or not and whether special characteristics such as fingerprints exist or not in legal documents cannot be judged by the existing OCR technology is solved, and functions of document detection are enriched.

It should be noted that the foregoing text recognition processing and the feature judgment processing may be performed synchronously, and for this reason, after the result of feature judgment is generated, the pixel points in the feature judgment area may be further subjected to erasing processing, and since the handwriting signature is often a continuous text, and the fingerprint pattern does not have a corresponding text, by the erasing processing, erroneous recognition of the pixel points by OCR can be avoided, and the text recognition accuracy of OCR is further improved.

It should be noted that, in the existing document detection program, whether a signature exists or a fingerprint exists is generally required to be recorded, and based on the program, related staff can be automatically matched or document archiving can be automatically performed, so that the intellectualization of document detection can be improved by adding the function of judging whether features exist, the process of manual judgment is saved, and the judgment accuracy is high.

It should be understood that the sequence number of each step in the foregoing embodiment does not mean that the execution sequence of each process should be determined by the function and the internal logic of each process, and should not limit the implementation process of the embodiment of the present application in any way.

The following are device embodiments of the present application, for details not described in detail therein, reference may be made to the corresponding method embodiments described above.

Fig. 3 is a schematic structural diagram of an OCR-based document detection device according to an embodiment of the present invention, and for convenience of explanation, only a portion related to an embodiment of the present application is shown, which is described in detail below:

as shown in fig. 3, the OCR-based document detection device includes:

the recognition module 310 is configured to perform OCR recognition on a document picture of the captured target document to obtain a first text recognition content of the document picture;

the template determining module 320 is configured to determine, according to the first text recognition content and a preset document template library, a document template to which the document picture belongs; the method comprises the steps that a document template library is preset, wherein the document template library comprises at least one document template, the document template comprises at least one template matching keyword and at least one proportional keyword, and the number of words of the proportional keyword is greater than or equal to two;

the reference word determining module 330 is configured to determine a reference word from the first text recognition content according to all proportional keywords of the document template to which the document picture belongs;

the threshold determining module 340 is configured to calculate a length value of the reference word in the document picture, and determine the length value as a threshold of the document picture;

the ash setting module 350 is configured to perform ash setting processing on all straight lines with lengths above a threshold value in the document picture;

The first detection module 360 is configured to perform OCR recognition on the document image after the processing of the grayscale, obtain second text recognition content of the document image, and generate a first document detection result of the target document according to the second text recognition content.

In one possible implementation, the reference word determining module is further configured to:

marking words which are the same as all the proportion keywords in the first character recognition content as candidate words;

acquiring the ordinate of the first character in the pixel point position of the document picture and the ordinate of the last character in the pixel point position of the document picture;

determining the target candidate word as a reference word in the case that the target candidate word exists; the absolute value of the difference value between the ordinate of the pixel point position of the first character and the ordinate of the pixel point position of the last character in the target candidate word is smaller than a preset difference value;

if the target candidate word does not exist, determining a preset number of continuous words adjacent to the candidate word as reference words; wherein the preset number is the word number of the candidate words.

In one possible implementation, the threshold determination module is further configured to:

acquiring the horizontal coordinate of the first character in the pixel point of the document picture and the horizontal coordinate of the last character in the pixel point of the document picture;

And determining the absolute value of the difference value between the horizontal coordinate of the pixel point position of the first character in the document picture and the horizontal coordinate of the pixel point position of the tail character in the document picture as the length value of the reference character in the document picture.

In one possible implementation, the template determination module is further configured to:

under the condition that only one template matching keyword exists in the first text recognition content, determining a document template corresponding to the template matching keyword as the document template to which the document picture belongs;

and under the condition that a plurality of template matching keywords exist in the first text recognition content, determining the document template with the largest template matching keyword among the plurality of template matching keywords as the document template to which the document picture belongs.

In one possible implementation, the document template further includes at least one feature keyword;

correspondingly, the OCR-based document detection device further includes:

the feature word determining module marks words which are the same as all feature keywords of a document template to which the document picture belongs in the first text recognition content as feature words;

the region determining module is used for determining a feature judging region corresponding to the feature word in the document picture according to the pixel point position coordinates of the feature word in the document picture;

The second detection module is used for generating a judging result of whether the feature exists according to the size relation between the total number of the pixel points in the feature judging area and the preset number, and generating a second document detection result of the target document according to the judging result.

In one possible implementation, the area determining module is further configured to:

acquiring the abscissa of the position of a pixel point of a tail word in a document picture and the ordinate of the position of the pixel point in the feature word;

a rectangular region defined by the following formula is determined as a feature judgment region corresponding to the feature word:

X _region(s) ∈[X _{Feature words} ，X _{Feature words} +T]；

Y _Region(s) ∈[Y _{Feature words} -D，Y _{Feature words} +D]；

In one possible implementation, the second detection module is further configured to:

generating a judging result of the presence feature under the condition that the total number of the pixel points in the feature judging area is larger than or equal to the preset number;

And generating a judging result without the feature under the condition that the total number of the pixel points in the feature judging area is smaller than the preset number.

In one possible implementation, the feature key comprises a fingerprint or signature.

In the embodiment of the application, the threshold value of the document picture, namely the threshold value which is independently set for individuation of the shot document picture, can be used for distinguishing the underline from the straight line of the document picture, and then the interference of the underline on OCR character recognition can be removed in a phase-changing manner through gray setting treatment, and the accuracy of OCR character recognition is greatly improved due to the fact that the interference of the underline on OCR character recognition is removed. In addition, after the determination result of whether the feature exists is generated, a feature class detection result of the target document, that is, a second document detection result, may be generated according to the determination result. Therefore, the problem that whether the handwriting signature exists or not and whether special characteristics such as fingerprints exist or not in legal documents cannot be judged by the existing OCR technology is solved, and functions of document detection are enriched.

The present application also provides a computer program product having program code which, when run in a corresponding processor, controller, computing device or terminal, performs the steps of any of the above described embodiments of OCR-based document detection methods, such as steps 110 to 160 shown in fig. 1. Those skilled in the art will appreciate that the methods and apparatus presented in the embodiments of the present application may be implemented in various forms of hardware, software, firmware, special purpose processors, or a combination thereof. The special purpose processor may include an Application Specific Integrated Circuit (ASIC), a Reduced Instruction Set Computer (RISC), and/or a Field Programmable Gate Array (FPGA). The proposed method and device are preferably implemented as a combination of hardware and software. The software is preferably installed as an application program on a program storage device. Which is typically a machine based on a computer platform having hardware, such as one or more Central Processing Units (CPUs), random Access Memory (RAM), and one or more input/output (I/O) interfaces. An operating system is also typically installed on the computer platform. The various processes and functions described herein may either be part of the application program or part of the application program which is executed by the operating system.

Fig. 4 is a schematic diagram of an electronic device 4 provided in an embodiment of the present application. As shown in fig. 4, the electronic apparatus 4 of this embodiment includes: a processor 40, a memory 41 and a computer program 42 stored in the memory 41 and executable on the processor 40. The steps of the various embodiments of OCR based document detection methods described above, such as steps 110 through 160 shown in fig. 1, are implemented when the processor 40 executes the computer program 42. Alternatively, the processor 40, when executing the computer program 42, performs the functions of the modules in the apparatus embodiments described above.

By way of example, the computer program 42 may be divided into one or more modules, one or more modules being stored in the memory 41 and executed by the processor 40 to complete the present application. One or more of the modules may be a series of computer program instruction segments capable of performing particular functions for describing the execution of the computer program 42 in the electronic device 4.

The electronic device 4 may include, but is not limited to, a processor 40, a memory 41. It will be appreciated by those skilled in the art that fig. 4 is merely an example of the electronic device 4 and is not meant to be limiting of the electronic device 4, and may include more or fewer components than shown, or may combine certain components, or different components, e.g., the electronic device may further include an input-output device, a network access device, a bus, etc.

The processor 40 may be a central processing unit (Central Processing Unit, CPU), other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field-programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 41 may be an internal storage unit of the electronic device 4, such as a hard disk or a memory of the electronic device 4. The memory 41 may also be an external storage device of the electronic device 4, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) or the like, which are provided on the electronic device 4. Further, the memory 41 may also include both an internal storage unit and an external storage device of the electronic device 4. The memory 41 is used to store computer programs and other programs and data required by the electronic device. The memory 41 may also be used to temporarily store data that has been output or is to be output.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions. The functional units and modules in the embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit, where the integrated units may be implemented in a form of hardware or a form of a software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working process of the units and modules in the above system may refer to the corresponding process in the foregoing method embodiment, which is not described herein again.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and in part, not described or illustrated in any particular embodiment, reference is made to the related descriptions of other embodiments.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus/terminal device and method may be implemented in other manners. For example, the apparatus/terminal device embodiments described above are merely illustrative, e.g., the division of modules or units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection via interfaces, devices or units, which may be in electrical, mechanical or other forms.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated modules, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the present application may implement all or part of the flow of the method of the above embodiment, or may be implemented by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium, where the computer program, when executed by a processor, may implement the steps of each of the embodiments of the OCR-based document detection method described above. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, executable files or in some intermediate form, etc. The computer readable medium may include: any entity or device capable of carrying computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth.

Furthermore, the features of the embodiments shown in the drawings or mentioned in the description of the present application are not necessarily to be construed as separate embodiments from each other. Rather, each feature described in one example of one embodiment may be combined with one or more other desired features from other embodiments, resulting in other embodiments not described in text or with reference to the drawings.

The above embodiments are only for illustrating the technical solution of the present application, and are not limiting thereof; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application, and are intended to be included in the scope of the present application.

Claims

1. An OCR-based document detection method, comprising:

performing OCR (optical character recognition) on a document picture of a shot target document to obtain first character recognition content of the document picture;

Determining a document template to which the document picture belongs according to the first text recognition content and a preset document template library; the document template library comprises at least one document template, wherein the document template comprises at least one template matching keyword and at least one proportion keyword, and the number of words of the proportion keyword is greater than or equal to two;

performing OCR (optical character recognition) on the document picture subjected to the gray processing to obtain second character recognition content of the document picture, and generating a first document detection result of the target document according to the second character recognition content.

2. The OCR-based document detection method according to claim 1, wherein determining the reference word from the first text recognition content according to all the proportional keywords of the document template to which the document picture belongs includes:

acquiring the ordinate of the position of the pixel point of the first character in the document picture and the ordinate of the position of the pixel point of the tail character in the document picture in the candidate words;

in the case that a target candidate word exists, determining the target candidate word as a reference word; the absolute value of the difference value between the ordinate of the pixel point position of the first character and the ordinate of the pixel point position of the last character in the target candidate word is smaller than a preset difference value;

if the target candidate word does not exist, determining a preset number of continuous words adjacent to the candidate word as reference words; wherein the preset number is the word number of the candidate word.

3. The OCR-based document detection method according to claim 1, wherein calculating a length value of the reference word in the document picture includes:

acquiring the horizontal coordinate of the first character in the pixel point of the document picture and the horizontal coordinate of the tail character in the pixel point of the document picture;

and determining the absolute value of the difference value between the abscissa of the pixel point position of the first character in the document picture and the abscissa of the pixel point position of the tail character in the document picture as the length value of the reference character in the document picture.

4. The OCR-based document detection method according to claim 1, wherein determining, according to the first text recognition content and a preset document template library, a document template to which the document picture belongs includes:

under the condition that only one template matching keyword exists in the first text recognition content, determining a document template corresponding to the one template matching keyword as the document template to which the document picture belongs;

and under the condition that a plurality of template matching keywords exist in the first text recognition content, determining a document template with the most template matching keywords among the plurality of template matching keywords as the document template to which the document picture belongs.

5. The OCR-based document detection method according to any one of claims 1 to 4, wherein the document template further comprises at least one feature keyword;

after determining the document template to which the document picture belongs, the method further comprises:

marking words which are the same as all characteristic keywords of a document template to which the document picture belongs in the first text recognition content as characteristic words;

determining a feature judgment area corresponding to the feature word in the document picture according to the pixel point position coordinates of the feature word in the document picture;

And generating a judging result of whether the characteristics exist or not according to the size relation between the total number of the pixel points in the characteristic judging area and the preset number, and generating a second document detection result of the target document according to the judging result.

6. The OCR-based document detection method according to claim 5, wherein determining a feature judgment area corresponding to the feature word in the document picture based on pixel point position coordinates of the feature word in the document picture comprises:

acquiring the abscissa of the position of a pixel point of the tail word in the document picture and the ordinate of the position of the pixel point in the feature word;

X _region(s) ∈[X _{Feature words} ，X _{Feature words} +T]；

Y _Region(s) ∈[Y _{Feature words} -D，Y _{Feature words} +D]；

7. The OCR-based document detection method according to claim 5, wherein the generating of the determination result of whether the feature exists according to the magnitude relation between the total number of the pixel points in the feature determination area and the preset number includes:

and generating a judging result of no feature under the condition that the total number of the pixel points in the feature judging area is smaller than the preset number.

8. The OCR-based document detection method of claim 5, wherein the feature key comprises a fingerprint or a signature.

9. An OCR-based document detection device, comprising:

the template determining module is used for determining a document template to which the document picture belongs according to the first text recognition content and a preset document template library; the document template library comprises at least one document template, wherein the document template comprises at least one template matching keyword and at least one proportion keyword, and the word number of the proportion keyword is greater than or equal to two;

the reference word determining module is used for determining a reference word from the first character recognition content according to all the proportional keywords of the document template to which the document picture belongs;

10. An electronic device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the OCR-based document detection method according to any one of the preceding claims 1 to 8 when the computer program is executed by the processor.