CN114782973A

CN114782973A - Document classification method, training method, device and storage medium

Info

Publication number: CN114782973A
Application number: CN202210471751.6A
Authority: CN
Inventors: 王雷; 宋祺; 张睿; 燕鹏举; 周健
Original assignee: Shanghai Hongji Information Technology Co Ltd
Current assignee: Shanghai Hongji Information Technology Co Ltd
Priority date: 2022-04-29
Filing date: 2022-04-29
Publication date: 2022-07-22

Abstract

The application provides a document classification method, a training method, a device and a storage medium, wherein the method comprises the following steps: acquiring a text recognition result of a document picture to be processed; classifying the document pictures to be processed according to the text recognition result and a preset template library, and outputting the classification result of the document pictures to be processed, wherein the template library comprises at least one type of document template. According to the method and the device, the type of the document picture is automatically identified based on the text identification result and the template library, the defect that the conventional picture classification method cannot process the document picture is overcome, and the automation degree of the service scene of classification processing of the document picture is improved.

Description

Document classification method, training method, device and storage medium

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a document classification method, a training method, a device, and a storage medium.

Background

With the improvement of the office electronization degree, the document data originally stored in a paper form is gradually changed into image form storage and classified management storage through electronization means such as a scanner, and in an actual scene, different types of document data have different purposes, so that the document classification is particularly important in the aspect of improving the working efficiency.

Common document categories exceed hundreds, and the gaps between some categories are very small, such as: the distinction between outpatient and hospitalized invoices is often reflected in several different texts. The documents have multiple categories and small differences among the categories, so that the classification task is very difficult.

Disclosure of Invention

The embodiment of the application aims to provide a document classification method, a training method, equipment and a storage medium, the type of a document picture is automatically identified based on a text identification result and a template library, the defect that the document picture cannot be processed by a conventional picture classification method is overcome, and the automation degree of a business scene of document picture classification processing is improved.

A first aspect of an embodiment of the present application provides a document classification method, including: acquiring a text recognition result of a document picture to be processed; classifying the document pictures to be processed according to the text recognition result and a preset template library, and outputting the classification result of the document pictures to be processed, wherein the template library comprises at least one type of document template.

In an embodiment, the obtaining a text recognition result of the to-be-processed document picture includes: adopting an optical character recognition technology to recognize the document picture to be processed to obtain an initial recognition result, wherein the initial recognition result comprises the following steps: the initial text line position and the text content in the document picture to be processed; and carrying out normalization processing on the initial text line position to obtain the final text line position of the document picture to be processed, wherein the text line position corresponds to the text content.

In an embodiment, the normalizing the initial text line position to obtain the final text line position of the document picture to be processed includes: determining the position of a base point in the document picture to be processed; calculating a horizontal coordinate difference value and a vertical coordinate difference value between each initial text line position and the base point position; determining a first proportional numerical value between the horizontal coordinate difference value and the horizontal width value of the to-be-processed document picture, determining a second proportional numerical value between the vertical coordinate difference value and the vertical height value of the to-be-processed document picture, taking the first proportional numerical value as the horizontal coordinate of the text row position, and taking the second proportional numerical value as the vertical coordinate of the text row position.

In an embodiment, the classifying the to-be-processed document picture according to the text recognition result and a preset template library, and outputting the classification result of the to-be-processed document picture includes: respectively calculating the matching degree between the document picture to be processed and each document template in the template library according to the text recognition result; and selecting the document template with the matching degree larger than a preset threshold value as a target document template, and taking the type of the target document template as the document type of the document picture to be processed.

In one embodiment, the method further comprises: when the number of the target document templates is multiple, selecting the document template with the maximum matching degree with the document picture to be processed from the multiple target document templates, and taking the type of the document template with the maximum matching degree as the document type of the document picture to be processed.

In an embodiment, the calculating, according to the text recognition result, a matching degree between the to-be-processed document picture and each of the document templates in the template library respectively includes: respectively calculating the overlapping degree and the text similarity between each text line in the document picture to be processed and each text line in the document template according to the text recognition result aiming at each document template; selecting a target line set from the text recognition result, wherein the overlapping degree corresponding to a target line in the target line set is greater than a first threshold value, and the text similarity corresponding to the target line is greater than a second threshold value; and taking the ratio of the number of the target lines in the target line set to the total number of the text lines in the document template as the matching degree between the document picture to be processed and the document template.

In an embodiment, the classifying the to-be-processed document picture according to the text recognition result and a preset template library, and outputting the classification result of the to-be-processed document picture includes: respectively calculating the matching degree between the document picture to be processed and each document template in the template library according to the text recognition result; and selecting the document template with the maximum matching degree as a target document template, and taking the type of the target document template as the document type of the document picture to be processed.

In an embodiment, the maximum value of the matching degree is greater than a preset threshold.

A second aspect of the embodiments of the present application provides a method for training a document template, including: : obtaining sample text recognition results and current templates of a plurality of sample document pictures, wherein the document types of the sample document pictures are the same, and the current templates are the same as the document types of the sample document pictures; for each sample document picture in the plurality of sample document pictures, calculating the overlapping degree between the kth sample document line in the current sample document picture and each document line in the current template, and judging whether the maximum overlapping degree is greater than a first threshold value or not, wherein k is a positive integer; when the maximum overlapping degree is larger than the first threshold value, calculating the text similarity between the kth sample text line and a text line with the maximum overlapping degree, wherein the text line with the maximum overlapping degree corresponds to the text line in the current template; judging whether the text similarity is larger than a second threshold value; and when the text similarity is larger than the second threshold value, updating the current template according to the kth sample text line and the text line with the maximum overlapping degree.

In one embodiment, the method further comprises: when the maximum overlapping degree is smaller than or equal to the first threshold value, or when the text similarity is smaller than or equal to the second threshold value, performing a training process between the (k + 1) th sample text line in the current sample document picture and each text line in the current template, and sequentially traversing all sample text lines in the current sample document picture to obtain the trained document template.

In one embodiment, the updating the current template according to the kth sample text line and the maximum overlapping degree text line includes: determining the average text line position of the kth sample text line position and the maximum overlapping degree text line position, and taking the average text line position as an updated text line position; and taking the text content with higher confidence coefficient in the kth sample text line and the text line with the maximum overlapping degree as the text content of the updated text line, and generating an updated document template.

In an embodiment, the obtaining sample text recognition results of a plurality of sample document pictures includes: adopting an optical character recognition technology to recognize the sample document pictures to obtain an initial recognition result, wherein the initial recognition result comprises the following steps: an initial text line position and text content in the sample document picture; and normalizing the initial text line positions of the sample document pictures to obtain final text line positions of the sample document pictures, wherein the text line positions correspond to the text content.

In an embodiment, the normalizing the initial text line positions of the sample document pictures to obtain final text line positions of the sample document pictures includes: determining a base point position in each sample document picture; calculating a horizontal coordinate difference value and a vertical coordinate difference value between each initial text line position and the base point position; determining a first proportional numerical value between the abscissa difference value and the sample document picture transverse width value, determining a second proportional numerical value between the ordinate difference value and the sample document picture longitudinal height value, taking the first proportional numerical value as the abscissa of the final text line position, and taking the second proportional numerical value as the ordinate of the final text line position.

A third aspect of embodiments of the present application provides a document classification apparatus, including: the first acquisition module is used for acquiring a text recognition result of a document picture to be processed; and the classification module is used for classifying the document pictures to be processed according to the text recognition result and a preset template library and outputting the classification result of the document pictures to be processed, wherein the template library comprises at least one type of document template.

In one embodiment, the first obtaining module is configured to: adopting an optical character recognition technology to recognize the document picture to be processed to obtain an initial recognition result, wherein the initial recognition result comprises: the initial text line position and the text content in the document picture to be processed; and normalizing the initial text line position to obtain the final text line position of the document picture to be processed, wherein the text line position corresponds to the text content.

In an embodiment, the first obtaining module is further configured to: determining the base point position in the document picture to be processed; calculating a horizontal coordinate difference value and a vertical coordinate difference value between each initial text line position and the base point position; determining a first proportional numerical value between the abscissa difference value and the lateral width value of the to-be-processed document picture, determining a second proportional numerical value between the ordinate difference value and the longitudinal height value of the to-be-processed document picture, taking the first proportional numerical value as the abscissa of the text line position, and taking the second proportional numerical value as the ordinate of the text line position.

In one embodiment, the classification module is configured to: respectively calculating the matching degree between the document picture to be processed and each document template in the template library according to the text recognition result; and selecting the document template with the matching degree larger than a preset threshold value as a target document template, and taking the type of the target document template as the document type of the document picture to be processed.

In one embodiment, the classification module is further configured to: and when the target document templates are multiple, selecting the document template with the maximum matching degree with the document picture to be processed from the multiple target document templates, and taking the type of the document template with the maximum matching degree as the document type of the document picture to be processed.

In one embodiment, the classification module is configured to: respectively calculating the overlapping degree and the text similarity between each text line in the document picture to be processed and each text line in the document template according to the text recognition result aiming at each document template; selecting a target line set from the text recognition result, wherein the overlapping degree corresponding to a target line in the target line set is greater than a first threshold value, and the text similarity corresponding to the target line is greater than a second threshold value; and taking the ratio of the number of the target lines contained in the target line set to the total number of the text lines in the document template as the matching degree between the to-be-processed document picture and the document template.

In one embodiment, the classification module is configured to: respectively calculating the matching degree between the document picture to be processed and each document template in the template library according to the text recognition result; and selecting the document template with the maximum matching degree as a target document template, and taking the type of the target document template as the document type of the document picture to be processed.

A fourth aspect of the embodiments of the present application provides a device for training a document template, including: the second acquisition module is used for acquiring sample text recognition results and current templates of a plurality of sample document pictures, wherein the document types of the sample document pictures are the same, and the current templates are the same as the document types of the sample document pictures; a first calculating module, configured to calculate, for each sample document picture in the multiple sample document pictures, an overlapping degree between a kth sample text line in a current sample document picture and each text line in the current template, and determine whether the maximum overlapping degree is greater than a first threshold, where k is a positive integer; a second calculating module, configured to calculate a text similarity between a kth sample line and a text line with a maximum overlap degree when the maximum overlap degree is greater than the first threshold, where the text line with the maximum overlap degree corresponds to the text line in the current template; the judging module is used for judging whether the text similarity is larger than a second threshold value or not; and the updating module is used for updating the current template according to the kth sample text line and the text line with the maximum overlapping degree when the text similarity is greater than the second threshold value.

In one embodiment, the apparatus further comprises an iteration module configured to: when the maximum overlapping degree is smaller than or equal to the first threshold value or when the text similarity is smaller than or equal to the second threshold value, performing a training process between the (k + 1) th sample line in the current sample document picture and each text line in the current template, and sequentially traversing all the sample text lines in the current sample document picture to obtain the trained document template.

In one embodiment, the update module is configured to: determining the average text line position of the kth sample text line position and the maximum overlapping degree text line position, and taking the average text line position as an updated text line position; and taking the text content with higher confidence coefficient in the kth sample text line and the text line with the maximum overlapping degree as the text content of the updated text line, and generating an updated document template.

In an embodiment, the second obtaining module is configured to: adopting an optical character recognition technology to recognize the sample document pictures to obtain an initial recognition result, wherein the initial recognition result comprises: an initial text line position and text content in the sample document picture; and normalizing the initial text line positions of the sample document pictures to obtain final text line positions of the sample document pictures, wherein the text line positions correspond to the text content.

In an embodiment, the second obtaining module is further configured to: determining a base point position in each sample document picture; calculating a horizontal coordinate difference value and a vertical coordinate difference value between each initial text line position and the base point position; determining a first proportional numerical value between the abscissa difference value and the sample document picture transverse width value, determining a second proportional numerical value between the ordinate difference value and the sample document picture longitudinal height value, taking the first proportional numerical value as the abscissa of the final text line position, and taking the second proportional numerical value as the ordinate of the final text line position.

A fifth aspect of an embodiment of the present application provides an electronic device, including: a memory to store a computer program; a processor configured to execute the computer program to implement the method of the first aspect and any embodiment thereof, or to implement the method of the second aspect and any embodiment thereof.

A sixth aspect of embodiments of the present application provides a non-transitory electronic device-readable storage medium, including: a program which, when run by an electronic device, causes the electronic device to perform the method of the first aspect of an embodiment of the present application and any of the embodiments thereof, or to implement the method of the second aspect of an embodiment of the present application and any of the embodiments thereof.

According to the document classification method, the training method, the device and the storage medium, the template base comprising the document templates of multiple types is preset, when a document picture to be processed exists, the document picture is firstly subjected to text recognition, and then the type of the document picture is automatically recognized based on the text recognition result of the document picture and the template base, so that the defect that the document picture cannot be processed by a conventional picture classification method is overcome, the automation degree of a business scene of classification processing of the document picture is improved, and the accuracy of document type recognition is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and that those skilled in the art can also obtain other related drawings based on the drawings without inventive efforts.

Fig. 1 is a schematic structural diagram of an electronic device according to an embodiment of the present application;

FIG. 2 is a flowchart illustrating a document classification method according to an embodiment of the present application;

FIG. 3 is a flowchart illustrating a document classification method according to an embodiment of the present application;

FIG. 4A is a pictorial illustration of a short purchase order in accordance with an embodiment of the present application;

FIG. 4B is a schematic illustration of a long purchase order in accordance with an embodiment of the present application;

FIG. 5 is a schematic structural diagram of a document sorting apparatus according to an embodiment of the present application;

FIG. 6 is a schematic structural diagram of a training apparatus for a document template according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application. In the description of the present application, the terms "first," "second," and the like are used solely to distinguish one from another and are not to be construed as indicating or implying relative importance.

As shown in fig. 1, the present embodiment provides an electronic device 1 including: at least one processor 11 and a memory 12, one processor being exemplified in fig. 1. The processor 11 and the memory 12 are connected by a bus 10. The memory 12 stores instructions executable by the processor 11, and the instructions are executed by the processor 11, so that the electronic device 1 can execute all or part of the processes of the methods in the embodiments described below, so as to solve the disadvantage that the conventional image classification method cannot process the document images, improve the automation degree of the service scene of the document image classification processing, and improve the accuracy of the document type identification.

In an embodiment, the electronic device 1 may be a mobile phone, a tablet computer, a notebook computer, a desktop computer, or the like.

Please refer to fig. 2, which is a document classification method according to an embodiment of the present application, and the method can be executed by the electronic device 1 shown in fig. 1 to solve the disadvantage that the conventional image classification method cannot process document images, improve the automation degree of the service scene of the document image classification process, and improve the accuracy of the document type identification. The method comprises the following steps:

step 201: and acquiring a text recognition result of the document picture to be processed.

In this step, the document picture to be processed can be obtained in real time from the image data of the document to be classified, for example, in an actual scene, if the bill document of a hospital needs to be classified, the bill picture can be obtained in real time by scanning or photographing the bill document to be processed. The picture of the document to be processed can also be read from a pre-prepared image database. The text recognition result can be obtained by performing text recognition on the document picture to be processed, and can also be directly read from a specified database, and different acquisition modes can be selected according to actual requirements, so that flexibility is increased.

In an embodiment, step 201 may specifically include: identifying the document picture to be processed by adopting an optical character identification technology to obtain an initial identification result, wherein the initial identification result comprises the following steps: and (4) initial text line positions and text contents in the document picture to be processed. And normalizing the initial text line position to obtain the final text line position of the document picture to be processed, wherein the text line position corresponds to the text content.

In this step, an OCR (Optical Character Recognition) technique may be adopted to recognize the document picture to be processed in real time, so as to obtain a text Recognition result of the document picture to be processed, where the text Recognition result at least includes: the positions of the text lines in the document picture to be processed and the text content of each text line can be represented by the coordinates of the key points of the text lines. For example, in the OCR recognition process, a minimum bounding box of a text line may be predicted based on a contour localization mechanism, corner points of the bounding box may be used as key points, and if the bounding box is a rectangular box, the position of the text line may be represented by a combination of coordinates of an upper left corner point and a lower right corner point of the bounding box. Assuming that the coordinates of the top left corner point of the minimum bounding box of a text line are (x1, y1) and the coordinates of the bottom right corner point are (x2, y2), the text line position can be represented by (x1, y1, x2, y 2). In order to avoid the influence caused by the inconsistency of the offset and the picture scale, the position of the initial text line can be normalized to improve the calculation precision.

Here, the text line refers to a text line recognized by the OCR algorithm, and may be different from or the same as the line distribution of the actual text in the document to be processed. Such as where one line of actual text in the document to be processed is long, it may be divided into 3 lines of text by the OCR algorithm.

In an embodiment, the normalizing the initial text line position to obtain the final text line position of the document picture to be processed includes: and determining the base point position in the document picture to be processed. A horizontal coordinate difference value and a vertical coordinate difference value between each initial text line position and the base point position are calculated. Determining a first proportional numerical value between the horizontal coordinate difference value and the horizontal width value of the to-be-processed document picture, determining a second proportional numerical value between the vertical coordinate difference value and the vertical height value of the to-be-processed document picture, taking the first proportional numerical value as the horizontal coordinate of the text row position, and taking the second proportional numerical value as the vertical coordinate of the text row position.

In this embodiment, according to the OCR recognition result of the document picture to be processed, a normalization operation may be performed: firstly, a base point position is selected as a reference position, the base point can be any point in the document picture to be processed, but the base point is unique for the same document picture to be processed. Preferably, the base point may be a corner point of the smallest bounding box of a certain text line, for example, a text line located at the leftmost upper corner may be selected, and the top-left corner point of the smallest bounding box of the text line is used as the base point. The method comprises the steps of subtracting a key point coordinate corresponding to each text line position from a base point coordinate to obtain a horizontal coordinate difference value and a vertical coordinate difference value, and then correspondingly dividing the obtained horizontal coordinate difference value and the obtained vertical coordinate difference value by the width and the height of a picture respectively, specifically, dividing the horizontal coordinate difference value by the width value of a document picture to be processed and dividing the vertical coordinate difference value by the height value of the document picture to be processed.

In one embodiment, it is assumed that for a certain document picture to be processed, the OCR algorithm provides 2 text lines, the initial text line positions of which are respectively represented by the coordinates of the key points, and the text line 1 is represented as: [10,6,20,30]. Text line 2 is represented as: [5,7,20,40]. The width value of the document picture to be processed is 100, and the height value of the document picture to be processed is 200.

Then the process of normalization is:

1. obtaining a point at the upper left corner of the document picture to be processed as a base point position, and the text line position: [5,6]

2. The corresponding subtraction of the 2 text lines' corresponding key point coordinates and the base point coordinates: [10-5,6-6,20-5,30-6], [5-5,7-6,20-5,40-6], that is, the difference value of the abscissa and the difference value of the ordinate corresponding to the text line 1 are expressed as: [5,0, 15, 24]. The horizontal coordinate difference value and the vertical coordinate difference value corresponding to the text line 2 are expressed as: [0,1, 15, 34].

3. And correspondingly dividing the horizontal coordinate difference value and the vertical coordinate difference value by the width and the height of the picture to obtain a text line position consisting of a first proportional numerical value and a second proportional numerical value:

wherein, the text line position corresponding to the text line 1 is represented as: [(10-5)/100,(6-6)/200,(20-5)/100,

the text line position corresponding to text line 2 is represented as: [(5-5)/100,(7-6)/200,(20-5)/100,

step 202: and classifying the document pictures to be processed according to the text recognition result and a preset template library, and outputting the classification result of the document pictures to be processed.

In this step, one or more types of document templates are included in the template library, that is, the types of document templates are known. Where at least the location of the text line and the corresponding text content are included in the document template. And matching the text recognition result of the document picture to be processed with the document template in the template library so as to obtain the classification result of the document picture to be processed.

According to the document classification method, the template base containing the document templates of multiple types is preset, when a document picture to be processed exists, text recognition is carried out on the document picture, and then the type of the document picture is automatically recognized based on the text recognition result of the document picture and the template base, so that the defect that the document picture cannot be processed by a conventional picture classification method is overcome, the automation degree of a business scene of classification processing of the document picture is improved, and the accuracy of document type recognition is improved.

Please refer to fig. 3, which is a document classification method according to an embodiment of the present application, and the method can be executed by the electronic device 1 shown in fig. 1 to solve the disadvantage that the conventional image classification method cannot process document images, improve the automation degree of the service scene of the document image classification process, and improve the accuracy of the document type identification. The method also comprises a training process of the document template, and the method specifically comprises the following steps:

step 301: and acquiring sample text recognition results and current templates of a plurality of sample document pictures.

In this step, the document categories of the plurality of sample document pictures are the same, and the current template is the same as the document categories of the plurality of sample document pictures. In an actual scenario, the template library may include a plurality of types of document templates, so that in the training stage, a batch of sample document pictures may be selected for different document types, and the batch of sample document pictures may be subjected to type labeling, for example, a sample picture of an outpatient service ticket is labeled as an "outpatient service ticket", and a sample picture of an inpatient service ticket is labeled as an "inpatient service ticket". Each type may include a plurality of sample document pictures, for example, the batch of sample document pictures includes N types, each type includes m sample document pictures, where m is a positive integer, and 1< m < ═ 10 may be set in order to save training resources. And then training a plurality of sample document pictures corresponding to each type to obtain a document template corresponding to the type.

The identification result of one sample document picture can be arbitrarily selected from the text identification results of a plurality of sample document pictures of the same type as the current template, and the text identification result of one document picture of the same type can also be additionally designated as the current template. In order to keep the consistency of the calculated data, the position of the text line in the current template is normalized. The current template provides an initial state for the training process, and the final training results are obtained by continuously updating on the basis of the current template.

It should be noted that, in the normalization process, the results of the normalization process may be greatly different due to different base point positions. Therefore, for the same document template, the base point positions are selected to be consistent during the normalization process in the training stage and during the normalization process when the document template is used for classification.

In an embodiment, to avoid the influence of the offset and the picture size, reference may be made to the specific implementation of step 201: adopting an optical character recognition technology to recognize a plurality of sample document pictures to obtain an initial recognition result, wherein the initial recognition result comprises: the initial text line position and text content in the sample document picture. And then normalizing the initial text line positions of the sample document pictures to obtain the final text line positions of the sample document pictures, wherein the text line positions correspond to text contents. Reference may be made in detail to the description of step 201 in the above embodiments.

In an embodiment, the normalizing the initial text line positions of the plurality of sample document pictures to obtain the final text line positions of the plurality of sample document pictures may specifically include: determining a base point position in the sample document picture aiming at each sample document picture; calculating a horizontal coordinate difference value and a vertical coordinate difference value between each initial text line position and the base point position; determining a first proportional numerical value between the abscissa difference value and the sample document picture transverse width value, determining a second proportional numerical value between the ordinate difference value and the sample document picture longitudinal height value, taking the first proportional numerical value as the abscissa of the final text row position, and taking the second proportional numerical value as the ordinate of the final text row position. Reference may be made in detail to the description of step 201 in the above embodiments.

Step 302: for each sample document picture in the plurality of sample document pictures, calculating the overlapping degree between the kth sample text line in the current sample document picture and each text line in the current template. And judging whether the maximum overlapping degree is larger than a first threshold value, if so, entering step 303, and otherwise, entering step 306.

In this step, assuming that the batch of sample document pictures contains N types, for the i (1 ═ i < ═ N) th type sample document pictures, where the text recognition result of the first sample document picture is selected as the current template, the current template is updated by sequentially traversing the remaining m-1 sample document pictures according to the current template.

Specifically, a j (1< j < ═ m) th sample document picture is taken as a current sample document, the k-th sample document line of the current sample document and each text line of the current template are overlapped, for example, an Intersection Over Union (IOU) between the k-th sample document line and each text line of the current template can be calculated to represent the overlapping degree of the k-th sample document picture and the k-th sample document picture, where 1< k < G, and G is the total number of sample text lines in the OCR recognition result of the j-th sample document picture. G is a positive integer. In the IOU calculation process, the normalized positions of the sample text line and the text line of the current template are input, and the IOU between the two text lines is calculated. And then selecting the maximum IOU value in the calculation results. And judging whether the maximum IOU value is larger than a first threshold value, if so, entering a step 303, and otherwise, entering a step 306. The first threshold may be statistically derived based on empirical data, and may be set to a value such as 0.5 or 0.8, and may be set to 0.8 for a strict scenario, and may be set to 0.5 for a less strict scenario.

Step 303: and calculating the text similarity between the kth sample text line and the text line with the maximum overlapping degree, wherein the text line with the maximum overlapping degree corresponds to the text line in the current template. Step 304 is then entered.

In this step, if the maximum value of the IOU is greater than the set threshold 1, selecting a text line with the largest IOU between the kth sample text line and the current template, where the text line is the text line with the largest overlap degree, and calculating the text similarity between the kth sample text line and the text line with the largest overlap degree, where the text similarity can be calculated by the corresponding text content. The text similarity between the maximum overlap text line and the kth sample text line may further characterize the degree of match between the two text lines.

In an embodiment, in the text similarity calculation process, the text contents of the kth sample text line and the text line with the maximum overlap degree may be extracted, and the text contents may be segmented, for example, the text contents may be segmented by using segmentation granularity such as words, original character strings, syntactic analysis results, and the like, and text feature vectors corresponding to the kth sample text line and the text line with the maximum overlap degree are respectively constructed based on the segmentation result, and the text similarity between the kth sample text line and the text line with the maximum overlap degree may be obtained by calculating a distance between two text feature vectors, for example, calculating a euclidean distance between the two text feature vectors.

Step 304: and judging whether the text similarity is larger than a second threshold value. If so, step 305 is entered, otherwise, step 306 is entered.

In this step, the second threshold may be set based on actual demand, and the greater the second threshold, the higher the accuracy. The second threshold may be obtained based on empirical statistics, the value range of the second threshold may be 0-1, the second threshold may be 0.8 for a scene with strict requirements, and if the similarity is greater than the set second threshold of 0.8, it indicates that the kth sample text line is matched with the text line with the maximum overlap degree, then step 305 is performed.

Step 305: and updating the current template according to the kth sample text line and the maximum overlapping degree text line. Step 306 is then entered.

In this step, when the text similarity is greater than the second threshold, it indicates that the kth sample text line is matched with the text line with the maximum overlapping degree, and the two text lines are subjected to weighted summation to obtain a new text line, which is stored in the updated current template. Therefore, the online updating of the document template is realized, when a new type of document needs to be classified, the document template in the template library can be updated by adopting the training mode, and the automation degree of the business scene of classifying and processing the document pictures is improved.

In an embodiment, step 305 may specifically include: and determining the average text line position of the kth sample text line position and the maximum overlapping degree text line position, and taking the average text line position as the updated text line position. And taking the text content with higher confidence coefficient in the kth sample text line and the text line with the maximum overlapping degree as the text content of the updated text line, and generating an updated document template.

The principle of weighted summation of the kth sample text line and the maximum overlap text line is presented here as follows: similar to the description of step 201 above, the coordinates of the key points of the text line may be used to represent the text line position, and for a text line, the coordinates of the top left corner point and the bottom right corner point of the smallest bounding box may be used to represent the text line position. Assuming that the coordinates of the top left corner point and the bottom right corner point of the minimum bounding box of a text line are (x1, y1) and (x2, y2), the text line position can be represented by (x1, y1, x2, y 2). Then assume that the kth sample text line position is denoted as (a, b, c, d) and the maximum overlap text line position is denoted as (e, f, g, h). Then the average text line position is the sum of the two corresponding coordinate values divided by 2. That is, the updated text line position can be expressed as

For the updated text line content, the OCR algorithm may give a confidence of each text line content, and a higher confidence indicates that the recognition result is more reliable, so the highest confidence is selected as the content of the updated text line to improve the accuracy of the result.

Step 306: and performing a training process between the (k + 1) th sample line in the current sample document picture and each text line in the current template, and sequentially traversing all the sample lines in the current sample document picture to obtain the trained document template.

In this step, if it is determined in steps 302 to 305 that the maximum value of the IOU is less than or equal to the set first threshold value 1, an iteration of the (k + 1) th sample line is performed in a manner similar to step 302. Or if the similarity in step 304 is less than or equal to the set second threshold 2, performing the iteration of the (k + 1) th sample line in a manner similar to steps 302 to 305. Or after the comparison of the kth sample text line is completed through step 305, the iteration of the (k + 1) th sample text line is performed in a manner similar to steps 302 to 305. After traversing all sample text lines of the jth sample document picture in sequence, the current template is updated to be a new document template, and all sample document pictures of the ith class are traversed in sequence, so that a final document template corresponding to the ith class can be obtained and stored in the template library.

Step 307: acquiring a to-be-processed document picture, and identifying the to-be-processed document picture by adopting an optical character recognition technology to obtain an initial identification result, wherein the initial identification result comprises the following steps: and (4) initial text line positions and text contents in the document picture to be processed. Reference may be made in detail to the description of step 201 in the above embodiments.

Step 308: and normalizing the initial text line position to obtain the final text line position of the document picture to be processed, wherein the text line position corresponds to the text content. Reference may be made in detail to the description of step 201 in the above embodiments.

Step 309: and respectively calculating the matching degree between the document picture to be processed and each document template in the template library according to the text recognition result.

In this step, the matching degree may represent the same degree between the to-be-processed document picture and the document template, and the greater the matching degree, the greater the same degree between the to-be-processed document picture and the document template is, the greater the possibility that the to-be-processed document picture and the document template belong to the same type is, and therefore, assuming that there are document templates of multiple types in the template library, the matching degree between the to-be-processed document picture and each document template in the template library may be calculated.

In an embodiment, step 309 may specifically include: and respectively calculating the overlapping degree and the text similarity between each text line in the document picture to be processed and each text line in the document template according to the text recognition result aiming at each document template. For example, for the p-th text line in the document picture to be processed, the overlapping degree and the similarity are calculated one by one with each text line in the document template until each text line in the document picture to be processed is calculated. And then selecting a target line set from the text recognition result based on the calculated overlapping degree and the text similarity, wherein the overlapping degree corresponding to a target line in the target line set is greater than a first threshold value, and the text similarity corresponding to the target line is greater than a second threshold value. And taking the ratio of the number of the target lines contained in the target line set to the total number of the text lines in the document template as the matching degree between the document picture to be processed and the document template. And one document template corresponds to one target line set, and the matching degree between each document template and the to-be-processed document picture is calculated according to the mode.

In this embodiment, the overlapping degree may be represented by an IOU, the calculation manner and the setting of the first threshold may refer to the manner of the IOU in step 302, and the setting of the text similarity and the second threshold may refer to the manner of the similarity in step 303.

For example, assume that the template includes three categories a, B, and C, wherein the document template of type a has 10 lines of text (location and text content), the document template of type B has 20 lines of text, and the document template of type C has 15 lines of text. Inputting an OCR recognition result of a to-be-processed document picture D, if 6 text lines in the to-be-processed document picture D match with the document template of type a (i.e. the IOU of the text line positions of the two text lines is greater than the first threshold and the similarity of the text content is greater than the second threshold), the matching degree between the to-be-processed document picture D and the document template of type a is 6/10. The matching degree calculation of the document picture D to be processed, the document template of type B and the document template of type C is also the same. The calculation accuracy can be further improved.

Step 310: and selecting the document template with the matching degree larger than a preset threshold value as a target document template, and taking the type of the target document template as the document type of the document picture to be processed.

In this step, the preset threshold may be set based on an actual scene, and may be obtained based on empirical data statistics, and a value range of the preset threshold may be 0-1, for example, the preset threshold may be 0.8. The preset threshold is used for representing the matching degree between the documents of the same type, and only two documents which are greater than the preset threshold are determined to belong to the same type, so that a target document template with the matching degree between the target document template and the document picture to be processed being greater than the preset threshold is selected from the template library, the type of the target document is the same as that of the document picture to be processed, and the type of the document picture to be processed is output.

In one embodiment, the method further comprises: and when the target document templates are multiple, selecting the document template with the maximum matching degree with the document picture to be processed from the multiple target document templates, and taking the type of the document template with the maximum matching degree as the document type of the document picture to be processed.

In an actual scene, there may be a plurality of target document templates in the template library, the matching degree of which with the document picture to be processed is greater than the preset threshold, and then the type of the target document template with the highest matching degree is selected as the output type of the document picture to be processed, so that the classification accuracy is further improved. If the target document template with the matching degree between the target document template and the document picture to be processed being larger than the preset threshold value does not exist in the template library, the other can be output to prompt the user that the matching type is not found.

In an embodiment, after the matching degree between the to-be-processed document picture and each document template in the template library is obtained in step 309, the document template with the largest matching degree may also be selected from the template library as the target document template, and the type of the target document template is used as the document type of the to-be-processed document picture. And the document template with the maximum matching degree is directly selected, so that unnecessary calculation can be reduced.

In an actual scene, a situation that the maximum matching degree is smaller than a preset threshold may also occur, in this situation, a document template of the same type as the document picture to be processed cannot be found in the template library, that is, the current template library cannot identify the document picture to be processed, and this may output another to prompt the user that the matching type is not found. But the document template type corresponding to the maximum matching degree is output, so that the maximum matching degree is larger than a preset threshold value to avoid the situation that other is output but not output.

Fig. 4A is a schematic diagram of a short purchase order, and fig. 4B is a schematic diagram of a long purchase order, assuming that only the type supporting identification in the template library is a short purchase order, the two images are input, and the output result is:

fig. 4A corresponds to: "clsass _ name": short _ po.

Fig. 4B corresponds to: "clsass _ name": other.

In an embodiment, if the classification accuracy of the current existing category is to be improved or a new category is to be added, a training phase may be referred to, a small amount of data is re-labeled, online training is performed, and the template library is updated.

Assuming that 3 long purchase order pictures are selected for online real-time updating of the template library, and after the updating, the template library supports the classification of the long purchase order, the two pictures shown in fig. 4A and 4B are still input, and the output result is:

FIG. 4A corresponds to "clsass _ name": short _ po.

FIG. 4B corresponds to "class _ name" and long _ po.

Please refer to fig. 5, which is a document classification apparatus 500 according to an embodiment of the present application, and the apparatus can be applied to the electronic device 1 shown in fig. 1 to solve the disadvantage that the conventional image classification method cannot process the document images, thereby improving the automation degree of the service scene of the document image classification process and improving the accuracy of the document type identification. The device comprises: a first obtaining module 501 and a classifying module 502, wherein the principle relationship of each module is as follows:

the first obtaining module 501 is configured to obtain a text recognition result of a document picture to be processed. The classification module 502 is configured to classify the document pictures to be processed according to the text recognition result and a preset template library, and output a classification result of the document pictures to be processed, where the template library includes at least one type of document template.

In one embodiment, the first obtaining module 501 is configured to: identifying the document picture to be processed by adopting an optical character identification technology to obtain an initial identification result, wherein the initial identification result comprises the following steps: and (4) initial text line positions and text contents in the document picture to be processed. And normalizing the initial text line position to obtain the final text line position of the document picture to be processed, wherein the text line position corresponds to the text content.

In an embodiment, the first obtaining module 501 is further configured to: and determining the base point position in the document picture to be processed. And calculating a horizontal coordinate difference value and a vertical coordinate difference value between each initial text line position and the base point position. Determining a first proportional numerical value between the horizontal coordinate difference value and the horizontal width value of the to-be-processed document picture, determining a second proportional numerical value between the vertical coordinate difference value and the vertical height value of the to-be-processed document picture, taking the first proportional numerical value as the horizontal coordinate of the text row position, and taking the second proportional numerical value as the vertical coordinate of the text row position.

In one embodiment, the classification module 502 is configured to: and respectively calculating the matching degree between the document picture to be processed and each document template in the template library according to the text recognition result. And selecting the document template with the matching degree larger than a preset threshold value as a target document template, and taking the type of the target document template as the document type of the document picture to be processed.

In one embodiment, the classification module 502 is further configured to: and when the target document templates are multiple, selecting the document template with the maximum matching degree with the document picture to be processed from the multiple target document templates, and taking the type of the document template with the maximum matching degree as the document type of the document picture to be processed.

In one embodiment, the classification module 502 is configured to: and respectively calculating the overlapping degree and the text similarity between each text line in the to-be-processed document picture and each text line in the document template according to the text recognition result aiming at each document template. And selecting a target line set from the text recognition result, wherein the overlapping degree corresponding to a target line in the target line set is greater than a first threshold value, and the text similarity corresponding to the target line is greater than a second threshold value. And taking the ratio of the number of the target lines contained in the target line set to the total number of the text lines in the document template as the matching degree between the document picture to be processed and the document template.

In one embodiment, the classification module 502 is configured to: respectively calculating the matching degree between the document picture to be processed and each document template in the template library according to the text recognition result; and selecting the document template with the maximum matching degree as a target document template, and taking the type of the target document template as the document type of the document picture to be processed.

In one embodiment, the maximum value of the matching degree is greater than a predetermined threshold.

For a detailed description of the document classifying device 500, please refer to the description of the related method steps in the above embodiment.

Please refer to fig. 6, which is a training apparatus 600 for a document template according to an embodiment of the present application, and the apparatus can be applied to the electronic device 1 shown in fig. 1 for training a template library, so as to solve the disadvantage that the conventional image classification method cannot process document images, improve the automation degree of the business scene of the document image classification process, and improve the accuracy of the document type identification. The device includes: the second obtaining module 601, the first calculating module 602, the second calculating module 603, the judging module 604 and the updating module 605, the principle relationship of each module is as follows:

a second obtaining module 601, configured to obtain sample text recognition results and current templates of multiple sample document pictures, where the document categories of the multiple sample document pictures are the same, and the document categories of the current template and the multiple sample document pictures are the same; a first calculating module 602, configured to calculate, for each sample document picture in the multiple sample document pictures, an overlapping degree between a kth sample text line in the current sample document picture and each text line in the current template, and determine whether the maximum overlapping degree is greater than a first threshold, where k is a positive integer; a second calculating module 603, configured to calculate a text similarity between the kth sample text line and a text line with a maximum overlap degree when the maximum overlap degree is greater than a first threshold, where the text line with the maximum overlap degree corresponds to the text line in the current template with the maximum overlap degree; a determining module 604, configured to determine whether the text similarity is greater than a second threshold; and the updating module 605 is configured to update the current template according to the kth sample text line and the text line with the maximum overlapping degree when the text similarity is greater than the second threshold.

In an embodiment, the system further includes an iteration module 606 configured to: and when the maximum overlapping degree is smaller than or equal to a first threshold value or when the text similarity is smaller than or equal to a second threshold value, performing a training process between the (k + 1) th sample text line in the current sample document picture and each text line in the current template, and sequentially traversing all the sample text lines in the current sample document picture to obtain the trained document template.

In one embodiment, the update module 605 is configured to: determining the average text line position of the kth sample text line position and the maximum overlapping degree text line position, and taking the average text line position as the updated text line position; and taking the text content with higher confidence coefficient in the kth sample text line and the text line with the maximum overlapping degree as the text content of the updated text line, and generating an updated document template.

In an embodiment, the second obtaining module 601 is configured to: adopting an optical character recognition technology to recognize a plurality of sample document pictures to obtain an initial recognition result, wherein the initial recognition result comprises the following steps: initial text line positions and text contents in the sample document picture; and normalizing the initial text line positions of the sample document pictures to obtain the final text line positions of the sample document pictures, wherein the text line positions correspond to the text content.

In an embodiment, the second obtaining module 601 is further configured to: determining a base point position in the sample document picture for each sample document picture; calculating a horizontal coordinate difference value and a vertical coordinate difference value between each initial text line position and the base point position; determining a first proportional numerical value between the horizontal coordinate difference value and the horizontal width value of the sample document picture, determining a second proportional numerical value between the vertical coordinate difference value and the vertical height value of the sample document picture, taking the first proportional numerical value as the horizontal coordinate of the final text row position, and taking the second proportional numerical value as the vertical coordinate of the final text row position.

For a detailed description of the training apparatus 600 for document templates, please refer to the description of the related method steps in the above embodiments.

An embodiment of the present invention further provides a non-transitory electronic device readable storage medium, including: a program that, when run on an electronic device, causes the electronic device to perform all or part of the procedures of the methods in the above-described embodiments. The storage medium may be a magnetic Disk, an optical Disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a Flash Memory (Flash Memory), a Hard Disk Drive (Hard Disk Drive, abbreviated as HDD), or a Solid State Drive (SSD). The storage medium may also comprise a combination of memories of the kind described above.

Although the embodiments of the present invention have been described in conjunction with the accompanying drawings, those skilled in the art may make various modifications and variations without departing from the spirit and scope of the invention, and such modifications and variations fall within the scope defined by the appended claims.

Claims

1. A method of classifying a document, comprising:

acquiring a text recognition result of a document picture to be processed;

classifying the document pictures to be processed according to the text recognition result and a preset template library, and outputting the classification result of the document pictures to be processed, wherein the template library comprises at least one type of document template.

2. The method according to claim 1, wherein the obtaining of the text recognition result of the to-be-processed document picture comprises:

adopting an optical character recognition technology to recognize the document picture to be processed to obtain an initial recognition result, wherein the initial recognition result comprises: the initial text line position and the text content in the document picture to be processed;

and carrying out normalization processing on the initial text line position to obtain the final text line position of the document picture to be processed, wherein the text line position corresponds to the text content.

3. The method according to claim 2, wherein the normalizing the initial text line position to obtain the final text line position of the document picture to be processed comprises:

determining the base point position in the document picture to be processed;

calculating a horizontal coordinate difference value and a vertical coordinate difference value between each initial text line position and the base point position;

determining a first proportional numerical value between the horizontal coordinate difference value and the horizontal width value of the to-be-processed document picture, determining a second proportional numerical value between the vertical coordinate difference value and the vertical height value of the to-be-processed document picture, taking the first proportional numerical value as the horizontal coordinate of the text row position, and taking the second proportional numerical value as the vertical coordinate of the text row position.

4. The method according to claim 1, wherein the classifying the to-be-processed document picture according to the text recognition result and a preset template library, and outputting the classification result of the to-be-processed document picture comprises:

respectively calculating the matching degree between the document picture to be processed and each document template in the template library according to the text recognition result;

and selecting the document template with the matching degree larger than a preset threshold value as a target document template, and taking the type of the target document template as the document type of the document picture to be processed.

5. The method of claim 4, further comprising: and when the target document templates are multiple, selecting the document template with the maximum matching degree with the document picture to be processed from the multiple target document templates, and taking the type of the document template with the maximum matching degree as the document type of the document picture to be processed.

6. The method according to claim 4, wherein the calculating the matching degree between the document picture to be processed and each document template in the template library according to the text recognition result comprises:

respectively calculating the overlapping degree and the text similarity between each text line in the document picture to be processed and each text line in the document template according to the text recognition result aiming at each document template;

selecting a target line set from the text recognition result, wherein the overlapping degree corresponding to a target line in the target line set is greater than a first threshold value, and the text similarity corresponding to the target line is greater than a second threshold value;

and taking the ratio of the number of the target lines in the target line set to the total number of the text lines in the document template as the matching degree between the document picture to be processed and the document template.

7. The method according to claim 1, wherein the classifying the to-be-processed document picture according to the text recognition result and a preset template library, and outputting the classification result of the to-be-processed document picture comprises:

and selecting the document template with the maximum matching degree as a target document template, and taking the type of the target document template as the document type of the document picture to be processed.

8. The method of claim 7, wherein the maximum value of the degree of match is greater than a preset threshold.

9. A method for training a document template, comprising: obtaining sample text recognition results and current templates of a plurality of sample document pictures, wherein the document types of the sample document pictures are the same, and the current templates are the same as the document types of the sample document pictures;

for each sample document picture in the plurality of sample document pictures, calculating the overlapping degree between the kth sample document line in the current sample document picture and each document line in the current template, and judging whether the maximum overlapping degree is greater than a first threshold value or not, wherein k is a positive integer;

when the maximum overlapping degree is larger than the first threshold value, calculating the text similarity between the kth sample text line and a text line with the maximum overlapping degree, wherein the text line with the maximum overlapping degree corresponds to the text line in the current template;

judging whether the text similarity is larger than a second threshold value;

and when the text similarity is larger than the second threshold value, updating the current template according to the kth sample text line and the text line with the maximum overlapping degree.

10. The method of claim 9, further comprising:

when the maximum overlapping degree is smaller than or equal to the first threshold value, or when the text similarity is smaller than or equal to the second threshold value, performing a training process between the (k + 1) th sample text line in the current sample document picture and each text line in the current template, and sequentially traversing all sample text lines in the current sample document picture to obtain the trained document template.

11. The method of claim 9, wherein updating the current template according to the kth sample text line and the maximum overlap text line comprises:

determining the average text line position of the kth sample text line position and the maximum overlapping degree text line position, and taking the average text line position as an updated text line position;

and taking the text content with higher confidence coefficient in the kth sample text line and the text line with the maximum overlapping degree as the text content of the updated text line, and generating an updated document template.

12. The method of claim 9, wherein obtaining sample text recognition results for a plurality of sample document pictures comprises:

adopting an optical character recognition technology to recognize the sample document pictures to obtain an initial recognition result, wherein the initial recognition result comprises the following steps: an initial text line position and text content in the sample document picture;

and normalizing the initial text line positions of the sample document pictures to obtain final text line positions of the sample document pictures, wherein the text line positions correspond to the text content.

13. The method of claim 12, wherein the normalizing the initial text line positions of the sample document pictures to obtain final text line positions of the sample document pictures comprises:

determining a base point position in each sample document picture;

determining a first proportional numerical value between the abscissa difference value and the sample document picture transverse width value, determining a second proportional numerical value between the ordinate difference value and the sample document picture longitudinal height value, taking the first proportional numerical value as the abscissa of the final text line position, and taking the second proportional numerical value as the ordinate of the final text line position.

14. An electronic device, comprising:

a memory to store a computer program;

a processor to execute the computer program to implement the method of any one of claims 1 to 13.

15. A non-transitory electronic device readable storage medium, comprising: program which, when run by an electronic device, causes the electronic device to perform the method of any one of claims 1 to 13.