CN108734089B - Method, device, equipment and storage medium for identifying table content in picture file - Google Patents

Method, device, equipment and storage medium for identifying table content in picture file Download PDF

Info

Publication number
CN108734089B
CN108734089B CN201810285135.5A CN201810285135A CN108734089B CN 108734089 B CN108734089 B CN 108734089B CN 201810285135 A CN201810285135 A CN 201810285135A CN 108734089 B CN108734089 B CN 108734089B
Authority
CN
China
Prior art keywords
character
picture file
header
target
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810285135.5A
Other languages
Chinese (zh)
Other versions
CN108734089A (en
Inventor
王磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201810285135.5A priority Critical patent/CN108734089B/en
Publication of CN108734089A publication Critical patent/CN108734089A/en
Application granted granted Critical
Publication of CN108734089B publication Critical patent/CN108734089B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/413Classification of content, e.g. text, photographs or tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/62Text, e.g. of license plates, overlay texts or captions on TV images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/418Document matching, e.g. of document images

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Character Discrimination (AREA)
  • Character Input (AREA)

Abstract

The invention relates to a method, a device, equipment and a storage medium for identifying table contents in a picture file, and belongs to the technical field of image identification. The method comprises the following steps: acquiring a target picture file to be identified; carrying out character recognition processing on the target picture file to obtain character information in the target picture file; matching the recognized character information with a preset word bank to obtain a header character with the matching degree of the header character with the preset word bank larger than a first threshold value; and determining table contents included in the target picture file according to character information corresponding to the header characters. Therefore, the form included in the picture can be identified quickly and accurately, the identification accuracy is improved, the time spent in identification operation can be reduced, and the use experience of a user is effectively improved.

Description

Method, device and equipment for identifying table content in picture file and storage medium
Technical Field
The present invention relates to the field of image recognition technologies, and in particular, to a method, an apparatus, a device, and a storage medium for recognizing table contents in a picture file.
Background
Optical Character Recognition (OCR) is a process of determining the shape of a Character in a picture by detecting dark and light patterns and then converting the image of the Character into computer text using a Character Recognition technique. The method is characterized in that characters in a picture are converted into an image file with a black-and-white dot matrix by adopting an optical method aiming at printed characters, and the characters in the image are converted into a text format through recognition software for further editing and processing by word processing software.
With the continuous development of computer technology, it is a strong demand to record pictures into a computer system for convenient use by users. In particular, a picture containing a form is entered into a computer system. Currently, in the related art, when performing table recognition on a picture including a table, a document is generally divided into a plurality of units, then a table line included in each unit is recognized, and after a table structure is obtained, characters are extracted and recognized from the picture.
However, when the table in the picture is identified by the above method, not only the algorithm is complex, but also the identification effect is greatly influenced by the picture quality, and the detection error rate is high.
Disclosure of Invention
The present invention is directed to solving, at least to some extent, one of the technical problems in the related art.
To this end, an embodiment of an aspect of the present invention provides a method for identifying table contents in a picture file, where the method includes: acquiring a target picture file to be identified; carrying out character recognition processing on the target picture file to obtain character information in the target picture file; matching the recognized character information with a preset word bank to obtain a header character with the matching degree of the preset word bank larger than a first threshold value; and determining the table content included in the target picture file according to the character information corresponding to the header characters.
Another embodiment of the present invention provides an apparatus for identifying table contents in a picture file, where the apparatus includes: the first acquisition module is used for acquiring a target picture file to be identified; the processing module is used for carrying out character identification processing on the target picture file to obtain character information in the target picture file; the matching module is used for matching the recognized character information with a preset word bank so as to obtain a header character with the matching degree of the preset word bank larger than a first threshold value; and the determining module is used for determining the table content included in the target picture file according to the character information corresponding to the header character.
In another aspect, an embodiment of the present invention provides a computer device, including: the storage stores computer programs, and when the processor executes the programs, the method for identifying table contents in the picture file is realized.
In another aspect, the present invention provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the method for identifying table contents in a picture file.
According to the method, the device, the equipment and the storage medium for identifying the table content in the picture file, which are provided by the embodiment of the invention, the target picture file to be identified is obtained to perform character identification processing on the target picture file to obtain the character information in the target picture file, then the identified character information is matched with the preset lexicon to obtain the table head characters of which the matching degree with the preset lexicon is greater than the first threshold value, and the table content in the target picture file is determined according to the character information corresponding to the table head characters. From this, realized carrying out quick accurate discernment to the form that includes in the picture, not only improved the accuracy of discernment, can also reduce the time that the discernment operation was spent to user's use experience has effectively been promoted.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.
FIG. 1 is a flowchart illustrating a method of identifying table content in a picture file according to an exemplary embodiment of the invention;
FIG. 2 is a flowchart illustrating a method of identifying table content in a picture file according to an exemplary embodiment of the invention;
FIG. 3 (a) is a table style diagram shown in accordance with an exemplary embodiment of the present invention;
fig. 3 (b) is a schematic diagram illustrating adding a target picture according to an exemplary embodiment of the present invention;
FIG. 3 (c) is a diagram illustrating the determination of the format of a target picture and the corresponding character content according to an exemplary embodiment of the present invention;
fig. 3 (d) is a schematic diagram illustrating the screening of recognized character contents and the determination result as digital contents according to an exemplary embodiment of the present invention;
FIG. 3 (e) is a pictorial diagram illustrating a corresponding trend line graph plotted against numerical results in accordance with an exemplary embodiment of the present invention;
FIG. 4 is a flow chart illustrating selection of location information and semantics of content characters corresponding to a header character in accordance with an exemplary embodiment of the present invention;
FIG. 5 is a flowchart illustrating a method of identifying table content in a picture file according to an exemplary embodiment of the invention;
fig. 6 is a schematic structural diagram illustrating an apparatus for recognizing table contents in a picture file according to an exemplary embodiment of the present invention;
FIG. 7 is a schematic block diagram of a computer device shown in accordance with an exemplary embodiment of the present invention;
fig. 8 is a schematic structural diagram illustrating a computer device according to an exemplary embodiment of the present invention.
With the above figures, certain embodiments of the invention have been illustrated and described in more detail below. These drawings and the description are not intended to limit the scope of the inventive concept in any way, but rather to illustrate it by those skilled in the art with reference to specific embodiments.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the invention, as detailed in the appended claims.
The embodiments of the invention provide a table identification method aiming at the problems that the existing method for identifying the table content in the picture file has a complex algorithm, the identification effect is greatly influenced by the picture quality, and the detection error rate is high.
The method for identifying the table content in the picture file, provided by the embodiment of the invention, comprises the steps of firstly obtaining a target picture file to be identified to perform character identification processing on the target picture file to obtain character information in the target picture file, then performing matching processing on the identified character information and a preset lexicon to obtain a table head character with the matching degree with the preset lexicon being greater than a first threshold value, and further determining the table content in the target picture file according to the character information corresponding to the obtained table head character. Therefore, the form included in the picture is quickly and accurately identified, the identification accuracy is improved, the time spent in identification operation can be reduced, and the use experience of a user is effectively improved.
The following describes a method, an apparatus, a device and a storage medium for identifying table contents in a picture file in detail with reference to the accompanying drawings.
First, a method for identifying table contents in a picture file according to an embodiment of the present invention is described in detail with reference to fig. 1.
Fig. 1 is a flowchart illustrating a method of identifying table contents in a picture file according to an exemplary embodiment of the present invention.
As shown in fig. 1, the method for identifying table contents in a picture file may include the following steps:
step 101, obtaining a target picture file to be identified.
Optionally, the method for identifying table content in a picture file provided in the embodiment of the present invention may be executed by the computer device provided in the embodiment of the present invention. The computer equipment is provided with a device for identifying the table content in the picture file, so that the identification process of the table content in the target picture file to be identified is managed or controlled through the device for identifying the table content in the picture file. The computer device of the present embodiment may be any hardware device with data processing function, such as a computer, a personal digital assistant, and so on.
In this embodiment, the target picture file to be identified may be any picture file having table content, which is not specifically limited in this embodiment.
In an optional implementation form of the present application, any picture file with table content may be obtained from a local picture library of the device as a target picture file to be identified; alternatively, a picture file obtaining request with table content may be sent to the server, so as to obtain a target picture file to be identified from the server in real time, and the like, which is not specifically limited herein.
And 102, performing character recognition processing on the target picture file to obtain character information in the target picture file.
In this embodiment, the character information may include character shape, semantic, character position information, and the like, and is not limited herein.
The character shape is used for representing the writing and presenting mode of the character, the character semantic meaning is used for representing the meaning of the character, and the character position information is used for representing the position of the character in the target picture file.
Optionally, after the target picture file is obtained, the apparatus for identifying table contents in the picture file may utilize a character identification technology in the prior art, such as: and the ORC technology is used for carrying out character recognition processing on the target picture file so as to obtain character information in the target picture file.
And 103, matching the recognized character information with a preset word bank to obtain a header character with the matching degree with the preset word bank larger than a first threshold value.
The preset word bank comprises various header characters. The word is obtained by collecting a large number of words and analyzing and processing the large number of words; or, the setting can be manually customized; or, a large number of words related to different fields may be processed to obtain word banks corresponding to different fields, and the like, which is not specifically limited in this embodiment.
For example, taking a physical examination report in the medical field as an example for explanation, the physical examination report generally includes: item, result, reference, unit, etc., and the headers used for the physical examination reports of different hospitals may differ, such as: the header of an item class typically includes: "item", "item name", "item full name", "check item", "chinese name", etc., and the header of the result class typically includes: the "result", "inspection result", "detection result", "measurement result", "actual numerical value", "detection value", "quantitative result", and so on, then by analyzing and processing the above-mentioned plurality of contents, the preset lexicon corresponding to the medical field can be obtained.
In this embodiment, the first threshold may be adaptively set according to actual needs, such as 0.90,0.92, which is not specifically limited herein.
In an optional implementation form of the present application, after obtaining the character information in the target picture file, the device for identifying table content in the picture file may perform a matching operation with the identified character information by using a preset lexicon, so as to obtain a header character with a matching degree greater than a first threshold.
For example, after the target picture file is identified, the character information in the target picture file is determined as "examination item", "white protein", "weight", and "reference value", and the first threshold value is 0.90. Then, when the character information is matched with the preset lexicon to obtain that the matching degree between the check item and the reference value is greater than 0.90, the check item and the reference value can be determined as the header characters.
And 104, determining table contents included in the target picture file according to the character information corresponding to the header characters.
Optionally, after the header character is determined, the device for identifying the table content in the image file may determine the table content included in the target image file according to the character information corresponding to the header character.
In an optional implementation form, first, the character information corresponding to the header character may be analyzed to determine the field to which the header character belongs, and then the table format and the table content in the target picture file may be obtained according to the determined field and the character information corresponding to the header character.
The method for identifying the table content in the picture file, provided by the embodiment of the invention, comprises the steps of obtaining a target picture file to be identified, carrying out character identification processing on the target picture file to obtain character information in the target picture file, then carrying out matching processing on the identified character information and a preset word bank to obtain a table head character with the matching degree with the preset word bank larger than a first threshold value, and further determining the table content in the target picture file according to the character information corresponding to the table head character. Therefore, the form included in the picture can be identified quickly and accurately, the identification accuracy is improved, the time spent in identification operation can be reduced, and the use experience of a user is effectively improved.
As can be seen from the above analysis, in the embodiment of the present invention, the character information in the target picture file is obtained, so as to obtain the header character according to the character information, and then the table content included in the target picture file is determined according to the character information corresponding to the header character. In an optional implementation form, since the obtained character information may include character semantics and character position information, in order to determine the header character more accurately, in this embodiment, a target character set may be determined according to a preset lexicon, and then a table style corresponding to the character information is determined according to the character semantics in the character information, so that the target position information is determined according to the table style, and then the header character is obtained according to the target position information and the position information corresponding to the target character set. The above process of the method for identifying table contents in a picture file according to the present invention is specifically described below with reference to fig. 2.
As shown in fig. 2, the method for identifying table contents in a picture file may include the following steps:
step 201, obtaining a target picture file to be identified.
Step 202, performing character recognition processing on the target picture file to obtain character information in the target picture file.
Wherein the character information includes: character semantics and character position information. The character position information may include a first direction coordinate and a second direction coordinate of the character in the target picture file.
In practical use, a coordinate system may be defined for the target picture file, for example, the upper left corner of the target picture file is the origin of the coordinate system, starting from the origin, the right direction is the positive direction of the X axis, and the downward direction is the positive direction of the y axis. Accordingly, the first direction coordinate may be an X-axis coordinate, and the second direction coordinate may be a Y-axis coordinate; alternatively, the first direction coordinate may be a Y-axis coordinate, and the second direction coordinate may be an X-axis coordinate, which is not particularly limited in this embodiment.
In this embodiment, the target picture file may be a picture in any format, such as BMP, TIF, JPG, PDF, etc., which is not limited herein.
An alternative implementation may utilize prior art character recognition techniques such as: and the OCR technology is used for carrying out character recognition processing on the target picture file so as to obtain the character semantics and the character position information in the target picture file.
And 203, matching the recognized character information with a preset word bank to obtain a target character set with the matching degree with the preset word bank larger than a first threshold value.
Optionally, in this embodiment, after obtaining the character semantics and the character position information in the target picture file, the device for identifying the table content in the picture file may perform a matching operation with the character information by using a preset lexicon, so as to obtain the target character set with the matching degree greater than the first threshold.
In the practical application process, the target picture file may relate to any field, so that in order to improve the accuracy of identifying the target picture file, in this embodiment, before matching the character information with the preset lexicon, the semantic meaning of the identified character may be analyzed first to determine the corresponding target lexicon according to the semantic meaning of the character. That is to say, the character semantics are analyzed to determine the preset word bank corresponding to the field according to the field to which the character semantics belong, so that the identification accuracy of the target picture file is effectively improved.
For example, if the analyzed character semantics mainly relate to the medical field, the preset word bank corresponding to the medical field may be determined as the target word bank; for another example, if the analyzed character semantics mainly relate to a financial field, the preset lexicon corresponding to the financial field may be determined as the target lexicon.
Further, after the target lexicon corresponding to the character information is determined, the device for identifying the table content in the picture file can match the characters included in the character information with the target lexicon to obtain the matching degree between the characters and the target lexicon, then the matching degree is compared with a first threshold value, and the characters with the matching degree larger than the first threshold value are used as a target character set.
For example, if the target picture file is identified, the characters included in the target picture file are determined as "examination item", "white protein", "weight", and "reference value", and the first threshold is 0.90. After the semantics of the characters are analyzed, the characters in the target picture file are determined to relate to the medical field, so that a word stock in the medical field can be obtained, then the matching degree of the characters and the word stock in the medical field is judged, and if the matching degree between the examination items and the reference value and a preset word stock is greater than 0.90, the examination items and the reference value can be determined as a target character set.
And step 204, determining a table style corresponding to the character information according to the character semantics in the character information.
Step 205, determining the target position information according to the table style.
In an optional implementation form, because table styles in different fields are different, in order to accurately and reliably obtain a header character, the embodiment may first determine a table style corresponding to character information according to character semantics in the character information in a target picture file; and then determining the target position information according to the determined form style.
That is, when the fields to which the character semantics in the character information in the target picture file relate are different, the form styles included in the target picture file are also different. For example, if the semantic meaning of the character in the character information in the target picture file relates to the medical field, the table style may be determined as shown in fig. 3 (a), so that the position relationship between the header characters can be determined as the row coordinates are the same.
And step 206, acquiring the header characters from the target character set according to the target position information and the position information corresponding to the target character set.
Optionally, after the target position information is determined, the device for identifying the table content in the picture file may obtain the header character from the target character set according to the target position information and the position information corresponding to the target character set.
As an optional implementation manner, when it is determined that the second direction coordinates of the form header characters are the same, and the second direction is the Y-axis direction, the device for identifying the form content in the picture file may screen the form header characters from the target character set according to the rule that the Y-axis direction coordinates are the same; or, when it is determined that the first direction coordinates of the header characters are the same, that is, the first direction coordinates are the same as the X-axis coordinates, the device for identifying table contents in the picture file may screen the header characters from the target character set according to a rule that the X-axis coordinates are the same.
For example, if the position information of the character and the character in the target character set are: "serial number, (X1, Y1)", "examination item, (X2, Y1)", "blood pressure, (X2, Y2)", "examination result, (X3, Y1)", "45, (X3, Y3)", "reference value, (X4, Y1)", and the determined target position information is: the Y-axis direction coordinates are the same. Then, the device for identifying table contents in the picture file can determine that the header characters are respectively as follows according to the rule that the Y-axis direction coordinates are the same: "serial number, (X1, Y1)", "inspection item, (X2, Y1)", "inspection result, (X3, Y1)" and "reference value, (X4, Y1)".
And step 207, selecting the position information and the semantics of the content character corresponding to the header character from the character information according to the position information and the semantics of the header character.
Specifically, after the header character is determined, in order to perfect the table content, the device for identifying the table content in the picture file may select, according to the position information and the semantics of the header character, a character corresponding to the position and the semantics of the header character from the character information as the content character, and acquire the position information and the semantics of the content character.
For clearly explaining the above example, the following describes in detail the process of selecting the position information and the semantic meaning of the content character corresponding to the top character from the character information according to the position information and the semantic meaning of the top character with reference to fig. 4.
In this embodiment, the character position information includes a first direction coordinate and a second direction coordinate of the character.
As shown in fig. 4, selecting the position information and semantics of the content character corresponding to the header character may include the following steps:
step 401, determining a first direction coordinate range or a second direction coordinate range of the content character according to the first direction coordinate or the second direction coordinate of any header character.
Optionally, after the device for identifying table content in the picture file determines the header character, the first direction coordinate range or the second direction coordinate range of the content character may be determined according to the position information of the header character.
In practical applications, due to the device for recognizing table contents in the picture file, the determined position information of the header character may have an error or the lengths of the content characters are different, so in order to accurately acquire the content character corresponding to the header character, the embodiment may determine the first direction coordinate range or the second direction coordinate range of the content character according to the position information of the header character. For example, an additional range may be added to the first direction coordinate and the second direction coordinate respectively based on the position information of the header character, that is, when the position information of the header character is (X, Y), the coordinate range of the first direction of the content character may be determined to be (X- Δ, X + Δ); or, the coordinate range in the second direction is: (Y-. DELTA.Y + DELTA.), and the like, and this example is not particularly limited thereto.
It can be understood that, when determining the coordinate range of the content character, the position relationship between the content character and the header character may be determined according to the position relationship of each header character, and then the coordinate range corresponding to the content character may be determined.
For example, if the header characters are located in the same row according to the positions of the header characters, the row coordinates of the content characters corresponding to the header characters and the header characters can be determined to be close, so that the row coordinate range of each content character can be determined to be (X1- Δ, X1+ Δ) according to the X-direction coordinate X1 of each header character.
Correspondingly, if the head characters are determined to be located in the same column according to the positions of the head characters, the content characters corresponding to the head characters can be determined to be close to the vertical coordinates of the head characters, so that the coordinate range of each content character column can be determined to be (Y1-delta, Y1+ delta) according to the Y-direction coordinate Y1 of each head character.
Step 402, an initial character set with position information conforming to the first direction coordinate range or the second direction coordinate range is selected from the character information.
Step 403, according to the semantic meaning of any one of the header characters, selecting a character matched with the semantic meaning of any one of the header characters from the initial character set as a content character corresponding to any one of the header characters.
Specifically, in order to improve the accuracy of obtaining the content characters corresponding to the header characters, the device for identifying the table content in the picture file according to this embodiment may further analyze the initially selected character set based on the semantics of the header characters after obtaining the initially selected character set, so as to select the characters matching the semantics of the header characters from the initially selected character set as the content characters corresponding to the header characters.
It can be understood that, in this embodiment, according to the position information of the header character, the coordinate position range of the content symbol is determined first, so as to select a primary selected character set of which the position information is in the first direction coordinate range or the second direction coordinate range from the character information, and then according to the semantics of the header character, a character matched with the semantics of the header character is selected from the primary selected character set as the content character corresponding to the header character, thereby effectively improving the accuracy of obtaining the content character corresponding to the header character.
And step 208, generating a table included in the target picture file according to the position information and the semantics of the header characters and the position information and the semantics of the content characters.
Specifically, after the position information and the semantics of the header characters of the target picture file and the position information and the semantics of the content characters are acquired, the device for identifying the table content in the picture file can generate the table included in the target picture file according to the information.
In practical application, because characters in a table may be aligned, and the like, a word is divided into two words which are far away from each other on the layout, so that the word is easily recognized as two independent characters. For example: "item" is identified as "item" and "purpose"; as another example, a "unit" is identified as two separate characters of "single" and "bit";
or, because there are many words with the same semantics in the field to which the table belongs, in order to unify the words with different semantics to form relatively unified structured data, which is convenient for subsequent storage and use, the apparatus for identifying table contents in the picture file of this embodiment may perform normalization processing on the words with the same semantics in combination with the semantic analysis of the table header characters, or the combination of synonyms, and the like. For example, physical examination report templates include: four items such as "item", "unit", "reference value", "unit", etc., the words such as "item", "item name", "item full name", "check item", "Chinese name", etc. can be categorized as "item", and so on.
Furthermore, the error correction processing can be performed on the header characters and the content characters by using a preset word stock or according to the semantics of the combined words, for example, if the item is recognized as item and then the item is known as a wrong character through semantic analysis, and then the item and the item of the characters can be corrected as the item.
That is to say, before generating the table included in the target picture file according to the position information and the semantics of the header characters and the position information and the semantics of the content characters, the apparatus for identifying the table content in the picture file according to this embodiment can perform normalization and word merging processing on the header characters and the content characters by using the preset lexicon, so that the subsequent management and processing are more convenient.
The above embodiments are further explained below with reference to fig. 3 (b) -3 (e):
if the target picture file in this embodiment is a paper physical examination report collected by a user by using a device in different time periods, in order to implement the paper physical examination report for the user, a corresponding electronic plate physical examination archive is established, the user may add the taken paper physical examination report to an application for establishing the electronic plate physical examination archive, specifically, as shown in fig. 3 (b), and then, after the application for establishing the electronic plate physical examination archive detects the physical examination report picture added by the user, a character recognition function may be used to determine a table style and corresponding character contents of the physical examination report from the physical examination report picture, as shown in fig. 3 (c), further, to facilitate the user to know the height and trend of each index in the physical examination report, the recognized character contents may be further screened to select a content with a digital result, as shown in fig. 3 (d), and then, according to the extracted digital result, for example, total bilirubin, a trend broken line diagram corresponding to the total bilirubin is drawn, specifically, as shown in fig. 3 (e), so that the user may clearly know whether each index of the generated trend broken line is at a normal level.
The method for identifying the table content in the picture file, provided by the embodiment of the invention, comprises the steps of firstly obtaining a target picture file to be identified, then carrying out character identification processing on the target picture file to obtain character information in the target picture file, matching the identified character information with a preset word stock to obtain a target character set of which the matching degree with the preset word stock is greater than a first threshold value, then determining a table style corresponding to the character information according to the character semantics in the character information, determining target position information according to the table style, further obtaining a table head character from the target character set according to the target position information and the position information corresponding to the target character set, then selecting the position information and the semantics of a content character corresponding to the table head character from the character information according to the position information and the semantics of the table head character, and further generating a table included in the target picture file according to the position information and the semantics of the table head character, the position information and the semantics of the content character. Therefore, the form included in the picture can be identified quickly and accurately, the identification accuracy is improved, the time spent in identification operation can be reduced, and the use experience of a user is effectively improved.
Through the analysis, the embodiment of the invention can determine the header characters according to the character position information by acquiring the character position information of the target picture file, and then select the position information and the semantics of the content characters corresponding to the header characters from the character information according to the position information and the semantics of the header characters, so as to generate the table included in the target picture file according to the position information and the semantics of the header characters and the position information and the semantics of the content characters. In a specific implementation, because the character position information recognized by the device for recognizing the table content in the picture file may have an error in the first direction coordinate or the second direction coordinate of the included character, when determining the header character or the content character according to the first direction coordinate or the second direction coordinate of the character, a situation that the character type recognition error is caused due to an error in the character position information may occur.
In actual use, the character position information recognized by the device for recognizing the table content in the picture file also comprises the width of the character in the first direction and the width of the character in the second direction. Therefore, in this embodiment, when determining whether the character is a header character, the position information of the character may be corrected according to the coordinates of the character in the first direction and the coordinates of the character in the second direction, the width of the character in the first direction and the width of the character in the second direction, and then the header character or the content character may be determined according to the corrected position information. The above process of the method for identifying table contents in a picture file according to the present invention is specifically described below with reference to fig. 5.
Fig. 5 is a flowchart illustrating a method of identifying table contents in a picture file according to an exemplary embodiment of the present invention.
It should be noted that, in order to describe the embodiment more clearly, a coordinate system may be defined for the target picture file first, for example, taking the upper left corner of the target picture file as an origin of coordinates, starting from the origin to the right as a positive X-axis direction, and starting from the origin to the bottom as a positive Y-axis direction. Accordingly, the first direction coordinate may be defined as an X-axis coordinate (i.e., abscissa) and the second direction coordinate may be defined as a Y-axis coordinate (i.e., ordinate), so that the detailed description of the embodiment according to the above definition is realized.
As shown in fig. 5, the method for identifying table contents in a picture file may include the following steps:
step 501, obtaining a target picture file to be identified.
Step 502, performing character recognition processing on the target picture file to obtain character information in the target picture file.
The character information comprises character semantics and character position information, and the character position information comprises a first direction coordinate (namely an X-axis coordinate) and a second direction coordinate (namely a Y-axis coordinate) of a character in a target picture file, the width of the character in the first direction and the width of the character in the second direction.
Since the implementation of this step is similar to the implementation of the above example, it is not described in detail herein, specifically refer to step 102 or step 202.
Step 503, sequentially traversing the character information according to the sequence of the coordinates in the first direction from small to large, judging whether the coordinates of the jth character and the ith character in the second direction are the same, if not, executing step 504, otherwise, executing step 508.
The difference value of the first direction coordinates of adjacent characters between the jth character and the ith character is within a preset range, i and j are positive integers, and j is larger than i.
The preset range in this embodiment may be adaptively set according to the position information between actual characters, for example, the preset range may be determined according to a character width, or determined according to a common character interval, which is not specifically limited in this embodiment.
Optionally, after the character information in the target picture file is acquired, the device for identifying the table content in the picture file may sequentially traverse according to a sequence from small to large of the first-direction coordinates of the character information, so as to determine whether the coordinates of the jth character and the ith character in the second direction are the same. If the characters are the same, the characters are in the same line, and if the characters are different, the characters are in different lines.
The jth character and the ith character may be any two different characters in the recognized characters, which is not specifically limited in this embodiment.
That is, by taking the X-axis coordinates of the characters in the character information as arguments, it is determined whether the Y-axis coordinates of the characters are the same in order from small to large. When the Y-axis coordinates of the characters are the same, the characters can be determined to be in the same row; when the Y-axis coordinates of the characters are different, the characters can be determined to be in different rows, and therefore whether the position information of each coordinate is correct or not can be accurately judged.
For example, if the first direction coordinate of the character a is X1, the second direction coordinate is Y1, the first direction coordinate of the character B is X2, and the second direction coordinate is Y2, then when Y1= Y2, it indicates that the character a and the character B are in the same row; when Y1 ≠ Y2, it indicates that character A and character B are in different rows.
And step 504, determining the coincidence degree of the jth character and the ith character in the second direction according to the width of the jth character in the second direction and the width of the ith character in the second direction.
In a specific implementation, the overlap ratio of the jth character and the ith character in the second direction can be determined through formula (1):
Figure SMS_1
where ri denotes an ith character, yi denotes a second direction coordinate of the ith character, hi denotes a width of the ith character in the second direction, rj denotes a jth character, yj denotes a second direction coordinate of the jth character, and hj denotes a width of the jth character in the second direction.
Step 505, it is determined whether the coincidence degree of the jth character and the ith character in the second direction is greater than a third threshold, if so, step 506 is executed, otherwise, step 508 is executed.
The third threshold may be a value obtained empirically, which is not specifically limited in this embodiment. For example, the third threshold may be set to half of the minimum value of the widths of the ith character and the jth character in the second direction.
For example, if the third threshold is half of the minimum value of the widths of the jth character and the ith character in the second direction, the coordinate of the jth character in the second direction is 2, the width of the jth character in the second direction is 1, the coordinate of the ith character in the second direction is 2.1, and the width of the jth character in the second direction is 0.9, then based on the above formula (1), it can be determined that the coincidence degree of the jth character and the ith character is greater than the third threshold.
Step 506, correcting the coordinates of the jth character and/or the ith character in the second direction.
Specifically, when it is determined that the coincidence degree of the jth character and the ith character in the second direction is greater than the third threshold value, it is indicated that the jth character and the ith character are actually in the same row. Therefore, in order to reduce the possibility of misjudgment of the character types subsequently, for the jth character and the ith character with the coincidence degree larger than the third threshold value, the coordinates of the jth character and/or the ith character in the second direction are corrected according to the alignment method, so that the coordinates of the jth character and the ith character in the second direction are the same.
When the coordinates of the jth character and/or the ith character in the second direction are corrected, in order to improve the correction effect, in this embodiment, a target second direction coordinate range is determined according to the second direction coordinates of the ith character and the second direction coordinates of the jth character, and then a plurality of characters in the target second direction coordinate range are selected from the recognized characters, so that the coordinates of the jth character and/or the ith character in the second direction are corrected according to the selected second direction coordinates of the plurality of characters.
Namely: determining a target second direction coordinate range according to the second direction coordinate of the ith character and the second direction coordinate of the jth character;
selecting k characters of which the second direction coordinates belong to a target second direction coordinate range;
and correcting the coordinate of the jth character and/or the ith character in the second direction according to the second direction coordinate of the k characters.
It can be understood that, in this embodiment, the coordinate of the jth character and/or the ith character in the second direction may be corrected by correcting the coordinate of the jth character in the second direction; or, correcting the coordinate of the ith character in the second direction; alternatively, the coordinates of the jth character and the ith character in the second direction are corrected, which is not particularly limited in this embodiment.
In step 507, the jth character is in a different line from the ith character.
In step 508, the jth character is in the same row as the ith character.
In the actual application process, the target picture file may be distorted due to the identification processing of the target picture file, or errors may occur in the identification of character position information in the character information, so that errors may occur in the judgment of whether the coordinates of each character in the second direction are the same.
In order to reduce the occurrence of inaccurate judgment results caused by the defects, in one possible implementation scenario of the present invention, the device for identifying table contents in the picture file may select an ith character closest to a coordinate of the jth character in the second direction from the character information and compare the ith character with the ith character to determine whether the coordinates of the jth character and the ith character in the second direction are the same. If the j character and the ith character are the same, determining that the j character and the ith character are in the same row, if not, determining the overlap ratio of the j character and the ith character in the second direction according to the width of the j character in the second direction and the width of the ith character in the second direction, judging whether the overlap ratio is greater than a third threshold value, if so, indicating that the j character and the ith character are actually in the same row, and at the moment, in order to avoid the subsequent misjudgment on the character type identification, correcting the coordinate of the j character and/or the ith character in the second direction so as to ensure that the coordinates of the j character and the ith character in the second direction are the same; otherwise, in a different row.
Step 509, matching the character information with a preset lexicon to obtain a target character set with a matching degree with the preset lexicon being greater than a first threshold.
Step 510, determining a table style corresponding to the character information according to the character semantics in the character information.
Step 511, determining the target position information according to the form style.
And step 512, acquiring the header characters from the target character set according to the target position information and the position information corresponding to the target character set.
Step 513, according to the position information and semantics of the header character, selecting the position information and semantics of the content character corresponding to the header character from the character information.
Step 514, generating a table included in the target picture file according to the position information and the semantics of the header character and the position information and the semantics of the content character.
It should be noted that, for the specific implementation process and principle of the steps 509 to 514, reference may be made to the detailed description of the foregoing embodiments, and details are not described herein again.
Similarly, when the width of the table content in the picture file in the first direction according to the character, the coordinate of the character in the first direction and the coordinate of the character in the second direction are recognized to correct the position information of the character, the process of correcting the position information of the character according to the width of the character in the second direction, the coordinate of the character in the first direction and the coordinate of the character in the second direction is similar to the above process, except that:
when traversing operation is carried out on the character information obtained by identifying and processing the target picture file, sequentially traversing is carried out according to the sequence of the coordinates in the second direction (namely Y-axis coordinates) from small to large, and whether the coordinates (namely X-axis coordinates) of the jth character and the ith character in the first direction are the same or not is judged. If the characters are the same, the j character and the i character are in the same row, and if the characters are not the same, determining the coincidence degree of the j character and the i character in the first direction according to the width of the j character in the first direction and the width of the i character in the first direction, and determining whether the coincidence degree is larger than a third threshold value.
If the contact ratio is smaller than a third threshold value, the jth character and the ith character are in different columns; if it is determined that the degree of overlap is greater than the third threshold, it may be determined that the coordinates of the jth character and the ith character in the first direction are actually the same, and therefore, in order to reduce the probability of misjudgment of subsequent character type identification, the present embodiment may correct the coordinates of the jth character or the ith character in the first direction, so that the coordinates of the jth character or the ith character in the first direction are the same. Specifically, when the coordinate of the jth character or the ith character in the first direction is corrected, the method may include: determining a target first-direction coordinate range according to the first-direction coordinate of the jth character and the first-direction coordinate of the ith character; selecting m characters of which the first-direction coordinates belong to a target first-direction coordinate range; and correcting the coordinates of the jth character and/or the ith character in the first direction according to the first direction coordinates of the m characters.
The third threshold may be a half of a minimum value of widths of the jth character and the ith character in the first direction, and the like, which is not specifically limited in this embodiment.
It can be understood that, in this embodiment, the coordinate of the jth character and/or the ith character in the second direction is corrected, which may be the coordinate of the jth character in the second direction; or, correcting the coordinate of the ith character in the second direction; alternatively, the coordinates of the jth character and the ith character in the second direction are corrected, which is not particularly limited in this embodiment.
The method for identifying the table content in the picture file provided by the embodiment of the invention comprises the steps of performing character identification on a target picture file to obtain a first direction coordinate and a second direction coordinate of a character in the target picture file in the target picture, the width of the character in the first direction and the width of the character in the second direction, traversing the characters according to the sequence from small to large of the first direction coordinate to judge whether the coordinates of a jth character and the ith character in the second direction are the same or not, judging the contact degree of the jth character and the ith character in the second direction if the coordinates of the jth character and the ith character are not the same, judging whether the contact degree is greater than a threshold value or not if the contact degree is not the same, correcting the coordinates of the jth character or the ith character in the second direction if the contact degree is greater than the threshold value, matching the character information with a preset lexicon to obtain a table character set, determining the table style according to the semantic character information in the character information, determining the target position information, obtaining the table header character information according to the position information of the semantic character, selecting the semantic information from the table and the semantic information of the semantic header information of the semantic information in the table, thereby generating the semantic information and the semantic header information of the semantic information of the target picture. From this, realized carrying out quick accurate discernment to the form that includes in the picture, not only improved the accuracy of discernment, can also reduce the time that the discernment operation took to effectively promoted user's use and experienced, and provided the advantage for follow-up user's use.
In an exemplary embodiment, an apparatus for identifying table contents in a picture file is also provided.
Fig. 6 is a schematic structural diagram illustrating an apparatus for identifying table contents in a picture file according to an exemplary embodiment of the present invention.
Referring to fig. 6, the apparatus for identifying table contents in a picture file according to the present invention includes: a first obtaining module 110, a processing module 120, a matching module 130, and a determining module 140.
The first obtaining module 110 is configured to obtain a target picture file to be identified;
the processing module 120 is configured to perform character recognition processing on the target picture file to obtain character information in the target picture file;
the matching module 130 is configured to perform matching processing on the identified character information and a preset word bank to obtain a header character with a matching degree with the preset word bank being greater than a first threshold;
the determining module 140 is configured to determine table content included in the target picture file according to character information corresponding to the header character.
It should be noted that the foregoing explanation of the embodiment of the method for identifying table content in a picture file is also applicable to the apparatus for identifying table content in a picture file in the embodiment, and the implementation principle is similar, and is not repeated here.
The device for identifying the table content in the picture file, provided by the embodiment of the invention, is used for carrying out character identification processing on the target picture file by acquiring the target picture file to be identified to obtain the character information in the target picture file, then matching the identified character information with the preset lexicon to obtain the table head characters of which the matching degree with the preset lexicon is greater than the first threshold value, and further determining the table content included in the target picture file according to the character information corresponding to the table head characters. From this, realized carrying out quick accurate discernment to the form that includes in the picture, not only improved the accuracy of discernment, can also reduce the time that the discernment operation was spent to user's use experience has effectively been promoted.
In an exemplary embodiment, a computer device is also provided.
FIG. 7 is a schematic block diagram of a computer device shown in accordance with an exemplary embodiment. The computer device shown in fig. 7 is only an example, and should not bring any limitation to the function and the scope of use of the embodiments of the present invention.
Referring to fig. 7, the computer apparatus 200 includes: a memory 210 and a processor 220, the memory 210 storing a computer program that, when executed by the processor 220, causes the processor 220 to perform the steps of: acquiring a target picture file to be identified; carrying out character recognition processing on the target picture file to obtain character information in the target picture file; the character information comprises character shape, semantics and character position information; matching the recognized character information with a preset word bank to obtain a header character with the matching degree of the preset word bank larger than a first threshold value; and determining table contents included in the target picture file according to the character information corresponding to the header characters.
In one embodiment, the character information includes character semantics and character position information; the obtaining of the header characters with the matching degree with the preset word stock larger than a first threshold value comprises: matching the recognized character information with a preset word bank to obtain a target character set with the matching degree of the preset word bank being greater than a first threshold value; determining a table style corresponding to the character information according to the character semantics in the character information; determining target position information according to the form style; and acquiring a header character from the target character set according to the target position information and the position information corresponding to the target character set.
In one embodiment, before the determining table contents included in the target picture file, the method further includes: and utilizing the preset word stock to carry out normalization and word combination processing on the header characters and the content characters.
In one embodiment, the determining table content included in the target picture file according to character information corresponding to the header character includes: selecting the position information and the semantics of the content character corresponding to the header character from the character information according to the position information and the semantics of the header character; and generating a table included in the target picture file according to the position information and the semantics of the header character and the position information and the semantics of the content character.
In one embodiment, the character position information comprises a first direction coordinate and a second direction coordinate of the character; the selecting the position information and the semantics of the content character corresponding to the header character from the character information comprises the following steps: determining a first direction coordinate range or a second direction coordinate range of a target content character corresponding to any header character according to the first direction coordinate or the second direction coordinate of the any header character; selecting a primary selection character set of which the position information accords with the first direction coordinate range or the second direction coordinate range from the character information; and selecting a character matched with the semantic meaning of any header character from the primary character selection set as a content character corresponding to the any header character according to the semantic meaning of the any header character.
In one embodiment, the character information includes character position information, wherein the character position information includes a first direction coordinate of the character, a second direction coordinate of the character, and a width of the character in the second direction; after the character information in the target picture file is obtained, the method further comprises the following steps: sequentially traversing the character information according to the sequence of the coordinates of the first direction from small to large, and judging whether the coordinates of the jth character and the ith character in the second direction are the same or not, wherein the difference value of the first direction coordinates of each adjacent character between the jth character and the ith character is in a preset range, i and j are positive integers, and j is larger than i; if the coordinates of the jth character and the ith character in the second direction are different, determining the coincidence degree of the jth character and the ith character in the second direction according to the width of the jth character in the second direction and the width of the ith character in the second direction; judging whether the coincidence degree of the jth character and the ith character in the second direction is greater than a third threshold value or not; and if so, correcting the coordinate of the jth character and/or the ith character in the second direction.
In one embodiment, before the correcting the coordinate of the jth character and/or the ith character in the second direction, the method further includes: determining a target second direction coordinate range according to the second direction coordinate of the ith character and the second direction coordinate of the jth character; selecting k characters of which the second direction coordinates belong to the target second direction coordinate range; and correcting the coordinate of the jth character and/or the ith character in the second direction according to the second direction coordinate of the k characters.
In one embodiment, the character position information further includes a width in the first direction; after traversing the character information in sequence from small to large according to the first direction coordinate, the method further comprises the following steps: determining a target first-direction coordinate range according to the first-direction coordinate of the jth character and the first-direction coordinate of the ith character; selecting m characters of which the first direction coordinates belong to the target first direction coordinate range; and correcting the coordinate of the jth character and/or the ith character in the first direction according to the first direction coordinate of the m characters.
In one embodiment, the character information includes character semantics; before matching the recognized character information with a preset word bank, the method further comprises the following steps: determining a target word stock according to the character semantics; the matching processing of the recognized character information and the preset word bank comprises the following steps: and matching the recognized character information with the target word bank.
In an alternative implementation form, as shown in fig. 8, the computer device 200 may further include: a memory 210 and a processor 220, a bus 230 connecting different components (including the memory 210 and the processor 220), wherein the memory 210 stores a computer program, and when the processor 220 executes the program, the cross-domain data transmission method according to the embodiment of the present invention is implemented.
Bus 230 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, a processor, or a local bus using any of a variety of bus architectures. By way of example, such architectures include, but are not limited to, industry Standard Architecture (ISA) bus, micro-channel architecture (MAC) bus, enhanced ISA bus, video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus.
Computer device 200 typically includes a variety of computer device readable media. Such media may be any available media that is accessible by computer device 200 and includes both volatile and nonvolatile media, removable and non-removable media.
Memory 210 may also include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM) 240 and/or cache memory 250. The computer device 200 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 260 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 8, commonly referred to as a "hard drive"). Although not shown in FIG. 8, a magnetic disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In these cases, each drive may be connected to bus 230 by one or more data media interfaces. Memory 210 may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the invention.
A program/utility 280 having a set (at least one) of program modules 270, including but not limited to an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment, may be stored in, for example, the memory 210. The program modules 270 generally perform the functions and/or methodologies of the described embodiments of the invention.
The computer device 200 may also communicate with one or more external devices 290 (e.g., keyboard, pointing device, display 291, etc.), with one or more devices that enable a user to interact with the computer device 200, and/or with any devices (e.g., network card, modem, etc.) that enable the computer device 200 to communicate with one or more other computing devices. Such communication may occur via input/output (I/O) interfaces 292. Also, computer device 200 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the Internet) through network adapter 293. As shown, network adapter 293 communicates with the other modules of computer device 200 via bus 230. It should be understood that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the computer device 200, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.
It should be noted that the foregoing explanation of the embodiment of the method for identifying table contents in a picture file is also applicable to the computer device of the embodiment, and the implementation principle is similar, and is not repeated here.
The computer device provided by the embodiment of the invention obtains the target picture file to be recognized, performs character recognition processing on the target picture file to obtain the character information in the target picture file, then performs matching processing on the recognized character information and the preset lexicon to obtain the header characters with the matching degree larger than the first threshold value with the preset lexicon, and further determines the table content included in the target picture file according to the character information corresponding to the header characters. Therefore, the form included in the picture can be identified quickly and accurately, the identification accuracy is improved, the time spent on identification operation can be reduced, and the use experience of a user is effectively improved.
In an exemplary embodiment, the present invention also provides a computer-readable storage medium.
The computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the method for identifying table contents in a picture file.
In the description of the present invention, it is to be understood that the terms "first", "second", and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or to imply that the number of technical features indicated are in fact significant. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the present invention, "a plurality" means two or more unless specifically defined otherwise.
In the description of the specification, reference to the description of the term "one embodiment", "some embodiments", "an example", "a specific example", or "some examples", etc., means that a particular feature or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the various embodiments or examples and features of the various embodiments or examples described in this specification can be combined and combined by those skilled in the art without being mutually inconsistent.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and alternate implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present invention.
The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Further, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.
It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.
It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware that can be related to instructions of a program, which can be stored in a computer-readable storage medium, and the program, when executed, includes one or a combination of the steps of the method embodiments.
In addition, functional units in the embodiments of the present invention may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented as a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.
The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc. Although embodiments of the present invention have been shown and described above, it will be understood that the above embodiments are exemplary and not to be construed as limiting the present invention, and that variations, modifications, substitutions and alterations thereof within the scope of the present invention may be made by those of ordinary skill in the art.

Claims (10)

1. A method for identifying table contents in a picture file is characterized by comprising the following steps:
acquiring a target picture file to be identified;
carrying out character recognition processing on the target picture file to obtain character information in the target picture file;
matching the recognized character information with a preset word bank to obtain a header character with the matching degree of the preset word bank larger than a first threshold value; the character information comprises character semantics and character position information; the character position information comprises a first direction coordinate and a second direction coordinate of the character; determining a first direction coordinate range or a second direction coordinate range of a target content character corresponding to any header character according to the position information and the semantics of the header character and according to a first direction coordinate or a second direction coordinate of any header character;
selecting a primary selection character set of which the position information accords with the first direction coordinate range or the second direction coordinate range from the character information;
according to the semantics of any one header character, selecting a character matched with the semantics of any one header character from the primary character set as a content character corresponding to any one header character;
and generating a table included in the target picture file according to the position information and the semantics of the header character and the position information and the semantics of the content character.
2. The method of claim 1,
the obtaining of the header characters with the matching degree with the preset lexicon larger than a first threshold value comprises:
matching the recognized character information with a preset word bank to obtain a target character set with the matching degree of the preset word bank being greater than a first threshold value;
determining a table style corresponding to the character information according to the character semantics in the character information;
determining target position information according to the form style;
and acquiring a header character from the target character set according to the target position information and the position information corresponding to the target character set.
3. The method of claim 2, wherein prior to determining the table content included in the target picture file, further comprising:
and utilizing the preset word stock to carry out normalization and word combination processing on the header characters and the content characters.
4. The method of claim 1, wherein the character position information further includes a width of the character in the second direction;
after the character information in the target picture file is obtained, the method further comprises the following steps:
sequentially traversing the character information according to the sequence of the coordinates of the first direction from small to large, and judging whether the coordinates of the jth character and the ith character in the second direction are the same or not, wherein the difference value of the first direction coordinates of each adjacent character between the jth character and the ith character is in a preset range, i and j are positive integers, and j is larger than i;
if the coordinates of the jth character and the ith character in the second direction are different, determining the coincidence degree of the jth character and the ith character in the second direction according to the width of the jth character in the second direction and the width of the ith character in the second direction;
judging whether the coincidence degree of the jth character and the ith character in the second direction is greater than a third threshold value or not;
and if so, correcting the coordinate of the jth character and/or the ith character in the second direction.
5. The method of claim 4, wherein before the correcting the coordinates of the jth character and/or the ith character in the second direction, further comprising:
determining a target second direction coordinate range according to the second direction coordinate of the ith character and the second direction coordinate of the jth character;
selecting k characters of which the second direction coordinates belong to the target second direction coordinate range;
and correcting the coordinate of the jth character and/or the ith character in the second direction according to the second direction coordinate of the k characters.
6. The method of claim 5, wherein the character position information further includes a width in the first direction;
after traversing the character information in sequence from small to large according to the first direction coordinate, the method further comprises the following steps:
determining a target first direction coordinate range according to the first direction coordinate of the jth character and the first direction coordinate of the ith character;
selecting m characters of which the first direction coordinates belong to the target first direction coordinate range;
and correcting the coordinate of the jth character and/or the ith character in the first direction according to the first direction coordinate of the m characters.
7. The method of any one of claims 1 to 6,
before the matching processing of the recognized character information and the preset word bank, the method further comprises the following steps:
determining a target word stock according to the character semantics;
the matching processing of the recognized character information and the preset word bank comprises the following steps:
and matching the recognized character information with the target word bank.
8. An apparatus for identifying table contents in a picture file, comprising:
the first acquisition module is used for acquiring a target picture file to be identified;
the processing module is used for carrying out character recognition processing on the target picture file to obtain character information in the target picture file;
the matching module is used for matching the recognized character information with a preset word bank so as to obtain a header character with the matching degree of the preset word bank larger than a first threshold value; the character information comprises character semantics and character position information; the character position information comprises a first direction coordinate and a second direction coordinate of the character;
a determination module to: according to the position information and the semantics of the table head characters and the first direction coordinate or the second direction coordinate of any table head character, determining a first direction coordinate range or a second direction coordinate range of a target content character corresponding to any table head character;
selecting a primary selection character set of which the position information accords with the first direction coordinate range or the second direction coordinate range from the character information;
according to the semantics of any one header character, selecting a character matched with the semantics of any one header character from the primary character set as a content character corresponding to any one header character;
and generating a table included in the target picture file according to the position information and the semantics of the header character and the position information and the semantics of the content character.
9. A computer device, comprising: a memory storing a computer program and a processor, wherein the processor, when executing the program, implements the method of identifying table contents in a picture file according to any one of claims 1 to 7.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method of identifying table contents in a picture file according to any one of claims 1 to 7.
CN201810285135.5A 2018-04-02 2018-04-02 Method, device, equipment and storage medium for identifying table content in picture file Active CN108734089B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810285135.5A CN108734089B (en) 2018-04-02 2018-04-02 Method, device, equipment and storage medium for identifying table content in picture file

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810285135.5A CN108734089B (en) 2018-04-02 2018-04-02 Method, device, equipment and storage medium for identifying table content in picture file

Publications (2)

Publication Number Publication Date
CN108734089A CN108734089A (en) 2018-11-02
CN108734089B true CN108734089B (en) 2023-04-18

Family

ID=63940603

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810285135.5A Active CN108734089B (en) 2018-04-02 2018-04-02 Method, device, equipment and storage medium for identifying table content in picture file

Country Status (1)

Country Link
CN (1) CN108734089B (en)

Families Citing this family (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109711396A (en) * 2018-11-12 2019-05-03 平安科技(深圳)有限公司 Generation method, device, equipment and the readable storage medium storing program for executing of OCR training sample
CN109726643B (en) * 2018-12-13 2021-08-20 北京金山数字娱乐科技有限公司 Method and device for identifying table information in image, electronic equipment and storage medium
CN109740135A (en) * 2018-12-19 2019-05-10 平安普惠企业管理有限公司 Chart generation method and device, electronic equipment and storage medium
CN109871524B (en) * 2019-02-21 2023-06-09 腾讯科技(深圳)有限公司 Chart generation method and device
CN110059688B (en) * 2019-03-19 2024-05-28 平安科技(深圳)有限公司 Picture information identification method, device, computer equipment and storage medium
CN110245570B (en) * 2019-05-20 2023-04-18 平安科技(深圳)有限公司 Scanned text segmentation method and device, computer equipment and storage medium
CN110147774B (en) * 2019-05-23 2021-06-15 阳光保险集团股份有限公司 Table format picture layout analysis method and computer storage medium
CN110287854B (en) * 2019-06-20 2022-06-10 北京百度网讯科技有限公司 Table extraction method and device, computer equipment and storage medium
CN112115111A (en) * 2019-06-20 2020-12-22 上海怀若智能科技有限公司 OCR-based document version management method and system
CN110516048A (en) * 2019-09-02 2019-11-29 苏州朗动网络科技有限公司 The extracting method, equipment and storage medium of list data in pdf document
CN110909725B (en) * 2019-10-18 2023-09-19 平安科技(深圳)有限公司 Method, device, equipment and storage medium for recognizing text
CN111310426A (en) * 2020-01-22 2020-06-19 平安科技(深圳)有限公司 Form format recovery method and device based on OCR and storage medium
CN113449559B (en) * 2020-03-26 2023-05-26 顺丰科技有限公司 Table identification method and device, computer equipment and storage medium
CN111507230A (en) * 2020-04-11 2020-08-07 创景未来(北京)科技有限公司 Method and system for identifying and extracting document and table data
CN111898528B (en) * 2020-07-29 2023-11-10 腾讯科技(深圳)有限公司 Data processing method, device, computer readable medium and electronic equipment
CN111683285B (en) * 2020-08-11 2021-01-26 腾讯科技(深圳)有限公司 File content identification method and device, computer equipment and storage medium
CN112507909B (en) * 2020-12-15 2024-06-14 信号旗智能科技(上海)有限公司 Method, device, equipment and medium for extracting document data based on OCR (optical character recognition)
CN112509661B (en) * 2021-02-03 2021-05-25 南京吉拉福网络科技有限公司 Methods, computing devices, and media for identifying physical examination reports
CN113504863A (en) * 2021-06-02 2021-10-15 珠海金山办公软件有限公司 Method and device for realizing picture screening, computer storage medium and terminal
CN113723301A (en) * 2021-08-31 2021-11-30 广州新丝路信息科技有限公司 Imported goods customs clearance list OCR recognition branch processing method and device
CN116127928B (en) * 2023-04-17 2023-07-07 广东粤港澳大湾区国家纳米科技创新研究院 Table data identification method and device, storage medium and computer equipment

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105184265A (en) * 2015-09-14 2015-12-23 哈尔滨工业大学 Self-learning-based handwritten form numeric character string rapid recognition method
JP2016009223A (en) * 2014-06-23 2016-01-18 株式会社日立情報通信エンジニアリング Optica character recognition device and optical character recognition method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160055376A1 (en) * 2014-06-21 2016-02-25 iQG DBA iQGATEWAY LLC Method and system for identification and extraction of data from structured documents

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2016009223A (en) * 2014-06-23 2016-01-18 株式会社日立情報通信エンジニアリング Optica character recognition device and optical character recognition method
CN105184265A (en) * 2015-09-14 2015-12-23 哈尔滨工业大学 Self-learning-based handwritten form numeric character string rapid recognition method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
仲小挺."基于自学习的手写表格数字字符串快速识别方法的研究".《中国优秀硕士学位论文全文数据库(电子期刊)》.2015,全文. *

Also Published As

Publication number Publication date
CN108734089A (en) 2018-11-02

Similar Documents

Publication Publication Date Title
CN108734089B (en) Method, device, equipment and storage medium for identifying table content in picture file
US11450125B2 (en) Methods and systems for automated table detection within documents
US11514698B2 (en) Intelligent extraction of information from a document
JP6528147B2 (en) Accounting data entry support system, method and program
US10489644B2 (en) System and method for automatic detection and verification of optical character recognition data
CN109344831A (en) A kind of tables of data recognition methods, device and terminal device
US20200372248A1 (en) Certificate recognition method and apparatus, electronic device, and computer-readable storage medium
CN110826494B (en) Labeling data quality evaluation method, labeling data quality evaluation device, computer equipment and storage medium
US10325068B2 (en) Methods and apparatus to label radiology images
CN105631393A (en) Information recognition method and device
CN104008363A (en) Handwriting track detection, standardization and online-identification and abnormal radical collection
CN112989861A (en) Sample identification code reading method and device, electronic equipment and storage medium
US20070230790A1 (en) Character recognition apparatus, method and program
JP5983075B2 (en) Method and apparatus for identifying the orientation of a character in an image block
RU2597163C2 (en) Comparing documents using reliable source
US20230334889A1 (en) Systems and methods for spatial-aware information extraction from electronic source documents
CN111008635A (en) OCR-based multi-bill automatic identification method and system
CN110647826B (en) Method and device for acquiring commodity training picture, computer equipment and storage medium
CN114743209A (en) Prescription identification and verification method, system, electronic equipment and storage medium
CN113673214A (en) Information list alignment method and device, storage medium and electronic equipment
JP7435118B2 (en) Information processing device and program
CN110827261B (en) Image quality detection method and device, storage medium and electronic equipment
JP3467437B2 (en) Character recognition apparatus and method and program recording medium
CN112036465A (en) Image recognition method, device, equipment and storage medium
JP2021152696A (en) Information processor and program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant