CN115346227A - Method for vectorizing electronic file based on layout file - Google Patents

Method for vectorizing electronic file based on layout file Download PDF

Info

Publication number
CN115346227A
CN115346227A CN202211266067.0A CN202211266067A CN115346227A CN 115346227 A CN115346227 A CN 115346227A CN 202211266067 A CN202211266067 A CN 202211266067A CN 115346227 A CN115346227 A CN 115346227A
Authority
CN
China
Prior art keywords
distance
matching
vector
target sequence
character
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202211266067.0A
Other languages
Chinese (zh)
Other versions
CN115346227B (en
Inventor
黄雪琪
王宝凤
赵瑞婷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jingchen Technology Nantong Co ltd
Original Assignee
Jingchen Technology Nantong Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jingchen Technology Nantong Co ltd filed Critical Jingchen Technology Nantong Co ltd
Priority to CN202211266067.0A priority Critical patent/CN115346227B/en
Publication of CN115346227A publication Critical patent/CN115346227A/en
Application granted granted Critical
Publication of CN115346227B publication Critical patent/CN115346227B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T11/002D [Two Dimensional] image generation
    • G06T11/20Drawing from basic elements, e.g. lines or circles
    • G06T11/203Drawing of straight lines or curves
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T11/002D [Two Dimensional] image generation
    • G06T11/60Editing figures and text; Combining figures or text
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/16Image preprocessing
    • G06V30/162Quantising the image signal
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/18Extraction of features or characteristics of the image
    • G06V30/1801Detecting partial patterns, e.g. edges or contours, or configurations, e.g. loops, corners, strokes or intersections
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/19007Matching; Proximity measures
    • G06V30/19093Proximity measures, i.e. similarity or distance measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/24Character recognition characterised by the processing or recognition method
    • G06V30/242Division of the character sequences into groups prior to recognition; Selection of dictionaries
    • G06V30/244Division of the character sequences into groups prior to recognition; Selection of dictionaries using graphical properties, e.g. alphabet type or font
    • G06V30/245Font recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/416Extracting the logical structure, e.g. chapters, sections or page numbers; Identifying elements of the document, e.g. authors
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Character Discrimination (AREA)

Abstract

The invention relates to the technical field of image recognition, in particular to a method for vectorizing an electronic file based on a layout file, which comprises the following steps: acquiring a scanning bitmap of a paper document; acquiring vector characters of different fonts corresponding to a scanning bitmap, acquiring binary images in an outer surrounding frame corresponding to the characters and the vector characters, acquiring the distance from the center point of the surrounding frame to a closed edge, and acquiring a distance sequence and a distance sequence set; sampling the distance sequence set by utilizing a plurality of different sampling scales to obtain a target sequence set, calculating the similarity distance of the characters and the target sequence corresponding to the vector characters, carrying out KM matching, obtaining a matching effect evaluation value according to the similarity distance of the matched target sequence, obtaining the character matching degree of the characters and the vector characters corresponding to the characters, determining a replacing object of the characters, and obtaining a vectorized file.

Description

Method for vectorizing electronic file based on layout file
Technical Field
The invention relates to the technical field of image recognition, in particular to a method for vectorizing an electronic file based on a layout file.
Background
With the development of informatization, the application range of electronic files is wider and wider, the electronic files gradually replace the paper files in daily office, and in application scenes such as archive organization, electronic archive filing is beneficial to storing a large amount of paper data. In the process of electronizing a paper document, the paper document needs to be scanned, that is, the paper document is firstly converted into a computer image form, and a bitmap of the paper document image is obtained through image scanning and is distorted during amplification, and in order to avoid the situation, vectorization is carried out on the computer image of the paper document to obtain a vector diagram.
The vector diagram is an image which does not cause distortion no matter how the vector diagram is enlarged, reduced or rotationally stretched, and because the vector diagram is an image represented by parameters of a mathematical formula, the required storage capacity is smaller, and the aim of converting a paper document into an electronic document for storage can be fulfilled.
In the process of vectorizing a paper document image obtained by scanning of a computer, generally, an OCR document content recognition technology is used to recognize character information in a document, and character vector graphics are directly replaced according to the recognition result of the character information to implement vectorization of the paper document, which is a process in which characters recognized by an OCR technology need to recognize fonts and font sizes, and the font sizes can be directly recognized according to the sizes of the characters, and the recognition of the fonts needs to compare the characters in the scanned image with corresponding vector characters existing in a vector character library, and replace and display the corresponding vector characters after the fonts of the characters in the scanned image are determined.
However, some characters have small font differences and are unstable due to the fact that the quality of the scanned image is easily affected by various factors, which easily causes the situation that the fonts of the characters in the scanned image are similar to the bitmap into which the vector characters are converted, but actually the fonts of the characters are not the same, and thus, the wrong vectorization result is easily generated.
Disclosure of Invention
The invention provides a method for vectorizing an electronic file based on a layout file, which aims to solve the problem of inaccurate vectorizing result in the prior art.
The invention discloses a method for vectorizing an electronic file based on a layout file, which adopts the following technical scheme:
acquiring a scanning bitmap of a paper document;
acquiring an index label and a font size label of each character on a scanning bitmap and an outer surrounding frame of each character;
acquiring vector characters of different fonts corresponding to the characters in a vector character library according to the index tag and the character size tag, converting the vector characters into a vector character bitmap, acquiring an outer surrounding frame of each vector character in the vector character bitmap, and acquiring binary images of corresponding images in the outer surrounding frames of the characters and the outer surrounding frames of the vector characters;
acquiring all closed edges in the two binary images, acquiring the maximum distance from the central point of the outer enclosure frame to the corresponding closed edge, sequentially acquiring the distances from all the edge points to the central point of the corresponding outer enclosure frame along the clockwise direction by taking the target edge point corresponding to the maximum distance as a starting point, and acquiring a distance sequence of the closed edge and a distance sequence set corresponding to the two binary images;
sampling each distance sequence in the distance sequence set by using a plurality of different sampling scales to obtain a target sequence set corresponding to each sampling scale, calculating the similarity distance between every two target sequences in the two target sequence sets under the same sampling scale, performing KM (K-K) matching on the target sequences in the two target sequence sets according to the similarity distance, acquiring the similarity distance mean value of the matched target sequences, and taking the similarity distance mean value as a matching effect evaluation value;
and calculating the font matching degree of the characters and the vector characters of each corresponding font according to the similarity of the KM matching results of the target sequences in the target sequence set corresponding to each two different sampling scales and the matching effect evaluation value, taking the vector character corresponding to the maximum font matching degree as a replacement object of the corresponding character in the vector diagram, and obtaining the vectorized file.
Preferably, the step of setting a plurality of different sampling scales comprises:
acquiring the length of a closed edge corresponding to a distance sequence with the shortest length in all distance sequences in the two distance sequence sets;
taking one tenth of the length of the closed edge corresponding to the distance sequence with the shortest length as the maximum sampling scale;
the sampling scale is 1, 2 and 3 in sequence, 8230, to the maximum sampling scale.
Preferably, the step of respectively sampling each distance sequence in the distance sequence set by using a plurality of different sampling scales to obtain a target sequence set corresponding to each sampling scale includes:
adding each distance in the distance sequences in the distance sequence set and the corresponding distance after the sampling scale interval to calculate an average value; taking the mean value as a target value in the target sequence;
and by analogy, obtaining a target sequence corresponding to each sampling scale, and obtaining a target sequence set corresponding to the distance sequence set.
Preferably, a dynamic time warping algorithm is used to calculate the similarity distance between every two target sequences in two target sequence sets under the same sampling scale.
Preferably, the step of obtaining the similarity distance average of the matched target sequences includes:
after matching is completed, recording the similarity distance of two unmatched target sequences as 1;
calculating the similarity distance between every two well-matched target sequences by using a dynamic time warping algorithm;
and after the matching is completed, calculating the similarity distance average value of the similarity distance of all the unmatched two target sequences and the similarity distance average value of all the matched two target sequences.
Preferably, the step of calculating the similarity of KM matching results of target sequences in the corresponding target sequence sets at two different sampling scales includes:
acquiring corresponding matched target sequence pairs in corresponding target sequence sets under two different sampling scales, and solving an intersection to obtain the number of the matched same target sequence pairs;
solving a union set of corresponding matched target sequence pairs in the corresponding target sequence sets under two different sampling scales to obtain the total matching number;
and taking the ratio of the number of the matched same target sequence pairs to the total number of matched targets as the similarity of KM matching results.
Preferably, the step of calculating the font matching degree between the text and the vector text of each corresponding font according to the similarity of KM matching results of the target sequences in the target sequence set corresponding to each two different sampling scales and the matching effect evaluation value includes:
calculating the mean value of the corresponding matching effect evaluation values under every two different sampling scales;
calculating the product of the similarity of KM matching results of target sequences in the corresponding target sequence set under every two different sampling scales, the mean value of the matching effect evaluation value and the weights of the corresponding two different sampling scales, and taking the product as the initial matching degree;
and summing all the initial matching degrees corresponding to every two different sampling scales in all the sampling scales to obtain the font matching degree of each character and the vector character of each corresponding font.
Preferably, the image correction is performed on the scanned bitmap, the image correction including: and correcting the pose of the scanning bitmap of the paper document according to the scanning angle to obtain a front view of the scanning bitmap of the paper document, and taking the front view of the scanning bitmap as the scanning bitmap.
Preferably, the OCR algorithm is used to obtain the text on the scanning bitmap and obtain the outer bounding box of each text.
Preferably, the index tag of the text is a tag of the text to be searched.
The method for vectorizing the electronic file based on the layout file has the advantages that:
1. the distance between the upper edge point and the central point of the closed edge of the scanning bitmap and the vector character bitmap of the characters is used as a description characteristic value of the closed edge, then the distance sequence is subjected to multi-sampling-scale pairing, and the font of each character is identified through the pairing result, so that the influence of the image quality of a scanning piece on the matching accuracy is avoided.
2. Similarity calculation is carried out on matching results of the distance sequence set of the closed edge of the scanning bitmap and the distance sequence set of the closed edge of the vector character bitmap, matching effects are obtained, then the matching degree of the characters and each corresponding character style is comprehensively evaluated according to the matching effects under the multi-sampling scale and the similarity of the matching results, and therefore the vector characters of the character style matched with the characters are further accurately determined, and vectorization of the characters is accurately achieved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the prior art descriptions will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
Fig. 1 is a schematic structural diagram of an embodiment of a method for vectorizing an electronic file based on a layout file according to the present invention;
fig. 2 is a schematic structural diagram of obtaining a distance sequence in an embodiment of the method for vectorizing an electronic file based on a layout file according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.
An embodiment of the method for vectorizing an electronic file based on a layout file according to the present invention is, as shown in fig. 1,
s1, obtaining a scanning bitmap of a paper document.
Specifically, a scanning bitmap of a paper document is acquired through a digital camera, and in order to prevent the collected paper document from skewing and influence the processing of a post-school image, the scanning bitmap is subjected to image correction in the embodiment, and the image correction includes: and correcting the pose of the scanning bitmap of the paper document according to the scanning angle to obtain a front view of the scanning bitmap of the paper document, and taking the front view of the scanning bitmap as the scanning bitmap.
S2, acquiring an index label, a font size label and an outer surrounding frame of each character on the scanning bitmap, acquiring vector characters of different fonts corresponding to the corresponding characters in a vector character library according to the index label and the font size label, converting the vector characters into a vector character bitmap, acquiring the outer surrounding frame of each vector character in the vector character bitmap, and acquiring the outer surrounding frame of the characters and a binary image of a corresponding image in the outer surrounding frame of the vector characters.
Because the vector characters are data existing in the vector character library, the characters which are identified by the scanning bitmap of the paper document by using an OCR algorithm are provided with index labels and character size labels, so that the vector characters of all different fonts corresponding to each character can be obtained in the vector character library.
Based on this, firstly, an index tag, a font size tag and an outer enclosure frame of each character on a scanning bitmap are required to be obtained, wherein the index tag of the character refers to a tag corresponding to the character to be searched, the index tag is the same as the tag of the same character in a character library, and specifically, the second step is to
Figure 517391DEST_PATH_IMAGE001
For example, a word has an index tag of the word
Figure 637794DEST_PATH_IMAGE002
One character size label
Figure 89635DEST_PATH_IMAGE003
In the vector font library, according to
Figure 156948DEST_PATH_IMAGE001
Index label of individual character
Figure 502479DEST_PATH_IMAGE002
Word size label
Figure 731466DEST_PATH_IMAGE003
Can obtain the first
Figure 795237DEST_PATH_IMAGE001
Vector characters of all different fonts corresponding to the characters, wherein
Figure 666241DEST_PATH_IMAGE001
The first character corresponds to
Figure 741645DEST_PATH_IMAGE004
Vector text representation of a seed font as
Figure 531746DEST_PATH_IMAGE005
I.e. when converting these vector words into a vector word bitmap, the second
Figure 707249DEST_PATH_IMAGE001
The vector bitmap of the vector character corresponding to each character can be expressed as
Figure 116364DEST_PATH_IMAGE006
Wherein, in the step (A),
Figure 170908DEST_PATH_IMAGE004
=1、2、3……
Figure 7277DEST_PATH_IMAGE007
to a first order
Figure 45640DEST_PATH_IMAGE001
Taking individual character as an example, the outer enclosure frame of the character in the scanning bitmap and the outer enclosure frame of the vector character of each font corresponding to the character in the vector character bitmap can be obtained, the image in the outer enclosure frame is binarized, because the scanning bitmap and the vector character bitmap of the character both comprise a character part and a background part, the pixel point of the character part is marked as 1, and the pixel point of the background part is set as 0, namely, the binarization of the bitmap in the outer enclosure frame is completed, in this embodiment, the second embodiment is marked as
Figure 727288DEST_PATH_IMAGE001
The binary image corresponding to the image in the enclosing frame of each character is
Figure 901918DEST_PATH_IMAGE008
To remember the first
Figure 909188DEST_PATH_IMAGE001
The binary image of the image in the bounding box of the vector character of each font corresponding to each character is recorded as
Figure 44634DEST_PATH_IMAGE009
Wherein, in the step (A),
Figure 920186DEST_PATH_IMAGE004
=1、2、3……
Figure 90268DEST_PATH_IMAGE007
it should be noted that, in this embodiment, the outer bounding box of the text and the outer bounding box of the vector text both refer to the minimum bounding rectangle.
And S3, acquiring all closed edges in the two binary images, acquiring the maximum distance from the central point of the outer enclosure frame to the corresponding closed edge, sequentially acquiring the distances from all the edge points to the central point of the corresponding outer enclosure frame along the clockwise direction by taking the target edge point corresponding to the maximum distance as a starting point, and acquiring a distance sequence of the closed edge and a distance sequence set corresponding to the two binary images.
Specifically, edge detection is performed on a binary image by using a CANNY edge detection algorithm to obtain all closed edges on the binary image, for a Chinese character, for example, a Chinese character "two", two closed edges are formed, for a Chinese character "three", three closed edges are obtained, when a distance sequence of each closed edge is obtained, taking one closed edge as an example, a center point of an outer enclosure frame corresponding to the closed edge needs to be obtained first, taking the center point of the outer enclosure frame as a reference point, and obtaining distances between the reference point and each edge point on the same closed edge, where the distance is obtained as the prior art.
As shown in fig. 2, the pixel distance of each edge point on a closed edge from the reference point is used as the description value of the closed edge, and then for the second closed edge
Figure 268439DEST_PATH_IMAGE001
Binary image within bounding box of text
Figure 750236DEST_PATH_IMAGE008
To a corresponding second
Figure 301916DEST_PATH_IMAGE010
The sequence of distances of the closed edges can be expressed as
Figure 185558DEST_PATH_IMAGE011
Figure 800210DEST_PATH_IMAGE012
Is shown as
Figure 644669DEST_PATH_IMAGE010
The total number of all edge points on an individual closed edge,
Figure 596445DEST_PATH_IMAGE013
is shown as
Figure 741118DEST_PATH_IMAGE010
On the closing edge of the first
Figure 385726DEST_PATH_IMAGE014
The distance between each edge point and the center point; then it is first
Figure 983061DEST_PATH_IMAGE001
The first character corresponds to
Figure 613894DEST_PATH_IMAGE004
The binary image in the bounding box of the vector character of the typeface is represented as
Figure 472128DEST_PATH_IMAGE009
Binary image
Figure 163004DEST_PATH_IMAGE009
To a corresponding second
Figure 106689DEST_PATH_IMAGE015
The sequence of distances of the closed edges can be expressed as
Figure 806792DEST_PATH_IMAGE016
(ii) a By analogy, the distance sequence and the corresponding distance sequence set of the binary image corresponding to the bounding box in the bounding box of all the characters and the bounding box in the vector character bitmap of the vector character of each corresponding font can be obtained.
S4, sampling each distance sequence in the distance sequence set by using a plurality of different sampling scales to obtain a corresponding target sequence set under each sampling scale, calculating the similarity distance between every two target sequences between two target sequence sets under the same sampling scale, performing KM (K-K matching) on the target sequences in the two target sequence sets according to the similarity distance, acquiring the similarity distance mean value of the matched target sequences, and taking the similarity distance mean value as a matching effect evaluation value.
Because the font matching degree of the binary image of the text and the binary image corresponding to the vector text is measured, it is necessary to perform multi-scale pairing on each distance sequence in the distance sequence set of the closed edge of the binary image of the text and each distance sequence in the distance sequence set of the closed edge of the binary image corresponding to the vector text.
Specifically, in this embodiment, a sampling scale is set first, and specifically, since the length of each closed edge is limited, the length of the closed edge corresponding to the distance sequence with the shortest length in all the distance sequences in the two distance sequence sets is obtained first; one tenth of the length of the closed edge corresponding to the distance sequence with the shortest length is taken as the maximum sampling scale, and the maximum sampling scale is taken as C, so that the sampling scales are 1, 2 and 3 \8230;, and C are 1, 2 and 3 \8230; wherein, when the sampling scale is C, the embodiment is the first
Figure 397829DEST_PATH_IMAGE001
Binary image within bounding box of text
Figure 384239DEST_PATH_IMAGE008
Distance sequence set ofThe target sequence set obtained after sampling each distance sequence in the sum is recorded as
Figure 690587DEST_PATH_IMAGE017
Then it is first
Figure 787856DEST_PATH_IMAGE001
Character 1
Figure 496049DEST_PATH_IMAGE004
Binary image in bounding box of vector character of seed font
Figure 387781DEST_PATH_IMAGE009
The target sequence set obtained after sampling each distance sequence in the distance sequence set is recorded as
Figure 181425DEST_PATH_IMAGE018
The step of sampling each distance sequence in the distance sequence set according to a sampling scale to obtain a target sequence set corresponding to each sampling scale comprises the following steps: adding each distance in the distance sequences in the distance sequence set and the corresponding distance after the sampling scale interval to calculate an average value; taking the mean value as a target value in a target sequence; and by analogy, obtaining a target sequence corresponding to each distance sequence under each sampling scale, and obtaining a target sequence set by the target sequences corresponding to all the distance sequences in each distance sequence set.
Specifically, a similarity distance between every two target sequences in two target sequence sets under the same sampling scale is calculated, where the similarity distance is a DTW distance, that is, the similarity distance between every two target sequences in two target sequence sets under the same sampling scale is calculated by using a dynamic time warping algorithm, and the dynamic time warping algorithm is an algorithm in the prior art and is not described in detail in this embodiment.
The method comprises the following steps of performing KM matching on target sequences in two target sequence sets according to similarity distance, and acquiring similarity distance mean values: wherein, the KM algorithm can complete a bipartite graphAn algorithm of maximum weight matching in matching, that is, every two target sequences in two target sequence sets are sequentially matched to obtain two target sequences matched in the target sequence set and the other target sequence set, and at the same time, the number of the matched sequences can also be obtained, that is, in this embodiment, the similarity distance corresponding to the target sequences is used for matching the two target sequences, after the matching is completed, the similarity distance of the two unmatched target sequences is marked as 1, then the similarity distance of each two matched target sequences is calculated by using a dynamic time warping algorithm, after the matching is completed, the similarity distance of all the two unmatched target sequences and the similarity distance average of the similarity distances of all the two matched target sequences are calculated, so that the value range is from 0 to 1, in this embodiment, the similarity distance is the normalized similarity distance, therefore, the obtained similarity distance average is used as the effect evaluation of the matching result, the similarity distance average approaches 0, the matching effect is better, the similarity distance average is more close to 1, the matching effect is illustrated, and therefore, in this embodiment, the sampling scale in this embodiment, the next step C is more the sampling is performed, the scale is more the matching effect is poor
Figure 223330DEST_PATH_IMAGE001
Binary image inside bounding box of text
Figure 645084DEST_PATH_IMAGE008
And a first
Figure 848664DEST_PATH_IMAGE001
The first character corresponds to
Figure 254237DEST_PATH_IMAGE004
Binary image in bounding box of vector character of seed font
Figure 834254DEST_PATH_IMAGE009
The evaluation values of the matching effect of the corresponding two target sequence sets are recorded as
Figure 376094DEST_PATH_IMAGE019
And S5, calculating the font matching degree of the characters and the vector characters according to the similarity of the KM matching results of the target sequences in the corresponding target sequence set under every two different sampling scales and the matching effect evaluation value, taking the vector character corresponding to the maximum font matching degree as a replacement object of the corresponding character in the vector diagram, and obtaining a vectorized file.
Specifically, the step of obtaining the similarity of KM matching results of target sequences in a target sequence set corresponding to each two different sampling scales includes: acquiring corresponding matched target sequence pairs in corresponding target sequence sets under every two different sampling scales, and solving an intersection to obtain the number of the matched same target sequence pairs; obtaining a total matching number by summing corresponding matched target sequence pairs in corresponding target sequence sets under every two different sampling scales; and taking the ratio of the number of the matched same target sequence pairs to the total number of matched target sequence pairs as the similarity of the KM matching results.
The step of calculating the font matching degree of the characters and the vector characters according to the similarity of the KM matching results of the target sequences in the target sequence set corresponding to each two different sampling scales and the matching effect evaluation value comprises the following steps: calculating the mean value of the corresponding matching effect evaluation values under every two different sampling scales; calculating the product of the similarity of the KM matching results of the target sequences in the corresponding target sequence set under every two different sampling scales, the mean value of the matching effect evaluation values and the weights of the two corresponding different sampling scales, and taking the product as the initial matching degree; summing all the initial matching degrees corresponding to every two different sampling scales in all the sampling scales to obtain the font matching degree of each character and the corresponding vector character, wherein the font matching degree of each character and the vector character of the corresponding font is calculated according to the following formula in the embodiment:
Figure 16154DEST_PATH_IMAGE020
in the formula (I), the compound is shown in the specification,
Figure 781460DEST_PATH_IMAGE021
is shown as
Figure 165168DEST_PATH_IMAGE001
Text and number one
Figure 827094DEST_PATH_IMAGE001
The first character corresponds to
Figure 638055DEST_PATH_IMAGE004
The font matching degree of the vector character of the font;
Figure 628007DEST_PATH_IMAGE022
is shown in
Figure 674461DEST_PATH_IMAGE023
At the sampling scale of
Figure 66259DEST_PATH_IMAGE001
Binary image inside bounding box of text
Figure 48122DEST_PATH_IMAGE008
And a first step of
Figure 915583DEST_PATH_IMAGE001
The first character corresponds to
Figure 375515DEST_PATH_IMAGE004
Binary image in bounding box of vector character of seed font
Figure 887399DEST_PATH_IMAGE009
The matching effect evaluation values of the corresponding two target sequence sets;
Figure 899217DEST_PATH_IMAGE024
is shown in
Figure 854973DEST_PATH_IMAGE025
At the sampling scale of
Figure 384174DEST_PATH_IMAGE001
Binary image within bounding box of text
Figure 609619DEST_PATH_IMAGE008
And a first
Figure 667705DEST_PATH_IMAGE001
The first corresponding to the character
Figure 509759DEST_PATH_IMAGE004
Binary image in bounding box of vector character of seed font
Figure 311493DEST_PATH_IMAGE009
The matching effect evaluation values of the corresponding two target sequence sets;
Figure 329128DEST_PATH_IMAGE026
is shown in
Figure 354853DEST_PATH_IMAGE023
KM matching results of target sequences in a corresponding set of target sequences at a sampling scale with KM matching results of KM matching sequences in a target sequence set
Figure 28410DEST_PATH_IMAGE025
The similarity of KM matching results of target sequences in the corresponding target sequence set at the sampling scale is, it should be noted that,
Figure 758469DEST_PATH_IMAGE027
Figure 568293DEST_PATH_IMAGE028
represents the total number of all sampling scales;
it should be noted that, in the following description,
Figure 496410DEST_PATH_IMAGE023
matching effect evaluation value corresponding to sampling scale
Figure 781898DEST_PATH_IMAGE022
And
Figure 191014DEST_PATH_IMAGE025
matching effect evaluation value corresponding to sampling scale
Figure 245558DEST_PATH_IMAGE024
All the similarity distances are normalized distances, so that the matching effect evaluation value is closer to 0, which indicates that the matching effect is better, and the matching effect evaluation value is closer to 1, which indicates that the matching effect is worse,
Figure 347506DEST_PATH_IMAGE029
representing weights of interest, i.e. sampling scale
Figure 730077DEST_PATH_IMAGE023
And the sampling scale
Figure 67517DEST_PATH_IMAGE025
The larger the phase difference is, the larger the sampling scale difference is, and the more similar the matching result is, the second is
Figure 117513DEST_PATH_IMAGE001
Binary image within bounding box of text
Figure 983838DEST_PATH_IMAGE008
And the first
Figure 384863DEST_PATH_IMAGE001
The first character corresponds to
Figure 870202DEST_PATH_IMAGE004
Binary image in bounding box of vector character of seed font
Figure 899338DEST_PATH_IMAGE009
The pairing results obtained at the two sampling scales with the larger difference are still similar, and at the same time, the wordsDegree of volume matching
Figure 811931DEST_PATH_IMAGE021
The closer to 0, the more the description is
Figure 559307DEST_PATH_IMAGE001
Text and number one
Figure 116846DEST_PATH_IMAGE001
The first character corresponds to
Figure 488DEST_PATH_IMAGE004
The smaller the difference between the two types of fonts is, the more matched the two types of fonts are, in short, the more similar the matching results corresponding to each sampling scale are and the better the matching effect of the matching results is, the second explanation is that
Figure 615140DEST_PATH_IMAGE001
The first character corresponds to
Figure 459599DEST_PATH_IMAGE004
The more the vector character of the same font should be the first
Figure 411375DEST_PATH_IMAGE001
And (5) a character replacement result.
Therefore, in this embodiment, the vector character corresponding to the maximum font matching degree in the font matching degrees of the characters and the vector characters of each corresponding font is used as a replacement object of the corresponding character in the vector diagram, and the vectorized file is obtained.
The invention relates to a method for vectorizing an electronic file based on a format file, which takes the distance between a scanning bitmap of characters and the upper edge point and the central point of the closed edge of a vector character bitmap as a description characteristic value of the closed edge, then performs multi-sampling-scale pairing on a distance sequence, and identifies the font of each character through a pairing result, thereby avoiding the influence of the image quality of a scanned piece on the matching accuracy.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims (10)

1. A method for vectorizing an electronic file based on a layout file is characterized by comprising the following steps:
acquiring a scanning bitmap of a paper document;
acquiring an index label and a font size label of each character on a scanning bitmap and an outer surrounding frame of each character;
acquiring vector characters of different fonts corresponding to the corresponding characters in a vector character library according to the index label and the character size label, converting the vector characters into a vector character bitmap, acquiring an outer surrounding frame of each vector character in the vector character bitmap, and acquiring binary images of the outer surrounding frame of the characters and corresponding images in the outer surrounding frame of the vector characters;
acquiring all closed edges in the two binary images, acquiring the maximum distance from the central point of the outer enclosure frame to the corresponding closed edge, sequentially acquiring the distances from all the edge points to the central point of the corresponding outer enclosure frame along the clockwise direction by taking the target edge point corresponding to the maximum distance as a starting point, and acquiring a distance sequence of the closed edge and a distance sequence set corresponding to the two binary images;
sampling each distance sequence in the distance sequence set by using a plurality of different sampling scales to obtain a target sequence set corresponding to each sampling scale, calculating the similarity distance between every two target sequences in the two target sequence sets under the same sampling scale, performing KM (K-K) matching on the target sequences in the two target sequence sets according to the similarity distance, acquiring the similarity distance mean value of the matched target sequences, and taking the similarity distance mean value as a matching effect evaluation value;
and calculating the font matching degree of the characters and the vector characters of each corresponding font according to the similarity of the KM matching results of the target sequences in the target sequence set corresponding to each two different sampling scales and the matching effect evaluation value, taking the vector character corresponding to the maximum font matching degree as a replacement object of the corresponding character in the vector diagram, and obtaining the vectorized file.
2. The method for vectorizing an electronic file based on a layout file according to claim 1, wherein the step of setting a plurality of different sampling scales comprises:
acquiring the length of a closed edge corresponding to a distance sequence with the shortest length in all distance sequences in the two distance sequence sets;
taking one tenth of the length of the closed edge corresponding to the distance sequence with the shortest length as the maximum sampling scale;
the sampling scale is 1, 2 and 3 in sequence, 8230, to the maximum sampling scale.
3. The method for vectorizing the electronic file based on the layout file according to claim 1, wherein the step of sampling each distance sequence in the distance sequence set by using a plurality of different sampling scales to obtain a target sequence set corresponding to each sampling scale comprises:
adding each distance in the distance sequences in the distance sequence set and the corresponding distance after the sampling scale interval to calculate an average value; taking the mean value as a target value in a target sequence;
and by analogy, obtaining a target sequence corresponding to each sampling scale, and obtaining a target sequence set corresponding to the distance sequence set.
4. The method for vectorizing an electronic file based on a layout file according to claim 1, wherein a dynamic time warping algorithm is used to calculate the similarity distance between every two target sequences in the two target sequence sets under the same sampling scale.
5. The method for vectorizing an electronic file based on a layout file according to claim 1, wherein the step of obtaining the mean value of the similarity distance between the matched target sequences comprises:
after matching is completed, recording the similarity distance of two unmatched target sequences as 1;
calculating the similarity distance between every two well-matched target sequences by using a dynamic time warping algorithm;
and after the matching is completed, calculating the similarity distance average value of the similarity distance of all the unmatched two target sequences and the similarity distance average value of all the matched two target sequences.
6. The method for vectorizing an electronic file based on a layout file according to claim 1, wherein the step of calculating the similarity of the KM matching results of the target sequences in the corresponding target sequence sets at two different sampling scales comprises:
acquiring corresponding matched target sequence pairs in corresponding target sequence sets under two different sampling scales, and solving an intersection to obtain the number of the matched same target sequence pairs;
obtaining a total matching number by summing corresponding matched target sequence pairs in corresponding target sequence sets under two different sampling scales;
and taking the ratio of the number of the matched same target sequence pairs to the total number of matched target sequence pairs as the similarity of the KM matching results.
7. The method for vectorizing an electronic document based on a layout file according to claim 1, wherein the step of calculating the font matching degree between the text and the vector text of each corresponding font according to the similarity of the KM matching results of the target sequences in the target sequence set corresponding to each two different sampling scales and the matching effect evaluation value comprises:
calculating the mean value of the corresponding matching effect evaluation values under every two different sampling scales;
calculating the product of the similarity of KM matching results of target sequences in the corresponding target sequence set under every two different sampling scales, the mean value of the matching effect evaluation value and the weights of the corresponding two different sampling scales, and taking the product as the initial matching degree;
and summing all the initial matching degrees corresponding to every two different sampling scales in all the sampling scales to obtain the font matching degree of each character and the vector character of each corresponding font.
8. The method for vectorizing an electronic file based on a layout file according to claim 1, wherein the scanning bitmap is subjected to image correction, and the image correction comprises: and correcting the pose of the scanning bitmap of the paper document according to the scanning angle to obtain a front view of the scanning bitmap of the paper document, and taking the front view of the scanning bitmap as the scanning bitmap.
9. The method for vectorizing an electronic file based on a layout file according to claim 1, wherein an OCR algorithm is used to obtain the text on the scanned bitmap and obtain the outer bounding box of each text.
10. The method for vectorizing an electronic file based on a layout file according to claim 1, wherein the index tag of the text is a tag of the text to be searched.
CN202211266067.0A 2022-10-17 2022-10-17 Method for vectorizing electronic file based on layout file Active CN115346227B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211266067.0A CN115346227B (en) 2022-10-17 2022-10-17 Method for vectorizing electronic file based on layout file

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211266067.0A CN115346227B (en) 2022-10-17 2022-10-17 Method for vectorizing electronic file based on layout file

Publications (2)

Publication Number Publication Date
CN115346227A true CN115346227A (en) 2022-11-15
CN115346227B CN115346227B (en) 2023-08-08

Family

ID=83957216

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211266067.0A Active CN115346227B (en) 2022-10-17 2022-10-17 Method for vectorizing electronic file based on layout file

Country Status (1)

Country Link
CN (1) CN115346227B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117037184A (en) * 2023-10-10 2023-11-10 深圳牛图科技有限公司 OCR fuzzy recognition system and method based on cloud matching
CN117236291A (en) * 2023-11-16 2023-12-15 北京点聚信息技术有限公司 Method and system for rapidly converting scanned file into vector layout file
CN117475438A (en) * 2023-10-23 2024-01-30 北京点聚信息技术有限公司 OCR technology-based scan file vectorization conversion method

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110232724A (en) * 2019-06-13 2019-09-13 大连民族大学 A kind of Chinese character style image vector representation method
CN114419635A (en) * 2022-03-30 2022-04-29 北京点聚信息技术有限公司 Electronic seal vector diagram identification method based on graphic identification
CN114926839A (en) * 2022-07-22 2022-08-19 富璟科技(深圳)有限公司 Image identification method based on RPA and AI and electronic equipment

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110232724A (en) * 2019-06-13 2019-09-13 大连民族大学 A kind of Chinese character style image vector representation method
CN114419635A (en) * 2022-03-30 2022-04-29 北京点聚信息技术有限公司 Electronic seal vector diagram identification method based on graphic identification
CN114926839A (en) * 2022-07-22 2022-08-19 富璟科技(深圳)有限公司 Image identification method based on RPA and AI and electronic equipment

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117037184A (en) * 2023-10-10 2023-11-10 深圳牛图科技有限公司 OCR fuzzy recognition system and method based on cloud matching
CN117475438A (en) * 2023-10-23 2024-01-30 北京点聚信息技术有限公司 OCR technology-based scan file vectorization conversion method
CN117475438B (en) * 2023-10-23 2024-05-24 北京点聚信息技术有限公司 OCR technology-based scan file vectorization conversion method
CN117236291A (en) * 2023-11-16 2023-12-15 北京点聚信息技术有限公司 Method and system for rapidly converting scanned file into vector layout file
CN117236291B (en) * 2023-11-16 2024-01-12 北京点聚信息技术有限公司 Method and system for rapidly converting scanned file into vector layout file

Also Published As

Publication number Publication date
CN115346227B (en) 2023-08-08

Similar Documents

Publication Publication Date Title
CN115346227B (en) Method for vectorizing electronic file based on layout file
CN107239786B (en) Character recognition method and device
US20190294921A1 (en) Field identification in an image using artificial intelligence
US9959475B2 (en) Table data recovering in case of image distortion
WO2021258634A1 (en) Image auditing and identification method and apparatus, and storage medium
US11436852B2 (en) Document information extraction for computer manipulation
WO2021051527A1 (en) Image segmentation-based text positioning method, apparatus and device, and storage medium
CN111401099A (en) Text recognition method, device and storage medium
CN113158895A (en) Bill identification method and device, electronic equipment and storage medium
US10380456B2 (en) Classification dictionary learning system, classification dictionary learning method and recording medium
CN111881923B (en) Bill element extraction method based on feature matching
CN107403179B (en) Registration method and device for article packaging information
CN112861842A (en) Case text recognition method based on OCR and electronic equipment
JP7384603B2 (en) Document form identification
Devi et al. Pattern matching model for recognition of stone inscription characters
Kataria et al. CNN-bidirectional LSTM based optical character recognition of Sanskrit manuscripts: A comprehensive systematic literature review
CN111213157A (en) Express information input method and system based on intelligent terminal
CN111814801B (en) Method for extracting labeled strings in mechanical diagram
CN118135584A (en) Automatic handwriting form recognition method and system based on deep learning
Shah et al. Devnagari handwritten character recognition (DHCR) for ancient documents: a review
US10366284B1 (en) Image recognition and parsing
CN111414917A (en) Identification method of low-pixel-density text
Zheng et al. Recognition of expiry data on food packages based on improved DBNet
CN114004976A (en) LBP (local binary pattern) feature-based target identification method and system
Koopmans et al. The effects of character-level data augmentation on style-based dating of historical manuscripts

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant