CN115640788A - Method and device for structuring non-editable document - Google Patents

Method and device for structuring non-editable document Download PDF

Info

Publication number
CN115640788A
CN115640788A CN202211660341.2A CN202211660341A CN115640788A CN 115640788 A CN115640788 A CN 115640788A CN 202211660341 A CN202211660341 A CN 202211660341A CN 115640788 A CN115640788 A CN 115640788A
Authority
CN
China
Prior art keywords
attribute
row
character
editable document
line
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202211660341.2A
Other languages
Chinese (zh)
Other versions
CN115640788B (en
Inventor
刘大海
王惠婷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Zuoyi Technology Co ltd
Original Assignee
Beijing Zuoyi Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Zuoyi Technology Co ltd filed Critical Beijing Zuoyi Technology Co ltd
Priority to CN202211660341.2A priority Critical patent/CN115640788B/en
Publication of CN115640788A publication Critical patent/CN115640788A/en
Application granted granted Critical
Publication of CN115640788B publication Critical patent/CN115640788B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Document Processing Apparatus (AREA)

Abstract

The invention relates to the technical field of information processing, and provides a structuring method and a device of an uneditable document, which comprises the steps of obtaining the uneditable document to be processed; transcoding and line aligning the to-be-processed non-editable document; determining the probability of a plurality of attribute tags corresponding to each line in the non-editable document to be processed through a pre-constructed line classification model, and taking the attribute tag with the probability at the first high level as the category attribute of the line; according to the category attributes of each row, carrying out row classification calibration and tabulation on the to-be-processed non-editable document; and performing column alignment on the to-be-processed non-editable document, and outputting a structured result of the to-be-processed non-editable document. The method for structuring the uneditable document can automatically output the structured result of the uneditable document, is efficient and quick, and is convenient for subsequent analysis.

Description

Method and device for structuring non-editable document
Technical Field
The present invention relates generally to the field of information processing technologies, and in particular, to a method and an apparatus for structuring a non-editable document.
Background
The physical Examination Report (Medical Examination Report) refers to an uneditable format document generated from the body reaction data by examining the human body. At present, the format of the non-editable document in the related art is mainly PDF, which causes that a computer cannot analyze the non-editable document, and has limitations.
For example, a chinese patent application with publication number CN115391516a, which is commonly known in the art, proposes an unstructured document extraction method, which receives input target document information, and screens a plurality of target document cell column matrices matched with the target document type information from a plurality of document cell matrices in a document cell matrix model; extracting a target document based on the document extraction score value corresponding to each target document cellular matrix; the matrix model is relatively complex, and therefore, a method for quickly and effectively structuring a non-editable document and converting the structured result into an editable document is urgently needed.
Disclosure of Invention
In view of the above-mentioned defects or shortcomings in the related art, it is desirable to provide a method and an apparatus for structuring a non-editable document, which can perform a structuring process on the non-editable document such as a physical examination report for facilitating a subsequent analysis.
In a first aspect, the present invention provides a method for structuring a non-editable document, including:
acquiring a to-be-processed non-editable document;
transcoding and line aligning the to-be-processed non-editable document;
determining the probability of a plurality of attribute tags corresponding to each line in the uneditable document to be processed through a pre-constructed line classification model, and taking the attribute tag with the highest probability as the category attribute of the line;
according to the category attributes of each row, carrying out row classification calibration and tabulation on the non-editable document to be processed;
and performing column alignment on the to-be-processed non-editable document, and outputting a structural result of the to-be-processed non-editable document.
Optionally, in some embodiments of the present invention, transcoding and line aligning a to-be-processed non-editable document includes:
converting an uneditable document to be processed into a picture;
identifying each character and/or character block in the picture and the coordinates of each character and/or character block;
and respectively merging the character blocks in the same line into one line according to each character and/or character block and the coordinates of each character and/or character block, wherein the character blocks in the same line are spliced by using the interval characteristic characters.
Optionally, in some embodiments of the present invention, the merging the character blocks in the same row into one row according to each character and/or character block and the coordinates of each character and/or character block respectively further includes:
detecting the distance between the characters;
and when the distance is smaller than a preset threshold value, combining the characters to form a new character block, and calculating the coordinates of the new character block.
Optionally, in some embodiments of the present invention, the step of performing row classification calibration on the to-be-processed non-editable document according to the category attribute of each row includes:
if two adjacent rows are both the first class attributes and the first class attributes are not simultaneously present in the two adjacent rows, comparing the probabilities of the attribute tags of the first class attributes of the two rows, determining the row with the higher probability of the attribute tag of the first class attribute as a row with correct classification, and taking the attribute tag with the second higher probability corresponding to the row with the lower probability of the attribute tag of the first class attribute as the new class attribute of the row;
if two adjacent rows are respectively the second class attribute and the third class attribute, and the second class attribute and the third class attribute should not appear in the two adjacent rows, comparing the probabilities of the attribute tags of the first class attributes of the two rows, determining the row with the higher probability of the attribute tag of the first class attribute as a row with correct classification, and taking the attribute tag with the second higher probability corresponding to the row with the lower probability of the attribute tag of the first class attribute as the new class attribute of the row;
and if the category attributes of the two adjacent lines are a header and a footer, identifying the classified correct line and the classified wrong line of the two adjacent lines according to the coordinate positions of the character strings in the two lines in the full text, and taking the attribute label with the second highest probability in the classified wrong line as a new category attribute.
Optionally, in some embodiments of the present invention, performing table division on the to-be-processed non-editable document according to the category attribute of each row includes:
positioning table title lines in the non-editable document to be processed;
if the front row of the table title row is the table name, taking the first character block in the front row as the table name;
if the next row of the table header row is the table content, the table content is added to the table content attribute of the table object, and the addition is stopped when the end flag is detected.
Optionally, in some embodiments of the present invention, the method further includes:
merging all characters in the structured result into a first long character string according to a preset sequence, and merging all characters in the transcoded non-editable document to be processed into a second long character string;
and calculating the similarity of the first long character string and the second long character string, and outputting a similarity result.
Optionally, in some embodiments of the present invention, the similarity is calculated based on at least one of a longest common substring value, a longest common subsequence value, and a word error rate.
In a second aspect, the present invention provides a structured apparatus for a non-editable document, the apparatus comprising:
the acquisition module is configured to acquire a to-be-processed non-editable document;
the line alignment module is configured to transcode and line align the non-editable document to be processed;
the determining module is configured to determine probabilities of a plurality of attribute tags corresponding to each row in the non-editable document to be processed through a pre-constructed row classification model, and take the attribute tag with the probability at the first highest position as a row category attribute;
the table dividing module is configured and used for carrying out row classification calibration and table division on the to-be-processed non-editable document according to the category attribute of each row;
and the output module is configured to align the columns of the to-be-processed non-editable document and output a structural result of the to-be-processed non-editable document.
In a third aspect, the present invention provides an electronic device comprising a processor and a memory, wherein at least one instruction, at least one program, set of codes, or set of instructions is stored in the memory, and the instruction, program, set of codes, or set of instructions is loaded and executed by the processor to implement the steps of the structured method of non-editable documents described in any one of the first aspects.
In a fourth aspect, the present invention provides a computer readable storage medium storing one or more programs, the one or more programs being executable by one or more processors to implement the steps of the method for structuring a non-editable document as described in any one of the first aspects.
According to the technical scheme, the embodiment of the invention has the following advantages:
the embodiment of the invention provides a structuring method, a device, electronic equipment and a storage medium of a non-editable document, which are used for transcoding the non-editable document to be processed to facilitate line alignment, and then performing line classification calibration on the non-editable document to be processed by taking each line type attribute in the non-editable document to be processed, which is determined by a line classification model, as a reference so as to avoid the influence on the accuracy of subsequent operation caused by errors; furthermore, the to-be-processed uneditable document is subjected to list and column alignment, and the structured result of the to-be-processed uneditable document is automatically output, so that the method is efficient, quick and convenient to analyze.
Drawings
Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:
FIG. 1 is a flow diagram that provides one embodiment of a method for structuring a non-editable document;
FIG. 2 is a partially schematic illustration of a non-editable document, provided in one embodiment;
FIG. 3 is another flowchart illustration of a method for structuring a non-editable document according to an embodiment;
FIG. 4 is a diagram illustrating an exemplary apparatus for structuring a non-editable document;
FIG. 5 is a second structural diagram of an apparatus for structuring a non-editable document according to an embodiment;
FIG. 6 is a third structural diagram of an apparatus for structuring a non-editable document according to an embodiment.
Detailed Description
In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The terms "first," "second," "third," "fourth," and the like in the description and in the claims, as well as in the drawings, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described are capable of operation in sequences other than those illustrated or otherwise described herein.
Moreover, the terms "comprises," "comprising," and any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or modules is not necessarily limited to those steps or modules explicitly listed, but may include other steps or modules not expressly listed or inherent to such process, method, article, or apparatus.
It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict. The present invention will be described in detail below with reference to the embodiments with reference to the attached drawings.
For convenience of understanding and explanation, the following describes in detail a method, an apparatus, an electronic device, and a storage medium for structuring a non-editable document according to the embodiments of the present invention with reference to fig. 1 to 6.
Please refer to fig. 1, which is a flowchart illustrating a method for structuring a non-editable document according to an embodiment of the present invention, the method includes the following steps:
s101, obtaining a to-be-processed non-editable document.
As shown in fig. 2, which is a partial schematic diagram of a physical examination report according to an embodiment of the present invention, the format of the physical examination report includes, but is not limited to, PDF, and specific contents include, but are not limited to, a table name, an operator, a table title, table contents, a summary, and the like.
S102, transcoding and line aligning the to-be-processed non-editable document.
For example, the embodiment of the present invention first converts the non-editable document to be processed into a picture; secondly, using Optical Character Recognition (OCR) to identify each Character and/or Character block and coordinates of each Character and/or Character block in the picture, wherein OCR refers to a process in which the electronic device determines a Character shape by detecting dark and light patterns and translates the shape into a computer word by using a Character Recognition method; thirdly, according to each character and/or character block and the coordinates of each character and/or character block, the character blocks in the same row are respectively merged into one row, wherein the character blocks in the same row are spliced by using interval signatures, so that the connection between the character blocks can be kept, and the distinction degree is provided, the interval signatures are characters which do not appear in the table content, and the interval signatures can be represented by the fourth separator to avoid confusion, for example, the interval signatures can be capital Chinese numbers such as one, two, three, four and five, or special symbols such as #, # #, # # # and # # #. In addition, the embodiment of the present invention may further detect a distance between the characters, combine the characters to form a new character block when the distance is smaller than a preset threshold, and calculate coordinates of the new character block, for example, the preset threshold is 10 pixels.
S103, determining the probability of a plurality of attribute tags corresponding to each row in the non-editable document to be processed through a pre-constructed row classification model, and taking the attribute tag with the highest probability as the category attribute of the row.
It should be noted that, in the process of constructing the line classification model, the embodiment of the present invention first converts a plurality of sample non-editable documents into a picture, and recognizes each character and/or character block and coordinates of each character and/or character block in the picture using OCR; then, according to each character and/or character block and the coordinates of each character and/or character block, respectively combining the character blocks in the same line into one line, wherein the character blocks in the same line are spliced by using the interval characteristic characters. In addition, if the character coordinates are recognized, combining the characters with the intervals smaller than the preset threshold value to form a new character block, and calculating the coordinates of the new character block.
Thirdly, labeling meaning of each line in the line alignment result, wherein the attribute label at least comprises a table name, a table title, table contents, a summary, an operator and others, and can also comprise a header, a footer, an abnormal index summary, a guidance suggestion and the like. If a plurality of semantics exist in a row, one semantic name needs to be manually assigned and labeled, for example, if a table name and an operator exist in a row, the operator is labeled.
Finally, a row classification model training is performed on each row through a machine learning model or a deep learning model, and the training is used for identifying the category attribute of each row, wherein the category attribute refers to that the row is a table title, a table content, a summary and the like. Wherein the features of the model training are literal features in a row and include added interval features, and the model is a multi-classification model, the number of categories including but not limited to form name, form title, form content, summary, operator and others. Taking the machine learning model as an example, the machine learning model includes but is not limited to a bayesian model, a decision tree model, a support vector machine model, a multi-layer perceptron model, and the like. Furthermore, in the embodiment of the present invention, the characters in each line may be converted into numerical values through a bag-of-words model or TF-IDF (Term Frequency-Inverse Document Frequency), and the models such as na iotave bayes, bernoulli bayes, polynomial bayes, and gaussian bayes are used for training respectively, and the model with the best effect is taken as the line classification model. For example, the character conversion value uses countvector (term frequency feature vector) and/or tfidvector (tfidf feature vector) in sklern, and the parameters of the two feature conversion models are set to min _ df =5, except that if the document frequency of a word is less than min _ df, the word is not treated as a keyword, and then the analyzer = 'char' is analyzed in a character-wise manner, and the length of the word group segmentation ranges from 1 to 4 characters, i.e., ngram _ range = (1, 4).
For example, assuming that the probabilities of the plurality of attribute tags corresponding to a certain row are 60% of the table name, 20% of the table title, 10% of the table content, 5% of the summary, 3% of the operator, and 2% of the others, the attribute tag with the probability at the first highest level, i.e., the table name, is taken as the category attribute of the row.
And S104, performing line classification calibration and tabulation on the to-be-processed non-editable document according to the category attributes of each line.
For example, in the process of performing line classification calibration on a to-be-processed non-editable document, if two adjacent lines are both first-class attributes and the first-class attribute should not appear in the two adjacent lines at the same time, the probabilities of the attribute tags of the first-class attributes of the two lines are compared, the line with the higher probability of the attribute tag of the first-class attribute is determined as a line with correct classification, and the attribute tag with the second higher probability corresponding to the line with the lower probability of the attribute tag of the first-class attribute is determined as a new class attribute of the line.
For example, the first category attribute is a table name, but two adjacent rows may not be both table names at the same time, which means that there is an identification error, and assuming that the probabilities of the plurality of attribute tags corresponding to the upper row are 60% of the table name, 20% of the table title, 10% of the table content, 5% of the summary, 3% of the operator, and 2% of the other attributes, respectively, the probabilities of the plurality of attribute tags corresponding to the lower row are 50% of the table name, 30% of the table title, 10% of the table content, 5% of the summary, 3% of the operator, and 2% of the other attributes, respectively, since the probability of 60% of the table name of the upper row is higher than the probability of 50% of the table name of the lower row, the upper row is a correct classification row, the category attribute of the upper row is determined as the table name, the lower row is a wrong classification row, and the category attribute of the lower row is determined as a table title with the second highest probability.
For example, if two adjacent rows are the second category attribute and the third category attribute respectively, and the second category attribute and the third category attribute should not appear in the two adjacent rows, the probabilities of the attribute tags of the first category attribute of the two rows are compared, the row with the higher probability of the attribute tag of the first category attribute is determined as the row with correct classification, and the attribute tag with the second higher probability corresponding to the row with the lower probability of the attribute tag of the first category attribute is taken as the new category attribute of the row;
for example, the category attributes of two adjacent rows are determined as the table name and the summary by the classification model, respectively, which obviously causes a classification error because the two rows cannot be adjacent rows. Assuming that the probabilities of the attribute labels corresponding to the uplink are 60% of the table name, 20% of the table title, 10% of the table content, 5% of the summary, 3% of the operator and the other 2%, and the probabilities of the attribute labels corresponding to the downlink are 5% of the table name, 30% of the table title, 10% of the table content, 50% of the summary, 3% of the operator and the other 2%, since 60% of the probability of the table name of the uplink is higher than 50% of the probability of the summary of the downlink, the uplink is a correct classification row, the class attribute of the uplink is determined as the table name, the downlink is a wrong classification row, and the class attribute of the downlink is determined as the table title with the second highest probability.
Illustratively, if the category attributes of two adjacent lines are a header and a footer, identifying a classified correct line and a classified incorrect line of the two adjacent lines according to the coordinate positions of the character strings in the two lines in the full text, and taking the attribute label with the second highest probability in the classified incorrect line as a new category attribute.
For example: the header and footer may not be adjacent lines, so the embodiment of the present invention may detect the position of the character and/or the character block corresponding to the header and the footer in the whole text according to the coordinates of the character and/or the character block, and if the coordinates are particularly forward, the category attribute of the line is the header, and if the coordinates are particularly backward, the category attribute of the line is not likely to be the header. Similarly, the detection of the position of the footer is similar to the header, and further, for the position error line, the attribute label with the second highest probability in the position error line is taken as the category attribute.
In the process of dividing the table of the non-editable document to be processed, since the main body part of the non-editable document is basically a table, the step can identify which rows are combined to form a table, and a complete table can comprise a table name, a table title, a table content, a summary and an operator, so that each table can be regarded as a class, wherein the table name, the table title, the table content, the summary and the operator are attributes of the class respectively. Further, the embodiment of the present invention first locates the table title line in the to-be-processed non-editable document, and then if the front line of the table title line is the table name, the first character block in the front line is used as the table name, and if the back line of the table title line is the table content, the table content is added to the table content attribute of the table object, and the addition is stopped when the end mark is detected, where the end mark may be a bar or an operator. In addition, considering that the bar and the operator are not always in the table, the bar and the operator can be regarded as the bar and the operator attribute of the table object only when the bar and the operator are recognized before the next table title.
Optionally, in some embodiments of the present invention, if the meaning of the line is a line such as a header and a footer that has little meaning to analyze the uneditable document, the invalid line may be deleted, so as to avoid interfering with a subsequent table splitting process, and improve processing efficiency.
And S105, performing column alignment on the to-be-processed non-editable document, and outputting a structural result of the to-be-processed non-editable document.
For example, the embodiment of the present invention performs column alignment by using a clustering algorithm, and the character and/or character block coordinates are characteristics of the clustering, and the data of the same category in the clustering result represents a column in the table. For example, the Clustering algorithm includes, but is not limited to, any one of DBSCAN (Density-Based Clustering of Applications with Noise), spectral Clustering, k-mean (k-means), and hierarchical Clustering. Further, the embodiment of the present invention can restore the table in the original PDF and output the table to txt in a structured manner.
Optionally, as shown in fig. 3, after obtaining the structured result, some embodiments of the present invention further include the following steps: and S106, merging the characters in the structured result into a first long character string according to a preset sequence, and merging the characters in the transcoded non-editable document to be processed into a second long character string, for example, merging the characters in the structured result into a first long character string A according to the sequence from left to right and from top to bottom, and merging the characters in the transcoded non-editable document to be processed into a second long character string B according to the sequence from left to right and from top to bottom.
S107, calculating similarity between the first long character string and the second long character string, and outputting a similarity result, for example, the similarity may be calculated based on at least one of a longest common substring value, a longest common subsequence value, and a Word Error Rate (WER), where the longest common substring value is the number of characters of the longest common substring in the first long character string a divided by the number of characters in the second long character string B, the longest common subsequence value is the number of characters of the longest common subsequence in the first long character string a divided by the number of characters in the second long character string B, and the Word Error Rate is the number of characters in the first long character string a compared with the number of characters in the second long character string B, and then some characters in the first long character string a that need to be inserted (Insertion), deleted (Deletion), and replaced (Deletion) are counted until they are the same as the number of characters in the second long character string B, and then divided by the number of characters in the second long character string B.
It should be noted that higher similarity indicates less information loss after structuring, and the structuring effect is better. In addition, the embodiment of the present invention may return the similarity result and the structured result to the user who uses the structured non-editable document together, and the user may determine whether the structured result is used according to the similarity result, or may determine according to a similarity threshold set by the user, so as to improve the degree of automation, for example, if the similarity is lower than the similarity threshold of 0.5, the structured result is not used.
The embodiment of the invention provides a structuring method of a non-editable document, which is characterized in that the non-editable document to be processed is transcoded to facilitate line alignment, and then line classification calibration is carried out on the non-editable document to be processed by taking each line type attribute in the non-editable document to be processed, which is determined by a line classification model, as a reference, so that the accuracy of subsequent operation is prevented from being influenced by errors; furthermore, the to-be-processed non-editable document is subjected to list and column alignment, and the structured result of the to-be-processed non-editable document is automatically output, so that the method is efficient, quick and convenient to analyze.
Based on the foregoing embodiments, the present invention provides a structured apparatus for a non-editable document. The structure apparatus 100 of the non-editable document can be applied to the structure method of the non-editable document according to the embodiment corresponding to fig. 1 to fig. 3. Referring to fig. 4, the apparatus 100 for structuring a non-editable document includes:
an obtaining module 101 configured to obtain a to-be-processed non-editable document;
the line alignment module 102 is configured to transcode and line align the to-be-processed non-editable document;
the determining module 103 is configured to determine, through a pre-constructed line classification model, probabilities of a plurality of attribute tags corresponding to each line in the non-editable document to be processed, and use an attribute tag with the highest probability as a category attribute of the line;
the table dividing module 104 is configured to perform row classification calibration and table division on the to-be-processed non-editable document according to the category attribute of each row;
the output module 105 is configured to perform column alignment on the to-be-processed non-editable document and output a structured result of the to-be-processed non-editable document.
Optionally, as shown in fig. 5, in some embodiments of the present invention, the row alignment module 102 includes:
a conversion unit 1021 configured to convert a non-editable document to be processed into a picture;
a recognition unit 1022 configured to recognize each character and/or character block in the picture and coordinates of each character and/or character block;
and a merging unit 1023 configured to merge the character blocks in the same row into one row according to each character and/or character block and the coordinates of each character and/or character block, wherein the character blocks in the same row are spliced by using the interval character.
Optionally, in some embodiments of the present invention, the merging unit 1023 is further configured to detect a distance between characters;
and when the distance is smaller than a preset threshold value, combining the characters to form a new character block, and calculating the coordinates of the new character block.
Optionally, in some embodiments of the present invention, the table splitting module 103 is further configured to, if the uplink and the downlink in two adjacent rows are both the first category attributes, compare the probability that the uplink corresponds to the first category attribute with the probability that the downlink corresponds to the first category attribute, determine the category attribute of the row with the higher probability in the uplink and the downlink as the first category attribute, and select an attribute label with the lower probability that the probability in the row is at the second highest position from the category attributes of the row with the lower probability;
and/or if the upper row in two adjacent rows is of a second category attribute and the lower row is of a third category attribute, respectively detecting the positions of the upper row and the lower row, and taking the category attribute of the row with the position error as the attribute label with the second highest probability in the row with the position error.
Optionally, in some embodiments of the present invention, the table partitioning module 103 is further configured to locate a table title row in the to-be-processed non-editable document;
if the front row of the table title row is the table name, taking the first character block in the front row as the table name;
if the next row of the table header row is the table content, the table content is added to the table content attribute of the table object, and the addition is stopped when the end flag is detected.
Optionally, as shown in fig. 6, in some embodiments of the present invention, the structural apparatus 100 for a non-editable document further includes:
the merging module 106 is configured to merge characters in the structured result into a first long character string according to a preset sequence, and merge characters in the transcoded non-editable document to be processed into a second long character string;
the calculating module 107 is configured to calculate a similarity between the first long character string and the second long character string, and output a similarity result.
Optionally, in some embodiments of the present invention, the calculating module 107 is further configured to calculate the similarity based on at least one of the longest common substring value, the longest common subsequence value, and the word error rate.
It should be noted that, for the descriptions of the same steps and the same contents in this embodiment as those in other embodiments, reference may be made to the descriptions in other embodiments, which are not described herein again.
The embodiment of the invention provides a structuralization device of a non-editable document, wherein an acquisition module in the structuralization device of the non-editable document can acquire the non-editable document to be processed, a row alignment module can carry out transcoding and row alignment on the non-editable document to be processed, a determination module can determine the probability of a plurality of attribute tags corresponding to each row in the non-editable document to be processed through a pre-constructed row classification model, and takes the attribute tag with the probability at the first high position as the category attribute of the row, so that a list division module can carry out row classification calibration and list division on the non-editable document to be processed according to the category attribute of each row, and an output module can carry out column alignment on the non-editable document to be processed and automatically output the structuralization result of the non-editable document to be processed, so that the structuralization device is efficient, fast and convenient for subsequent analysis.
Based on the foregoing embodiments, an embodiment of the present invention provides an electronic device, which includes a processor and a memory. The memory has stored therein at least one instruction, at least one program, set of codes, or set of instructions that is loaded and executed by the processor to implement the steps of the structured method of non-editable documents of the corresponding embodiments of fig. 1-3.
As another aspect, an embodiment of the present invention provides a computer-readable storage medium for storing program code for implementing any one of the foregoing methods for structuring a non-editable document according to the corresponding embodiments in fig. 1 to 3.
It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the system, the apparatus and the module described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the several embodiments provided in the present invention, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is merely a logical division, and in actual implementation, there may be other divisions, for example, multiple modules or components may be combined or integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or modules, and may be in an electrical, mechanical or other form. Modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.
In addition, functional modules in the embodiments of the present invention may be integrated into one processing unit, or each module may exist alone physically, or two or more units are integrated into one module. The integrated unit may be implemented in the form of hardware, or may also be implemented in the form of a software functional unit. The integrated unit, if implemented as a software functional unit and sold or used as a separate product, may be stored in a computer readable storage medium.
Based on such understanding, the technical solution of the present invention, which essentially or partly contributes to the prior art, or all or part of the technical solution may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the method for structuring a non-editable document according to various embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
It should be noted that the above embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A method of structuring a non-editable document, said method comprising:
acquiring a to-be-processed non-editable document;
transcoding and line aligning the to-be-processed non-editable document;
determining the probability of a plurality of attribute tags corresponding to each line in the non-editable document to be processed through a pre-constructed line classification model, and taking the attribute tag with the highest probability as the category attribute of the line;
performing row classification calibration and tabulation on the to-be-processed non-editable document according to the category attribute of each row;
and aligning the columns of the to-be-processed non-editable document, and outputting a structured result of the to-be-processed non-editable document.
2. The method for structuring a non-editable document according to claim 1, wherein the transcoding and line aligning the to-be-processed non-editable document comprises:
converting the to-be-processed non-editable document into a picture;
identifying each character and/or character block in the picture and coordinates of each character and/or character block;
and respectively merging the character blocks in the same line into a line according to the coordinates of each character and/or character block and each character and/or character block, wherein the character blocks in the same line are spliced by using spacing characteristics.
3. The method according to claim 2, wherein said combining the character blocks in the same row into one row according to the coordinates of each character and/or character block and each character and/or character block further comprises:
detecting a distance between each of the characters;
and when the distance is smaller than a preset threshold value, combining the characters to form a new character block, and calculating the coordinate of the new character block.
4. The method of claim 1, wherein the step of calibrating the line classification of the non-editable document to be processed according to the category attribute of each line comprises:
if two adjacent rows are both the first category attributes and the first category attributes do not simultaneously appear in the two adjacent rows, comparing the probabilities of the attribute tags of the first category attributes of the two rows, determining the row with the high probability of the attribute tag of the first category attribute as a row with correct classification, and taking the attribute tag with the second high probability corresponding to the row with the low probability of the attribute tag of the first category attribute as a new category attribute of the row;
if two adjacent rows are respectively a second class attribute and a third class attribute, and the second class attribute and the third class attribute should not appear in the two adjacent rows, comparing the probabilities of the attribute tags of the first class attributes of the two rows, determining the row with the high probability of the attribute tag of the first class attribute as a row with correct classification, and taking the attribute tag with the second high probability corresponding to the row with the low probability of the attribute tag of the first class attribute as a new class attribute of the row;
and if the category attributes of the two adjacent lines are a header and a footer, identifying the classified correct line and the classified wrong line of the two adjacent lines according to the coordinate positions of the character strings in the two lines in the full text, and taking the attribute label with the second highest probability in the classified wrong line as a new category attribute.
5. The method according to claim 1, wherein the tabulating the to-be-processed non-editable document according to the category attribute of each row comprises:
positioning table title lines in the to-be-processed non-editable document;
if the forward row of the table title row is the table name, taking the first character block in the forward row as the table name;
and if the next row of the table title row is the table content, adding the table content to the table content attribute of the table object, and stopping adding when an end mark is detected.
6. A method for structuring a non-editable document according to any one of claims 1 to 5, further comprising:
merging the characters in the structured result into a first long character string according to a preset sequence, and merging the transcoded characters in the to-be-processed non-editable document into a second long character string;
and calculating the similarity of the first long character string and the second long character string, and outputting a similarity result.
7. The method of claim 6, wherein the similarity is calculated based on at least one of a longest common substring value, a longest common subsequence value, and a misword rate.
8. A structured apparatus of a non-editable document, said apparatus comprising:
the acquisition module is configured to acquire a to-be-processed non-editable document;
the line alignment module is configured to transcode and line align the to-be-processed non-editable document;
the determining module is configured to determine the probabilities of a plurality of attribute tags corresponding to each row in the to-be-processed non-editable document through a pre-constructed row classification model, and take the attribute tag with the highest probability as the category attribute of the row;
the table dividing module is configured to perform row classification calibration and table division on the to-be-processed non-editable document according to the category attribute of each row;
and the output module is configured to align the columns of the to-be-processed non-editable document and output a structural result of the to-be-processed non-editable document.
9. The apparatus of claim 8, wherein:
the row alignment module is further configured to:
converting the to-be-processed non-editable document into a picture;
identifying each character and/or character block in the picture and coordinates of each character and/or character block;
and respectively merging the character blocks in the same line into a line according to the coordinates of each character and/or character block and each character and/or character block, wherein the character blocks in the same line are spliced by using spacing characteristics.
10. The apparatus of claim 8, wherein:
the sub-table module is further configured to:
if two adjacent rows are both the first class attributes and the first class attributes are not simultaneously present in the two adjacent rows, comparing the probabilities of the attribute tags of the first class attributes of the two rows, determining the row with the high probability of the attribute tag of the first class attribute as a row with correct classification, and taking the attribute tag with the second high probability corresponding to the row with the low probability of the attribute tag of the first class attribute as the new class attribute of the row;
if two adjacent rows are respectively a second class attribute and a third class attribute, and the second class attribute and the third class attribute should not appear in the two adjacent rows, comparing the probabilities of the attribute tags of the first class attributes of the two rows, determining the row with the high probability of the attribute tag of the first class attribute as a row with correct classification, and taking the attribute tag with the second high probability corresponding to the row with the low probability of the attribute tag of the first class attribute as a new class attribute of the row;
and if the category attributes of the two adjacent lines are a header and a footer, identifying the classified correct line and the classified wrong line of the two adjacent lines according to the coordinate positions of the character strings in the two lines in the full text, and taking the attribute label with the second highest probability in the classified wrong line as a new category attribute.
CN202211660341.2A 2022-12-23 2022-12-23 Method and device for structuring non-editable document Active CN115640788B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211660341.2A CN115640788B (en) 2022-12-23 2022-12-23 Method and device for structuring non-editable document

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211660341.2A CN115640788B (en) 2022-12-23 2022-12-23 Method and device for structuring non-editable document

Publications (2)

Publication Number Publication Date
CN115640788A true CN115640788A (en) 2023-01-24
CN115640788B CN115640788B (en) 2023-03-21

Family

ID=84949987

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211660341.2A Active CN115640788B (en) 2022-12-23 2022-12-23 Method and device for structuring non-editable document

Country Status (1)

Country Link
CN (1) CN115640788B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9292876B1 (en) * 2014-12-16 2016-03-22 Docusign, Inc. Systems and methods for employing document snapshots in transaction rooms for digital transactions
CN105898448A (en) * 2015-12-14 2016-08-24 乐视云计算有限公司 Submission method and device of transcoding attribute information
CN114510547A (en) * 2021-09-23 2022-05-17 成都四方伟业软件股份有限公司 Method and device for extracting structured information of PDF (Portable document Format) file
CN115292246A (en) * 2022-08-02 2022-11-04 上海百家云科技有限公司 Document transcoding method and device and electronic equipment
CN115392188A (en) * 2022-08-23 2022-11-25 杭州未名信科科技有限公司 Method and device for generating editable document based on non-editable image-text images

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9292876B1 (en) * 2014-12-16 2016-03-22 Docusign, Inc. Systems and methods for employing document snapshots in transaction rooms for digital transactions
CN105898448A (en) * 2015-12-14 2016-08-24 乐视云计算有限公司 Submission method and device of transcoding attribute information
CN114510547A (en) * 2021-09-23 2022-05-17 成都四方伟业软件股份有限公司 Method and device for extracting structured information of PDF (Portable document Format) file
CN115292246A (en) * 2022-08-02 2022-11-04 上海百家云科技有限公司 Document transcoding method and device and electronic equipment
CN115392188A (en) * 2022-08-23 2022-11-25 杭州未名信科科技有限公司 Method and device for generating editable document based on non-editable image-text images

Also Published As

Publication number Publication date
CN115640788B (en) 2023-03-21

Similar Documents

Publication Publication Date Title
US11514698B2 (en) Intelligent extraction of information from a document
US8539349B1 (en) Methods and systems for splitting a chinese character sequence into word segments
CN110175334B (en) Text knowledge extraction system and method based on custom knowledge slot structure
CN112434691A (en) HS code matching and displaying method and system based on intelligent analysis and identification and storage medium
Tkaczyk et al. A modular metadata extraction system for born-digital articles
CN111797356B (en) Webpage form information extraction method and device
Klampfl et al. A comparison of two unsupervised table recognition methods from digital scientific articles
US8880391B2 (en) Natural language processing apparatus, natural language processing method, natural language processing program, and computer-readable recording medium storing natural language processing program
EP2544100A2 (en) Method and system for making document modules
US8700997B1 (en) Method and apparatus for spellchecking source code
CN110413996B (en) Method and device for constructing zero-index digestion corpus
CN115640788B (en) Method and device for structuring non-editable document
CN115294593A (en) Image information extraction method and device, computer equipment and storage medium
JP2004171316A (en) Ocr device, document retrieval system and document retrieval program
CN111310457B (en) Word mismatching recognition method and device, electronic equipment and storage medium
Mohapatra et al. Spell checker for OCR
CN114154480A (en) Information extraction method, device, equipment and storage medium
Gao et al. Analysis of book documents' table of content based on clustering
CN113962196A (en) Resume processing method and device, electronic equipment and storage medium
CN111209724A (en) Text verification method and device, storage medium and processor
CN115687334B (en) Data quality inspection method, device, equipment and storage medium
Bhargava et al. bioPDFX: preparing PDF scientific articles for biomedical text mining
CN110647628B (en) Automatic marking and detecting method and system
CN114328938B (en) Image report structured extraction method
Haq et al. Correction of whitespace and word segmentation in noisy Pashto text using CRF

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant