CN117033642A - Document analysis method and device - Google Patents

Document analysis method and device Download PDF

Info

Publication number
CN117033642A
CN117033642A CN202311293136.1A CN202311293136A CN117033642A CN 117033642 A CN117033642 A CN 117033642A CN 202311293136 A CN202311293136 A CN 202311293136A CN 117033642 A CN117033642 A CN 117033642A
Authority
CN
China
Prior art keywords
text
block
blocks
character
position information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311293136.1A
Other languages
Chinese (zh)
Inventor
罗华刚
付淳川
张�杰
于皓
李犇
崔明飞
王展
贾敬伍
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Zhongguancun Kejin Technology Co Ltd
Original Assignee
Beijing Zhongguancun Kejin Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Zhongguancun Kejin Technology Co Ltd filed Critical Beijing Zhongguancun Kejin Technology Co Ltd
Priority to CN202311293136.1A priority Critical patent/CN117033642A/en
Publication of CN117033642A publication Critical patent/CN117033642A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/103Formatting, i.e. changing of presentation of documents
    • G06F40/109Font handling; Temporal or kinetic typography
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/14Image acquisition
    • G06V30/148Segmentation of character regions
    • G06V30/153Segmentation of character regions using recognition of characters or words
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/191Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
    • G06V30/19107Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/191Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
    • G06V30/19173Classification techniques
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a document analysis method and a document analysis device, and relates to the technical field of artificial intelligence. Wherein the method comprises the following steps: extracting characters from a document to be analyzed to obtain character information in the document, wherein the character information comprises character content and character position information; clustering the characters according to the position information of the characters to obtain a plurality of character blocks; determining information of a character block according to the information of characters in the character block, wherein the information of the character block comprises the content of the character block and the position information of the character block; according to the content of the text blocks and/or the position information of the text blocks, sequencing a plurality of text blocks to obtain sequencing results; and generating a document analysis result according to the content of the text blocks and the sequencing result. According to the document analysis method, even for the contents of the columns and the blocks, the words can be accurately sequenced, so that the quality of the analyzed contents and the accuracy of the question-answering and abstract results obtained based on the analyzed contents are improved.

Description

Document analysis method and device
Technical Field
The application relates to the technical field of artificial intelligence, in particular to a document analysis method and device.
Background
With the development of technology, enterprises or individuals generate a large amount of data in daily work, and most of the data exists in the form of unstructured text, i.e. documents. One often needs to obtain answers to questions or summaries of documents from these documents, which requires parsing the documents. And analyzing the document, namely analyzing the text content in the document, and analyzing the text content into a complete and ordered content according to the sequence.
In the related art, only the contents of the characters in the document can be analyzed, and the sequences of the characters are arranged according to the rows. For the contents of the columns and the blocks, when the contents are arranged according to the rows, the contents of different columns or blocks on the same horizontal line are arranged in a crossing way, so that the ordering of the characters is inaccurate, the quality of the analyzed contents is poorer, and the accuracy of the question-answering and abstract results is further affected.
Disclosure of Invention
The embodiment of the application aims to provide a document analysis method and a document analysis device, which are used for solving the problems that the ordering of characters is inaccurate in the related technology, the quality of analyzed contents is poor, and the accuracy of question-answering and abstract results is affected.
In order to realize the technical scheme, the embodiment of the application is realized as follows:
in a first aspect, an embodiment of the present application provides a document parsing method, including: extracting characters from a document to be analyzed to obtain character information in the document, wherein the character information comprises the content of the characters and the position information of the characters; clustering the characters according to the position information of the characters to obtain a plurality of character blocks; determining the information of the text block according to the information of the text in the text block, wherein the information of the text block comprises the content of the text block and the position information of the text block; according to the content of the text blocks and/or the position information of the text blocks, sequencing the plurality of text blocks to obtain a sequencing result; and generating a document analysis result according to the content of the text block and the sequencing result.
In a second aspect, an embodiment of the present application provides a document parsing apparatus, including: the extraction module is used for extracting characters from the document to be analyzed to obtain character information in the document, wherein the character information comprises the content of the characters and the position information of the characters; the clustering module is used for clustering the characters according to the position information of the characters to obtain a plurality of character blocks; the determining module is used for determining the information of the character blocks according to the information of the characters in the character blocks, wherein the information of the character blocks comprises the content of the character blocks and the position information of the character blocks; the ordering module is used for ordering the plurality of text blocks according to the content of the text blocks and/or the position information of the text blocks to obtain an ordering result; and the generation module is used for generating a document analysis result according to the content of the text block and the sequencing result.
In a third aspect, an embodiment of the present application provides a document parsing apparatus, including: a processor; and a memory arranged to store computer executable instructions configured to be executed by the processor, the executable instructions comprising steps for performing the method as described in the first aspect.
In a fourth aspect, embodiments of the present application provide a storage medium storing computer-executable instructions for causing a computer to perform the steps of the method as described in the first aspect.
It can be seen that, in the embodiment of the present application, when a document is analyzed, text extraction is first performed on the document to be analyzed to obtain text information in the document, the text information includes text content and text position information, then text is clustered according to the text position information to obtain a plurality of text blocks, text block information is determined according to text information in the text blocks, the text block information includes text block content and text block position information, the plurality of text blocks are ordered according to the text block content and/or text block position information to obtain an ordering result, and a document analysis result is generated according to the text block content and the ordering result. According to the embodiment of the application, the characters in the document are clustered into the plurality of character blocks according to the position information of the character blocks, and the plurality of character blocks are sequenced according to the content and/or the position information of the character blocks, and because the clustering and sequencing are performed according to the position information of the characters instead of the clustering and sequencing according to the rows, even for the contents of the columns and the blocks, the accurate sequencing of the characters can be realized, so that the quality of the content which is analyzed is improved, and the accuracy of the question-answer and abstract results obtained based on the analyzed content is improved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application. In the drawings:
FIG. 1 is a flow chart of a method for parsing a document according to an embodiment of the present application;
FIG. 2 is a flow chart of a method for parsing a document according to another embodiment of the present application;
FIG. 3 is a flowchart of a document parsing method according to another embodiment of the present application;
FIG. 4 is a flowchart of a document parsing method according to another embodiment of the present application;
FIG. 5 is a flowchart of a document parsing method according to another embodiment of the present application;
FIG. 6 is a flowchart of a document parsing method according to another embodiment of the present application;
FIG. 7 is a flowchart of a document parsing method according to another embodiment of the present application;
FIG. 8 is a schematic overall flow chart of a document parsing method according to another embodiment of the present application;
FIG. 9 is a schematic diagram of a document parsing apparatus according to an embodiment of the present application;
FIG. 10 is a schematic diagram of a document parsing apparatus according to another embodiment of the present application;
fig. 11 is a schematic structural view of a document parsing apparatus according to an embodiment of the present application.
Detailed Description
In order to make the technical solution of the present application better understood by those skilled in the art, the technical solution of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, shall fall within the scope of the application.
It should be noted that, without conflict, embodiments of the present application and features of the embodiments may be combined with each other. Embodiments of the present application will be described in detail below with reference to the accompanying drawings in conjunction with the embodiments.
The embodiment of the application provides a document analysis method and a document analysis device, and the main inventive concept of the application is as follows: in the process of analyzing the document, the content of the characters in the document is analyzed in the related technology, and the sequence of the characters is arranged according to the rows. However, for the contents of the columns and the blocks, when the contents are arranged according to the rows, as the contents of different columns or blocks on the same horizontal line are arranged in a crossing manner, the method can lead to inaccurate sequencing of the characters, further lead to poorer quality of the analyzed contents, and further influence the accuracy of the question-answering and abstract results. Therefore, when analyzing a document, the embodiment of the application provides a document analysis method and a document analysis device, firstly, text extraction is carried out on the document to be analyzed to obtain text information in the document, the text information comprises text content and text position information, then, the text is clustered according to the text position information to obtain a plurality of text blocks, the text information in the text blocks is determined according to the text information in the text blocks, the text block information comprises text block content and text block position information, the text blocks are sequenced according to the text block content and/or the text block position information to obtain a sequencing result, and the document analysis result is generated according to the text block content and the sequencing result. According to the embodiment of the application, the characters in the document are clustered into the plurality of character blocks according to the position information of the character blocks, and the plurality of character blocks are sequenced according to the content and/or the position information of the character blocks, and because the clustering and sequencing are performed according to the position information of the characters instead of the clustering and sequencing according to the rows, even for the contents of the columns and the blocks, the accurate sequencing of the characters can be realized, so that the quality of the content which is analyzed is improved, and the accuracy of the question-answer and abstract results obtained based on the analyzed content is improved.
The following describes in detail the technical solutions provided by the embodiments of the present application with reference to the accompanying drawings.
Fig. 1 is a flow chart of a document parsing method according to an embodiment of the present application. As shown in FIG. 1, the document parsing method according to the embodiment of the application specifically includes the following steps:
s101, extracting characters from a document to be analyzed to obtain character information in the document, wherein the character information comprises character content and character position information.
In the embodiment of the application, the execution main body of the document analysis method of the embodiment of the application is a document analysis device, and the document analysis device can be arranged in document analysis equipment. The document analysis device may be a terminal device or a server. The terminal equipment can be a mobile phone, a tablet computer, a desktop computer, a portable notebook computer, vehicle-mounted equipment and the like; the server may be an independent server or a server cluster composed of a plurality of servers.
The document to be parsed, namely the document to be subjected to content parsing. The document is an unstructured text and may include, but is not limited to, enterprise regulations, technical documents, promotional material, etc., and the format of the document may include, but is not limited to, pdf, word, ppt, etc. Text, pictures, tables and the like can be specifically included in the document.
The text information may include, but is not limited to, content of a single text in a document, location information of the single text, and the like.
The text content is text content corresponding to the characters. The position information of the text may specifically include position information of a rectangular area (denoted as a second rectangular area) corresponding to the text. The position information of the second rectangular area may be expressed as (l, r, u, b), where l is a left boundary of the second rectangular area, i.e., an abscissa minimum value of the second rectangular area, r is a right boundary of the second rectangular area, i.e., an abscissa maximum value of the second rectangular area, u is an upper boundary of the second rectangular area, i.e., an ordinate minimum value of the second rectangular area, and b is a lower boundary of the second rectangular area, i.e., an ordinate maximum value of the second rectangular area.
As a possible implementation manner, text extraction can be performed on the document according to the text coding information corresponding to the document so as to obtain text content and position information. The text coding information is a 2-system character string corresponding to non-picture type text in the document, the fixed file type has a corresponding standard format, text extraction of the document can be realized based on the 2-system character string, and text content and position information are obtained.
As another possible implementation manner, optical character recognition (Optical Character Recognition, abbreviated as OCR) may be performed on the document to implement text extraction on the document, so as to obtain text content and location information. Where OCR refers to a process in which an electronic device (e.g., a scanner or digital camera) checks characters printed on paper, determines their shapes by detecting dark and light patterns, and then translates the shapes into computer text using a character recognition method.
It will be appreciated by those skilled in the art that for documents that include both pictorial and non-pictorial text, the text content and location information may be obtained in two ways, respectively, and then fused to obtain the final text content and location information.
S102, clustering the characters according to the position information of the characters to obtain a plurality of character blocks.
In the embodiment of the application, based on the position information of the single text acquired in the step S101, a plurality of texts corresponding to the document are clustered, and each category is used as a text block to acquire a plurality of text blocks. The purpose of clustering according to the position information is to divide a plurality of characters into a plurality of character blocks according to the position concentration degree, the distance between the characters in the same character block is closer, and the distance between the characters in different character blocks is farther.
S103, determining the information of the character blocks according to the information of the characters in the character blocks, wherein the information of the character blocks comprises the content of the character blocks and the position information of the character blocks.
In the embodiment of the application, based on the plurality of text blocks obtained by clustering in the step S102 and the content and the position information of the text in each text block, the content corresponding to the text block can be obtained. Because the arrangement of the characters in the single character block is relatively orderly, the contents of the characters in the character block can be arranged according to the sequence from left to right and from top to bottom (namely, the characters with the upper positions at different longitudinal positions are arranged in front of the sequence of the characters with the left positions at the same longitudinal positions) so as to obtain the corresponding contents of the character block.
The position information of the text block may specifically include position information of a rectangular area (denoted as a first rectangular area) corresponding to the text block. The position information of the first rectangular area may be expressed as (L, R, U, B), where L is a left boundary of the first rectangular area, i.e., an abscissa minimum value of the first rectangular area, R is a right boundary of the first rectangular area, i.e., an abscissa maximum value of the first rectangular area, U is an upper boundary of the first rectangular area, i.e., an ordinate minimum value of the first rectangular area, and B is a lower boundary of the first rectangular area, i.e., an ordinate maximum value of the first rectangular area.
Based on the plurality of text blocks obtained by clustering in the step S102 and the position information of the text in each text block, the position information corresponding to the text block can be obtained. For example, L, the smallest of a plurality of words in a word block, may be determined as L, i.e., l=min (li), R, the largest of a plurality of words in a word block, r=max (ri), U, the smallest of a word in a word block, u=min (ui), and B, the largest of a word in a word block, b=max (bi), where i represents all individual words in the word block. That is, the leftmost boundary among the plurality of left boundaries of the plurality of characters in the character block is determined as the left boundary of the character block, the rightmost boundary among the plurality of right boundaries of the plurality of characters in the character block is determined as the right boundary of the character block, the uppermost boundary among the plurality of upper boundaries of the plurality of characters in the character block is determined as the upper boundary of the character block, and the rightmost boundary among the plurality of right boundaries of the plurality of characters in the character block is determined as the right boundary of the character block.
S104, sorting the plurality of text blocks according to the content of the text blocks and/or the position information of the text blocks to obtain a sorting result.
In the embodiment of the application, the sequencing result is the sequence of a plurality of text blocks.
As a possible implementation manner, a preset sorting manner may be adopted, and the plurality of text blocks obtained by clustering may be sorted only according to the content of the text blocks determined in step S103, so as to obtain a sorting result.
As another possible implementation manner, a preset sorting manner may be adopted, and the plurality of text blocks obtained by clustering are sorted only according to the position information of the text blocks determined in step S103, so as to obtain a sorting result.
As another possible implementation manner, a preset sorting manner may be adopted, and the plurality of text blocks obtained by clustering are sorted according to the content of the text blocks and the position information of the text blocks determined in step S103, so as to obtain a sorting result.
The preset sorting mode can specifically include, but is not limited to, at least one of the following: a preset ordering model and a preset ordering rule are adopted.
S105, generating a document analysis result according to the content of the text blocks and the sequencing result.
In the embodiment of the application, the document analysis result is the document content obtained by analyzing the document. And according to the sequencing result of the plurality of text blocks determined in the step S104, the contents of the single text block determined in the step S103 are sequenced to obtain a document analysis result corresponding to the document.
It should be noted that, the execution subjects of each step of the method provided by the embodiment of the present application may be the same device, or the method may also be executed by different devices. For example, the execution subject of step S101 and step S102 may be the device 1, and the execution subject of step S103 may be the device 2; for another example, the execution subject of step S101 may be the device 1, and the execution subjects of step S102 and step S103 may be the device 2; etc.
In summary, when analyzing a document, the document to be analyzed is firstly subjected to text extraction to obtain text information in the document, the text information comprises text content and text position information, then the text is clustered according to the text position information to obtain a plurality of text blocks, the text block information is determined according to the text information in the text blocks, the text block information comprises text block content and text block position information, the plurality of text blocks are ordered according to the text block content and/or the text block position information to obtain an ordering result, and the document analysis result is generated according to the text block content and the ordering result. According to the embodiment of the application, the characters in the document are clustered into the plurality of character blocks according to the position information of the character blocks, and the plurality of character blocks are sequenced according to the content and/or the position information of the character blocks, and because the clustering and sequencing are performed according to the position information of the characters instead of the clustering and sequencing according to the rows, even for the contents of the columns and the blocks, the accurate sequencing of the characters can be realized, so that the quality of the content which is analyzed is improved, and the accuracy of the question-answer and abstract results obtained based on the analyzed content is improved.
Fig. 2 is a flow chart of a document parsing method according to another embodiment of the present application. As shown in fig. 2, on the basis of the embodiment shown in fig. 1, the document parsing method according to the embodiment of the present application specifically includes the following steps:
s201, extracting characters from the document to be analyzed to obtain character information in the document, wherein the character information comprises character content and character position information.
In the embodiment of the present application, the step S201 is the same as the step S101 in the above embodiment, and will not be repeated here.
Step S102 "clustering the text according to the text position information to obtain a plurality of text blocks" in the above embodiment may specifically include the following step S202.
S202, calculating the center coordinates of the second rectangular area according to the position information of the second rectangular area corresponding to the characters.
In the embodiment of the application, according to the position information (l, r, u, d) of the second rectangular area corresponding to the character, the center coordinate of the character, namely, the center coordinate ((l+r)/2, (u+d)/2) of the second rectangular area is calculated.
And S203, clustering the characters according to the center coordinates of the second rectangular area to obtain a plurality of character blocks.
In the embodiment of the application, based on the center coordinates ((l+r)/2, (u+d)/2) of the characters determined in the step S202, the characters are clustered to obtain a plurality of character blocks.
The text may be clustered using various clustering algorithms commonly used, such as k-means (k-means) clustering algorithms. The k-means clustering algorithm is an iterative solution clustering analysis algorithm, and comprises the following steps: calculating the center coordinates of the characters to obtain a series of coordinate points (x 1, y 1), (x 2, y 2), … …, (xn, yn), 1) randomly determining k initial points as centroids (the centroids are the centers of the gathered classes and may not be points in the series of coordinate points); 2) Finding the nearest centroid of each point in a series of coordinate points according to the distance, and allocating the point to the class, and then recalculating the centroid of each class (calculating the average coordinates of all points in the class as centroid coordinates); 3) Repeating the step 2) until the barycenter coordinates are not updated.
S204, determining the information of the character blocks according to the information of the characters in the character blocks, wherein the information of the character blocks comprises the content of the character blocks and the position information of the character blocks.
S205, sorting the plurality of text blocks according to the content of the text blocks and/or the position information of the text blocks to obtain a sorting result.
S206, generating a document analysis result according to the content of the text blocks and the sequencing result.
In the embodiment of the present application, steps S204 to S206 are the same as steps S103 to S105 in the above embodiment, and will not be described here again.
In step S205, "the plurality of text blocks are ordered according to the content of the text blocks to obtain an ordering result", the method specifically includes the following steps: and according to the content of the text blocks, sequencing the plurality of text blocks by adopting a sequencing model to obtain a sequencing result.
Taking the ranking model as an "upper and lower sentence judgment model" as an example, as shown in fig. 3, the step of ranking a plurality of text blocks by using the ranking model according to the content of the text blocks to obtain a ranking result may specifically include the following steps:
s301, if no ordered text blocks exist in the text blocks, determining the text block with the highest position as the text block with the current first order.
In the embodiment of the application, the text blocks in the current first sequence are sequentially determined until the sequence of all the text blocks is obtained. The specific process is as follows:
if the ordered text blocks do not exist in the text blocks, namely the text blocks are not ordered, determining the text block with the highest position in the text blocks as the text block with the current first sequence, namely the text block with the highest position in the text blocks is ordered.
S302, inputting the contents of the two text blocks into an upper sentence judgment model and a lower sentence judgment model to obtain a judgment result of whether the contents of the two text blocks are continuous or not and a score corresponding to the judgment result; the text blocks serving as the upper sentence in the contents of the two text blocks are the text blocks which are ordered last time, and the text blocks serving as the lower sentence in the contents of the two text blocks are any unordered text blocks.
In the embodiment of the application, the input and output of the sentence judgment model are as follows: judging whether the given text segment A and text segment B are continuous contents, inputting the text segment A [ sep ] text segment B [ sep ] ", outputting a continuity score and judging the result as 'yes/no', wherein the value of the continuity score is (0, 1).
The last ordered block of text is the last ordered block of text, i.e., the last block of text in the ordered blocks of text. The score corresponding to the judgment result, that is, the score of the content continuity of the two text blocks, is (0, 1), and if the score is higher than 0.5, the judgment result is yes, and if the score is equal to or lower than 0.5, the judgment result is no according to a threshold value, for example, 0.5.
And taking the content of the last sequenced text block as an upper sentence, and taking the content of any unordered text block as a lower sentence, and inputting the content into an upper sentence and lower sentence judgment model. The sentence up and down judging model judges whether the input sentences up and down are continuous or not, and outputs continuous scores of the sentences up and down and corresponding judging results. The embedded layer in the upper sentence and lower sentence judging model generates an input vector according to the contents of the two text blocks, the neural network layer in the upper sentence and lower sentence judging model generates an output vector according to the input vector, and the output layer in the upper sentence and lower sentence judging model generates a judging result and a score according to the output vector.
S303, determining the unordered text blocks with highest scores as the text blocks with the current first order.
In the embodiment of the present application, among the unordered text blocks in step S302, the text block with the highest score is determined as the text block with the current first order.
That is, after the content of the text block located most above is selected as the first sentence (as the upper sentence), all the remaining sentences (i.e. the content of the remaining text blocks) and the relation of the sentence are sequentially calculated, the sentence with the highest score is selected as the lower sentence of the sentence (i.e. the text block with the highest score is selected as the text block in the current first order), and the sequence of all the sentences (i.e. the content of all the text blocks) is completed by cycling to the last.
Taking the ordering model as a word ordering model as an example, the step of ordering a plurality of word blocks by using the ordering model according to the content of the word blocks to obtain an ordering result may specifically include the steps of: and inputting the contents of the plurality of text blocks into a text ordering model to obtain an ordering result.
In the embodiment of the application, the input and output of the text sorting model are as follows: for a given plurality of text segments A1, A2, A3, the order output is performed by the text order model, and the input is an order task description sentence "please order the following text segment to output a # text segment a1# text segment a2# text segment a3.," the output is the correct text segment order "# text segment a2# text segment a1# text segment a3".
The method comprises the steps of inputting contents of a plurality of text blocks into a text ordering model, analyzing the contents of the plurality of text blocks by the text ordering model, ordering the contents of the plurality of text blocks, namely ordering the plurality of text blocks, and outputting ordering results of the plurality of text blocks. The embedded layer in the text ordering model generates an input vector according to the contents of a plurality of text blocks, the neural network layer in the text ordering model generates an output vector according to the input vector, and the output layer in the text ordering model generates an ordering result according to the output vector.
Further, in step S205, "sorting the plurality of text blocks according to the position information of the text blocks to obtain a sorting result", the method may specifically include the following steps: and according to the position information of the text blocks, sequencing the plurality of text blocks according to a preset sequencing rule to obtain a sequencing result.
The preset ordering rule comprises the following steps: for the text blocks with different longitudinal positions, the text blocks with upper positions are ranked forward, and for the text blocks with the same longitudinal positions, the text blocks with left positions are ranked forward, that is, according to the rule that the text blocks with upper positions are ranked forward more to the left, the text blocks are ranked, that is, the ranking rule that the text blocks with upper positions are ranked forward more to the left.
In some embodiments, as shown in fig. 4, the step of "sorting a plurality of text blocks according to a preset sorting rule to obtain a sorting result according to the location information of the text blocks" may specifically include the following steps:
s401, determining the character block with the smallest L as the leftmost character block in the unordered character blocks.
In the embodiment of the present application, a text block with the smallest L (i.e., min (Lj)) among the unordered text blocks is determined as the leftmost text block min (Lj). Where j is all unordered text blocks.
Here, if there are a plurality of leftmost text blocks min (Lj), the text block with the smallest U among the plurality of leftmost text blocks min (Lj) is selected as the final leftmost text block, and uniqueness of the leftmost text block is ensured.
S402, determining the text block with the smallest U as the uppermost text block in the unordered text blocks.
In the embodiment of the present application, a text block with the smallest U (i.e., min (Uj)) among the unordered text blocks is determined as the uppermost text block min (Uj).
Here, if there are a plurality of the uppermost text blocks min (Uj), the text block with the smallest L among the plurality of the uppermost text blocks min (Uj) is selected as the final leftmost text block, and the uniqueness of the uppermost text block is ensured.
S403, determining the text blocks in the current first sequence in the leftmost text block and the uppermost text block.
In the embodiment of the application, one text block serving as the current first sequence is determined in the leftmost text block min (Lj) and the uppermost text block min (Uj).
Specifically, as shown in fig. 5, the step S403 "determining the text block in the current first order among the leftmost text block and the uppermost text block" may specifically include the following steps:
s501, determining the leftmost character block and the uppermost character block as the same character block, and determining the uppermost character block as the character block in the current first sequence.
In the embodiment of the application, if the leftmost text block min (Lj) and the uppermost text block min (Uj) are the same text block k, the text block k is determined to be the text block in the current first order.
S502, determining the current character blocks in the first sequence in the leftmost character block and the uppermost character block based on the position information of the last sequenced character block if the leftmost character block and the uppermost character block are different character blocks.
In the embodiment of the application, the last sequenced text block, namely the last sequenced text block in all sequenced text blocks, namely the last determined text block is the current first sequenced text block. The position information of the last ordered text block may be expressed as (Lold, rol, uold, bold), and each time a current first ordered text block is determined, the position information of the last ordered text block is updated to the position information of the current first ordered text block, i.e. the position information of the current first ordered text block is taken as the position information of a new last ordered text block, e.g. the current first ordered text block is taken as text block k, the position information of the last ordered text block (Lold, rol, oold, bold) = (Lj, rj, uj, bj) is made.
If the leftmost text block min (Lj) and the uppermost text block min (Uj) are not the same text block, namely, the leftmost text block min (Lj) and the uppermost text block min (Uj) are different text blocks, determining one text block serving as the current first sequence in the leftmost text block min (Lj) and the uppermost text block min (Uj) according to the position information of the last sequenced text block. The specific process is shown in fig. 6, and comprises the following steps:
s601, determining the uppermost word block as the word block in the current first sequence if U of the uppermost word block is equal to or greater than B of the word block in the last sequence.
In the embodiment of the application, assuming that the leftmost text block min (Lj) is a text block k and the uppermost text block min (Uj) is a text block t, if the uppermost text block, i.e., U of the text block t, is equal to or greater than B of the text block ordered last time, i.e., ut is greater than or equal to Bold, determining the uppermost text block, i.e., the text block t, as the text block in the current first order.
S602, determining the leftmost text block as the current text block with the U of the leftmost text block smaller than the B of the last sequenced text block and the U of the leftmost text block smaller than the B of the uppermost text block.
In the embodiment of the application, if the U of the uppermost text block, i.e. text block t, is smaller than the B of the text block ordered last time and the U of the leftmost text block, i.e. text block k, is smaller than the B of the uppermost text block, i.e. text block t, i.e. Ut < Bold and Uk < Bt, then the leftmost text block, i.e. text block k, is determined as the current text block ordered first.
S603, if U of the uppermost word block is smaller than B of the word blocks ordered last time and U of the leftmost word block is equal to or larger than B of the uppermost word block, determining the uppermost word block as the word block in the current first order.
In the embodiment of the application, if the U of the uppermost word block, i.e. the word block t, is smaller than the B of the word block ordered last time and the U of the leftmost word block, i.e. the word block k, is equal to or larger than the B of the uppermost word block, i.e. the word block t, i.e. the Ut < Bold and the Uk is larger than or equal to Bt, the uppermost word block, i.e. the word block t, is determined as the word block ordered first current.
S604, if the ordered text blocks do not exist in the text blocks, determining the uppermost text block as the text block in the current first order.
In the embodiment of the application, if the ordered text blocks do not exist in the text blocks, namely, the text block which is ordered the last time does not exist, the uppermost text block, namely, the text block t is determined as the text in the current first order.
In order to clearly illustrate the above process of "sorting a plurality of text blocks according to a preset sorting rule according to the location information of the text blocks", the following is exemplified:
the position information of the plurality of text blocks is assumed as follows: text block 1- [20,60,50,110], text block 2- [40,80,10,30], text block 3- [70,110,40,60], text block 4- [70,110,70,90], text block 5- [70,110,100,130].
a. min (Lj) =20, the position information [20,60,50,110] of the corresponding text block 1, min (Uj) =10, the position information [40,80,10,30] of the corresponding text block 2 are two different text blocks, and no ordered text blocks exist at this time, so the text block 2 corresponding to min (Uj) is selected as the current first-order text block.
b. min (Lj) =20, the position information [20,60,50,110] of the corresponding text block 1, min (Uj) =40, the position information [70,110,40,60] of the corresponding text block 3 are two different text blocks, and at this time, ut=40 < 30=bold is not satisfied, uk=50 < 60=bt, so that the text block 1 corresponding to k text blocks, i.e., min (Lj), is selected as the current first-order text block.
c. min (Lj) =70, and the position information corresponding to the three text blocks is: the position information [70,110,40,60] of the text block 3, the position information [70,110,70,90] of the text block 4 and the position information [70,110,100,130] of the text block 5, the text block 3 with the smallest U is selected as the final k text block, min (Uj) =40, and the position information [70,110,40,60] of the corresponding text block 3 is the same text block at this time, so that the text block 3 is determined as the text block in the current first order.
d. The sequence judgment of the two latter text blocks is consistent with the step c, so the sequence is a text block 4 and a text block 5.
e. The final sorting result is: text block 2- [40,80,10,30], text block 1- [20,60,50,110], text block 3- [70,110,40,60], text block 4- [70,110,70,90], text block 5- [70,110,100,130].
In step S205, the plurality of text blocks may be sorted in a plurality of ways to obtain a plurality of sorting results. When the ranking result is plural, as shown in fig. 7, the step S206 "generates a document parsing result according to the content of the text block and the ranking result", specifically may include the following steps:
s701, selecting one of a plurality of sorting results as a target sorting result by adopting a preset selecting mode.
In the embodiment of the application, the selection mode adopted when the plurality of sequencing results are selected can be preset. For a plurality of sorting results, the preset selection mode can be adopted, and one sorting result is selected from the plurality of sorting results to serve as a final sorting result, namely a target sorting result.
As a possible implementation, one of the ranking results may be selected as the target ranking result based on a minority-subject majority voting mechanism.
Specifically, the voting mechanism, for example, two sort results, a sort result, B, and a sort result, determines that a sort is before B according to a minority-rule-compliant majority voting principle. By adopting the mode, the part with inconsistent sequencing in the rest sequencing results is sequentially selected and confirmed, so that the target sequencing result is determined.
As a possible implementation, one of the plurality of ranked results is selected as the target ranked result based on a confidence mechanism that ranks the results more accurately with higher confidence.
Specifically, the confidence degree given by the confidence mechanism, for example, three kinds of sorting results, namely 0.3, 0.2 and 0.4 in turn, and the sorting result with the confidence degree of 0.4 is determined as the target sorting result according to the principle that the higher the confidence degree is, namely the more accurate the sorting result is. The confidence level can be set manually, and the confidence level of the corresponding sorting result can be determined according to the accuracy of different sorting methods on the evaluation set.
S702, generating a document analysis result according to the content of the text block and the target ordering result.
According to the method and the device, the content of the plurality of text blocks is arranged according to the sequence of the plurality of text blocks in the target sequence result, and the document analysis result of the document to be analyzed is obtained.
In summary, when analyzing a document, the document to be analyzed is firstly subjected to text extraction to obtain text information in the document, the text information comprises text content and text position information, then the text is clustered according to the text position information to obtain a plurality of text blocks, the text block information is determined according to the text information in the text blocks, the text block information comprises text block content and text block position information, the plurality of text blocks are ordered according to the text block content and/or the text block position information to obtain an ordering result, and the document analysis result is generated according to the text block content and the ordering result. According to the embodiment of the application, the characters in the document are clustered into the plurality of character blocks according to the position information of the character blocks, and the plurality of character blocks are sequenced according to the content and/or the position information of the character blocks, and because the clustering and sequencing are performed according to the position information of the characters instead of the clustering and sequencing according to the rows, even for the contents of the columns and the blocks, the accurate sequencing of the characters can be realized, so that the quality of the content which is analyzed is improved, and the accuracy of the question-answer and abstract results obtained based on the analyzed content is improved. And a plurality of sorting results are obtained by adopting a plurality of sorting methods, and a target sorting result is selected in the sorting results by a preset selection mode, so that the bias and error of the single sorting method are reduced, and the stability and the accuracy of the final sorting result are improved. The multiple text blocks are ordered by adopting ordering models such as an upper sentence judging model, a lower sentence judging model, a text ordering model and the like, and the ordering of the multiple text blocks can be accurately realized based on the learning capacity of the model. By determining the leftmost text block and the uppermost text block in the unordered text blocks and the multiple judgment logic, the ordering of the text blocks according to the rule that the upper part is more forward than the left part is realized, and the ordering of the text blocks can be realized more accurately.
In order to clearly illustrate a specific process of the document parsing method according to the embodiment of the present application, a possible implementation of the document parsing method according to the embodiment of the present application will be described in detail with reference to fig. 8. FIG. 8 is a schematic overall flow chart of a document parsing method according to an embodiment of the present application, as shown in FIG. 8, the document parsing method according to the embodiment of the present application may specifically include the following steps: obtaining a document to be analyzed, obtaining content and position information of a single text according to text coding information and an optical character recognition result corresponding to the document, clustering the text according to the position information of the text to obtain a plurality of text blocks, determining the content position information of the text blocks, sorting the plurality of text blocks according to the content and/or the position information of the text blocks by a plurality of sorting methods in a sorting module to obtain a plurality of sorting results, selecting one of the sorting results as a target sorting result by any one of the selecting methods in the selecting module, and generating a document analysis result based on the content and the target sorting result of the text blocks. The sorting module comprises a plurality of sorting methods, such as a preset sorting rule, a sorting model and the like. The selection module includes a plurality of selection methods, such as a few voting mechanisms subject to majority, a confidence mechanism with higher confidence and more accurate sequencing results, and the like.
Fig. 9 is a schematic structural diagram of a document parsing apparatus according to an embodiment of the present application. As shown in fig. 9, the document parsing apparatus 900 according to the embodiment of the present application may specifically include: an extraction module 901, a clustering module 902, a determination module 903, a ranking module 904, and a generation module 905. Wherein:
the extraction module 901 is configured to perform text extraction on a document to be parsed to obtain text information in the document, where the text information includes text content and text position information.
And the clustering module 902 is configured to cluster the text according to the position information of the text, so as to obtain a plurality of text blocks.
The determining module 903 is configured to determine information of a text block according to information of text in the text block, where the information of the text block includes content of the text block and location information of the text block.
The sorting module 904 is configured to sort the plurality of text blocks according to the content of the text blocks and/or the position information of the text blocks, so as to obtain a sorting result.
The generating module 905 is configured to generate a document parsing result according to the content and the sorting result of the text blocks.
In the embodiment of the present application, specific processes of implementing functions of each module and unit in the document parsing apparatus in the embodiment of the present application may be referred to the related description in the embodiment of the document parsing method, which is not repeated herein.
In summary, when analyzing a document, the document analysis device according to the embodiment of the application firstly performs text extraction on the document to be analyzed to obtain text information in the document, the text information includes text content and text position information, then clusters the text according to the text position information to obtain a plurality of text blocks, determines text block information according to text information in the text blocks, the text block information includes text block content and text block position information, sorts the plurality of text blocks according to the text block content and/or the text block position information to obtain a sorting result, and generates a document analysis result according to the text block content and the sorting result. According to the embodiment of the application, the characters in the document are clustered into the plurality of character blocks according to the position information of the character blocks, and the plurality of character blocks are sequenced according to the content and/or the position information of the character blocks, and because the clustering and sequencing are performed according to the position information of the characters instead of the clustering and sequencing according to the rows, even for the contents of the columns and the blocks, the accurate sequencing of the characters can be realized, so that the quality of the content which is analyzed is improved, and the accuracy of the question-answer and abstract results obtained based on the analyzed content is improved.
Fig. 10 is a schematic structural diagram of a document parsing apparatus according to another embodiment of the present application. As shown in fig. 10, in the document parsing apparatus 900 according to the embodiment of the present application based on the embodiment shown in fig. 9, the ranking module 904 may further include: the first sorting unit 1001.
The first sorting unit 1001 is configured to sort the plurality of text blocks by using a sorting model according to the content of the text blocks, so as to obtain a sorting result.
In a possible implementation manner of the embodiment of the present application, the ranking model is an upper and lower sentence judgment model, and the first ranking unit 1001 is further configured to: inputting the contents of the two text blocks into the upper sentence judgment model and the lower sentence judgment model to obtain a judgment result of whether the contents of the two text blocks are continuous or not and a score corresponding to the judgment result; the method comprises the steps that the text blocks serving as upper sentences in the contents of two text blocks are the text blocks which are ordered last time, the text blocks serving as lower sentences in the contents of two text blocks are any unordered text blocks, an embedded layer in an upper sentence judgment model generates input vectors according to the contents of the two text blocks, a neural network layer in the upper sentence judgment model generates output vectors according to the input vectors, and an output layer in the upper sentence judgment model generates judgment results and scores according to the output vectors; determining unordered text blocks with highest scores as current text blocks with first sequence; and if the ordered text blocks do not exist in the text blocks, determining the text block with the highest position as the text block with the current first sequence.
In one possible implementation of the embodiment of the present application, the ranking model is a text ranking model, and the first ranking unit 1001 is further configured to: inputting the contents of a plurality of text blocks into a text ordering model to obtain an ordering result; the embedded layer in the text ordering model generates an input vector according to the contents of a plurality of text blocks, the neural network layer in the text ordering model generates an output vector according to the input vector, and the output layer in the text ordering model generates an ordering result according to the output vector.
In one possible implementation of the embodiment of the present application, the sorting module 904 may further include: a second sorting unit 1002, configured to sort the plurality of text blocks according to a preset sorting rule according to the position information of the text blocks, so as to obtain a sorting result; the preset ordering rule comprises the following steps: for the text blocks with different longitudinal positions, the text blocks with upper positions are ranked forward, and for the text blocks with the same longitudinal positions, the text blocks with left positions are ranked forward.
In a possible implementation manner of the embodiment of the present application, the position information of the text block includes position information of a first rectangular area corresponding to the text block, where the position information of the first rectangular area includes an abscissa minimum value L, an abscissa maximum value R, an ordinate minimum value U, and an ordinate maximum value B of the first rectangular area; the second sorting unit 1002 is further configured to: determining the character block with the smallest L as the leftmost character block in the unordered plurality of character blocks; determining the text block with the smallest U as the uppermost text block in the unordered text blocks; and determining the text blocks in the current first sequence from the leftmost text block and the uppermost text block.
In one possible implementation of the embodiment of the present application, the second sorting unit 1002 is further configured to: the leftmost character block and the uppermost character block are the same character block, and the uppermost character block is determined to be the character block in the current first sequence; and determining the current character blocks in the first sequence from the leftmost character block and the uppermost character block based on the position information of the last sequenced character block if the leftmost character block and the uppermost character block are different character blocks.
In one possible implementation of the embodiment of the present application, the second sorting unit 1002 is further configured to: the U of the uppermost word block is equal to or greater than the B of the word block ordered last time, and the uppermost word block is determined to be the word block in the current first order; the U of the uppermost word block is smaller than the B of the word blocks ordered last time, and the U of the leftmost word block is smaller than the B of the uppermost word block, determining the leftmost word block as the word block in the current first order; and if the U of the uppermost word block is smaller than the B of the word block ordered last time and the U of the leftmost word block is equal to or larger than the B of the uppermost word block, determining the uppermost word block as the word block in the current first order.
In one possible implementation of the embodiment of the present application, the second sorting unit 1002 is further configured to: and if the ordered text blocks do not exist in the text blocks, determining the uppermost text block as the text block in the current first order.
In one possible implementation of the embodiment of the present application, the sorting result is multiple, and the generating module 905 is further configured to: selecting one of a plurality of sorting results as a target sorting result by adopting a preset selection mode; and generating a document analysis result according to the content of the text block and the target ordering result.
In one possible implementation of the embodiment of the present application, the generating module 905 is further configured to: selecting one of the plurality of ranking results as a target ranking result based on a minority-subject majority voting mechanism; or, based on a confidence mechanism that the higher the confidence, the more accurate the ranking result, selecting one of the plurality of ranking results as the target ranking result.
In a possible implementation manner of the embodiment of the present application, the location information of the text includes location information of a second rectangular area corresponding to the text, and the clustering module 902 is further configured to: calculating the center coordinates of the second rectangular area according to the position information of the second rectangular area corresponding to the characters; and clustering the characters according to the center coordinates of the second rectangular area to obtain a plurality of character blocks.
In one possible implementation of the embodiment of the present application, the extraction module 901 is further configured to: acquiring character information according to character coding information corresponding to the document; and/or acquiring text information according to the optical character recognition result of the document.
In a possible implementation manner of the embodiment of the present application, the position information of the text block includes position information of a first rectangular area corresponding to the text block, where the position information of the first rectangular area includes an abscissa minimum value L, an abscissa maximum value R, an ordinate minimum value U, and an ordinate maximum value B of the first rectangular area; the position information of the characters comprises the position information of a second rectangular area corresponding to the characters, and the position information of the second rectangular area comprises an abscissa minimum value l, an abscissa maximum value r, an ordinate minimum value u and an ordinate maximum value b of the second rectangular area; the determining module 903 is further configured to arrange contents of the text in the text block according to the position information of the text in the text block from left to right and from top to bottom, to obtain contents of the text block; the minimum L of the characters in the character block is determined as L of the character block, the maximum R of the characters in the character block is determined as R of the character block, the minimum U of the characters in the character block is determined as U of the character block, and the maximum B of the characters in the character block is determined as B of the character block.
In the embodiment of the present application, specific processes of implementing functions of each module and unit in the document parsing apparatus in the embodiment of the present application may be referred to the related description in the embodiment of the document parsing method, which is not repeated herein.
In summary, when analyzing a document, the document analysis device according to the embodiment of the application firstly performs text extraction on the document to be analyzed to obtain text information in the document, the text information includes text content and text position information, then clusters the text according to the text position information to obtain a plurality of text blocks, determines text block information according to text information in the text blocks, the text block information includes text block content and text block position information, sorts the plurality of text blocks according to the text block content and/or the text block position information to obtain a sorting result, and generates a document analysis result according to the text block content and the sorting result. According to the embodiment of the application, the characters in the document are clustered into the plurality of character blocks according to the position information of the character blocks, and the plurality of character blocks are sequenced according to the content and/or the position information of the character blocks, and because the clustering and sequencing are performed according to the position information of the characters instead of the clustering and sequencing according to the rows, even for the contents of the columns and the blocks, the accurate sequencing of the characters can be realized, so that the quality of the content which is analyzed is improved, and the accuracy of the question-answer and abstract results obtained based on the analyzed content is improved. And a plurality of sorting results are obtained by adopting a plurality of sorting methods, and a target sorting result is selected in the sorting results by a preset selection mode, so that the bias and error of the single sorting method are reduced, and the stability and the accuracy of the final sorting result are improved. The multiple text blocks are ordered by adopting ordering models such as an upper sentence judging model, a lower sentence judging model, a text ordering model and the like, and the ordering of the multiple text blocks can be accurately realized based on the learning capacity of the model. By determining the leftmost text block and the uppermost text block in the unordered text blocks and the multiple judgment logic, the ordering of the text blocks according to the rule that the upper part is more forward than the left part is realized, and the ordering of the text blocks can be realized more accurately.
Based on the same technical concept, the embodiment of the application also provides a document parsing apparatus for executing the above document parsing method, as shown in fig. 11.
The document parsing apparatus may have a relatively large difference due to different configurations or performances, and may include one or more processors 1101 and a memory 1102, and one or more storage applications or data may be stored in the memory 1102. Wherein the memory 1102 may be transient storage or persistent storage. The application program stored in the memory 1102 may include one or more modules (not shown), each of which may include a series of computer-executable instructions for use in a document parsing apparatus. Still further, the processor 1101 may be arranged to communicate with the memory 1102 to execute a series of computer executable instructions in the memory 1102 on a document parsing apparatus. The document parsing apparatus may also include one or more power supplies 1103, one or more wired or wireless network interfaces 1104, one or more input output interfaces 1105, one or more keyboards 1106, etc.
In a particular embodiment, a document parsing apparatus includes a memory, and one or more programs, where the one or more programs are stored in the memory, and the one or more programs may include one or more modules, and each module may include a series of computer-executable instructions for the document parsing apparatus, and configured to be executed by one or more processors, the one or more programs including computer-executable instructions for:
Extracting characters from a document to be analyzed to obtain character information in the document, wherein the character information comprises character content and character position information;
clustering the characters according to the position information of the characters to obtain a plurality of character blocks;
determining information of a character block according to the information of characters in the character block, wherein the information of the character block comprises the content of the character block and the position information of the character block;
according to the content of the text blocks and/or the position information of the text blocks, sequencing a plurality of text blocks to obtain sequencing results;
and generating a document analysis result according to the content of the text blocks and the sequencing result.
It should be noted that, the embodiment of the document parsing apparatus in the present application and the embodiment of the document parsing method in the present application are based on the same inventive concept, so the specific implementation of this embodiment may refer to the implementation of the corresponding document parsing method, and the repetition is not repeated.
When the document is analyzed, character extraction is firstly carried out on the document to be analyzed to obtain character information in the document, the character information comprises character content and character position information, then the characters are clustered according to the character position information to obtain a plurality of character blocks, the character block information is determined according to the character information in the character blocks, the character block information comprises character block content and character block position information, the plurality of character blocks are ordered according to the character block content and/or the character block position information to obtain an ordering result, and the document analysis result is generated according to the character block content and the ordering result. According to the embodiment of the application, the characters in the document are clustered into the plurality of character blocks according to the position information of the character blocks, and the plurality of character blocks are sequenced according to the content and/or the position information of the character blocks, and because the clustering and sequencing are performed according to the position information of the characters instead of the clustering and sequencing according to the rows, even for the contents of the columns and the blocks, the accurate sequencing of the characters can be realized, so that the quality of the content which is analyzed is improved, and the accuracy of the question-answer and abstract results obtained based on the analyzed content is improved.
Based on the same technical concept, the embodiment of the present application further provides a storage medium, which is used to store computer executable instructions, in a specific embodiment, the storage medium may be a usb disk, an optical disc, a hard disk, etc., where the computer executable instructions stored in the storage medium can implement the following flow when executed by a processor:
extracting characters from a document to be analyzed to obtain character information in the document, wherein the character information comprises character content and character position information;
clustering the characters according to the position information of the characters to obtain a plurality of character blocks;
determining information of a character block according to the information of characters in the character block, wherein the information of the character block comprises the content of the character block and the position information of the character block;
according to the content of the text blocks and/or the position information of the text blocks, sequencing a plurality of text blocks to obtain sequencing results;
and generating a document analysis result according to the content of the text blocks and the sequencing result.
It should be noted that, the embodiment of the storage medium in the present application and the embodiment of the document parsing method in the present application are based on the same inventive concept, so the specific implementation of this embodiment may refer to the implementation of the corresponding document parsing method, and the repetition is not repeated.
When the computer executable instructions stored in the storage medium in the embodiment of the application are executed by a processor, when the document is analyzed, firstly, character extraction is carried out on the document to be analyzed to obtain character information in the document, the character information comprises character content and character position information, then, the characters are clustered according to the character position information to obtain a plurality of character blocks, the character block information is determined according to the character information in the character blocks, the character block information comprises character block content and character block position information, the plurality of character blocks are sequenced according to the character block content and/or the character block position information to obtain a sequencing result, and the document analysis result is generated according to the character block content and the sequencing result. According to the embodiment of the application, the characters in the document are clustered into the plurality of character blocks according to the position information of the character blocks, and the plurality of character blocks are sequenced according to the content and/or the position information of the character blocks, and because the clustering and sequencing are performed according to the position information of the characters instead of the clustering and sequencing according to the rows, even for the contents of the columns and the blocks, the accurate sequencing of the characters can be realized, so that the quality of the content which is analyzed is improved, and the accuracy of the question-answer and abstract results obtained based on the analyzed content is improved.
The foregoing describes certain embodiments of the present application. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
In the 90 s of the 20 th century, improvements to one technology could clearly be distinguished as improvements in hardware (e.g., improvements to circuit structures such as diodes, transistors, switches, etc.) or software (improvements to the process flow). However, with the development of technology, many improvements of the current method flows can be regarded as direct improvements of hardware circuit structures. Designers almost always obtain corresponding hardware circuit structures by programming improved method flows into hardware circuits. Therefore, an improvement of a method flow cannot be said to be realized by a hardware entity module. For example, a programmable logic device (Programmable Logic Device, PLD) (e.g., field programmable gate array (Field Programmable Gate Array, FPGA)) is an integrated circuit whose logic function is determined by the programming of the device by a user. A designer programs to "integrate" a digital system onto a PLD without requiring the chip manufacturer to design and fabricate application-specific integrated circuit chips. Moreover, nowadays, instead of manually manufacturing integrated circuit chips, such programming is mostly implemented with "logic compiler" software, which is similar to the software compiler used in program development and writing, and the original code before being compiled is also written in a specific programming language, which is called hardware description language (Hardware Description Language, HDL), but also HDL is not only one, but a plurality of, such as ABEL (Advanced Boolean Expression Language), AHDL (Altera Hardware Description Language), confluence, CUPL (Cornell University Programming Language), HD Cal, JHDL (Java Hardware Description Language), lava, lola, my HDL, palam, RHDL (Ruby Hardware Description Language), etc., VHDL (Very-High-Speed Integrated Circuit Hardware Description Language) and Verilog are currently most commonly used. It will also be apparent to those skilled in the art that a hardware circuit implementing the logic method flow can be readily obtained by merely slightly programming the method flow into an integrated circuit using several of the hardware description languages described above.
The controller may be implemented in any suitable manner, for example, the controller may take the form of, for example, a microprocessor or processor and a computer readable medium storing computer readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, application specific integrated circuits (Application Specific Integrated Circuit, ASIC), programmable logic controllers, and embedded microcontrollers, examples of which include, but are not limited to, the following microcontrollers: ARC 625D, atmel AT91SAM, microchip PIC18F26K20, and Silicone Labs C8051F320, the memory controller may also be implemented as part of the control logic of the memory. Those skilled in the art will also appreciate that, in addition to implementing the controller in a pure computer readable program code, it is well possible to implement the same functionality by logically programming the method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers, etc. Such a controller may thus be regarded as a kind of hardware component, and means for performing various functions included therein may also be regarded as structures within the hardware component. Or even means for achieving the various functions may be regarded as either software modules implementing the methods or structures within hardware components.
The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. One typical implementation is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.
For convenience of description, the above devices are described as being functionally divided into various units, respectively. Of course, the functions of the units may be implemented in the same software and/or hardware when implementing the application.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on a computer-usable storage medium (including but not limited to disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In one typical configuration, a computing device includes a processor (CPU), input/output interfaces, network interfaces, and memory.
The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.
Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on a computer-usable storage medium (including but not limited to disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
The embodiments of the present application are described in a progressive manner, and the same and similar parts of the embodiments are all referred to each other, and each embodiment is mainly described in the differences from the other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, as relevant to see a section of the description of method embodiments.
The foregoing is merely exemplary of the present application and is not intended to limit the present application. Various modifications and variations of the present application will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. which come within the spirit and principles of the application are to be included in the scope of the claims of the present application.

Claims (17)

1. A document parsing method, comprising:
extracting characters from a document to be analyzed to obtain character information in the document, wherein the character information comprises the content of the characters and the position information of the characters;
clustering the characters according to the position information of the characters to obtain a plurality of character blocks;
determining the information of the text block according to the information of the text in the text block, wherein the information of the text block comprises the content of the text block and the position information of the text block;
According to the content of the text blocks and/or the position information of the text blocks, sequencing the plurality of text blocks to obtain a sequencing result;
and generating a document analysis result according to the content of the text block and the sequencing result.
2. The document parsing method according to claim 1, wherein sorting the plurality of blocks according to the content of the blocks to obtain the sorting result includes:
and according to the content of the text blocks, sequencing the plurality of text blocks by adopting a sequencing model to obtain the sequencing result.
3. The method for parsing a document according to claim 2, wherein the ranking model is a sentence-in-sentence judgment model, and the ranking the plurality of text blocks by using the ranking model according to the content of the text blocks to obtain the ranking result includes:
inputting the contents of the two text blocks into the sentence judging model to obtain judging results of whether the contents of the two text blocks are continuous or not and scores corresponding to the judging results; the text blocks serving as upper sentences in the contents of the two text blocks are the text blocks which are ordered last time, the text blocks serving as lower sentences in the contents of the two text blocks are any unordered text blocks, an embedded layer in an upper sentence judgment model generates input vectors according to the contents of the two text blocks, a neural network layer in the upper sentence judgment model generates output vectors according to the input vectors, and an output layer in the upper sentence judgment model generates judgment results and scores according to the output vectors;
Determining the unordered text blocks with highest scores as the text blocks with the current first sequence;
and if the ordered text blocks do not exist in the text blocks, determining the text block with the highest position as the text block with the current first sequence.
4. The method for parsing a document according to claim 2, wherein the ranking model is a text ranking model, and the ranking the plurality of text blocks by using the ranking model according to the content of the text blocks to obtain the ranking result includes:
inputting the contents of the plurality of text blocks into the text sorting model to obtain the sorting result; the embedded layer in the word sequencing model generates an input vector according to the contents of the plurality of word blocks, the neural network layer in the word sequencing model generates an output vector according to the input vector, and the output layer in the word sequencing model generates the sequencing result according to the output vector.
5. The document parsing method according to claim 1, wherein sorting the plurality of text blocks according to the position information of the text blocks, to obtain the sorting result, comprises:
According to the position information of the text blocks, sequencing the text blocks according to a preset sequencing rule to obtain the sequencing result; the preset ordering rule comprises the following steps: for the text blocks with different longitudinal positions, the text blocks with upper positions are ranked forward, and for the text blocks with the same longitudinal positions, the text blocks with left positions are ranked forward.
6. The document parsing method according to claim 5, wherein the position information of the text block includes position information of a first rectangular area corresponding to the text block, the position information of the first rectangular area including an abscissa minimum value L, an abscissa maximum value R, an ordinate minimum value U, and an ordinate maximum value B of the first rectangular area;
the step of sorting the plurality of text blocks according to the position information of the text blocks and a preset sorting rule to obtain the sorting result comprises the following steps:
determining the character block with the smallest L as the leftmost character block in the unordered plurality of character blocks;
determining the text block with the smallest U as the uppermost text block in the unordered plurality of text blocks;
and determining the text blocks in the current first sequence in the leftmost text block and the uppermost text block.
7. The document parsing method according to claim 6, wherein determining a current first-order text block among the leftmost text block and the uppermost text block includes:
the leftmost text block and the uppermost text block are the same text block, and the uppermost text block is determined to be the text block in the current first sequence;
and determining the current character block in the first sequence in the leftmost character block and the uppermost character block based on the position information of the character block ordered last time if the leftmost character block and the uppermost character block are different character blocks.
8. The document parsing method according to claim 7, wherein determining a current first-order text block among the leftmost text block and the uppermost text block based on the position information of the last-ordered text block includes:
the U of the uppermost word block is equal to or greater than the B of the word block ordered last time, and the uppermost word block is determined to be the word block in the current first order;
the U of the uppermost word block is smaller than the B of the word blocks ordered last time, and the U of the leftmost word block is smaller than the B of the uppermost word block, and the leftmost word block is determined to be the word block in the current first order;
And if the U of the uppermost word block is smaller than the B of the word block ordered last time and the U of the leftmost word block is equal to or larger than the B of the uppermost word block, determining the uppermost word block as the word block in the current first order.
9. The document parsing method according to claim 7, wherein determining a current first-order text block among the leftmost text block and the uppermost text block, further comprises:
and if the ordered text blocks do not exist in the text blocks, determining the uppermost text block as the text block in the current first order.
10. The method of claim 1, wherein the ranking results are a plurality, and the generating the document parsing result according to the content of the text block and the ranking results includes:
selecting one of the sorting results as a target sorting result by adopting a preset selecting mode;
and generating the document analysis result according to the content of the text block and the target ordering result.
11. The method for analyzing a document according to claim 10, wherein selecting one of the plurality of ranking results as a target ranking result by using a preset selection method includes:
Selecting one of the plurality of ranking results as the target ranking result based on a minority-subject majority voting mechanism; or,
and selecting one of the sorting results from the plurality of sorting results as the target sorting result based on a confidence mechanism that the sorting result is more accurate as the confidence is higher.
12. The method for analyzing a document according to claim 1, wherein the position information of the text includes position information of a second rectangular area corresponding to the text, the clustering the text according to the position information of the text, to obtain a plurality of text blocks, includes:
calculating the center coordinates of a second rectangular area according to the position information of the second rectangular area corresponding to the characters;
and clustering the characters according to the center coordinates of the second rectangular area to obtain the plurality of character blocks.
13. The method for parsing a document according to claim 1, wherein the step of extracting text from the document to be parsed to obtain text information in the document includes:
acquiring the information of the characters according to the character coding information corresponding to the document; and/or the number of the groups of groups,
and acquiring the information of the characters according to the optical character recognition result of the document.
14. The document parsing method according to claim 1, wherein the position information of the text block includes position information of a first rectangular area corresponding to the text block, the position information of the first rectangular area including an abscissa minimum value L, an abscissa maximum value R, an ordinate minimum value U, and an ordinate maximum value B of the first rectangular area; the position information of the characters comprises the position information of a second rectangular area corresponding to the characters, and the position information of the second rectangular area comprises an abscissa minimum value l, an abscissa maximum value r, an ordinate minimum value u and an ordinate maximum value b of the second rectangular area;
the determining the information of the text block according to the information of the text in the text block comprises the following steps:
according to the position information of the characters in the character blocks, arranging the contents of the characters in the character blocks according to the sequence from left to right and from top to bottom to obtain the contents of the character blocks;
and determining the smallest L of the characters in the character block as L of the character block, the largest R of the characters in the character block as R of the character block, the smallest U of the characters in the character block as U of the character block, and the largest B of the characters in the character block as B of the character block.
15. A document parsing apparatus, comprising:
the extraction module is used for extracting characters from the document to be analyzed to obtain character information in the document, wherein the character information comprises the content of the characters and the position information of the characters;
the clustering module is used for clustering the characters according to the position information of the characters to obtain a plurality of character blocks;
the determining module is used for determining the information of the character blocks according to the information of the characters in the character blocks, wherein the information of the character blocks comprises the content of the character blocks and the position information of the character blocks;
the ordering module is used for ordering the plurality of text blocks according to the content of the text blocks and/or the position information of the text blocks to obtain an ordering result;
and the generation module is used for generating a document analysis result according to the content of the text block and the sequencing result.
16. A document parsing apparatus, the apparatus comprising:
a processor; and
a memory arranged to store computer executable instructions configured to be executed by the processor, the executable instructions comprising steps for performing the method of any of claims 1-14.
17. A storage medium storing computer executable instructions for causing a computer to perform the method of any one of claims 1-14.
CN202311293136.1A 2023-10-08 2023-10-08 Document analysis method and device Pending CN117033642A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311293136.1A CN117033642A (en) 2023-10-08 2023-10-08 Document analysis method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311293136.1A CN117033642A (en) 2023-10-08 2023-10-08 Document analysis method and device

Publications (1)

Publication Number Publication Date
CN117033642A true CN117033642A (en) 2023-11-10

Family

ID=88623077

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311293136.1A Pending CN117033642A (en) 2023-10-08 2023-10-08 Document analysis method and device

Country Status (1)

Country Link
CN (1) CN117033642A (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070043760A1 (en) * 2005-08-22 2007-02-22 Microsoft Corporation Embedding expression in XML literals
CN101206639A (en) * 2007-12-20 2008-06-25 北大方正集团有限公司 Method for indexing complex impression based on PDF
CN112434686A (en) * 2020-11-16 2021-03-02 浙江大学 End-to-end error-containing text classification recognition instrument for OCR (optical character recognition) picture
CN113887235A (en) * 2021-09-24 2022-01-04 北京三快在线科技有限公司 Information recommendation method and device
CN114359943A (en) * 2022-01-13 2022-04-15 北京华宇信息技术有限公司 OFD format document paragraph identification method and device
CN116738988A (en) * 2023-05-24 2023-09-12 腾讯音乐娱乐科技(深圳)有限公司 Text detection method, computer device, and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070043760A1 (en) * 2005-08-22 2007-02-22 Microsoft Corporation Embedding expression in XML literals
CN101206639A (en) * 2007-12-20 2008-06-25 北大方正集团有限公司 Method for indexing complex impression based on PDF
CN112434686A (en) * 2020-11-16 2021-03-02 浙江大学 End-to-end error-containing text classification recognition instrument for OCR (optical character recognition) picture
CN113887235A (en) * 2021-09-24 2022-01-04 北京三快在线科技有限公司 Information recommendation method and device
CN114359943A (en) * 2022-01-13 2022-04-15 北京华宇信息技术有限公司 OFD format document paragraph identification method and device
CN116738988A (en) * 2023-05-24 2023-09-12 腾讯音乐娱乐科技(深圳)有限公司 Text detection method, computer device, and storage medium

Similar Documents

Publication Publication Date Title
US10915564B2 (en) Leveraging corporal data for data parsing and predicting
CN106776673B (en) Multimedia document summarization
CN110705214B (en) Automatic coding method and device
CN116227474B (en) Method and device for generating countermeasure text, storage medium and electronic equipment
CN106959946B (en) Text semantic feature generation optimization method based on deep learning
CN111401062B (en) Text risk identification method, device and equipment
CN112672184A (en) Video auditing and publishing method
CN117331561B (en) Intelligent low-code page development system and method
CN117076650B (en) Intelligent dialogue method, device, medium and equipment based on large language model
CN111931041A (en) Label recommendation method and device, electronic equipment and storage medium
CN108921190A (en) A kind of image classification method, device and electronic equipment
CN111104572A (en) Feature selection method and device for model training and electronic equipment
CN116402113B (en) Task execution method and device, storage medium and electronic equipment
CN110457430A (en) A kind of Traceability detection method of text, device and equipment
CN111339910B (en) Text processing and text classification model training method and device
CN117252183A (en) Semantic-based multi-source table automatic matching method, device and storage medium
CN117113174A (en) Model training method and device, storage medium and electronic equipment
CN117033642A (en) Document analysis method and device
CN114330303A (en) Text error correction method and related equipment
CN115221523A (en) Data processing method, device and equipment
CN111325195B (en) Text recognition method and device and electronic equipment
CN110321433B (en) Method and device for determining text category
CN117743558B (en) Knowledge processing and knowledge question-answering method, device and medium based on large model
US20240161529A1 (en) Extracting document hierarchy using a multimodal, layer-wise link prediction neural network
CN113204664B (en) Image clustering method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Country or region after: China

Address after: Room 228, 2nd Floor, No. 5 Guanghua Road, Zhangjiawan Town, Tongzhou District, Beijing, 101113

Applicant after: BEIJING ZHONGGUANCUN KEJIN TECHNOLOGY Co.,Ltd.

Address before: 130, 1st Floor, Building 5, Courtyard 1, Shangdi Fourth Street, Haidian District, Beijing, 100085

Applicant before: BEIJING ZHONGGUANCUN KEJIN TECHNOLOGY Co.,Ltd.

Country or region before: China