CN117764044A - Document dividing method and device, electronic equipment and storage medium - Google Patents
Document dividing method and device, electronic equipment and storage medium Download PDFInfo
- Publication number
- CN117764044A CN117764044A CN202311866289.0A CN202311866289A CN117764044A CN 117764044 A CN117764044 A CN 117764044A CN 202311866289 A CN202311866289 A CN 202311866289A CN 117764044 A CN117764044 A CN 117764044A
- Authority
- CN
- China
- Prior art keywords
- paragraph
- input document
- blocks
- current
- block
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 61
- 238000000605 extraction Methods 0.000 claims abstract description 18
- 238000000638 solvent extraction Methods 0.000 claims description 10
- 230000008569 process Effects 0.000 abstract description 4
- 238000010586 diagram Methods 0.000 description 12
- 238000004590 computer program Methods 0.000 description 8
- 206010061619 Deformity Diseases 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 238000003064 k means clustering Methods 0.000 description 3
- 238000003058 natural language processing Methods 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 230000003595 spectral effect Effects 0.000 description 3
- 230000003993 interaction Effects 0.000 description 2
- 230000011218 segmentation Effects 0.000 description 2
- 238000003491 array Methods 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 239000003086 colorant Substances 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 230000001953 sensory effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The application provides a document dividing method, a document dividing device, electronic equipment and a storage medium, wherein the method comprises the following steps: determining a plurality of specifying units in the input document based on the setting symbols included in the input document; carrying out multi-mode feature extraction on the appointed unit to obtain unit features for representing layout features and semantic features of the appointed unit; dividing a plurality of designated units based on the unit features to divide the input document into a plurality of paragraph blocks; wherein the process of dividing the plurality of specified units includes performing paragraph block clustering based on paragraph features of existing paragraph blocks of the input document to update a current paragraph block of the input document.
Description
Technical Field
The present disclosure relates to the field of data processing, and in particular, to a method and apparatus for dividing a document, an electronic device, and a storage medium.
Background
Natural language processing techniques have been widely used, and input content for natural language processing includes speech information and text information, for example: a manually written document. Manually written documents can contain various topics and semantic information, but problems of lengthy content, unreasonable structure and low readability are easy to occur. Has great influence on downstream natural language processing tasks, such as document-based retrieval, document-based question-answering, document text abstract extraction and the like. The original document can be divided by the division of the long document, and related contents are concentrated together, so that the contents of each part have certain integrity, and focusing understanding is facilitated.
Disclosure of Invention
The application provides a document dividing method, a document dividing device, electronic equipment and a storage medium.
An aspect of an embodiment of the present application provides a document dividing method, including:
determining a plurality of specified units in an input document based on a setting symbol included in the input document;
carrying out multi-mode feature extraction on the appointed unit to obtain unit features for representing layout features and semantic features of the appointed unit;
dividing a plurality of the specified units based on the unit features to divide the input document into a plurality of paragraph blocks;
wherein the dividing of the plurality of the specified units includes performing paragraph block clustering based on paragraph features of existing paragraph blocks of the input document to update a current paragraph block of the input document.
Wherein the dividing the plurality of the specified units based on the unit features includes:
generating an initial paragraph block based on the first specification unit;
generating a current paragraph block based on the cell characteristics of the second specified cell and the paragraph characteristics of the initial paragraph block;
traversing other specified units, and generating an updated current paragraph block based on the unit characteristics of the current other specified units and the paragraph characteristics of the current paragraph block, wherein the updated current paragraph block at least comprises two paragraph blocks.
The generating the updated current paragraph block based on the unit features of the other specified units and the paragraph features of the current paragraph block includes:
performing differential comparison on the unit characteristics of the other current designated units and the paragraph characteristics of the current paragraph block to obtain a comparison result;
if the comparison result is smaller than a first threshold value, adding other current appointed units into the current paragraph block to generate an updated current paragraph block;
and if the comparison result is greater than or equal to a first threshold value, generating a new paragraph block based on other current specified units, and generating an updated current paragraph.
Wherein the step of clustering the paragraph blocks based on the paragraph characteristics of the existing paragraph blocks of the input document to update the current paragraph blocks of the input document comprises the following steps:
if the current paragraph block of the input document is changed, clustering the existing paragraph blocks;
and merging the paragraph blocks with the similarity larger than or equal to a second threshold value in the clustering result to update the current paragraph block of the input document.
Wherein, the paragraph block clustering is performed based on the paragraph characteristics of the existing paragraph blocks of the input document to update the current paragraph blocks of the input document, and the method further comprises:
if the number of the current paragraph blocks of the input document reaches the set number, clustering the existing paragraph blocks;
and merging the paragraph blocks with the similarity larger than or equal to a second threshold value in the clustering result to update the current paragraph block of the input document.
Wherein, the paragraph block clustering is performed based on the paragraph characteristics of the existing paragraph blocks of the input document to update the current paragraph blocks of the input document, and the method further comprises:
clustering existing paragraph blocks of the input document at each interval of set time;
and merging the paragraph blocks with the similarity larger than or equal to a second threshold value in the clustering result to update the current paragraph block of the input document.
The input document comprises image information for representing document layout information and text information for representing document semantic information, and the multi-modal feature extraction is performed on the appointed unit, and the method comprises the following steps:
and extracting features of the input document according to the image information and the text information to obtain unit features for representing layout features and semantic features of the designated unit.
Another aspect of the embodiments of the present application provides a document dividing apparatus, including:
a first division module for determining a plurality of specification units in an input document based on setting symbols included in the input document;
the extraction module is used for carrying out multi-mode feature extraction on the appointed unit to obtain unit features for representing layout features and semantic features of the appointed unit;
a second dividing module for dividing a plurality of the specified units based on the unit features to divide the input document into a plurality of paragraph blocks;
the second dividing module is further configured to perform paragraph block clustering based on paragraph features of existing paragraph blocks of the input document, so as to update a current paragraph block of the input document.
Still another aspect of the present invention provides an electronic device, including:
a processor, a memory for storing instructions executable by the processor;
the processor is used for reading the executable instructions from the memory and executing the instructions to realize the document dividing method.
A further aspect of the present invention provides a computer-readable storage medium storing a computer program for executing the document dividing method.
It should be understood that the description of this section is not intended to identify key or critical features of the embodiments of the application or to delineate the scope of the application. Other features of the present application will become apparent from the description that follows.
Drawings
The above, as well as additional purposes, features, and advantages of exemplary embodiments of the present application will become readily apparent from the following detailed description when read in conjunction with the accompanying drawings. Several embodiments of the present application are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings, in which:
in the drawings, the same or corresponding reference numerals indicate the same or corresponding parts.
FIG. 1 illustrates a flow chart of a method of document partitioning according to one embodiment of the present application;
FIG. 2A shows a schematic diagram of an input document according to one embodiment of the present application;
FIG. 2B illustrates a schematic diagram of a plurality of designated units of an input document according to one embodiment of the present application;
FIG. 2C illustrates a schematic diagram of an initial paragraph block of an input document, according to one embodiment of the present application;
FIG. 2D illustrates a schematic diagram of a current paragraph block of an input document, according to one embodiment of the present application;
FIG. 2E shows a schematic diagram of a current paragraph block of an input document according to another embodiment of the present application;
FIG. 2F illustrates a schematic diagram of an updated current paragraph block of an input document according to one embodiment of the present application;
FIG. 2G illustrates a schematic diagram of an updated current paragraph block of an input document according to another embodiment of the present application;
FIG. 2H illustrates a schematic diagram of an updated current paragraph block of an input document according to another embodiment of the present application;
FIG. 3 illustrates a flow chart of a method of document partitioning according to another embodiment of the present application;
FIG. 4 illustrates a flow chart of a document partitioning method according to another embodiment of the present application;
FIG. 5 illustrates a flow chart of a method of document partitioning according to another embodiment of the present application;
FIG. 6 illustrates a flow chart of a method of document partitioning according to another embodiment of the present application;
FIG. 7 illustrates a flow chart of a method of document partitioning according to another embodiment of the present application;
FIG. 8 shows a schematic diagram of a document dividing apparatus according to one embodiment of the present application;
fig. 9 shows a schematic diagram of the composition structure of an electronic device according to an embodiment of the present application.
Detailed Description
In order to make the objects, features and advantages of the present application more obvious and understandable, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.
The existing method for dividing the documents does not consider layout information of the documents, such as fonts, word sizes, colors and the like, and the information comprises structural information of the documents, so that more information can be provided for dividing the documents.
Therefore, in order to reasonably divide a document without destroying the original logical structure of the document, an embodiment of the present application provides a document division method, as shown in fig. 1, including:
step 101, determining a plurality of designated units in an input document based on setting symbols included in the input document.
In this embodiment, the setting symbol may be comma, period, and/or semicolon, etc., and in other embodiments, the setting may be performed based on specific requirements.
For example, based on the input document shown in fig. 2A, the input document shown in fig. 2A contains text content that is broken by periods. The input document shown in fig. 2A is divided into a plurality of specification units based on periods in the input document, the divided plurality of specification units are shown in fig. 2B, each period and text content before the period are divided into one specification unit, and the input document shown in fig. 2A is divided into 7 specification units as shown in fig. 2B.
And 102, carrying out multi-mode feature extraction on the appointed unit to obtain unit features for representing layout features and semantic features of the appointed unit.
The multi-mode feature extraction can be performed on the appointed unit by using a CLIP model, a ViLBERT model, a UNITER model, a DALL-E model or an M3E model and the like to obtain the unit feature of the appointed unit. The cell features contain layout features and semantic features that specify the cell.
Step 103, dividing a plurality of the specified units based on the unit features to divide the input document into a plurality of paragraph blocks; wherein the dividing of the plurality of the specified units includes performing paragraph block clustering based on paragraph features of existing paragraph blocks of the input document to update a current paragraph block of the input document.
Dividing a plurality of appointed units based on unit characteristics to obtain a plurality of paragraph blocks corresponding to the input document, wherein each paragraph block comprises at least one appointed unit. In the dividing process of a plurality of appointed units, when a certain condition is met, the existing paragraph blocks of the input document are required to be subjected to a countercheck, the countercheck comprises the reasonability of dividing the existing paragraph blocks of the input document, multi-modal feature extraction is carried out on the existing paragraph blocks of the input document, the paragraph features of each existing paragraph block are obtained, and the paragraph block clustering is carried out based on the paragraph features of each existing paragraph block so as to update the current paragraph block of the input document.
It should be noted that after the new paragraph blocks are generated by paragraph block clustering, the old paragraph blocks used for clustering may be reserved or deleted based on preset settings.
In the scheme, the input document is divided into a plurality of specified units based on the set symbols in the input document, then the specified units are subjected to multi-mode feature extraction to obtain the unit features for representing the layout features and the semantic features of the specified units, and finally the specified units of the input document are divided into a plurality of paragraph blocks based on the unit features, so that the layout information and the semantic information of the input document are fully considered, the divided paragraph blocks are more reasonable, and the division requirements of users are met. And in the dividing process, the existing segmentation blocks of the input document are divided, so that the accuracy of the finally divided segmentation blocks can be further improved.
In an example of the present application, there is further provided a document dividing method, as shown in fig. 3, wherein dividing a plurality of the specified units based on the unit features includes:
step 201, an initial paragraph block is generated based on the first specified unit.
Creating blank paragraph blocks, adding the first appointed unit in the preset sequence into the blank paragraph blocks, and generating initial paragraph blocks.
In this embodiment, the preset order is determined based on the position of the specifying unit in the input document, and in other embodiments, the preset order may be set based on specific requirements.
For example, as shown in fig. 2C, a first specified unit is determined from the plurality of specified units shown in fig. 2A according to a position in the input document, a blank paragraph block is created, and the first specified unit is added to the blank paragraph block, generating an initial paragraph block.
Step 202, generating a current paragraph block based on the unit features of the second specified unit and the paragraph features of the initial paragraph block.
Determining whether the second specified unit needs to be added to the initial paragraph block or not based on the unit characteristics of the second specified unit and the paragraph characteristics of the initial paragraph block in the preset order, generating a new paragraph block instead of the initial paragraph block, and adding the second specified unit to the new paragraph block. And the current paragraph block generated is composed of the initial paragraph block and the new paragraph block.
For example, a second specified unit is determined from the plurality of specified units shown in fig. 2A in accordance with the position in the input document, and it is determined whether the second specified unit needs to be added to the initial paragraph block based on the unit characteristics of the second specified unit and the paragraph characteristics of the initial paragraph block. If so, a second designated element is added to the initial paragraph block, as shown in FIG. 2D, to generate the current paragraph block. If not, a new paragraph block is created and a second specification unit is added to the new paragraph block, as shown in FIG. 2E, to generate the current paragraph block.
Step 203, traversing other specified units, generating an updated current paragraph block based on the unit characteristics of the current other specified units and the paragraph characteristics of the current paragraph block, wherein the updated current paragraph block at least comprises two paragraph blocks.
Similar to step 202, it is determined whether it is necessary to add the other specified unit currently traversed to the original paragraph block or not based on the unit features of the other specified unit currently traversed and the paragraph features of the current paragraph block, but a new paragraph block is generated and the other specified unit currently traversed is added to the new paragraph block. The updated current paragraph block includes at least two paragraph blocks.
For example, a third specified element is determined from the plurality of specified elements shown in FIG. 2A by location in the input document, and the current paragraph block is as shown in FIG. 2E, so that the element characteristics of the third specified element need to be compared with the paragraph characteristics of two of the current paragraph blocks in FIG. 2E, respectively, to determine whether the third specified element needs to join the two paragraph blocks, join one of the two paragraph blocks, or not join the two paragraph blocks. If it is determined that the third specification unit is to join the two paragraph blocks at the same time, then the third specification unit is added to the two paragraph blocks in the current paragraph block, respectively, to generate updated paragraph blocks, as shown in fig. 2F. If it is determined that the third specified unit is to join the first paragraph block and not the second paragraph block, then the third specified unit is added to the first paragraph block in the current paragraph block to generate an updated paragraph block, as shown in fig. 2G. If it is determined that the third specification unit does not join any one of the current paragraph blocks, a new paragraph block is generated, and the third specification unit is added to the new paragraph block to generate an updated paragraph block, as shown in fig. 2H.
After the traversing is finished, the division of a plurality of designated units of the input document is finished, and a plurality of paragraph blocks of the input document are obtained.
In the above scheme, the current paragraph block is generated by generating an initial paragraph block based on the first specified unit, then determining whether to add the second specified unit to the initial paragraph block or to generate a new paragraph block based on the unit characteristics of the second specified unit and the paragraph characteristics of the initial paragraph block, and adding the second specified unit to the new paragraph block. And traversing other specified units, determining whether to add the other specified units traversed currently to a certain paragraph block in the current paragraph block or generate a new paragraph block based on the unit characteristics of the other specified units traversed currently and the paragraph characteristics of the current paragraph block, and adding the other specified units traversed currently to the new paragraph block to update the current paragraph block. Reasonable division of the input document based on layout features and semantic features of the input document is achieved, and accuracy of a plurality of segmented blocks finally obtained after division is improved.
In an example of the present application, as shown in fig. 4, the generating the updated current paragraph block based on the unit features of the current other specified units and the paragraph features of the current paragraph block includes:
step 301, performing differential comparison between the unit features of the other specified units and the paragraph features of the current paragraph block to obtain a comparison result.
And performing differential comparison between the unit characteristics of other current specified units and the paragraph characteristics of each paragraph block in the current paragraph block to obtain a comparison result.
In this example, there are 3 ways to make differential comparisons:
mode one: correlation coefficients (e.g., pearson correlation coefficients, etc.) between the cell characteristics of the other specified cells currently and the paragraph characteristics of each of the current paragraph blocks are calculated.
Mode two: distances (e.g., euclidean distance, manhattan distance, cosine similarity, etc.) between cell features of other specified cells currently and paragraph features of each of the current paragraph blocks are calculated.
Mode three: kernel functions (e.g., linear kernel, polynomial kernel, gaussian kernel, etc.) between the cell features of the other specified cells currently and the paragraph features of each of the paragraph blocks currently are calculated.
In other embodiments, any other way of performing differential comparison may be used.
And step 302, if the comparison result is smaller than a first threshold value, adding the current other specified units into the current paragraph block, and generating an updated current paragraph block.
If at least one paragraph block with the comparison result with the current other specified units is smaller than the first threshold value, adding the current other specified units into the at least one paragraph block, and generating an updated current paragraph block.
Step 303, if the comparison result is greater than or equal to the first threshold, generating a new paragraph block based on the current other specified units, and generating an updated current paragraph block.
If the comparison results of all paragraph blocks and other current specified units are larger than or equal to a first threshold value, generating a new paragraph block, adding the other current specified units into the new paragraph block, and generating an updated paragraph block.
In the above scheme, the unit features of the other specified units currently traversed are compared with the paragraph features of each paragraph block in the current paragraph block in a differential manner by traversing the other specified units. And when at least one paragraph block with the comparison result smaller than the first threshold value exists, adding the current other specified units into the at least one paragraph block, and generating an updated current paragraph block. And when all the comparison results are greater than or equal to a first threshold value, generating a new paragraph block, adding other current appointed units into the new paragraph block, and generating an updated paragraph block. The appointed units are divided through difference comparison, so that the accuracy of a plurality of segmented blocks finally obtained after division is further improved.
In an example of the present application, as shown in fig. 5, the step of clustering the paragraph blocks based on paragraph features of existing paragraph blocks of the input document to update the current paragraph block of the input document includes:
step 401, if the current paragraph block of the input document changes, clustering the existing paragraph blocks.
If the current paragraph block of the input document is changed (for example, the current paragraph block of the input document is updated every time the designated unit is divided), the changed paragraph block is retuned. The disfigurement is mainly performed by clustering the changed paragraph blocks (clustering can be performed by a K-means clustering method, a hierarchical clustering method, a DBSCAN clustering method, an LDA clustering method or a Spectral Clustering clustering method and the like) so as to determine whether the division of the changed paragraph blocks is reasonable.
And step 402, merging paragraph blocks with the similarity larger than or equal to a second threshold value in the clustering result to update the current paragraph block of the input document.
If the paragraph blocks with the similarity larger than or equal to the second threshold value exist in the clustering result, the fact that the division of the changed paragraph blocks is unreasonable is indicated by the negative judgment, and then the paragraph blocks with the similarity larger than or equal to the second threshold value in the clustering result are combined to generate updated current paragraph blocks.
It should be noted that after the paragraph blocks are merged, the merged paragraph blocks may be set to be reserved, or the merged paragraph blocks may be set to be deleted. For example, there are paragraph blocks A, B and C, after clustering, paragraph blocks A and C need to be merged, and if set to preserve the merged paragraph blocks, the updated current paragraph blocks are paragraph blocks A, B, C and AC. If the merged paragraph block is set to be deleted, the updated current paragraph block is paragraph block B and AC.
In the above scheme, when the current paragraph block of the input document is changed, the existing paragraph block is thinked back, whether the division of the changed paragraph block is reasonable or not can be determined, and when the division is unreasonable, the paragraph blocks with correlation among each other are concentrated together, so that the paragraph block after the thinking back is more reasonable.
In an example of the present application, as shown in fig. 6, the method for classifying the paragraph blocks based on the paragraph features of the existing paragraph blocks of the input document to update the current paragraph block of the input document further includes:
step 501, if the number of current paragraph blocks of the input document reaches a set number, clustering the existing paragraph blocks.
If the number of current paragraph blocks of the input document reaches the set number (for example, the number is set to 5, the number can be set based on specific requirements), the existing paragraph blocks are retuned. The retum mainly carries out clustering on changed paragraph blocks (can be clustered by a clustering method such as K-means clustering, hierarchical clustering, DBSCAN clustering, LDA clustering or Spectral Clustering clustering and the like) so as to determine whether the division of the existing paragraph blocks is reasonable.
Step 502, merging paragraph blocks with similarity greater than or equal to a second threshold value in the clustering result to update the current paragraph block of the input document.
If the segment blocks with the similarity larger than or equal to the second threshold value exist in the clustering result, the fact that the division of the existing segment blocks is unreasonable is indicated by the disfigurement, and then the segment blocks with the similarity larger than or equal to the second threshold value in the clustering result are combined to generate updated current segment blocks.
It should be noted that, in this embodiment, in order to perform a countercheck on an existing paragraph block when the number of current paragraph blocks reaches the set number, the purpose of simplifying the paragraph blocks is achieved while determining that the division of the existing paragraph block is unreasonable and performing corresponding processing through the countercheck, and therefore, it is generally set to delete the merged paragraph blocks after the paragraph blocks are merged.
In the above scheme, when the number of the current paragraph blocks of the input document reaches the set number, the existing paragraph blocks are thinked back, whether the division of the existing paragraph blocks is reasonable or not can be determined, and when the division is unreasonable, the paragraph blocks with correlation among each other are concentrated together, so that the clustered paragraph blocks are more reasonable, the number of the existing paragraph blocks can be reduced, and the user can read the document conveniently.
In an example of the present application, as shown in fig. 7, the method for classifying the paragraph blocks based on the paragraph features of the existing paragraph blocks of the input document to update the current paragraph block of the input document further includes:
step 601, clustering existing paragraph blocks of the input document every time a set time is set.
Every interval of set time (such as 2 seconds interval, the set time can be set based on specific requirements), the existing paragraph blocks are back-thinked. The dislike mainly comprises the step of clustering the existing paragraph blocks (clustering can be performed through a K-means clustering method, a hierarchical clustering method, a DBSCAN clustering method, an LDA clustering method or a Spectral Clustering clustering method and the like) so as to determine whether the division of the existing paragraph blocks is reasonable.
Step 602, merging paragraph blocks with similarity greater than or equal to a second threshold in the clustering result to update the current paragraph block of the input document.
If the segment blocks with the similarity larger than or equal to the second threshold value exist in the clustering result, the fact that the division of the existing segment blocks is unreasonable is indicated by the disfigurement, and then the segment blocks with the similarity larger than or equal to the second threshold value in the clustering result are combined to generate updated current segment blocks.
It should be noted that after the paragraph blocks are merged, the merged paragraph blocks may be set to be reserved, or the merged paragraph blocks may be set to be deleted.
In the above scheme, by setting time at each interval, the existing paragraph blocks are back-thought, whether the division of the existing paragraph blocks is reasonable can be determined, and when the division is unreasonable, the paragraph blocks with correlation among each other are concentrated together, so that the paragraph blocks after clustering are more reasonable.
In an example of the present application, there is further provided a document dividing method, where the input document includes image information for characterizing document layout information and text information for characterizing semantic information of the document, and the multi-modal feature extraction is performed on the specifying unit, including:
and extracting features of the input document according to the image information and the text information to obtain unit features for representing layout features and semantic features of the designated unit.
Since features are extracted based on text information alone, only semantic information of text is usually extracted. Therefore, in the scheme, through multi-mode feature extraction of the image information containing the document layout information and the text information containing the document semantic information, the unit features simultaneously containing the layout features and the semantic features of the designated units can be extracted, and the accuracy of the subsequent division of the designated units or the clustering of the paragraph blocks based on the unit features is remarkably improved.
In order to implement the above-described document dividing method, as shown in fig. 8, an example of the present application provides a document dividing apparatus including:
a first division module 10 for determining a plurality of specified units in an input document based on setting symbols included in the input document;
the extracting module 20 is configured to perform multi-mode feature extraction on the specified unit, so as to obtain unit features that are used for characterizing layout features and semantic features of the specified unit;
a second dividing module 30 for dividing a plurality of the specified units based on the unit features to divide the input document into a plurality of paragraph blocks;
the second dividing module 30 is further configured to perform paragraph block clustering based on paragraph features of existing paragraph blocks of the input document, so as to update a current paragraph block of the input document.
Wherein the second dividing module 30 is further configured to generate an initial paragraph block based on the first specifying unit;
the second dividing module 30 is further configured to generate a current paragraph block based on the unit feature of the second specified unit and the paragraph feature of the initial paragraph block;
the second dividing module 30 is further configured to traverse other specified units, generate an updated current paragraph block based on the unit features of the current other specified units and the paragraph features of the current paragraph block, where the updated current paragraph block includes at least two paragraph blocks.
The second dividing module 30 is further configured to differentially compare the unit features of the current other specified units with the paragraph features of the current paragraph block to obtain a comparison result;
the second dividing module 30 is further configured to add the current other specified units to the current paragraph block if the comparison result is less than the first threshold value, and generate an updated current paragraph block;
the second dividing module 30 is further configured to generate a new paragraph block based on the current other specified units and generate an updated current paragraph if the comparison result is greater than or equal to the first threshold.
The second dividing module 30 is further configured to cluster the existing paragraph blocks if the current paragraph block of the input document is changed;
the second dividing module 30 is further configured to combine the paragraph blocks with the similarity greater than or equal to the second threshold in the clustering result, so as to update the current paragraph block of the input document.
The second dividing module 30 is further configured to cluster the existing paragraph blocks if the number of the current paragraph blocks of the input document reaches a set number;
the second dividing module 30 is further configured to combine the paragraph blocks with the similarity greater than or equal to the second threshold in the clustering result, so as to update the current paragraph block of the input document.
The second dividing module 30 is further configured to cluster existing paragraph blocks of the input document at each set time interval;
the second dividing module 30 is further configured to combine the paragraph blocks with the similarity greater than or equal to the second threshold in the clustering result, so as to update the current paragraph block of the input document.
The extracting module 20 is further configured to perform feature extraction on the input document according to the image information and the text information, so as to obtain unit features that are used for characterizing layout features and semantic features of the specified unit.
According to embodiments of the present disclosure, the present disclosure also provides an electronic device and a readable storage medium.
Fig. 9 illustrates a schematic block diagram of an example electronic device 700 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 9, the apparatus 700 includes a computing unit 701 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 702 or a computer program loaded from a storage unit 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data required for the operation of the device 700 may also be stored. The computing unit 701, the ROM 702, and the RAM 703 are connected to each other through a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.
Various components in device 700 are connected to I/O interface 705, including: an input unit 706 such as a keyboard, a mouse, etc.; an output unit 707 such as various types of displays, speakers, and the like; a storage unit 708 such as a magnetic disk, an optical disk, or the like; and a communication unit 709 such as a network card, modem, wireless communication transceiver, etc. The communication unit 709 allows the device 700 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.
The computing unit 701 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 701 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The calculation unit 701 performs the respective methods and processes described above, for example, a document division method. For example, in some embodiments, the document partitioning method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 708. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 700 via ROM 702 and/or communication unit 709. When a computer program is loaded into the RAM 703 and executed by the computing unit 701, one or more steps of the document dividing method described above may be performed. Alternatively, in other embodiments, the computing unit 701 may be configured to perform the document partitioning method by any other suitable means (e.g., by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), integrated Systems On Chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.
The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server incorporating a blockchain.
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.
Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In the description of the present disclosure, the meaning of "a plurality" is two or more, unless explicitly defined otherwise.
The foregoing is merely specific embodiments of the disclosure, but the protection scope of the disclosure is not limited thereto, and any person skilled in the art can easily think about changes or substitutions within the technical scope of the disclosure, and it is intended to cover the scope of the disclosure. Therefore, the protection scope of the present disclosure shall be subject to the protection scope of the claims.
Claims (10)
1. A document partitioning method, the method comprising:
determining a plurality of specified units in an input document based on a setting symbol included in the input document;
carrying out multi-mode feature extraction on the appointed unit to obtain unit features for representing layout features and semantic features of the appointed unit;
dividing a plurality of the specified units based on the unit features to divide the input document into a plurality of paragraph blocks;
wherein the dividing of the plurality of the specified units includes performing paragraph block clustering based on paragraph features of existing paragraph blocks of the input document to update a current paragraph block of the input document.
2. The method of claim 1, the partitioning a plurality of the specified cells based on the cell characteristics, comprising:
generating an initial paragraph block based on the first specification unit;
generating a current paragraph block based on the cell characteristics of the second specified cell and the paragraph characteristics of the initial paragraph block;
traversing other specified units, and generating an updated current paragraph block based on the unit characteristics of the current other specified units and the paragraph characteristics of the current paragraph block, wherein the updated current paragraph block at least comprises two paragraph blocks.
3. The method of claim 2, the generating the updated current paragraph block based on the cell characteristics of the current other specified cells and the paragraph characteristics of the current paragraph block, comprising:
performing differential comparison on the unit characteristics of the other current designated units and the paragraph characteristics of the current paragraph block to obtain a comparison result;
if the comparison result is smaller than a first threshold value, adding other current appointed units into the current paragraph block to generate an updated current paragraph block;
and if the comparison result is greater than or equal to a first threshold value, generating a new paragraph block based on other current specified units, and generating an updated current paragraph.
4. A method according to claim 3, said clustering of paragraph blocks based on paragraph features of existing paragraph blocks of said input document to update a current paragraph block of said input document, comprising:
if the current paragraph block of the input document is changed, clustering the existing paragraph blocks;
and merging the paragraph blocks with the similarity larger than or equal to a second threshold value in the clustering result to update the current paragraph block of the input document.
5. The method of claim 1, wherein the clustering of paragraph blocks based on paragraph features of existing paragraph blocks of the input document to update current paragraph blocks of the input document, further comprises:
if the number of the current paragraph blocks of the input document reaches the set number, clustering the existing paragraph blocks;
and merging the paragraph blocks with the similarity larger than or equal to a second threshold value in the clustering result to update the current paragraph block of the input document.
6. The method of claim 1, wherein the clustering of paragraph blocks based on paragraph features of existing paragraph blocks of the input document to update current paragraph blocks of the input document, further comprises:
clustering existing paragraph blocks of the input document at each interval of set time;
and merging the paragraph blocks with the similarity larger than or equal to a second threshold value in the clustering result to update the current paragraph block of the input document.
7. The method of claim 1, the input document including image information for characterizing document layout information and text information for characterizing semantic information of the document, the multi-modal feature extraction for the specification unit comprising:
and extracting features of the input document according to the image information and the text information to obtain unit features for representing layout features and semantic features of the designated unit.
8. A document dividing apparatus, the apparatus comprising:
a first division module for determining a plurality of specification units in an input document based on setting symbols included in the input document;
the extraction module is used for carrying out multi-mode feature extraction on the appointed unit to obtain unit features for representing layout features and semantic features of the appointed unit;
a second dividing module for dividing a plurality of the specified units based on the unit features to divide the input document into a plurality of paragraph blocks;
the second dividing module is further configured to perform paragraph block clustering based on paragraph features of existing paragraph blocks of the input document, so as to update a current paragraph block of the input document.
9. An apparatus comprising at least one processor, and at least one memory, bus connected to the processor; the processor and the memory complete communication with each other through the bus; the processor is configured to invoke program instructions in the memory to perform the method of any of claims 1-7.
10. A computer readable storage medium comprising a set of computer executable instructions for performing the method of any of claims 1-7 when the instructions are executed.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311866289.0A CN117764044A (en) | 2023-12-29 | 2023-12-29 | Document dividing method and device, electronic equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311866289.0A CN117764044A (en) | 2023-12-29 | 2023-12-29 | Document dividing method and device, electronic equipment and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117764044A true CN117764044A (en) | 2024-03-26 |
Family
ID=90325742
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311866289.0A Pending CN117764044A (en) | 2023-12-29 | 2023-12-29 | Document dividing method and device, electronic equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117764044A (en) |
-
2023
- 2023-12-29 CN CN202311866289.0A patent/CN117764044A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113807098B (en) | Model training method and device, electronic equipment and storage medium | |
EP3923185A2 (en) | Image classification method and apparatus, electronic device and storage medium | |
CN114970522A (en) | Language model pre-training method, device, equipment and storage medium | |
CN112560481B (en) | Statement processing method, device and storage medium | |
CN113408660B (en) | Book clustering method, device, equipment and storage medium | |
US20230073994A1 (en) | Method for extracting text information, electronic device and storage medium | |
CN112989235A (en) | Knowledge base-based internal link construction method, device, equipment and storage medium | |
CN113850080A (en) | Rhyme word recommendation method, device, equipment and storage medium | |
CN112906368B (en) | Industry text increment method, related device and computer program product | |
CN113191145B (en) | Keyword processing method and device, electronic equipment and medium | |
CN117992569A (en) | Method, device, equipment and medium for generating document based on generation type large model | |
CN115248890A (en) | User interest portrait generation method and device, electronic equipment and storage medium | |
CN114880498B (en) | Event information display method and device, equipment and medium | |
US20230004715A1 (en) | Method and apparatus for constructing object relationship network, and electronic device | |
CN115952258A (en) | Generation method of government affair label library, and label determination method and device of government affair text | |
CN115510247A (en) | Method, device, equipment and storage medium for constructing electric carbon policy knowledge graph | |
CN114417862A (en) | Text matching method, and training method and device of text matching model | |
CN115600592A (en) | Method, device, equipment and medium for extracting key information of text content | |
CN114116914A (en) | Entity retrieval method and device based on semantic tag and electronic equipment | |
CN117764044A (en) | Document dividing method and device, electronic equipment and storage medium | |
CN113641724A (en) | Knowledge tag mining method and device, electronic equipment and storage medium | |
CN112926297A (en) | Method, apparatus, device and storage medium for processing information | |
CN113360602B (en) | Method, apparatus, device and storage medium for outputting information | |
CN116069914B (en) | Training data generation method, model training method and device | |
CN116484870B (en) | Method, device, equipment and medium for extracting text information |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |