CN117669501A - Document information extraction method, device and storage medium - Google Patents
Document information extraction method, device and storage medium Download PDFInfo
- Publication number
- CN117669501A CN117669501A CN202311760500.0A CN202311760500A CN117669501A CN 117669501 A CN117669501 A CN 117669501A CN 202311760500 A CN202311760500 A CN 202311760500A CN 117669501 A CN117669501 A CN 117669501A
- Authority
- CN
- China
- Prior art keywords
- document
- processed
- nodes
- node
- information
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000605 extraction Methods 0.000 title claims abstract description 82
- 238000010801 machine learning Methods 0.000 claims abstract description 22
- 238000004458 analytical method Methods 0.000 claims abstract description 12
- 238000012545 processing Methods 0.000 claims description 8
- 238000010276 construction Methods 0.000 claims description 6
- 238000004590 computer program Methods 0.000 claims description 3
- 238000000034 method Methods 0.000 abstract description 10
- 238000010586 diagram Methods 0.000 description 17
- 230000006870 function Effects 0.000 description 7
- 230000004044 response Effects 0.000 description 6
- 238000012360 testing method Methods 0.000 description 5
- 238000013461 design Methods 0.000 description 3
- 230000014509 gene expression Effects 0.000 description 3
- 238000012423 maintenance Methods 0.000 description 3
- 230000002457 bidirectional effect Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 239000000284 extract Substances 0.000 description 2
- 230000018109 developmental process Effects 0.000 description 1
- 238000011478 gradient descent method Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000007726 management method Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 230000006855 networking Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 230000001737 promoting effect Effects 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/151—Transformation
- G06F40/154—Tree transformation for tree-structured or markup documents, e.g. XSLT, XSL-FO or stylesheets
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
- G06N3/0455—Auto-encoder networks; Encoder-decoder networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- General Engineering & Computer Science (AREA)
- Biomedical Technology (AREA)
- Evolutionary Computation (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Data Mining & Analysis (AREA)
- Biophysics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The disclosure provides a document information extraction method, a document information extraction device and a storage medium, and relates to the technical field of computers. The document information extraction method comprises the following steps: first analyzing a document to be processed to obtain a node sequence in the document to be processed; performing second analysis on the document to be processed to obtain information contained in nodes in the document to be processed; constructing a document tree according to the sequence of the nodes in the document to be processed and the information contained in the nodes in the document to be processed; determining the category of software quality indexes to which the nodes in the document tree belong based on the trained machine learning model; and extracting the software quality index information from the nodes in the document tree according to the extraction rules corresponding to the software quality index categories. By the method, the accuracy and the extraction efficiency of the software quality index information extraction can be improved.
Description
Technical Field
The disclosure relates to the field of computer technology, and in particular, to a method, a device and a storage medium for extracting document information.
Background
In the whole software life cycle, the software quality is effectively measured, which is helpful for the management and improvement of the software and has important significance for developers and users. In software development, information required to evaluate the quality of software is often spread across multiple documents, such as requirements documents, design documents, test documents, maintenance documents, and user manuals.
In the related art, information in a document is extracted based on an extraction rule or information in a document is extracted based on a machine learning model.
Disclosure of Invention
The present disclosure provides a document information extraction method, apparatus and storage medium.
According to a first aspect of the present disclosure, a document information extraction method is provided, including: first analyzing a document to be processed to obtain a node sequence in the document to be processed; performing second analysis on the document to be processed to obtain information contained in nodes in the document to be processed; constructing a document tree according to the sequence of the nodes in the document to be processed and the information contained in the nodes in the document to be processed; determining the category of software quality indexes to which the nodes in the document tree belong based on the trained machine learning model; and extracting the software quality index information from the nodes in the document tree according to the extraction rules corresponding to the software quality index categories.
In some embodiments, the determining, based on the trained machine learning model, a software quality indicator class to which a node in the document tree belongs includes: constructing an input vector according to the sequence of the nodes in the document to be processed and the information contained in the nodes in the document to be processed; and processing the input vector by using the trained machine learning model to obtain the software quality index category to which the node in the document tree belongs.
In some embodiments, the trained machine learning model is a transform-based bi-directional encoder representation model.
In some embodiments, the document to be processed is a word document, and the second parsing the document to be processed to obtain information contained in nodes in the document to be processed includes: creating a document object instance based on the document to be processed; and extracting information contained in the nodes from the document object examples according to the types of the nodes in the documents to be processed.
In some embodiments, the type of the node includes at least one of a paragraph, a table, and a directory, and the type of the paragraph includes at least one of a picture and text.
In some embodiments, the information of the nodes includes hierarchical representation information of the nodes, and constructing the document tree according to the order of the nodes in the document to be processed and the information contained in the nodes in the document to be processed includes: traversing the nodes in the document to be processed according to the sequence of the nodes in the document to be processed; determining the hierarchical structure of the node in the document to be processed according to the hierarchical representation information of the node in the document to be processed; and constructing a document tree according to the hierarchical structure of the nodes in the document to be processed and the information contained in the nodes in the document to be processed.
In some embodiments, the hierarchical representation information of the node includes at least one of prefix information of the node and a text style of the node, and whether the node is title text.
In some embodiments, the determining, according to the hierarchical representation information of the nodes in the document to be processed, a hierarchical structure to which the nodes in the document to be processed belong includes: judging whether traversed nodes matched with at least one of prefix information and text style of an ith node exist or not under the condition that the ith node in the document to be processed is a title text, wherein i is an integer greater than 1 and less than N; if so, the ith node and the matched traversed node are positioned at the same level; if not, the ith node is the next layer of the lowest level in the traversed nodes.
In some embodiments, the determining, according to the hierarchical representation information of the nodes in the document to be processed, a hierarchical structure to which the nodes in the document to be processed belong further includes: and in the case that the ith node in the document to be processed is a non-title text and the last node of the ith node is a title text, the ith node is the next layer of the last node.
In some embodiments, the determining, according to the hierarchical representation information of the nodes in the document to be processed, a hierarchical structure to which the nodes in the document to be processed belong further includes: judging whether at least one item of prefix information and text style of an i node and a previous node of the i node is matched or not under the condition that the i node in the document to be processed is a non-title text and the previous node of the i node is a non-title text; if the nodes are matched, the ith node and the node above the ith node are located at the same level; and if the nodes are not matched, the ith node is the next layer of the node above the ith node.
In some embodiments, the document to be processed is a word document, and the first parsing the document to be processed to obtain the node sequence in the document to be processed includes: modifying the suffix of the document to be processed into a compressed file format to obtain a compressed package; decompressing the compressed package to obtain a target folder, wherein the target folder comprises a document file in an XML format; and analyzing the document file in the XML format to obtain the node sequence in the document to be processed.
In some embodiments, the extraction rules include regular matching rules.
According to a second aspect of the present disclosure, there is provided a document information extraction apparatus including: the first analysis module is configured to perform first analysis on the document to be processed so as to obtain the node sequence in the document to be processed; the second analysis module is configured to perform second analysis on the document to be processed to obtain information contained in nodes in the document to be processed; a construction module configured to construct a document tree according to the order of nodes in the document to be processed and information contained in the nodes in the document to be processed; the classification module is configured to determine the software quality index classification of the nodes in the document tree based on the trained machine learning model; and the extraction module is configured to extract the software quality index information from the nodes in the document tree according to the extraction rules corresponding to the software quality index classification.
According to a third aspect of the present disclosure, there is provided a document information extraction apparatus including: a memory; and a processor coupled to the memory, the processor configured to perform the document information extraction method as described above based on instructions stored in the memory.
According to a fourth aspect of the present disclosure, a computer-readable storage medium is presented, on which computer program instructions are stored, which instructions, when executed by a processor, implement a document information extraction method as described before.
Other features of the present disclosure and its advantages will become apparent from the following detailed description of exemplary embodiments of the disclosure, which proceeds with reference to the accompanying drawings.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description, serve to explain the principles of the disclosure.
The present disclosure will be more clearly understood from the following detailed description with reference to the accompanying drawings.
Fig. 1 is a flow diagram of a document information extraction method according to some embodiments of the present disclosure.
FIG. 2 is a schematic flow diagram of building a document tree according to some embodiments of the present disclosure.
FIG. 3 is a schematic diagram of a document tree constructed in accordance with some embodiments of the present disclosure.
Fig. 4 is a flow diagram of a hierarchical structure of a determination node according to some embodiments of the present disclosure.
Fig. 5 is a schematic block diagram of a document information extraction apparatus according to some embodiments of the present disclosure.
Fig. 6 is a schematic structural view of a document information extraction apparatus according to other embodiments of the present disclosure.
Fig. 7 is a schematic diagram of a computer system according to some embodiments of the present disclosure.
Detailed Description
Various exemplary embodiments of the present disclosure will now be described in detail with reference to the accompanying drawings. It should be noted that: the relative arrangement of the components and steps, numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present disclosure unless it is specifically stated otherwise.
Meanwhile, it should be understood that the sizes of the respective parts shown in the drawings are not drawn in actual scale for convenience of description.
The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application, or uses.
Techniques, methods, and apparatus known to one of ordinary skill in the relevant art may not be discussed in detail, but are intended to be part of the specification where appropriate.
In all examples shown and discussed herein, any specific values should be construed as merely illustrative, and not a limitation. Thus, other examples of the exemplary embodiments may have different values.
It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further discussion thereof is necessary in subsequent figures.
For the purposes of promoting an understanding of the principles and advantages of the disclosure, reference will now be made to the embodiments illustrated in the drawings and specific language will be used to describe the same.
The inventors of the present disclosure found that: the document information extraction method in the related technology has the problems of low information extraction accuracy, poor extraction efficiency and the like. How to efficiently and accurately extract the required software quality assessment information from the document remains a challenging problem.
In view of this, the present disclosure proposes a document information extraction method, apparatus, electronic device, and storage medium to improve efficiency and accuracy of software quality index extraction.
Fig. 1 is a flow diagram of a document information extraction method according to some embodiments of the present disclosure. As shown in fig. 1, the document information extraction method includes steps S101 to S105.
In step S101, a first parsing is performed on a document to be processed to obtain a node order in the document to be processed.
In some embodiments, the document information extraction method is performed by a document information extraction apparatus.
In some embodiments, the document to be processed is a word document. For example, the document to be processed is a word document whose suffix name is docx.
Wherein nodes in the document to be processed may also be referred to as elements. The type of the node includes at least one of paragraphs, tables, and directories. Paragraphs may in turn be divided into pictures and text.
In some embodiments, the document to be processed is a word document, and the first parsing step performed on the document to be processed includes: modifying the suffix of the document to be processed into a compressed file format to obtain a compressed package; decompressing the compressed package to obtain a target folder, wherein the target folder comprises a document file in an XML format; and analyzing the document file in the XML format to obtain the node sequence in the document to be processed.
In some examples, the suffix name of the word document to be processed is modified from "docx" to "zip" to obtain a compressed package, which is then decompressed to obtain the target folder. The target folder contains a plurality of document files in extensible markup language (Extensible Markup Language, XML) format, such as document. Next, the document xml file is parsed based on the etre.parameter method in the Lxml library to obtain an element tree object (ElementTree object), and then the order of nodes is obtained by traversing all nodes in the element tree object, and the order of nodes is stored. The Lxml library is a parser capable of parsing an XML file, and has the main functions of parsing and extracting data in the XML file.
In some examples, the order of the nodes is stored in a list (list). The sequence of the nodes comprises an overall number and a local number, wherein the overall number is the number of all types of nodes according to the sequence in the document, and the local number is the number of the same type of nodes according to the sequence in the document. For example, global numbers 001, 002, 003, 004, 005, 006, where 001 to 003 are global numbers of paragraph type nodes, and local numbers of the three paragraph type nodes are (p, 0), (p, 1), (p, 2); 004 is the overall number of the table type node, the local number of the table type node being (t, 0); 005 through 006 are the global numbers of nodes of paragraph type, the local numbers of nodes of these two short circuit types being (p, 3), (p, 4).
In step S102, the document to be processed is subjected to a second parsing to obtain information contained in the nodes in the document to be processed.
In some embodiments, the document to be processed is a word document, and the second parsing of the document to be processed includes: creating a document object instance based on the document to be processed; and extracting information contained in the nodes from the document object examples according to the types of the nodes in the documents to be processed.
In some embodiments, the type of node includes at least one of a paragraph, a table, and a directory, and the type of paragraph includes at least one of a picture and text. In some examples, the picture includes a visio map.
In some examples, a document object instance (i.e., document object instance) of the document to be processed is created using a Python docx library. The Python docx library is a third party library for processing word documents, directly operates the word documents, supports reading, inquiring, modifying docx and other format files, and can facilitate us to efficiently read node information such as paragraphs, tables and the like.
In some embodiments, according to the node sequence determined in step S101, the information of the nodes is extracted from the document to be processed according to the extraction manner corresponding to the node type.
In some examples, when the node type is a paragraph, the information contained in the paragraph is extracted as follows: judging whether the paragraph is a picture or a text; if the paragraph is a picture (such as a visio picture), inquiring information contained in the picture based on a picture storage path; if the paragraph is text, the information contained in the text is queried based on the text storage path.
In some examples, if the paragraph is a picture, such as a visio map, a visio object instance is created first, and then information of the visio map is obtained from a storage path of the visio map based on the visio object instance and stored.
For example, whether the paragraph object contains < w is queried based on the XML query language Xpath: object > tag; if < w: an object tag, searching whether the object tag contains a < o: OLEObject > tag by using a getElementsByTagName method; if the < o: OLEObject > tag is contained, acquiring RID attribute by using a getAttribute method, and finding a picture storage path according to the RID attribute; and then extracting the picture information according to the picture storage path.
In some examples, if the paragraph is text, a storage path for the text is found, and the paragraph style, font size, paragraph type (e.g., whether it is a title), text information in the paragraph, prefix information for the paragraph, etc. are extracted from the storage path for the text.
In some examples, when extracting the prefix of the paragraph, if there is no auto-numbered prefix, determining whether there is a prefix of a conventional type; intercepting the prefix if the prefix exists, and marking the prefix as null if the prefix does not exist; if an auto-numbered prefix exists, the prefix is extracted.
In some examples, when the node type is a table, information in the table is queried according to a storage path of the table. For example, from < w in document object instance: under tbl > tag, extracting information of each cell in the table by row and by column, and storing the extracted information.
In some examples, when the node type is a directory, the information in the directory is queried according to the storage path of the directory. For example, from < w in document object instance: sdt > under the label, text information of each line in the catalog is extracted, and the extracted information is stored.
In some embodiments, the first parsing step and the second parsing step are performed based on different parsing tools, e.g., the first parsing step is performed based on an Lxml library and the second parsing step is performed based on a Python docx library. The two analysis tools are combined, so that the efficiency of extracting the document information is improved.
In step S103, a document tree is constructed according to the order of nodes in the document to be processed and the information contained in the nodes in the document to be processed.
In some embodiments, the document tree is constructed according to the flow shown in FIG. 2.
In step S104, a software quality index category to which the node in the document tree belongs is determined based on the trained machine learning model.
In some embodiments, in step S104, an input vector is constructed from information contained by nodes in the document to be processed; and processing the input vector by using the trained machine learning model to obtain the software quality index category of the node in the document tree.
In some embodiments, in step S104, an input vector is constructed from the order of nodes in the document to be processed and information contained by the nodes in the document to be processed; and processing the input vector by using the trained machine learning model to obtain the software quality index category of the node in the document tree.
In some examples, the class of software quality indicators to which the node belongs includes one or more of a primary indicator of functional adaptability, performance efficiency, compatibility, ease of use, reliability, information security, maintainability, portability, and the like. Through the trained machine learning model, the quality index category of software to which each node belongs can be determined, so that the subsequent rapid and accurate extraction of finer granularity index information is facilitated.
In some examples, the trained machine learning model is a transform-based bi-directional encoder representation model (Bidirectional Encoder Representations from Transformers, simply BERT model). The Bert model uses a bidirectional transducer model with Self-Attention (Self-Attention) mechanism to learn and obtain the front and rear semantic relations of sentences, and can be used for analyzing and obtaining the semantic expression of the text language more conveniently.
In some examples, the information contained by the nodes is preprocessed before using the trained BERT model. The preprocessing comprises word segmentation processing on texts contained in the nodes. For example, text content in the nodes is segmented using the berttoken method in the transformers package.
In some examples, the loss function of the BERT model is optimally solved based on a stochastic gradient descent method before using the trained BERT model.
In the embodiment of the disclosure, when the category of the node is determined based on the machine learning model, the node sequence and the information contained by the node are comprehensively considered, so that the accuracy of node classification is improved.
In step S105, the software quality index information is extracted from the nodes in the document tree according to the extraction rule corresponding to the software quality index category.
In some embodiments, the extraction rules include regular matching rules. The regular matching rule is a matching rule corresponding to a specific index contained in the software quality index category.
For example, when the software quality index class to which the node belongs is functional adaptability, the specific index included in the class is one or more of functional coverage rate, functional correctness, functional suitability of a use target, and functional suitability of a system, and the regular matching rule corresponding to the specific index includes: one or more of a regular matching rule corresponding to the functional coverage rate, a regular matching rule corresponding to the functional correctness, a regular matching rule corresponding to the functional suitability of the use target and a regular matching rule corresponding to the functional suitability of the system.
For example, when the software quality index class to which the node belongs is performance efficiency, the specific index included in the class is one or more of average response time, sufficiency of response time, average turn-around time, turn-around time sufficiency, average throughput, average processor occupancy, average memory occupancy, average input/output device occupancy, bandwidth occupancy, transaction capacity, user access volume, and sufficiency of user access growth, and the regular matching rule corresponding to the specific index includes: one or more of a regular matching rule corresponding to average response time, a regular matching rule corresponding to sufficiency of response time, a regular matching rule corresponding to average turn-around time, a regular matching rule corresponding to turn-around time sufficiency, a regular matching rule corresponding to average throughput, a regular matching rule corresponding to average occupancy of a processor, a regular matching rule corresponding to average occupancy of a memory, a regular matching rule corresponding to average occupancy of an input/output device, a regular matching rule corresponding to bandwidth occupancy, a regular matching rule corresponding to transaction capacity, a regular matching rule corresponding to user access volume, and a regular matching rule corresponding to increased sufficiency of user access.
In some examples, the regular matching rules are represented by regular expressions. For example, a regular matching rule for the average response time designed for a test document may be expressed as ".? ([ 1-9] \d\d+|0\d+|0| [1-9] \d) (ms|s.).
In some examples, different canonical matching rules are designed for different types of documents. For example, the document types include one or more of a requirements document, a design document, a test document, a maintenance document, and a user manual. By designing different regular matching rules for different types of documents, the information extraction requirements of various documents can be met, the applicability of the document information extraction method is improved, and the accuracy of information extraction is improved.
For example, a regular matching rule designed for "sufficiency of response time" metrics in a demand document is? ([ 1-9] \d\d+|0\d+|0| [1-9] \d) (ms|s.).
For example, a regular matching rule designed for an "average turn-around time" index in a test document is ".? ([ 1-9] \d\d+|0\d+|0| [1-9] \d) (ms|s.).
In the embodiment of the disclosure, after the nodes in the document are classified by the machine learning model, the software quality index information is extracted pertinently based on the extraction rules corresponding to the classification, so that the accuracy and the efficiency of information extraction can be improved.
In the embodiment of the disclosure, on one hand, not only the information contained in each node is extracted, but also the sequence of each node is extracted, and a document tree is constructed based on the information of the two aspects, so that the efficiency and the extraction accuracy of the subsequent document information extraction are improved; on the other hand, by combining the machine learning model with the extraction rule to extract the software quality index information, compared with the related art which extracts the information only based on the machine learning model or the extraction rule, the efficiency and the accuracy of the software quality index extraction can be improved, and the automation level and the evaluation accuracy of the software quality evaluation can be improved.
FIG. 2 is a schematic flow diagram of building a document tree according to some embodiments of the present disclosure. As shown in fig. 2, the flow of constructing the document tree includes steps S201 to S203.
In step S201, the nodes in the document to be processed are traversed in the order of the nodes in the document to be processed.
For example, the order of nodes in the document to be processed is 001, 002, 003, 004, 005, 006, where 001 to 003 are the overall numbers of nodes of paragraph type, 004 is the overall number of nodes of table type, 005 to 006 are the overall numbers of nodes of paragraph type. The nodes are traversed in the order described above.
In step S202, the hierarchical structure to which the nodes in the document to be processed belong is determined based on the hierarchical representation information of the nodes in the document to be processed.
In some embodiments, the hierarchical representation information of the node in the document to be processed includes at least one of prefix information of the node and a text style of the node, and whether the node is a headline text.
In some examples, the title text includes a title type of Heading, title, subtitle or the like; non-title text includes List Paragraph, table (Table), template files (such as Normal), and other non-title types.
In some embodiments, the hierarchy to which the node belongs is determined according to the following: judging whether traversed nodes matched with at least one of prefix information and text style of an ith node exist or not under the condition that the ith node in a document to be processed is a title text, wherein i is an integer which is more than 1 and less than N; if the node exists, the ith node and the matched traversed node are positioned at the same level; if not, the ith node is the next layer of the lowest hierarchy in the traversed nodes.
In some embodiments, the hierarchy to which the node belongs is determined according to the following: in the case that the i-th node in the document to be processed is a non-title text and the previous node of the i-th node is a title text, the i-th node is the next layer of the previous node.
In some embodiments, the hierarchy to which the node belongs is determined according to the following: judging whether at least one item of prefix information and text style of the ith node and the last node of the ith node is matched or not under the condition that the ith node in the document to be processed is a non-title text and the last node of the ith node is a non-title text; if the nodes are matched, the ith node and the previous node of the ith node are positioned at the same level; if the nodes are not matched, the ith node is the next layer of the last node of the ith node.
In some embodiments, the hierarchy to which the node belongs is determined according to the embodiment shown in FIG. 4.
In step S203, a document tree is constructed according to the order of the nodes in the document to be processed and the information contained in the nodes in the document to be processed.
According to the embodiment of the disclosure, the hierarchical structure in the document to be processed is determined according to the various hierarchical representation information, and the document tree is constructed according to the hierarchical structure and the node information, so that the subsequent rapid and accurate information extraction is facilitated.
FIG. 3 is a schematic diagram of a document tree constructed in accordance with some embodiments of the present disclosure. As shown in FIG. 3, the document tree includes a plurality of nodes. In fig. 3, circles represent nodes, numbers in the circles represent the order of the nodes, and relationships between the nodes represent the hierarchical structure of the nodes.
In some examples, when it is desired to extract the "number of functions considered" of the software quality index in the document to be processed, a node with a number of 59 (which corresponds to chapter 3 in the document to be processed) is found first, then a child node of the node with a number of 59 is queried, and a node with a number of 65 (which corresponds to chapter 3.2 in the document to be processed) is found in the child node. Then, a grandchild node of the node No. 65 (which corresponds to the four-level header "3.2.X.x" in the document to be processed) is queried, and the number of grandchild nodes thereof, that is, the value of "the number of functions under consideration".
Fig. 4 is a flow diagram of a hierarchical structure of a determination node according to some embodiments of the present disclosure. As shown in fig. 4, the flow of determining the hierarchical structure of the nodes includes steps S401 to S411.
In step S401, the i-th node in the document to be processed is acquired.
In the disclosed embodiment, i is an integer greater than 1 and less than N. In addition, the node where i is equal to 1 is the root node, i.e., the node of the highest hierarchy.
In some embodiments, before performing step S401, further comprising: and carrying out standardization processing on the format of the nodes in the document to be processed. For example, the planning process includes: and normalizing the text styles, text prefixes and the like in the document.
In step S402, it is determined whether the i-th node is a title text.
In the case where the i-th node is the title text, step S403 is executed; in the case where the i-th node is not the title text, step S407 is performed.
In step S403, it is determined whether there is a node matching the text prefix of the i-th node among the traversed nodes.
If the determination in step S403 is yes, step S405 is executed; if the determination result in step S403 is no, step S404 is executed.
For example, the text prefix of the i-th node is "3.2", which matches the node having the text prefix of "3.1" existing in the traversed node. In this case, step S405 is performed.
For example, the text prefix of the i-th node is "3.1", and there is no node matching with the text prefix of the traversed node. In this case, step S404 is performed.
In step S404, it is determined whether there is a node matching the text style of the i-th node among the traversed nodes.
If the determination in step S404 is yes, step S405 is executed; otherwise, step S406 is performed.
For example, the text style of the ith node is font No. three, bolded, and a reduction of 2 characters. If there is a node matching the text style of the ith node in the traversed nodes, executing step S405; otherwise, step S406 is performed.
In step S405, it is determined that the i-th node coincides with the hierarchy of the matched traversed nodes.
In step S406, it is determined that the i-th node is the next layer of the lowest hierarchy of traversed nodes.
For example, when the lowest level of the traversed nodes is the third level, the i-th node is taken as the fourth level.
In step S407, it is determined whether or not the node immediately preceding the i-th node is the title text.
If the determination result in step S407 is yes, step S408 is executed; otherwise, step S409 is performed.
In step S408, it is determined that the i-th node is the next level of the previous node.
In step S409, it is determined whether the text prefix of the i-th node and the previous node match.
In the case that the determination result of step S409 is yes, step S410 is executed; otherwise, step S411 is performed.
In step S410, it is determined that the i-th node is at the same level as the previous node.
In step S411, it is determined whether the text style of the i-th node and the previous node match.
If the determination result in step S411 is yes, step S410 is performed, i.e. it is determined that the i-th node and the previous node are at the same level; otherwise, step S408 is performed, i.e. the i-th node is determined to be the next level of the previous node.
In the embodiment of the disclosure, the document tree corresponding to the document to be processed can be quickly constructed through the above flow, so that the subsequent information extraction based on the document tree is facilitated.
Fig. 5 is a schematic block diagram of a document information extraction apparatus according to some embodiments of the present disclosure. As shown in fig. 5, the document information extraction apparatus 500 includes a first parsing module 501, a second parsing module 502, a construction module 503, a classification module 504, and an extraction module 505.
The first parsing module 501 is configured to perform a first parsing on the document to be processed, so as to obtain a node sequence in the document to be processed.
In some embodiments, the documents to be processed include one or more of a requirements document, a design document, a test document, a maintenance document, and a user manual.
In some embodiments, the first parsing module 501 obtains the order of nodes in the document to be processed according to the following: modifying the suffix of the document to be processed into a compressed file format to obtain a compressed package; decompressing the compressed package to obtain a target folder, wherein the target folder comprises a document file in an XML format; and analyzing the document file in the XML format to obtain the node sequence in the document to be processed.
The second parsing module 502 is configured to perform second parsing on the document to be processed, so as to obtain information contained in the nodes in the document to be processed.
In some embodiments, the second parsing module 502 obtains the information contained by the nodes in the document to be processed according to the following manner: creating a document object instance based on the document to be processed; and extracting information contained in the nodes from the document object examples according to the types of the nodes in the documents to be processed.
A construction module 503 is configured to construct a document tree according to the order of the nodes in the document to be processed and the information contained in the nodes in the document to be processed.
In some embodiments, the construction module 503 constructs the document tree according to the following: traversing the nodes in the document to be processed according to the sequence of the nodes in the document to be processed; determining the hierarchical structure of the node in the document to be processed according to the hierarchical representation information of the node in the document to be processed; and constructing a document tree according to the hierarchical structure of the nodes in the document to be processed and the information contained in the nodes in the document to be processed.
A classification module 504 is configured to determine, based on the trained machine learning model, a software quality indicator classification to which the nodes in the document tree belong.
The extraction module 505 is configured to extract software quality index information from nodes in the document tree according to an extraction rule corresponding to the software quality index classification.
In the embodiment of the disclosure, the efficiency and the accuracy of extracting the software quality index can be improved by the device, so that the automation level and the evaluation accuracy of software quality evaluation can be improved.
Fig. 6 is a schematic structural view of a document information extraction apparatus according to other embodiments of the present disclosure.
As shown in fig. 6, the document information extraction apparatus 600 includes a memory 601; and a processor 602 coupled to the memory 601. The memory 601 is used to store instructions for performing corresponding embodiments of the document information extraction method. The processor 602 is configured to perform the document information extraction method in any of the embodiments of the present disclosure based on instructions stored in the memory 601.
Fig. 7 is a schematic diagram of a computer system according to some embodiments of the present disclosure.
As shown in FIG. 7, computer system 700 may be in the form of a general purpose computing device. The computer system 700 includes a memory 701, a processor 702, and a bus 703 that connects the various system components.
The memory 701 may include, for example, system memory, nonvolatile storage media, and the like. The system memory stores, for example, an operating system, application programs, boot Loader (Boot Loader), and other programs. The system memory may include volatile storage media, such as Random Access Memory (RAM) and/or cache memory. The non-volatile storage medium stores, for example, instructions of a corresponding embodiment of at least one document information extraction method in execution. Non-volatile storage media include, but are not limited to, disk storage, optical storage, flash memory, and the like.
The processor 702 may be implemented as discrete hardware components such as a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gates or transistors, or the like. Accordingly, each module, such as the first parsing module, the second parsing module, the construction module, the classification module, the extraction module, etc., may be implemented by a Central Processing Unit (CPU) running instructions in a memory to perform the corresponding steps, or may be implemented by dedicated circuitry to perform the corresponding steps.
Bus 703 may use any of a variety of bus structures. For example, bus structures include, but are not limited to, an Industry Standard Architecture (ISA) bus, a Micro Channel Architecture (MCA) bus, and a Peripheral Component Interconnect (PCI) bus.
These interfaces 704, 705, 706 of the computer system 700, and the memory 701 and the processor 702 may be connected by a bus 703. The input output interface 704 may provide a connection interface for input output devices such as a display, mouse, keyboard, etc. The network interface 705 provides a connection interface for various networking devices. The storage interface 706 provides a connection interface for external storage devices such as a floppy disk, a USB flash disk, an SD card, and the like.
Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer readable program instructions.
These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable apparatus to produce a machine, such that the instructions, which execute via the processor, create means for implementing the functions specified in the flowchart and/or block diagram block or blocks.
These computer readable program instructions may also be stored in a computer readable memory that can direct a computer to function in a particular manner, such that the instructions stored in the computer readable memory produce an article of manufacture including instructions which implement the function specified in the flowchart and/or block diagram block or blocks.
The present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects.
By the document information extraction method, the device and the storage medium in the embodiment, the accuracy and the extraction efficiency of the software quality index information extraction can be improved.
Heretofore, the document information extraction method, apparatus, and storage medium according to the present disclosure have been described in detail. In order to avoid obscuring the concepts of the present disclosure, some details known in the art are not described. How to implement the solutions disclosed herein will be fully apparent to those skilled in the art from the above description.
Claims (15)
1. A document information extraction method, comprising:
first analyzing a document to be processed to obtain a node sequence in the document to be processed;
performing second analysis on the document to be processed to obtain information contained in nodes in the document to be processed;
constructing a document tree according to the sequence of the nodes in the document to be processed and the information contained in the nodes in the document to be processed;
determining the category of software quality indexes to which the nodes in the document tree belong based on the trained machine learning model;
and extracting the software quality index information from the nodes in the document tree according to the extraction rules corresponding to the software quality index categories.
2. The document information extraction method of claim 1, wherein the determining, based on the trained machine learning model, a software quality indicator class to which a node in the document tree belongs comprises:
constructing an input vector according to the sequence of the nodes in the document to be processed and the information contained in the nodes in the document to be processed;
and processing the input vector by using the trained machine learning model to obtain the software quality index category to which the node in the document tree belongs.
3. The document information extraction method of claim 2, wherein the trained machine learning model is a transform-based bi-directional encoder representation model.
4. The document information extraction method according to claim 1, wherein the document to be processed is a word document, and the second parsing the document to be processed to obtain the information contained in the nodes in the document to be processed includes:
creating a document object instance based on the document to be processed;
and extracting information contained in the nodes from the document object examples according to the types of the nodes in the documents to be processed.
5. The document information extraction method according to claim 4, wherein the type of the node includes at least one of a paragraph, a table, and a directory, the type of the paragraph including at least one of a picture and a text.
6. The document information extraction method according to claim 5, wherein the information of the nodes includes hierarchical representation information of the nodes, and the constructing a document tree from the order of the nodes in the document to be processed and the information contained in the nodes in the document to be processed includes:
traversing the nodes in the document to be processed according to the sequence of the nodes in the document to be processed;
determining the hierarchical structure of the node in the document to be processed according to the hierarchical representation information of the node in the document to be processed;
and constructing a document tree according to the hierarchical structure of the nodes in the document to be processed and the information contained in the nodes in the document to be processed.
7. The document information extraction method according to claim 6, wherein the hierarchical representation information of the node includes at least one of prefix information of the node and a text style of the node, and whether the node is a headline text.
8. The document information extraction method according to claim 7, wherein the determining, based on the hierarchical representation information of the nodes in the document to be processed, a hierarchical structure to which the nodes in the document to be processed belong includes:
judging whether traversed nodes matched with at least one of prefix information and text style of an ith node exist or not under the condition that the ith node in the document to be processed is a title text, wherein i is an integer greater than 1 and less than N;
if so, the ith node and the matched traversed node are positioned at the same level;
if not, the ith node is the next layer of the lowest level in the traversed nodes.
9. The document information extraction method according to claim 8, wherein the determining, according to the hierarchical representation information of the nodes in the document to be processed, a hierarchical structure to which the nodes in the document to be processed belong further includes:
and in the case that the ith node in the document to be processed is a non-title text and the last node of the ith node is a title text, the ith node is the next layer of the last node.
10. The document information extraction method according to claim 8, wherein the determining, according to the hierarchical representation information of the nodes in the document to be processed, a hierarchical structure to which the nodes in the document to be processed belong further includes:
judging whether at least one item of prefix information and text style of an i node and a previous node of the i node is matched or not under the condition that the i node in the document to be processed is a non-title text and the previous node of the i node is a non-title text;
if the nodes are matched, the ith node and the node above the ith node are located at the same level;
and if the nodes are not matched, the ith node is the next layer of the node above the ith node.
11. The document information extraction method according to claim 1, wherein the document to be processed is a word document, and the first parsing the document to be processed to obtain the node order in the document to be processed includes:
modifying the suffix of the document to be processed into a compressed file format to obtain a compressed package;
decompressing the compressed package to obtain a target folder, wherein the target folder comprises a document file in an XML format;
and analyzing the document file in the XML format to obtain the node sequence in the document to be processed.
12. The document information extraction method according to claim 1, wherein the extraction rule includes a regular matching rule.
13. A document information extraction apparatus comprising:
the first analysis module is configured to perform first analysis on the document to be processed so as to obtain the node sequence in the document to be processed;
the second analysis module is configured to perform second analysis on the document to be processed to obtain information contained in nodes in the document to be processed;
a construction module configured to construct a document tree according to the order of nodes in the document to be processed and information contained in the nodes in the document to be processed;
the classification module is configured to determine the software quality index classification of the nodes in the document tree based on the trained machine learning model;
and the extraction module is configured to extract the software quality index information from the nodes in the document tree according to the extraction rules corresponding to the software quality index classification.
14. A document information extraction apparatus comprising:
a memory; and
a processor coupled to the memory, the processor configured to perform the document information extraction method of any one of claims 1 to 12 based on instructions stored in the memory.
15. A computer-readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the document information extraction method of any of claims 1 to 12.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311760500.0A CN117669501A (en) | 2023-12-20 | 2023-12-20 | Document information extraction method, device and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311760500.0A CN117669501A (en) | 2023-12-20 | 2023-12-20 | Document information extraction method, device and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117669501A true CN117669501A (en) | 2024-03-08 |
Family
ID=90066537
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311760500.0A Pending CN117669501A (en) | 2023-12-20 | 2023-12-20 | Document information extraction method, device and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117669501A (en) |
-
2023
- 2023-12-20 CN CN202311760500.0A patent/CN117669501A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9690788B2 (en) | File type recognition analysis method and system | |
CN110457302B (en) | Intelligent structured data cleaning method | |
WO2019205308A1 (en) | Information input method and apparatus, and terminal device and medium | |
CN113177125B (en) | Standard knowledge graph construction and standard query method and device | |
CN112395418B (en) | Method and device for extracting target object in webpage and electronic equipment | |
CN110413307B (en) | Code function association method and device and electronic equipment | |
CN109857957B (en) | Method for establishing label library, electronic equipment and computer storage medium | |
CN113761879B (en) | Message format verification method, device and storage medium | |
US8180799B1 (en) | Dynamically creating tables to store received data | |
CN112199499A (en) | Text division method, text classification method, device, equipment and storage medium | |
CN114743012B (en) | Text recognition method and device | |
CN112328246A (en) | Page component generation method and device, computer equipment and storage medium | |
CN115858751A (en) | Processing method and device of table question-answer data and electronic equipment | |
CN111492364A (en) | Data labeling method and device and storage medium | |
CN103235757A (en) | Device and method based on automatic data construction for testing test object in input field | |
CN109325217B (en) | File conversion method, system, device and computer readable storage medium | |
CN113255369B (en) | Text similarity analysis method and device and storage medium | |
CN111291535B (en) | Scenario processing method and device, electronic equipment and computer readable storage medium | |
CN117435189A (en) | Test case analysis method, device, equipment and medium of financial system interface | |
CN109684473A (en) | A kind of automatic bulletin generation method and system | |
CN113779218B (en) | Question-answer pair construction method, question-answer pair construction device, computer equipment and storage medium | |
CN117669501A (en) | Document information extraction method, device and storage medium | |
CN113642291B (en) | Method, system, storage medium and terminal for constructing logical structure tree reported by listed companies | |
CN115017256A (en) | Power data processing method and device, electronic equipment and storage medium | |
CN115270777A (en) | Contract document information extraction method, device and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |