CN112686012A - Document feature extraction method, device, equipment and medium - Google Patents

Document feature extraction method, device, equipment and medium Download PDF

Info

Publication number
CN112686012A
CN112686012A CN202011253863.1A CN202011253863A CN112686012A CN 112686012 A CN112686012 A CN 112686012A CN 202011253863 A CN202011253863 A CN 202011253863A CN 112686012 A CN112686012 A CN 112686012A
Authority
CN
China
Prior art keywords
document
extracting
extracted
sentence
extraction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011253863.1A
Other languages
Chinese (zh)
Other versions
CN112686012B (en
Inventor
黄敬林
庄莉
梁懿
林振天
池少宁
翁明东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
State Grid Information and Telecommunication Co Ltd
Fujian Yirong Information Technology Co Ltd
Great Power Science and Technology Co of State Grid Information and Telecommunication Co Ltd
Original Assignee
State Grid Information and Telecommunication Co Ltd
Fujian Yirong Information Technology Co Ltd
Great Power Science and Technology Co of State Grid Information and Telecommunication Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by State Grid Information and Telecommunication Co Ltd, Fujian Yirong Information Technology Co Ltd, Great Power Science and Technology Co of State Grid Information and Telecommunication Co Ltd filed Critical State Grid Information and Telecommunication Co Ltd
Priority to CN202011253863.1A priority Critical patent/CN112686012B/en
Publication of CN112686012A publication Critical patent/CN112686012A/en
Application granted granted Critical
Publication of CN112686012B publication Critical patent/CN112686012B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention discloses a method for extracting official document features, which comprises the following steps: a document extraction template definition process and a document feature extraction process; customizing an official document extraction template through an extensible markup language, wherein the official document extraction template comprises the following steps: the method comprises the following steps that an attachment extracts an identification tag, a split sentence rule tag and at least one extracted characteristic field tag; then extracting an identification tag according to the attachment in the official document extraction template, acquiring the official document text or the official document text and the attachment as an official document to be extracted, and converting the official document to be extracted into the content in the format of the extensible markup language; carrying out sentence splitting on the document to be extracted according to the sentence splitting rule tag; and extracting and outputting the characteristic fields sentence by sentence according to the extracted characteristic field labels. According to the method, the device, the equipment and the medium for extracting the document features, the document feature extraction template is defined, the feature extraction of the unstructured document is realized in a building block mode, and the difficulty of document feature extraction is greatly simplified.

Description

Document feature extraction method, device, equipment and medium
Technical Field
The invention relates to the technical field of official document management, in particular to an official document feature extraction method, device, equipment and medium.
Background
The official documents are written materials which are formed and used by legal authorities and organizations in official business activities according to specific body types and through certain processing procedures, and are also called official documents. Whether professional work or administrative affairs are performed, the user learns to transmit political policies and deal with the affairs through official documents so as to ensure various relationships to be coordinated and decide that the affairs can be performed correctly and efficiently. The method for extracting the document features is a main means for deeply analyzing document contents, and along with continuous deepening of relevant technologies such as artificial intelligence, natural language processing, text mining and the like, the method for extracting the document features is also rich.
At present, the official document feature extraction system is mainly limited to combing existing metadata to form a complete metadata related standard, and a system for extracting features of an official document unstructured file is not available. The existing method for extracting the document features mainly forms document feature information by analyzing the existing metadata and extracting the content of the unstructured document paragraphs, and is mainly realized by related technologies such as keyword extraction and word segmentation.
The existing official document feature extraction system mainly has the characteristics of complex codes, low reusability and the like. The features need to be extracted from different document writing codes, and a large amount of personnel investment needs to be consumed. The extracted characteristic information is difficult to be presented in an imaging way, and codes are difficult to debug and modify, so that great difficulty and difficult-to-break business points are brought to the intelligent application of the document.
Disclosure of Invention
The invention aims to solve the technical problem of providing a method, a device, equipment and a medium for extracting the characteristics of an official document, which realize the characteristic extraction of an unstructured official document in a building block mode by defining an official document characteristic extraction template and greatly simplify the difficulty of the characteristic extraction of the official document.
In a first aspect, the present invention provides a method for extracting document features, including: a document extraction template definition process and a document feature extraction process;
the official document extraction template definition process comprises the following steps:
customizing an official document extraction template through an extensible markup language, wherein the official document extraction template comprises the following steps: the method comprises the following steps that an attachment extracts an identification tag, a split sentence rule tag and at least one extracted characteristic field tag; the attachment extraction identification tag is used for defining whether the attachment in the official document is included when the file features are extracted; the sentence splitting rule tag is used for defining sentence splitting rules of the official document; the extracted characteristic field label is used for defining the content and the mode of extracting the official document characteristic field;
the official document feature extraction process comprises the following steps:
extracting an identification tag according to an attachment in the document extraction template, acquiring a document text or the document text and the attachment as a document to be extracted, and converting the document to be extracted into a content in an extensible markup language format;
carrying out sentence splitting on the document to be extracted according to the sentence splitting rule tag;
and extracting and outputting the characteristic fields sentence by sentence according to the extracted characteristic field labels.
Further, the extracting feature field tag includes: extracting feature field labels according to sentences, grouping and extracting feature field labels and splitting and extracting labels according to paragraphs;
the sentence-based extraction characteristic field label is used for extracting the information of the official document characteristic field by sentences, and the sentence-based extraction characteristic field label is defined by a regular expression;
the packet extraction characteristic field label is used for extracting the information of the official document characteristic field in a packet mode, and the packet extraction characteristic field label can nest rules for extracting the characteristic field according to sentences;
the label extracted by paragraph splitting is used for extracting the information of the document characteristic field by paragraph splitting, and the label extracted by paragraph splitting can be nested with the rule for extracting the characteristic field by sentence.
Further, the extracting and outputting the feature field sentence by sentence according to the extracted feature field tag specifically includes: and matching each sentence of a document to be extracted with each extracted feature field label, when the matching is successful, extracting the feature field according to the extracted feature field label, and then sorting and outputting the extraction result.
In a second aspect, an apparatus for extracting a document feature includes: the document extraction template definition module and the document feature extraction module;
the official document extraction template definition module is used for customizing the official document extraction template through an extensible markup language, and the official document extraction template comprises: the method comprises the following steps that an attachment extracts an identification tag, a split sentence rule tag and at least one extracted characteristic field tag; the attachment extraction identification tag is used for defining whether the attachment in the official document is included when the file features are extracted; the sentence splitting rule tag is used for defining sentence splitting rules of the official document; the extracted characteristic field label is used for defining the content and the mode of extracting the official document characteristic field;
the document feature extraction module is used for extracting the identification tag according to the attachment in the document extraction template, acquiring a document text or the document text and the attachment as a document to be extracted, and converting the document to be extracted into the content in the format of the extensible markup language;
carrying out sentence splitting on the document to be extracted according to the sentence splitting rule tag;
and extracting and outputting the characteristic fields sentence by sentence according to the extracted characteristic field labels.
Further, the extracting feature field tag includes: extracting feature field labels according to sentences, grouping and extracting feature field labels and splitting and extracting labels according to paragraphs;
the sentence-based extraction characteristic field label is used for extracting the information of the official document characteristic field by sentences, and the sentence-based extraction characteristic field label is defined by a regular expression;
the packet extraction characteristic field label is used for extracting the information of the official document characteristic field in a packet mode, and the packet extraction characteristic field label can nest rules for extracting the characteristic field according to sentences;
the label extracted by paragraph splitting is used for extracting the information of the document characteristic field by paragraph splitting, and the label extracted by paragraph splitting can be nested with the rule for extracting the characteristic field by sentence.
Further, the extracting and outputting the feature field sentence by sentence according to the extracted feature field tag specifically includes: and matching each sentence of a document to be extracted with each extracted feature field label, when the matching is successful, extracting the feature field according to the extracted feature field label, and then sorting and outputting the extraction result.
In a third aspect, the present invention provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of the first aspect when executing the program.
In a fourth aspect, the invention provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the method of the first aspect.
One or more technical solutions provided in the embodiments of the present invention have at least the following technical effects or advantages:
1. the expression of the official document feature extraction and the code separation are realized by introducing an extensible markup language (XML) self-defined official document feature extraction template; the method for extracting the official document features is templatized, so that the imagination and easy operation of the official document feature extraction are realized;
2. by setting an accessory extraction identification tag, a split sentence rule tag and at least one extracted feature field tag, the feature extraction of the unstructured official document is realized in a building block mode, and the difficulty of the official document feature extraction is greatly simplified;
3. the extraction of the characteristics of the machine replaces the extraction of the artificial characteristics, so that the characteristic extraction quality is improved, the threshold of the characteristic extraction of the system is greatly reduced, the efficiency of the extraction of the characteristics of the document is effectively improved, and a solid foundation is laid for the intelligent application of the document.
The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.
Drawings
The invention will be further described with reference to the following examples with reference to the accompanying drawings.
FIG. 1 is a schematic flow chart of a method according to a first embodiment of the present invention;
FIG. 2 is a flowchart illustrating an official document extraction template definition process according to an embodiment of the present invention;
FIG. 3 is a flowchart illustrating a document feature extraction process according to an embodiment of the present invention;
FIG. 4 is a schematic structural diagram of an apparatus according to a second embodiment of the present invention;
fig. 5 is a schematic structural diagram of an electronic device according to a third embodiment of the invention;
fig. 6 is a schematic structural diagram of a medium according to a fourth embodiment of the present invention.
Detailed Description
Example one
The embodiment provides a document feature extraction method, as shown in fig. 1, including: a document extraction template definition process and a document feature extraction process;
the official document extraction template definition process comprises the following steps:
customizing an official document extraction template through an extensible markup language, wherein the official document extraction template comprises the following steps: the method comprises the following steps that an attachment extracts an identification tag, a split sentence rule tag and at least one extracted characteristic field tag; the attachment extraction identification tag is used for defining whether the attachment in the official document is included when the file features are extracted; the sentence splitting rule tag is used for defining sentence splitting rules of the official document; the extracted characteristic field label is used for defining the content and the mode of extracting the official document characteristic field;
the official document feature extraction process comprises the following steps:
extracting an identification tag according to an attachment in the document extraction template, acquiring a document text or the document text and the attachment as a document to be extracted, and converting the document to be extracted into a content in an extensible markup language format;
carrying out sentence splitting on the document to be extracted according to the sentence splitting rule tag;
and extracting and outputting the characteristic fields sentence by sentence according to the extracted characteristic field labels.
The expression of the official document feature extraction and the code separation are realized by introducing an extensible markup language (XML) self-defined official document feature extraction template; the method for extracting the official document features is templatized, so that the imagination and easy operation of the official document feature extraction are realized; through setting up annex extraction sign label, split sentence rule label and at least one extraction characteristic field label to the mode of building blocks realizes the characteristic extraction to unstructured official document, simplifies the degree of difficulty to official document characteristic extraction greatly.
In one possible implementation, as shown in fig. 2, the extracting feature field tag includes: extracting feature field labels according to sentences, grouping and extracting feature field labels and splitting and extracting labels according to paragraphs;
the sentence-based extraction characteristic field label is used for extracting the information of the official document characteristic field by sentences, and the sentence-based extraction characteristic field label is defined by a regular expression;
the packet extraction characteristic field label is used for extracting the information of the official document characteristic field in a packet mode, and the packet extraction characteristic field label can nest rules for extracting the characteristic field according to sentences;
the label extracted by paragraph splitting is used for extracting the information of the document characteristic field by paragraph splitting, and the label extracted by paragraph splitting can be nested with the rule for extracting the characteristic field by sentence.
The attachment extraction identification tag can be defined by the format of < attachment num ═ 0"interval ═ 10"/>;
split sentence rule tags may be defined by the < sensens regular ═ positive expression "/> format;
the extracted features field tag by sentence may be defined by the format of < regular name ═ x1, x2,. - > regular top-subject ═ x'/>;
the packet extraction feature field tag may be defined by a regular rg-group type, a regular name, a regular no-checker-regular, a regular next-regular id, a regular top-paragraph, an x-unit/type format;
the extraction tag split by paragraph may be defined by the format of < regular flag-regular end-regular ═ regular "name ═ sOrgs" regular ═ all "next-regular-id ═ c-fs-unit" merge ═ true "include-start ═ true" include-end ═ true "/> format.
In a possible implementation manner, as shown in fig. 3, the extracting and outputting the feature field sentence by sentence according to the extracted feature field tag specifically includes: and matching each sentence of a document to be extracted with each extracted feature field label, when the matching is successful, extracting the feature field according to the extracted feature field label, and then sorting and outputting the extraction result.
When the extracted feature field label is a group extracted feature field label nested with a rule for extracting the feature field according to sentences or a split extracted label according to paragraphs, the sentence extracted feature field feature rules in the extracted feature field label are called in sequence to extract features, and the results are sorted and output.
Based on the same inventive concept, the application also provides a device corresponding to the method in the first embodiment, which is detailed in the second embodiment.
Example two
In this embodiment, there is provided a document feature extraction device, as shown in fig. 4, including: the document extraction template definition module and the document feature extraction module;
the official document extraction template definition module is used for customizing the official document extraction template through an extensible markup language, and the official document extraction template comprises: the method comprises the following steps that an attachment extracts an identification tag, a split sentence rule tag and at least one extracted characteristic field tag; the attachment extraction identification tag is used for defining whether the attachment in the official document is included when the file features are extracted; the sentence splitting rule tag is used for defining sentence splitting rules of the official document; the extracted characteristic field label is used for defining the content and the mode of extracting the official document characteristic field;
the document feature extraction module is used for extracting the identification tag according to the attachment in the document extraction template, acquiring a document text or the document text and the attachment as a document to be extracted, and converting the document to be extracted into the content in the format of the extensible markup language;
carrying out sentence splitting on the document to be extracted according to the sentence splitting rule tag;
and extracting and outputting the characteristic fields sentence by sentence according to the extracted characteristic field labels.
In one possible implementation, the extracting the feature field tag includes: extracting feature field labels according to sentences, grouping and extracting feature field labels and splitting and extracting labels according to paragraphs;
the sentence-based extraction characteristic field label is used for extracting the information of the official document characteristic field by sentences, and the sentence-based extraction characteristic field label is defined by a regular expression;
the packet extraction characteristic field label is used for extracting the information of the official document characteristic field in a packet mode, and the packet extraction characteristic field label can nest rules for extracting the characteristic field according to sentences;
the label extracted by paragraph splitting is used for extracting the information of the document characteristic field by paragraph splitting, and the label extracted by paragraph splitting can be nested with the rule for extracting the characteristic field by sentence.
In a possible implementation manner, the extracting and outputting the feature field sentence by sentence according to the extracted feature field tag specifically includes: and matching each sentence of a document to be extracted with each extracted feature field label, when the matching is successful, extracting the feature field according to the extracted feature field label, and then sorting and outputting the extraction result.
Since the apparatus described in the second embodiment of the present invention is an apparatus used for implementing the method of the first embodiment of the present invention, based on the method described in the first embodiment of the present invention, a person skilled in the art can understand the specific structure and the deformation of the apparatus, and thus the details are not described herein. All the devices adopted in the method of the first embodiment of the present invention belong to the protection scope of the present invention.
Based on the same inventive concept, the application provides an electronic device embodiment corresponding to the first embodiment, which is detailed in the third embodiment.
EXAMPLE III
The embodiment provides an electronic device, as shown in fig. 5, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, any one of the first embodiment modes may be implemented.
Since the electronic device described in this embodiment is a device used for implementing the method in the first embodiment of the present application, based on the method described in the first embodiment of the present application, a specific implementation of the electronic device in this embodiment and various variations thereof can be understood by those skilled in the art, and therefore, how to implement the method in the first embodiment of the present application by the electronic device is not described in detail herein. The equipment used by those skilled in the art to implement the methods in the embodiments of the present application is within the scope of the present application.
Based on the same inventive concept, the application provides a storage medium corresponding to the fourth embodiment, which is described in detail in the fourth embodiment.
Example four
The present embodiment provides a computer-readable storage medium, as shown in fig. 6, on which a computer program is stored, and when the computer program is executed by a processor, any one of the embodiments can be implemented.
According to the technical scheme provided by the embodiment of the application, the expression of the official document feature extraction and the code separation are realized by introducing an extensible markup language (XML) self-defined official document feature extraction template; the method for extracting the official document features is templatized, so that the imagination and easy operation of the official document feature extraction are realized; by setting an accessory extraction identification tag, a split sentence rule tag and at least one extracted feature field tag, the feature extraction of the unstructured official document is realized in a building block mode, and the difficulty of the official document feature extraction is greatly simplified; the extraction of the characteristics of the machine replaces the extraction of the artificial characteristics, so that the characteristic extraction quality is improved, the threshold of the characteristic extraction of the system is greatly reduced, the efficiency of the extraction of the characteristics of the document is effectively improved, and a solid foundation is laid for the intelligent application of the document.
Although specific embodiments of the invention have been described above, it will be understood by those skilled in the art that the specific embodiments described are illustrative only and are not limiting upon the scope of the invention, and that equivalent modifications and variations can be made by those skilled in the art without departing from the spirit of the invention, which is to be limited only by the appended claims.

Claims (8)

1. A method for extracting document features is characterized in that: the method comprises the following steps: a document extraction template definition process and a document feature extraction process;
the official document extraction template definition process comprises the following steps:
customizing an official document extraction template through an extensible markup language, wherein the official document extraction template comprises the following steps: the method comprises the following steps that an attachment extracts an identification tag, a split sentence rule tag and at least one extracted characteristic field tag; the attachment extraction identification tag is used for defining whether the attachment in the official document is included when the file features are extracted; the sentence splitting rule tag is used for defining sentence splitting rules of the official document; the extracted characteristic field label is used for defining the content and the mode of extracting the official document characteristic field;
the official document feature extraction process comprises the following steps:
extracting an identification tag according to an attachment in the document extraction template, acquiring a document text or the document text and the attachment as a document to be extracted, and converting the document to be extracted into a content in an extensible markup language format;
carrying out sentence splitting on the document to be extracted according to the sentence splitting rule tag;
and extracting and outputting the characteristic fields sentence by sentence according to the extracted characteristic field labels.
2. The method of claim 1, wherein: the extracted feature field tag includes: extracting feature field labels according to sentences, grouping and extracting feature field labels and splitting and extracting labels according to paragraphs;
the sentence-based extraction characteristic field label is used for extracting the information of the official document characteristic field by sentences, and the sentence-based extraction characteristic field label is defined by a regular expression;
the packet extraction characteristic field label is used for extracting the information of the official document characteristic field in a packet mode, and the packet extraction characteristic field label can nest rules for extracting the characteristic field according to sentences;
the label extracted by paragraph splitting is used for extracting the information of the document characteristic field by paragraph splitting, and the label extracted by paragraph splitting can be nested with the rule for extracting the characteristic field by sentence.
3. The method according to claim 1 or 2, characterized in that: the extracting and outputting the characteristic field sentence by sentence according to the extracted characteristic field label specifically comprises: and matching each sentence of a document to be extracted with each extracted feature field label, when the matching is successful, extracting the feature field according to the extracted feature field label, and then sorting and outputting the extraction result.
4. The utility model provides a official document feature extraction device which characterized in that: the method comprises the following steps: the document extraction template definition module and the document feature extraction module;
the official document extraction template definition module is used for customizing the official document extraction template through an extensible markup language, and the official document extraction template comprises: the method comprises the following steps that an attachment extracts an identification tag, a split sentence rule tag and at least one extracted characteristic field tag; the attachment extraction identification tag is used for defining whether the attachment in the official document is included when the file features are extracted; the sentence splitting rule tag is used for defining sentence splitting rules of the official document; the extracted characteristic field label is used for defining the content and the mode of extracting the official document characteristic field;
the document feature extraction module is used for extracting the identification tag according to the attachment in the document extraction template, acquiring a document text or the document text and the attachment as a document to be extracted, and converting the document to be extracted into the content in the format of the extensible markup language;
carrying out sentence splitting on the document to be extracted according to the sentence splitting rule tag;
and extracting and outputting the characteristic fields sentence by sentence according to the extracted characteristic field labels.
5. The apparatus of claim 4, wherein: the extracted feature field tag includes: extracting feature field labels according to sentences, grouping and extracting feature field labels and splitting and extracting labels according to paragraphs;
the sentence-based extraction characteristic field label is used for extracting the information of the official document characteristic field by sentences, and the sentence-based extraction characteristic field label is defined by a regular expression;
the packet extraction characteristic field label is used for extracting the information of the official document characteristic field in a packet mode, and the packet extraction characteristic field label can nest rules for extracting the characteristic field according to sentences;
the label extracted by paragraph splitting is used for extracting the information of the document characteristic field by paragraph splitting, and the label extracted by paragraph splitting can be nested with the rule for extracting the characteristic field by sentence.
6. The apparatus of claim 4 or 5, wherein: the extracting and outputting the characteristic field sentence by sentence according to the extracted characteristic field label specifically comprises: and matching each sentence of a document to be extracted with each extracted feature field label, when the matching is successful, extracting the feature field according to the extracted feature field label, and then sorting and outputting the extraction result.
7. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1 to 3 when executing the program.
8. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 3.
CN202011253863.1A 2020-11-11 2020-11-11 Document feature extraction method, device, equipment and medium Active CN112686012B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011253863.1A CN112686012B (en) 2020-11-11 2020-11-11 Document feature extraction method, device, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011253863.1A CN112686012B (en) 2020-11-11 2020-11-11 Document feature extraction method, device, equipment and medium

Publications (2)

Publication Number Publication Date
CN112686012A true CN112686012A (en) 2021-04-20
CN112686012B CN112686012B (en) 2023-03-31

Family

ID=75446632

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011253863.1A Active CN112686012B (en) 2020-11-11 2020-11-11 Document feature extraction method, device, equipment and medium

Country Status (1)

Country Link
CN (1) CN112686012B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102567303A (en) * 2010-12-24 2012-07-11 北京大学 Typesetting method and device for variable official document data
CN109614622A (en) * 2018-12-11 2019-04-12 北京锐安科技有限公司 Valid data extracting method, device, storage medium and terminal
US20190243841A1 (en) * 2018-02-06 2019-08-08 Thomson Reuters (Professional) UK Ltd. Systems and method for generating a structured report from unstructured data
CN110399477A (en) * 2019-06-20 2019-11-01 全球能源互联网研究院有限公司 A kind of literature summary extracting method, equipment and can storage medium
CN110609998A (en) * 2019-08-07 2019-12-24 中通服建设有限公司 Data extraction method of electronic document information, electronic equipment and storage medium
CN110765889A (en) * 2019-09-29 2020-02-07 平安直通咨询有限公司上海分公司 Legal document feature extraction method, related device and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102567303A (en) * 2010-12-24 2012-07-11 北京大学 Typesetting method and device for variable official document data
US20190243841A1 (en) * 2018-02-06 2019-08-08 Thomson Reuters (Professional) UK Ltd. Systems and method for generating a structured report from unstructured data
CN109614622A (en) * 2018-12-11 2019-04-12 北京锐安科技有限公司 Valid data extracting method, device, storage medium and terminal
CN110399477A (en) * 2019-06-20 2019-11-01 全球能源互联网研究院有限公司 A kind of literature summary extracting method, equipment and can storage medium
CN110609998A (en) * 2019-08-07 2019-12-24 中通服建设有限公司 Data extraction method of electronic document information, electronic equipment and storage medium
CN110765889A (en) * 2019-09-29 2020-02-07 平安直通咨询有限公司上海分公司 Legal document feature extraction method, related device and storage medium

Also Published As

Publication number Publication date
CN112686012B (en) 2023-03-31

Similar Documents

Publication Publication Date Title
US20040194035A1 (en) Systems and methods for automatic form segmentation for raster-based passive electronic documents
CN108664474B (en) Resume analysis method based on deep learning
CN110598203A (en) Military imagination document entity information extraction method and device combined with dictionary
KR101224673B1 (en) System and method for storing a document in a serial binary format
US11599727B2 (en) Intelligent text cleaning method and apparatus, and computer-readable storage medium
CN111177532A (en) Vertical search method, device, computer system and readable storage medium
CN107590291A (en) A kind of searching method of picture, terminal device and storage medium
CN106777336A (en) A kind of exabyte composition extraction system and method based on deep learning
CN110245349A (en) A kind of syntax dependency parsing method, apparatus and a kind of electronic equipment
CN111209831A (en) Document table content identification method and device based on classification algorithm
CN102663108A (en) Medicine corporation finding method based on parallelization label propagation algorithm for complex network model
CN113850056A (en) Document key information extraction method and system based on keyword splitting technology
CN112686012B (en) Document feature extraction method, device, equipment and medium
CN111368532B (en) Topic word embedding disambiguation method and system based on LDA
CN109902299B (en) Text processing method and device
CN114579796B (en) Machine reading understanding method and device
CN113312486B (en) Signal portrait construction method and device, electronic equipment and storage medium
CN115496830A (en) Method and device for generating product demand flow chart
CN116306506A (en) Intelligent mail template method based on content identification
CN113051869B (en) Method and system for realizing identification of text difference content by combining semantic recognition
CN115713085A (en) Document theme content analysis method and device
CN113343140B (en) Method for automatically extracting webpage text content based on neo4j graphic database
CN112613315A (en) Text knowledge automatic extraction method, device, equipment and storage medium
CN112990091A (en) Research and report analysis method, device, equipment and storage medium based on target detection
CN102486767B (en) Method and device for labeling content

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant