WO2020233332A1 - 文本结构化信息提取方法、服务器及存储介质 - Google Patents

文本结构化信息提取方法、服务器及存储介质 Download PDF

Info

Publication number
WO2020233332A1
WO2020233332A1 PCT/CN2020/086292 CN2020086292W WO2020233332A1 WO 2020233332 A1 WO2020233332 A1 WO 2020233332A1 CN 2020086292 W CN2020086292 W CN 2020086292W WO 2020233332 A1 WO2020233332 A1 WO 2020233332A1
Authority
WO
WIPO (PCT)
Prior art keywords
level
preset
text content
label
training
Prior art date
Application number
PCT/CN2020/086292
Other languages
English (en)
French (fr)
Inventor
韦峰
徐国强
邱寒
Original Assignee
深圳壹账通智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳壹账通智能科技有限公司 filed Critical 深圳壹账通智能科技有限公司
Publication of WO2020233332A1 publication Critical patent/WO2020233332A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2291User-Defined Types; Storage management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/416Extracting the logical structure, e.g. chapters, sections or page numbers; Identifying elements of the document, e.g. authors

Definitions

  • This application relates to the field of artificial intelligence, and in particular to a method for extracting text structured information, a server and a storage medium.
  • PDF Portable Document Format
  • OCR Optical Character Recognition
  • this application provides a text structured information extraction method, server and storage medium, the purpose of which is to solve the problem of large arbitrariness in format and text position when extracting document information, and structured information cannot be easily obtained.
  • this application provides a method for extracting text structured information, which includes:
  • Each first-level label, second-level label, first-level text content, and second-level text content is stored as a logical page in a preset database, and the logical page generates a corresponding file in a preset format and feeds it back to the client.
  • the present application also provides a server.
  • the server includes a memory, a processor, and a computer program that is stored in the memory and can run on the processor.
  • the processor implements a text structure when the program is executed.
  • a method for extracting chemical information the method includes:
  • Each first-level label, second-level label, first-level text content, and second-level text content is stored as a logical page in a preset database, and the logical page generates a corresponding file in a preset format and feeds it back to the client.
  • this application also provides a computer-readable storage medium on which a computer program is stored.
  • a method for extracting text structured information is implemented.
  • Each first-level label, second-level label, first-level text content, and second-level text content is stored as a logical page in a preset database, and the logical page generates a corresponding file in a preset format and feeds it back to the client.
  • the text structured information extraction method, server and storage medium proposed in this application solve the problem of large arbitrariness in format and text position when extracting document information, and the structured information cannot be easily obtained.
  • the original document is determined by using a segmentation model Each first-level tag and second-level tag in, and then extract structure information based on the tag content. Automatically realize the extraction of document structured information, avoid manual processing, convenient and efficient.
  • Figure 1 is an application environment diagram of a preferred embodiment of a method for extracting text structured information of this application
  • Figure 2 is a schematic diagram of a preferred embodiment of the application server
  • FIG. 3 is a schematic diagram of modules of a preferred embodiment of the text structured information extraction program in FIG. 2;
  • FIG. 4 is a flowchart of a preferred embodiment of a method for extracting structured information from the text of this application;
  • FIG. 1 it is an application environment diagram of a preferred embodiment of the method for extracting text structured information of the present application.
  • the server 1 is installed with a text structured information extraction program 10.
  • Multiple clients 3 connect to the server 1 through the network 2.
  • the network 2 may be the Internet, a cloud network, a wireless fidelity (Wi-Fi) network, a personal network (PAN), a local area network (LAN), and/or a metropolitan area network (MAN).
  • Various devices in the network environment can be configured to connect to the communication network according to various wired and wireless communication protocols.
  • wired and wireless communication protocols can include but are not limited to at least one of the following: Transmission Control Protocol and Internet Protocol (TCP/IP), User Datagram Protocol (UDP), Hypertext Transfer Protocol (HTTP), File Transfer Protocol (FTP), ZigBee, EDGE, IEEE 802.11, Optical Fidelity (Li-Fi), 802.16, IEEE 802.11s, IEEE 802.11g, multi-hop communication, wireless access point (AP), device-to-device communication, cellular communication Protocol and/or Bluetooth (Blue Tooth) communication protocol or a combination thereof.
  • the client 3 can be a desktop computer, a notebook, a tablet computer, a mobile phone, or another terminal device that is installed with application software and can communicate with the server 1 through the network 2.
  • the database 4 is used to store data such as tags of each level and text content corresponding to each level of tags.
  • FIG. 2 is a schematic diagram of a preferred embodiment of the server 1 of this application.
  • the server 1 includes but is not limited to: a memory 11, a processor 12, a display 13, and a network interface 14.
  • the server 1 is connected to the network through the network interface 14 to obtain original data.
  • the network may be an intranet, the Internet, a global system of mobile communication (GSM), a wideband code division multiple access (WCDMA), a 4G network , 5G network, Bluetooth (Bluetooth), call network and other wireless or wired networks.
  • GSM global system of mobile communication
  • WCDMA wideband code division multiple access
  • 4G network a 4G network
  • 5G network 5G network
  • Bluetooth Bluetooth
  • the memory 11 includes at least one type of readable storage medium, and the computer readable storage medium may be non-volatile or volatile.
  • the readable storage medium includes flash memory, hard disk, multimedia card, card type memory (for example, SD or DX memory, etc.), random access memory (RAM), static random access memory (SRAM), read only memory (ROM), electronic Erasable programmable read-only memory (EEPROM), programmable read-only memory (PROM), magnetic memory, magnetic disk, optical disk, etc.
  • the storage 11 may be an internal storage unit of the server 1, such as a hard disk or a memory of the server 1.
  • the memory 11 may also be an external storage device of the server 1, for example, a plug-in hard disk equipped with the server 1, a smart media card (SMC), and a secure digital (Secure Digital). , SD) card, flash card (Flash Card), etc.
  • the memory 11 may also include both the internal storage unit of the server 1 and its external storage device.
  • the memory 11 is generally used to store the operating system and various application software installed in the server 1, such as the program code of the text structured information extraction program 10.
  • the memory 11 can also be used to temporarily store various types of data that have been output or will be output.
  • the processor 12 may be a central processing unit (Central Processing Unit, CPU), a controller, a microcontroller, a microprocessor, or other data processing chips in some embodiments.
  • the processor 12 is generally used to control the overall operation of the server 1, such as performing data interaction or communication-related control and processing.
  • the processor 12 is used to run the program code or process data stored in the memory 11, for example, to run the program code of the text structured information extraction program 10.
  • the display 13 may be called a display screen or a display unit.
  • the display 13 may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, and an organic light-emitting diode (OLED) touch device.
  • the display 13 is used for displaying the information processed in the server 1 and for displaying a visualized work interface, for example, displaying the results of data statistics.
  • the network interface 14 may optionally include a standard wired interface and a wireless interface (such as a WI-FI interface).
  • the network interface 14 is usually used to establish a communication connection between the server 1 and other electronic devices.
  • Fig. 2 only shows the server 1 with the components 11-14 and the text structured information extraction program 10. However, it should be understood that it is not required to implement all the components shown, and more or fewer components may be implemented instead. .
  • the server 1 may also include a user interface.
  • the user interface may include a display (Display) and an input unit such as a keyboard (Keyboard).
  • the optional user interface may also include a standard wired interface and a wireless interface.
  • the display may be an LED display, a liquid crystal display, a touch liquid crystal display, an organic light-emitting diode (OLED) touch device, and the like.
  • the display may also be appropriately called a display screen or a display unit, which is used to display the information processed in the server 1 and to display a visualized user interface.
  • the server 1 may also include a radio frequency (RF) circuit, a sensor, an audio circuit, etc., which are not described here.
  • RF radio frequency
  • the processor 12 can implement the following steps when executing the text structured information extraction program 10 stored in the memory 11:
  • Receiving step receiving a request for extracting text structured information sent by the client, and obtaining the original document of the structured information to be extracted;
  • the first obtaining step input the original document into a pre-trained first segmentation model to obtain multiple first-level tags of the original document, and then obtain each of the first-level tags from the original document according to a first preset rule The first-level text content corresponding to the label;
  • the second obtaining step input each of the first-level text content into a pre-trained second segmentation model to obtain the second-level label corresponding to each of the first-level text content, and then obtain from the original document according to a second preset rule The secondary text content corresponding to each of the secondary tags; and
  • Feedback step store each primary label, secondary label, primary text content, and secondary text content as logical pages in a preset database, and generate corresponding files for the logical pages in a preset format to feed back to the customer end.
  • the text structured information extraction program 10 may be divided into multiple modules, and the multiple modules are stored in the memory 12 and executed by the processor 13 to complete the application.
  • the module referred to in this application refers to a series of computer program instruction segments that can complete specific functions.
  • FIG. 3 it is a program module diagram of an embodiment of the text structured information extraction program 10 in FIG. 2.
  • the text structured information extraction program 10 can be divided into: a receiving module 110, a first acquiring module 120, a second acquiring module 130, and a feedback module 140.
  • the receiving module 110 is configured to receive a request for extracting text structured information sent by the client, and obtain the original document of the structured information to be extracted.
  • the request may include the original document to be structured, and may also include the storage path and unique identifier of the original document to be structured.
  • the original document can be entered when the user submits the text structured request, or it can be obtained from the address specified by the request after the user submits the text structured request.
  • the original document can be corporate documents such as official documents and bidding documents, and its format is PDF.
  • the receiving module 110 also performs user identity information authentication on the user of the client who initiated the text structured information extraction request. If the user identity information authentication is passed, the subsequent steps are executed, and if the user identity information authentication fails, the text structured information is rejected. Request for information extraction and generate warning information. For example, the receiving module 110 obtains the user's identity information, and judges whether the user has the authority to extract the text structured information according to the user identity information, if so, continue to perform the subsequent steps, if not, then reject the text structured information extraction request and Generate warning information.
  • the first obtaining module 120 is configured to input the original document into a pre-trained first segmentation model to obtain multiple first-level tags of the original document, and then obtain each location from the original document according to a first preset rule. Describe the first-level text content corresponding to the first-level label.
  • the first segmentation model is obtained by training a Conditional Random Field (CRF) model.
  • the specific training steps include:
  • the computer can read the information of the original document (for example: the position coordinates of the text in the document, font and other information) to facilitate the subsequent steps of obtaining the first-level label.
  • the information of the original document for example: the position coordinates of the text in the document, font and other information
  • a unique first preset tag is assigned to the converted XML format document, where the tags include, but are not limited to: cover, title, index, body, footnotes, eyebrows, references, appendices, etc. Take the cover as an example, mark the cover of the document: "Cover”. Then extract the preset feature vector of each label according to the predetermined feature vector extraction algorithm. Specifically, the extraction step includes:
  • each label into the pre-trained word vector model (word2vec model) to generate a word-level vector r wrd ; input the characters that make up each label into the pre-trained convolutional neural network model (Convolutional Neural Network) Neural Network, CNN), generate a character-level vector r wch corresponding to the label; combine the word-level vector and the character-level vector to obtain a new vector u n [r wrd , r wch ], as the feature vector of each label.
  • r wrd represents the vector obtained by training with the word2vec model, and its processing method is consistent with the existing word2vec model
  • r wch represents the vector obtained through convolutional neural network training.
  • the training process can be obtained from the prior art. Repeat it again.
  • each preset feature vector is used as a variable X
  • each preset label is used as a dependent variable Y
  • a sample set is generated.
  • the sample set is divided into the first training set and the first training set according to the first preset ratio (for example, 4:1).
  • a verification set wherein, the number of samples in the first training set is greater than the number of samples in the first verification set.
  • each variable X and each dependent variable Y in the first training set uses the conditional random field model, and use the first verification set to verify the conditional random field model every preset period (for example: every 1000 iterations).
  • Each of the variables X and each dependent variable Y in the first verification set verifies the accuracy of the first segmentation model.
  • the verification accuracy is greater than the first preset threshold (for example: 95%)
  • the training is ended, and the results are obtained.
  • the first preset threshold for example: 95%)
  • the number of samples is increased, and the above training step is re-executed based on the increased document samples.
  • the original document from which the structured information is to be extracted is input into the trained first segmentation model, and after multiple first-level tags of the original document are obtained, the first-level text content corresponding to each first-level tag is obtained according to the first preset rule.
  • the first preset rule includes determining the levels corresponding to the obtained multiple first-level tags according to the preset mapping relationship between the first-level tags and the levels. Among them, the level of each first-level label is predetermined, for example: cover, title, text, references, appendices, etc. are the first level, index, footnotes, and eyebrows are the second level, and the first level takes precedence over the second level.
  • Extract each first-level label of the first level extract the text content between the current first-level label and the next first-level first-level label, as the text content corresponding to the current first-level label, if the current first-level label is the last
  • the text content after the current first-level label is extracted as the text content corresponding to the current first-level label.
  • the entire document is divided into multiple parts, and each part belongs to a first-level label category.
  • the second acquisition module 130 is configured to input each of the first-level text content into a pre-trained second segmentation model to obtain the second-level label corresponding to each of the first-level text content, and then obtain the second-level label from the second preset rule
  • the original document obtains the secondary text content corresponding to each of the secondary tags.
  • a second preset number of first-level text content samples are obtained, and a unique second preset label is assigned to each of the first-level text content samples.
  • the first Two preset tags include but are not limited to: headline, subtitle, author, etc.
  • the first-level text content samples are divided into a second training set and a second verification set according to a second preset ratio (for example, 3:1). Wherein, the number of samples in the second training set is greater than the number of samples in the second verification set.
  • a second preset ratio for example, 3:1
  • the verification accuracy is greater than the second preset threshold (for example: 97%), the end Train to obtain the second segmentation model. If the verified accuracy is less than the second preset threshold (for example, 97%), increase the number of first-level text content samples, and re-execute the training step based on the increased samples.
  • Each first-level text content is input into the trained second segmentation model to obtain multiple second-level tags corresponding to the first-level text content, and then each second-level text content corresponding to each second-level tag is obtained according to a second preset rule.
  • the step of determining the content of each secondary text corresponding to each secondary label includes:
  • the levels corresponding to the obtained multiple secondary labels are determined.
  • the level of each secondary label is predetermined.
  • the first-level subtitle is the first level
  • the second-level subtitle is the second level
  • the third-level subtitle is the third level
  • the subheadings can include: first-level subheading, second-level subheading, third-level subheading, and fourth-level subtitle. Divided into the content corresponding to the corresponding four-level subtitles.
  • M is a positive integer greater than or equal to 1.
  • the feedback module 140 is used to store each first-level label, second-level label, first-level text content, and second-level text content as logical pages in a preset database, and generate corresponding files in a preset format for the logical pages to feed back to The client.
  • the first-level tags and the second-level tags corresponding to each first-level tag are stored in a structured manner, and each first-level tag, second-level tag, and text content belonging to each first-level tag and second-level tag are taken as One logical page is stored.
  • the text content of each label is regarded as the content corresponding to the label.
  • a key is established for the generated corresponding file in advance, and the file is encrypted and pushed during the process of sending to the client.
  • the generated corresponding file can be viewed.
  • FIG 4 is a flowchart of a preferred embodiment of a method for extracting text structured information according to the present application.
  • Step S10 Receive the request for extracting text structured information sent by the client, and obtain the original document of the structured information to be extracted.
  • the request may include the original document to be structured, and may also include the storage path and unique identifier of the original document to be structured.
  • the original document can be entered when the user submits the text structured request, or it can be obtained from the address specified by the request after the user submits the text structured request.
  • the original document can be corporate documents such as official documents and bidding documents, and its format is PDF.
  • the receiving module 110 also performs user identity information authentication on the user of the client who initiated the text structured information extraction request. If the user identity information authentication is passed, the subsequent steps are executed, and if the user identity information authentication fails, the text structured information is rejected. Request for information extraction and generate warning information. For example, the receiving module 110 obtains the user's identity information, and judges whether the user has the authority to extract the text structured information according to the user identity information, if so, continue to perform the subsequent steps, if not, then reject the text structured information extraction request and Generate warning information.
  • Step S20 Input the original document into the pre-trained first segmentation model to obtain multiple first-level tags of the original document, and then obtain the corresponding first-level tags from the original document according to the first preset rule The first-level text content.
  • the first segmentation model is obtained by training a Conditional Random Field (CRF) model.
  • the specific training steps include:
  • the computer can read the information of the original document (for example: the position coordinates of the text in the document, font and other information) to facilitate the subsequent steps of obtaining the first-level label.
  • the information of the original document for example: the position coordinates of the text in the document, font and other information
  • a unique first preset tag is assigned to the converted XML format document, where the tags include, but are not limited to: cover, title, index, body, footnotes, eyebrows, references, appendices, etc. Take the cover as an example, mark the cover of the document: "Cover”. Then extract the preset feature vector of each label according to the predetermined feature vector extraction algorithm. Specifically, the extraction step includes:
  • each label into the pre-trained word vector model (word2vec model) to generate a word-level vector r wrd ; input the characters that make up each label into the pre-trained convolutional neural network model (Convolutional Neural Network) Neural Network, CNN), generate a character-level vector r wch corresponding to the label; combine the word-level vector and the character-level vector to obtain a new vector u n [r wrd , r wch ], as the feature vector of each label.
  • r wrd represents the vector obtained by training with the word2vec model, and its processing method is consistent with the existing word2vec model
  • r wch represents the vector obtained through convolutional neural network training.
  • the training process can be obtained from the prior art. Repeat it again.
  • each preset feature vector is used as a variable X
  • each preset label is used as a dependent variable Y
  • a sample set is generated.
  • the sample set is divided into the first training set and the first training set according to the first preset ratio (for example, 4:1).
  • a verification set wherein, the number of samples in the first training set is greater than the number of samples in the first verification set.
  • each variable X and each dependent variable Y in the first training set uses the conditional random field model, and use the first verification set to verify the conditional random field model every preset period (for example: every 1000 iterations).
  • Each of the variables X and each dependent variable Y in the first verification set verifies the accuracy of the first segmentation model.
  • the verification accuracy is greater than the first preset threshold (for example: 95%)
  • the training is ended, and the results are obtained.
  • the first preset threshold for example: 95%)
  • the number of samples is increased, and the above training step is re-executed based on the increased document samples.
  • the original document from which the structured information is to be extracted is input into the trained first segmentation model, and after multiple first-level tags of the original document are obtained, the first-level text content corresponding to each first-level tag is obtained according to the first preset rule.
  • the first preset rule includes determining the levels corresponding to the obtained multiple first-level tags according to the preset mapping relationship between the first-level tags and the levels. Among them, the level of each first-level label is predetermined, for example: cover, title, text, references, appendices, etc. are the first level; index, footnotes, and eyebrows are the second level, and the first level has priority over the second level.
  • Extract each first-level label of the first level extract the text content between the current first-level label and the next first-level first-level label as the text content corresponding to the current first-level label; if the current first-level label is the last In the case of a first-level label, the text content after the current first-level label is extracted as the text content corresponding to the current first-level label. After the first-level label classification is completed, the entire document is divided into multiple parts, and each part belongs to a first-level label category.
  • Step S30 Input each of the first-level text content into a pre-trained second segmentation model to obtain a second-level label corresponding to each of the first-level text content, and then obtain each information from the original document according to a second preset rule. Describe the second-level text content corresponding to the second-level label.
  • a second preset number of first-level text content samples are obtained, and a unique second preset label is assigned to each of the first-level text content samples.
  • the first Two preset tags include but are not limited to: headline, subtitle, author, etc.
  • the first-level text content samples are divided into a second training set and a second verification set according to a second preset ratio (for example, 3:1). Wherein, the number of samples in the second training set is greater than the number of samples in the second verification set.
  • a second preset ratio for example, 3:1
  • the verification accuracy is greater than the second preset threshold (for example: 97%), the end Train to obtain the second segmentation model. If the verification accuracy is less than the second preset threshold (for example: 97%), increase the number of first-level text content samples, and re-execute the training step based on the increased samples.
  • Each first-level text content is input into the trained second segmentation model to obtain multiple second-level tags corresponding to the first-level text content, and then each second-level text content corresponding to each second-level tag is obtained according to a second preset rule.
  • the step of determining the content of each secondary text corresponding to each secondary label includes:
  • the levels corresponding to the obtained multiple secondary labels are determined.
  • the level of each secondary label is predetermined.
  • the first-level subtitle is the first level
  • the second-level subtitle is the second level
  • the third-level subtitle is the third level
  • the subheadings can include: first-level subheading, second-level subheading, third-level subheading, and fourth-level subtitle. Divided into the content corresponding to the corresponding four-level subtitles.
  • M is a positive integer greater than or equal to 1.
  • Step S40 Store each primary label, secondary label, primary text content, and secondary text content as a logical page in a preset database, and generate a corresponding file for the logical page in a preset format to feed back to the client end.
  • the first-level tags and the second-level tags corresponding to each first-level tag are stored in a structured manner, and each first-level tag, second-level tag, and text content belonging to each first-level tag and second-level tag are taken as One logical page is stored.
  • the text content of each label is regarded as the content corresponding to the label.
  • a key is established for the generated corresponding file in advance, and the file is encrypted and pushed during the process of sending to the client.
  • the generated corresponding file can be viewed.
  • the embodiment of the present application also proposes a computer-readable storage medium.
  • the computer-readable storage medium may be a hard disk, a multimedia card, an SD card, a flash memory card, an SMC, a read-only memory (ROM), an erasable programmable only Any one or any combination of EPROM, CD-ROM, USB memory, etc.
  • the computer-readable storage medium includes a text structured information extraction program 10, which implements the following operations when executed by a processor:
  • Receiving step receiving a request for extracting text structured information sent by the client, and obtaining the original document of the structured information to be extracted;
  • the first obtaining step input the original document into a pre-trained first segmentation model to obtain multiple first-level tags of the original document, and then obtain each of the first-level tags from the original document according to a first preset rule The first-level text content corresponding to the label;
  • the second obtaining step input each of the first-level text content into a pre-trained second segmentation model to obtain the second-level label corresponding to each of the first-level text content, and then obtain from the original document according to a second preset rule The secondary text content corresponding to each of the secondary tags; and
  • Feedback step store each primary label, secondary label, primary text content, and secondary text content as logical pages in a preset database, and generate corresponding files for the logical pages in a preset format to feed back to the customer end.
  • the method of the above embodiments can be implemented by means of software plus the necessary general hardware platform. Of course, it can also be implemented by hardware, but in many cases the former is better. ⁇ Based on this understanding, the technical solution of this application essentially or the part that contributes to the prior art can be embodied in the form of a software product.
  • the computer software product is stored in a storage medium as described above, and the computer-readable storage
  • the medium can be non-volatile or volatile (such as ROM/RAM, magnetic disk, optical disk), and includes several instructions to enable a terminal device (can be a mobile phone, computer, server, or network device, etc.) Perform the method described in each embodiment of this application.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Multimedia (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请涉及数据处理技术,提供了一种文本结构化信息提取方法、服务器及存储介质。该方法首先获取待提取结构化信息的原始文档,将原始文档输入训练好的第一分割模型,得到原始文档的多个一级标签,根据第一预设规则获取各一级标签对应的一级文本内容。之后,将各一级文本内容输入训练好的第二分割模型,得到多个二级标签,再根据第二预设规则获取各二级标签对应的二级文本内容,将得到的各标签及文本内容作为逻辑页存储至预设数据库,并生成相应的文件反馈至客户端。利用本申请,通过利用分割模型确定原始文档中的各一级标签及二级标签,然后根据标签内容提取出结构信息,自动实现文档的结构化信息的提取,便捷高效

Description

文本结构化信息提取方法、服务器及存储介质
本申请要求于2019年5月20日提交中国专利局、申请号为201910419888.5,发明名称为“文本结构化信息提取方法、服务器及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能领域,尤其涉及一种文本结构化信息提取方法、服务器及存储介质。
背景技术
便携式文档格式(Portable Document Format,PDF)用于与应用程序、操作系统、硬件无关的方式进行文件交换,属于版式文档,会忠实地再现原稿的每一个字符、颜色以及图像,但是PDF的存储是非结构化的数据存储格式,没有记录文档的逻辑结构、表格等逻辑元素。
发明人意识到,目前,提取PDF文档的信息,通常采用光学字符识别(Optical Character Recognition,OCR)技术,但采用OCR技术所提取出来的PDF文档的信息,是以矢量的方式进行的渲染,每个字符之间是没有逻辑关系的,提取出来的字符形成的文本仅是x、y、z三个坐标加上旋转量来渲染的矩阵,这样的文本存在格式和位置随意性大的问题,而且无法便利地得到结构化信息,这是本领域技术人员亟待解决的问题。
发明内容
鉴于以上内容,本申请提供一种文本结构化信息提取方法、服务器及存储介质,其目的在于解决提取文档信息时,存在格式和文字位置随意性大,且无法便利地得到结构化信息的问题。
为实现上述目的,本申请提供一种文本结构化信息提取方法,该方法包括:
接收所述客户端发出的提取文本结构化信息的请求,获取待提取结构化信息的原始文档;
将所述原始文档输入预先训练好的第一分割模型,得到所述原始文档的多个一级标签,再根据第一预设规则从所述原始文档获取各所述一级标签对应的一级文本内容;
将各所述一级文本内容输入预先训练好的第二分割模型,得到各所述一级文本内容对应的二级标签,再根据第二预设规则从所述原始文档获取各所述二级标签对应的二级文本内容;及
将各一级标签、二级标签、一级文本内容及二级文本内容作为逻辑页存储至预设数据库,并将所述逻辑页按预设格式生成相应的文件反馈至所述客户端。
第二方面,本申请还提供一种服务器,所述服务器包括存储器、处理器以及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述程序时实现一种文本结构化信息提取方法,所述方法包括:
接收所述客户端发出的提取文本结构化信息的请求,获取待提取结构化信息的原始文档;
将所述原始文档输入预先训练好的第一分割模型,得到所述原始文档的多个一级标签,再根据第一预设规则从所述原始文档获取各所述一级标签对应的一级文本内容;
将各所述一级文本内容输入预先训练好的第二分割模型,得到各所述一级文本内容对应的二级标签,再根据第二预设规则从所述原始文档获取各所述二级标签对应的二级文本内容;及
将各一级标签、二级标签、一级文本内容及二级文本内容作为逻辑页存储至预设数据库,并将所述逻辑页按预设格式生成相应的文件反馈至所述客户端。
第三方面,本申请还提供一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机程序,该计算机程序被处理器执行时实现一种文本结构化信息提取方法,所述方法包括:
接收所述客户端发出的提取文本结构化信息的请求,获取待提取结构化信息的原始文档;
将所述原始文档输入预先训练好的第一分割模型,得到所述原始文档的多个一级标签,再根据第一预设规则从所述原始文档获取各所述一级标签对应的一级文本内容;
将各所述一级文本内容输入预先训练好的第二分割模型,得到各所述一级文本内容对应的二级标签,再根据第二预设规则从所述原始文档获取各所 述二级标签对应的二级文本内容;及
将各一级标签、二级标签、一级文本内容及二级文本内容作为逻辑页存储至预设数据库,并将所述逻辑页按预设格式生成相应的文件反馈至所述客户端。
本申请提出的文本结构化信息提取方法、服务器及存储介质,解决了提取文档信息时,存在格式和文字位置随意性大,且无法便利地得到结构化信息的问题,通过利用分割模型确定原始文档中的各一级标签及二级标签,然后根据标签内容提取出结构信息。自动的实现文档的结构化信息的提取,避免手工处理,便捷高效。
附图说明
图1为本申请文本结构化信息提取方法较佳实施例的应用环境图;
图2为本申请服务器较佳实施例的示意图;
图3为图2中文本结构化信息提取程序较佳实施例的模块示意图;
图4为本申请文本结构化信息提取方法较佳实施例的流程图;
具体实施方式
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的具体实施例仅用以解释本申请,并不用于限定本申请。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
参照图1所示,是本申请文本结构化信息提取方法较佳实施例的应用环境图。服务器1安装有文本结构化信息提取程序10。多个客户端3通过网络2连接服务器1。网络2可以为互联网、云网络、无线保真(Wi-Fi)网络、个人网(PAN)、局域网(LAN)和/或城域网(MAN)。网络环境中的各种设备可以被配置为根据各种有线和无线通信协议连接到通信网络。这样的有线和无线通信协议的例子可以包括但不限于以下中的至少一个:传输控制协议和互联网协议(TCP/IP)、用户数据报协议(UDP)、超文本传输协议(HTTP)、文件传输协议(FTP)、ZigBee、EDGE、IEEE 802.11、光保真(Li-Fi)、802.16、IEEE 802.11s、IEEE 802.11g、多跳通信、无线接入点(AP)、设备对设备通信、蜂窝通信协议和/或蓝牙(Blue Tooth)通信协议或其组合。客户端3可以为桌上型计算机、笔记本、平板电脑、手机,或其它安装有应用软件,可以通过网络2与服务 器1进行通信的终端装置。数据库4用于存储每级标签及每级标签对应的文本内容等数据。
参照图2所示,为本申请服务器1较佳实施例的示意图。
该服务器1包括但不限于:存储器11、处理器12、显示器13及网络接口14。所述服务器1通过网络接口14连接网络,获取原始数据。其中,所述网络可以是企业内部网(Intranet)、互联网(Internet)、全球移动通讯系统(Global System of Mobile communication,GSM)、宽带码分多址(Wideband Code Division Multiple Access,WCDMA)、4G网络、5G网络、蓝牙(Bluetooth)、通话网络等无线或有线网络。
其中,存储器11至少包括一种类型的可读存储介质,计算机可读存储介质可以是非易失性,也可以是易失性。所述可读存储介质包括闪存、硬盘、多媒体卡、卡型存储器(例如,SD或DX存储器等)、随机访问存储器(RAM)、静态随机访问存储器(SRAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、可编程只读存储器(PROM)、磁性存储器、磁盘、光盘等。在一些实施例中,所述存储器11可以是所述服务器1的内部存储单元,例如该服务器1的硬盘或内存。在另一些实施例中,所述存储器11也可以是所述服务器1的外部存储设备,例如该服务器1配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。当然,所述存储器11还可以既包括所述服务器1的内部存储单元也包括其外部存储设备。本实施例中,存储器11通常用于存储安装于所述服务器1的操作系统和各类应用软件,例如文本结构化信息提取程序10的程序代码等。此外,存储器11还可以用于暂时地存储已经输出或者将要输出的各类数据。
处理器12在一些实施例中可以是中央处理器(Central Processing Unit,CPU)、控制器、微控制器、微处理器、或其他数据处理芯片。该处理器12通常用于控制所述服务器1的总体操作,例如执行数据交互或者通信相关的控制和处理等。本实施例中,所述处理器12用于运行所述存储器11中存储的程序代码或者处理数据,例如运行文本结构化信息提取程序10的程序代码等。
显示器13可以称为显示屏或显示单元。在一些实施例中显示器13可以是LED显示器、液晶显示器、触控式液晶显示器以及有机发光二极管(Organic  Light-Emitting Diode,OLED)触摸器等。显示器13用于显示在服务器1中处理的信息以及用于显示可视化的工作界面,例如显示数据统计的结果。
网络接口14可选地可以包括标准的有线接口、无线接口(如WI-FI接口),该网络接口14通常用于在所述服务器1与其它电子设备之间建立通信连接。
图2仅示出了具有组件11-14以及文本结构化信息提取程序10的服务器1,但是应理解的是,并不要求实施所有示出的组件,可以替代的实施更多或者更少的组件。
可选地,所述服务器1还可以包括用户接口,用户接口可以包括显示器(Display)、输入单元比如键盘(Keyboard),可选的用户接口还可以包括标准的有线接口、无线接口。可选地,在一些实施例中,显示器可以是LED显示器、液晶显示器、触控式液晶显示器以及有机发光二极管(Organic Light-Emitting Diode,OLED)触摸器等。其中,显示器也可以适当的称为显示屏或显示单元,用于显示在服务器1中处理的信息以及用于显示可视化的用户界面。
该服务器1还可以包括射频(Radio Frequency,RF)电路、传感器和音频电路等等,在此不再赘述。
在上述实施例中,处理器12执行存储器11中存储的文本结构化信息提取程序10时可以实现如下步骤:
接收步骤:接收所述客户端发出的提取文本结构化信息的请求,获取待提取结构化信息的原始文档;
第一获取步骤:将所述原始文档输入预先训练好的第一分割模型,得到所述原始文档的多个一级标签,再根据第一预设规则从所述原始文档获取各所述一级标签对应的一级文本内容;
第二获取步骤:将各所述一级文本内容输入预先训练好的第二分割模型,得到各所述一级文本内容对应的二级标签,再根据第二预设规则从所述原始文档获取各所述二级标签对应的二级文本内容;及
反馈步骤:将各一级标签、二级标签、一级文本内容及二级文本内容作为逻辑页存储至预设数据库,并将所述逻辑页按预设格式生成相应的文件反馈至所述客户端。
关于上述步骤的详细介绍,请参照下述图3关于文本结构化信息提取程序10实施例的程序模块图以及图4关于文本结构化信息提取方法实施例的流程图的说明。
在其他实施例中,所述文本结构化信息提取程序10可以被分割为多个模块,该多个模块被存储于存储器12中,并由处理器13执行,以完成本申请。本申请所称的模块是指能够完成特定功能的一系列计算机程序指令段。
参照图3所示,为图2中文本结构化信息提取程序10一实施例的程序模块图。在本实施例中,所述文本结构化信息提取程序10可以被分割为:接收模块110、第一获取模块120、第二获取模块130及反馈模块140。
接收模块110,用于接收所述客户端发出的提取文本结构化信息的请求,获取待提取结构化信息的原始文档。
在本实施例中,请求中可以包括待结构化的原始文档,也可以包括待结构化的原始文档的存储路径及唯一的标识。也就是说,原始文档可以是用户在提交文本结构化请求时一并录入,也可以是用户提交文本结构化请求之后从请求指定的地址中获取。例如:原始文档可以是公文、招标书等企业文档,其格式为PDF。
在一个实施例中,接收模块110还对发起文本结构化信息提取请求的客户端的用户进行用户身份信息鉴定,用户身份信息鉴定通过则执行后续步骤,用户身份信息鉴定失败则拒绝所述文本结构化信息提取请求并生成预警信息。例如,接收模块110获取用户的身份信息,根据用户身份信息判断用户是否具备文本结构化信息提取请求的权限,若具备,继续执行后续步骤,若不具备,则拒绝文本结构化信息提取的请求并生成预警信息。
第一获取模块120,用于将所述原始文档输入预先训练好的第一分割模型,得到所述原始文档的多个一级标签,再根据第一预设规则从所述原始文档获取各所述一级标签对应的一级文本内容。
在本实施例中,第一分割模型是通过对条件随机场(Conditional Random Field,CRF)模型进行训练得到的。具体地训练步骤包括:
获取第一预设数量(例如10万篇)的PDF文档样本,对每一篇PDF文档样本进行格式转换,例如转换成可扩展标记语言(Extensible Markup  Language,XML)格式的文档,通过进行格式转换,可使计算机能读取到原始文档的信息(例如:文档中文字的位置坐标、字体等信息),以便于执行后续的一级标签获取的步骤。
对转换后的XML格式的文档分配唯一的第一预设标签,其中,标签包括但不限于:封面、标题、索引、正文、脚注、眉批、参考文献、附录等。以封面为例,对文档的封面标注:“封面”。再根据预先确定的特征向量提取算法提取出各标签的预设特征向量,具体地,该提取步骤包括:
把每个标签输入预先训练好的词向量模型(word2vec模型),生成一个词语级别(word-level)的向量r wrd;把组成每个标签的字符输入预先训练好的卷积神经网络模型(Convolutional Neural Network,CNN),生成该标签对应的字符级别(character-level)的向量r wch;将所述word-level的向量和character-level的向量组合得到一个新的向量u n=[r wrd,r wch],作为每个标签的特征向量。其中,r wrd表示利用word2vec模型训练得到的向量,其处理方式与现有的word2vec模型一致,r wch表示通过卷积神经网络训练得到的向量,训练过程可从现有技术中获取,在此不再赘述。
之后,将各预设特征向量作为变量X、将各所述预设标签作为因变量Y并生成样本集,将样本集按照第一预设比例(例如4:1)分成第一训练集和第一验证集。其中,第一训练集中的样本数量大于所述第一验证集中的样本数量。
利用第一训练集中的各个变量X和各个因变量Y对条件随机场模型进行训练,每隔预设周期(例如:每进行1000次迭代)使用第一验证集对条件随机场模型进行验证,利用第一验证集中各个所述变量X和各个因变量Y对该第一分割模型的准确率进行验证,当验证的准确率大于第一预设阈值(例如:95%)时,结束训练,得到所述第一分割模型,如果验证的准确率小于第二预设阈值(例如:95%)时则增加样本的数量,并基于增加的文档样本重新执行上述训练步骤。
将待提取结构化信息的原始文档输入训练好的第一分割模型,得到所述原始文档的多个一级标签后,根据第一预设规则获取各一级标签对应的一级文本内容。其中,第一预设规则包括根据预设的一级标签与层级的映射关系,确定得到的多个一级标签对应的层级。其中,各一级标签的层级是预先确定 的,例如:封面、标题、正文、参考文献、附录等为第一层级,索引、脚注、眉批为第二层级,第一层级优先于第二层级。提取第一层级的每个一级标签,提取当前一级标签与下一个第一层级的一级标签之间的文本内容,作为与当前一级标签对应的文本内容,若当前一级标签为最后一个一级标签时,提取当前一级标签之后的文本内容,作为与当前一级标签对应的文本内容。一级标签分类结束后,整篇文档被分割成为多个部分,每个部分都属于一个一级标签的类别。
第二获取模块130,用于将各所述一级文本内容输入预先训练好的第二分割模型,得到各所述一级文本内容对应的二级标签,再根据第二预设规则从所述原始文档获取各所述二级标签对应的二级文本内容。
在本实施例中,获取第二预设数量的一级文本内容样本,对各所述一级文本内容样本分配唯一的第二预设标签,以标题内容为一级标签的文本为例,第二预设标签包括但不限于:大标题、副标题、作者等。
将一级文本内容样本按照第二预设比例(例如3:1)分成第二训练集和第二验证集。其中,第二训练集中的样本数量大于所述第二验证集中的样本数量。
将所述第二训练集中的一级文本内容样本输入所述条件随机场模型进行训练,每隔预设周期使用所述第二验证集对所述条件随机场模型进行验证,利用所述第二验证集中各所述一级文本内容和各所述第二预设标签对该第二分割模型的准确率进行验证,当验证的准确率大于第二预设阈值(例如:97%)时,结束训练,得到所述第二分割模型,如果验证的准确率小于第二预设阈值(例如:97%)时,则增加一级文本内容样本的数量,并基于增加的样本重新执行训练步骤。
将各一级文本内容输入训练好的第二分割模型,得到多个一级文本内容对应的二级标签后,根据第二预设规则获取各二级标签对应的各二级文本内容。其中,确定各二级标签对应的各二级文本内容的步骤包括:
根据预先确定的二级标签与层级的映射关系,确定得到的多个二级标签对应的层级。其中,各二级标签的层级是预先确定的。例如,一级小标题为第一层级、二级小标题为第二层级、三级小标题为第三层级、依次类推。为了使得到的小标题更完整,小标题可以包括:一级小标题、二级小标题、三 级小标题及四级小标题,若还存在五级小标题,则将所有五级小标题均划分至对应的四级小标题对应的内容。
提取当前M级二级标签与下一个M级二级标签之间的文本内容,作为与当前M级二级标签对应的文本内容,若当前M级二级标签为最后一个M级二级标签时,提取当前M级二级标签之后的文本内容,作为与当前M级二级标签对应的文本内容。然后从当前M级二级标签对应的文本内容中提取M+1级二级标签,重复上述步骤提取出各M级二级标签对应的M+1级二级标签对应的文本内容。直到确定M级二级标签对应的所有下一层级的标签对应的文本内容为止。其中,M为大于或等于1的正整数。
反馈模块140,用于将各一级标签、二级标签、一级文本内容及二级文本内容作为逻辑页存储至预设数据库,并将所述逻辑页按预设格式生成相应的文件反馈至所述客户端。
在本实施例中,结构化存储一级标签、及各一级标签对应的二级标签,将每个一级标签、二级标签及隶属于每个一级标签、二级标签的文本内容作为一个逻辑页进行存储。其中,各标签的文本内容作为与该标签对应的内容。将得到的逻辑页按照预设格式生成相应的文件,例如:word文档。输出结果是一个排版好、处理好的文件,还可根据用户的需求进行文档格式转换。
在一个实施例中,预先为生成的相应的文件建立密钥,在发送给客户端的过程中对该文件进行加密推送,当输入的密钥验证成功时,可查看生成的相应的文件。
参照图4所示,是本申请文本结构化信息提取方法较佳实施例的流程图。
步骤S10:接收所述客户端发出的提取文本结构化信息的请求,获取待提取结构化信息的原始文档。
在本实施例中,请求中可以包括待结构化的原始文档,也可以包括待结构化的原始文档的存储路径及唯一的标识。也就是说,原始文档可以是用户在提交文本结构化请求时一并录入,也可以是用户提交文本结构化请求之后从请求指定的地址中获取。例如:原始文档可以是公文、招标书等企业文档,其格式为PDF。
在一个实施例中,接收模块110还对发起文本结构化信息提取请求的客户端的用户进行用户身份信息鉴定,用户身份信息鉴定通过则执行后续步骤,用户身份信息鉴定失败则拒绝所述文本结构化信息提取请求并生成预警信息。例如,接收模块110获取用户的身份信息,根据用户身份信息判断用户是否具备文本结构化信息提取请求的权限,若具备,继续执行后续步骤,若不具备,则拒绝文本结构化信息提取的请求并生成预警信息。
步骤S20:将所述原始文档输入预先训练好的第一分割模型,得到所述原始文档的多个一级标签,再根据第一预设规则从所述原始文档获取各所述一级标签对应的一级文本内容。
在本实施例中,第一分割模型是通过对条件随机场(Conditional Random Field,CRF)模型进行训练得到的。具体地训练步骤包括:
获取第一预设数量(例如10万篇)的PDF文档样本,对每一篇PDF文档样本进行格式转换,例如转换成可扩展标记语言(Extensible Markup Language,XML)格式的文档,通过进行格式转换,可使计算机能读取到原始文档的信息(例如:文档中文字的位置坐标、字体等信息),以便于执行后续的一级标签获取的步骤。
对转换后的XML格式的文档分配唯一的第一预设标签,其中,标签包括但不限于:封面、标题、索引、正文、脚注、眉批、参考文献、附录等。以封面为例,对文档的封面标注:“封面”。再根据预先确定的特征向量提取算法提取出各标签的预设特征向量,具体地,该提取步骤包括:
把每个标签输入预先训练好的词向量模型(word2vec模型),生成一个词语级别(word-level)的向量r wrd;把组成每个标签的字符输入预先训练好的卷积神经网络模型(Convolutional Neural Network,CNN),生成该标签对应的字符级别(character-level)的向量r wch;将所述word-level的向量和character-level的向量组合得到一个新的向量u n=[r wrd,r wch],作为每个标签的特征向量。其中,r wrd表示利用word2vec模型训练得到的向量,其处理方式与现有的word2vec模型一致,r wch表示通过卷积神经网络训练得到的向量,训练过程可从现有技术中获取,在此不再赘述。
之后,将各预设特征向量作为变量X、将各所述预设标签作为因变量Y并生成样本集,将样本集按照第一预设比例(例如4:1)分成第一训练集和 第一验证集。其中,第一训练集中的样本数量大于所述第一验证集中的样本数量。
利用第一训练集中的各个变量X和各个因变量Y对条件随机场模型进行训练,每隔预设周期(例如:每进行1000次迭代)使用第一验证集对条件随机场模型进行验证,利用第一验证集中各个所述变量X和各个因变量Y对该第一分割模型的准确率进行验证,当验证的准确率大于第一预设阈值(例如:95%)时,结束训练,得到所述第一分割模型,如果验证的准确率小于第二预设阈值(例如:95%)时则增加样本的数量,并基于增加的文档样本重新执行上述训练步骤。
将待提取结构化信息的原始文档输入训练好的第一分割模型,得到所述原始文档的多个一级标签后,根据第一预设规则获取各一级标签对应的一级文本内容。其中,第一预设规则包括根据预设的一级标签与层级的映射关系,确定得到的多个一级标签对应的层级。其中,各一级标签的层级是预先确定的,例如:封面、标题、正文、参考文献、附录等为第一层级;索引、脚注、眉批为第二层级,第一层级优先于第二层级。提取第一层级的每个一级标签;提取当前一级标签与下一个第一层级的一级标签之间的文本内容,作为与当前一级标签对应的文本内容;若当前一级标签为最后一个一级标签时,提取当前一级标签之后的文本内容,作为与当前一级标签对应的文本内容。一级标签分类结束后,整篇文档被分割成为多个部分,每个部分都属于一个一级标签的类别。
步骤S30:将各所述一级文本内容输入预先训练好的第二分割模型,得到各所述一级文本内容对应的二级标签,再根据第二预设规则从所述原始文档获取各所述二级标签对应的二级文本内容。
在本实施例中,获取第二预设数量的一级文本内容样本,对各所述一级文本内容样本分配唯一的第二预设标签,以标题内容为一级标签的文本为例,第二预设标签包括但不限于:大标题、副标题、作者等。
将一级文本内容样本按照第二预设比例(例如3:1)分成第二训练集和第二验证集。其中,第二训练集中的样本数量大于所述第二验证集中的样本数量。
将所述第二训练集中的一级文本内容样本输入所述条件随机场模型进行 训练,每隔预设周期使用所述第二验证集对所述条件随机场模型进行验证,利用所述第二验证集中各所述一级文本内容和各所述第二预设标签对该第二分割模型的准确率进行验证,当验证的准确率大于第二预设阈值(例如:97%)时,结束训练,得到所述第二分割模型,如果验证的准确率小于第二预设阈值(例如:97%)时,则增加一级文本内容样本的数量,并基于增加的样本重新执行训练步骤。
将各一级文本内容输入训练好的第二分割模型,得到多个一级文本内容对应的二级标签后,根据第二预设规则获取各二级标签对应的各二级文本内容。其中,确定各二级标签对应的各二级文本内容的步骤包括:
根据预先确定的二级标签与层级的映射关系,确定得到的多个二级标签对应的层级。其中,各二级标签的层级是预先确定的。例如,一级小标题为第一层级、二级小标题为第二层级、三级小标题为第三层级、依次类推。为了使得到的小标题更完整,小标题可以包括:一级小标题、二级小标题、三级小标题及四级小标题,若还存在五级小标题,则将所有五级小标题均划分至对应的四级小标题对应的内容。
提取当前M级二级标签与下一个M级二级标签之间的文本内容,作为与当前M级二级标签对应的文本内容,若当前M级二级标签为最后一个M级二级标签时,提取当前M级二级标签之后的文本内容,作为与当前M级二级标签对应的文本内容。然后从当前M级二级标签对应的文本内容中提取M+1级二级标签,重复上述步骤提取出各M级二级标签对应的M+1级二级标签对应的文本内容。直到确定M级二级标签对应的所有下一层级的标签对应的文本内容为止。其中,M为大于或等于1的正整数。
步骤S40:将各一级标签、二级标签、一级文本内容及二级文本内容作为逻辑页存储至预设数据库,并将所述逻辑页按预设格式生成相应的文件反馈至所述客户端。
在本实施例中,结构化存储一级标签、及各一级标签对应的二级标签,将每个一级标签、二级标签及隶属于每个一级标签、二级标签的文本内容作为一个逻辑页进行存储。其中,各标签的文本内容作为与该标签对应的内容。将得到的逻辑页按照预设格式生成相应的文件,例如:word文档。输出结果是一个排版好、处理好的文件,还可根据用户的需求进行文档格式转换。
在一个实施例中,预先为生成的相应的文件建立密钥,在发送给客户端的过程中对该文件进行加密推送,当输入的密钥验证成功时,可查看生成的相应的文件。
此外,本申请实施例还提出一种计算机可读存储介质,该计算机可读存储介质可以是硬盘、多媒体卡、SD卡、闪存卡、SMC、只读存储器(ROM)、可擦除可编程只读存储器(EPROM)、便携式紧致盘只读存储器(CD-ROM)、USB存储器等等中的任意一种或者几种的任意组合。所述计算机可读存储介质中包括文本结构化信息提取程序10,所述文本结构化信息提取程序10被处理器执行时实现如下操作:
接收步骤:接收所述客户端发出的提取文本结构化信息的请求,获取待提取结构化信息的原始文档;
第一获取步骤:将所述原始文档输入预先训练好的第一分割模型,得到所述原始文档的多个一级标签,再根据第一预设规则从所述原始文档获取各所述一级标签对应的一级文本内容;
第二获取步骤:将各所述一级文本内容输入预先训练好的第二分割模型,得到各所述一级文本内容对应的二级标签,再根据第二预设规则从所述原始文档获取各所述二级标签对应的二级文本内容;及
反馈步骤:将各一级标签、二级标签、一级文本内容及二级文本内容作为逻辑页存储至预设数据库,并将所述逻辑页按预设格式生成相应的文件反馈至所述客户端。
本申请之计算机可读存储介质的具体实施方式与上述文本结构化信息提取方法的具体实施方式大致相同,在此不再赘述。
需要说明的是,上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。并且本文中的术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、装置、物品或者方法不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、装置、物品或者方法所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、装置、物品或者方法中还存在另外的相同要素。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述 实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在如上所述的一个存储介质,计算机可读存储介质可以是非易失性,也可以是易失性(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,或者网络设备等)执行本申请各个实施例所述的方法。
以上仅为本申请的优选实施例,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本申请的专利保护范围内。

Claims (15)

  1. 一种文本结构化信息提取方法,应用于服务器,所述服务器通信连接一个或多个客户端,所述方法包括:
    接收所述客户端发出的提取文本结构化信息的请求,获取待提取结构化信息的原始文档;
    将所述原始文档输入预先训练好的第一分割模型,得到所述原始文档的多个一级标签,再根据第一预设规则从所述原始文档获取各所述一级标签对应的一级文本内容;
    将各所述一级文本内容输入预先训练好的第二分割模型,得到各所述一级文本内容对应的二级标签,再根据第二预设规则从所述原始文档获取各所述二级标签对应的二级文本内容;及
    将各一级标签、二级标签、一级文本内容及二级文本内容作为逻辑页存储至预设数据库,并将所述逻辑页按预设格式生成相应的文件反馈至所述客户端。
  2. 如权利要求1所述的文本结构化信息提取方法,所述第一分割模型是通过条件随机场模型训练得到的,训练过程包括如下步骤:
    获取第一预设数量的原始文档样本,对所述原始文档样本进行预处理;
    对各所述预处理后的文档分配唯一的第一预设标签,根据预先确定的特征向量提取算法提取出各所述第一预设标签的预设特征向量,将各所述预设特征向量作为变量X、各所述第一预设标签作为因变量Y生成样本集;
    将所述样本集按照第一预设比例分成第一训练集及第一验证集;
    利用所述第一训练集中的各所述变量X及各所述因变量Y对所述条件随机场模型进行训练,每隔预设周期使用所述第一验证集对所述条件随机场模型进行验证,利用所述第一验证集中各所述变量X及各所述因变量Y对该第一分割模型的第一准确率进行验证;及
    当验所述第一准确率大于第一预设阈值时,结束训练,得到所述第一分割模型。
  3. 如权利要求1或2所述的文本结构化信息提取方法,所述第一预设规则包括:
    提取第N个一级标签与第N+1个一级标签之间的文本内容,作为所述第N个一级标签对应的文本内容,其中,N为大于或等于1的正整数。
  4. 如权利要求3所述的文本结构化信息提取方法,所述第二分割模型是通过条件随机场模型训练得到的,训练过程包括如下步骤:
    获取第二预设数量的一级文本内容样本,对各所述一级文本内容样本分配唯一的第二预设标签;
    将所述一级文本内容样本按照第二预设比例分成第二训练集及第二验证集;
    将所述第二训练集中的一级文本内容样本输入所述条件随机场模型进行训练,每隔预设周期使用所述第二验证集对所述条件随机场模型进行验证,利用所述第二验证集中各所述一级文本内容和各所述第二预设标签对该第二分割模型的第二准确率进行验证;及
    当所述第二准确率大于第二预设阈值时,结束训练,得到所述第二分割模型。
  5. 如权利要求1所述的文本结构化信息提取方法,所述第二预设规则包括:
    提取第M个二级标签与第M+1个二级标签之间的文本内容,作为所述第M个二级标签对应的文本内容,其中,M为大于或等于1的正整数。
  6. 一种服务器,所述服务器包括存储器、处理器以及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述程序时实现一种文本结构化信息提取方法,所述方法包括:
    接收所述客户端发出的提取文本结构化信息的请求,获取待提取结构化信息的原始文档;
    将所述原始文档输入预先训练好的第一分割模型,得到所述原始文档的多个一级标签,再根据第一预设规则从所述原始文档获取各所述一级标签对应的一级文本内容;
    将各所述一级文本内容输入预先训练好的第二分割模型,得到各所述一级文本内容对应的二级标签,再根据第二预设规则从所述原始文档获取各所述二级标签对应的二级文本内容;及
    将各一级标签、二级标签、一级文本内容及二级文本内容作为逻辑页存 储至预设数据库,并将所述逻辑页按预设格式生成相应的文件反馈至所述客户端。
  7. 如权利要求6所述的服务器,所述第一分割模型是通过条件随机场模型训练得到的,训练过程包括如下步骤:
    获取第一预设数量的原始文档样本,对所述原始文档样本进行预处理;
    对各所述预处理后的文档分配唯一的第一预设标签,根据预先确定的特征向量提取算法提取出各所述第一预设标签的预设特征向量,将各所述预设特征向量作为变量X、各所述第一预设标签作为因变量Y生成样本集;
    将所述样本集按照第一预设比例分成第一训练集及第一验证集;
    利用所述第一训练集中的各所述变量X及各所述因变量Y对所述条件随机场模型进行训练,每隔预设周期使用所述第一验证集对所述条件随机场模型进行验证,利用所述第一验证集中各所述变量X及各所述因变量Y对该第一分割模型的第一准确率进行验证;及
    当验所述第一准确率大于第一预设阈值时,结束训练,得到所述第一分割模型。
  8. 如权利要求6或7所述的服务器,所述第一预设规则包括:
    提取第N个一级标签与第N+1个一级标签之间的文本内容,作为所述第N个一级标签对应的文本内容,其中,N为大于或等于1的正整数。
  9. 如权利要求8所述的服务器,所述第二分割模型是通过条件随机场模型训练得到的,训练过程包括如下步骤:
    获取第二预设数量的一级文本内容样本,对各所述一级文本内容样本分配唯一的第二预设标签;
    将所述一级文本内容样本按照第二预设比例分成第二训练集及第二验证集;
    将所述第二训练集中的一级文本内容样本输入所述条件随机场模型进行训练,每隔预设周期使用所述第二验证集对所述条件随机场模型进行验证,利用所述第二验证集中各所述一级文本内容和各所述第二预设标签对该第二分割模型的第二准确率进行验证;及
    当所述第二准确率大于第二预设阈值时,结束训练,得到所述第二分割模型。
  10. 如权利要求6所述的服务器,所述第二预设规则包括:
    提取第M个二级标签与第M+1个二级标签之间的文本内容,作为所述第M个二级标签对应的文本内容,其中,M为大于或等于1的正整数。
  11. 一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机程序,该计算机程序被处理器执行时实现一种文本结构化信息提取方法,所述方法包括:
    接收所述客户端发出的提取文本结构化信息的请求,获取待提取结构化信息的原始文档;
    将所述原始文档输入预先训练好的第一分割模型,得到所述原始文档的多个一级标签,再根据第一预设规则从所述原始文档获取各所述一级标签对应的一级文本内容;
    将各所述一级文本内容输入预先训练好的第二分割模型,得到各所述一级文本内容对应的二级标签,再根据第二预设规则从所述原始文档获取各所述二级标签对应的二级文本内容;及
    将各一级标签、二级标签、一级文本内容及二级文本内容作为逻辑页存储至预设数据库,并将所述逻辑页按预设格式生成相应的文件反馈至所述客户端。
  12. 如权利要求11所述的计算机可读存储介质,所述第一分割模型是通过条件随机场模型训练得到的,训练过程包括如下步骤:
    获取第一预设数量的原始文档样本,对所述原始文档样本进行预处理;
    对各所述预处理后的文档分配唯一的第一预设标签,根据预先确定的特征向量提取算法提取出各所述第一预设标签的预设特征向量,将各所述预设特征向量作为变量X、各所述第一预设标签作为因变量Y生成样本集;
    将所述样本集按照第一预设比例分成第一训练集及第一验证集;
    利用所述第一训练集中的各所述变量X及各所述因变量Y对所述条件随机场模型进行训练,每隔预设周期使用所述第一验证集对所述条件随机场模型进行验证,利用所述第一验证集中各所述变量X及各所述因变量Y对该第一分割模型的第一准确率进行验证;及
    当验所述第一准确率大于第一预设阈值时,结束训练,得到所述第一分割模型。
  13. 如权利要求11或12所述的计算机可读存储介质,所述第一预设规则包括:
    提取第N个一级标签与第N+1个一级标签之间的文本内容,作为所述第N个一级标签对应的文本内容,其中,N为大于或等于1的正整数。
  14. 如权利要求13所述的计算机可读存储介质,所述第二分割模型是通过条件随机场模型训练得到的,训练过程包括如下步骤:
    获取第二预设数量的一级文本内容样本,对各所述一级文本内容样本分配唯一的第二预设标签;
    将所述一级文本内容样本按照第二预设比例分成第二训练集及第二验证集;
    将所述第二训练集中的一级文本内容样本输入所述条件随机场模型进行训练,每隔预设周期使用所述第二验证集对所述条件随机场模型进行验证,利用所述第二验证集中各所述一级文本内容和各所述第二预设标签对该第二分割模型的第二准确率进行验证;及
    当所述第二准确率大于第二预设阈值时,结束训练,得到所述第二分割模型。
  15. 如权利要求11所述的计算机可读存储介质,所述第二预设规则包括:
    提取第M个二级标签与第M+1个二级标签之间的文本内容,作为所述第M个二级标签对应的文本内容,其中,M为大于或等于1的正整数。
PCT/CN2020/086292 2019-05-20 2020-04-23 文本结构化信息提取方法、服务器及存储介质 WO2020233332A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910419888.5A CN110287785A (zh) 2019-05-20 2019-05-20 文本结构化信息提取方法、服务器及存储介质
CN201910419888.5 2019-05-20

Publications (1)

Publication Number Publication Date
WO2020233332A1 true WO2020233332A1 (zh) 2020-11-26

Family

ID=68002204

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/086292 WO2020233332A1 (zh) 2019-05-20 2020-04-23 文本结构化信息提取方法、服务器及存储介质

Country Status (2)

Country Link
CN (1) CN110287785A (zh)
WO (1) WO2020233332A1 (zh)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112597353A (zh) * 2020-12-18 2021-04-02 武汉大学 一种文本信息自动提取方法
CN112835922A (zh) * 2021-01-29 2021-05-25 上海寻梦信息技术有限公司 地址区划分类方法、系统、设备及存储介质
CN113591454A (zh) * 2021-07-30 2021-11-02 中国银行股份有限公司 一种文本解析方法及装置
CN114091427A (zh) * 2021-11-19 2022-02-25 海信电子科技(武汉)有限公司 一种图像文本相似度模型训练方法及显示设备
CN112784033B (zh) * 2021-01-29 2023-11-03 北京百度网讯科技有限公司 一种时效等级识别模型训练及应用的方法、及电子设备

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110287785A (zh) * 2019-05-20 2019-09-27 深圳壹账通智能科技有限公司 文本结构化信息提取方法、服务器及存储介质
CN111598550A (zh) * 2020-05-22 2020-08-28 深圳市小满科技有限公司 邮件签名信息提取方法、装置、电子设备及介质
CN112035449B (zh) * 2020-07-22 2024-06-14 大箴(杭州)科技有限公司 数据处理方法及装置、计算机设备、存储介质
CN113255303B (zh) * 2020-09-14 2022-03-25 苏州七星天专利运营管理有限责任公司 一种文档辅助编辑的方法和系统
CN112270224A (zh) * 2020-10-14 2021-01-26 招商银行股份有限公司 保险责任解析方法、装置及计算机可读存储介质
CN112270604B (zh) * 2020-10-14 2024-08-20 招商银行股份有限公司 信息结构化处理方法、装置及计算机可读存储介质
CN112733505B (zh) * 2020-12-30 2024-04-26 中国科学技术大学 文档生成方法和装置、电子设备及存储介质
CN113158946A (zh) * 2021-04-29 2021-07-23 南方电网深圳数字电网研究院有限公司 一种标书结构化处理方法及系统

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101206639A (zh) * 2007-12-20 2008-06-25 北大方正集团有限公司 一种基于pdf的复杂版面的标引方法
CN101976232A (zh) * 2010-09-19 2011-02-16 深圳市万兴软件有限公司 一种识别文档中数据表格的方法及装置
US20150254213A1 (en) * 2014-02-12 2015-09-10 Kevin D. McGushion System and Method for Distilling Articles and Associating Images
CN107358208A (zh) * 2017-07-14 2017-11-17 北京神州泰岳软件股份有限公司 一种pdf文档结构化信息提取方法及装置
CN107992597A (zh) * 2017-12-13 2018-05-04 国网山东省电力公司电力科学研究院 一种面向电网故障案例的文本结构化方法
CN108874928A (zh) * 2018-05-31 2018-11-23 平安科技(深圳)有限公司 简历数据信息解析处理方法、装置、设备及存储介质
CN110287785A (zh) * 2019-05-20 2019-09-27 深圳壹账通智能科技有限公司 文本结构化信息提取方法、服务器及存储介质

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5594809A (en) * 1995-04-28 1997-01-14 Xerox Corporation Automatic training of character templates using a text line image, a text line transcription and a line image source model
CN108875059B (zh) * 2018-06-29 2021-02-12 北京百度网讯科技有限公司 用于生成文档标签的方法、装置、电子设备和存储介质

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101206639A (zh) * 2007-12-20 2008-06-25 北大方正集团有限公司 一种基于pdf的复杂版面的标引方法
CN101976232A (zh) * 2010-09-19 2011-02-16 深圳市万兴软件有限公司 一种识别文档中数据表格的方法及装置
US20150254213A1 (en) * 2014-02-12 2015-09-10 Kevin D. McGushion System and Method for Distilling Articles and Associating Images
CN107358208A (zh) * 2017-07-14 2017-11-17 北京神州泰岳软件股份有限公司 一种pdf文档结构化信息提取方法及装置
CN107992597A (zh) * 2017-12-13 2018-05-04 国网山东省电力公司电力科学研究院 一种面向电网故障案例的文本结构化方法
CN108874928A (zh) * 2018-05-31 2018-11-23 平安科技(深圳)有限公司 简历数据信息解析处理方法、装置、设备及存储介质
CN110287785A (zh) * 2019-05-20 2019-09-27 深圳壹账通智能科技有限公司 文本结构化信息提取方法、服务器及存储介质

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
LU WEI; HUANG YONG; CHENG QIKAI: "The Structure Function of Academic Text and Its Classification", JOURNAL OF THE CHINA SOCIETY FOR SCIENTIFIC AND TECHNICAL INFORMATION, vol. 33, no. 9, 30 September 2014 (2014-09-30), pages 979 - 985, XP055755087, ISSN: 1000-0135, DOI: 10.3772/j.issn.1000-0135.2014.09.010 *
YU HONG-TAO;YU HAI-MING;ZHANG FU-ZHI: "Metadata Extraction Based on Third-order Conditional Random Fields", CHINA MASTER'S THESES FULL-TEXT DATABASE, vol. 35, no. 3, 15 February 2014 (2014-02-15), pages 606 - 609, XP055755092, ISSN: 1674-0246 *
YU, LIANG: "Research and Applications on Text Features Extraction from Science and Technical Literatures", CHINESE MASTER'S THESES FULL-TEXT DATABASE, 15 March 2010 (2010-03-15), pages 1 - 58, XP055754894, ISSN: 1674-0246 *
ZHANG YU-FANG,MO LING-LIN,XIONG ZHONG-YANG,GENG XIAO-FEI: "Hierarchical information extraction from research papers based on conditional random fields", APPLICATION RESEARCH OF COMPUTERS, vol. 26, no. 10, 31 October 2009 (2009-10-31), pages 3690 - 3693, XP055755082, ISSN: 1001-3695, DOI: 10.3969/j,issn.1001-3695.2009.10.025 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112597353A (zh) * 2020-12-18 2021-04-02 武汉大学 一种文本信息自动提取方法
CN112597353B (zh) * 2020-12-18 2024-03-08 武汉大学 一种文本信息自动提取方法
CN112835922A (zh) * 2021-01-29 2021-05-25 上海寻梦信息技术有限公司 地址区划分类方法、系统、设备及存储介质
CN112784033B (zh) * 2021-01-29 2023-11-03 北京百度网讯科技有限公司 一种时效等级识别模型训练及应用的方法、及电子设备
CN113591454A (zh) * 2021-07-30 2021-11-02 中国银行股份有限公司 一种文本解析方法及装置
CN114091427A (zh) * 2021-11-19 2022-02-25 海信电子科技(武汉)有限公司 一种图像文本相似度模型训练方法及显示设备

Also Published As

Publication number Publication date
CN110287785A (zh) 2019-09-27

Similar Documents

Publication Publication Date Title
WO2020233332A1 (zh) 文本结构化信息提取方法、服务器及存储介质
WO2021151270A1 (zh) 图像结构化数据提取方法、装置、设备及存储介质
US10200336B2 (en) Generating a conversation in a social network based on mixed media object context
CN108932294B (zh) 基于索引的简历数据处理方法、装置、设备及存储介质
US10204082B2 (en) Generating digital document content from a digital image
US10489435B2 (en) Method, device and equipment for acquiring answer information
US10977259B2 (en) Electronic template generation, data extraction and response detection
WO2019041527A1 (zh) 文档图表抽取方法、电子设备及计算机可读存储介质
US20210192129A1 (en) Method, system and cloud server for auto filing an electronic form
CN113032580B (zh) 关联档案推荐方法、系统及电子设备
CN112016274B (zh) 医学文本结构化方法、装置、计算机设备及存储介质
US11138426B2 (en) Template matching, rules building and token extraction
CN110347984B (zh) 保单页面变更方法、装置、计算机设备及存储介质
CN110166522B (zh) 服务器识别方法、装置、可读存储介质和计算机设备
US9710769B2 (en) Methods and systems for crowdsourcing a task
CN113837113B (zh) 基于人工智能的文档校验方法、装置、设备及介质
CN112016290A (zh) 一种文档自动排版方法、装置、设备及存储介质
CN116132527B (zh) 管理指示牌的系统、方法及数据处理服务器
US10643022B2 (en) PDF extraction with text-based key
US9842307B2 (en) Methods and systems for creating tasks
WO2019149065A1 (zh) 绘文字兼容显示方法、装置、终端及计算机可读存储介质
US8867838B2 (en) Method and system for a text data entry from an electronic document
WO2019000697A1 (zh) 信息检索方法、系统、服务器及可读存储介质
CN117133006A (zh) 一种单证验证方法、装置、计算机设备及存储介质
CN107784328B (zh) 德语旧字体识别方法、装置及计算机可读存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20809293

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20809293

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 23.03.2022)

122 Ep: pct application non-entry in european phase

Ref document number: 20809293

Country of ref document: EP

Kind code of ref document: A1