WO2020233332A1 - 文本结构化信息提取方法、服务器及存储介质 - Google Patents
文本结构化信息提取方法、服务器及存储介质 Download PDFInfo
- Publication number
- WO2020233332A1 WO2020233332A1 PCT/CN2020/086292 CN2020086292W WO2020233332A1 WO 2020233332 A1 WO2020233332 A1 WO 2020233332A1 CN 2020086292 W CN2020086292 W CN 2020086292W WO 2020233332 A1 WO2020233332 A1 WO 2020233332A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- level
- preset
- text content
- label
- training
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
- G06F16/2291—User-Defined Types; Storage management thereof
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/40—Document-oriented image-based pattern recognition
- G06V30/41—Analysis of document content
- G06V30/416—Extracting the logical structure, e.g. chapters, sections or page numbers; Identifying elements of the document, e.g. authors
Definitions
- This application relates to the field of artificial intelligence, and in particular to a method for extracting text structured information, a server and a storage medium.
- PDF Portable Document Format
- OCR Optical Character Recognition
- this application provides a text structured information extraction method, server and storage medium, the purpose of which is to solve the problem of large arbitrariness in format and text position when extracting document information, and structured information cannot be easily obtained.
- this application provides a method for extracting text structured information, which includes:
- Each first-level label, second-level label, first-level text content, and second-level text content is stored as a logical page in a preset database, and the logical page generates a corresponding file in a preset format and feeds it back to the client.
- the present application also provides a server.
- the server includes a memory, a processor, and a computer program that is stored in the memory and can run on the processor.
- the processor implements a text structure when the program is executed.
- a method for extracting chemical information the method includes:
- Each first-level label, second-level label, first-level text content, and second-level text content is stored as a logical page in a preset database, and the logical page generates a corresponding file in a preset format and feeds it back to the client.
- this application also provides a computer-readable storage medium on which a computer program is stored.
- a method for extracting text structured information is implemented.
- Each first-level label, second-level label, first-level text content, and second-level text content is stored as a logical page in a preset database, and the logical page generates a corresponding file in a preset format and feeds it back to the client.
- the text structured information extraction method, server and storage medium proposed in this application solve the problem of large arbitrariness in format and text position when extracting document information, and the structured information cannot be easily obtained.
- the original document is determined by using a segmentation model Each first-level tag and second-level tag in, and then extract structure information based on the tag content. Automatically realize the extraction of document structured information, avoid manual processing, convenient and efficient.
- Figure 1 is an application environment diagram of a preferred embodiment of a method for extracting text structured information of this application
- Figure 2 is a schematic diagram of a preferred embodiment of the application server
- FIG. 3 is a schematic diagram of modules of a preferred embodiment of the text structured information extraction program in FIG. 2;
- FIG. 4 is a flowchart of a preferred embodiment of a method for extracting structured information from the text of this application;
- FIG. 1 it is an application environment diagram of a preferred embodiment of the method for extracting text structured information of the present application.
- the server 1 is installed with a text structured information extraction program 10.
- Multiple clients 3 connect to the server 1 through the network 2.
- the network 2 may be the Internet, a cloud network, a wireless fidelity (Wi-Fi) network, a personal network (PAN), a local area network (LAN), and/or a metropolitan area network (MAN).
- Various devices in the network environment can be configured to connect to the communication network according to various wired and wireless communication protocols.
- wired and wireless communication protocols can include but are not limited to at least one of the following: Transmission Control Protocol and Internet Protocol (TCP/IP), User Datagram Protocol (UDP), Hypertext Transfer Protocol (HTTP), File Transfer Protocol (FTP), ZigBee, EDGE, IEEE 802.11, Optical Fidelity (Li-Fi), 802.16, IEEE 802.11s, IEEE 802.11g, multi-hop communication, wireless access point (AP), device-to-device communication, cellular communication Protocol and/or Bluetooth (Blue Tooth) communication protocol or a combination thereof.
- the client 3 can be a desktop computer, a notebook, a tablet computer, a mobile phone, or another terminal device that is installed with application software and can communicate with the server 1 through the network 2.
- the database 4 is used to store data such as tags of each level and text content corresponding to each level of tags.
- FIG. 2 is a schematic diagram of a preferred embodiment of the server 1 of this application.
- the server 1 includes but is not limited to: a memory 11, a processor 12, a display 13, and a network interface 14.
- the server 1 is connected to the network through the network interface 14 to obtain original data.
- the network may be an intranet, the Internet, a global system of mobile communication (GSM), a wideband code division multiple access (WCDMA), a 4G network , 5G network, Bluetooth (Bluetooth), call network and other wireless or wired networks.
- GSM global system of mobile communication
- WCDMA wideband code division multiple access
- 4G network a 4G network
- 5G network 5G network
- Bluetooth Bluetooth
- the memory 11 includes at least one type of readable storage medium, and the computer readable storage medium may be non-volatile or volatile.
- the readable storage medium includes flash memory, hard disk, multimedia card, card type memory (for example, SD or DX memory, etc.), random access memory (RAM), static random access memory (SRAM), read only memory (ROM), electronic Erasable programmable read-only memory (EEPROM), programmable read-only memory (PROM), magnetic memory, magnetic disk, optical disk, etc.
- the storage 11 may be an internal storage unit of the server 1, such as a hard disk or a memory of the server 1.
- the memory 11 may also be an external storage device of the server 1, for example, a plug-in hard disk equipped with the server 1, a smart media card (SMC), and a secure digital (Secure Digital). , SD) card, flash card (Flash Card), etc.
- the memory 11 may also include both the internal storage unit of the server 1 and its external storage device.
- the memory 11 is generally used to store the operating system and various application software installed in the server 1, such as the program code of the text structured information extraction program 10.
- the memory 11 can also be used to temporarily store various types of data that have been output or will be output.
- the processor 12 may be a central processing unit (Central Processing Unit, CPU), a controller, a microcontroller, a microprocessor, or other data processing chips in some embodiments.
- the processor 12 is generally used to control the overall operation of the server 1, such as performing data interaction or communication-related control and processing.
- the processor 12 is used to run the program code or process data stored in the memory 11, for example, to run the program code of the text structured information extraction program 10.
- the display 13 may be called a display screen or a display unit.
- the display 13 may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, and an organic light-emitting diode (OLED) touch device.
- the display 13 is used for displaying the information processed in the server 1 and for displaying a visualized work interface, for example, displaying the results of data statistics.
- the network interface 14 may optionally include a standard wired interface and a wireless interface (such as a WI-FI interface).
- the network interface 14 is usually used to establish a communication connection between the server 1 and other electronic devices.
- Fig. 2 only shows the server 1 with the components 11-14 and the text structured information extraction program 10. However, it should be understood that it is not required to implement all the components shown, and more or fewer components may be implemented instead. .
- the server 1 may also include a user interface.
- the user interface may include a display (Display) and an input unit such as a keyboard (Keyboard).
- the optional user interface may also include a standard wired interface and a wireless interface.
- the display may be an LED display, a liquid crystal display, a touch liquid crystal display, an organic light-emitting diode (OLED) touch device, and the like.
- the display may also be appropriately called a display screen or a display unit, which is used to display the information processed in the server 1 and to display a visualized user interface.
- the server 1 may also include a radio frequency (RF) circuit, a sensor, an audio circuit, etc., which are not described here.
- RF radio frequency
- the processor 12 can implement the following steps when executing the text structured information extraction program 10 stored in the memory 11:
- Receiving step receiving a request for extracting text structured information sent by the client, and obtaining the original document of the structured information to be extracted;
- the first obtaining step input the original document into a pre-trained first segmentation model to obtain multiple first-level tags of the original document, and then obtain each of the first-level tags from the original document according to a first preset rule The first-level text content corresponding to the label;
- the second obtaining step input each of the first-level text content into a pre-trained second segmentation model to obtain the second-level label corresponding to each of the first-level text content, and then obtain from the original document according to a second preset rule The secondary text content corresponding to each of the secondary tags; and
- Feedback step store each primary label, secondary label, primary text content, and secondary text content as logical pages in a preset database, and generate corresponding files for the logical pages in a preset format to feed back to the customer end.
- the text structured information extraction program 10 may be divided into multiple modules, and the multiple modules are stored in the memory 12 and executed by the processor 13 to complete the application.
- the module referred to in this application refers to a series of computer program instruction segments that can complete specific functions.
- FIG. 3 it is a program module diagram of an embodiment of the text structured information extraction program 10 in FIG. 2.
- the text structured information extraction program 10 can be divided into: a receiving module 110, a first acquiring module 120, a second acquiring module 130, and a feedback module 140.
- the receiving module 110 is configured to receive a request for extracting text structured information sent by the client, and obtain the original document of the structured information to be extracted.
- the request may include the original document to be structured, and may also include the storage path and unique identifier of the original document to be structured.
- the original document can be entered when the user submits the text structured request, or it can be obtained from the address specified by the request after the user submits the text structured request.
- the original document can be corporate documents such as official documents and bidding documents, and its format is PDF.
- the receiving module 110 also performs user identity information authentication on the user of the client who initiated the text structured information extraction request. If the user identity information authentication is passed, the subsequent steps are executed, and if the user identity information authentication fails, the text structured information is rejected. Request for information extraction and generate warning information. For example, the receiving module 110 obtains the user's identity information, and judges whether the user has the authority to extract the text structured information according to the user identity information, if so, continue to perform the subsequent steps, if not, then reject the text structured information extraction request and Generate warning information.
- the first obtaining module 120 is configured to input the original document into a pre-trained first segmentation model to obtain multiple first-level tags of the original document, and then obtain each location from the original document according to a first preset rule. Describe the first-level text content corresponding to the first-level label.
- the first segmentation model is obtained by training a Conditional Random Field (CRF) model.
- the specific training steps include:
- the computer can read the information of the original document (for example: the position coordinates of the text in the document, font and other information) to facilitate the subsequent steps of obtaining the first-level label.
- the information of the original document for example: the position coordinates of the text in the document, font and other information
- a unique first preset tag is assigned to the converted XML format document, where the tags include, but are not limited to: cover, title, index, body, footnotes, eyebrows, references, appendices, etc. Take the cover as an example, mark the cover of the document: "Cover”. Then extract the preset feature vector of each label according to the predetermined feature vector extraction algorithm. Specifically, the extraction step includes:
- each label into the pre-trained word vector model (word2vec model) to generate a word-level vector r wrd ; input the characters that make up each label into the pre-trained convolutional neural network model (Convolutional Neural Network) Neural Network, CNN), generate a character-level vector r wch corresponding to the label; combine the word-level vector and the character-level vector to obtain a new vector u n [r wrd , r wch ], as the feature vector of each label.
- r wrd represents the vector obtained by training with the word2vec model, and its processing method is consistent with the existing word2vec model
- r wch represents the vector obtained through convolutional neural network training.
- the training process can be obtained from the prior art. Repeat it again.
- each preset feature vector is used as a variable X
- each preset label is used as a dependent variable Y
- a sample set is generated.
- the sample set is divided into the first training set and the first training set according to the first preset ratio (for example, 4:1).
- a verification set wherein, the number of samples in the first training set is greater than the number of samples in the first verification set.
- each variable X and each dependent variable Y in the first training set uses the conditional random field model, and use the first verification set to verify the conditional random field model every preset period (for example: every 1000 iterations).
- Each of the variables X and each dependent variable Y in the first verification set verifies the accuracy of the first segmentation model.
- the verification accuracy is greater than the first preset threshold (for example: 95%)
- the training is ended, and the results are obtained.
- the first preset threshold for example: 95%)
- the number of samples is increased, and the above training step is re-executed based on the increased document samples.
- the original document from which the structured information is to be extracted is input into the trained first segmentation model, and after multiple first-level tags of the original document are obtained, the first-level text content corresponding to each first-level tag is obtained according to the first preset rule.
- the first preset rule includes determining the levels corresponding to the obtained multiple first-level tags according to the preset mapping relationship between the first-level tags and the levels. Among them, the level of each first-level label is predetermined, for example: cover, title, text, references, appendices, etc. are the first level, index, footnotes, and eyebrows are the second level, and the first level takes precedence over the second level.
- Extract each first-level label of the first level extract the text content between the current first-level label and the next first-level first-level label, as the text content corresponding to the current first-level label, if the current first-level label is the last
- the text content after the current first-level label is extracted as the text content corresponding to the current first-level label.
- the entire document is divided into multiple parts, and each part belongs to a first-level label category.
- the second acquisition module 130 is configured to input each of the first-level text content into a pre-trained second segmentation model to obtain the second-level label corresponding to each of the first-level text content, and then obtain the second-level label from the second preset rule
- the original document obtains the secondary text content corresponding to each of the secondary tags.
- a second preset number of first-level text content samples are obtained, and a unique second preset label is assigned to each of the first-level text content samples.
- the first Two preset tags include but are not limited to: headline, subtitle, author, etc.
- the first-level text content samples are divided into a second training set and a second verification set according to a second preset ratio (for example, 3:1). Wherein, the number of samples in the second training set is greater than the number of samples in the second verification set.
- a second preset ratio for example, 3:1
- the verification accuracy is greater than the second preset threshold (for example: 97%), the end Train to obtain the second segmentation model. If the verified accuracy is less than the second preset threshold (for example, 97%), increase the number of first-level text content samples, and re-execute the training step based on the increased samples.
- Each first-level text content is input into the trained second segmentation model to obtain multiple second-level tags corresponding to the first-level text content, and then each second-level text content corresponding to each second-level tag is obtained according to a second preset rule.
- the step of determining the content of each secondary text corresponding to each secondary label includes:
- the levels corresponding to the obtained multiple secondary labels are determined.
- the level of each secondary label is predetermined.
- the first-level subtitle is the first level
- the second-level subtitle is the second level
- the third-level subtitle is the third level
- the subheadings can include: first-level subheading, second-level subheading, third-level subheading, and fourth-level subtitle. Divided into the content corresponding to the corresponding four-level subtitles.
- M is a positive integer greater than or equal to 1.
- the feedback module 140 is used to store each first-level label, second-level label, first-level text content, and second-level text content as logical pages in a preset database, and generate corresponding files in a preset format for the logical pages to feed back to The client.
- the first-level tags and the second-level tags corresponding to each first-level tag are stored in a structured manner, and each first-level tag, second-level tag, and text content belonging to each first-level tag and second-level tag are taken as One logical page is stored.
- the text content of each label is regarded as the content corresponding to the label.
- a key is established for the generated corresponding file in advance, and the file is encrypted and pushed during the process of sending to the client.
- the generated corresponding file can be viewed.
- FIG 4 is a flowchart of a preferred embodiment of a method for extracting text structured information according to the present application.
- Step S10 Receive the request for extracting text structured information sent by the client, and obtain the original document of the structured information to be extracted.
- the request may include the original document to be structured, and may also include the storage path and unique identifier of the original document to be structured.
- the original document can be entered when the user submits the text structured request, or it can be obtained from the address specified by the request after the user submits the text structured request.
- the original document can be corporate documents such as official documents and bidding documents, and its format is PDF.
- the receiving module 110 also performs user identity information authentication on the user of the client who initiated the text structured information extraction request. If the user identity information authentication is passed, the subsequent steps are executed, and if the user identity information authentication fails, the text structured information is rejected. Request for information extraction and generate warning information. For example, the receiving module 110 obtains the user's identity information, and judges whether the user has the authority to extract the text structured information according to the user identity information, if so, continue to perform the subsequent steps, if not, then reject the text structured information extraction request and Generate warning information.
- Step S20 Input the original document into the pre-trained first segmentation model to obtain multiple first-level tags of the original document, and then obtain the corresponding first-level tags from the original document according to the first preset rule The first-level text content.
- the first segmentation model is obtained by training a Conditional Random Field (CRF) model.
- the specific training steps include:
- the computer can read the information of the original document (for example: the position coordinates of the text in the document, font and other information) to facilitate the subsequent steps of obtaining the first-level label.
- the information of the original document for example: the position coordinates of the text in the document, font and other information
- a unique first preset tag is assigned to the converted XML format document, where the tags include, but are not limited to: cover, title, index, body, footnotes, eyebrows, references, appendices, etc. Take the cover as an example, mark the cover of the document: "Cover”. Then extract the preset feature vector of each label according to the predetermined feature vector extraction algorithm. Specifically, the extraction step includes:
- each label into the pre-trained word vector model (word2vec model) to generate a word-level vector r wrd ; input the characters that make up each label into the pre-trained convolutional neural network model (Convolutional Neural Network) Neural Network, CNN), generate a character-level vector r wch corresponding to the label; combine the word-level vector and the character-level vector to obtain a new vector u n [r wrd , r wch ], as the feature vector of each label.
- r wrd represents the vector obtained by training with the word2vec model, and its processing method is consistent with the existing word2vec model
- r wch represents the vector obtained through convolutional neural network training.
- the training process can be obtained from the prior art. Repeat it again.
- each preset feature vector is used as a variable X
- each preset label is used as a dependent variable Y
- a sample set is generated.
- the sample set is divided into the first training set and the first training set according to the first preset ratio (for example, 4:1).
- a verification set wherein, the number of samples in the first training set is greater than the number of samples in the first verification set.
- each variable X and each dependent variable Y in the first training set uses the conditional random field model, and use the first verification set to verify the conditional random field model every preset period (for example: every 1000 iterations).
- Each of the variables X and each dependent variable Y in the first verification set verifies the accuracy of the first segmentation model.
- the verification accuracy is greater than the first preset threshold (for example: 95%)
- the training is ended, and the results are obtained.
- the first preset threshold for example: 95%)
- the number of samples is increased, and the above training step is re-executed based on the increased document samples.
- the original document from which the structured information is to be extracted is input into the trained first segmentation model, and after multiple first-level tags of the original document are obtained, the first-level text content corresponding to each first-level tag is obtained according to the first preset rule.
- the first preset rule includes determining the levels corresponding to the obtained multiple first-level tags according to the preset mapping relationship between the first-level tags and the levels. Among them, the level of each first-level label is predetermined, for example: cover, title, text, references, appendices, etc. are the first level; index, footnotes, and eyebrows are the second level, and the first level has priority over the second level.
- Extract each first-level label of the first level extract the text content between the current first-level label and the next first-level first-level label as the text content corresponding to the current first-level label; if the current first-level label is the last In the case of a first-level label, the text content after the current first-level label is extracted as the text content corresponding to the current first-level label. After the first-level label classification is completed, the entire document is divided into multiple parts, and each part belongs to a first-level label category.
- Step S30 Input each of the first-level text content into a pre-trained second segmentation model to obtain a second-level label corresponding to each of the first-level text content, and then obtain each information from the original document according to a second preset rule. Describe the second-level text content corresponding to the second-level label.
- a second preset number of first-level text content samples are obtained, and a unique second preset label is assigned to each of the first-level text content samples.
- the first Two preset tags include but are not limited to: headline, subtitle, author, etc.
- the first-level text content samples are divided into a second training set and a second verification set according to a second preset ratio (for example, 3:1). Wherein, the number of samples in the second training set is greater than the number of samples in the second verification set.
- a second preset ratio for example, 3:1
- the verification accuracy is greater than the second preset threshold (for example: 97%), the end Train to obtain the second segmentation model. If the verification accuracy is less than the second preset threshold (for example: 97%), increase the number of first-level text content samples, and re-execute the training step based on the increased samples.
- Each first-level text content is input into the trained second segmentation model to obtain multiple second-level tags corresponding to the first-level text content, and then each second-level text content corresponding to each second-level tag is obtained according to a second preset rule.
- the step of determining the content of each secondary text corresponding to each secondary label includes:
- the levels corresponding to the obtained multiple secondary labels are determined.
- the level of each secondary label is predetermined.
- the first-level subtitle is the first level
- the second-level subtitle is the second level
- the third-level subtitle is the third level
- the subheadings can include: first-level subheading, second-level subheading, third-level subheading, and fourth-level subtitle. Divided into the content corresponding to the corresponding four-level subtitles.
- M is a positive integer greater than or equal to 1.
- Step S40 Store each primary label, secondary label, primary text content, and secondary text content as a logical page in a preset database, and generate a corresponding file for the logical page in a preset format to feed back to the client end.
- the first-level tags and the second-level tags corresponding to each first-level tag are stored in a structured manner, and each first-level tag, second-level tag, and text content belonging to each first-level tag and second-level tag are taken as One logical page is stored.
- the text content of each label is regarded as the content corresponding to the label.
- a key is established for the generated corresponding file in advance, and the file is encrypted and pushed during the process of sending to the client.
- the generated corresponding file can be viewed.
- the embodiment of the present application also proposes a computer-readable storage medium.
- the computer-readable storage medium may be a hard disk, a multimedia card, an SD card, a flash memory card, an SMC, a read-only memory (ROM), an erasable programmable only Any one or any combination of EPROM, CD-ROM, USB memory, etc.
- the computer-readable storage medium includes a text structured information extraction program 10, which implements the following operations when executed by a processor:
- Receiving step receiving a request for extracting text structured information sent by the client, and obtaining the original document of the structured information to be extracted;
- the first obtaining step input the original document into a pre-trained first segmentation model to obtain multiple first-level tags of the original document, and then obtain each of the first-level tags from the original document according to a first preset rule The first-level text content corresponding to the label;
- the second obtaining step input each of the first-level text content into a pre-trained second segmentation model to obtain the second-level label corresponding to each of the first-level text content, and then obtain from the original document according to a second preset rule The secondary text content corresponding to each of the secondary tags; and
- Feedback step store each primary label, secondary label, primary text content, and secondary text content as logical pages in a preset database, and generate corresponding files for the logical pages in a preset format to feed back to the customer end.
- the method of the above embodiments can be implemented by means of software plus the necessary general hardware platform. Of course, it can also be implemented by hardware, but in many cases the former is better. ⁇ Based on this understanding, the technical solution of this application essentially or the part that contributes to the prior art can be embodied in the form of a software product.
- the computer software product is stored in a storage medium as described above, and the computer-readable storage
- the medium can be non-volatile or volatile (such as ROM/RAM, magnetic disk, optical disk), and includes several instructions to enable a terminal device (can be a mobile phone, computer, server, or network device, etc.) Perform the method described in each embodiment of this application.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Software Systems (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Multimedia (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
Claims (15)
- 一种文本结构化信息提取方法,应用于服务器,所述服务器通信连接一个或多个客户端,所述方法包括:接收所述客户端发出的提取文本结构化信息的请求,获取待提取结构化信息的原始文档;将所述原始文档输入预先训练好的第一分割模型,得到所述原始文档的多个一级标签,再根据第一预设规则从所述原始文档获取各所述一级标签对应的一级文本内容;将各所述一级文本内容输入预先训练好的第二分割模型,得到各所述一级文本内容对应的二级标签,再根据第二预设规则从所述原始文档获取各所述二级标签对应的二级文本内容;及将各一级标签、二级标签、一级文本内容及二级文本内容作为逻辑页存储至预设数据库,并将所述逻辑页按预设格式生成相应的文件反馈至所述客户端。
- 如权利要求1所述的文本结构化信息提取方法,所述第一分割模型是通过条件随机场模型训练得到的,训练过程包括如下步骤:获取第一预设数量的原始文档样本,对所述原始文档样本进行预处理;对各所述预处理后的文档分配唯一的第一预设标签,根据预先确定的特征向量提取算法提取出各所述第一预设标签的预设特征向量,将各所述预设特征向量作为变量X、各所述第一预设标签作为因变量Y生成样本集;将所述样本集按照第一预设比例分成第一训练集及第一验证集;利用所述第一训练集中的各所述变量X及各所述因变量Y对所述条件随机场模型进行训练,每隔预设周期使用所述第一验证集对所述条件随机场模型进行验证,利用所述第一验证集中各所述变量X及各所述因变量Y对该第一分割模型的第一准确率进行验证;及当验所述第一准确率大于第一预设阈值时,结束训练,得到所述第一分割模型。
- 如权利要求1或2所述的文本结构化信息提取方法,所述第一预设规则包括:提取第N个一级标签与第N+1个一级标签之间的文本内容,作为所述第N个一级标签对应的文本内容,其中,N为大于或等于1的正整数。
- 如权利要求3所述的文本结构化信息提取方法,所述第二分割模型是通过条件随机场模型训练得到的,训练过程包括如下步骤:获取第二预设数量的一级文本内容样本,对各所述一级文本内容样本分配唯一的第二预设标签;将所述一级文本内容样本按照第二预设比例分成第二训练集及第二验证集;将所述第二训练集中的一级文本内容样本输入所述条件随机场模型进行训练,每隔预设周期使用所述第二验证集对所述条件随机场模型进行验证,利用所述第二验证集中各所述一级文本内容和各所述第二预设标签对该第二分割模型的第二准确率进行验证;及当所述第二准确率大于第二预设阈值时,结束训练,得到所述第二分割模型。
- 如权利要求1所述的文本结构化信息提取方法,所述第二预设规则包括:提取第M个二级标签与第M+1个二级标签之间的文本内容,作为所述第M个二级标签对应的文本内容,其中,M为大于或等于1的正整数。
- 一种服务器,所述服务器包括存储器、处理器以及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述程序时实现一种文本结构化信息提取方法,所述方法包括:接收所述客户端发出的提取文本结构化信息的请求,获取待提取结构化信息的原始文档;将所述原始文档输入预先训练好的第一分割模型,得到所述原始文档的多个一级标签,再根据第一预设规则从所述原始文档获取各所述一级标签对应的一级文本内容;将各所述一级文本内容输入预先训练好的第二分割模型,得到各所述一级文本内容对应的二级标签,再根据第二预设规则从所述原始文档获取各所述二级标签对应的二级文本内容;及将各一级标签、二级标签、一级文本内容及二级文本内容作为逻辑页存 储至预设数据库,并将所述逻辑页按预设格式生成相应的文件反馈至所述客户端。
- 如权利要求6所述的服务器,所述第一分割模型是通过条件随机场模型训练得到的,训练过程包括如下步骤:获取第一预设数量的原始文档样本,对所述原始文档样本进行预处理;对各所述预处理后的文档分配唯一的第一预设标签,根据预先确定的特征向量提取算法提取出各所述第一预设标签的预设特征向量,将各所述预设特征向量作为变量X、各所述第一预设标签作为因变量Y生成样本集;将所述样本集按照第一预设比例分成第一训练集及第一验证集;利用所述第一训练集中的各所述变量X及各所述因变量Y对所述条件随机场模型进行训练,每隔预设周期使用所述第一验证集对所述条件随机场模型进行验证,利用所述第一验证集中各所述变量X及各所述因变量Y对该第一分割模型的第一准确率进行验证;及当验所述第一准确率大于第一预设阈值时,结束训练,得到所述第一分割模型。
- 如权利要求6或7所述的服务器,所述第一预设规则包括:提取第N个一级标签与第N+1个一级标签之间的文本内容,作为所述第N个一级标签对应的文本内容,其中,N为大于或等于1的正整数。
- 如权利要求8所述的服务器,所述第二分割模型是通过条件随机场模型训练得到的,训练过程包括如下步骤:获取第二预设数量的一级文本内容样本,对各所述一级文本内容样本分配唯一的第二预设标签;将所述一级文本内容样本按照第二预设比例分成第二训练集及第二验证集;将所述第二训练集中的一级文本内容样本输入所述条件随机场模型进行训练,每隔预设周期使用所述第二验证集对所述条件随机场模型进行验证,利用所述第二验证集中各所述一级文本内容和各所述第二预设标签对该第二分割模型的第二准确率进行验证;及当所述第二准确率大于第二预设阈值时,结束训练,得到所述第二分割模型。
- 如权利要求6所述的服务器,所述第二预设规则包括:提取第M个二级标签与第M+1个二级标签之间的文本内容,作为所述第M个二级标签对应的文本内容,其中,M为大于或等于1的正整数。
- 一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机程序,该计算机程序被处理器执行时实现一种文本结构化信息提取方法,所述方法包括:接收所述客户端发出的提取文本结构化信息的请求,获取待提取结构化信息的原始文档;将所述原始文档输入预先训练好的第一分割模型,得到所述原始文档的多个一级标签,再根据第一预设规则从所述原始文档获取各所述一级标签对应的一级文本内容;将各所述一级文本内容输入预先训练好的第二分割模型,得到各所述一级文本内容对应的二级标签,再根据第二预设规则从所述原始文档获取各所述二级标签对应的二级文本内容;及将各一级标签、二级标签、一级文本内容及二级文本内容作为逻辑页存储至预设数据库,并将所述逻辑页按预设格式生成相应的文件反馈至所述客户端。
- 如权利要求11所述的计算机可读存储介质,所述第一分割模型是通过条件随机场模型训练得到的,训练过程包括如下步骤:获取第一预设数量的原始文档样本,对所述原始文档样本进行预处理;对各所述预处理后的文档分配唯一的第一预设标签,根据预先确定的特征向量提取算法提取出各所述第一预设标签的预设特征向量,将各所述预设特征向量作为变量X、各所述第一预设标签作为因变量Y生成样本集;将所述样本集按照第一预设比例分成第一训练集及第一验证集;利用所述第一训练集中的各所述变量X及各所述因变量Y对所述条件随机场模型进行训练,每隔预设周期使用所述第一验证集对所述条件随机场模型进行验证,利用所述第一验证集中各所述变量X及各所述因变量Y对该第一分割模型的第一准确率进行验证;及当验所述第一准确率大于第一预设阈值时,结束训练,得到所述第一分割模型。
- 如权利要求11或12所述的计算机可读存储介质,所述第一预设规则包括:提取第N个一级标签与第N+1个一级标签之间的文本内容,作为所述第N个一级标签对应的文本内容,其中,N为大于或等于1的正整数。
- 如权利要求13所述的计算机可读存储介质,所述第二分割模型是通过条件随机场模型训练得到的,训练过程包括如下步骤:获取第二预设数量的一级文本内容样本,对各所述一级文本内容样本分配唯一的第二预设标签;将所述一级文本内容样本按照第二预设比例分成第二训练集及第二验证集;将所述第二训练集中的一级文本内容样本输入所述条件随机场模型进行训练,每隔预设周期使用所述第二验证集对所述条件随机场模型进行验证,利用所述第二验证集中各所述一级文本内容和各所述第二预设标签对该第二分割模型的第二准确率进行验证;及当所述第二准确率大于第二预设阈值时,结束训练,得到所述第二分割模型。
- 如权利要求11所述的计算机可读存储介质,所述第二预设规则包括:提取第M个二级标签与第M+1个二级标签之间的文本内容,作为所述第M个二级标签对应的文本内容,其中,M为大于或等于1的正整数。
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910419888.5A CN110287785A (zh) | 2019-05-20 | 2019-05-20 | 文本结构化信息提取方法、服务器及存储介质 |
CN201910419888.5 | 2019-05-20 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2020233332A1 true WO2020233332A1 (zh) | 2020-11-26 |
Family
ID=68002204
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2020/086292 WO2020233332A1 (zh) | 2019-05-20 | 2020-04-23 | 文本结构化信息提取方法、服务器及存储介质 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN110287785A (zh) |
WO (1) | WO2020233332A1 (zh) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112597353A (zh) * | 2020-12-18 | 2021-04-02 | 武汉大学 | 一种文本信息自动提取方法 |
CN112835922A (zh) * | 2021-01-29 | 2021-05-25 | 上海寻梦信息技术有限公司 | 地址区划分类方法、系统、设备及存储介质 |
CN113591454A (zh) * | 2021-07-30 | 2021-11-02 | 中国银行股份有限公司 | 一种文本解析方法及装置 |
CN114091427A (zh) * | 2021-11-19 | 2022-02-25 | 海信电子科技(武汉)有限公司 | 一种图像文本相似度模型训练方法及显示设备 |
CN112784033B (zh) * | 2021-01-29 | 2023-11-03 | 北京百度网讯科技有限公司 | 一种时效等级识别模型训练及应用的方法、及电子设备 |
Families Citing this family (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110287785A (zh) * | 2019-05-20 | 2019-09-27 | 深圳壹账通智能科技有限公司 | 文本结构化信息提取方法、服务器及存储介质 |
CN111598550A (zh) * | 2020-05-22 | 2020-08-28 | 深圳市小满科技有限公司 | 邮件签名信息提取方法、装置、电子设备及介质 |
CN112035449B (zh) * | 2020-07-22 | 2024-06-14 | 大箴(杭州)科技有限公司 | 数据处理方法及装置、计算机设备、存储介质 |
CN113255303B (zh) * | 2020-09-14 | 2022-03-25 | 苏州七星天专利运营管理有限责任公司 | 一种文档辅助编辑的方法和系统 |
CN112270224A (zh) * | 2020-10-14 | 2021-01-26 | 招商银行股份有限公司 | 保险责任解析方法、装置及计算机可读存储介质 |
CN112270604B (zh) * | 2020-10-14 | 2024-08-20 | 招商银行股份有限公司 | 信息结构化处理方法、装置及计算机可读存储介质 |
CN112733505B (zh) * | 2020-12-30 | 2024-04-26 | 中国科学技术大学 | 文档生成方法和装置、电子设备及存储介质 |
CN113158946A (zh) * | 2021-04-29 | 2021-07-23 | 南方电网深圳数字电网研究院有限公司 | 一种标书结构化处理方法及系统 |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101206639A (zh) * | 2007-12-20 | 2008-06-25 | 北大方正集团有限公司 | 一种基于pdf的复杂版面的标引方法 |
CN101976232A (zh) * | 2010-09-19 | 2011-02-16 | 深圳市万兴软件有限公司 | 一种识别文档中数据表格的方法及装置 |
US20150254213A1 (en) * | 2014-02-12 | 2015-09-10 | Kevin D. McGushion | System and Method for Distilling Articles and Associating Images |
CN107358208A (zh) * | 2017-07-14 | 2017-11-17 | 北京神州泰岳软件股份有限公司 | 一种pdf文档结构化信息提取方法及装置 |
CN107992597A (zh) * | 2017-12-13 | 2018-05-04 | 国网山东省电力公司电力科学研究院 | 一种面向电网故障案例的文本结构化方法 |
CN108874928A (zh) * | 2018-05-31 | 2018-11-23 | 平安科技(深圳)有限公司 | 简历数据信息解析处理方法、装置、设备及存储介质 |
CN110287785A (zh) * | 2019-05-20 | 2019-09-27 | 深圳壹账通智能科技有限公司 | 文本结构化信息提取方法、服务器及存储介质 |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5594809A (en) * | 1995-04-28 | 1997-01-14 | Xerox Corporation | Automatic training of character templates using a text line image, a text line transcription and a line image source model |
CN108875059B (zh) * | 2018-06-29 | 2021-02-12 | 北京百度网讯科技有限公司 | 用于生成文档标签的方法、装置、电子设备和存储介质 |
-
2019
- 2019-05-20 CN CN201910419888.5A patent/CN110287785A/zh active Pending
-
2020
- 2020-04-23 WO PCT/CN2020/086292 patent/WO2020233332A1/zh active Application Filing
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101206639A (zh) * | 2007-12-20 | 2008-06-25 | 北大方正集团有限公司 | 一种基于pdf的复杂版面的标引方法 |
CN101976232A (zh) * | 2010-09-19 | 2011-02-16 | 深圳市万兴软件有限公司 | 一种识别文档中数据表格的方法及装置 |
US20150254213A1 (en) * | 2014-02-12 | 2015-09-10 | Kevin D. McGushion | System and Method for Distilling Articles and Associating Images |
CN107358208A (zh) * | 2017-07-14 | 2017-11-17 | 北京神州泰岳软件股份有限公司 | 一种pdf文档结构化信息提取方法及装置 |
CN107992597A (zh) * | 2017-12-13 | 2018-05-04 | 国网山东省电力公司电力科学研究院 | 一种面向电网故障案例的文本结构化方法 |
CN108874928A (zh) * | 2018-05-31 | 2018-11-23 | 平安科技(深圳)有限公司 | 简历数据信息解析处理方法、装置、设备及存储介质 |
CN110287785A (zh) * | 2019-05-20 | 2019-09-27 | 深圳壹账通智能科技有限公司 | 文本结构化信息提取方法、服务器及存储介质 |
Non-Patent Citations (4)
Title |
---|
LU WEI; HUANG YONG; CHENG QIKAI: "The Structure Function of Academic Text and Its Classification", JOURNAL OF THE CHINA SOCIETY FOR SCIENTIFIC AND TECHNICAL INFORMATION, vol. 33, no. 9, 30 September 2014 (2014-09-30), pages 979 - 985, XP055755087, ISSN: 1000-0135, DOI: 10.3772/j.issn.1000-0135.2014.09.010 * |
YU HONG-TAO;YU HAI-MING;ZHANG FU-ZHI: "Metadata Extraction Based on Third-order Conditional Random Fields", CHINA MASTER'S THESES FULL-TEXT DATABASE, vol. 35, no. 3, 15 February 2014 (2014-02-15), pages 606 - 609, XP055755092, ISSN: 1674-0246 * |
YU, LIANG: "Research and Applications on Text Features Extraction from Science and Technical Literatures", CHINESE MASTER'S THESES FULL-TEXT DATABASE, 15 March 2010 (2010-03-15), pages 1 - 58, XP055754894, ISSN: 1674-0246 * |
ZHANG YU-FANG,MO LING-LIN,XIONG ZHONG-YANG,GENG XIAO-FEI: "Hierarchical information extraction from research papers based on conditional random fields", APPLICATION RESEARCH OF COMPUTERS, vol. 26, no. 10, 31 October 2009 (2009-10-31), pages 3690 - 3693, XP055755082, ISSN: 1001-3695, DOI: 10.3969/j,issn.1001-3695.2009.10.025 * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112597353A (zh) * | 2020-12-18 | 2021-04-02 | 武汉大学 | 一种文本信息自动提取方法 |
CN112597353B (zh) * | 2020-12-18 | 2024-03-08 | 武汉大学 | 一种文本信息自动提取方法 |
CN112835922A (zh) * | 2021-01-29 | 2021-05-25 | 上海寻梦信息技术有限公司 | 地址区划分类方法、系统、设备及存储介质 |
CN112784033B (zh) * | 2021-01-29 | 2023-11-03 | 北京百度网讯科技有限公司 | 一种时效等级识别模型训练及应用的方法、及电子设备 |
CN113591454A (zh) * | 2021-07-30 | 2021-11-02 | 中国银行股份有限公司 | 一种文本解析方法及装置 |
CN114091427A (zh) * | 2021-11-19 | 2022-02-25 | 海信电子科技(武汉)有限公司 | 一种图像文本相似度模型训练方法及显示设备 |
Also Published As
Publication number | Publication date |
---|---|
CN110287785A (zh) | 2019-09-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2020233332A1 (zh) | 文本结构化信息提取方法、服务器及存储介质 | |
WO2021151270A1 (zh) | 图像结构化数据提取方法、装置、设备及存储介质 | |
US10200336B2 (en) | Generating a conversation in a social network based on mixed media object context | |
CN108932294B (zh) | 基于索引的简历数据处理方法、装置、设备及存储介质 | |
US10204082B2 (en) | Generating digital document content from a digital image | |
US10489435B2 (en) | Method, device and equipment for acquiring answer information | |
US10977259B2 (en) | Electronic template generation, data extraction and response detection | |
WO2019041527A1 (zh) | 文档图表抽取方法、电子设备及计算机可读存储介质 | |
US20210192129A1 (en) | Method, system and cloud server for auto filing an electronic form | |
CN113032580B (zh) | 关联档案推荐方法、系统及电子设备 | |
CN112016274B (zh) | 医学文本结构化方法、装置、计算机设备及存储介质 | |
US11138426B2 (en) | Template matching, rules building and token extraction | |
CN110347984B (zh) | 保单页面变更方法、装置、计算机设备及存储介质 | |
CN110166522B (zh) | 服务器识别方法、装置、可读存储介质和计算机设备 | |
US9710769B2 (en) | Methods and systems for crowdsourcing a task | |
CN113837113B (zh) | 基于人工智能的文档校验方法、装置、设备及介质 | |
CN112016290A (zh) | 一种文档自动排版方法、装置、设备及存储介质 | |
CN116132527B (zh) | 管理指示牌的系统、方法及数据处理服务器 | |
US10643022B2 (en) | PDF extraction with text-based key | |
US9842307B2 (en) | Methods and systems for creating tasks | |
WO2019149065A1 (zh) | 绘文字兼容显示方法、装置、终端及计算机可读存储介质 | |
US8867838B2 (en) | Method and system for a text data entry from an electronic document | |
WO2019000697A1 (zh) | 信息检索方法、系统、服务器及可读存储介质 | |
CN117133006A (zh) | 一种单证验证方法、装置、计算机设备及存储介质 | |
CN107784328B (zh) | 德语旧字体识别方法、装置及计算机可读存储介质 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20809293 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20809293 Country of ref document: EP Kind code of ref document: A1 |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 23.03.2022) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20809293 Country of ref document: EP Kind code of ref document: A1 |