CN111813849A - Data extraction method, device and device, and storage medium - Google Patents

Data extraction method, device and device, and storage medium Download PDF

Info

Publication number
CN111813849A
CN111813849A CN202010957895.3A CN202010957895A CN111813849A CN 111813849 A CN111813849 A CN 111813849A CN 202010957895 A CN202010957895 A CN 202010957895A CN 111813849 A CN111813849 A CN 111813849A
Authority
CN
China
Prior art keywords
form template
target
template
data
cell
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010957895.3A
Other languages
Chinese (zh)
Inventor
周鹏程
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Dt Dream Technology Co Ltd
Original Assignee
Hangzhou Dt Dream Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Dt Dream Technology Co Ltd filed Critical Hangzhou Dt Dream Technology Co Ltd
Priority to CN202010957895.3A priority Critical patent/CN111813849A/en
Publication of CN111813849A publication Critical patent/CN111813849A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/254Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Document Processing Apparatus (AREA)

Abstract

本发明提供一种数据抽取方法、装置及设备、存储介质,该方法包括:从已配置的表单模板中获取与待处理的表单匹配的目标表单模板,每一表单模板具有对应的数据库表,且表单模板包含对应数据库表中表单头部的至少一个字段的字段信息;依据所述目标表单模板中的字段信息从所述表单中确定出与所述字段信息对应的数据,得到目标数据;将所述目标数据抽取至所述目标表单模板对应的数据库表中。无需针对每个表单配置相应的抽取方式,可减少配置所需的工作量。

Figure 202010957895

The present invention provides a data extraction method, device, device, and storage medium. The method includes: acquiring a target form template matching a form to be processed from a configured form template, each form template having a corresponding database table, and The form template includes field information corresponding to at least one field of the form header in the database table; the data corresponding to the field information is determined from the form according to the field information in the target form template, and the target data is obtained; The target data is extracted to the database table corresponding to the target form template. There is no need to configure the corresponding extraction method for each form, which reduces the workload required for configuration.

Figure 202010957895

Description

数据抽取方法、装置及设备、存储介质Data extraction method, device and device, and storage medium

技术领域technical field

本发明涉及数据处理技术领域,尤其涉及的是一种数据抽取方法、装置及设备、存储介质。The present invention relates to the technical field of data processing, and in particular, to a data extraction method, device and equipment, and a storage medium.

背景技术Background technique

随着我国信息化建设的不断推进和深化,政府机关、企业集团及各行业业务系统的建设已经达到了一定的水平。数据的存储方式也多种多样,可以存储在不同类型的数据库,也可以存储在文件中。对于一些文件比如Excel文件来说,文件中具有至少一个表单,表单内的数据一般是人工填写,同一个业务系统中,可能因为地区或者部门不同,表单格式也不尽相同,就会有海量不同结构的Excel文件。With the continuous advancement and deepening of my country's informatization construction, the construction of government agencies, enterprise groups and business systems in various industries has reached a certain level. Data is also stored in a variety of ways, either in different types of databases or in files. For some files such as Excel files, there is at least one form in the file, and the data in the form is generally filled in manually. In the same business system, the form format may be different due to different regions or departments, and there will be massive differences. Structured Excel file.

在应对一些业务需求时,需要将海量不同结构的Excel文件中的数据按类别存储到相应的数据库表中,比如,从各种不同结构的Excel文件中抽取出气象类数据存储到一个数据库表中,从各种不同结构的Excel文件中抽取出蔬菜类数据存储到另一个数据库表中,这就可以采用ETL技术来实现。ETL是数据集成领域的落地技术,区别于传统数据交换,ETL在可完成基本数据交换(抽取、传输、装载)的前提下,对数据的转换(即数据的按需加工处理)提供更易用和更强大的支持,使数据在不同业务之间流动的同时,保证各业务获取到的数据是准确、及时、符合业务需求的。When dealing with some business requirements, it is necessary to store the data in a large number of Excel files with different structures into the corresponding database tables by category. For example, meteorological data extracted from various Excel files with different structures is stored in a database table. , extract vegetable data from various Excel files of different structures and store it in another database table, which can be realized by ETL technology. ETL is a landing technology in the field of data integration. Different from traditional data exchange, ETL provides easier and more convenient and easy-to-use data conversion (that is, data processing on demand) under the premise of completing basic data exchange (extraction, transmission, and loading). Stronger support enables data to flow between different businesses, while ensuring that the data obtained by each business is accurate, timely, and in line with business needs.

在现有的ETL技术中,应对海量不同结构的Excel文件的抽取方案,大多都需要人工去识别Excel文件中每一个表单格式,针对每个表单,根据表单格式配置相应的抽取方式,再对表单中的数据进行抽取,费时费力。In the existing ETL technology, most of the extraction schemes of Excel files with different structures need to manually identify each form format in the Excel file. For each form, configure the corresponding extraction method according to the form format. It is time-consuming and labor-intensive to extract the data.

发明内容SUMMARY OF THE INVENTION

有鉴于此,本发明提供一种数据抽取方法、装置及设备、存储介质,无需针对每个表单配置相应的抽取方式,可减少配置所需的工作量。In view of this, the present invention provides a data extraction method, device and device, and storage medium, which eliminates the need to configure a corresponding extraction method for each form, and can reduce the workload required for configuration.

本发明第一方面提供一种数据抽取方法,包括:A first aspect of the present invention provides a data extraction method, comprising:

从已配置的表单模板中获取与待处理的表单匹配的目标表单模板,每一表单模板具有对应的数据库表,且表单模板包含对应数据库表中表单头部的至少一个字段的字段信息;Obtain a target form template matching the form to be processed from the configured form template, each form template has a corresponding database table, and the form template contains field information corresponding to at least one field of the form header in the database table;

依据所述目标表单模板中的字段信息从所述表单中确定出与所述字段信息对应的数据,得到目标数据;Determining data corresponding to the field information from the form according to the field information in the target form template to obtain target data;

将所述目标数据抽取至所述目标表单模板对应的数据库表中。Extracting the target data into a database table corresponding to the target form template.

根据本发明的一个实施例,从已配置的表单模板中获取与待处理的表单匹配的目标表单模板,包括:According to an embodiment of the present invention, obtaining a target form template matching the form to be processed from the configured form template includes:

遍历所述表单中的每一行单元格:Iterate over each row of cells in the sheet:

在所述已配置的表单模板中查找包含的字段信息与该行单元格的内容匹配的表单模板;Find a form template that contains field information that matches the content of the cell in the row in the configured form template;

若查找到,则确定该行单元格为所述表单的表单头部,并将查找到的表单模板确定为所述目标表单模板。If found, the row of cells is determined as the form header of the form, and the found form template is determined as the target form template.

根据本发明的一个实施例,所述表单的表单头部的数量为1个;该方法进一步包括:According to an embodiment of the present invention, the number of form headers of the form is one; the method further includes:

当确定该行单元格为所述表单的表单头部时,结束对所述表单的遍历。When it is determined that the row of cells is the form header of the form, the traversal of the form is ended.

根据本发明的一个实施例,该方法进一步包括:According to an embodiment of the present invention, the method further includes:

若未查找到,则检查当前对所述表单的遍历次数是否达到最大遍历次数,如果是,则结束对所述表单的遍历,否则继续对所述表单的遍历。If not found, check whether the current traversal times of the form reaches the maximum traversal times, if so, end the traversal of the form, otherwise continue to traverse the form.

根据本发明的一个实施例,在所述已配置的表单模板中查找包含的字段信息与该行单元格的内容匹配的表单模板进一步为:According to an embodiment of the present invention, searching the configured form template for a form template whose field information matches the content of the row of cells is further:

在该行中每个单元格的数据类型为文本类型时,在所述已配置的表单模板中查找包含的字段信息与该行单元格的内容匹配的表单模板。When the data type of each cell in the row is a text type, the configured form template is searched for a form template that contains field information that matches the content of the cell in the row.

根据本发明的一个实施例,According to an embodiment of the present invention,

所述字段信息至少包括字段名称;The field information includes at least a field name;

在所述已配置的表单模板中查找包含的字段信息与该行单元格的内容匹配的表单模板,包括:Search the configured form templates for form templates that contain field information that matches the content of the cells in the row, including:

针对所述已配置的表单模板中的每一表单模板:For each of the configured form templates:

从该行单元格中确定出参考单元格,所述参考单元格的内容与该表单模板中的任一字段名称匹配;determining a reference cell from the row of cells, the content of the reference cell matching any field name in the form template;

依据所述参考单元格的个数确定该表单模板是否为包含的字段信息与该行单元格的内容匹配的表单模板。According to the number of the reference cells, it is determined whether the form template is a form template that contains field information that matches the content of the row of cells.

根据本发明的一个实施例,所述参考单元格的内容与该表单模板中的任一字段名称匹配是指:According to an embodiment of the present invention, the content of the reference cell matches any field name in the form template means:

所述参考单元格的内容与该表单模板中的任一字段名称相同;The content of the reference cell is the same as any field name in the form template;

或者,所述参考单元格的内容与该表单模板中的任一字段名称为近义词或同义词,所述近义词或同义词是通过设定的匹配算法确定的。Alternatively, the content of the reference cell and any field name in the form template are synonyms or synonyms, and the synonyms or synonyms are determined by a set matching algorithm.

根据本发明的一个实施例,According to an embodiment of the present invention,

所述表单模板还包括设定匹配度;The form template further includes a set matching degree;

依据所述参考单元格的个数确定该表单模板是否为包含的字段信息与该行单元格的内容匹配的表单模板,包括:Determine whether the form template is a form template with field information that matches the content of the row of cells according to the number of the reference cells, including:

计算所述参考单元格的个数与该表单模板中字段信息对应的字段个数之间的比值;Calculate the ratio between the number of the reference cells and the number of fields corresponding to the field information in the form template;

若该比值大于该表单模板中的设定匹配度,则确定该表单模板为包含的字段信息与该行单元格的内容匹配的表单模板。If the ratio is greater than the set matching degree in the form template, it is determined that the form template is a form template whose field information matches the content of the cell in the row.

根据本发明的一个实施例,According to an embodiment of the present invention,

所述字段信息至少包括字段名称;The field information includes at least a field name;

所述表单模板还包括各字段信息对应的字段顺序、以及至少一个字段信息对应的转换规则,所述字段顺序是依据所述目标表单模板对应的数据库表的表单头部确定的;The form template further includes a field sequence corresponding to each field information and a conversion rule corresponding to at least one field information, where the field sequence is determined according to the form header of the database table corresponding to the target form template;

依据所述目标表单模板中的字段信息从所述表单中确定出与所述字段信息对应的数据,得到目标数据,包括:According to the field information in the target form template, the data corresponding to the field information is determined from the form, and the target data is obtained, including:

按照所述目标表单模板中的字段顺序对所述表单中各目标单元格所在列进行排序,所述目标单元格属于所述表单的表单头部,且目标单元格的内容与所述目标表单模板中的任一字段名称匹配;Sort the column where each target cell in the form is located according to the field order in the target form template, the target cell belongs to the form header of the form, and the content of the target cell is the same as the target form template matches any of the field names in;

针对所述表单中位于所述表单头部之后的每一行,确定出该行中位于各目标单元格所在列的单元格内容,作为一条目标数据。For each row in the form that is located after the form header, the cell content in the row where each target cell is located is determined as a piece of target data.

根据本发明的一个实施例,According to an embodiment of the present invention,

所述确定出该行中位于各目标单元格所在列的单元格内容之前,该方法还包括:若该行中存在合并单元格,则将该合并单元格进行拆分,并将该合并单元格的内容填入至拆分出的所有单元格中;Before determining the content of the cells in the row located in the columns where the target cells are located, the method further includes: if there is a merged cell in the row, splitting the merged cell, and dividing the merged cell Fill in the contents of all split cells;

所述确定出该行中位于各目标单元格所在列的单元格内容之后,该方法还包括:按照所述目标表单模板中的转换规则对确定出的至少一个单元格内容进行转换,将得到的单元格内容作为一条目标数据。After determining the cell content in the row where each target cell is located, the method further includes: converting the determined at least one cell content according to the conversion rule in the target form template, and converting the obtained cell content The cell content is used as a piece of target data.

根据本发明的一个实施例,所述方法进一步包括:According to an embodiment of the present invention, the method further comprises:

在需对任一数据库表进行表单模板配置时,检查指定数据库中是否存在该数据库表与需配置的表单模板之间的对应关系,如果不存在,则继续进行该表单模板的配置,并将该数据库表与该表单模板的对应关系存储至该指定数据库中。When the form template configuration needs to be performed on any database table, check whether there is a corresponding relationship between the database table and the form template to be configured in the specified database. The corresponding relationship between the database table and the form template is stored in the specified database.

根据本发明的一个实施例,从已配置的表单模板中获取与待处理的表单匹配的目标表单模板,进一步为:According to an embodiment of the present invention, obtaining a target form template matching the form to be processed from the configured form template, further comprising:

在获得所述表单所存储的数据类别的情况下,从用于存储该类数据的数据库表被配置的表单模板中获取与待处理的表单匹配的目标表单模板。In the case of obtaining the type of data stored in the form, a target form template matching the form to be processed is obtained from the form template in which the database table for storing this type of data is configured.

根据本发明的一个实施例,所述表单是待处理文件中的任一个表单,所述待处理文件的格式为指定文件格式,所述待处理文件包含至少一个表单。According to an embodiment of the present invention, the form is any form in the to-be-processed file, the format of the to-be-processed file is a specified file format, and the to-be-processed file includes at least one form.

本发明第二方面提供一种数据抽取装置,包括:A second aspect of the present invention provides a data extraction device, comprising:

目标表单模板确定模块,用于从已配置的表单模板中获取与待处理的表单匹配的目标表单模板,每一表单模板具有对应的数据库表,且表单模板包含对应数据库表中表单头部的至少一个字段的字段信息;The target form template determination module is used to obtain the target form template matching the form to be processed from the configured form template, each form template has a corresponding database table, and the form template includes at least the form header in the corresponding database table. Field information for a field;

目标数据确定模块,用于依据所述目标表单模板中的字段信息从所述表单中确定出与所述字段信息对应的数据,得到目标数据;a target data determination module, configured to determine data corresponding to the field information from the form according to the field information in the target form template to obtain target data;

数据抽取模块,用于将所述目标数据抽取至所述目标表单模板对应的数据库表中。A data extraction module is used to extract the target data into a database table corresponding to the target form template.

根据本发明的一个实施例,所述目标表单模板确定模块从已配置的表单模板中获取与待处理的表单匹配的目标表单模板时,具体用于:According to an embodiment of the present invention, when the target form template determination module acquires the target form template matching the form to be processed from the configured form templates, it is specifically used for:

遍历所述表单中的每一行单元格:Iterate over each row of cells in the sheet:

在所述已配置的表单模板中查找包含的字段信息与该行单元格的内容匹配的表单模板;Find a form template that contains field information that matches the content of the cell in the row in the configured form template;

若查找到,则确定该行单元格为所述表单的表单头部,并将查找到的表单模板确定为所述目标表单模板。If found, the row of cells is determined as the form header of the form, and the found form template is determined as the target form template.

根据本发明的一个实施例,所述表单的表单头部的数量为1个;所述目标表单模板确定模块进一步用于:According to an embodiment of the present invention, the number of form headers of the form is one; the target form template determination module is further configured to:

当确定该行单元格为所述表单的表单头部时,结束对所述表单的遍历。When it is determined that the row of cells is the form header of the form, the traversal of the form is ended.

根据本发明的一个实施例,所述目标表单模板确定模块进一步用于:According to an embodiment of the present invention, the target form template determination module is further configured to:

若未查找到,则检查当前对所述表单的遍历次数是否达到最大遍历次数,如果是,则结束对所述表单的遍历,否则继续对所述表单的遍历。If not found, check whether the current traversal times of the form reaches the maximum traversal times, if so, end the traversal of the form, otherwise continue to traverse the form.

根据本发明的一个实施例,所述目标表单模板确定模块在所述已配置的表单模板中查找包含的字段信息与该行单元格的内容匹配的表单模板时,进一步用于:According to an embodiment of the present invention, when the target form template determination module searches the configured form template for a form template whose field information matches the content of the row of cells, it is further configured to:

在该行中每个单元格的数据类型为文本类型时,在所述已配置的表单模板中查找包含的字段信息与该行单元格的内容匹配的表单模板。When the data type of each cell in the row is a text type, the configured form template is searched for a form template that contains field information that matches the content of the cell in the row.

根据本发明的一个实施例,According to an embodiment of the present invention,

所述字段信息至少包括字段名称;The field information includes at least a field name;

所述目标表单模板确定模块在所述已配置的表单模板中查找包含的字段信息与该行单元格的内容匹配的表单模板时,具体用于:When the target form template determination module searches the configured form template for a form template whose field information matches the content of the row of cells, it is specifically used for:

针对所述已配置的表单模板中的每一表单模板:For each of the configured form templates:

从该行单元格中确定出参考单元格,所述参考单元格的内容与该表单模板中的任一字段名称匹配;determining a reference cell from the row of cells, the content of the reference cell matching any field name in the form template;

依据所述参考单元格的个数确定该表单模板是否为包含的字段信息与该行单元格的内容匹配的表单模板。According to the number of the reference cells, it is determined whether the form template is a form template that contains field information that matches the content of the row of cells.

根据本发明的一个实施例,所述参考单元格的内容与该表单模板中的任一字段名称匹配是指:According to an embodiment of the present invention, the content of the reference cell matches any field name in the form template means:

所述参考单元格的内容与该表单模板中的任一字段名称相同;The content of the reference cell is the same as any field name in the form template;

或者,所述参考单元格的内容与该表单模板中的任一字段名称为近义词或同义词,所述近义词或同义词是通过设定的匹配算法确定的。Alternatively, the content of the reference cell and any field name in the form template are synonyms or synonyms, and the synonyms or synonyms are determined by a set matching algorithm.

根据本发明的一个实施例,According to an embodiment of the present invention,

所述表单模板还包括设定匹配度;The form template further includes a set matching degree;

所述目标表单模板确定模块依据所述参考单元格的个数确定该表单模板是否为包含的字段信息与该行单元格的内容匹配的表单模板时,具体用于:When the target form template determination module determines whether the form template is a form template whose field information matches the content of the row of cells according to the number of the reference cells, it is specifically used for:

计算所述参考单元格的个数与该表单模板中字段信息对应的字段个数之间的比值;Calculate the ratio between the number of the reference cells and the number of fields corresponding to the field information in the form template;

若该比值大于该表单模板中的设定匹配度,则确定该表单模板为包含的字段信息与该行单元格的内容匹配的表单模板。If the ratio is greater than the set matching degree in the form template, it is determined that the form template is a form template whose field information matches the content of the cell in the row.

根据本发明的一个实施例,According to an embodiment of the present invention,

所述字段信息至少包括字段名称;The field information includes at least a field name;

所述表单模板还包括各字段信息对应的字段顺序、以及至少一个字段信息对应的转换规则,所述字段顺序是依据所述目标表单模板对应的数据库表的表单头部确定的;The form template further includes a field sequence corresponding to each field information and a conversion rule corresponding to at least one field information, where the field sequence is determined according to the form header of the database table corresponding to the target form template;

所述目标数据确定模块依据所述目标表单模板中的字段信息从所述表单中确定出与所述字段信息对应的数据,得到目标数据时,具体用于:When the target data determination module determines the data corresponding to the field information from the form according to the field information in the target form template, and obtains the target data, it is specifically used for:

按照所述目标表单模板中的字段顺序对所述表单中各目标单元格所在列进行排序,所述目标单元格属于所述表单的表单头部,且目标单元格的内容与所述目标表单模板中的任一字段名称匹配;Sort the column where each target cell in the form is located according to the field order in the target form template, the target cell belongs to the form header of the form, and the content of the target cell is the same as the target form template matches any of the field names in;

针对所述表单中位于所述表单头部之后的每一行,确定出该行中位于各目标单元格所在列的单元格内容,作为一条目标数据。For each row in the form that is located after the form header, the cell content in the row where each target cell is located is determined as a piece of target data.

根据本发明的一个实施例,According to an embodiment of the present invention,

所述目标数据确定模块确定出该行中位于各目标单元格所在列的单元格内容之前,还用于:若该行中存在合并单元格,则将该合并单元格进行拆分,并将该合并单元格的内容填入至拆分出的所有单元格中;Before the target data determination module determines the content of the cells in the row where each target cell is located, it is also used for: if there is a merged cell in the row, split the merged cell, and divide the merged cell into the row. The contents of the merged cell are filled into all the split cells;

所述目标数据确定模块确定出该行中位于各目标单元格所在列的单元格内容之后,还用于:按照所述目标表单模板中的转换规则对确定出的至少一个单元格内容进行转换,将得到的单元格内容作为一条目标数据。After the target data determination module determines the cell content located in the column where each target cell is located in the row, it is further configured to: convert the determined at least one cell content according to the conversion rule in the target form template, Use the obtained cell content as a piece of target data.

根据本发明的一个实施例,所述装置进一步包括:According to an embodiment of the present invention, the apparatus further comprises:

配置模块,用于在需对任一数据库表进行表单模板配置时,检查指定数据库中是否存在该数据库表与需配置的表单模板之间的对应关系,如果不存在,则继续进行该表单模板的配置,并将该数据库表与该表单模板的对应关系存储至该指定数据库中。The configuration module is used to check whether there is a corresponding relationship between the database table and the form template to be configured in the specified database when the form template configuration needs to be performed on any database table. configuration, and store the corresponding relationship between the database table and the form template in the specified database.

根据本发明的一个实施例,所述目标表单模板确定模块从已配置的表单模板中获取与待处理的表单匹配的目标表单模板时,进一步用于:According to an embodiment of the present invention, when the target form template determination module obtains a target form template matching the form to be processed from the configured form template, it is further used for:

在获得所述表单所存储的数据类别的情况下,从用于存储该类数据的数据库表被配置的表单模板中获取与待处理的表单匹配的目标表单模板。In the case of obtaining the type of data stored in the form, a target form template matching the form to be processed is obtained from the form template in which the database table for storing this type of data is configured.

根据本发明的一个实施例,所述表单是待处理文件中的任一个表单,所述待处理文件的格式为指定文件格式,所述待处理文件包含至少一个表单。According to an embodiment of the present invention, the form is any form in the to-be-processed file, the format of the to-be-processed file is a specified file format, and the to-be-processed file includes at least one form.

本发明第三方面提供一种电子设备,包括处理器及存储器;所述存储器存储有可被处理器调用的程序;其中,所述处理器执行所述程序时,实现如前述实施例中所述的数据抽取方法。A third aspect of the present invention provides an electronic device, including a processor and a memory; the memory stores a program that can be called by the processor; wherein, when the processor executes the program, the implementation is as described in the foregoing embodiments data extraction method.

本发明第四方面提供一种机器可读存储介质,其上存储有程序,该程序被处理器执行时,实现如前述实施例中所述的数据抽取方法。A fourth aspect of the present invention provides a machine-readable storage medium on which a program is stored, and when the program is executed by a processor, implements the data extraction method described in the foregoing embodiments.

本发明实施例具有以下有益效果:The embodiment of the present invention has the following beneficial effects:

本发明实施例中,预先配置好多个数据库表对应的表单模板,每一表单模板包含对应数据库表中表单头部的至少一个字段的字段信息,在需要抽取某个表单中的数据时,可以从已配置的表单模板中获取与表单相匹配的目标表单模板,基于该目标表单模板中的字段信息从表单中确定出与字段信息对应的数据,得到目标数据,将目标数据抽取至该目标表单模板对应的数据库表中,上述方式中,对于不同结构的表单来说,在匹配到同一表单模板的情况下,就可以基于同一表单模板来完成数据的抽取,不需要人工识别表单格式,也不需要为每个表单配置相应的抽取方式,可大大降低用户配置所需的工作量与时间。In this embodiment of the present invention, form templates corresponding to multiple database tables are preconfigured, and each form template contains field information corresponding to at least one field in the form header in the database table. Obtain the target form template matching the form from the configured form template, determine the data corresponding to the field information from the form based on the field information in the target form template, obtain the target data, and extract the target data to the target form template In the corresponding database table, in the above method, for forms with different structures, in the case of matching the same form template, the data extraction can be completed based on the same form template, and there is no need to manually identify the form format, nor do Configuring a corresponding extraction method for each form can greatly reduce the workload and time required for user configuration.

附图说明Description of drawings

图1是本发明一实施例的数据抽取方法的流程示意图;1 is a schematic flowchart of a data extraction method according to an embodiment of the present invention;

图2是本发明另一实施例的数据抽取方法的流程示意图;2 is a schematic flowchart of a data extraction method according to another embodiment of the present invention;

图3是本发明一实施例的数据抽取装置的结构框图;3 is a structural block diagram of a data extraction apparatus according to an embodiment of the present invention;

图4是本发明一实施例的电子设备的结构框图。FIG. 4 is a structural block diagram of an electronic device according to an embodiment of the present invention.

具体实施方式Detailed ways

这里将详细地对示例性实施例进行说明,其示例表示在附图中。下面的描述涉及附图时,除非另有表示,不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本发明相一致的所有实施方式。相反,它们仅是与如所附权利要求书中所详述的、本发明的一些方面相一致的装置和方法的例子。Exemplary embodiments will be described in detail herein, examples of which are illustrated in the accompanying drawings. Where the following description refers to the drawings, the same numerals in different drawings refer to the same or similar elements unless otherwise indicated. The implementations described in the illustrative examples below are not intended to represent all implementations consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with some aspects of the invention as recited in the appended claims.

在本发明使用的术语是仅仅出于描述特定实施例的目的,而非旨在限制本发明。在本发明和所附权利要求书中所使用的单数形式的“一种”、“所述”和“该”也旨在包括多数形式,除非上下文清楚地表示其他含义。The terminology used in the present invention is for the purpose of describing particular embodiments only and is not intended to limit the present invention. As used in this specification and the appended claims, the singular forms "a," "the," and "the" are intended to include the plural forms as well, unless the context clearly dictates otherwise.

应当理解,尽管在本发明可能采用术语第一、第二、第三等来描述各种对象,但这些信息不应限于这些术语。这些术语仅用来将同一类型的对象彼此区分开。例如,在不脱离本发明范围的情况下,第一对象也可以被称为第二对象,类似地,第二对象也可以被称为第一对象。取决于语境,如在此所使用的词语“如果”可以被解释成为“在……时”或“当……时”或“响应于确定”。It should be understood that although the terms first, second, third, etc. may be used herein to describe various objects, the information should not be limited to these terms. These terms are only used to distinguish objects of the same type from each other. For example, a first object could also be referred to as a second object, and similarly, a second object could also be referred to as a first object, without departing from the scope of the present invention. Depending on the context, the word "if" as used herein can be interpreted as "at the time of" or "when" or "in response to determining."

下面对本发明实施例的数据抽取方法进行更具体的描述,但不应以此为限。在一个实施例中,参看图1,一种数据抽取方法,应用于电子设备,该方法可以包括以下步骤:The data extraction method according to the embodiment of the present invention will be described in more detail below, but it should not be limited thereto. In one embodiment, referring to FIG. 1 , a data extraction method, applied to an electronic device, may include the following steps:

S100:从已配置的表单模板中获取与待处理的表单匹配的目标表单模板,每一表单模板具有对应的数据库表,且表单模板包含对应数据库表中表单头部的至少一个字段的字段信息;S100: Obtain a target form template matching the form to be processed from the configured form template, each form template has a corresponding database table, and the form template includes field information corresponding to at least one field of the form header in the database table;

S200:依据所述目标表单模板中的字段信息从所述表单中确定出与所述字段信息对应的数据,得到目标数据;S200: Determine data corresponding to the field information from the form according to the field information in the target form template, and obtain target data;

S300:将所述目标数据抽取至所述目标表单模板对应的数据库表中。S300: Extract the target data into a database table corresponding to the target form template.

本发明实施例中,数据抽取方法的执行主体为电子设备。电子设备比如可以是计算机设备或由多台计算机设备组成的服务器,当然,电子设备的具体类型不限于此,具有一定的数据处理能力即可。In the embodiment of the present invention, the execution body of the data extraction method is an electronic device. For example, the electronic device may be a computer device or a server composed of multiple computer devices. Of course, the specific type of the electronic device is not limited to this, as long as it has a certain data processing capability.

本发明实施例中,可以改进以由ETL应用程序,或者开发新的ETL应用程序,使得电子设备在运行该ETL应用程序时,可以实现上述的数据抽取方法。In this embodiment of the present invention, an ETL application program can be improved, or a new ETL application program can be developed, so that the above-mentioned data extraction method can be implemented when the electronic device runs the ETL application program.

步骤S100中,从已配置的表单模板中获取与待处理的表单匹配的目标表单模板,每一表单模板具有对应的数据库表,且表单模板包含对应数据库表中表单头部的至少一个字段的字段信息。In step S100, a target form template matching the form to be processed is obtained from the configured form template, each form template has a corresponding database table, and the form template includes a field corresponding to at least one field of the form header in the database table information.

不同数据库表可以用于存储不同类数据,比如一个数据库表用于存储气象类数据,还有一个数据库表用于存储蔬菜类数据等,当然还可以有其他数据库表用于存储其他类数据,比如人员类数据等。而每一类数据都有可能存在于海量不同结构的文件如Excel文件中,因而需要先从各不同结构的Excel文件中抽取出相应的数据。Different database tables can be used to store different types of data. For example, a database table is used to store meteorological data, and another database table is used to store vegetable data. Of course, there can be other database tables used to store other types of data, such as Personnel data, etc. And each type of data may exist in a large number of files with different structures, such as Excel files, so it is necessary to extract the corresponding data from the Excel files with different structures first.

可以预先为每一数据库表配置对应的表单模板,且表单模板包含对应数据库表中表单头部的至少一个字段的字段信息,数据库表的表单头部指示了待存储至数据库表中的数据所包含的字段以及字段顺序。每一数据库表可以对应于一个或多个表单模板,其中,对应于多个表单模板的情况,举例来说可以是,一个数据库表的表单头部包含5个字段,该数据库表既可以存储包含5个字段的数据,又可以存储包含4个字段的数据,则可以为其配置包含5个字段的字段信息的表单模板和包含4个字段的字段信息的表单模板。A corresponding form template can be configured for each database table in advance, and the form template contains field information corresponding to at least one field of the form header in the database table, and the form header of the database table indicates the data to be stored in the database table. fields and the order of the fields. Each database table may correspond to one or more form templates, and in the case of corresponding to multiple form templates, for example, the form header of a database table may contain 5 fields, and the database table may store the If the data of 5 fields can be stored, and the data of 4 fields can be stored, a form template with field information of 5 fields and a form template with field information of 4 fields can be configured for it.

可以根据数据存储需求(比如待存储数据的字段),自动地从存储有某一类数据的文件比如存储有蔬菜类数据的Excel文件中获取出所需的字段信息,根据获取出的字段信息生成相应的表单模板。或者,可以由用户通过手动配置方式来生成所需的表单模板。According to data storage requirements (such as the fields of data to be stored), the required field information can be automatically obtained from a file that stores a certain type of data, such as an Excel file that stores vegetable data, and generated based on the obtained field information. The corresponding form template. Alternatively, the required form template can be generated by the user through manual configuration.

表单模板主要可以包括设定匹配度Proportion、最大遍历次数maxTraverseTimes、字段信息Columns等几个重要属性,其中,字段信息Columns可以包括字段类型Type、字段名称Name、时间格式Format、字段长度Length、字段精度Precision等,可参看如下表(1),当然,实际并不局限于此。The form template can mainly include several important attributes such as setting the matching degree Proportion, the maximum number of traversals maxTraverseTimes, and the field information Columns. Among them, the field information Columns can include the field type Type, field name Name, time format Format, field length Length, field precision Precision, etc., can refer to the following table (1), of course, it is not limited to this.

Figure 771969DEST_PATH_IMAGE002
Figure 771969DEST_PATH_IMAGE002

在一个例子中,用户可以根据某个库表整理出该库表待存储数据的字段,并配置好相应的字段信息,将字段信息填入至ETL程序页面中,进而自动生成JSON格式的表单模板,作为该库表对应的表单模板。以表(1)中的字段信息Columns为例,在表单模板中,字段信息Columns可以作为字段类型Type、字段名称Name、时间格式Format、字段长度Length、字段精度Precision的父级参数。In one example, the user can sort out the fields of the data to be stored in the library table according to a certain library table, configure the corresponding field information, fill in the field information into the ETL program page, and then automatically generate a form template in JSON format , as the form template corresponding to the library table. Taking the field information Columns in Table (1) as an example, in the form template, the field information Columns can be used as the parent parameter of the field type Type, field name Name, time format Format, field length Length, and field precision Precision.

从已配置的表单模板中获取与待处理的表单匹配的目标表单模板时,可以确定出表单的表单头部(可以预先指定或者根据表单模板确定出),根据表单头部中字段的字段信息来从已配置的表单模板中确定出匹配的目标表单模板,比如,表单的表单头部中存在目标表单模板中所有或大部分字段信息对应的字段。当然,具体不限于此,在后续的内容中将会对此进行更详细的描述。When the target form template matching the form to be processed is obtained from the configured form template, the form header of the form can be determined (which can be pre-specified or determined according to the form template), and the form header can be determined according to the field information of the fields in the form header. A matching target form template is determined from the configured form templates. For example, there are fields corresponding to all or most of the field information in the target form template in the form header of the form. Of course, it is not limited to this, and will be described in more detail in the subsequent content.

在一个实施例中,该方法进一步包括:在需对任一数据库表进行表单模板配置时,检查指定数据库中是否存在该数据库表与需配置的表单模板之间的对应关系,如果不存在,则继续进行该表单模板的配置,并将该数据库表与该表单模板的对应关系存储至该指定数据库中。In one embodiment, the method further includes: when form template configuration needs to be performed on any database table, checking whether there is a corresponding relationship between the database table and the form template to be configured in the specified database, if not, then Continue to configure the form template, and store the corresponding relationship between the database table and the form template in the designated database.

指定数据库可以是任意被指定的数据库,具体不做限定。The specified database can be any specified database, which is not specifically limited.

将已配置的数据库表与表单模板之间的对应关系存储在指定数据库中,后续在需要为某个数据库表配置对应的表单模板时,只需检查指定数据库中是否存在该数据库表与该表单模板的对应关系即可,不存在,则说明之前没有为该数据库表配置过对应的表单模板,此时再进行该表单模板的配置,并将该数据库表与该表单模板的对应关系存储至该指定数据库中。Store the corresponding relationship between the configured database table and the form template in the specified database. When you need to configure the corresponding form template for a database table later, you only need to check whether the database table and the form template exist in the specified database. If the corresponding relationship does not exist, it means that the corresponding form template has not been configured for the database table before. At this time, the configuration of the form template is performed, and the corresponding relationship between the database table and the form template is stored in the specified in the database.

可选的,如果指定数据库中存在该数据库表与需配置的表单模板之间的对应关系,则无需重复配置,后续可直接调用该表单模板。如此,可以避免相同表单模板的重复配置。Optionally, if there is a corresponding relationship between the database table and the form template that needs to be configured in the specified database, there is no need to repeat the configuration, and the form template can be called directly subsequently. In this way, repeated configuration of the same form template can be avoided.

在一个实施例中,已配置的表单模板中获取与待处理的表单匹配的目标表单模板,进一步可以为:在获得所述表单所存储的数据类别的情况下,可以从用于存储该类数据的数据库表被配置的表单模板中获取与待处理的表单匹配的目标表单模板。换言之,后续涉及的已配置的表单模板就是用于存储该类数据的数据库表被配置的表单模板。In one embodiment, the target form template matching the form to be processed is obtained from the configured form template, and further may be: in the case of obtaining the data category stored in the form, the target form template for storing the data of the form can be obtained from The database table of the configured form template gets the target form template that matches the form to be processed. In other words, the subsequent configured form template is the form template in which the database table for storing this type of data is configured.

比如,表单存储的是蔬菜类数据,则从用于存储蔬菜类数据的数据库表被配置的表单模板中获取与待处理的表单匹配的目标表单模板。For example, if the form stores vegetable data, a target form template matching the form to be processed is obtained from the form template configured in the database table for storing vegetable data.

在一个实施例中,所述表单是待处理文件中的任一个表单,所述待处理文件的格式为指定文件格式,所述待处理文件包含至少一个表单。In one embodiment, the form is any form in the files to be processed, the format of the file to be processed is a specified file format, and the file to be processed includes at least one form.

指定文件格式比如可以为xls格式、xlsx格式,相应的,待处理文件为Excel文件,一个Excel文件可以包含一个或多个sheet,所述表单可以是Excel文件中的任一个sheet。The specified file format may be, for example, xls format or xlsx format. Correspondingly, the file to be processed is an Excel file. An Excel file may contain one or more sheets, and the sheet may be any sheet in the Excel file.

可选的,电子设备可以从保存有各种类型的文件的文件夹中,通过指定的正则表达式查找出文件名与正则表达式匹配的Excel文件,每次从中读取出一个或多个Excel文件(具体读取的个数可以指定,具体读取的顺序不作限定),并遍历读取出的Excel文件中的Sheet,每次遍历到的sheet作为待处理的表单。本例中,正则表达式比如可以为“.*xlsx”,表示读取所有后缀为xlsx的Excel文件,当然,此处只是举例,具体不限于此。Optionally, the electronic device can search out the Excel files whose file names match the regular expression from the folders that save various types of files through the specified regular expression, and read out one or more Excel files each time. file (the specific number of readings can be specified, and the specific reading order is not limited), and traverse the Sheets in the read Excel file, and each traversed sheet is used as the form to be processed. In this example, the regular expression can be, for example, ".*xlsx", which means to read all Excel files with a suffix of xlsx. Of course, this is just an example, and it is not limited to this.

步骤S200中,依据所述目标表单模板中的字段信息从所述表单中确定出与所述字段信息对应的数据,得到目标数据。In step S200, data corresponding to the field information is determined from the form according to the field information in the target form template to obtain target data.

字段信息可以包括字段名称,可以从表单中确定出与字段名称对应的数据,得到目标数据。比如确定出的每个目标数据都包含目标表单模板中的各字段名称。The field information may include field names, and data corresponding to the field names may be determined from the form to obtain target data. For example, each determined target data includes the field names in the target form template.

举例来说,目标表单模板中比如可以包括3个字段的字段名称,包括天气、温度、统计时间;表单的表单头部也包含天气、温度、统计时间这三个字段,还包含其他两个无关字段a和b,相应的,每条数据都包含这些字段,具体来说,表单可参看如下表(2):For example, the target form template can include the field names of three fields, including weather, temperature, and statistical time; the form header of the form also includes three fields of weather, temperature, and statistical time, and two other irrelevant fields. Fields a and b. Correspondingly, each piece of data contains these fields. Specifically, the form can refer to the following table (2):

Figure 954689DEST_PATH_IMAGE004
Figure 954689DEST_PATH_IMAGE004

比如,可以确定出以下三条目标数据:For example, the following three target data can be identified:

晴、21.5、2020-05-23;Clear, 21.5, 2020-05-23;

雨、18、2020-05-24;Rain, 18, 2020-05-24;

晴、28、2020-05-25。Sunny, 28, 2020-05-25.

步骤S300中,将所述目标数据抽取至所述目标表单模板对应的数据库表中。In step S300, the target data is extracted into a database table corresponding to the target form template.

可以在确定出所有目标数据之后,将所有目标数据抽取至目标表单模板对应的数据库表中;或者,也可以每确定出一条目标数据,便将该目标数据抽取到目标表单模板对应的数据库表中,具体不作限定。After all target data is determined, all target data can be extracted to the database table corresponding to the target form template; or, each time a piece of target data is determined, the target data can be extracted to the database table corresponding to the target form template. , which is not specifically limited.

在确定出数据之后,也可以按照设定的转换规则对数据进行一定的转换之后,再存储到目标表单模板对应的数据库表中。比如,可以将数据中各字段按照数据库表中表单头部的字段顺序进行排序,将排序后的数据抽取到数据库表中,当然这里只是举例,转换规则具体不限于此。转换规则可以预先设置在表单模板中。After the data is determined, the data can also be converted to a certain extent according to the set conversion rules, and then stored in the database table corresponding to the target form template. For example, the fields in the data can be sorted according to the field order of the form header in the database table, and the sorted data can be extracted into the database table. Of course, this is just an example, and the conversion rules are not limited to this. Conversion rules can be preset in form templates.

本发明实施例中,预先配置好多个数据库表对应的表单模板,每一表单模板包含对应数据库表中表单头部的至少一个字段的字段信息,在需要抽取某个表单中的数据时,可以从已配置的表单模板中获取与表单相匹配的目标表单模板,基于该目标表单模板中的字段信息从表单中确定出与字段信息对应的数据,得到目标数据,将目标数据抽取至该目标表单模板对应的数据库表中,上述方式中,对于不同结构的表单来说,在匹配到同一表单模板的情况下,就可以基于同一表单模板来完成数据的抽取,不需要人工识别表单格式,也不需要为每个表单配置相应的抽取方式,可大大降低用户配置所需的工作量与时间。In this embodiment of the present invention, form templates corresponding to multiple database tables are preconfigured, and each form template contains field information corresponding to at least one field in the form header in the database table. Obtain the target form template matching the form from the configured form template, determine the data corresponding to the field information from the form based on the field information in the target form template, obtain the target data, and extract the target data to the target form template In the corresponding database table, in the above method, for forms with different structures, in the case of matching the same form template, the data extraction can be completed based on the same form template, and there is no need to manually identify the form format, nor do Configuring a corresponding extraction method for each form can greatly reduce the workload and time required for user configuration.

比如,另一表单如下表(3):For example, another form is as follows (3):

Figure 780431DEST_PATH_IMAGE006
Figure 780431DEST_PATH_IMAGE006

对比表(2)与表(3)来说,两者的结构是不同的,具体是两者中“天气”和“无关字段a”的顺序对调了,但是,两者中待抽取字段“天气”、“温度”、“统计时间”是相同的,可以匹配到同一表单模板,基于同一表单模板来完成数据的抽取,匹配过程也是自动的,不需要分别识别表(2)和表(3)的格式,也不需要分别为表(2)和表(3)设置相应的抽取方式。Comparing table (2) and table (3), the structure of the two is different, specifically, the order of "weather" and "irrelevant field a" in the two are reversed, but the field to be extracted in the two is "weather". ", "temperature", and "statistical time" are the same, and can be matched to the same form template, and the data extraction is completed based on the same form template. The matching process is also automatic, and there is no need to identify table (2) and table (3) separately. format, and there is no need to set corresponding extraction methods for table (2) and table (3) respectively.

在一个实施例中,步骤S100中,从已配置的表单模板中获取与待处理的表单匹配的目标表单模板,可以包括以下步骤:In one embodiment, in step S100, obtaining a target form template matching the form to be processed from the configured form template may include the following steps:

遍历所述表单中的每一行单元格:Iterate over each row of cells in the sheet:

在所述已配置的表单模板中查找包含的字段信息与该行单元格的内容匹配的表单模板;Find a form template that contains field information that matches the content of the cell in the row in the configured form template;

若查找到,则确定该行单元格为所述表单的表单头部,并将查找到的表单模板确定为所述目标表单模板。If found, the row of cells is determined as the form header of the form, and the found form template is determined as the target form template.

大部分情况下,表单中的第一行单元格为表单头部,但是也有其他情况,比如表单的第一行单元格为表单的标题等。In most cases, the first row of cells in the form is the form header, but there are other cases, such as the first row of cells in the form is the title of the form, etc.

本实施例中,以行为单位来判断表单模板包含的字段信息是否与该行单元格的内容匹配,在匹配的情况下,一方面可以确定出该表单模板为目标表单模板,另一方面也可以确定出该行单元格为表单头部,可以省去额外用于确定表单头部的步骤。In this embodiment, whether the field information contained in the form template matches the content of the cell in the row is determined in units of behavior. In the case of matching, on the one hand, it can be determined that the form template is the target form template, and on the other hand, Determining that the row of cells is the form header can save additional steps for determining the form header.

本实施例中,可以采用遍历的方式来读取表单中的各行单元格,以实现上述的步骤。即,可以针对每一表单模板,遍历所述表单中的各行单元格,检查该表单模板中的字段信息与遍历到的该行单元格的内容是否匹配,匹配规则具体不限,比如可以是表单模板中所有字段信息包含的字段名称都存在于该行单元格中,如果匹配,则可以确定遍历到的该行单元格为所述表单的表单头部,并将查找到的表单模板确定为所述目标表单模板。In this embodiment, each row of cells in the form may be read in a traversal manner, so as to implement the above steps. That is, for each form template, each row of cells in the form can be traversed to check whether the field information in the form template matches the content of the traversed row of cells. The matching rules are not limited, for example, it can be a form The field names contained in all the field information in the template exist in the cell in this row. If there is a match, it can be determined that the cell in the row traversed is the form header of the form, and the found form template is determined as the form template. Describe the target form template.

可选的,所述表单的表单头部的数量为1个,该方法进一步包括:在确定遍历到的该行单元格为所述表单的表单头部时,结束对所述表单的遍历。由于已经找到了目标表单模板,也确定了表单头部,因而不需要再遍历后续的行,所以可以结束遍历,以提升处理效率。Optionally, the number of form headers of the form is one, and the method further includes: when it is determined that the traversed row of cells is the form header of the form, ending the traversal of the form. Since the target form template has been found and the form header has been determined, there is no need to traverse subsequent rows, so the traversal can be ended to improve processing efficiency.

当然,如果表单的表单头部的数量大于1个,则在确定遍历到的该行单元格为表单的表单头部时:可以继续对表单进行遍历;或者可以检查当前对所述表单的遍历次数是否达到最大遍历次数,如果是,则结束对所述表单的遍历,否则可以继续对表单的遍历。Of course, if the number of form headers of the form is greater than one, when it is determined that the traversed row of cells is the form header of the form: the form can be traversed continuously; or the current traversal times of the form can be checked. Whether the maximum number of traversals is reached, if so, the traversal of the form is ended, otherwise, the traversal of the form can be continued.

可选的,该方法进一步包括:若在所述已配置的表单模板中未查找到包含的字段信息与该行单元格的内容匹配的表单模板,则检查当前对所述表单的遍历次数是否达到最大遍历次数,如果是,则结束对所述表单的遍历。一般来说,遍历很多次之后还无法确定是否匹配,说明大概率是不匹配的,通过定义最大遍历次数来限制对表单的遍历次数,适时结束遍历,可以避免遍历次数过多造成的资源浪费。Optionally, the method further includes: if no form template containing field information matching the content of the row of cells is found in the configured form template, checking whether the current number of traversal of the form reaches The maximum number of traversals, if yes, end the traversal of the form. Generally speaking, it is impossible to determine whether there is a match after traversing many times, indicating that there is a high probability of mismatching. By defining the maximum number of traversals to limit the number of traversals of the form, and ending the traversal in time, you can avoid the waste of resources caused by too many traversals.

其中,最大遍历次数可以在表单模板中定义,不同表单模板中的最大遍历次数可以不同,这里检查当前对所述表单的遍历次数是否达到最大遍历次数时,具体可以检查当前对所述表单的遍历次数是否达到该表单模板中的最大遍历次数。当然,最大遍历次数也可以在电子设备中定义,在此不作限定。The maximum traversal times can be defined in the form template, and the maximum traversal times in different form templates can be different. Here, when checking whether the current traversal times of the form reaches the maximum traversal times, you can specifically check the current traversal times of the form. Whether the maximum number of traversals in this form template has been reached. Of course, the maximum number of traversals can also be defined in the electronic device, which is not limited here.

在一个实施例中,在所述已配置的表单模板中查找包含的字段信息与该行单元格的内容匹配的表单模板进一步为:In one embodiment, searching the configured form template for a form template whose field information matches the content of the row of cells is further:

在该行中每个单元格的数据类型为文本类型时,在所述已配置的表单模板中查找包含的字段信息与该行单元格的内容匹配的表单模板。When the data type of each cell in the row is a text type, the configured form template is searched for a form template that contains field information that matches the content of the cell in the row.

一般来说,表单中的表单头部中各个单元格的类型为文本类型,所以基于这个特点,可以在该行中每个单元格的数据类型为文本类型时,再在所述已配置的表单模板中查找包含的字段信息与该行单元格的内容匹配的表单模板;而在该行中至少一个单元格的数据类型为非文本类型时,可以直接跳过针对该行单元格的查找匹配的表单模板的处理,以加快处理效率。Generally speaking, the type of each cell in the form header in the form is text type, so based on this feature, when the data type of each cell in the row is text type, then in the configured form In the template, look for the form template that contains the field information that matches the content of the cells in the row; and when the data type of at least one cell in the row is non-text type, you can directly skip the search for the cells in the row. Processing of form templates to speed up processing efficiency.

在一个实施例中,所述字段信息至少包括字段名称;In one embodiment, the field information includes at least a field name;

在所述已配置的表单模板中查找包含的字段信息与该行单元格的内容匹配的表单模板,包括:Search the configured form templates for form templates that contain field information that matches the content of the cells in the row, including:

针对所述已配置的表单模板中的每一表单模板:For each of the configured form templates:

从该行单元格中确定出参考单元格,所述参考单元格的内容与该表单模板中的任一字段名称匹配;determining a reference cell from the row of cells, the content of the reference cell matching any field name in the form template;

依据所述参考单元格的个数确定该表单模板是否为包含的字段信息与该行单元格的内容匹配的表单模板。According to the number of the reference cells, it is determined whether the form template is a form template that contains field information that matches the content of the row of cells.

表单的一行中可以有多个单元格,一行中内容与表单模板中的字段名称匹配的单元格(也就是参考单元格)的数量越多,则该行单元格为表单头部的可能性就越大,所以可以根据参考单元格的个数确定该表单模板是否为包含的字段信息与该行单元格的内容匹配的表单模板。There can be multiple cells in a row of a form. The more cells in a row whose content matches the field name in the form template (that is, the reference cell), the more likely the cell in the row is the header of the form. The larger the value is, it can be determined according to the number of reference cells whether the form template is a form template that contains field information that matches the content of the cells in the row.

比如,可以在参考单元格的个数达到设定个数时,确定该表单模板为包含的字段信息与该行单元格的内容匹配的表单模板。这里的设定个数可以在该表单模板中定义,也可以在电子设备中预设,设定个数的取值可以根据需要而定,比如可以为大于1。For example, when the number of reference cells reaches a set number, it can be determined that the form template is a form template that contains field information that matches the content of the cells in the row. The set number here can be defined in the form template, or can be preset in the electronic device, and the value of the set number can be determined according to needs, for example, it can be greater than 1.

可选的,针对每一行单元格,在确定每个参考单元格时,可以将参考单元格的内容与位置索引记录到缓存中。可选的,位置索引可以用于指示参考单元格所在的列,以表(2)为例,缓存中可以保存有以下信息:Optionally, for each row of cells, when each reference cell is determined, the content and position index of the reference cell may be recorded in the cache. Optionally, the position index can be used to indicate the column where the reference cell is located. Taking table (2) as an example, the following information can be stored in the cache:

“天气”:C2"Weather": C2

“温度”:C4"Temp": C4

“统计时间”:C5"Statistical time": C5

其中,C2表示“天气”所在单元格位于第2列,C4表示“温度”所在单元格位于第3列,C5表示“统计时间”所在单元格位于第5列,这里的C2、C4、C5可以用数值来表示,具体数值不限,只要能够表征相应的列即可。Among them, C2 means that the cell of "weather" is located in the second column, C4 means that the cell of "temperature" is located in the third column, and C5 means that the cell of "statistical time" is located in the fifth column. Here, C2, C4, and C5 can be It is represented by numerical values, and the specific numerical values are not limited, as long as the corresponding columns can be represented.

进一步的,如果依据参考单元格的个数确定该表单模板不为包含的字段信息与该行单元格的内容匹配的表单模板,则可以清空该缓存。Further, if it is determined according to the number of reference cells that the form template is not a form template that contains field information that matches the content of the row of cells, the cache can be cleared.

在一个实施例中,所述参考单元格的内容与该表单模板中的任一字段名称匹配是指:In one embodiment, the content of the reference cell matches any field name in the form template means:

所述参考单元格的内容与该表单模板中的任一字段名称相同;The content of the reference cell is the same as any field name in the form template;

或者,所述参考单元格的内容与该表单模板中的任一字段名称为近义词或同义词,所述近义词或同义词是通过设定的匹配算法确定的。Alternatively, the content of the reference cell and any field name in the form template are synonyms or synonyms, and the synonyms or synonyms are determined by a set matching algorithm.

换言之,参考单元格的内容与该表单模板中的任一字段名称匹配的匹配规则可以包括:字符串的完全匹配(即相同)、或者可以为基于设定的匹配算法确定出的字符串的近义词或同义词匹配,当然,具体匹配规则不限于此,比如还可以基于文本相似度来确定等。In other words, the matching rule for the content of the reference cell to match any field name in the form template may include: exact match (ie, the same) of the string, or a synonym of the string determined based on the set matching algorithm Or synonym matching, of course, the specific matching rule is not limited to this, for example, it can also be determined based on text similarity.

设定的匹配算法比如可以包括:基于NLP(自然语言处理,Natural LanguageProcessing)的近义词匹配等,具体不作限定。The set matching algorithm may include, for example, synonym matching based on NLP (Natural Language Processing, Natural Language Processing), etc., which is not specifically limited.

本实施例中,近义词或同义词匹配不是精准的匹配方式,可以使得同一表单模板可以用于更多同类数据表单的数据抽取,可以进一步减少所需配置的模板数量,也有利于更准确地找出表单头部。In this embodiment, the matching of synonyms or synonyms is not an accurate matching method, so that the same form template can be used for data extraction of more similar data forms, which can further reduce the number of templates that need to be configured, and is also conducive to finding out more accurately. Form header.

在一个实施例中,所述表单模板还包括设定匹配度;In one embodiment, the form template further includes setting a matching degree;

依据所述参考单元格的个数确定该表单模板是否为包含的字段信息与该行单元格的内容匹配的表单模板,包括:Determine whether the form template is a form template with field information that matches the content of the row of cells according to the number of the reference cells, including:

计算所述参考单元格的个数与该表单模板中字段信息对应的字段个数之间的比值;Calculate the ratio between the number of the reference cells and the number of fields corresponding to the field information in the form template;

若该比值大于该表单模板中的设定匹配度,则确定该表单模板为包含的字段信息与该行单元格的内容匹配的表单模板。If the ratio is greater than the set matching degree in the form template, it is determined that the form template is a form template whose field information matches the content of the cell in the row.

以表单为表(2)为例,假设该表单模板中字段信息对应的字段分别为天气、温度、统计时间,也就是说字段个数为3,在遍历到第一行单元格时,找出了3个参考单元格(缓存了3个参考单元格的内容),这3个参考单元格的内容分别为天气、温度、统计时间,与该表单模板中的3个字段信息应,所以,参考单元格的个数也为3,则计算参考单元格的个数与该表单模板中字段信息对应的字段个数的比值为3/3=1,假设该表单模板中的设定匹配度为0.8,由于1大于0.8,所以可以确定第一行单元格为表单头部,且该表单模板为包含的字段信息与该行单元格的内容匹配的表单模板。Take the form as table (2) as an example, assuming that the fields corresponding to the field information in the form template are weather, temperature, and statistical time respectively, that is to say, the number of fields is 3. When traversing to the first row of cells, find out 3 reference cells (the contents of 3 reference cells are cached), the contents of these 3 reference cells are weather, temperature, and statistical time respectively, which correspond to the information of the 3 fields in the form template. Therefore, refer to The number of cells is also 3, then the ratio of the number of reference cells to the number of fields corresponding to the field information in the form template is 3/3=1, assuming that the matching degree set in the form template is 0.8 , since 1 is greater than 0.8, it can be determined that the first row of cells is the form header, and the form template is a form template whose field information matches the content of the row of cells.

本实施例中,只要比值大于设定匹配度即可,并不需要完全匹配,这样一个表单模板可以匹配更多的表单,可以进一步减少所需配置的模板数量。In this embodiment, as long as the ratio is greater than the set matching degree, complete matching is not required. Such a form template can match more forms, which can further reduce the number of templates that need to be configured.

在一个实施例中,在已配置的表单模板中不存在与所述表单匹配的目标表单模板的情况下,可以输出该表单的相关信息至指定文件(比如但不限于txt文件)或指定数据库表中,以用于后续审计。In one embodiment, if there is no target form template matching the form in the configured form template, the relevant information of the form can be output to a specified file (such as but not limited to a txt file) or a specified database table , for subsequent auditing.

相关信息比如可以包括该表单所在文件的文件名、以及获取不到目标表单模板的错误原因,这里的错误原因比如可以有表单模板中字段信息的错误(比如将价格写成了价值)、文件的格式错误(比如理应是html格式,但却呈现的是xlsx格式)。Relevant information can include, for example, the file name of the file where the form is located, and the reason for the error that the target form template cannot be obtained. The reason for the error here can include, for example, errors in the field information in the form template (such as writing price as value), and the format of the file. Error (for example, it should be in html format, but it is rendered in xlsx format).

审计人员可通过人工审计的形式,依赖于指定文件后指定数据库表中的错误原因,来判断是否需将该表单中的内容同步至问题数据库表中、修改表单模板、修改文件格式等。Auditors can rely on the error cause in the specified database table after specifying the file to determine whether it is necessary to synchronize the content of the form to the problem database table, modify the form template, modify the file format, etc. through manual auditing.

在一个实施例中,所述字段信息至少包括字段名称;In one embodiment, the field information includes at least a field name;

所述表单模板还包括各字段信息对应的字段顺序、以及至少一个字段信息对应的转换规则,所述字段顺序是依据所述目标表单模板对应的数据库表的表单头部确定的;The form template further includes a field sequence corresponding to each field information and a conversion rule corresponding to at least one field information, where the field sequence is determined according to the form header of the database table corresponding to the target form template;

步骤S200中,依据所述目标表单模板中的字段信息从所述表单中确定出与所述字段信息对应的数据,得到目标数据,可以包括以下步骤:In step S200, the data corresponding to the field information is determined from the form according to the field information in the target form template, and the target data is obtained, which may include the following steps:

S201:按照所述目标表单模板中的字段顺序对所述表单中各目标单元格所在列进行排序,所述目标单元格属于所述表单的表单头部,且目标单元格的内容与所述目标表单模板中的任一字段名称匹配;S201: Sort the column where each target cell in the form is located according to the order of fields in the target form template, the target cell belongs to the form header of the form, and the content of the target cell is the same as the target cell Any field name in the form template matches;

S202:针对所述表单中位于所述表单头部之后的每一行,确定出该行中位于各目标单元格所在列的单元格内容,得到一条目标数据。S202: For each row in the form that is located after the form header, determine the cell content in the row where each target cell is located, and obtain a piece of target data.

在确定出表单头部时,缓存中记录的单元格的内容就是目标单元格的内容,目标单元格的内容与所述目标表单模板中的任一字段名称匹配的匹配规则可以参考前述关于参考单元格的内容的匹配规则,在此不再赘述。When the form header is determined, the content of the cell recorded in the cache is the content of the target cell, and the matching rule for the content of the target cell to match any field name in the target form template can refer to the aforementioned reference cell The matching rules for the content of the grid will not be repeated here.

表单中各目标单元格所在列可以根据缓存中记录的单元格内容及位置索引来确定,目标表单模板中的字段顺序可以与对应数据库表的表单头部中的对应字段的顺序相同,可以体现目标表单模板中的字段与对应数据库表中字段的映射关系,基于该字段顺序对表单中各目标单元格所在列进行排序之后,可以使得表单中相应的字段按照数据库表中的字段顺序排列。The column of each target cell in the form can be determined according to the cell content and position index recorded in the cache. The order of the fields in the target form template can be the same as the order of the corresponding fields in the form header of the corresponding database table, which can reflect the target The mapping relationship between the fields in the form template and the fields in the corresponding database table, after sorting the column where each target cell in the form is based on the field order, the corresponding fields in the form can be arranged in the order of the fields in the database table.

继续以表单为表(2)为例,假设目标表单模板中的字段排序为统计时间、天气、温度从前往后依次排序,对应的数据库表中的字段顺序也是如此,则按照该字段顺序对表单中各目标单元格所在列进行排序之后,得到以下表(4):Continue to take the form as table (2) as an example. Assuming that the fields in the target form template are sorted by statistical time, weather, and temperature in order from front to back, and the field order in the corresponding database table is also the same, then the form is sorted according to the field order. After sorting the columns in which the target cells are located in , the following table (4) is obtained:

Figure DEST_PATH_IMAGE008
Figure DEST_PATH_IMAGE008

排序之后,针对所述表单中位于所述表单头部之后的每一行,确定出该行中位于各目标单元格所在列的单元格内容,得到一条目标数据。After sorting, for each row in the form located after the form header, determine the cell content in the row where each target cell is located, and obtain a piece of target data.

比如,上述表(4)中,表单头部是第一行,位于表单头部之后的行为第二至第四行,这几行中记录了所需抽取的数据,以第二行为例,确定出该行中位于各目标单元格所在列的单元格内容,分别为2020-05-23、晴、21.5,作为一条目标数据。For example, in the above table (4), the form header is the first row, and the rows after the form header are the second to fourth rows. These rows record the data to be extracted. Taking the second row as an example, determine The contents of the cells in the row where each target cell is located are 2020-05-23, Qing, and 21.5, respectively, as a piece of target data.

通过上述排序,可以将确定出的单元格内容正确地记录到数据库表中对应的字段位置处,保证数据库表中各数据包含的字段的顺序一致性。Through the above sorting, the determined cell content can be correctly recorded at the corresponding field position in the database table, so as to ensure the order consistency of the fields included in each data in the database table.

可选的,若目标字段模板中的某个字段(简称缺省字段)名称不存在于表单头部中,即抽取的单元格内容的个数少于所需的字段个数时,可以用空值补足缺省字段对应的位置,即目标数据中与缺省字段对应的位置可以为空值。比如,在目标表单模板中的字段顺序为统计时间、天气、温度、PM2.5值从前往后依次排序,则在确定出2020-05-23、晴、21.5之后,在21.5之后补足一个空值比如为0,则得到的目标数据为2020-05-23、晴、21.5、0。Optionally, if the name of a field (default field for short) in the target field template does not exist in the form header, that is, when the number of extracted cell contents is less than the required number of fields, you can use an empty field. The value complements the position corresponding to the default field, that is, the position corresponding to the default field in the target data can be a null value. For example, if the order of fields in the target form template is statistical time, weather, temperature, and PM2.5 values in order from front to back, after 2020-05-23, sunny, and 21.5 are determined, a null value is added after 21.5 For example, if it is 0, the target data obtained is 2020-05-23, clear, 21.5, 0.

在一个实施例中,所述确定出该行中位于各目标单元格所在列的单元格内容之前,该方法还包括:In one embodiment, before the content of the cells in the row where each target cell is located is determined, the method further includes:

若该行中存在合并单元格,则将该合并单元格进行拆分,并将该合并单元格的内容填入至拆分出的所有单元格中。If there is a merged cell in the row, the merged cell is split, and the content of the merged cell is filled into all the split cells.

每个单元格具有对应的属性信息,属性信息可以指示该单元格是否为合并单元格,若该行中存在合并单元格,则将该合并单元格进行拆分,比如拆分成最小单元格,并将该合并单元格的内容填入至拆分出的所有单元格中。Each cell has corresponding attribute information. The attribute information can indicate whether the cell is a merged cell. If there is a merged cell in the row, the merged cell will be split, for example, into the smallest cell. And fill the contents of the merged cell into all the split cells.

比如,表单如下表(5):For example, the form is as follows (5):

Figure DEST_PATH_IMAGE010
Figure DEST_PATH_IMAGE010

第一行为表单头部,如果不进行合并单元格的拆分,则确定出的数据为以下两条:The first line is the header of the form. If the merged cells are not split, the following two data will be determined:

1.七连8班,张三,2020011. Class 8, Seventh Company, Zhang San, 202001

2. ,李四,2020022. , Li Si, 202002

显然,李四的数据中丢失了队伍信息,导致数据缺损。Obviously, the team information was lost in Li Si's data, resulting in data loss.

而本实施例中,可以在遍历到第二行时,将检测到“七连8班”所在单元格为合并单元格,将该合并单元格进行拆分,并将“七连8班”填入至拆分出的所有单元格中,得到如下表(6):However, in this embodiment, when traversing to the second row, it is possible to detect that the cell where "seven consecutive 8 classes" is located is a merged cell, split the merged cell, and fill in the "seven consecutive 8 classes". Enter into all the split cells, and get the following table (6):

Figure DEST_PATH_IMAGE012
Figure DEST_PATH_IMAGE012

经过上述拆分,确定出的数据为以下两条:After the above split, the determined data are as follows:

1.七连8班,张三,2020011. Seventh consecutive eighth class, Zhang San, 202001

2.七连8班,李四,2020022. Seventh consecutive eighth class, Li Si, 202002

即,解决了数据缺损的问题。That is, the problem of data loss is solved.

在一个实施例中,所述确定出该行中位于各目标单元格所在列的单元格内容之后,该方法还包括:In one embodiment, after determining the cell content in the row that is located in the column where each target cell is located, the method further includes:

按照所述目标表单模板中的转换规则对确定出的至少一个单元格内容进行转换,将得到的单元格内容作为一条目标数据。Convert at least one determined cell content according to the conversion rule in the target form template, and use the obtained cell content as a piece of target data.

可以基于目标表单模板中的字段长度、字段精度、字段类型、字段格式对相应的单元格内容进行转换。比如,对字段类型为时间类型的单元格,用目标表单模板中预定义的时间格式yyyy(年)-MM(月)-dd(日)对单元格内容进行转换。当然,此处只是举例,实际还可以有其他的转换方式。The corresponding cell content can be converted based on the field length, field precision, field type, and field format in the target form template. For example, for a cell whose field type is time type, convert the cell content with the predefined time format yyyy (year)-MM (month)-dd (day) in the target form template. Of course, this is just an example, and there may actually be other conversion methods.

通过上述转换方式,可以使得输出至数据库表中的数据的格式更统一,也更便于进一步的处理。Through the above conversion method, the format of the data output to the database table can be made more uniform, and further processing is more convenient.

下面结合图2,以一个更具体的实施例来对本发明的数据抽取方法进行阐述。The data extraction method of the present invention will be described below with reference to FIG. 2 with a more specific embodiment.

1)获取Excel文件,可以通过指定的正则表达式从各类文件中查找出文件名与正则表达式匹配的Excel文件,接着执行步骤2);1) To obtain the Excel file, you can find the Excel file whose file name matches the regular expression from various files through the specified regular expression, and then perform step 2);

2)获取Excel文件中的表单sheet,可以遍历上述Excel文件中的所有sheet,获取的sheet即为遍历到的sheet,接着执行步骤3);2) To obtain the sheet in the Excel file, you can traverse all the sheets in the above Excel file, the obtained sheet is the traversed sheet, and then go to step 3);

3)遍历已配置的表单模板,针对遍历到的表单模板,遍历sheet中的各行单元格;接着执行步骤4);3) Traverse the configured form template, and traverse each row of cells in the sheet for the traversed form template; then perform step 4);

4)针对遍历到的该行单元格中的每一单元格,检查该表单模板中是否存在与该单元格内容匹配的字段信息,如果是,则将该单元格内容与该单元格的位置索引记录到缓存中之后,执行步骤5);如果否,则直接执行步骤5);4) For each cell in the traversed row of cells, check whether there is field information that matches the content of the cell in the form template, and if so, then the content of the cell and the position index of the cell After recording in the cache, go to step 5); if not, go to step 5) directly;

5)检查缓存中单元格内容的个数与该表单模板中字段信息对应的字段个数的比值是否大于设定匹配度,如果是,则执行步骤7),如果否,则执行步骤6);5) Check whether the ratio of the number of cell contents in the cache to the number of fields corresponding to the field information in the form template is greater than the set matching degree, if so, go to step 7), if not, go to step 6);

6)检查该行是否存在下一个单元格,如果是,则继续针对该行的下一个单元格进行处理;如果否,则继续对该表单进行遍历,在对该表单遍历结束时,若当前存在未被遍历到的表单模板,则继续遍历下一个表单模板,否则,输出用于指示该sheet匹配不到任一表单模板的错误结果,该错误结果可以包括文件名、sheet页名称和错误原因,返回步骤2)继续针对下一表单进行处理;6) Check whether there is the next cell in the row, if so, continue to process the next cell in the row; if not, continue to traverse the form, and at the end of the form traversal, if there is currently For form templates that have not been traversed, continue to traverse the next form template. Otherwise, output an error result indicating that the sheet cannot match any form template. The error result can include the file name, sheet page name and error reason. Return to step 2) to continue processing the next form;

7)确定当前行是表单头部,且该表单模板为目标表单模板,接着执行步骤8);7) Determine that the current row is the form header, and the form template is the target form template, and then perform step 8);

8)按照目标表单模板中的字段顺序对表单中的目标单元格内容所在的列进行排序,目标单元格内容即缓存中记录的单元格内容,根据对应的位置索引可以确定列所在位置,接着执行步骤9);8) Sort the column where the content of the target cell in the form is located according to the field order in the target form template. The content of the target cell is the content of the cell recorded in the cache. The position of the column can be determined according to the corresponding position index, and then execute step 9);

9)遍历表单中位于表单头部之后的行,根据目标表单模板中的字段信息从遍历到的行中抽取出目标数据并输出至对应的数据库表中。9) Traverse the rows located after the form header in the form, extract the target data from the traversed rows according to the field information in the target form template, and output them to the corresponding database table.

本发明还提供一种数据抽取装置,参看图3,该数据抽取装置100包括:The present invention also provides a data extraction apparatus. Referring to FIG. 3, the data extraction apparatus 100 includes:

目标表单模板确定模块101,用于从已配置的表单模板中获取与待处理的表单匹配的目标表单模板,每一表单模板具有对应的数据库表,且表单模板包含对应数据库表中表单头部的至少一个字段的字段信息;The target form template determination module 101 is used to obtain a target form template matching the form to be processed from the configured form template, each form template has a corresponding database table, and the form template includes the corresponding database table in the form header. Field information of at least one field;

目标数据确定模块102,用于依据所述目标表单模板中的字段信息从所述表单中确定出与所述字段信息对应的数据,得到目标数据;a target data determination module 102, configured to determine data corresponding to the field information from the form according to the field information in the target form template to obtain target data;

数据抽取模块103,用于将所述目标数据抽取至所述目标表单模板对应的数据库表中。The data extraction module 103 is configured to extract the target data into a database table corresponding to the target form template.

在一个实施例中,所述目标表单模板确定模块从已配置的表单模板中获取与待处理的表单匹配的目标表单模板时,具体用于:In one embodiment, when the target form template determination module acquires the target form template matching the form to be processed from the configured form templates, it is specifically used for:

遍历所述表单中的每一行单元格:Iterate over each row of cells in the sheet:

在所述已配置的表单模板中查找包含的字段信息与该行单元格的内容匹配的表单模板;Find a form template that contains field information that matches the content of the cell in the row in the configured form template;

若查找到,则确定该行单元格为所述表单的表单头部,并将查找到的表单模板确定为所述目标表单模板。If found, the row of cells is determined as the form header of the form, and the found form template is determined as the target form template.

在一个实施例中,所述表单的表单头部的数量为1个;所述目标表单模板确定模块进一步用于:In one embodiment, the number of form headers of the form is one; the target form template determination module is further used for:

当确定该行单元格为所述表单的表单头部时,结束对所述表单的遍历。When it is determined that the row of cells is the form header of the form, the traversal of the form is ended.

在一个实施例中,所述目标表单模板确定模块进一步用于:In one embodiment, the target form template determination module is further configured to:

若未查找到,则检查当前对所述表单的遍历次数是否达到最大遍历次数,如果是,则结束对所述表单的遍历,否则继续对所述表单的遍历。If not found, check whether the current traversal times of the form reaches the maximum traversal times, if so, end the traversal of the form, otherwise continue to traverse the form.

在一个实施例中,所述目标表单模板确定模块在所述已配置的表单模板中查找包含的字段信息与该行单元格的内容匹配的表单模板时,进一步用于:In one embodiment, when the target form template determination module searches the configured form template for a form template whose field information matches the content of the cell in the row, the module is further configured to:

在该行中每个单元格的数据类型为文本类型时,在所述已配置的表单模板中查找包含的字段信息与该行单元格的内容匹配的表单模板。When the data type of each cell in the row is a text type, a form template containing field information that matches the content of the cell in the row is searched in the configured form template.

在一个实施例中,所述字段信息至少包括字段名称;In one embodiment, the field information includes at least a field name;

所述目标表单模板确定模块在所述已配置的表单模板中查找包含的字段信息与该行单元格的内容匹配的表单模板时,具体用于:When the target form template determination module searches the configured form template for a form template whose field information matches the content of the row of cells, it is specifically used for:

针对所述已配置的表单模板中的每一表单模板:For each of the configured form templates:

从该行单元格中确定出参考单元格,所述参考单元格的内容与该表单模板中的任一字段名称匹配;determining a reference cell from the row of cells, the content of the reference cell matching any field name in the form template;

依据所述参考单元格的个数确定该表单模板是否为包含的字段信息与该行单元格的内容匹配的表单模板。According to the number of the reference cells, it is determined whether the form template is a form template that contains field information that matches the content of the row of cells.

在一个实施例中,所述参考单元格的内容与该表单模板中的任一字段名称匹配是指:In one embodiment, the content of the reference cell matches any field name in the form template means:

所述参考单元格的内容与该表单模板中的任一字段名称相同;The content of the reference cell is the same as any field name in the form template;

或者,所述参考单元格的内容与该表单模板中的任一字段名称为近义词或同义词,所述近义词或同义词是通过设定的匹配算法确定的。Alternatively, the content of the reference cell and any field name in the form template are synonyms or synonyms, and the synonyms or synonyms are determined by a set matching algorithm.

在一个实施例中,In one embodiment,

所述表单模板还包括设定匹配度;The form template further includes a set matching degree;

所述目标表单模板确定模块依据所述参考单元格的个数确定该表单模板是否为包含的字段信息与该行单元格的内容匹配的表单模板时,具体用于:When the target form template determination module determines whether the form template is a form template whose field information matches the content of the row of cells according to the number of the reference cells, it is specifically used for:

计算所述参考单元格的个数与该表单模板中字段信息对应的字段个数之间的比值;Calculate the ratio between the number of the reference cells and the number of fields corresponding to the field information in the form template;

若该比值大于该表单模板中的设定匹配度,则确定该表单模板为包含的字段信息与该行单元格的内容匹配的表单模板。If the ratio is greater than the set matching degree in the form template, it is determined that the form template is a form template that contains field information that matches the content of the cell in the row.

在一个实施例中,In one embodiment,

所述字段信息至少包括字段名称;The field information includes at least a field name;

所述表单模板还包括各字段信息对应的字段顺序、以及至少一个字段信息对应的转换规则,所述字段顺序是依据所述目标表单模板对应的数据库表的表单头部确定的;The form template further includes a field sequence corresponding to each field information and a conversion rule corresponding to at least one field information, and the field sequence is determined according to the form header of the database table corresponding to the target form template;

所述目标数据确定模块依据所述目标表单模板中的字段信息从所述表单中确定出与所述字段信息对应的数据,得到目标数据时,具体用于:When the target data determination module determines the data corresponding to the field information from the form according to the field information in the target form template, and obtains the target data, it is specifically used for:

按照所述目标表单模板中的字段顺序对所述表单中各目标单元格所在列进行排序,所述目标单元格属于所述表单的表单头部,且目标单元格的内容与所述目标表单模板中的任一字段名称匹配;Sort the column where each target cell in the form is located according to the field order in the target form template, the target cell belongs to the form header of the form, and the content of the target cell is the same as the target form template matches any of the field names in;

针对所述表单中位于所述表单头部之后的每一行,确定出该行中位于各目标单元格所在列的单元格内容,作为一条目标数据。For each row in the form that is located after the form header, the cell content in the row where each target cell is located is determined as a piece of target data.

在一个实施例中,In one embodiment,

所述目标数据确定模块确定出该行中位于各目标单元格所在列的单元格内容之前,还用于:若该行中存在合并单元格,则将该合并单元格进行拆分,并将该合并单元格的内容填入至拆分出的所有单元格中;Before the target data determination module determines the content of the cells in the row where each target cell is located, it is also used for: if there is a merged cell in the row, split the merged cell, and divide the merged cell into the row. The contents of the merged cell are filled into all the split cells;

所述目标数据确定模块确定出该行中位于各目标单元格所在列的单元格内容之后,还用于:按照所述目标表单模板中的转换规则对确定出的至少一个单元格内容进行转换,将得到的单元格内容作为一条目标数据。After the target data determination module determines the cell content located in the column where each target cell is located in the row, it is further configured to: convert the determined at least one cell content according to the conversion rule in the target form template, Use the obtained cell content as a piece of target data.

在一个实施例中,所述装置进一步包括:In one embodiment, the apparatus further comprises:

配置模块,用于在需对任一数据库表进行表单模板配置时,检查指定数据库中是否存在该数据库表与需配置的表单模板之间的对应关系,如果不存在,则继续进行该表单模板的配置,并将该数据库表与该表单模板的对应关系存储至该指定数据库中。The configuration module is used to check whether there is a corresponding relationship between the database table and the form template to be configured in the specified database when the form template configuration needs to be performed on any database table. configuration, and store the corresponding relationship between the database table and the form template in the specified database.

在一个实施例中,所述目标表单模板确定模块从已配置的表单模板中获取与待处理的表单匹配的目标表单模板时,进一步用于:In one embodiment, when the target form template determination module acquires the target form template matching the form to be processed from the configured form template, it is further used for:

在获得所述表单所存储的数据类别的情况下,从用于存储该类数据的数据库表被配置的表单模板中获取与待处理的表单匹配的目标表单模板。In the case of obtaining the type of data stored in the form, a target form template matching the form to be processed is obtained from the form template in which the database table for storing this type of data is configured.

在一个实施例中,所述表单是待处理文件中的任一个表单,所述待处理文件的格式为指定文件格式,所述待处理文件包含至少一个表单。In one embodiment, the form is any form in the files to be processed, the format of the file to be processed is a specified file format, and the file to be processed includes at least one form.

上述装置中各个单元的功能和作用的实现过程具体详见上述方法中对应步骤的实现过程,在此不再赘述。For details of the implementation process of the functions and functions of each unit in the above device, please refer to the implementation process of the corresponding steps in the above method, which will not be repeated here.

对于装置实施例而言,由于其基本对应于方法实施例,所以相关之处参见方法实施例的部分说明即可。以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元。As for the apparatus embodiments, since they basically correspond to the method embodiments, reference may be made to the partial descriptions of the method embodiments for related parts. The device embodiments described above are only illustrative, wherein the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units.

本发明还提供一种电子设备,包括处理器及存储器;所述存储器存储有可被处理器调用的程序;其中,所述处理器执行所述程序时,实现如前述实施例中所述的数据抽取方法。The present invention also provides an electronic device, including a processor and a memory; the memory stores a program that can be called by the processor; wherein, when the processor executes the program, the data described in the foregoing embodiments is implemented extraction method.

本发明数据抽取装置的实施例可以应用在电子设备上。以软件实现为例,作为一个逻辑意义上的装置,是通过其所在电子设备的处理器将非易失性存储器中对应的计算机程序指令读取到内存中运行形成的。从硬件层面而言,如图4所示,图4是本发明根据一示例性实施例示出的数据抽取装置100所在电子设备的一种硬件结构图,除了图4所示的处理器510、内存530、网络接口520、以及非易失性存储器540之外,实施例中数据抽取装置100所在的电子设备通常根据该电子设备的实际功能,还可以包括其他硬件,对此不再赘述。The embodiments of the data extraction apparatus of the present invention can be applied to electronic equipment. Taking software implementation as an example, a device in a logical sense is formed by reading the corresponding computer program instructions in the non-volatile memory into the memory for operation by the processor of the electronic device where the device is located. From the perspective of hardware, as shown in FIG. 4 , FIG. 4 is a hardware structure diagram of the electronic device where the data extraction apparatus 100 according to an exemplary embodiment of the present invention is located, except for the processor 510 and the memory shown in FIG. 4 . 530 , the network interface 520 , and the non-volatile memory 540 , the electronic device where the data extraction apparatus 100 is located in the embodiment may also include other hardware generally according to the actual function of the electronic device, which will not be repeated here.

本发明还提供一种机器可读存储介质,其上存储有程序,该程序被处理器执行时,实现如前述实施例中所述的数据抽取方法。The present invention also provides a machine-readable storage medium on which a program is stored, and when the program is executed by a processor, implements the data extraction method described in the foregoing embodiments.

本发明可采用在一个或多个其中包含有程序代码的存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。机器可读存储介质包括永久性和非永久性、可移动和非可移动媒体,可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。机器可读存储介质的例子包括但不限于:相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带,磁盘存储或其他磁性存储设备或任何其他非传输介质,可用于存储可以被计算设备访问的信息。The present invention may take the form of a computer program product embodied on one or more storage media having program code embodied therein, including but not limited to disk storage, CD-ROM, optical storage, and the like. Machine-readable storage media includes both persistent and non-permanent, removable and non-removable media, and storage of information can be implemented by any method or technology. Information may be computer readable instructions, data structures, modules of programs, or other data. Examples of machine-readable storage media include, but are not limited to, phase-change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), Flash Memory or other memory technology, Compact Disc Read Only Memory (CD-ROM), Digital Versatile Disc (DVD) or other optical storage , magnetic tape cartridges, magnetic disk storage or other magnetic storage devices or any other non-transmission medium that can be used to store information that can be accessed by a computing device.

以上所述仅为本发明的较佳实施例而已,并不用以限制本发明,凡在本发明的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本发明保护的范围之内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present invention shall be included in the present invention. within the scope of protection.

Claims (16)

1. A data extraction method, comprising:
acquiring a target form template matched with a form to be processed from the configured form templates, wherein each form template is provided with a corresponding database table and comprises field information of at least one field of a form head in the corresponding database table;
determining data corresponding to the field information from the form according to the field information in the target form template to obtain target data;
and extracting the target data to a database table corresponding to the target form template.
2. The data extraction method of claim 1, wherein obtaining a target form template matching a form to be processed from the configured form templates comprises:
traversing each row of cells in the form:
searching a form template containing field information matched with the content of the row of cells in the configured form template;
if the form template is found, determining the cell of the line as the form head of the form, and determining the found form template as the target form template.
3. The data extraction method of claim 2, wherein the number of form headers of the form is 1; the method further comprises the following steps:
and when the line cell is determined to be the form head of the form, ending the traversal of the form.
4. The data extraction method of claim 2, further comprising:
if not, checking whether the current traversal times of the form reaches the maximum traversal times, if so, ending the traversal of the form, otherwise, continuing the traversal of the form.
5. The data extraction method of claim 2, wherein searching for a form template in the configured form template that contains field information that matches the content of the row of cells further comprises:
and when the data type of each cell in the line is a text type, searching a form template containing field information matched with the content of the cell in the line in the configured form template.
6. The data extraction method as claimed in claim 2,
the field information at least comprises a field name;
searching the configured form template for the form template with the field information matched with the content of the row of cells, wherein the method comprises the following steps:
for each of the configured form templates:
determining a reference cell from the row of cells, wherein the content of the reference cell is matched with any field name in the form template;
and determining whether the form template is a form template with the contained field information matched with the content of the row of cells or not according to the number of the reference cells.
7. The data extraction method of claim 6, wherein the matching of the content of the reference cell with any field name in the form template is:
the content of the reference cell is the same as the name of any field in the form template;
or the content of the reference cell and any field name in the form template are similar words or synonyms, and the similar words or synonyms are determined through a set matching algorithm.
8. The data extraction method of claim 6,
the form template also comprises a set matching degree;
determining whether the form template is a form template with the contained field information matched with the content of the row of cells according to the number of the reference cells, wherein the method comprises the following steps:
calculating the ratio of the number of the reference cells to the number of the fields corresponding to the field information in the form template;
and if the ratio is greater than the set matching degree in the form template, determining that the form template is the form template with the contained field information matched with the content of the row of cells.
9. The data extraction method as claimed in claim 2,
the field information at least comprises a field name;
the form template further comprises a field sequence corresponding to each field information and at least one conversion rule corresponding to the field information, wherein the field sequence is determined according to a form head of a database table corresponding to the target form template;
determining data corresponding to the field information from the form according to the field information in the target form template to obtain target data, wherein the method comprises the following steps:
sequencing columns of target cells in the form according to the field sequence in the target form template, wherein the target cells belong to the form head of the form, and the content of the target cells is matched with any field name in the target form template;
and determining the cell content of each row positioned in the column of each target cell in the row as a piece of target data for each row positioned behind the head of the form in the form.
10. The data extraction method as claimed in claim 9,
before determining the cell contents in the column of the target cells in the row, the method further includes: if the merged cell exists in the row, splitting the merged cell, and filling the content of the merged cell into all split cells;
after determining the cell contents in the column of the target cells in the row, the method further includes: and converting the determined at least one cell content according to the conversion rule in the target form template, and taking the obtained cell content as a piece of target data.
11. The data extraction method of claim 1, wherein the method further comprises:
when any database table needs to be configured with the form template, whether the corresponding relation between the database table and the form template needing to be configured exists in a specified database is checked, if the corresponding relation does not exist, the configuration of the form template is continued, and the corresponding relation between the database table and the form template is stored in the specified database.
12. The data extraction method of any one of claims 1-11, wherein obtaining a target form template from the configured form templates that matches the form to be processed further comprises:
and under the condition of obtaining the data category stored in the form, obtaining a target form template matched with the form to be processed from the form templates configured in the database table for storing the data.
13. The data extraction method according to claim 1, wherein the form is any one form in a file to be processed, the format of the file to be processed is a specified file format, and the file to be processed contains at least one form.
14. A data extraction apparatus, comprising:
the target form template determining module is used for acquiring a target form template matched with a form to be processed from the configured form templates, each form template is provided with a corresponding database table, and the form template comprises field information of at least one field of a form head in the corresponding database table;
the target data determining module is used for determining data corresponding to the field information from the form according to the field information in the target form template to obtain target data;
and the data extraction module is used for extracting the target data into a database table corresponding to the target form template.
15. An electronic device comprising a processor and a memory; the memory stores a program that can be called by the processor; wherein the processor, when executing the program, implements a data extraction method as claimed in any one of claims 1 to 13.
16. A machine-readable storage medium, having stored thereon a program which, when executed by a processor, implements a data extraction method as claimed in any one of claims 1 to 13.
CN202010957895.3A 2020-09-14 2020-09-14 Data extraction method, device and device, and storage medium Pending CN111813849A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010957895.3A CN111813849A (en) 2020-09-14 2020-09-14 Data extraction method, device and device, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010957895.3A CN111813849A (en) 2020-09-14 2020-09-14 Data extraction method, device and device, and storage medium

Publications (1)

Publication Number Publication Date
CN111813849A true CN111813849A (en) 2020-10-23

Family

ID=72859305

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010957895.3A Pending CN111813849A (en) 2020-09-14 2020-09-14 Data extraction method, device and device, and storage medium

Country Status (1)

Country Link
CN (1) CN111813849A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112596851A (en) * 2020-12-02 2021-04-02 中国人民解放军63921部队 Multi-source heterogeneous data batch extraction method and analysis method of simulation platform
CN113127359A (en) * 2021-04-23 2021-07-16 中国工商银行股份有限公司 Method and device for obtaining test data
CN113610396A (en) * 2021-08-06 2021-11-05 三峡高科信息技术有限责任公司 Method and system for structuring matrix designer based on construction quality acceptance table
CN114155928A (en) * 2021-12-14 2022-03-08 浙江太美医疗科技股份有限公司 Form generation method, apparatus, computer equipment and storage medium
CN114385158A (en) * 2021-12-30 2022-04-22 杭州数梦工场科技有限公司 A method, device and device for constructing a data interaction system
CN115249006A (en) * 2022-08-02 2022-10-28 中国银行股份有限公司 A text processing method, device and electronic device
CN115344571A (en) * 2022-05-20 2022-11-15 药渡经纬信息科技(北京)有限公司 Universal data acquisition and analysis method, system and storage medium
CN116187288A (en) * 2022-12-20 2023-05-30 长城计算机软件与系统有限公司 Statistical report processing method and device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108345682A (en) * 2018-03-02 2018-07-31 弘成科技发展有限公司 Platform and method are imported and exported based on what multi-tenant can configure
CN109933765A (en) * 2019-03-12 2019-06-25 中冶焦耐(大连)工程技术有限公司 Method for extracting Excel table content to CAD table
CN110321410A (en) * 2019-06-21 2019-10-11 东软集团股份有限公司 Method, apparatus, storage medium and the electronic equipment that log is extracted
CN110399420A (en) * 2019-07-30 2019-11-01 广州吉信网络科技开发有限公司 A kind of deriving method, electronic equipment and the medium of configurableization Excel format
CN111125221A (en) * 2019-12-19 2020-05-08 上海三稻智能科技有限公司 Excel format-based data extraction system and configuration method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108345682A (en) * 2018-03-02 2018-07-31 弘成科技发展有限公司 Platform and method are imported and exported based on what multi-tenant can configure
CN109933765A (en) * 2019-03-12 2019-06-25 中冶焦耐(大连)工程技术有限公司 Method for extracting Excel table content to CAD table
CN110321410A (en) * 2019-06-21 2019-10-11 东软集团股份有限公司 Method, apparatus, storage medium and the electronic equipment that log is extracted
CN110399420A (en) * 2019-07-30 2019-11-01 广州吉信网络科技开发有限公司 A kind of deriving method, electronic equipment and the medium of configurableization Excel format
CN111125221A (en) * 2019-12-19 2020-05-08 上海三稻智能科技有限公司 Excel format-based data extraction system and configuration method

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112596851A (en) * 2020-12-02 2021-04-02 中国人民解放军63921部队 Multi-source heterogeneous data batch extraction method and analysis method of simulation platform
CN113127359A (en) * 2021-04-23 2021-07-16 中国工商银行股份有限公司 Method and device for obtaining test data
CN113127359B (en) * 2021-04-23 2025-03-07 中国工商银行股份有限公司 A method and device for obtaining test data
CN113610396A (en) * 2021-08-06 2021-11-05 三峡高科信息技术有限责任公司 Method and system for structuring matrix designer based on construction quality acceptance table
CN113610396B (en) * 2021-08-06 2022-02-11 三峡高科信息技术有限责任公司 Method and system for structuring matrix designer based on construction quality acceptance table
CN114155928A (en) * 2021-12-14 2022-03-08 浙江太美医疗科技股份有限公司 Form generation method, apparatus, computer equipment and storage medium
CN114155928B (en) * 2021-12-14 2025-05-16 浙江太美医疗科技股份有限公司 Form generation method, device, computer equipment and storage medium
CN114385158A (en) * 2021-12-30 2022-04-22 杭州数梦工场科技有限公司 A method, device and device for constructing a data interaction system
CN115344571A (en) * 2022-05-20 2022-11-15 药渡经纬信息科技(北京)有限公司 Universal data acquisition and analysis method, system and storage medium
CN115249006A (en) * 2022-08-02 2022-10-28 中国银行股份有限公司 A text processing method, device and electronic device
CN116187288A (en) * 2022-12-20 2023-05-30 长城计算机软件与系统有限公司 Statistical report processing method and device

Similar Documents

Publication Publication Date Title
CN111813849A (en) Data extraction method, device and device, and storage medium
CN107491487B (en) A full-text database architecture and bitmap index creation, data query method, server and medium
US11899641B2 (en) Trie-based indices for databases
CN107122443B (en) A kind of distributed full-text search system and method based on Spark SQL
US8862566B2 (en) Systems and methods for intelligent parallel searching
CN112162977B (en) MES-oriented mass data redundancy removing method and system
CN110019218A (en) Data storage and querying method and equipment
CN104881424A (en) Regular expression-based acquisition, storage and analysis method of power big data
CN104331446A (en) Memory map-based mass data preprocessing method
CN114328601B (en) Data downsampling and data query method, system and storage medium
CN106528877A (en) Modular method and system for word document
CN114281989A (en) Data deduplication method and device based on text similarity, storage medium and server
US8548980B2 (en) Accelerating queries based on exact knowledge of specific rows satisfying local conditions
CN115712757A (en) Enterprise name matching method and device based on index tree
CN115658680A (en) Data storage method, data query method and related device
CN110825744A (en) A partitioned storage method for air quality monitoring big data based on cluster environment
CN118520152A (en) Log keyword extraction method and device, storage medium and computer equipment
US20180144060A1 (en) Processing deleted edges in graph databases
CN116090416B (en) Standard writing method, system, equipment and medium based on standard knowledge graph
CN112765960A (en) Text matching method and device and computer equipment
CN117112877A (en) Medical document processing method and device applied to inquiry medicine
JP2005018751A (en) System and method for expressing and calculating relationship between measures
CN114780515B (en) Elasticsearch database data migration method, device, equipment and storage medium
US11321340B1 (en) Metadata extraction from big data sources
CN114218347A (en) Method for quickly searching index of multiple file contents

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20201023