WO2020057021A1 - Data table processing method and device, computer device and storage medium - Google Patents
Data table processing method and device, computer device and storage medium Download PDFInfo
- Publication number
- WO2020057021A1 WO2020057021A1 PCT/CN2019/071126 CN2019071126W WO2020057021A1 WO 2020057021 A1 WO2020057021 A1 WO 2020057021A1 CN 2019071126 W CN2019071126 W CN 2019071126W WO 2020057021 A1 WO2020057021 A1 WO 2020057021A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- data table
- search
- field
- labeling
- structure information
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2457—Query processing with adaptation to user needs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
Definitions
- the terminal may generate a corresponding data table identifier for the data table uploaded by the user.
- the data table identifier is used to uniquely identify a data table.
- the data table identifier may include at least any one of characters, numbers, and symbols.
- the terminal may obtain each field name included in the table structure information of the data table extracted in step 204, and obtain the labeling result of each field name obtained in step 203, and identify the data table identifier with each field name in the data table. The result is stored correspondingly.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Computational Linguistics (AREA)
- Software Systems (AREA)
- Probability & Statistics with Applications (AREA)
- Mathematical Physics (AREA)
- Fuzzy Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
A data table processing method: acquiring a data table uploaded by a user; parsing the data table to obtain table structure information of the data table; carrying out recognition on the table structure information by means of a trained labeling model, and outputting a label result for each field name in the table structure information, the label results comprising either one among only retrieval ranges, only retrieval dimensions, or both retrieval ranges and retrieval dimensions; and storing the label results and the data table in correspondence.
Description
本申请要求于2018年09月18日提交中国专利局,申请号为201811090036.8,申请名称为“数据表处理方法、装置、计算机设备和存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed on September 18, 2018 with the Chinese Patent Office under the application number 201811090036.8, and the application name is "Data Sheet Processing Method, Device, Computer Equipment, and Storage Medium", the entire contents of which are hereby incorporated by reference Incorporated in this application.
本申请涉及计算机技术领域,特别是涉及一种数据表处理方法、装置、计算机设备和存储介质。The present application relates to the field of computer technology, and in particular, to a data table processing method, device, computer device, and storage medium.
目前,市场上针对各行各业都配备有相应的大数据平台,这些数据平台可以基于用户的输入获取数据并进行统计,还可以报表的形式将统计的结果可视化地呈现给用户,满足用户的数据分析需求。At present, the market is equipped with corresponding big data platforms for all walks of life. These data platforms can obtain data and perform statistics based on user input, and can also statistically present the statistical results to users in the form of reports to meet user data. Analyze requirements.
为了能够获取与用户的输入相匹配的数据,通常需要对数据源库中的数据进行预处理,然而,发明人意识到,现有的数据平台通常只能对数据源库中的数据进行简单的规范字段名等处理,而在需要对字段名标注是否可作为维度或范围时,通常都是依赖于人工处理,需要人工执行大量的重复工作,导致处理效率十分低下。In order to be able to obtain data that matches the user's input, it is usually necessary to pre-process the data in the data source database. However, the inventors have realized that existing data platforms usually can only perform simple operations on the data in the data source database. Normalize the processing of field names, and so on. When the field names need to be marked as dimensions or ranges, they usually rely on manual processing, which requires a lot of repetitive work to be performed manually, resulting in very low processing efficiency.
发明内容Summary of the Invention
根据本申请公开的各种实施例,提供一种数据表处理方法、装置、计算机设备和存储介质。According to various embodiments disclosed in the present application, a data table processing method, apparatus, computer device, and storage medium are provided.
一种数据表处理方法包括:A data table processing method includes:
获取用户上传的数据表;Get the data table uploaded by the user;
对所述数据表进行解析,得到所述数据表的表结构信息;Parse the data table to obtain table structure information of the data table;
通过已训练的标注模型对所述表结构信息进行识别,输出所述表结构信息中各个字段名的标注结果;所述标注结果包括仅为检索范围、仅为检索维度以及既为检索范围又为检索维度中的一种;及The table structure information is identified through a trained labeling model, and the labeling results of each field name in the table structure information are output; the labeling results include only the search range, only the search dimension, and both the search range and One of the search dimensions; and
将所述标注结果与所述数据表对应存储。The labeling result is stored in correspondence with the data table.
一种数据表处理装置包括:A data table processing device includes:
获取模块,用于获取用户上传的数据表;An acquisition module for acquiring a data table uploaded by a user;
解析模块,用于对所述数据表进行解析,得到所述数据表的表结构信息;An analysis module, configured to parse the data table to obtain table structure information of the data table;
标注模块,用于通过已训练的标注模型对所述表结构信息进行识别,输出所述数据表中各个字段名的标注结果;所述标注结果包括仅为检索范围、仅为检索维度以及既为检索 范围又为检索维度中的一种;及A labeling module, configured to identify the table structure information through a trained labeling model, and output labeling results for each field name in the data table; the labeling results include only the search range, only the search dimension, and both The search scope is one of the search dimensions; and
存储模块,用于将所述标注结果与所述数据表对应存储。A storage module, configured to store the marked result corresponding to the data table.
一种计算机设备,包括存储器和一个或多个处理器,所述存储器中储存有计算机可读指令,所述计算机可读指令被所述处理器执行时,使得所述一个或多个处理器执行以下步骤:获取用户上传的数据表;A computer device includes a memory and one or more processors. The memory stores computer-readable instructions. When the computer-readable instructions are executed by the processor, the one or more processors are executed. The following steps: obtain the data table uploaded by the user;
对所述数据表进行解析,得到所述数据表的表结构信息;Parse the data table to obtain table structure information of the data table;
通过已训练的标注模型对所述表结构信息进行识别,输出所述数据表中各个字段名的标注结果;所述标注结果包括仅为检索范围、仅为检索维度以及既为检索范围又为检索维度中的一种;及The table structure information is identified through a trained labeling model, and the labeling results of each field name in the data table are output; the labeling results include only the search range, only the search dimension, and both the search range and the search. One of the dimensions; and
将所述标注结果与所述数据表对应存储。The labeling result is stored in correspondence with the data table.
一个或多个存储有计算机可读指令的非易失性计算机可读存储介质,计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行以下步骤:One or more non-volatile computer-readable storage media storing computer-readable instructions. When the computer-readable instructions are executed by one or more processors, the one or more processors execute the following steps:
获取用户上传的数据表;Get the data table uploaded by the user;
对所述数据表进行解析,得到所述数据表的表结构信息;Parse the data table to obtain table structure information of the data table;
通过已训练的标注模型对所述表结构信息进行识别,输出所述数据表中各个字段名的标注结果;所述标注结果包括仅为检索范围、仅为检索维度以及既为检索范围又为检索维度中的一种;及The table structure information is identified through a trained labeling model, and the labeling results of each field name in the data table are output; the labeling results include only the search range, only the search dimension, and both the search range and the search. One of the dimensions; and
将所述标注结果与所述数据表对应存储。The labeling result is stored in correspondence with the data table.
本申请的一个或多个实施例的细节在下面的附图和描述中提出。本申请的其它特征和优点将从说明书、附图以及权利要求书变得明显。Details of one or more embodiments of the present application are set forth in the accompanying drawings and description below. Other features and advantages of the application will become apparent from the description, the drawings, and the claims.
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其它的附图。In order to explain the technical solutions in the embodiments of the present application more clearly, the drawings used in the embodiments will be briefly introduced below. Obviously, the drawings in the following description are only some embodiments of the present application. Those of ordinary skill in the art can obtain other drawings according to the drawings without paying creative labor.
图1为根据一个或多个实施例中数据表处理方法的应用场景图。FIG. 1 is an application scenario diagram of a data table processing method according to one or more embodiments.
图2为根据一个或多个实施例中数据表处理方法的流程示意图。FIG. 2 is a schematic flowchart of a data table processing method according to one or more embodiments.
图3为根据一个或多个实施例中根据标注结果筛选报表数据的步骤的流程示意图。FIG. 3 is a schematic flowchart of steps for filtering report data according to an annotation result in one or more embodiments.
图4为根据一个或多个具体的实施例中数据表处理方法的流程示意图。FIG. 4 is a schematic flowchart of a data table processing method according to one or more specific embodiments.
图5为根据一个或多个实施例中数据表处理装置的框图。FIG. 5 is a block diagram of a data table processing apparatus according to one or more embodiments.
图6为根据一个或多个实施例中计算机设备的框图。FIG. 6 is a block diagram of a computer device according to one or more embodiments.
为了使本申请技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。In order to make the technical solution and advantages of the present application more clear and clear, the present application is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the application, and are not used to limit the application.
本申请提供的数据表处理方法,可以应用于如图1所示的应用环境中。其中,终端102通过网络与服务器104通过网络进行通信。终端102可获取用户上传的数据表,并对数据表进行解析,得到数据表的表结构信息,终端120还可以通过已训练的标注模型对表结构信息进行识别,输出数据表中各个字段名的标注结果,标注结果包括仅为检索范围、仅为检索维度以及既为检索范围又为检索维度中的一种;终端102还可将得到的标注结果与数据表对应存储,对应存储在服务器中。终端102可以但不限于是各种个人计算机、笔记本电脑、智能手机、平板电脑和便携式可穿戴设备,服务器104可以用独立的服务器或者是多个服务器组成的服务器集群来实现,还可以是提供云服务、云数据库、云存储等基础云计算服务的云服务器。The data table processing method provided in this application can be applied to the application environment shown in FIG. 1. The terminal 102 communicates with the server 104 through the network through the network. The terminal 102 can obtain the data table uploaded by the user, and analyze the data table to obtain the table structure information of the data table. The terminal 120 can also identify the table structure information through the trained labeling model, and output the information of each field name in the data table. The labeling result includes one of only the search range, only the search dimension, and both the search range and the search dimension; the terminal 102 may also store the obtained labeling result corresponding to the data table and store the corresponding result in the server. The terminal 102 may be, but is not limited to, various personal computers, laptops, smartphones, tablets, and portable wearable devices. The server 104 may be implemented by an independent server or a server cluster composed of multiple servers, and may also provide a cloud Services, cloud databases, cloud storage and other basic cloud computing services.
需要说明的是,上述的应用环境只是一个示例,在一些实施例中,终端102还可将获取的表结构信息发送至服务器104,由服务器104通过已训练的标注模型对表结构信息进行识别,得到与数据表中各个字段名对应的标注结果,由服务器120标注结果与数据表对应存储,终端102可从服务器获取该数据表对应的标注结果。It should be noted that the above application environment is only an example. In some embodiments, the terminal 102 may also send the obtained table structure information to the server 104, and the server 104 identifies the table structure information through the trained labeling model. The labeling result corresponding to each field name in the data table is obtained, and the labeling result corresponding to the data table is stored by the server 120, and the terminal 102 can obtain the labeling result corresponding to the data table from the server.
在一些实施例中,如图2所示,提供了一种数据表处理方法,以该方法应用于图1中的终端102为例进行说明,包括以下步骤:In some embodiments, as shown in FIG. 2, a data table processing method is provided. The method is applied to the terminal 102 in FIG. 1 as an example for description, and includes the following steps:
步骤202,获取用户上传的数据表。Step 202: Obtain a data table uploaded by a user.
数据表是一种结构化的数据表格,比如可以是CSV(逗号分隔值,Comma-Separated Values)格式的表格,CSV数据表以纯文本形式存储表格数据,存储的表格数据包括数值型和字符型。具体地,可提供网页界面,用户通过该网页界面上传数据表,终端就可获取用户上传的数据表。在其中一个实施例中,每个用户需按预设的文件格式或表格模板生成包含报表数据的数据表,以便终端可解析出上传的数据表的表结构信息。A data table is a structured data table. For example, it can be a table in CSV (Comma-Separated Values) format. The CSV data table stores table data in plain text. The stored table data includes numeric and character types. Specifically, a web interface can be provided, and the user uploads the data table through the web interface, and the terminal can obtain the data table uploaded by the user. In one embodiment, each user needs to generate a data table containing report data according to a preset file format or form template, so that the terminal can parse out the table structure information of the uploaded data table.
如下表1所示,为一个实施例中上传的CSV格式的数据表的示意图。Table 1 below is a schematic diagram of a CSV format data table uploaded in an embodiment.
表1Table 1
从上表1中可以看出,该数据表中每一行的元素之间用逗号分隔开,第一行的元素用于表示这一列的列名,也叫数据表的表头或字段名,相应的该列中的元素为字段名对应的字段值,一个字段名对应了多个字段值。As can be seen from Table 1 above, the elements in each row in the data table are separated by commas. The elements in the first row are used to indicate the column name of this column, which is also called the header or field name of the data table. The corresponding element in this column is the field value corresponding to the field name, and one field name corresponds to multiple field values.
步骤204,对数据表进行解析,得到数据表的表结构信息。Step 204: Parse the data table to obtain table structure information of the data table.
表结构信息是能够表示数据表所包括的内容的信息。具体地,终端可约束用户上传的数据表的格式或形式,用户上传的数据表是符合固定格式的,这样,终端就可按照该预设的格式或形式对数据表进行解析,得到数据表的表结构信息。The table structure information is information capable of indicating the contents included in the data table. Specifically, the terminal may restrict the format or form of the data table uploaded by the user, and the data table uploaded by the user conforms to a fixed format, so that the terminal can parse the data table according to the preset format or form to obtain the data table. Table structure information.
在其中一个实施例中,表结构信息包括字段名和字段值类型;步骤S204,对数据表进行解析,得到数据表的表结构信息具体包括:提取数据表的表头所包括的字段名;统计各字段名对应的枚举值;将各字段名对应的字段值的字符类型作为字段名的字段值类型;根据字段名以及相应的枚举值、字段值类型确定数据表的表结构信息。In one embodiment, the table structure information includes field names and field value types. In step S204, parsing the data table to obtain the table structure information of the data table specifically includes: extracting field names included in the header of the data table; The enumeration value corresponding to the field name; the character type of the field value corresponding to each field name as the field value type of the field name; the table structure information of the data table is determined according to the field name and the corresponding enumeration value and field value type.
具体地,终端可提取当前用户上传的数据表的表头所包括的各个字段名,得到所有的字段名。终端还可统计各个字段名对应的字段值,将字段值分类,有多少类就代表该字段名有多少个枚举值。比如,在前文展示的表1中提取到字段名为“性别”对应的字段值只有“男”和“女”两种类别,那么字段名“性别”对应的枚举值包括“男、女”,类似地,在表1中,字段名“学历”对应的枚举值包括“大专、本科、硕士”。Specifically, the terminal may extract each field name included in the header of the data table uploaded by the current user, and obtain all field names. The terminal may also count the field values corresponding to each field name, and classify the field values. How many types represent the number of enumeration values in the field name. For example, in Table 1 shown above, the field value corresponding to the field name "Gender" has only two categories: "Male" and "Female", then the enumeration value corresponding to the field name "Gender" includes "Male, Female" Similarly, in Table 1, the enumeration value corresponding to the field name "Education" includes "College, Undergraduate, Master".
可以理解,对于无法穷举字段值的字段名或者是字段值的类别超过预设数量(比如20)的字段名而言,由于字段值是非常分散的,则终端可判断该字段名没有枚举值。比如,对于表1中的字段名“姓名”而言,统计它的枚举值是没有意义的,每一行都不一样,又比如,对于字段名“贷款金额”而言,不同地区、不同性别、不同学历的贷款金额都不一样,因此该字段名也没有枚举值。终端可将不存在枚举值的字段名对应的枚举值记为“无”。It can be understood that for field names that cannot be exhausted or whose field types exceed a preset number (such as 20), since the field values are very scattered, the terminal can determine that the field names are not enumerated value. For example, for the field name "Name" in Table 1, it is meaningless to count its enumeration value, and each row is different. For example, for the field name "Loan Amount", different regions and different genders 2. The loan amount is different for different academic degrees, so there is no enumerated value for this field name. The terminal may mark the enumeration value corresponding to the field name where the enumeration value does not exist as "none".
字段值类型是字段值的字符类型,包括字符型和数值型。比如,在上表1中,姓名、性别、学历、地区的字段值类型是字符型,年龄、贷款金额、贷款时间、身份证号码是纯数值型的,对应的字段值类型为数值型。The field value type is the character type of the field value, including character and numeric. For example, in Table 1 above, the field value types of name, gender, education, and region are character types, age, loan amount, loan time, and ID number are pure numeric values, and the corresponding field value types are numeric types.
终端对用户上传的数据表进行解析,可得到每个字段名,对于存在枚举值的字段名还可获得相应的枚举值,以及相应的字段值类型,这样,终端就获取了能够表达整个数据表所包括内容的表结构信息。The terminal parses the data table uploaded by the user to obtain each field name. For the field name where the enumeration value exists, the corresponding enumeration value and the corresponding field value type can be obtained. In this way, the terminal obtains the ability to express the entire Table structure information of the contents of the data table.
步骤206,通过已训练的标注模型对表结构信息进行识别,输出表结构信息中各个字段名的标注结果;标注结果包括仅为检索范围、仅为检索维度以及既为检索范围又为检索维度中的一种。Step 206: Identify the table structure information through the trained labeling model, and output the labeling results of each field name in the table structure information; the labeling results include only the search range, only the search dimension, and both the search range and the search dimension. Kind of.
具体地,终端可通过已训练的机器学习模型对输入的表结构信息进行识别,输出数据表中各个字段名的标注结果。其中,标注结果是通过标注模型自动对数据表中的各个字段名进行标注得到的结果,包括该字段名仅能作为检索范围、仅能作为检索维度以及既可以作为检索范围又可以作为检索维度中的一种。Specifically, the terminal may recognize the input table structure information by using a trained machine learning model, and output a labeling result of each field name in the data table. Among them, the labeling result is the result of automatically labeling each field name in the data table through the labeling model, including that the field name can only be used as a search range, only as a search dimension, and can be used as both a search range and a search dimension. Kind of.
检索范围可作为从数据源库存储的大量的数据中进行数据筛选的条件,终端可从数据源库中按照检索范围筛选出生成报表所需的报表数据;检索维度可作为对筛选出的数据进行展示的展示维度。若该字段名能作为检索范围,在从数据源库中筛选报表数据时,终端可以从可作为检索范围的字段名对应的字段值中进行筛选;类似地,若该字段名能作为检 索维度,在从数据源库中筛选报表数据时,终端可以从可作为检索维度的字段名对应的字段值中进行筛选,能够提高筛选的数据与用户输入的检索词条之间的匹配度。The search range can be used as a condition for data screening from the large amount of data stored in the data source library. The terminal can filter the report data required to generate the report according to the search range from the data source database; the search dimension can be used as a filter for the filtered data. The display dimension of the display. If the field name can be used as the search range, when filtering report data from the data source database, the terminal can filter from the field value corresponding to the field name that can be used as the search range. Similarly, if the field name can be used as the search dimension, When filtering report data from a data source database, the terminal can filter from field values corresponding to field names that can be used as search dimensions, which can improve the degree of matching between the filtered data and the search terms entered by the user.
在其中一个实施例中,步骤206,通过已训练的标注模型对表结构信息进行识别,输出数据表中各个字段名的标注结果包括:获取用户选定的业务场景类别;将表结构信息输入至已训练的与业务场景类别对应的标注模型中,通过标注模型根据表结构信息得到数据表中各字段名对应的特征向量;对各字段名对应的特征向量进行变换,输出数据表中各个字段名对应的标注结果。In one of the embodiments, in step 206, the table structure information is identified through the trained labeling model, and the output of the labeling results of each field name in the data table includes: obtaining the business scenario category selected by the user; inputting the table structure information to In the trained labeling model corresponding to the business scenario category, the feature vector corresponding to each field name in the data table is obtained according to the table structure information through the labeling model; the feature vector corresponding to each field name is transformed to output each field name in the data table. The corresponding labeling result.
业务场景类别用于区分不同的业务场景,不同的业务场景对应不同的数据源库,也对应不同的标注模型。业务场景包括贷款业务、保险业务、理财业务、银行业务等,这些不同的业务场景所涉及到的数据源库的数据是不一样的,在训练标注模型时,所采用的训练语料也不一样,也就是,不同的业务场景需要用不同的标注模型对表结构信息进行识别。Business scenario categories are used to distinguish different business scenarios. Different business scenarios correspond to different data source libraries and different labeling models. The business scenarios include loan business, insurance business, wealth management business, banking business, etc. The data of the data source database involved in these different business scenarios are different, and the training corpus used is different when training the labeling model. That is, different business scenarios need to use different labeling models to identify table structure information.
具体地,在用户进入检索平台时,终端可提供业务场景类别供用户选择,用户选定了业务场景类别后,上传数据表,终端对数据表进行解析得到表结构信息,终端可调取与该业务场景类别对应的标注模型,根据解析得到的表结构信息得到各个字段名对应的特征向量,然后通过标注模型的隐藏层的模型参数与得到的特征向量进行变换,输出各个字段名是否可作为检索范围或检索维度的标注结果。Specifically, when the user enters the retrieval platform, the terminal may provide a business scenario category for the user to select. After the user selects the business scenario category, the user uploads the data table, and the terminal analyzes the data table to obtain the table structure information. The annotation model corresponding to the business scenario category, obtains the feature vector corresponding to each field name according to the parsed table structure information, and then transforms the model parameters of the hidden layer of the model and the obtained feature vector to output whether each field name can be used for retrieval Range or retrieve dimension label results.
在其中一个实施例中,终端在获得了数据表的表结构信息后,可根据字段名、该字段名对应的枚举值、该字段名对应的字段值类型确定各个字段名对应的特征向量。具体地,可将数据表中各个字段名向量化,得到各个字段名的向量化表示,并获取各个字段名相应的枚举值的词向量,根据字段名、是否存在枚举值、枚举值的个数、字段值类型等特征生成各个字段名对应的特征向量。也就是,字段名的特征向量所表达的信息包括了与该字段名关联的多种表结果特征。In one embodiment, after obtaining the table structure information of the data table, the terminal may determine the feature vector corresponding to each field name according to the field name, the enumeration value corresponding to the field name, and the field value type corresponding to the field name. Specifically, each field name in the data table can be vectorized to obtain a vectorized representation of each field name, and a word vector corresponding to the enumeration value of each field name can be obtained. According to the field name, whether an enumeration value exists, or an enumeration value, Feature number, field value type and other features to generate feature vectors corresponding to each field name. That is, the information expressed by the feature vector of the field name includes various table result features associated with the field name.
步骤208,将标注结果与数据表对应存储。Step 208: Store the labeling result corresponding to the data table.
具体地,终端可将识别出的标注结果与数据表对应存储下来,以便根据用户输入的检索词条从数据源库中存储的大量的数据表中筛选出生成报表所需的报表数据。Specifically, the terminal may store the identified labeling result corresponding to the data table, so as to filter out report data required for generating a report from a large number of data tables stored in a data source database according to a search term input by a user.
在其中一个实施例中,终端可为用户上传的数据表生成相应的数据表标识,数据表标识用于唯一标识一个数据表,数据表标识可至少包括字符、数字以及符号中任意一种。终端可获取在步骤204中提取的数据表的表结构信息中包括的各个字段名,并获取通过步骤203得到的各个字段名的标注结果,将数据表标识与该数据表中各个字段名的标注结果对应存储。In one embodiment, the terminal may generate a corresponding data table identifier for the data table uploaded by the user. The data table identifier is used to uniquely identify a data table. The data table identifier may include at least any one of characters, numbers, and symbols. The terminal may obtain each field name included in the table structure information of the data table extracted in step 204, and obtain the labeling result of each field name obtained in step 203, and identify the data table identifier with each field name in the data table. The result is stored correspondingly.
这样,终端可在获取到检索词条时,从数据源库中遍历各个数据表标识对应的数据表,根据数据表标识获取对应的标注结果,根据标注结果从数据表中筛选出生成报表所需的、与检索词条匹配的报表数据。如图3所示,在其中一个实施例中,上述数据表处理方法还包括根据标注结果筛选报表数据的步骤:In this way, when the search term is obtained, the terminal can traverse the data table corresponding to each data table identifier from the data source database, obtain the corresponding labeling result according to the data table identifier, and filter out the data table to generate the report according to the labeling result. Report data that matches the search term. As shown in FIG. 3, in one of the embodiments, the above data table processing method further includes a step of filtering report data according to the marked result:
步骤302,获取用户输入的检索词条。Step 302: Obtain a search term input by a user.
具体地,检索平台中包括可供用户输入检索词条的搜索框,终端可在监测到报表检索事件时,获取用户在搜索框中输入的内容,作为检索词条。比如用户输入“上海男性借款人的学历如何分布?”,并在用户点击了检索图标时触发终端获取该检索词条。Specifically, the search platform includes a search box for a user to input a search term, and when the terminal detects a report search event, the terminal can obtain the content entered by the user in the search box as a search term. For example, the user enters "how are the academic qualifications of male borrowers in Shanghai distributed?" And triggers the terminal to obtain the search term when the user clicks the search icon.
步骤304,识别检索词条对应的检索范围和检索维度。Step 304: Identify a search range and a search dimension corresponding to the search term.
具体地,终端可通过已训练的意图识别模型对检索词条进行识别,得到与检索词条对应的检索范围和检索维度。终端可对获取的检索词条进行向量化处理,得到检索词条向量,然后将检索词条向量输入至意图识别模型中,用过意图识别模型的隐藏层对检索词条向量进行编码、变换后输出与检索词条对应的检索范围和检索维度。Specifically, the terminal may recognize the search term through the trained intent recognition model, and obtain a search range and a search dimension corresponding to the search term. The terminal may perform vectorization processing on the retrieved search terms to obtain the search term vector, and then input the search term vector into the intent recognition model. The hidden term layer of the intent recognition model is used to encode and transform the search term vector. Output the search scope and search dimension corresponding to the search term.
步骤306,获取数据源库中各数据表对应的标注结果。Step 306: Obtain a labeling result corresponding to each data table in the data source database.
数据源库是存储大量数据表的数据库,数据源库中存储的数据可与业务场景类别对应,不同的业务场景对应了存储不同类型数据的数据源库。在需要从数据源库中筛选出报表数据时,终端可先获取存储的各个数据表对应的标注结果。A data source database is a database that stores a large number of data tables. The data stored in the data source database can correspond to the business scenario category. Different business scenarios correspond to data source libraries that store different types of data. When the report data needs to be filtered from the data source database, the terminal may first obtain the labeling results corresponding to each stored data table.
步骤308,根据标注结果,从数据源库中筛选出与检索范围和检索维度匹配的报表数据。In step 308, according to the labeling result, the report data matching the search range and search dimension is filtered from the data source database.
具体地,终端在识别出检索词条对应的检索范围和检索维度,并获取了各个数据表对应的标注结果后,可根据标注结果、检索范围和检索维度从数据源库中的数据表中筛选出生成报表所需的报表数据。Specifically, after the terminal recognizes the search range and search dimension corresponding to the search term and obtains the labeling results corresponding to each data table, it can filter from the data table in the data source database according to the labeling result, search range and search dimension. Report data required for report generation.
在其中一个实施例中,步骤308,根据标注结果,从数据源库中筛选出与检索范围和检索维度匹配的报表数据具体包括:将检索范围与标注结果中可作为检索范围的字段名进行匹配;将检索维度与标注结果中可作为检索维度的字段名进行匹配;按照匹配的字段名,从数据库源中筛选出报表数据。In one of the embodiments, step 308, according to the labeling result, filtering report data matching the search range and search dimension from the data source database specifically includes: matching the search range with a field name that can be used as the search range in the labeling result. ; Match the search dimension with the field name that can be used as the search dimension in the labeling result; filter the report data from the database source according to the matched field name.
具体地,终端可获取数据源库中各数据表对应的标注结果,将根据检索词条识别出的检索范围与标注结果中可作为检索范围的字段名进行匹配,将根据检索词条识别出的检索维度与标注结果中可作为检索维度的字段名进行匹配,若能匹配上,则按照匹配上的字段名从数据源库中筛选出报表数据,若不能匹配上,则说明数据源库中不存储与检索词条匹配的报表数据。在其中一个实施例中,终端还可根据检索词条识别出检索意图,并进一步对筛选出的报表数据进行统计汇总,得到用于生成报表的统计数据,按照检索维度、检索意图绘制报表,以展示统计数据。Specifically, the terminal can obtain the labeling results corresponding to each data table in the data source database, match the search range identified according to the search term with the field name that can be used as the search range in the labeling result, and identify the search range based on the search term. The search dimension matches the field name that can be used as the search dimension in the labeled result. If it matches, the report data is filtered from the data source database according to the field name on the match. If it does not match, the data source database does not match Stores report data that matches search terms. In one of the embodiments, the terminal may also identify the retrieval intent according to the search term, and further statistically summarize the filtered report data to obtain statistical data for generating a report, and draw a report according to the retrieval dimension and retrieval intention to Show statistics.
上述数据表处理方法,在获取到用户上传的数据表时,就对数据表进行解析,得到数据表的表结构信息,表结构信息可以反映出数据表所包括的内容和字段名,然后通过已训练的标注模型对表结构信息进行识别,可自动输出数据表中各个字段名对应的标注结果,标注结果能够确定数据表中的字段名能否作为检索范围或检索维度,这样,就实现了对用户上传的数据表中的字段名进行自动标注,相比于人工标注,大大地提高了对数据表的标注效率,并且,将标注结果与该数据表对应存储,能够便于从数据表中获取与用户的检索词条匹配的数据。In the above data table processing method, when the data table uploaded by the user is obtained, the data table is parsed to obtain the table structure information of the data table. The table structure information can reflect the content and field names included in the data table, and then pass the The trained labeling model recognizes the table structure information, and can automatically output the labeling results corresponding to the field names in the data table. The labeling results can determine whether the field names in the data table can be used as the search range or search dimension. The field names in the data table uploaded by the user are automatically labeled. Compared with manual labeling, the labeling efficiency of the data table is greatly improved, and the labeling results are stored in correspondence with the data table, which can be easily obtained from the data table. The user's search term matches the data.
在其中一个实施例中,标注模型的训练步骤包括:获取训练样本语料和测试样本语料;获取训练样本语料中各个训练样本、测试样本语料中各个测试样本对应的标注结果;循环执行将标注好的当前训练样本输入至机器学习模型中,输出当前训练样本对应的预测结果,将当前训练样本输出的预测结果与相应的标注结果进行比较,在差异不符合预设条件时,调整机器学习模型的模型参数,在差异符合预设条件时,接受前次调整的模型参数的步骤,直至训练样本语料训练完毕;将测试样本语料中的各个测试样本输入至训练完毕的机器学习模型中,输出各个测试样本对应的预测结果;基于各个测试样本对应的预测结果与相应的标注结果之间的差异,统计机器学习模型的准确率;当统计的准确率符合训练停止条件时,得到训练好的标注模型。In one embodiment, the training step of the labeling model includes: obtaining training sample corpus and test sample corpus; obtaining each training sample in the training sample corpus, and corresponding labeling results of each test sample in the test sample corpus; the loop execution will mark the marked The current training sample is input to the machine learning model, and the prediction result corresponding to the current training sample is output. The prediction result output by the current training sample is compared with the corresponding labeled result. When the difference does not meet the preset conditions, the model of the machine learning model is adjusted. Parameters, when the difference meets the preset conditions, the step of accepting the previously adjusted model parameters until the training sample corpus is trained; input each test sample in the test sample corpus into the trained machine learning model and output each test sample Corresponding prediction results; Based on the differences between the corresponding prediction results of each test sample and the corresponding labeling results, the accuracy rate of the machine learning model is counted; when the statistical accuracy rate meets the training stop conditions, a trained labeling model is obtained.
训练样本语料是用于对模型进行训练的语料,测试样本语料是用于对模型进行测试的语料。在其中一个实施例中,在对机器学习模型进行训练时,需要区分业务场景类别,对于不同的业务场景类别,获取与该业务场景类别相应的训练样本语料和测试样本语料,对机器学习模型进行训练,得到与该业务场景类别相应的标注模型,这样在用户上传数据表并选取输入了业务场景类别后,可通过与选定的业务场景类别对应的标注模型对上传的数据表进行自动标注。The training sample corpus is a corpus used for training the model, and the test sample corpus is a corpus used for testing the model. In one embodiment, when training a machine learning model, it is necessary to distinguish business scenario categories. For different business scenario categories, obtain training sample corpora and test sample corpora corresponding to the business scenario category, and perform machine learning model processing. The training obtains a labeling model corresponding to the business scenario category. After the user uploads the data table and selects and enters the business scenario category, the uploaded data table can be automatically labeled by the labeling model corresponding to the selected business scenario category.
具体地,为了对模型进行训练,获取的训练样本语料和测试样本语料的标注结果可以是人工标注的,标注结果准确,有利于得到标注准确率较高的标注模型。在训练的过程中,可依次将标注好的当前训练样本输入至机器学习模型中,输出当前训练样本对应的预测结果,将当前训练样本输出的预测结果与相应的标注结果进行比较,当预测结果与相应的标注结果之间的差异不符合预设条件时,调整机器学习模型的模型参数,当差异符合预设条件时,接受前次调整的模型参数,重复上述训练的过程,将下一个训练样本输入至机器学习模型中,直至训练样本语料中的训练样本训练完毕。Specifically, in order to train the model, the labeling results of the training sample corpus and test sample corpus obtained may be manually labeled, and the labeling results are accurate, which is beneficial to obtaining a labeling model with high labeling accuracy. During the training process, the labeled current training samples can be input into the machine learning model in sequence, and the prediction results corresponding to the current training samples are output. The prediction results output by the current training samples are compared with the corresponding labeled results. When the difference between the corresponding labeled results does not meet the preset conditions, adjust the model parameters of the machine learning model. When the differences meet the preset conditions, accept the previously adjusted model parameters, repeat the above training process, and send the next training The samples are input into the machine learning model until the training samples in the training sample corpus are trained.
接着,将测试样本语料中的测试样本输入至训练完毕的模型中,统计对测试样本语料中的测试样本进行预测的准确率,当统计的准确率符合训练停止条件时,得到训练好的标注模型。Next, the test samples in the test sample corpus are input to the trained model, and the accuracy rate of predicting the test samples in the test sample corpus is statistically calculated. When the statistical accuracy rate meets the training stop condition, a trained labeled model is obtained. .
当统计的准确率不符合训练停止条件时,可根据上述的训练样本语料和测试样本语料继续新一轮的对该机器学习模型进行训练的步骤,直至统计的准确率符合训练停止条件,得到训练好的标注模型。When the statistical accuracy rate does not meet the training stop conditions, you can continue the new step of training the machine learning model based on the training sample corpus and test sample corpus until the statistical accuracy rate meets the training stop condition and get training. Good annotation model.
在本实施例中,可通过人工标注样本语料对机器学习模型进行不断训练,得到标注的准确率满足预设条件的标注模型,才能实现对用户上传的数据表进行自动标注。In this embodiment, the machine learning model can be continuously trained by manually labeling the sample corpus to obtain a labeling model with a labeling accuracy rate that satisfies a preset condition, so as to realize automatic labeling of the data table uploaded by the user.
在其中一个实施例中,上述的数据表处理方法还包括以下步骤:展示各个字段名及相应的标注结果;获取用户从展示的字段名中选取输入的至少两个字段名;获取用户输入的与至少两个字段名相关联的中间字段名;将中间字段名与数据表对应存储;中间字段名的 标注结果与选取输入的至少两个字段名相同。In one embodiment, the above data table processing method further includes the following steps: displaying each field name and the corresponding labeling result; obtaining at least two field names entered by the user from the displayed field names; and obtaining the user input and the The intermediate field names associated with at least two field names; the intermediate field names are stored in correspondence with the data table; the labeling results of the intermediate field names are the same as the at least two field names selected for input.
终端还可将识别出的标注结果展示给上传数据表的用户,以便用户根据展示的标注结果自定义中间字段名。具体地,终端可获取用户从展示的字段名中选取输入的至少两个字段名,选取输入的该至少两个字段名是用户上传的数据表中的原始字段名,是出现在数据表中的,中间字段名不是数据表中的原始字段名,而是根据原始字段名定义的中间字段名。用户可输入中间字段名,并将选取输入的至少两个原始字段名与中间字段名关联,终端就可以获取中间字段名与这至少两个字段名之间的联系,并将该中间字段名与该数据表对应存储,且该中间字段名的标注结果与与选取输入的至少两个字段名的标注结果相同。The terminal can also display the identified annotation results to the user who uploaded the data table, so that the user can customize the middle field name according to the displayed annotation results. Specifically, the terminal may obtain at least two field names selected by the user from the displayed field names, and the at least two field names selected and input are the original field names in the data table uploaded by the user and appear in the data table. The middle field name is not the original field name in the data table, but the middle field name defined according to the original field name. The user can enter the middle field name and associate at least two original field names selected with the middle field name, and the terminal can obtain the connection between the middle field name and the at least two field names, and associate the middle field name with The data table is stored correspondingly, and the labeling result of the middle field name is the same as the labeling result of at least two field names selected for input.
举例说明,在用户上传的数据表中,包括原始的字段名“逾期金额”、“逾期本金”,且标注的结果是既可以作为检索范围又可以作为检索维度,那么用户可自定义中间字段名“逾期率”,且“逾期率=逾期金额/逾期本金”,则终端可将中间字段名“逾期率”与该数据表对应存储,且对应的标注结果也是既可以作为检索范围又可以作为检索维度。那么在用户输入的检索词条“上海和北京的逾期率之间的差异”时,可通过意图识别模型识别出检索维度为“逾期率”,那么可从数据源库中按照检索维度“逾期率=逾期金额/逾期本金”中的“逾期金额、逾期本金”筛选出报表数据,并统计出“逾期率”,得到生成报表所需的统计数据。For example, in the data table uploaded by the user, including the original field names "overdue amount" and "overdue principal", and the marked result is both a search range and a search dimension, then the user can customize the intermediate field Name "overdue rate", and "overdue rate = overdue amount / overdue principal", the terminal can store the middle field name "overdue rate" corresponding to this data table, and the corresponding labeling result can also be used as a search range or As the search dimension. Then when the user enters the search term "difference between overdue rates in Shanghai and Beijing", the retrieval dimension can be identified as "overdue rate" through the intent recognition model, then the "overdue rate" can be obtained from the data source database according to the search dimension = Overdue Amount / Overdue Principal ", the" overdue amount, overdue principal "filter out the report data, and calculate the" overdue rate "to get the statistical data required to generate the report.
在本实施例中,通过展示标注结果,用户可根据原始的字段名自定义中间维度并存储下来,使得可匹配用户输入的检索词条的字段名更为丰富、多样,这样即便从用户输入的检索词条识别出的检索范围或检索维度不存在数据表的原始字段名中,也可筛选出与检索词条匹配的报表数据,提高了匹配的准确率。In this embodiment, by displaying the annotation results, the user can customize the intermediate dimension according to the original field name and store it, so that the field names that can match the search terms entered by the user are richer and more diverse. The search scope or search dimension identified by the search term does not exist in the original field name of the data table, and the report data matching the search term can be filtered, which improves the accuracy of the matching.
如图4所示,在一个具体的实施例中,数据表处理方法具体包括以下步骤:As shown in FIG. 4, in a specific embodiment, the data table processing method specifically includes the following steps:
步骤402,获取用户上传的数据表;Step 402: Obtain a data table uploaded by a user;
步骤404,提取数据表的表头所包括的字段名;Step 404: Extract field names included in the header of the data table;
步骤406,统计各字段名对应的枚举值;Step 406: Count the enumeration values corresponding to the field names.
步骤408,将各字段名对应的字段值的字符类型作为字段名的字段值类型;Step 408: Use the character type of the field value corresponding to each field name as the field value type of the field name.
步骤410,根据字段名以及相应的枚举值、字段值类型确定数据表的表结构信息;Step 410: Determine the table structure information of the data table according to the field names and corresponding enumeration values and field value types.
步骤412,获取用户选定的业务场景类别;Step 412: Obtain a business scenario category selected by the user;
步骤414,将表结构信息输入至已训练的与业务场景类别对应的标注模型中,通过标注模型根据表结构信息得到数据表中各字段名对应的特征向量;Step 414: Enter the table structure information into the labeled model corresponding to the business scene category that has been trained, and obtain the feature vector corresponding to each field name in the data table according to the table structure information through the label model;
步骤416,对各字段名对应的特征向量进行变换,输出数据表中各个字段名对应的标注结果;标注结果包括仅为检索范围、仅为检索维度以及既为检索范围又为检索维度中的一种;Step 416: The feature vector corresponding to each field name is transformed, and the labeling result corresponding to each field name in the data table is output; the labeling result includes only one of the search range, only the search dimension, and both the search range and the search dimension. Species
步骤418,展示各个字段名及相应的标注结果;Step 418, displaying each field name and corresponding labeling result;
步骤420,获取用户从展示的字段名中选取输入的至少两个字段名;Step 420: Acquire at least two field names input by the user from the displayed field names;
步骤422,获取用户输入的与至少两个字段名相关联的中间字段名;Step 422: Obtain an intermediate field name associated with at least two field names input by the user;
步骤424,将中间字段名与数据表对应存储;中间字段名的标注结果与选取输入的至少两个字段名相同;Step 424: Store the middle field name corresponding to the data table; the labeling result of the middle field name is the same as at least two field names selected for input;
步骤426,将标注结果与数据表对应存储;Step 426: Store the labeling result corresponding to the data table.
步骤428,获取用户输入的检索词条;Step 428: Obtain a search term input by the user;
步骤430,识别检索词条对应的检索范围和检索维度;Step 430: Identify a search range and a search dimension corresponding to the search term;
步骤432,获取数据源库中各数据表对应的标注结果;Step 432: Obtain a labeling result corresponding to each data table in the data source database.
步骤434,将检索范围与标注结果中可作为检索范围的字段名进行匹配;Step 434: Match the search range with the field name that can be used as the search range in the marked result;
步骤436,将检索维度与标注结果中可作为检索维度的字段名进行匹配;Step 436: Match the search dimension with a field name that can be used as the search dimension in the labeled result;
步骤438,按照匹配的字段名,从数据库源中筛选出报表数据。Step 438: Filter the report data from the database source according to the matching field names.
上述数据表处理方法,在获取到用户上传的数据表时,就对数据表进行解析,得到数据表的表结构信息,表结构信息可以反映出数据表所包括的内容和字段名,然后通过已训练的标注模型对表结构信息进行识别,可自动输出数据表中各个字段名对应的标注结果,标注结果能够确定数据表中的字段名能否作为检索范围或检索维度,这样,就实现了对用户上传的数据表中的字段名进行自动标注,相比于人工标注,大大地提高了对数据表的标注效率,并且,将标注结果与该数据表对应存储,能够便于从数据表中获取与用户的检索词条匹配的数据。In the above data table processing method, when the data table uploaded by the user is obtained, the data table is parsed to obtain the table structure information of the data table. The table structure information can reflect the content and field names included in the data table, and then pass the The trained labeling model recognizes the table structure information, and can automatically output the labeling results corresponding to the field names in the data table. The labeling results can determine whether the field names in the data table can be used as the search range or search dimension. The field names in the data table uploaded by the user are automatically labeled. Compared with manual labeling, the labeling efficiency of the data table is greatly improved, and the labeling results are stored in correspondence with the data table, which can be easily obtained from the data table. The user's search term matches the data.
应该理解的是,虽然图2至图4的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,这些步骤可以以其它的顺序执行。而且,图2至图4中的至少一部分步骤可以包括多个子步骤或者多个阶段,这些子步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,这些子步骤或者阶段的执行顺序也不必然是依次进行,而是可以与其它步骤或者其它步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。It should be understood that although the steps in the flowcharts of FIGS. 2 to 4 are sequentially displayed in accordance with the directions of the arrows, these steps are not necessarily performed in the order indicated by the arrows. Unless explicitly stated in this document, the execution of these steps is not strictly limited, and these steps can be performed in other orders. Moreover, at least a part of the steps in FIGS. 2 to 4 may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily performed at the same time, but may be performed at different times. These sub-steps or The execution order of the phases is not necessarily performed sequentially, but may be performed in turn or alternately with other steps or at least a part of the sub-steps or phases of other steps.
在其中一个实施例中,如图5所示,提供了一种数据表处理装置500,包括:获取模块502、解析模块504、标注模块506和存储模块508,其中:In one embodiment, as shown in FIG. 5, a data table processing apparatus 500 is provided, including: an obtaining module 502, a parsing module 504, a labeling module 506, and a storage module 508, where:
获取模块502,用于获取用户上传的数据表。The obtaining module 502 is configured to obtain a data table uploaded by a user.
解析模块504,用于对数据表进行解析,得到数据表的表结构信息。The analysis module 504 is configured to parse the data table to obtain table structure information of the data table.
标注模块506,用于通过已训练的标注模型对表结构信息进行识别,输出数据表中各个字段名的标注结果;标注结果包括仅为检索范围、仅为检索维度以及既为检索范围又为检索维度中的一种。A labeling module 506 is used to identify the table structure information through the trained labeling model and output the labeling results of each field name in the data table; the labeling results include only the search range, only the search dimension, and both the search range and the search One of the dimensions.
存储模块508,用于将标注结果与数据表对应存储。The storage module 508 is configured to store the marked result corresponding to the data table.
在其中一个实施例中,数据表处理装置500还包括检索词条获取模块、识别模块、标注结果获取模块和报表数据筛选模块;检索词条获取模块用于获取用户输入的检索词条; 识别模块用于识别检索词条对应的检索范围和检索维度;标注结果获取模块用于获取数据源库中各数据表对应的标注结果;报表数据筛选模块用于根据标注结果,从数据源库中筛选出与检索范围和检索维度匹配的报表数据。In one of the embodiments, the data table processing device 500 further includes a search term acquisition module, a recognition module, a labeling result acquisition module, and a report data screening module; the search term acquisition module is used to obtain a search term input by a user; a recognition module It is used to identify the search scope and search dimension corresponding to the search term; the annotation result acquisition module is used to obtain the annotation results corresponding to each data table in the data source database; the report data filtering module is used to filter out the data source database based on the annotation results. Report data that matches the search scope and search dimensions.
在其中一个实施例中,报表数据筛选模块还用于将检索范围与标注结果中可作为检索范围的字段名进行匹配;将检索维度与标注结果中可作为检索维度的字段名进行匹配;按照匹配的字段名,从数据库源中筛选出报表数据。In one embodiment, the report data filtering module is further configured to match the search scope with the field names that can be used as the search scope in the labeled results; match the search dimensions with the field names that can be used as the search dimensions in the labeled results; Field name to filter report data from the database source.
在其中一个实施例中,表结构信息包括字段名和字段值类型;解析模块还用于提取数据表的表头所包括的字段名;统计各字段名对应的枚举值;将各字段名对应的字段值的字符类型作为字段名的字段值类型;根据字段名以及相应的枚举值、字段值类型确定数据表的表结构信息。In one embodiment, the table structure information includes field names and field value types; the parsing module is further configured to extract the field names included in the header of the data table; count the enumeration values corresponding to each field name; The character type of the field value is used as the field value type of the field name; the table structure information of the data table is determined according to the field name and the corresponding enumeration value and field value type.
在其中一个实施例中,标注模块还用于获取用户选定的业务场景类别;将表结构信息输入至已训练的与业务场景类别对应的标注模型中,通过标注模型根据表结构信息得到数据表中各字段名对应的特征向量;对各字段名对应的特征向量进行变换,输出数据表中各个字段名对应的标注结果。In one embodiment, the labeling module is further configured to obtain the business scenario category selected by the user; input the table structure information into the trained labeling model corresponding to the business scenario category, and obtain a data table based on the table structure information through the labeling model. Feature vector corresponding to each field name in the field; transforming the feature vector corresponding to each field name to output the labeling result corresponding to each field name in the data table.
在其中一个实施例中,数据表处理装置500还包括训练模块,用于获取训练样本语料和测试样本语料;获取训练样本语料中各个训练样本、测试样本语料中各个测试样本对应的标注结果;循环执行将标注好的当前训练样本输入至机器学习模型中,输出当前训练样本对应的预测结果,将当前训练样本输出的预测结果与相应的标注结果进行比较,在差异不符合预设条件时,调整机器学习模型的模型参数,在差异符合预设条件时,接受前次调整的模型参数的步骤,直至训练样本语料训练完毕;将测试样本语料中的各个测试样本输入至训练完毕的机器学习模型中,输出各个测试样本对应的预测结果;基于各个测试样本对应的预测结果与相应的标注结果之间的差异,统计机器学习模型的准确率;当统计的准确率符合训练停止条件时,得到训练好的标注模型。In one embodiment, the data table processing device 500 further includes a training module for obtaining training sample corpus and test sample corpus; obtaining each training sample in the training sample corpus and corresponding labeling results of each test sample in the test sample corpus; loop Execute the labeled current training samples into the machine learning model, output the prediction results corresponding to the current training samples, compare the prediction results output by the current training samples with the corresponding labeled results, and adjust when the differences do not meet the preset conditions. When the model parameters of the machine learning model meet the preset conditions, the steps of the previously adjusted model parameters are accepted until the training sample corpus is trained; each test sample in the test sample corpus is input into the trained machine learning model. , Output the prediction results corresponding to each test sample; based on the differences between the prediction results corresponding to each test sample and the corresponding labeled results, calculate the accuracy rate of the machine learning model; when the statistical accuracy rate meets the training stop conditions, get a good training Callout model.
在其中一个实施例中,数据表处理装置500还包括标注结果展示模块、字段名获取模块、中间字段名定义模块以及中间字段名存储模块;标注结果展示模块用于展示各个字段名及相应的标注结果;字段名获取模块用于获取用户从展示的字段名中选取输入的至少两个字段名;中间字段名定义模块用于获取用户输入的与至少两个字段名相关联的中间字段名;中间字段名存储模块用于将中间字段名与数据表对应存储;中间字段名的标注结果与选取输入的至少两个字段名相同。In one embodiment, the data table processing device 500 further includes an annotation result display module, a field name acquisition module, an intermediate field name definition module, and an intermediate field name storage module; the annotation result display module is used to display each field name and corresponding annotation. Result; the field name acquisition module is used to obtain at least two field names that the user selects from the displayed field names; the middle field name definition module is used to obtain the middle field names associated with the at least two field names entered by the user; The field name storage module is used to store the middle field name corresponding to the data table; the labeling result of the middle field name is the same as at least two field names selected for input.
上述数据表处理装置500,在获取到用户上传的数据表时,就对数据表进行解析,得到数据表的表结构信息,表结构信息可以反映出数据表所包括的内容和字段名,然后通过已训练的标注模型对表结构信息进行识别,可自动输出数据表中各个字段名对应的标注结果,标注结果能够确定数据表中的字段名能否作为检索范围或检索维度,这样,就实现了对用户上传的数据表中的字段名进行自动标注,相比于人工标注,大大地提高了对数据表的标注效率,并且,将标注结果与该数据表对应存储,能够便于从数据表中获取与用户的 检索词条匹配的数据。The above data table processing device 500 analyzes the data table when the data table uploaded by the user is obtained, and obtains the table structure information of the data table. The table structure information can reflect the content and field names included in the data table, and then passes The trained labeling model recognizes the table structure information, and can automatically output the labeling results corresponding to the field names in the data table. The labeling results can determine whether the field names in the data table can be used as the search range or search dimension. Automatically label the field names in the data table uploaded by the user. Compared with manual labeling, the labeling efficiency of the data table is greatly improved, and the labeling results are stored in correspondence with the data table, which can be easily obtained from the data table. Data that matches a user's search term.
关于数据表处理装置500的具体限定可以参见上文中对于数据表处理方法的限定,在此不再赘述。上述数据表处理装置500中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。For the specific limitation of the data table processing apparatus 500, reference may be made to the foregoing limitation on the data table processing method, and details are not described herein again. Each module in the data table processing apparatus 500 may be implemented in whole or in part by software, hardware, and a combination thereof. The above-mentioned modules may be embedded in the hardware form or independent of the processor in the computer device, or may be stored in the memory of the computer device in the form of software, so that the processor calls and performs the operations corresponding to the above modules.
在其中一个实施例中,提供了一种计算机设备,该计算机设备可以是终端,其内部结构图可以如图6所示。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口、显示屏和输入装置。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性计算机可读存储介质、内存储器。该非易失性计算机可读存储介质存储有操作系统和计算机可读指令。该内存储器为非易失性计算机可读存储介质中的操作系统和计算机可读指令的运行提供环境。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机可读指令被处理器执行时以实现一种数据表处理方法。该计算机设备的显示屏可以是液晶显示屏或者电子墨水显示屏,该计算机设备的输入装置可以是显示屏上覆盖的触摸层,也可以是计算机设备外壳上设置的按键、轨迹球或触控板,还可以是外接的键盘、触控板或鼠标等。In one embodiment, a computer device is provided. The computer device may be a terminal, and its internal structure diagram may be as shown in FIG. 6. The computer equipment includes a processor, a memory, a network interface, a display screen, and an input device connected through a system bus. The processor of the computer device is used to provide computing and control capabilities. The memory of the computer device includes a non-volatile computer-readable storage medium and an internal memory. The non-volatile computer-readable storage medium stores an operating system and computer-readable instructions. The internal memory provides an environment for operating systems and computer-readable instructions in a non-volatile computer-readable storage medium. The network interface of the computer device is used to communicate with an external terminal through a network connection. The computer-readable instructions are executed by a processor to implement a data table processing method. The display screen of the computer device may be a liquid crystal display screen or an electronic ink display screen. The input device of the computer device may be a touch layer covered on the display screen, or a button, a trackball, or a touchpad provided on the computer device casing. , Or an external keyboard, trackpad, or mouse.
本领域技术人员可以理解,图6中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。Those skilled in the art can understand that the structure shown in FIG. 6 is only a block diagram of a part of the structure related to the scheme of the present application, and does not constitute a limitation on the computer equipment to which the scheme of the present application is applied. Include more or fewer parts than shown in the figure, or combine certain parts, or have a different arrangement of parts.
在其中一个实施例中,本申请提供的数据表处理装置500可以实现为一种计算机可读指令的形式,计算机可读指令可在如图6所示的计算机设备上运行。计算机设备的存储器中可存储组成该数据表处理装置500的各个程序模块,比如,图5所示的获取模块502、解析模块504、标注模块506和存储模块508。各个程序模块构成的计算机可读指令使得处理器执行本说明书中描述的本申请各个实施例的数据表处理方法中的步骤。In one embodiment, the data table processing apparatus 500 provided in the present application may be implemented in the form of a computer-readable instruction, and the computer-readable instruction may run on a computer device as shown in FIG. 6. The memory of the computer device may store various program modules constituting the data table processing apparatus 500, such as the obtaining module 502, the analyzing module 504, the labeling module 506, and the storage module 508 shown in FIG. The computer-readable instructions constituted by each program module cause the processor to execute the steps in the data table processing method of each embodiment of the present application described in this specification.
例如,图6所示的计算机设备可以通过如图5所示的数据表处理装置500中的获取模块502执行步骤S202。计算机设备可通过解析模块504执行步骤S204。计算机设备可通过标注模块506执行步骤S206。计算机设备可通过存储模块508执行步骤S208。For example, the computer device shown in FIG. 6 may execute step S202 by the obtaining module 502 in the data table processing apparatus 500 shown in FIG. 5. The computer device may execute step S204 through the analysis module 504. The computer device may execute step S206 through the labeling module 506. The computer device may execute step S208 through the storage module 508.
在其中一个实施例中,提供了一种计算机设备,包括存储器和一个或多个处理器,存储器中储存有计算机可读指令,计算机可读指令被处理器执行时实现本申请任意一个实施例中提供的数据表处理方法的步骤。此处数据表处理方法的步骤可以是上述各个实施例的数据表处理方法中的步骤。In one embodiment, a computer device is provided, which includes a memory and one or more processors. The memory stores computer-readable instructions. The computer-readable instructions are implemented by the processor to implement any one of the embodiments of the present application. Provide the steps of the data sheet processing method. The steps of the data table processing method herein may be the steps in the data table processing method of each of the foregoing embodiments.
在其中一个实施例中,提供了一个或多个存储有计算机可读指令的非易失性计算机可读存储介质,计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器实现实 现本申请任意一个实施例中提供的数据表处理方法的步骤。此处数据表处理方法的步骤可以是上述各个实施例的数据表处理方法中的步骤。In one embodiment, one or more non-volatile computer-readable storage media storing computer-readable instructions are provided, and when the computer-readable instructions are executed by one or more processors, one or more processes are processed. The processor implements the steps of implementing the data table processing method provided in any one of the embodiments of the present application. The steps of the data table processing method herein may be the steps in the data table processing method of each of the foregoing embodiments.
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机可读指令来指令相关的硬件来完成,所述的计算机可读指令可存储于一非易失性计算机可读取存储介质中,该计算机可读指令在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above embodiments can be implemented by computer-readable instructions to instruct related hardware. The computer-readable instructions can be stored in a non-volatile computer. In the readable storage medium, the computer-readable instructions, when executed, may include the processes of the embodiments of the methods described above. Wherein, any reference to the memory, storage, database or other media used in the embodiments provided in this application may include non-volatile and / or volatile memory. Non-volatile memory may include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory can include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in various forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Synchlink DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
以上实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。The technical features of the above embodiments can be arbitrarily combined. In order to make the description concise, all possible combinations of the technical features in the above embodiments have not been described. However, as long as there is no contradiction in the combination of these technical features, they should be It is considered to be the range described in this specification.
以上所述实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对发明专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。因此,本申请专利的保护范围应以所附权利要求为准。The above-mentioned embodiments only express several implementation manners of the present application, and their descriptions are more specific and detailed, but they cannot be understood as limiting the scope of the invention patent. It should be noted that, for those of ordinary skill in the art, without departing from the concept of the present application, several modifications and improvements can be made, which all belong to the protection scope of the present application. Therefore, the protection scope of this application patent shall be subject to the appended claims.
Claims (20)
- 一种数据表处理方法,包括:A data table processing method includes:获取用户上传的数据表;Get the data table uploaded by the user;对所述数据表进行解析,得到所述数据表的表结构信息;Parse the data table to obtain table structure information of the data table;通过已训练的标注模型对所述表结构信息进行识别,输出所述数据表中各个字段名的标注结果;所述标注结果包括仅为检索范围、仅为检索维度以及既为检索范围又为检索维度中的一种;及The table structure information is identified through a trained labeling model, and the labeling results of each field name in the data table are output; the labeling results include only the search range, only the search dimension, and both the search range and the search. One of the dimensions; and将所述标注结果与所述数据表对应存储。The labeling result is stored in correspondence with the data table.
- 根据权利要求1所述的方法,其特征在于,还包括:The method according to claim 1, further comprising:获取用户输入的检索词条;Get search terms entered by the user;识别所述检索词条对应的检索范围和检索维度;Identifying a search range and a search dimension corresponding to the search term;获取数据源库中各数据表对应的标注结果;及Obtain the annotation results corresponding to each data table in the data source database; and根据所述标注结果,从所述数据源库中筛选出与所述检索范围和所述检索维度匹配的报表数据。According to the labeling result, report data matching the search range and the search dimension is filtered from the data source database.
- 根据权利要求2所述的方法,其特征在于,所述根据所述标注结果,从所述数据源库中筛选出与所述检索范围和所述检索维度匹配的报表数据包括:The method according to claim 2, wherein, according to the labeling result, filtering out report data matching the search range and the search dimension from the data source database comprises:将所述检索范围与所述标注结果中可作为检索范围的字段名进行匹配;Matching the search range with a field name that can be used as the search range in the labeled result;将所述检索维度与所述标注结果中可作为检索维度的字段名进行匹配;及Matching the search dimension with a field name that can be used as the search dimension in the labeled result; and按照匹配的字段名,从所述数据库源中筛选出报表数据。Filter report data from the database source according to the matching field names.
- 根据权利要求1所述的方法,其特征在于,所述表结构信息包括字段名和字段值类型;所述对所述数据表进行解析,得到所述数据表的表结构信息包括:The method according to claim 1, wherein the table structure information includes field names and field value types; and parsing the data table to obtain table structure information of the data table includes:提取所述数据表的表头所包括的字段名;Extracting field names included in a header of the data table;统计各所述字段名对应的枚举值;Enumerating the enumerated values corresponding to the field names;将各所述字段名对应的字段值的字符类型作为所述字段名的字段值类型;及Use the character type of the field value corresponding to each of the field names as the field value type of the field name; and根据所述字段名以及相应的枚举值、字段值类型确定所述数据表的表结构信息。The table structure information of the data table is determined according to the field name and the corresponding enumeration value and field value type.
- 根据权利要求1所述的方法,其特征在于,所述通过已训练的标注模型对所述表结构信息进行识别,输出所述数据表中各个字段名的标注结果包括:The method according to claim 1, wherein the identifying the table structure information by using a trained annotation model, and outputting an annotation result of each field name in the data table comprises:获取用户选定的业务场景类别;Get the business scenario category selected by the user;将所述表结构信息输入至已训练的与所述业务场景类别对应的标注模型中,通过所述标注模型根据所述表结构信息得到所述数据表中各字段名对应的特征向量;及Inputting the table structure information into a trained labeling model corresponding to the business scene category, and obtaining a feature vector corresponding to each field name in the data table according to the table structure information through the labeling model; and对各所述字段名对应的特征向量进行变换,输出所述数据表中各个字段名对应的标注结果。The feature vector corresponding to each of the field names is transformed, and a labeling result corresponding to each field name in the data table is output.
- 根据权利要求1所述的方法,其特征在于,所述标注模型的训练步骤包括:The method according to claim 1, wherein the training step of the labeling model comprises:获取训练样本语料和测试样本语料;Obtain training sample corpus and test sample corpus;获取所述训练样本语料中各个训练样本、所述测试样本语料中各个测试样本对应的标 注结果;Obtaining annotation results corresponding to each training sample in the training sample corpus and each test sample in the test sample corpus;循环执行将标注好的当前训练样本输入至机器学习模型中,输出当前训练样本对应的预测结果,将当前训练样本输出的预测结果与相应的标注结果进行比较,在差异不符合预设条件时,调整所述机器学习模型的模型参数,在差异符合预设条件时,接受前次调整的模型参数的步骤,直至所述训练样本语料训练完毕;The loop execution inputs the labeled current training samples into the machine learning model, outputs the prediction results corresponding to the current training samples, and compares the prediction results output by the current training samples with the corresponding labeled results. Adjusting the model parameters of the machine learning model, and accepting the steps of the previously adjusted model parameters when the difference meets a preset condition, until the training sample corpus is trained;将所述测试样本语料中的各个测试样本输入至训练完毕的机器学习模型中,输出各个测试样本对应的预测结果;Inputting each test sample in the test sample corpus into a trained machine learning model, and outputting prediction results corresponding to each test sample;基于各个测试样本对应的预测结果与相应的标注结果之间的差异,统计所述机器学习模型的准确率;及Based on the difference between the prediction result corresponding to each test sample and the corresponding labeled result, statistics of the accuracy of the machine learning model; and当统计的所述准确率符合训练停止条件时,得到训练好的标注模型。When the statistical accuracy rate meets the training stop condition, a trained labeled model is obtained.
- 根据权利要求1至6任一项所述的方法,其特征在于,还包括:The method according to any one of claims 1 to 6, further comprising:展示各个字段名及相应的标注结果;Display each field name and corresponding labeling results;获取用户从展示的所述字段名中选取输入的至少两个字段名;Acquiring at least two field names input by the user from the displayed field names;获取用户输入的与所述至少两个字段名相关联的中间字段名;及Obtaining an intermediate field name associated with the at least two field names input by the user; and将所述中间字段名与所述数据表对应存储;所述中间字段名的标注结果与所述选取输入的至少两个字段名相同。The intermediate field name is stored corresponding to the data table; the labeling result of the intermediate field name is the same as at least two field names of the selected input.
- 一种数据表处理装置,所述装置包括:A data table processing device, the device includes:获取模块,用于获取用户上传的数据表;An acquisition module for acquiring a data table uploaded by a user;解析模块,用于对所述数据表进行解析,得到所述数据表的表结构信息;An analysis module, configured to parse the data table to obtain table structure information of the data table;标注模块,用于通过已训练的标注模型对所述表结构信息进行识别,输出所述数据表中各个字段名的标注结果;所述标注结果包括仅为检索范围、仅为检索维度以及既为检索范围又为检索维度中的一种;及A labeling module, configured to identify the table structure information through a trained labeling model, and output labeling results for each field name in the data table; the labeling results include only the search range, only the search dimension, and both The search scope is one of the search dimensions; and存储模块,用于将所述标注结果与所述数据表对应存储。A storage module, configured to store the marked result corresponding to the data table.
- 根据权利要求8所述的装置,其特征在于,所述装置还包括:The apparatus according to claim 8, further comprising:检索词条获取模块,用于获取用户输入的检索词条;A search term acquisition module for acquiring a search term input by a user;识别模块,用于识别所述检索词条对应的检索范围和检索维度;A recognition module, configured to identify a search range and a search dimension corresponding to the search term;标注结果获取模块,用于获取数据源库中各数据表对应的标注结果;及Annotation result acquisition module, for acquiring the annotation results corresponding to each data table in the data source database; and报表数据筛选模块,用于根据所述标注结果,从所述数据源库中筛选出与所述检索范围和所述检索维度匹配的报表数据。A report data filtering module is configured to filter report data matching the search range and the search dimension from the data source database according to the annotation result.
- 根据权利要求8所述的装置,其特征在于,所述报表数据筛选模块还用于将所述检索范围与所述标注结果中可作为检索范围的字段名进行匹配;将所述检索维度与所述标注结果中可作为检索维度的字段名进行匹配;及按照匹配的字段名,从所述数据库源中筛选出报表数据。The device according to claim 8, wherein the report data filtering module is further configured to match the search range with a field name that can be used as a search range in the marked result; match the search dimension with all Matching the field names as search dimensions in the annotation results; and filtering report data from the database source according to the matching field names.
- 根据权利要求8所述的装置,其特征在于,所述表结构信息包括字段名和字段值 类型;所述解析模块还用于提取所述数据表的表头所包括的字段名;统计各所述字段名对应的枚举值;将各所述字段名对应的字段值的字符类型作为所述字段名的字段值类型;及根据所述字段名以及相应的枚举值、字段值类型确定所述数据表的表结构信息。The device according to claim 8, wherein the table structure information includes field names and field value types; the parsing module is further configured to extract field names included in a header of the data table; and count each of the fields The enumeration value corresponding to the field name; the character type of the field value corresponding to each of the field names as the field value type of the field name; and determining the field name according to the field name and the corresponding enumeration value and field value type Table structure information of the data table.
- 根据权利要求8所述的装置,其特征在于,所述标注模块还用于获取用户选定的业务场景类别;将所述表结构信息输入至已训练的与所述业务场景类别对应的标注模型中,通过所述标注模型根据所述表结构信息得到所述数据表中各字段名对应的特征向量;及对各所述字段名对应的特征向量进行变换,输出所述数据表中各个字段名对应的标注结果。The device according to claim 8, wherein the labeling module is further configured to obtain a business scenario category selected by a user; and input the table structure information to a trained labeling model corresponding to the business scenario category In the method, the feature vector corresponding to each field name in the data table is obtained according to the table structure information through the annotation model; and the feature vector corresponding to each field name is transformed to output each field name in the data table. The corresponding labeling result.
- 一种计算机设备,包括存储器和一个或多个处理器,存储器中储存有计算机可读指令,计算机可读指令被处理器执行时,使得一个或多个处理器执行以下步骤:A computer device includes a memory and one or more processors. Computer-readable instructions are stored in the memory. When the computer-readable instructions are executed by the processor, the one or more processors execute the following steps:获取用户上传的数据表;Get the data table uploaded by the user;对所述数据表进行解析,得到所述数据表的表结构信息;Parse the data table to obtain table structure information of the data table;通过已训练的标注模型对所述表结构信息进行识别,输出所述数据表中各个字段名的标注结果;所述标注结果包括仅为检索范围、仅为检索维度以及既为检索范围又为检索维度中的一种;及The table structure information is identified through a trained labeling model, and the labeling results of each field name in the data table are output; the labeling results include only the search range, only the search dimension, and both the search range and the search. One of the dimensions; and将所述标注结果与所述数据表对应存储。The labeling result is stored in correspondence with the data table.
- 根据权利要求13所述的计算机设备,其特征在于,所述处理器执行所述计算机可读指令时还执行以下步骤:The computer device of claim 13, wherein the processor further executes the following steps when executing the computer-readable instructions:获取用户输入的检索词条;Get search terms entered by the user;识别所述检索词条对应的检索范围和检索维度;Identifying a search range and a search dimension corresponding to the search term;获取数据源库中各数据表对应的标注结果;及Obtain the annotation results corresponding to each data table in the data source database; and根据所述标注结果,从所述数据源库中筛选出与所述检索范围和所述检索维度匹配的报表数据。According to the labeling result, report data matching the search range and the search dimension is filtered from the data source database.
- 根据权利要求13所述的计算机设备,其特征在于,所述表结构信息包括字段名和字段值类型;所述处理器执行所述计算机可读指令时还执行以下步骤:The computer device according to claim 13, wherein the table structure information includes field names and field value types; and when the processor executes the computer-readable instructions, the following steps are further performed:提取所述数据表的表头所包括的字段名;Extracting field names included in a header of the data table;统计各所述字段名对应的枚举值;Enumerating the enumerated values corresponding to the field names;将各所述字段名对应的字段值的字符类型作为所述字段名的字段值类型;及Use the character type of the field value corresponding to each of the field names as the field value type of the field name; and根据所述字段名以及相应的枚举值、字段值类型确定所述数据表的表结构信息。The table structure information of the data table is determined according to the field name and the corresponding enumeration value and field value type.
- 根据权利要求13所述的计算机设备,其特征在于,所述处理器执行所述计算机可读指令时还执行以下步骤:The computer device of claim 13, wherein the processor further executes the following steps when executing the computer-readable instructions:获取用户选定的业务场景类别;Get the business scenario category selected by the user;将所述表结构信息输入至已训练的与所述业务场景类别对应的标注模型中,通过所述标注模型根据所述表结构信息得到所述数据表中各字段名对应的特征向量;及Inputting the table structure information into a trained labeling model corresponding to the business scene category, and obtaining a feature vector corresponding to each field name in the data table according to the table structure information through the labeling model; and对各所述字段名对应的特征向量进行变换,输出所述数据表中各个字段名对应的标 注结果。The feature vector corresponding to each of the field names is transformed, and the annotation result corresponding to each field name in the data table is output.
- 一个或多个存储有计算机可读指令的非易失性计算机可读存储介质,计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行以下步骤:One or more non-volatile computer-readable storage media storing computer-readable instructions. When the computer-readable instructions are executed by one or more processors, the one or more processors execute the following steps:获取用户上传的数据表;Get the data table uploaded by the user;对所述数据表进行解析,得到所述数据表的表结构信息;Parse the data table to obtain table structure information of the data table;通过已训练的标注模型对所述表结构信息进行识别,输出所述数据表中各个字段名的标注结果;所述标注结果包括仅为检索范围、仅为检索维度以及既为检索范围又为检索维度中的一种;及The table structure information is identified through a trained labeling model, and the labeling results of each field name in the data table are output; the labeling results include only the search range, only the search dimension, and both the search range and the search. One of the dimensions; and将所述标注结果与所述数据表对应存储。The labeling result is stored in correspondence with the data table.
- 根据权利要求17所述的存储介质,其特征在于,所述计算机可读指令被所述处理器执行时还执行以下步骤:The storage medium according to claim 17, wherein when the computer-readable instructions are executed by the processor, the following steps are further performed:获取用户输入的检索词条;Get search terms entered by the user;识别所述检索词条对应的检索范围和检索维度;Identifying a search range and a search dimension corresponding to the search term;获取数据源库中各数据表对应的标注结果;及Obtain the annotation results corresponding to each data table in the data source database; and根据所述标注结果,从所述数据源库中筛选出与所述检索范围和所述检索维度匹配的报表数据。According to the labeling result, report data matching the search range and the search dimension is filtered from the data source database.
- 根据权利要求16所述的存储介质,其特征在于,所述表结构信息包括字段名和字段值类型;所述计算机可读指令被所述处理器执行时还执行以下步骤:提取所述数据表的表头所包括的字段名;The storage medium according to claim 16, wherein the table structure information includes field names and field value types; when the computer-readable instructions are executed by the processor, the following steps are further performed: extracting the data table Field names included in the header;统计各所述字段名对应的枚举值;Enumerating the enumerated values corresponding to the field names;将各所述字段名对应的字段值的字符类型作为所述字段名的字段值类型;及Use the character type of the field value corresponding to each of the field names as the field value type of the field name; and根据所述字段名以及相应的枚举值、字段值类型确定所述数据表的表结构信息。The table structure information of the data table is determined according to the field name and the corresponding enumeration value and field value type.
- 根据权利要求16所述的存储介质,其特征在于,所述计算机可读指令被所述处理器执行时还执行以下步骤:The storage medium according to claim 16, wherein when the computer-readable instructions are executed by the processor, the following steps are further performed:获取用户选定的业务场景类别;Get the business scenario category selected by the user;将所述表结构信息输入至已训练的与所述业务场景类别对应的标注模型中,通过所述标注模型根据所述表结构信息得到所述数据表中各字段名对应的特征向量;及Inputting the table structure information into a trained labeling model corresponding to the business scene category, and obtaining a feature vector corresponding to each field name in the data table according to the table structure information through the labeling model; and对各所述字段名对应的特征向量进行变换,输出所述数据表中各个字段名对应的标注结果。The feature vector corresponding to each of the field names is transformed, and a labeling result corresponding to each field name in the data table is output.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811090036.8 | 2018-09-18 | ||
CN201811090036.8A CN109299094A (en) | 2018-09-18 | 2018-09-18 | Tables of data processing method, device, computer equipment and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2020057021A1 true WO2020057021A1 (en) | 2020-03-26 |
Family
ID=65163673
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2019/071126 WO2020057021A1 (en) | 2018-09-18 | 2019-01-10 | Data table processing method and device, computer device and storage medium |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN109299094A (en) |
WO (1) | WO2020057021A1 (en) |
Families Citing this family (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110378378B (en) * | 2019-06-17 | 2022-10-28 | 北京百度网讯科技有限公司 | Event retrieval method and device, computer equipment and storage medium |
CN110427992A (en) * | 2019-07-23 | 2019-11-08 | 杭州城市大数据运营有限公司 | Data matching method, device, computer equipment and storage medium |
CN110727743A (en) * | 2019-10-12 | 2020-01-24 | 杭州城市大数据运营有限公司 | Data identification method and device, computer equipment and storage medium |
CN112667855B (en) * | 2019-10-15 | 2022-07-05 | 北京新唐思创教育科技有限公司 | Block chain data management method, electronic device and computer storage medium |
CN110795482B (en) * | 2019-10-16 | 2022-11-22 | 浙江大华技术股份有限公司 | Data benchmarking method, device and storage device |
CN111079174A (en) * | 2019-11-21 | 2020-04-28 | 中国电力科学研究院有限公司 | Power consumption data desensitization method and system based on anonymization and differential privacy technology |
CN111143433B (en) * | 2019-12-10 | 2024-07-09 | 中国平安财产保险股份有限公司 | Method and device for counting data in data bin |
CN111258993A (en) * | 2020-01-09 | 2020-06-09 | 佛山科学技术学院 | Method and device for filtering abnormal data of industrial big data |
CN113095064A (en) * | 2021-03-18 | 2021-07-09 | 杭州数梦工场科技有限公司 | Code field identification method and device, electronic equipment and storage medium |
CN113157788B (en) * | 2021-04-13 | 2024-02-13 | 福州外语外贸学院 | Big data mining method and system |
CN113918577B (en) * | 2021-12-15 | 2022-03-11 | 北京新唐思创教育科技有限公司 | Data table identification method and device, electronic equipment and storage medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105045769A (en) * | 2015-06-01 | 2015-11-11 | 中国人民解放军装备学院 | Structure recognition based Web table information extraction method |
CN106055584A (en) * | 2010-01-15 | 2016-10-26 | 起元技术有限责任公司 | Managing data queries |
CN107527070A (en) * | 2017-08-25 | 2017-12-29 | 江苏赛睿信息科技股份有限公司 | Recognition methods, storage medium and the server of dimension data and achievement data |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107102993B (en) * | 2016-02-19 | 2021-01-29 | 创新先进技术有限公司 | User appeal analysis method and device |
CN106407407B (en) * | 2016-09-22 | 2019-10-15 | 江苏通付盾科技有限公司 | A kind of file labeling system and method |
CN107274291B (en) * | 2017-06-21 | 2020-08-04 | 况客科技(北京)有限公司 | Cross-platform valuation table analysis method, storage medium and application server |
-
2018
- 2018-09-18 CN CN201811090036.8A patent/CN109299094A/en active Pending
-
2019
- 2019-01-10 WO PCT/CN2019/071126 patent/WO2020057021A1/en active Application Filing
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106055584A (en) * | 2010-01-15 | 2016-10-26 | 起元技术有限责任公司 | Managing data queries |
CN105045769A (en) * | 2015-06-01 | 2015-11-11 | 中国人民解放军装备学院 | Structure recognition based Web table information extraction method |
CN107527070A (en) * | 2017-08-25 | 2017-12-29 | 江苏赛睿信息科技股份有限公司 | Recognition methods, storage medium and the server of dimension data and achievement data |
Also Published As
Publication number | Publication date |
---|---|
CN109299094A (en) | 2019-02-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2020057021A1 (en) | Data table processing method and device, computer device and storage medium | |
WO2021184571A1 (en) | Dynamic form generation method, apparatus, computer device, and storage medium | |
CN111444723B (en) | Information extraction method, computer device, and storage medium | |
CN111666401B (en) | Document recommendation method, device, computer equipment and medium based on graph structure | |
CN110377558B (en) | Document query method, device, computer equipment and storage medium | |
WO2020057022A1 (en) | Associative recommendation method and apparatus, computer device, and storage medium | |
WO2021000555A1 (en) | Knowledge graph-based question answering method and apparatus, computer device and storage medium | |
US11914968B2 (en) | Official document processing method, device, computer equipment and storage medium | |
WO2017215370A1 (en) | Method and apparatus for constructing decision model, computer device and storage device | |
WO2020147395A1 (en) | Emotion-based text classification method and device, and computer apparatus | |
WO2020177365A1 (en) | Data mining-based social insurance data processing method and apparatus, and computer device | |
WO2021164205A1 (en) | Identity identification-based data auditing method and apparatus, and computer device | |
WO2018184518A1 (en) | Microblog data processing method and device, computer device and storage medium | |
CN110472114B (en) | Abnormal data early warning method and device, computer equipment and storage medium | |
CN110674131A (en) | Financial statement data processing method and device, computer equipment and storage medium | |
WO2020237872A1 (en) | Method and apparatus for testing accuracy of semantic analysis model, storage medium, and device | |
CN110362798B (en) | Method, apparatus, computer device and storage medium for judging information retrieval analysis | |
WO2023272850A1 (en) | Decision tree-based product matching method, apparatus and device, and storage medium | |
US11574491B2 (en) | Automated classification and interpretation of life science documents | |
CN111651552A (en) | Structured information determination method and device and electronic equipment | |
CN112085087A (en) | Method and device for generating business rules, computer equipment and storage medium | |
CN113705198B (en) | Scene graph generation method and device, electronic equipment and storage medium | |
US9594757B2 (en) | Document management system, document management method, and document management program | |
CN117095422B (en) | Document information analysis method, device, computer equipment and storage medium | |
CN117725182A (en) | Data retrieval method, device, equipment and storage medium based on large language model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 19861655 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205 DATED 08/07/2021) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 19861655 Country of ref document: EP Kind code of ref document: A1 |