CN117539981A

CN117539981A - Method, equipment and medium for constructing theme data set

Info

Publication number: CN117539981A
Application number: CN202311560826.9A
Authority: CN
Inventors: 郑敏; 单震
Original assignee: Chaozhou Zhuoshu Big Data Industry Development Co Ltd
Current assignee: Chaozhou Zhuoshu Big Data Industry Development Co Ltd
Priority date: 2023-11-21
Filing date: 2023-11-21
Publication date: 2024-02-09

Abstract

The application discloses a method, equipment and medium for constructing a theme data set, wherein the method comprises the following steps: receiving a subject data set construction request from a user, and acquiring a data screening requirement; analyzing the data screening requirement to obtain analysis contents of different field types; creating an analytic content configuration table based on the analytic content; generating a target database statement based on the analysis content configuration table and a preset database statement template; and constructing a theme data set through the target database statement. Analyzing the data demand screening conditions around the dimension type and the quantization type fields, storing analysis contents into a configuration table according to rules, and constructing a theme data set by combining the templated SQL. By this method, data requirements can be responded to quickly. The method has the advantages of reusability in a form of separating configuration from codes, reduced development amount in data development work, convenience in maintenance and good popularization and application value.

Description

Method, equipment and medium for constructing theme data set

Technical Field

The present invention relates to the field of data matching, and in particular, to a method, an apparatus, and a medium for constructing a theme data set.

Background

The topic data set is a data set aiming at a specific topic or task and is widely applied to the fields of machine learning, data mining analysis, decision support and the like. Taking the e-commerce data analysis field as an example, a data team often needs to quickly send out an analysis report according to valuable social hotspots, so that historical data of corresponding topics needs to be obtained from massive data in a data warehouse to support reporting views.

The existing data processing method generally converts the filtering conditions in the requirement document into corresponding SQL sentences one by one, and then manually executes the SQL sentences to acquire the required data. However, this approach tends to be inefficient in situations where a fast response to data demands is required, and is detrimental to later maintenance.

Disclosure of Invention

In order to solve the above problems, the present application proposes a method, an apparatus, and a medium for constructing a theme data set, where the method includes:

receiving a subject data set construction request from a user, and acquiring a data screening requirement; analyzing the data screening requirement to obtain analysis contents of different field types; creating an analytic content configuration table based on the analytic content; generating a target database statement based on the analysis content configuration table and a preset database statement template; and constructing a theme data set through the target database statement.

In one example, the field types include a dimension field and a quantization field; the dimension field is used for describing the environment attribute of the environment where the quantized data is located and providing context for the quantized data; the quantization field is used to represent a performance index of the business process.

In one example, the parsing the data filtering requirement to obtain parsed contents with different field types specifically includes: analyzing the data screening requirement to determine an analysis field in the data screening requirement; performing semantic analysis on the analysis field to determine a field type corresponding to the analysis field; determining a preset category classification dictionary corresponding to the analysis field based on the field type; and determining the analysis content corresponding to the analysis field based on the preset category classification dictionary.

In one example, the creating the parsing content configuration table based on the parsing content specifically includes: determining the configuration table content corresponding to the data screening requirement based on the analysis content; the configuration table content comprises a requirement batch identifier, a dimension screening field, a fuzzy matching characteristic word, a fuzzy matching interference word, a standardized field and a duplication eliminating sequence; and creating a analytic content configuration table based on the configuration table content.

In one example, after determining the configuration table content corresponding to the data filtering requirement based on the parsing content, the method further includes: determining field types corresponding to the configuration table contents respectively, wherein the field types at least comprise character types and numerical types; the field types corresponding to the requirement batch identifier, the dimension screening field, the fuzzy matching characteristic word, the fuzzy matching interference word and the standardized field are character types; the field type corresponding to the de-duplication sequence is numerical.

In one example, the generating the target database statement based on the parsing content configuration table and a preset database statement template specifically includes: determining a preset function template based on the analysis content configuration table and the data screening requirement, wherein the preset function template comprises a plurality of preset database statement templates; determining blank fields in the plurality of preset database statement templates in the preset function statement; and replacing the blank field based on the configuration table content corresponding to the analysis content configuration table to generate the target database statement.

In one example, the constructing a subject data set by the target database statement specifically includes: determining the data quantity of a source table, and splitting the source table to obtain a plurality of sub-source tables if the data quantity of the source table is higher than a preset data quantity threshold; and circularly executing the target database statement on the plurality of sub-source tables until the source table data volume is traversed.

In one example, the method further comprises: determining target original data which accords with a plurality of matching rules in the data screening requirement; and performing de-duplication on the target original data in the matching result based on the de-duplication sequence so as to ensure the uniqueness of the target original data.

The application also provides a device for constructing a subject data set, which comprises: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform: receiving a subject data set construction request from a user, and acquiring a data screening requirement; analyzing the data screening requirement to obtain analysis contents of different field types; creating an analytic content configuration table based on the analytic content; generating a target database statement based on the analysis content configuration table and a preset database statement template; and constructing a theme data set through the target database statement.

The present application also provides a non-volatile computer storage medium storing computer-executable instructions configured to: receiving a subject data set construction request from a user, and acquiring a data screening requirement; analyzing the data screening requirement to obtain analysis contents of different field types; creating an analytic content configuration table based on the analytic content; generating a target database statement based on the analysis content configuration table and a preset database statement template; and constructing a theme data set through the target database statement.

The method provided by the application has the following beneficial effects: analyzing the data demand screening conditions around the dimension type and the quantization type fields, storing analysis contents into a configuration table according to rules, and constructing a theme data set by combining the templated SQL. By this method, data requirements can be responded to quickly. The method has the advantages of reusability in a form of separating configuration from codes, reduced development amount in data development work, convenience in maintenance and good popularization and application value.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute an undue limitation to the application. In the drawings:

FIG. 1 is a flow chart of a method for constructing a topic data set according to an embodiment of the present application;

fig. 2 is a schematic structural diagram of an apparatus for constructing a theme data set according to an embodiment of the present application.

Detailed Description

For the purposes, technical solutions and advantages of the present application, the technical solutions of the present application will be clearly and completely described below with reference to specific embodiments of the present application and corresponding drawings. It will be apparent that the described embodiments are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

The following describes in detail the technical solutions provided by the embodiments of the present application with reference to the accompanying drawings.

Fig. 1 is a flow chart illustrating a method for constructing a theme data set according to one or more embodiments of the present disclosure. The method can be applied to the construction of different types of theme data sets, such as an internet financial theme data set, a business theme data set, an instant messaging theme data set, a game theme data set, a public business theme data set and the like. The process may be performed by a computing device in the corresponding domain (e.g., a wind control server or intelligent mobile terminal corresponding to the payment service, etc.), and certain input parameters or intermediate results in the process allow for manual intervention adjustments to help improve accuracy.

The implementation of the analysis method according to the embodiment of the present application may be a terminal device or a server, which is not particularly limited in this application. For ease of understanding and description, the following embodiments are described in detail with reference to a server.

It should be noted that the server may be a single device, or may be a system composed of a plurality of devices, that is, a distributed server, which is not specifically limited in this application.

As shown in fig. 1, an embodiment of the present application provides a method for constructing a theme data set, including:

s101: and receiving a subject data set construction request from a user, and acquiring data screening requirements.

First, based on a subject data set construction request sent by a user, determining a data screening requirement corresponding to a current construction task and source table data. It should be noted that, the construction task refers to matching original data according to data screening requirements in source table data, so as to obtain a subject data set.

S102: and analyzing the data screening requirement to obtain analysis contents of different field types.

And analyzing the data screening requirements provided by the user, so as to obtain the analysis contents of different field types corresponding to the data screening requirements. It should be noted that, the data filtering requirement may be text format or voice command, but when resolving, the data filtering requirement is converted into text format, and the user confirms the text content, so as to improve the resolving accuracy.

In one embodiment, taking the most common commodity data in the internet collected data as an example, the field types of the data can be mainly divided into two types: a dimension field and a quantization field. The dimension field is used for describing the attribute of the background and environment where the quantized data are located, providing context for the quantized data, and commonly comprises month, category, region and the like; the quantization field is a performance index for representing the business process, such as commodity name, sales volume, price, etc. When the requirements are analyzed, the screening conditions in the requirements are analyzed and classified around the two fields.

Further, when analyzing the data screening requirement, the data screening requirement is analyzed firstly to determine an analysis field in the data screening requirement, then semantic analysis is performed on the analysis field to determine a field type corresponding to the analysis field, a preset category classification dictionary corresponding to the analysis field is determined based on the field type, and finally analysis content corresponding to the analysis field is determined based on the preset category classification dictionary.

The following description will be given by taking "acquire data related to chafing dish food materials in Chongqing city of Tianmao mall platform, exclude self-heating food, standardize such data into category 'chafing dish food materials'" as data screening requirement. According to the data screening requirement, the dimension type field obtained through analysis comprises an e-commerce platform (limited to a ' Tianmao's market '), a region (limited to a ' Chongqing ' market), and the ' hot pot food material related data ' is analyzed into a dimension type field primary category (limited to ' grain and oil and food '), and the quantized field commodity name contains ' hot pot ' and does not contain ' self-heating ', because the category of the hot pot cannot be obtained according to the category in the ' network retail monitoring index specification '.

S103: and creating an analytic content configuration table based on the analytic content.

After the analysis content is obtained, the analysis content can be stored in the configuration table according to a preset rule based on the analysis content.

In one embodiment, when creating the configuration table of the resolved content, firstly, determining the configuration table content corresponding to the data screening requirement based on the resolved content, where the configuration table content includes a requirement batch identifier, a dimension screening field, a fuzzy matching feature word, a fuzzy matching interference word, a standardized field and a de-duplication sequence, and filling the configuration table content into a template of the configuration table of the resolved content according to a preset rule to obtain the configuration table of the resolved content.

In one embodiment, when the configuration table contents are filled into the parsing content configuration table template according to a preset rule, first, the field types corresponding to the configuration table contents are determined, where the field types at least include a character type and a numerical type. The field types corresponding to the requirement batch identification, the dimension screening field, the fuzzy matching characteristic word, the fuzzy matching interference word and the standardized field are character types, and the field types corresponding to the deduplication sequence are numerical types. The structure of the analytic content configuration table is abstracted as follows:

field name	Field type	Field annotation
			REQ_BATCH	Character type	Demand lot identification
DIM_COL_1	Character type	Dimension filtering field 1
			DIM_COL_N	Character type	Dimension filtering field N
MATCH_WORDS1	Character type	Fuzzy matching of first feature words
			MATCH_WORDS2	Character type	Fuzzy matching of second feature words
NOT_MATCH_WORDS	Character type	Fuzzy matching interference word
			STD_COL_1	Character type	Standardized field 1
STD_COL_N	Character type	Normalized field N
			STD_ORDER	Numerical value type	De-duplication sequence

Wherein req_BATCH represents the lot identification of the current demand. Dim_col_1. Dim_col_n represents a dimension filtering field, and the number of specific fields depends on the structure and requirements of the corresponding access source table in the data warehouse.

MATCH_WORS1, MATCH_WORS2 and NOT_MATCH_WORS are stored in the content of regular expressions respectively in the form of (keyword 1|keyword 2|keyword N), each quantized field needing to be screened is subjected to screening configuration information storage through the group of fields, and logical operation of AND, OR and NOT is completed through the three fields. For example, the quantization field screening conditions in the requirements are: keyword 1 must be included, while either keyword 2 or keyword 3 may be included, but keyword 4 and keyword 5 cannot be present, then the set of fields is stored in the form of: MATCH_WORS1: (keyword 1); MATCH_WORD2: (keyword 2|keyword 3); not_match_work (keyword 4|keyword 5).

Std_col_1.

The std_order is used for the de-duplication logic when matching the matched original data, for example, one piece of original data may match with a plurality of configuration rules in the requirement, and priority ordering is needed according to the set std_order value and the requirement business meaning to ensure the uniqueness of the basic data. When the duplication removal is carried out, determining target original data which accords with a plurality of matching rules in the data screening requirement; and then, based on the de-duplication sequence, de-duplication is carried out on the target original data in the matching result so as to ensure the uniqueness of the target original data.

S104: and generating a target database statement based on the analysis content configuration table and a preset database statement template.

And generating a target database statement through a preset SQL statement template by analyzing the content configuration table so that the target database statement can execute the data screening requirement.

In one embodiment, when generating the target database statement, a preset function template is determined firstly based on the analysis content configuration table and the data screening requirement, wherein the preset function template comprises a plurality of preset database statement templates. And then determining blank fields in a plurality of preset database statement templates in the preset function statements, and replacing the blank fields based on the configuration table contents corresponding to the analysis content configuration table to generate the target database statement.

Wherein the database statement template may be abstracted into the following form:

the above pseudocode is written according to Oracle syntax as an example, and most other mainstream database types can implement the above functions by replacing the corresponding functions.

S105: and constructing a theme data set through the target database statement.

And matching the original data with the data screening requirement by executing the target database statement in the source table data, so as to obtain the original data corresponding to the constructed subject data set, and further construct the subject data set.

In one embodiment, when the subject data set is constructed, if the data amount of the source table is large, the target database statement may be circularly executed after the data is split. At this time, the data amount of the source table needs to be determined, and if the data amount of the source table is higher than a preset data amount threshold, the source table is split to obtain a plurality of sub-source tables. And circularly executing target database sentences on the plurality of sub-source tables until the source table data volume is traversed.

As shown in fig. 2, an embodiment of the present application further provides a device for constructing a theme data set, including:

at least one processor; the method comprises the steps of,

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to:

The embodiments also provide a non-volatile computer storage medium storing computer executable instructions configured to:

All embodiments in the application are described in a progressive manner, and identical and similar parts of all embodiments are mutually referred, so that each embodiment mainly describes differences from other embodiments. In particular, for the apparatus and medium embodiments, the description is relatively simple, as it is substantially similar to the method embodiments, with reference to the section of the method embodiments being relevant.

The devices and media provided in the embodiments of the present application are in one-to-one correspondence with the methods, so that the devices and media also have similar beneficial technical effects as the corresponding methods, and since the beneficial technical effects of the methods have been described in detail above, the beneficial technical effects of the devices and media are not described in detail herein.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.

The foregoing is merely exemplary of the present application and is not intended to limit the present application. Various modifications and changes may be made to the present application by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc. which are within the spirit and principles of the present application are intended to be included within the scope of the claims of the present application.

Claims

1. A method of constructing a subject data set, comprising:

receiving a subject data set construction request from a user, and acquiring a data screening requirement;

analyzing the data screening requirement to obtain analysis contents of different field types;

creating an analytic content configuration table based on the analytic content;

generating a target database statement based on the analysis content configuration table and a preset database statement template;

and constructing a theme data set through the target database statement.

2. The method of claim 1, wherein the field types include a dimension field and a quantization field;

the dimension field is used for describing the environment attribute of the environment where the quantized data is located and providing context for the quantized data;

the quantization field is used to represent a performance index of the business process.

3. The method of claim 2, wherein the parsing the data filtering requirement to obtain parsed contents with different field types specifically includes:

analyzing the data screening requirement to determine an analysis field in the data screening requirement;

performing semantic analysis on the analysis field to determine a field type corresponding to the analysis field;

determining a preset category classification dictionary corresponding to the analysis field based on the field type;

and determining the analysis content corresponding to the analysis field based on the preset category classification dictionary.

4. The method according to claim 1, wherein creating the parsed content configuration table based on the parsed content specifically comprises:

determining the configuration table content corresponding to the data screening requirement based on the analysis content;

the configuration table content comprises a requirement batch identifier, a dimension screening field, a fuzzy matching characteristic word, a fuzzy matching interference word, a standardized field and a duplication eliminating sequence;

and creating a analytic content configuration table based on the configuration table content.

5. The method of claim 4, wherein after determining the configuration table content corresponding to the data filtering requirement based on the parsing content, the method further comprises:

determining field types corresponding to the configuration table contents respectively, wherein the field types at least comprise character types and numerical types;

the field types corresponding to the requirement batch identifier, the dimension screening field, the fuzzy matching characteristic word, the fuzzy matching interference word and the standardized field are character types;

the field type corresponding to the de-duplication sequence is numerical.

6. The method according to claim 1, wherein the generating the target database statement based on the parsed content configuration table and a preset database statement template specifically includes:

determining a preset function template based on the analysis content configuration table and the data screening requirement, wherein the preset function template comprises a plurality of preset database statement templates;

determining blank fields in the plurality of preset database statement templates in the preset function statement;

and replacing the blank field based on the configuration table content corresponding to the analysis content configuration table to generate the target database statement.

7. The method according to claim 1, wherein said constructing a subject data set from said target database statement, in particular comprises:

determining the data quantity of a source table, and splitting the source table to obtain a plurality of sub-source tables if the data quantity of the source table is higher than a preset data quantity threshold;

and circularly executing the target database statement on the plurality of sub-source tables until the source table data volume is traversed.

8. The method according to claim 4, wherein the method further comprises:

determining target original data which accords with a plurality of matching rules in the data screening requirement;

and performing de-duplication on the target original data in the matching result based on the de-duplication sequence so as to ensure the uniqueness of the target original data.

9. A device for constructing a subject data set, comprising:

at least one processor; and a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform:

creating an analytic content configuration table based on the analytic content;

and constructing a theme data set through the target database statement.

10. A non-transitory computer storage medium storing computer-executable instructions, the computer-executable instructions configured to:

creating an analytic content configuration table based on the analytic content;

and constructing a theme data set through the target database statement.