CN117828006A - Programmable data extraction method, device, equipment and medium - Google Patents

Programmable data extraction method, device, equipment and medium Download PDF

Info

Publication number
CN117828006A
CN117828006A CN202311863590.6A CN202311863590A CN117828006A CN 117828006 A CN117828006 A CN 117828006A CN 202311863590 A CN202311863590 A CN 202311863590A CN 117828006 A CN117828006 A CN 117828006A
Authority
CN
China
Prior art keywords
component
extraction
data
task
components
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311863590.6A
Other languages
Chinese (zh)
Inventor
李振华
牟宣理
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Dt Dream Technology Co Ltd
Original Assignee
Hangzhou Dt Dream Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Dt Dream Technology Co Ltd filed Critical Hangzhou Dt Dream Technology Co Ltd
Priority to CN202311863590.6A priority Critical patent/CN117828006A/en
Publication of CN117828006A publication Critical patent/CN117828006A/en
Pending legal-status Critical Current

Links

Landscapes

  • Machine Translation (AREA)

Abstract

The application provides a programmable data extraction method, device, equipment and medium. The method comprises the following steps: responding to the data extraction task arrangement request, and outputting an extraction task arrangement interface to a user; the arrangement interface is used for arranging extraction tasks formed by the components; the orchestration interface includes a component toolset including at least one component; the at least one component at least comprises an extraction component for extracting unstructured data to obtain structured data; responding to at least one component selected by the user in the task scheduling interface, and scheduling based on the at least one component and the connection relation between the at least one component to obtain an extraction task; and saving the default data connection configuration as a tenant data connection configuration corresponding to the tenant space.

Description

Programmable data extraction method, device, equipment and medium
Technical Field
The present disclosure relates to the field of big data technologies, and in particular, to a method, an apparatus, a device, and a medium for extracting programmable data.
Background
Database tools typically only manage and process structured data efficiently, but it is difficult to process unstructured data directly. Thus, for unstructured data, it is often necessary to convert it into structured data. Structured data extraction generally refers to the process of extracting structured data in unstructured data, such as long text files, and storing it in a table.
The existing structured data extraction mainly relies on manual repeatability to finish extraction by a large amount of text labeling work in a simple manual extraction mode, and has the disadvantages of high complexity and low efficiency.
Disclosure of Invention
In view of this, the present specification provides the following methods, apparatus, devices, and media.
In a first aspect of the present application, there is provided a programmable data extraction method, the method comprising:
responding to the data extraction task arrangement request, and outputting an extraction task arrangement interface to a user; the arrangement interface is used for arranging extraction tasks formed by the components; the orchestration interface includes a component toolset including at least one component; the at least one component at least comprises an extraction component for extracting unstructured data to obtain structured data;
responding to at least one component selected by the user in the task scheduling interface, and scheduling based on the at least one component and the connection relation between the at least one component to obtain an extraction task;
and operating the extraction task, and processing unstructured data to obtain structured data.
In a second aspect of the present application, there is provided an orchestratable data extraction device, the device comprising:
the output unit is used for responding to the data extraction task arrangement request and outputting an extraction task arrangement interface to a user; the arrangement interface is used for arranging extraction tasks formed by the components; the orchestration interface includes a component toolset including at least one component; the at least one component at least comprises an extraction component for extracting unstructured data to obtain structured data;
the arrangement unit is used for responding to at least one component selected by the user in the task arrangement interface, and arranging the at least one component based on the connection relation between the at least one component and the at least one component to obtain an extraction task;
and the operation unit is used for operating the extraction task and processing unstructured data to obtain structured data.
In a third aspect of the present application, there is provided an electronic device, including a communication interface, a processor, a memory, and a bus, where the communication interface, the processor, and the memory are connected to each other by the bus;
the memory stores machine readable instructions that the processor performs the above method by invoking the machine readable instructions.
In a fourth aspect of the present application, there is provided a machine-readable storage medium storing machine-readable instructions that, when invoked and executed by a processor, implement the above method.
The above embodiments of the present specification have at least the following advantageous effects:
according to the technical scheme, the assembly tool set and the simple and easy-to-use arrangement interface are provided, so that a user can conveniently and quickly arrange data extraction tasks in a visual mode, customized structured data extraction capacity is realized, structured data extraction efficiency is improved, and labor is saved.
Drawings
FIG. 1 is a flow chart of an orchestratable data extraction method shown in an illustrative embodiment;
FIG. 2 is a schematic diagram of an orchestration interface of an orchestratable data extraction method, shown in an illustrative embodiment;
FIG. 3 is a schematic diagram of a parameter setting interface of components of a programmable data extraction method, shown in an illustrative embodiment;
FIG. 4 is a schematic diagram of a resource method of a programmable data extraction method, shown in an illustrative embodiment;
FIG. 5 is a hardware block diagram of an electronic device in which an orchestratable data extraction device is located, as shown in an illustrative embodiment;
FIG. 6 is a block diagram of an orchestratable data extraction device, according to one illustrative embodiment.
Detailed Description
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present application as detailed in the accompanying claims.
It should be noted that: in other embodiments, the steps of the corresponding method are not necessarily performed in the order shown and described in this specification. In some other embodiments, the method may include more or fewer steps than described in this specification. Furthermore, individual steps described in this specification, in other embodiments, may be described as being split into multiple steps; while various steps described in this specification may be combined into a single step in other embodiments.
In order to enable those skilled in the art to better understand the technical solutions in the embodiments of the present disclosure, a brief description of related techniques related to furniture tracing related to the embodiments of the present disclosure is provided below.
Structured data: data organized in the form of tables, rows and columns has a clear structure and a fixed pattern. Typically stored in a relational database, may be represented using a predefined schema (structure of a table). The method is easy to query, analyze and process, and supports complex relation and connection operation.
Unstructured data: there is no fixed pattern or predefined data model. Such data typically exists in free form, without an organization of tables or rows. It is not easy to represent in tabular form, and processing and analysis of unstructured data typically requires parsing and processing before useful information can be extracted.
And (3) structured data extraction: unstructured data, such as long text files, are extracted from the structured data through different means according to the structural information of the defined structured objects and stored in a warehouse table.
In the related art, the structured data extraction is mainly performed by a simple manual extraction mode, relies on manual repeatability work, performs a large amount of text labeling work to complete extraction, and has high complexity and low efficiency.
In view of this, the present specification proposes a programmable data extraction method that provides a component tool set and a simple and easy-to-use programming interface, so that a user can perform programming of data extraction tasks in a convenient, fast and visual manner, and achieve the extraction capability of customized structured data.
The following describes the present application through specific embodiments and in connection with specific application scenarios.
Referring to fig. 1, fig. 1 is a flow chart illustrating an orchestratable data extraction method according to an exemplary embodiment.
The above method may perform the steps of:
step 102: responding to the data extraction task arrangement request, and outputting an extraction task arrangement interface to a user; the arrangement interface is used for arranging extraction tasks formed by the components; the orchestration interface includes a component toolset including at least one component; the at least one component comprises at least an extraction component for extracting unstructured data to obtain structured data.
The process of extracting structured data information in unstructured data can be generally designed according to the characteristics of the unstructured data to be extracted and the mode of the structured data to be extracted. In the related art, manual extraction is generally only possible by manual analysis.
However, in general, extraction of structured data can be broken down into a number of core steps, such as data input, data splitting, data filtering, data extraction, data output, etc.; these broken-down core steps may be implemented by some general-purpose components.
Therefore, the components can be selected and arranged based on the component tool set formed by the universal components according to specific structured data extraction requirements, and the components are conveniently and quickly combined into a structured data extraction task.
Further, a general task orchestration interface may be provided for the user to perform visual extraction task orchestration, where the orchestration page contains the component tools described above. The component tool set may provide components that may be required for various extraction tasks, including at least extraction components for extracting unstructured data to obtain structured data. In addition, other components may be included, such as components for performing various functions of data input, data splitting, data filtering, data connection, data output, and the like. The specification does not limit the specific type of component.
The tools in the component tool set may be extensible and new components may be added to the component tool set as desired
When a user needs to make a creation orchestration of a structured data extraction task, a data extraction task orchestration request may be sent. When the request is received, a visual orchestration interface may be provided to the user for orchestration of the structured data extraction tasks.
Step 104: and responding to at least one component selected by the user in the task scheduling interface, and scheduling based on the at least one component and the connection relation between the at least one component to obtain an extraction task.
The user can select the components in the component tool set in the task orchestration interface to conduct orchestration freely.
Specifically, the components to be used can be selected into the canvas of the task orchestration interface by clicking and dragging the components, and the connection relation among the components is orchestrated on the canvas.
By connecting the components according to a certain order, the structured data extraction tasks that are performed in the order of connection can be orchestrated.
Typically, the interconnected components execute in succession, with the output of a preceding task serving as the input to a subsequent task, and the interconnected components constitute the complete structured data extraction task.
Step 106: and operating the extraction task, and processing unstructured data to obtain structured data.
When the extraction task arrangement is completed, the extraction task may be used to perform data extraction on the unstructured data to obtain structured data. The extraction tasks may be multiplexed, extracting multiple unstructured data for the same extraction task.
The extraction task is also modifiable, and can be used for different extraction tasks by locally modifying the extraction task on the task scheduling interface based on the existing extraction task.
According to the embodiment, the assembly tool set and the simple and easy-to-use arrangement interface are provided, so that a user can conveniently and quickly arrange data extraction tasks in a visual mode, customized structured data extraction capacity is realized, structured data extraction efficiency is improved, and labor is saved.
Referring to fig. 2, fig. 2 is a schematic diagram illustrating an arrangement interface of an arrangement data extraction method according to an exemplary embodiment.
As shown on the left side of FIG. 2, the orchestration interface contains a component toolset, which may contain multiple component groups, each of which may contain one or more different components.
Taking the embodiment of fig. 2 as an example, the components in the component tool set may include:
the input component is used for receiving the input of unstructured data to be subjected to structured data extraction;
when the unstructured data to be extracted is a text file, the input component may be a text input component.
The output component is used for outputting the data obtained by the structured extraction to a preset target data carrier;
when the predetermined destination data carrier is a destination table, the output component may be a table output component.
The extraction component is used for extracting the unstructured data according to a preset rule to obtain a structured extraction component;
specifically, the extraction component may include a model extraction component that performs data extraction based on a preset machine learning model, and a regular expression component that performs data extraction based on a preset regular expression.
The conversion component is used for carrying out preset intermediate processing on the data;
the conversion assembly may specifically include:
the constant component is used for adding a custom constant in the data and adding a mark for the processed content so as to avoid the difficulty in identifying when the multi-class content is output to one place;
the column splitting multi-row component is used for splitting the content of the target data into a plurality of rows for the next processing or splitting according to the scene requirement;
a field selection component for renaming or not outputting configuration for certain field needs of upstream output;
the filtering and recording component is used for filtering according to conditions aiming at the output content of the upstream node and carrying out branching treatment;
the custom code component is used for realizing the purpose of custom data processing by supporting manual code writing; for example, the customizable code component may be a JavaScript code component that supports code writing using the JavaScript language.
The connecting component is used for connecting data aiming at a plurality of inputs, so that the combined data is output to carry out downstream overall processing;
when the connection component makes a connection of data as a column connection of data, it may be a data column connection component.
All the components can be realized at the bottom layer in a manner of depending on the ETL tool of the main stream open source and the expansion development of the plug-in.
The user may select a component from the component tool set and place it in a canvas in the orchestration interface for specific orchestration.
The user can change the position of the components in the canvas in a dragging manner and can connect wires between the components. And according to the direction of the connecting line, arranging the execution sequence of the second component positioned at the tail end of the connecting line in the two components of the connecting line to the position behind the first component positioned at the head end of the connecting line. Typically, the output of the first component may be the input of the second component.
Different components can support different connection quantity and connection modes, for example, some components can be used as the first component to be positioned at the connection head end, some components can be used as the second component to be positioned at the connection tail end, and some components can be used as the first component to be positioned at the connection head end and also can be used as the second component to be positioned at the connection tail end.
For example, in fig. 2, the text file input component "file input" connects two links that point to the regular expression component "extract natural disaster section" and the regular expression component "extract incident report section" respectively.
The connection lines are connected, represent the regular expression component 'extract natural disaster section' and the regular expression component 'extract accident report section', are executed after the text file input component 'file input' is executed, and output of the text file input component 'file input' can be used as input of the regular expression component 'extract natural disaster section' and the regular expression component 'extract accident report section'.
The user can also set specific parameters for each component, so that each general component can be customized according to the requirements of the task.
Referring to FIG. 3, FIG. 3 is a schematic diagram illustrating a parameter setting interface of components of an orchestratable data extraction method according to an exemplary embodiment;
as in fig. 3, fig. 3 represents a parameter setting interface for the regular expression node "extract natural disaster section" in the embodiment in fig. 2.
It can be seen that in the parameter setting interface of the regular expression node, a user can input a regular expression and set a field name of a result field obtained by matching the regular expression.
In addition, in the parameter setting interface of the regular expression node, a regular expression pattern example library and a regular expression online testing tool for assisting in writing the regular expression can be provided.
Through parameters set by a user, the regular expression node can execute a regular expression on input data, and output a matching result of the regular expression as a designated field.
For example, as shown in FIG. 3, the parametric regular expression in the regular expression node is set to:
(
The regular expression represents all characters that match after "natural disaster" and before "three, information posting".
And setting the field name of the matching result as 'ZirRanZaiHai', wherein the type is 'String', representing the result obtained by matching the regular expression, and the data type of the field is String character String type as the content of the field ZirRanZaiHai.
The regular expression field extracts information related to natural disasters from an input unstructured text through a regular expression, and outputs the information in the form of structured data.
As shown in fig. 2, the contents of the field ZirRanZaiHai will be used as input to the next component regular expression component "extract sub-chapter_flood".
The present description does not specifically limit the types and number of parameter options for the components, and different components may include different parameter settings as desired.
For example, a column splits into multiple rows of components, whose settable parameters may include node name, field to split, separator type of split field, separator, field output after splitting, etc.
As another example, add constant components whose settable parameters may include node name, constant type, field value, etc.
When the user finishes the parameter setting and connection of each component required by the extraction task, the extraction task can be obtained by arranging the parameters and the connection relation of each component.
The extraction task may process structured data for specific unstructured data.
In one exemplary embodiment shown in this description, the same type of components that extract structured data for different topics may be orchestrated into the same task.
And setting parameters corresponding to the corresponding topics aiming at the components of the same type of different topics, and extracting the structural information of the different topics.
The task may include at least an input component, an extraction component, and an output component.
The input component can be used for acquiring unstructured data to be subjected to structured data extraction and transmitting the unstructured data to the extraction component;
the same type of component as described above may be an extraction component and may include at least a first component and a second component. The different topics may include at least two different topics, a first topic and a second topic.
The first extraction component is used for extracting first data related to a first theme from unstructured data input by the input component;
the second extraction component is configured to extract second data related to a second topic from unstructured data incoming by the input component.
The output component can be configured to output the input data to a structured destination table based on a preset field mapping rule.
Thus, the output of the first extraction component, i.e. the first data, may be input into the first output component, such that the first output component outputs the first data into the structured first destination table based on the preset field mapping rules.
Similarly, the output of the second extraction component, i.e., the second data, may be input into the second output component, such that the second output component outputs the second data into the structured second destination table based on the preset field mapping rules.
By the embodiment, the method and the device realize that the structured data with different topics are extracted from the input unstructured data and output to different destination tables.
Taking the extraction task in fig. 2 as an example, the regular expression node in fig. 2 extracts a natural disaster section, and the regular expression node extracts an accident report section, namely two components of the same type for extracting the structured data of different subjects; the two components can extract natural disaster chapters and accident report chapters from unstructured data input by a text file input node 'file input', respectively, based on different parameter settings.
In practical applications, extraction of structured data may be required for large amounts of unstructured data. The method provides a structured data extraction method, but for the scene of task operation with large data volume in large quantity, single machine operation can not meet the requirement, and resource allocation management and scheduling support and distributed capability are needed to meet the operation capability and speed of the task.
In one exemplary embodiment shown in this description, multiple instances may be generated based on the extraction task and submitted to a cluster of a resource scheduling platform for resource allocation and scheduling execution to perform batch extraction on multiple unstructured data to obtain structured data.
Each instance may perform the extraction task alone or may perform only the corresponding function of a portion of the components in the extraction task.
The instances may be submitted to a cluster of resource scheduling platforms, which perform resource allocation and scheduling execution. When the system is submitted, parameter configuration can be performed, including the number of allocated CPU cores, the memory size, the starting waiting time, the maximum operation time, other custom parameters and the like, and the operation resources of the instance can be controlled through the parameters.
In one illustrative embodiment, each instance may perform a corresponding function that extracts a portion of the components in the task. The resource scheduling platform can schedule the resources in real time, and control the number of instances executing different functions and the resources allocated for the instances.
Specifically, the operation time of each instance can be monitored, and based on the historical average operation time of a single instance corresponding to each component in the extraction task, the real force quantity and the resources are distributed to each component in real time, so that the difference between the lengths required by the functions of each component in the extraction task does not exceed a preset threshold.
The resource scheduling platform may include a YARN-based resource scheduling platform, a Kubernetes-based resource scheduling platform, and the like, which is not specifically limited in this specification.
The following describes a procedure of resource allocation and scheduling execution, taking a YARN-based resource scheduling platform as an example. YARN (Yet Another Resource Negotiator) is a key component in the Apache Hadoop ecosystem for resource management and job scheduling.
Referring to fig. 4, fig. 4 is a schematic diagram illustrating a resource method of an orchestratable data extraction method according to an exemplary embodiment.
Submitting each structured data extraction task as a job (or Application) to the Yarn for resource allocation and scheduling execution, wherein the detailed steps are as follows:
1. the business system submits task jobs through gateway to create Client.
Client submits Application to Yarn, a structured extraction task job.
The resource manager of yarn communicates to the node manager, which allocates a first container for this Application. And running the application master corresponding to the application program in the container. The application master is responsible for negotiating resources with a resource manager, distributing tasks, monitoring task execution progress, processing task failures and the like.
After the application Master is started, the jobs are split, the split tasks come out, and the tasks can run in one or more containers. The resourceManager is then applied for the container to run the program and the heartbeat is sent to the resourceManager at regular time.
5. After the application is applied to the container, the application master can communicate with the corresponding NodeManager of the container, and then the job is distributed to the container in the corresponding NodeManager for operation.
6. The task running in the container sends heartbeat to the application Master to report the situation of the task itself. When the program is running, the application Master logs off the resource manager again and releases the container resources.
Referring to fig. 5, fig. 5 is a hardware configuration diagram of an electronic device in which an orchestratable data extraction device is shown in an exemplary embodiment. At the hardware level, the device includes a processor 502, an internal bus 504, a network interface 506, a memory 508, and a non-volatile storage 510, although other hardware required for the service is possible. One or more embodiments of the present description may be implemented in a software-based manner, such as by the processor 502 reading a corresponding computer program from the non-volatile storage 510 into the memory 508 and then running. Of course, in addition to software implementation, one or more embodiments of the present disclosure do not exclude other implementation manners, such as a logic device or a combination of software and hardware, etc., that is, the execution subject of the following processing flow is not limited to each logic unit, but may also be hardware or a logic device.
Referring to fig. 6, fig. 6 is a block diagram illustrating an exemplary embodiment of an orchestratable data extraction device.
The programmable data extraction device may include:
an output unit 610 for outputting an extraction task orchestration interface to a user in response to a data extraction task orchestration request; the arrangement interface is used for arranging extraction tasks formed by the components; the orchestration interface includes a component toolset including at least one component; the at least one component at least comprises an extraction component for extracting unstructured data to obtain structured data;
an orchestration unit 620, configured to orchestrate, in response to at least one component selected by the user at the task orchestration interface, to obtain an extraction task based on the at least one component and a connection relationship between the at least one component;
and the operation unit 630 is configured to operate the extraction task, and process unstructured data to obtain structured data.
In one embodiment, the output unit 610 is specifically configured to:
responding to the user to select any component in the task orchestration interface, outputting a configuration page aiming at the any component to the user, and setting parameters of the any component by the user;
responding to the connection operation of the user on any two components, and arranging the execution sequence of the second components of the two components at the tail end of the connection to the back of the first component at the head end of the connection according to the direction of the connection.
In one embodiment, the assembly further comprises at least one of:
the input component is used for receiving the input of unstructured data to be subjected to structured data extraction;
the output component is used for outputting the data obtained through the structured extraction to a preset destination table;
the conversion component is used for carrying out preset intermediate processing on the data;
a connection component for combining a plurality of inputs;
the extraction assembly includes at least one of:
the model extraction component is used for extracting structured data from the text based on a preset machine learning model;
the regular expression component is used for extracting structured data from the text based on a preset regular expression.
In one embodiment, the same type of component that extracts structured data for different topics is orchestrated into the same task; the parameter setting corresponding to the corresponding theme is carried out on the components of the same type of different themes;
the extracting the unstructured data to obtain structured data includes:
obtaining unstructured data to be subjected to structured data extraction through an input component;
extracting, by a first extraction component, first data related to a first topic from the unstructured data;
extracting, by a second extraction component, second data related to a second topic from the unstructured data;
outputting the first data to a structured first destination table based on a preset field mapping rule through a first output component;
and outputting the second data to a structured second destination table based on a preset field mapping rule through a second output component.
In one embodiment, the running unit 630 is specifically configured to:
based on the extraction task, a plurality of examples are generated, and the examples are submitted to a cluster of a resource scheduling platform for resource allocation and scheduling execution, so that batch extraction is carried out on a plurality of unstructured data to obtain structured data.
In one embodiment, the manner of generating the plurality of instances based on the extraction task includes at least one of:
generating a plurality of instances for implementing the extraction task, wherein each instance performs data extraction for at least one unstructured data;
a plurality of instances is generated, wherein each instance is for implementing at least one component in the extraction task.
In one embodiment, the generating a plurality of instances, wherein each instance is for implementing at least one component in the extraction task, comprises:
acquiring historical average operation time length of a single instance corresponding to the at least one component;
and according to the historical average operation time length of the single instance corresponding to the at least one component, distributing the number of instances for each component so as to ensure that the difference between the lengths required by realizing the components in the extraction task does not exceed a preset threshold value.
In one embodiment, the resource scheduling platform comprises:
YARN-based resource scheduling platform, or Kubernetes-based resource scheduling platform.
The implementation process of the functions and roles of each unit in the above device is specifically shown in the implementation process of the corresponding steps in the above method, and will not be described herein again.
For the device embodiments, reference is made to the description of the method embodiments for the relevant points, since they essentially correspond to the method embodiments. The apparatus embodiments described above are illustrative only, in that the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purposes of the present description. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. A typical implementation device is a computer, which may be in the form of a personal computer, laptop computer, cellular telephone, camera phone, smart phone, personal digital assistant, media player, navigation device, email device, game console, tablet computer, wearable device, or a combination of any of these devices.
In a typical configuration, a computer includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.
Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, read only compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic disk storage, quantum memory, graphene-based storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by the computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article of furniture, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article of furniture, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, piece of furniture or apparatus comprising the element.
The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
The terminology used in the one or more embodiments of the specification is for the purpose of describing particular embodiments only and is not intended to be limiting of the one or more embodiments of the specification. As used in this specification, one or more embodiments and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items.
It should be understood that although the terms first, second, third, etc. may be used in one or more embodiments of the present description to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of one or more embodiments of the present description. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "responsive to a determination", depending on the context.
The foregoing description of the preferred embodiment(s) is (are) merely intended to illustrate the embodiment(s) of the present invention, and it is not intended to limit the embodiment(s) of the present invention to the particular embodiment(s) described.

Claims (10)

1. A method of programmable data extraction, the method comprising:
responding to the data extraction task arrangement request, and outputting an extraction task arrangement interface to a user; the arrangement interface is used for arranging extraction tasks formed by the components; the orchestration interface includes a component toolset including at least one component; the at least one component at least comprises an extraction component for extracting unstructured data to obtain structured data;
responding to at least one component selected by the user in the task scheduling interface, and scheduling based on the at least one component and the connection relation between the at least one component to obtain an extraction task;
and operating the extraction task, and processing unstructured data to obtain structured data.
2. The method of claim 1, wherein the step of determining the position of the substrate comprises,
the step of responding to at least one component selected by the user in the task orchestration interface, and orchestrating to obtain an extraction task based on the at least one component and the connection relation between the at least one component comprises the following steps:
responding to the user to select any component in the task orchestration interface, outputting a configuration page aiming at the any component to the user, and setting parameters of the any component by the user;
responding to the connection operation of the user on any two components, and arranging the execution sequence of the second components of the two components at the tail end of the connection to the back of the first component at the head end of the connection according to the direction of the connection.
3. The method of claim 1, wherein the step of determining the position of the substrate comprises,
the assembly further comprises at least one of:
the input component is used for receiving the input of unstructured data to be subjected to structured data extraction;
the output component is used for outputting the data obtained through the structured extraction to a preset destination table;
the conversion component is used for carrying out preset intermediate processing on the data;
a connection component for combining a plurality of inputs;
the extraction assembly includes at least one of:
the model extraction component is used for extracting structured data from the text based on a preset machine learning model;
the regular expression component is used for extracting structured data from the text based on a preset regular expression.
4. The method of claim 3, wherein the step of,
arranging the components of the same type for extracting the structured data of different topics into the same task; the parameter setting corresponding to the corresponding theme is carried out on the components of the same type of different themes;
the extracting the unstructured data to obtain structured data comprises the following steps:
obtaining unstructured data to be subjected to structured data extraction through an input component;
extracting, by a first extraction component, first data related to a first topic from the unstructured data;
extracting, by a second extraction component, second data related to a second topic from the unstructured data;
outputting the first data to a structured first destination table based on a preset field mapping rule through a first output component;
and outputting the second data to a structured second destination table based on a preset field mapping rule through a second output component.
5. The method of claim 1, wherein the step of determining the position of the substrate comprises,
and running the extraction task, and processing unstructured data to obtain structured data, wherein the method comprises the following steps of:
based on the extraction task, a plurality of examples are generated, and the examples are submitted to a cluster of a resource scheduling platform for resource allocation and scheduling execution, so that batch extraction is carried out on a plurality of unstructured data to obtain structured data.
6. The method of claim 5, wherein the manner in which the plurality of instances are generated based on the extraction task comprises at least one of:
generating a plurality of instances for implementing the extraction task, wherein each instance performs data extraction for at least one unstructured data;
a plurality of instances is generated, wherein each instance is for implementing at least one component in the extraction task.
7. The method of claim 6, wherein the step of providing the first layer comprises,
the generating a plurality of instances, wherein each instance is configured to implement at least one component in the extraction task, comprising:
acquiring historical average operation time length of a single instance corresponding to the at least one component;
and according to the historical average operation time length of the single instance corresponding to the at least one component, distributing the number of instances for each component so as to ensure that the difference between the lengths required by realizing the components in the extraction task does not exceed a preset threshold value.
8. A structured data extraction apparatus, the apparatus comprising:
the output unit is used for responding to the data extraction task arrangement request and outputting an extraction task arrangement interface to a user; the arrangement interface is used for arranging extraction tasks formed by the components; the orchestration interface includes a component toolset including at least one component; the at least one component at least comprises an extraction component for extracting unstructured data to obtain structured data;
the arrangement unit is used for responding to at least one component selected by the user in the task arrangement interface, and arranging the at least one component based on the connection relation between the at least one component and the at least one component to obtain an extraction task;
and the operation unit is used for operating the extraction task and processing unstructured data to obtain structured data.
9. A storage medium having stored thereon a computer program which, when executed, implements the steps of the method according to any of claims 1-7.
10. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of any of claims 1-7 when the program is executed by the processor.
CN202311863590.6A 2023-12-29 2023-12-29 Programmable data extraction method, device, equipment and medium Pending CN117828006A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311863590.6A CN117828006A (en) 2023-12-29 2023-12-29 Programmable data extraction method, device, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311863590.6A CN117828006A (en) 2023-12-29 2023-12-29 Programmable data extraction method, device, equipment and medium

Publications (1)

Publication Number Publication Date
CN117828006A true CN117828006A (en) 2024-04-05

Family

ID=90503835

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311863590.6A Pending CN117828006A (en) 2023-12-29 2023-12-29 Programmable data extraction method, device, equipment and medium

Country Status (1)

Country Link
CN (1) CN117828006A (en)

Similar Documents

Publication Publication Date Title
EP3353672B1 (en) Method and apparatus for transferring data between databases
US10409558B2 (en) Workflow development system with ease-of-use features
CN107292186B (en) Model training method and device based on random forest
US10423445B2 (en) Composing and executing workflows made up of functional pluggable building blocks
JP7488006B2 (en) Method, system, and program for identifying tabular data using machine learning
CN108984652B (en) Configurable data cleaning system and method
CN110046303B (en) Information recommendation method and device based on demand matching platform
US20160274874A1 (en) Method and apparatus for processing request
CN111949856A (en) Object storage query method and device based on web
CN112860777A (en) Data processing method, device and equipment
CN115392501A (en) Data acquisition method and device, electronic equipment and storage medium
CN103678360A (en) Data storing method and device for distributed file system
CN112860412B (en) Service data processing method and device, electronic equipment and storage medium
CN108241620B (en) Query script generation method and device
CN116932147A (en) Streaming job processing method and device, electronic equipment and medium
CN117828006A (en) Programmable data extraction method, device, equipment and medium
CN112749308A (en) Data labeling method and device and electronic equipment
CN110019357B (en) Database query script generation method and device
KR20200103133A (en) Method and apparatus for performing extract-transfrom-load procedures in a hadoop-based big data processing system
CN112507725B (en) Static publishing method, device, equipment and storage medium of financial information
CN113905037A (en) File transmission management method, device, equipment and storage medium
CN109857838B (en) Method and apparatus for generating information
US20200034767A1 (en) System and method for visualizing an order allocation process
CN109492195A (en) A kind of font loading method, device, terminal and storage medium
US10606939B2 (en) Applying matching data transformation information based on a user's editing of data within a document

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination