CN113204706B - Data screening and extracting method and system based on MapReduce - Google Patents
Data screening and extracting method and system based on MapReduce Download PDFInfo
- Publication number
- CN113204706B CN113204706B CN202110563545.3A CN202110563545A CN113204706B CN 113204706 B CN113204706 B CN 113204706B CN 202110563545 A CN202110563545 A CN 202110563545A CN 113204706 B CN113204706 B CN 113204706B
- Authority
- CN
- China
- Prior art keywords
- data
- screening
- extraction
- information
- judging whether
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000012216 screening Methods 0.000 title claims abstract description 342
- 238000000034 method Methods 0.000 title claims abstract description 29
- 238000000605 extraction Methods 0.000 claims abstract description 113
- 238000012545 processing Methods 0.000 claims abstract description 20
- 238000001914 filtration Methods 0.000 claims description 53
- 238000006243 chemical reaction Methods 0.000 claims description 35
- 238000013075 data extraction Methods 0.000 claims description 17
- 230000009471 action Effects 0.000 claims description 14
- 238000007405 data analysis Methods 0.000 claims description 12
- 238000004590 computer program Methods 0.000 claims description 10
- 238000003860 storage Methods 0.000 claims description 8
- 238000011161 development Methods 0.000 abstract description 6
- 238000004891 communication Methods 0.000 description 9
- 230000008569 process Effects 0.000 description 8
- 238000010586 diagram Methods 0.000 description 6
- 230000008901 benefit Effects 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 238000004519 manufacturing process Methods 0.000 description 2
- 230000002093 peripheral effect Effects 0.000 description 2
- 239000007787 solid Substances 0.000 description 2
- 230000001360 synchronised effect Effects 0.000 description 2
- 108010001267 Protein Subunits Proteins 0.000 description 1
- 230000001133 acceleration Effects 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 238000013475 authorization Methods 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000004140 cleaning Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000000802 evaporation-induced self-assembly Methods 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9535—Search customisation based on user profiles and personalisation
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to a data screening and extracting method and system based on MapReduce. Wherein the method comprises the following steps: inputting screening extraction information, judging whether the screening extraction information is legal, initializing and analyzing the screening extraction information under a MapReduce framework, reading each row of data records in the data to be screened, entering a map processing stage, separating the data records according to input separator, judging whether the data to be screened contains screening condition information, screening the data if the data to be screened contains the screening condition information, acquiring corresponding data fields, and not screening if the data to be screened does not contain the screening condition information; judging whether the data to be screened contains extraction field information, if so, extracting the corresponding data field and outputting the extracted data field to an output path. The invention efficiently realizes the user data screening requirement through the unified input format interface, and aims at various complex and diverse data screening requirements, thereby avoiding frequent writing of different screening extraction programs by developers, improving the efficiency of data processing tasks and saving the development cost.
Description
Technical Field
The application relates to the technical field of data screening, in particular to a data screening and extracting method and system based on MapReduce.
Background
With the rapid development of the internet and the rapid increase of internet users, information is explosively increased, and the enterprise is very serious in terms of data storage, cleaning, analysis and mining processing in the face of massive information data. At the same time, the wide data sources, the variety of data types, and the complex processing environments all make large data processing very challenging. Hadoop is a distributed system infrastructure that implements a distributed file system (HDFS) with high fault tolerance and high throughput that provides storage for mass data, while another core designs MapReduce that provides computation for mass data.
With the continuous expansion of the service scale, a plurality of service scenes are frequently encountered, and according to the actual service requirements, data meeting the service requirements are screened and extracted from a given batch of mass storage data text files according to the designated filtering conditions and extraction information for statistics and analysis. For users who do not understand the related operation of the HDFS and the MapReduce calculation, only specific developers can be informed of the detailed requirements to realize the related operation, but for specific developers, in the face of increasing and various data processing tasks, different screening extraction programs need to be frequently written, and then the screening extraction programs are compiled and packaged for running, so that the requirements of users for processing data cannot be rapidly met, the increasing efficiency of the data processing tasks cannot be met, and a lot of development cost is wasted greatly.
The patent document China with the authorization number of CN103150400B discloses a data screening and extracting method based on MapReduce, which comprises the following steps of firstly inputting screening requirements: the method comprises a data input/output path, a screening field sequence number, a screening keyword and a screening upper limit and lower limit; then, data screening was performed: counting the total number of fields in the screening requirement, then circularly traversing the total number of the screening fields, screening the extraction range or keywords according to the screening requirement, and performing range screening or keyword screening on the data to be screened; finally, outputting the filtered data to an output path
Because the scheme performs data screening by screening field sequence numbers and keywords based on a MapReduce framework, the following defects still exist:
1. the data types with complex and various structures cannot be processed;
2. complex screening that fails to meet multiple screening conditions, such as: numerical value and date range screening, array operation screening, combination condition screening and the like;
3. the extraction function of the data information field cannot be realized.
At present, an effective solution is not proposed for the problem that the screening of data types with complex and diverse structures cannot be performed.
Disclosure of Invention
The embodiment of the application provides a data screening and extracting method and system based on MapReduce, which can efficiently realize user data screening requirements through a unified input format so as to at least solve the problem that the screening of data types with complex and various structures cannot be performed in the related technology.
In a first aspect, an embodiment of the present application provides a data screening and extracting method based on MapReduce, including the following steps:
screening extraction information input step, inputting screening extraction information, wherein the screening extraction information comprises: the data input path, screening requirement information and data output path, wherein the screening requirement information comprises input separator, screening condition information, extraction field information and output separator;
a validity judging step of judging whether the screening requirement information is empty or not, if not, judging whether the screening requirement information is json format or not, if yes, judging whether the screening requirement information accords with a preset standard or not, if so, continuously judging whether an input path exists or not, if so, judging whether an output path does not exist, and if not, judging that the screening extraction information is legal;
a data screening step, under the MapReduce framework, initializing and analyzing screening extraction information, reading each row of data record in the data to be screened, entering a map processing stage, separating the data records according to an input separator, judging whether the data to be screened contains screening condition information, if so, carrying out data screening and obtaining corresponding data fields, and if not, not screening;
And a data extraction step, judging whether the data to be screened contains extraction field information, and if so, extracting the corresponding data field and outputting the extracted data to an output path.
In some of these embodiments, the data screening step specifically includes:
a screening condition relation acquisition step, namely initializing a screening condition through marking as true, and acquiring a logic relation among all screening condition items in the screening condition information;
a step of screening and circulating traversal, which is to judge whether each screening condition item is completed through circulating traversal, if so, judge whether a screening condition passing mark is true, if so, reserve the data record, if not, ignore the data record, and if not, perform the next step;
a screening judging step of judging whether a line of data records accords with the screening condition item,
if the logical relation is in accordance with the screening condition item, judging whether the logical relation is a logical sum, if so, returning to a screening cycle traversing step, and if not, reserving the data record;
if the logical relation is not the logical sum, the data record is ignored, if the logical sum is not the logical sum, the screening condition pass through mark is set as false, and the screening cycle traversing step is returned.
In some embodiments, the screening condition item includes a screening data Key, a screening data type, a filtering action, and a filtering condition value, and the specific steps of determining whether the data record meets the screening condition item are as follows:
a data content acquisition step, namely acquiring corresponding data content in the data record according to the screening data Key;
a type conversion step, namely carrying out data analysis and type conversion on the data content according to the screened data types;
and a screening and judging step, namely screening and judging the data content according to the corresponding screening mode and logic relation according to the filtering action and the filtering condition value.
In some of these embodiments, the extraction field information includes an extraction data Key, an extraction data type, and an extraction field Key, the data extraction step further comprising:
a corresponding data acquisition step, namely acquiring a corresponding data value from the screened data field according to the extracted data Key;
a data conversion step, namely carrying out corresponding data analysis conversion on the data value according to the extracted data type and obtaining converted data;
a data acquisition step, namely acquiring corresponding extraction data from the conversion data according to the extraction field Key;
and a data output step, namely splicing the extracted data according to the output separator until each extracted information item in the extracted field information is circularly traversed, and outputting the spliced extracted data to an output path.
In a second aspect, an embodiment of the present application provides a data filtering and extracting system based on MapReduce, where the data filtering and extracting method based on MapReduce of the first aspect is applied, including:
screening extraction information input module, input screening extraction information, screening extraction information includes: the data input path, screening requirement information and data output path, wherein the screening requirement information comprises input separator, screening condition information, extraction field information and output separator;
the validity judging module judges whether the screening requirement information is empty or not, if the screening requirement information is not empty, judges whether the screening requirement information is in json format or not, if the screening requirement information is in json format, judges whether the screening requirement information accords with a preset standard, if the screening requirement information accords with the preset standard, continuously judges whether an input path exists or not, if the input path exists, judges whether an output path does not exist, and if the output path does not exist, the screening extraction information is legal;
the data screening module is used for initializing and analyzing screening extraction information under a MapReduce framework, reading each row of data records in the data to be screened, entering a map processing stage, separating the data records according to an input separator, judging whether the data to be screened contains screening condition information, screening the data and obtaining corresponding data fields if the data to be screened contains the screening condition information, and not screening if the data to be screened does not contain the screening condition information;
And the data extraction module is used for judging whether the data to be screened contains extraction field information, and extracting the corresponding data field and outputting the extracted data to the output path if the data to be screened contains the extraction field information.
In some of these embodiments, the data screening module specifically includes:
the screening condition relation acquisition unit is used for initializing a screening condition through marking as true to acquire a logic relation among all screening condition items in the screening condition information;
the screening circulation traversing unit is used for judging whether the circulation traversing is completed or not, if so, judging whether the screening condition passing mark is true or not, if so, reserving the data record, if not, ignoring the data record, and if not, carrying out the next step;
a screening judging unit for judging whether a line of data records accords with the screening condition item,
if the logical relation is in accordance with the screening condition item, judging whether the logical relation is a logical sum, if so, returning to a screening cycle traversing step, and if not, reserving the data record;
if the logical relation is not the logical sum, the data record is ignored, if the logical sum is not the logical sum, the screening condition pass through mark is set as false, and the screening cycle traversing step is returned.
In some embodiments, the filtering condition item includes a filtering data Key, a filtering data type, a filtering action, and a filtering condition value, and the filtering judgment unit includes:
the data content acquisition subunit acquires corresponding data content in the data record according to the screening data Key;
the type conversion subunit is used for carrying out data analysis and type conversion on the data content according to the screened data types;
and the screening judging subunit is used for screening and judging the data content according to the corresponding screening mode and logic relation according to the filtering action and the filtering condition value.
In some of these embodiments, the extraction field information includes an extraction data Key, an extraction data type, and an extraction field Key, the data extraction module further comprising:
the corresponding data acquisition unit acquires corresponding data values from the data fields obtained through screening according to the extracted data Key;
the data conversion unit is used for carrying out corresponding data analysis conversion on the data value according to the extracted data type and obtaining conversion data;
the data acquisition unit acquires corresponding extraction data from the conversion data according to the extraction field Key;
and the data output unit splices the extracted data according to the output separator until each extracted information item in the extracted field information is circularly traversed, and outputs the spliced extracted data to the output path.
In a third aspect, an embodiment of the present application provides an electronic device, including a memory, a processor, and a computer program stored on the memory and capable of running on the processor, where the processor implements the MapReduce-based data filtering and extracting method according to the first aspect.
In a fourth aspect, embodiments of the present application provide a computer readable storage medium having a computer program stored thereon, which when executed by a processor implements a MapReduce-based data screening and extraction method as described in the first aspect above.
The embodiment of the application provides a data screening and extracting method and system based on MapReduce, which can be applied to the technical field of data capacity and the technical field of data service.
The details of one or more embodiments of the application are set forth in the accompanying drawings and the description below to provide a more thorough understanding of the other features, objects, and advantages of the application.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute an undue limitation to the application. In the drawings:
FIG. 1 is a flow chart of a MapReduce-based data screening extraction method according to an embodiment of the present application;
FIG. 2 is a flow chart of data screening steps according to an embodiment of the present application;
FIG. 3 is a flowchart illustrating specific steps for determining whether a data record meets a screening criteria according to an embodiment of the present application;
FIG. 4 is a flow chart of data extraction steps according to an embodiment of the present application;
FIG. 5 is a flowchart of a MapReduce-based data screening extraction method in a preferred embodiment of the present application;
FIG. 6 is a schematic flow chart of the data screening and extraction process according to the preferred embodiment of the present application;
FIG. 7 is a flow chart of a data screening process in a preferred embodiment of the present application;
FIG. 8 is a flow chart of data extraction in a preferred embodiment of the present application;
FIG. 9 is a block diagram of a MapReduce-based data screening extraction system in accordance with an embodiment of the present application;
fig. 10 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present application.
Description of the drawings:
a screening and extracting information input module 1; a validity judgment module 2;
a data screening module 3; a data extraction module 4;
a screening condition relationship acquisition unit 31; a screening cycle traversing unit 32;
a screening judgment unit 33; a data content acquisition subunit 331;
a type conversion subunit 332; a screening determination subunit 333;
a corresponding data acquisition unit 41; a data conversion unit 42;
a data acquisition unit 43; a data output unit 44;
a processor 81; a memory 82; a communication interface 83; a bus 80.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described and illustrated below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden on the person of ordinary skill in the art based on the embodiments provided herein, are intended to be within the scope of the present application.
It is apparent that the drawings in the following description are only some examples or embodiments of the present application, and it is possible for those of ordinary skill in the art to apply the present application to other similar situations according to these drawings without inventive effort. Moreover, it should be appreciated that while such a development effort might be complex and lengthy, it would nevertheless be a routine undertaking of design, fabrication, or manufacture for those of ordinary skill having the benefit of this disclosure, and thus should not be construed as having the benefit of this disclosure.
Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is to be expressly and implicitly understood by those of ordinary skill in the art that the embodiments described herein can be combined with other embodiments without conflict.
Unless defined otherwise, technical or scientific terms used herein should be given the ordinary meaning as understood by one of ordinary skill in the art to which this application belongs. Reference to "a," "an," "the," and similar terms herein do not denote a limitation of quantity, but rather denote the singular or plural. The terms "comprising," "including," "having," and any variations thereof, are intended to cover a non-exclusive inclusion; for example, a process, method, system, article, or apparatus that comprises a list of steps or modules (elements) is not limited to only those steps or elements but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus. The terms "connected," "coupled," and the like in this application are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. The term "plurality" as used herein refers to two or more. "and/or" describes an association relationship of an association object, meaning that there may be three relationships, e.g., "a and/or B" may mean: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship. The terms "first," "second," "third," and the like, as used herein, are merely distinguishing between similar objects and not representing a particular ordering of objects.
Therefore, the embodiment provides a data screening and extracting method based on MapReduce. Fig. 1 is a flowchart of a MapReduce-based data screening and extraction method according to an embodiment of the present application, as shown in fig. 1, the flowchart includes the following steps:
screening extraction information input step S1, in which screening extraction information is input, the screening extraction information including: the data input path, screening requirement information and data output path, wherein the screening requirement information comprises input separator, screening condition information, extraction field information and output separator;
a validity judging step S2, judging whether the screening requirement information is empty or not, if not, judging whether the screening requirement information is in json format or not, if yes, judging whether the screening requirement information accords with a preset standard, if so, continuously judging whether an input path exists or not, if so, judging whether an output path does not exist, and if not, judging that the screening extraction information is legal;
a data screening step S3, under a MapReduce framework, initializing and analyzing screening extraction information, reading each row of data records in the data to be screened, entering a map processing stage, separating the data records according to the input separator, judging whether the data to be screened contains screening condition information, screening the data and obtaining corresponding data fields if the data to be screened contains the screening condition information, and not screening if the data to be screened does not contain the screening condition information;
And S4, judging whether the data to be screened contains extraction field information, and if so, extracting the corresponding data field and outputting the extracted data to an output path.
Through the steps, the method for reducing development cost and improving the screening and extracting processing efficiency of mass data is provided, and through setting a unified input format, the requirements of user data screening are efficiently met, and data types with complex and various structures can be processed, such as: numbers, character strings, boolean, arrays, JSON format data, mixed data of character strings and JSON format, etc. In addition, aiming at various complex and diverse data screening requirements, frequent writing of different screening and extracting programs by developers is avoided, the efficiency of data processing tasks is improved, and the development cost is saved.
It should be noted that, the preset specification is: whether there are preset fields, i.e., input separator, filtering condition information, extraction field information, and output separator.
Fig. 2 is a flowchart of a data screening step according to an embodiment of the present application, as shown in fig. 2, in some embodiments, the data screening step S3 specifically includes:
a screening condition relation obtaining step S31, wherein a logical relation among all screening condition items in the screening condition information is obtained by initializing a screening condition with a true mark;
Step S32 of screening and circulating traversal, judging whether circulating traversal is completed for each screening condition item, if so, judging whether the screening condition pass identifier is true, if so, reserving the data record, if not, ignoring the data record, and if not, carrying out the next step;
a screening judging step S33, judging whether a line of data records accords with the screening condition item,
if the logical relation is in accordance with the screening condition item, judging whether the logical relation is a logical sum, if so, returning to a screening cycle traversing step, and if not, reserving the data record;
if the logical relation is not the logical sum, the data record is ignored, if the logical sum is not the logical sum, the screening condition pass through mark is set as false, and the screening cycle traversing step is returned.
Through the steps described above, complex screening of a plurality of screening conditions can be satisfied, such as: numerical value and date range screening, array operation screening, condition combination screening and the like,
it should be noted that the method can support different filtering logic operations for different data types, including equal to, greater than, less than, between, including, not including, regular matching, logical and, logical or, etc.
Fig. 3 is a flowchart illustrating specific steps for determining whether a data record meets a filtering condition item according to an embodiment of the present application, as shown in fig. 3, where in some embodiments, the filtering condition item includes a filtering data Key, a filtering data type, a filtering action, and a filtering condition value, and the specific steps for determining whether the data record meets the filtering condition item are as follows:
a data content obtaining step S331, namely obtaining corresponding data content in the data record according to the screening data Key;
a type conversion step S332, according to the filtered data types, performing data analysis and type conversion on the data content;
and a screening and judging step S333, wherein the data content is screened and judged according to the corresponding screening mode and logic relation according to the filtering action and the filtering condition value.
Fig. 4 is a flowchart of a data extraction step according to an embodiment of the present application, as shown in fig. 4, in some embodiments, the extraction field information includes an extraction data Key, an extraction data type, and an extraction field Key, and the data extraction step S4 further includes:
a corresponding data acquisition step S41, namely acquiring a corresponding data value from the data field obtained by screening according to the extracted data Key;
a data conversion step S42, which is to perform corresponding data analysis conversion on the data value according to the extracted data type and obtain converted data;
A data acquisition step S43, namely acquiring corresponding extraction data from the conversion data according to the extraction field Key;
and a data output step S44, wherein the extracted data are spliced according to the output separator until each extracted information item in the extracted field information is circularly traversed, and the spliced extracted data are output to an output path.
The following is a data screening and extracting method based on MapReduce in the preferred embodiment of the present application, and fig. 5 is a flowchart of the data screening and extracting method based on MapReduce in the preferred embodiment of the present application, as shown in fig. 5.
S501, inputting screening extraction information, comprising: the data input path, the data screening requirement information (JSON format) and the data output path, wherein the screening requirement information comprises an input separator, screening condition information, extraction field information and an output separator.
S502, judging validity of the screening and extracting information input in S501, if the input screening requirement information is legal, continuing, otherwise prompting a user to input information illegally, and then judging whether an input path exists or not, if not, prompting that the input path does not exist, and prompting that information needs to be input again, if the input path exists, judging whether an output path exists, and if so, prompting that the output path exists, and if not, prompting that the information needs to be input again, and if not, starting the next step.
S503, according to the screening and extracting information input in S501, carrying out data screening and extracting processes in the Map stage of the MapReduce framework.
S504, outputting the screening and extracting result data to a value output path.
Fig. 6 is a schematic flow chart of the data screening and extracting process in the preferred embodiment of the present application, as shown in fig. 6, the data screening and extracting process in S503 further includes the following steps,
s5031, setup is initialized and input screening extraction information is analyzed;
s5032, reading each row of data record cycle in the file to perform map processing:
s50321, taking each row of data in the data file to be screened as a value in the map;
s50322, judging whether data screening is performed, if so, separating the data according to the input separator, screening the data according to the input screening condition information, judging whether the data meets the screening condition, if not, ignoring the data, if yes, continuing to step S50323, if not, directly entering step S50323,
s50323, judging whether data extraction is performed, extracting data according to the output extraction field information if extraction is performed, taking the result data as the output of the Map stage, continuing to step S50324, judging whether the data is output as it is if not, ignoring the data if the data is not output as it is, continuing to step S50324, taking the result data as the output of the Map stage if the data is output as it is, and continuing to step S50324;
S50324, judging whether the map process is continued, if yes, returning to the step S50321, and if not, ending the cycle.
Fig. 7 is a schematic flow chart of a data screening process in a preferred embodiment of the present application, as shown in fig. 7, S50322 further includes the following steps:
s503221, enabling the screening condition to obtain a logical relationship between each screening condition item in the input screening condition information through the screening condition by identifying flag=true, and circularly traversing each screening condition item;
s503222, judging whether each screening condition item is circularly traversed or not, if not, acquiring screening condition item information, wherein the screening condition item information comprises screening data Key, screening data type, filtering action and filtering condition value, acquiring corresponding data value according to the data Key, analyzing and converting corresponding data according to the data type, executing corresponding filtering mode according to the filtering action, and performing filtering judgment according to the filtering condition value;
s503223, judging whether the data record accords with the screening condition item, if so, further judging whether the logic relationship is a sum relationship, if so, continuing the judgment of the next screening condition item, and if not, reserving the data record; if the logical relation is not the same as the screening condition item, further judging whether the logical relation is the same as the screening condition item, if the logical relation is the same as the screening condition item, ignoring the data record, and if the logical relation is not the same as the screening condition item, enabling the screening condition to be judged by marking the flag as false, and continuing the next screening condition item;
S503224, judging whether traversing is completed, if traversing each screening condition item in a circulating way is completed, judging whether the screening condition is true or not through the identification flag, if true, reserving the data record, and if false, ignoring the data record.
Fig. 8 is a schematic flow chart of data extraction in the preferred embodiment of the present application, as shown in fig. 8, S50323 further includes the following steps:
s503231, judging whether the cycle traversal is completed for each extracted information item, if the cycle traversal is completed, ending the cycle, and if the cycle traversal is not completed, continuing S503232;
s503232, obtaining an extraction information item of extraction field information, including: extracting a data Key, extracting a data type and an extracting field Key, acquiring corresponding data according to the extracting data Key, analyzing and converting the corresponding data according to the extracting data type, acquiring the corresponding data from the analyzed data according to the extracting field Key, splicing the data according to the output separator, outputting to an output path, and returning to the step S503231.
It should be noted that the steps illustrated in the above-described flow or flow diagrams of the figures may be performed in a computer system, such as a set of computer-executable instructions, and that, although a logical order is illustrated in the flow diagrams, in some cases, the steps illustrated or described may be performed in an order other than that illustrated herein.
The embodiment also provides a data screening and extracting system based on MapReduce, which is used for implementing the above embodiment and the preferred implementation, and is not described in detail. As used below, the terms "module," "unit," "sub-unit," and the like may be a combination of software and/or hardware that implements a predetermined function. While the means described in the following embodiments are preferably implemented in software, implementation in hardware, or a combination of software and hardware, is also possible and contemplated.
Fig. 9 is a block diagram of a data filtering and extracting system based on MapReduce according to an embodiment of the present application, as shown in fig. 9, the system includes:
screening extraction information input module 1 inputs screening extraction information including: the data input path, screening requirement information and data output path, wherein the screening requirement information comprises input separator, screening condition information, extraction field information and output separator;
the validity judging module 2 judges whether the screening requirement information is empty or not, if not, judges whether the screening requirement information is in json format or not, if yes, judges whether the screening requirement information accords with a preset standard, if so, continuously judges whether an input path exists or not, if so, judges whether an output path does not exist, and if not, the screening extraction information is legal;
The data screening module 3 is used for initializing and analyzing screening extraction information under a MapReduce framework, reading each row of data records in the data to be screened, entering a map processing stage, separating the data records according to the input separator, judging whether the data to be screened contains screening condition information, screening the data and obtaining corresponding data fields if the data to be screened contains the screening condition information, and not screening if the data to be screened does not contain the screening condition information;
and the data extraction module 4 is used for judging whether the data to be screened contains extraction field information, and extracting the corresponding data field and outputting the extracted data to the output path if the data to be screened contains the extraction field information.
In some of these embodiments, the data screening module 3 specifically includes:
a screening condition relation acquiring unit 31 for initializing a screening condition to acquire a logical relation between each screening condition item in the screening condition information by identifying the screening condition as true;
the screening cycle traversing unit 32 judges whether the cycle traversal is completed or not, if so, judges whether the screening condition pass identifier is true or not, if so, the data record is reserved, if not, the data record is ignored, and if not, the next step is carried out;
a screening judging unit 33 judging whether a line of data records meets the screening condition item,
If the logical relation is in accordance with the screening condition item, judging whether the logical relation is a logical sum, if so, returning to a screening cycle traversing step, and if not, reserving the data record;
if the logical relation is not the logical sum, the data record is ignored, if the logical sum is not the logical sum, the screening condition pass through mark is set as false, and the screening cycle traversing step is returned.
In some embodiments, the filtering condition item includes a filtering data Key, a filtering data type, a filtering action, and a filtering condition value, and the filtering determination unit 33 includes:
the data content obtaining subunit 331 obtains corresponding data content in the data record according to the screening data Key;
a type conversion subunit 332, for performing data analysis and type conversion on the data content according to the filtered data type;
the filtering judgment subunit 333 performs filtering judgment on the data content according to the filtering action and the filtering condition value and the corresponding filtering mode and logic relation.
In some of these embodiments, the extraction field information includes an extraction data Key, an extraction data type, and an extraction field Key, and the data extraction module 4 further includes:
A corresponding data acquisition unit 41 for acquiring a corresponding data value from the data field obtained by screening according to the extracted data Key;
the data conversion unit 42 performs corresponding data analysis conversion on the data value according to the extracted data type and obtains converted data;
a data acquisition unit 43 that acquires corresponding extraction data from the conversion data according to the extraction field Key;
the data output unit 44 concatenates the extracted data according to the output separator until each extracted information item in the extracted field information is traversed circularly, and outputs the concatenated extracted data to the output path.
The above-described respective modules may be functional modules or program modules, and may be implemented by software or hardware. For modules implemented in hardware, the various modules described above may be located in the same processor; or the above modules may be located in different processors in any combination.
In addition, the data screening and extracting method based on MapReduce in the embodiment of the application described in connection with fig. 1 may be implemented by an electronic device. Fig. 10 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present application.
The electronic device may include a processor 81 and a memory 82 storing computer program instructions.
In particular, the processor 81 may include a Central Processing Unit (CPU), or an application specific integrated circuit (Application Specific Integrated Circuit, abbreviated as ASIC), or may be configured to implement one or more integrated circuits of embodiments of the present application.
Memory 82 may include, among other things, mass storage for data or instructions. By way of example, and not limitation, memory 82 may comprise a Hard Disk Drive (HDD), floppy Disk Drive, solid state Drive (Solid State Drive, SSD), flash memory, optical Disk, magneto-optical Disk, tape, or universal serial bus (Universal Serial Bus, USB) Drive, or a combination of two or more of the foregoing. The memory 82 may include removable or non-removable (or fixed) media, where appropriate. The memory 82 may be internal or external to the data processing apparatus, where appropriate. In a particular embodiment, the memory 82 is a Non-Volatile (Non-Volatile) memory. In a particular embodiment, the Memory 82 includes Read-Only Memory (ROM) and random access Memory (Random Access Memory, RAM). Where appropriate, the ROM may be a mask-programmed ROM, a programmable ROM (Programmable Read-Only Memory, abbreviated PROM), an erasable PROM (Erasable Programmable Read-Only Memory, abbreviated EPROM), an electrically erasable PROM (Electrically Erasable Programmable Read-Only Memory, abbreviated EEPROM), an electrically rewritable ROM (Electrically Alterable Read-Only Memory, abbreviated EAROM), or a FLASH Memory (FLASH), or a combination of two or more of these. The RAM may be Static Random-Access Memory (SRAM) or dynamic Random-Access Memory (Dynamic Random Access Memory DRAM), where the DRAM may be a fast page mode dynamic Random-Access Memory (Fast Page Mode Dynamic Random Access Memory FPMDRAM), extended data output dynamic Random-Access Memory (Extended Date Out Dynamic Random Access Memory EDODRAM), synchronous dynamic Random-Access Memory (Synchronous Dynamic Random-Access Memory SDRAM), or the like, as appropriate.
Memory 82 may be used to store or cache various data files that need to be processed and/or communicated, as well as possible computer program instructions for execution by processor 81.
The processor 81 reads and executes the computer program instructions stored in the memory 82 to implement any of the MapReduce-based data screening extraction methods of the above embodiments.
In some of these embodiments, the electronic device may also include a communication interface 83 and a bus 80. As shown in fig. 10, the processor 81, the memory 82, and the communication interface 83 are connected to each other via the bus 80 and perform communication with each other.
The communication interface 83 is used to implement communications between various modules, devices, units, and/or units in embodiments of the present application. Communication port 83 may also enable communication with other components such as: and the external equipment, the image/data acquisition equipment, the database, the external storage, the image/data processing workstation and the like are used for data communication.
Bus 80 includes hardware, software, or both that couple components of the electronic device to one another. Bus 80 includes, but is not limited to, at least one of: data Bus (Data Bus), address Bus (Address Bus), control Bus (Control Bus), expansion Bus (Expansion Bus), local Bus (Local Bus). By way of example, and not limitation, bus 80 may include a graphics acceleration interface (Accelerated Graphics Port), abbreviated AGP, or other graphics Bus, an enhanced industry standard architecture (Extended Industry Standard Architecture, abbreviated EISA) Bus, a Front Side Bus (FSB), a HyperTransport (HT) interconnect, an industry standard architecture (Industry Standard Architecture, ISA) Bus, a wireless bandwidth (InfiniBand) interconnect, a Low Pin Count (LPC) Bus, a memory Bus, a micro channel architecture (Micro Channel Architecture, abbreviated MCa) Bus, a peripheral component interconnect (Peripheral Component Interconnect, abbreviated PCI) Bus, a PCI-Express (PCI-X) Bus, a serial advanced technology attachment (Serial Advanced Technology Attachment, abbreviated SATA) Bus, a video electronics standards association local (Video Electronics Standards Association Local Bus, abbreviated VLB) Bus, or other suitable Bus, or a combination of two or more of the foregoing. Bus 80 may include one or more buses, where appropriate. Although embodiments of the present application describe and illustrate a particular bus, the present application contemplates any suitable bus or interconnect.
In addition, in combination with the MapReduce-based data screening and extraction method in the above embodiment, the embodiments of the present application may provide a computer readable storage medium for implementation. The computer readable storage medium has stored thereon computer program instructions; the computer program instructions, when executed by the processor, implement any of the MapReduce-based data screening extraction methods of the above embodiments.
The technical features of the above-described embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above-described embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The above examples merely represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the invention. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application is to be determined by the claims appended hereto.
Claims (8)
1. The data screening and extracting method based on MapReduce is characterized by comprising the following steps of:
screening extraction information input step of inputting screening extraction information including: the data input path, screening requirement information and the data output path, wherein the screening requirement information comprises an input separator, screening condition information, extraction field information and an output separator;
a validity judging step of judging whether the screening requirement information is empty or not, if not, judging whether the screening requirement information is in json format or not, if yes, judging whether the screening requirement information accords with a preset standard, if so, continuously judging whether the input path exists or not, if so, judging whether the output path does not exist, and if not, judging that the screening extraction information is legal;
initializing and analyzing the screening extraction information under a MapReduce framework, reading each row of data records in the data to be screened, entering a map processing stage, separating the data records according to the input separator, judging whether the data to be screened contains the screening condition information, screening the data if the data to be screened contains the screening condition information, acquiring corresponding data fields, and not screening if the data to be screened does not contain the screening condition information;
A data extraction step, judging whether the data to be screened contains the extraction field information, and if so, extracting the corresponding data field and outputting the extracted data to the output path;
the data screening step specifically comprises the following steps:
a screening condition relation acquisition step, namely initializing a screening condition through marking as true, and acquiring a logic relation among all screening condition items in the screening condition information;
a screening and circulating traversing step, namely judging whether circulating traversing is completed on each screening condition item, if so, judging whether the screening condition pass identifier is true, if so, reserving the data record, if not, ignoring the data record, and if not, carrying out the next step;
a screening judging step of judging whether a line of the data records accords with the screening condition item,
if the logical relation is in accordance with the screening condition item, judging whether the logical relation is a logical sum, if yes, returning to the screening cycle traversing step, and if not, reserving the row of the data records;
if the logical relation is not the logical sum, judging whether the logical relation is the logical sum, if yes, ignoring the row of the data records, if not, setting the screening condition pass identifier as false, and returning to the screening cycle traversing step.
2. The MapReduce-based data screening and extraction method according to claim 1, wherein the screening condition item includes a screening data Key, a screening data type, a filtering action, and a filtering condition value, and the specific step of determining whether the data record meets the screening condition item is:
a data content acquisition step, namely acquiring corresponding data content in the data record according to the screening data Key;
a type conversion step, namely carrying out data analysis and type conversion on the data content according to the screened data type;
and a screening and judging step, wherein the data content is screened and judged according to the corresponding screening mode and logic relation according to the filtering action and the filtering condition value.
3. The MapReduce-based data screening and extraction method of claim 1, wherein the extraction field information comprises an extraction data Key, an extraction data type, and an extraction field Key, the data extraction step further comprising:
a corresponding data acquisition step, namely acquiring a corresponding data value from the data field obtained by screening according to the extracted data Key;
a data conversion step, namely carrying out corresponding data analysis conversion on the data value according to the extracted data type to obtain conversion data;
A data acquisition step of acquiring corresponding extraction data from the conversion data according to the extraction field Key;
and a data output step, namely splicing the extracted data according to the output separator until each extracted information item in the extracted field information is circularly traversed, and outputting the spliced extracted data to the output path.
4. A MapReduce-based data screening and extraction system, applying the MapReduce-based data screening and extraction method of any one of claims 1-3, comprising:
screening extraction information input module, input screening extraction information, screening extraction information includes: the data input path, screening requirement information and the data output path, wherein the screening requirement information comprises an input separator, screening condition information, extraction field information and an output separator;
the validity judging module judges whether the screening requirement information is empty or not, if not, judges whether the screening requirement information is in json format or not, if yes, judges whether the screening requirement information accords with a preset standard, if so, continuously judges whether the input path exists or not, if so, judges whether the output path does not exist, and if not, the screening extraction information is legal;
The data screening module is used for initializing and analyzing the screening extraction information under a MapReduce framework, reading each row of data records in the data to be screened, entering a map processing stage, separating the data records according to the input separator, judging whether the data to be screened contains the screening condition information, screening the data if the data to be screened contains the screening condition information, acquiring corresponding data fields, and not screening if the data to be screened does not contain the screening condition information;
the data extraction module is used for judging whether the data to be screened contains the extraction field information, and if so, extracting the corresponding data field and outputting the extracted data to the output path;
the data screening module specifically comprises:
the screening condition relation acquisition unit is used for initializing a screening condition through marking as true to acquire a logic relation among all screening condition items in the screening condition information;
the screening circulation traversing unit is used for judging whether circulation traversing is completed on each screening condition item, if so, judging whether the screening condition pass identifier is true, if so, reserving the data record, if not, ignoring the data record, and if not, carrying out the next step;
a screening judging unit for judging whether a row of the data records accords with the screening condition item,
If the logical relation is in accordance with the screening condition item, judging whether the logical relation is a logical sum, if yes, returning to the screening cycle traversing step, and if not, reserving the row of the data records;
if the logical relation is not the logical sum, judging whether the logical relation is the logical sum, if yes, ignoring the row of the data records, if not, setting the screening condition pass identifier as false, and returning to the screening cycle traversing step.
5. The MapReduce-based data screening and extraction system of claim 4, wherein the screening condition item comprises a screening data Key, a screening data type, a filtering action, and a filtering condition value, and wherein the screening determination unit comprises:
a data content acquisition subunit, for acquiring corresponding data content in the data record according to the screening data Key;
a type conversion subunit, for performing data analysis and type conversion on the data content according to the screening data type;
and the screening and judging subunit is used for screening and judging the data content according to the corresponding screening mode and logic relation according to the filtering action and the filtering condition value.
6. The MapReduce-based data screening and extraction system of claim 4, wherein the extraction field information comprises an extraction data Key, an extraction data type, and an extraction field Key, the data extraction module further comprising:
the corresponding data acquisition unit acquires a corresponding data value from the data field obtained by screening according to the extracted data Key;
the data conversion unit is used for carrying out corresponding data analysis conversion on the data value according to the extracted data type and obtaining conversion data;
the data acquisition unit acquires corresponding extraction data from the conversion data according to the extraction field Key;
and the data output unit splices the extracted data according to the output separator until each extracted information item in the extracted field information is traversed circularly, and outputs the spliced extracted data to the output path.
7. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements a MapReduce-based data screening extraction method as claimed in any one of claims 1 to 3 when the computer program is executed.
8. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements a MapReduce-based data screening extraction method as claimed in any one of claims 1 to 3.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110563545.3A CN113204706B (en) | 2021-05-24 | 2021-05-24 | Data screening and extracting method and system based on MapReduce |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110563545.3A CN113204706B (en) | 2021-05-24 | 2021-05-24 | Data screening and extracting method and system based on MapReduce |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113204706A CN113204706A (en) | 2021-08-03 |
CN113204706B true CN113204706B (en) | 2024-01-12 |
Family
ID=77022993
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110563545.3A Active CN113204706B (en) | 2021-05-24 | 2021-05-24 | Data screening and extracting method and system based on MapReduce |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113204706B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115168714B (en) * | 2022-07-07 | 2023-11-10 | 中国测绘科学研究院 | Web API data extraction method and device |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103150400A (en) * | 2013-03-27 | 2013-06-12 | 领航动力信息系统有限公司 | MapReduce-framework-based data screening method |
CN103200262A (en) * | 2013-04-02 | 2013-07-10 | 亿赞普(北京)科技有限公司 | Method, device and system for advertisement scheduling based on mobile network |
CN104484616A (en) * | 2014-12-03 | 2015-04-01 | 浪潮电子信息产业股份有限公司 | Privacy protection method under MapReduce data processing framework |
CN108491526A (en) * | 2018-03-28 | 2018-09-04 | 腾讯科技(深圳)有限公司 | Daily record data processing method, device, electronic equipment and storage medium |
CN110019308A (en) * | 2017-12-28 | 2019-07-16 | 中国移动通信集团海南有限公司 | Data query method, apparatus, equipment and storage medium |
CN111104390A (en) * | 2019-11-08 | 2020-05-05 | 珠海金山网络游戏科技有限公司 | Method and system for merging and checking multiple CSV files |
CN112597145A (en) * | 2020-12-29 | 2021-04-02 | 恩亿科(北京)数据科技有限公司 | Real-time data cleaning method, system, electronic equipment and storage medium |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2015134193A1 (en) * | 2014-03-07 | 2015-09-11 | Ab Initio Technology Llc | Managing data profiling operations related to data type |
-
2021
- 2021-05-24 CN CN202110563545.3A patent/CN113204706B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103150400A (en) * | 2013-03-27 | 2013-06-12 | 领航动力信息系统有限公司 | MapReduce-framework-based data screening method |
CN103200262A (en) * | 2013-04-02 | 2013-07-10 | 亿赞普(北京)科技有限公司 | Method, device and system for advertisement scheduling based on mobile network |
CN104484616A (en) * | 2014-12-03 | 2015-04-01 | 浪潮电子信息产业股份有限公司 | Privacy protection method under MapReduce data processing framework |
CN110019308A (en) * | 2017-12-28 | 2019-07-16 | 中国移动通信集团海南有限公司 | Data query method, apparatus, equipment and storage medium |
CN108491526A (en) * | 2018-03-28 | 2018-09-04 | 腾讯科技(深圳)有限公司 | Daily record data processing method, device, electronic equipment and storage medium |
CN111104390A (en) * | 2019-11-08 | 2020-05-05 | 珠海金山网络游戏科技有限公司 | Method and system for merging and checking multiple CSV files |
CN112597145A (en) * | 2020-12-29 | 2021-04-02 | 恩亿科(北京)数据科技有限公司 | Real-time data cleaning method, system, electronic equipment and storage medium |
Non-Patent Citations (1)
Title |
---|
利用大数据开源项目实现医疗临床大数据筛选;陈军晓;汤其宇;刘逸敏;;中国数字医学(10);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN113204706A (en) | 2021-08-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US8010844B2 (en) | File mutation method and system using file section information and mutation rules | |
CN106557307B (en) | Service data processing method and system | |
CN112199935B (en) | Data comparison method and device, electronic equipment and computer readable storage medium | |
CN113204706B (en) | Data screening and extracting method and system based on MapReduce | |
CN113126986A (en) | Dynamic data-based form item rendering method, system, equipment and storage medium | |
CN112527950B (en) | Map data deleting method and system based on MapReduce | |
CN114154474A (en) | Data export method, system, computer device and readable storage medium | |
CN110688223B (en) | Data processing method and related product | |
CN115495498B (en) | Data association method, system, electronic equipment and storage medium | |
CN111142871A (en) | Front-end page development system, method, equipment and medium | |
CN112818007B (en) | Data processing method and device and readable storage medium | |
CN113111614B (en) | Method, device, equipment and medium for determining class bus grouping | |
CN115599388A (en) | API (application programming interface) document generation method, storage medium and electronic equipment | |
CN114281761A (en) | Data file loading method and device, computer equipment and storage medium | |
CN113535338A (en) | Interaction method, system, storage medium and electronic device for data access | |
CN113987173A (en) | Short text classification method, system, electronic device and medium | |
CN115729752A (en) | Register checking method and device and storage medium | |
CN102378005A (en) | Moving image processing apparatus, moving image processing method, and program | |
CN111324732A (en) | Model training method, text processing device and electronic equipment | |
CN111652013A (en) | Character filtering method, device, equipment and storage medium | |
CN113792247B (en) | Method, apparatus, device and medium for generating functional flow chart based on code characteristics | |
CN116152043B (en) | Memory management method and device based on image processing and electronic equipment | |
CN117807056A (en) | Data auditing method and device, electronic equipment and storage medium | |
EP4068141B1 (en) | Method and system to enable print functionality in high-level synthesis (hls) design platforms | |
CN111427870B (en) | Resource management method, device and equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |