CN109947429B - Data processing method and device - Google Patents

Data processing method and device Download PDF

Info

Publication number
CN109947429B
CN109947429B CN201910190563.4A CN201910190563A CN109947429B CN 109947429 B CN109947429 B CN 109947429B CN 201910190563 A CN201910190563 A CN 201910190563A CN 109947429 B CN109947429 B CN 109947429B
Authority
CN
China
Prior art keywords
data
separator
processed
target
preset
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910190563.4A
Other languages
Chinese (zh)
Other versions
CN109947429A (en
Inventor
吴庆双
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
MIGU Culture Technology Co Ltd
Original Assignee
MIGU Culture Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by MIGU Culture Technology Co Ltd filed Critical MIGU Culture Technology Co Ltd
Priority to CN201910190563.4A priority Critical patent/CN109947429B/en
Publication of CN109947429A publication Critical patent/CN109947429A/en
Application granted granted Critical
Publication of CN109947429B publication Critical patent/CN109947429B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The embodiment of the invention provides a data processing method and device. The method is applied to a computing engine Spark, and comprises the following steps: monitoring and obtaining that the default separators applied by the Spark cannot be used for dividing the data to be processed, and acquiring a preset separator library of the Spark; traversing separators in the preset separator library, and acquiring target separators which are matched with the data to be processed in the separators; the target separator is a separator which successfully divides part of data in the data to be processed and meets a preset check rule; setting the target separator as a default separator of the Spark. The embodiment of the invention solves the problem that in the prior art, each Spark version usually has a preset default separator and cannot process data of non-default separators.

Description

Data processing method and device
Technical Field
The embodiment of the invention relates to the technical field of data processing, in particular to a data processing method and device.
Background
The calculation engine (Spark) is a general-purpose calculation engine in the field of large-scale data processing. Spark is an open-source class, can be applied to algorithms needing iteration, such as data mining, machine learning and the like, and can optimize iteration workload besides providing interactive query.
However, in the prior art, each Spark version usually has a preset default delimiter, and data of a non-default delimiter cannot be processed. For example, when the Spark version is cdh5.5.0, the Spark of this version can only identify the default delimiter. If the separator used in the data to be processed is not the default separator of the Spark, the Spark cannot correctly identify and process the data, and thus cannot process the data.
Disclosure of Invention
Embodiments of the present invention provide a data processing method and apparatus, so as to solve the problem that in the prior art, each Spark version generally has a preset default delimiter and cannot process data of a non-default delimiter.
In one aspect, an embodiment of the present invention provides a data processing method, where the method is applied to a compute engine Spark, and the method includes:
monitoring to acquire that the default separator applied by the Spark cannot divide the data to be processed, and acquiring a preset separator library of the Spark;
traversing separators in the preset separator library, and acquiring target separators which are matched with the data to be processed in the separators; the target separator is a separator which successfully divides part of data in the data to be processed and meets a preset check rule;
setting the target separator as a default separator of the Spark.
In one aspect, an embodiment of the present invention provides a data processing apparatus, which is applied to a compute engine Spark, where the apparatus includes:
the monitoring module is used for monitoring and knowing that the default separator applied by the Spark cannot be used for dividing the data to be processed, and acquiring a preset separator library of the Spark;
the traversal module is used for traversing the separators in the preset separator library and acquiring target separators which are matched with the data to be processed in the separators; the target separator is a separator which successfully divides part of data in the data to be processed and meets a preset check rule;
a setting module, configured to set the target separator as a default separator of the Spark.
On the other hand, an embodiment of the present invention further provides an electronic device, which includes a memory, a processor, a bus, and a computer program stored in the memory and executable on the processor, where the processor implements the steps in the data processing method when executing the program.
In still another aspect, an embodiment of the present invention further provides a non-transitory computer readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to implement the steps in the data processing method.
According to the data processing method and device provided by the embodiment of the invention, the situation that the default separators applied by the Spark cannot be used for dividing the data to be processed is monitored, and the preset separator library of the Spark is obtained; traversing the separators in the preset separator library, and acquiring target separators which are matched with the data to be processed in the separators, namely the separators actually used in the data to be processed; and setting the target separator as a default separator of the Spark so that the Spark can identify and process the data to be processed, and the universality degree and the data processing efficiency of the Spark are improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.
Fig. 1 is a schematic flow chart of a data processing method according to an embodiment of the present invention;
FIG. 2 is a flow chart illustrating a data processing method according to another embodiment of the present invention;
FIG. 3 is a schematic flow diagram of an example of an embodiment of the invention;
FIG. 4 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram of a server according to an embodiment of the present invention.
Detailed Description
To make the technical problems, technical solutions and advantages of the present invention more apparent, the following detailed description is given with reference to the accompanying drawings and specific embodiments. In the following description, specific details such as specific configurations and components are provided only to help the full understanding of the embodiments of the present invention. Thus, it will be apparent to those skilled in the art that various changes and modifications may be made to the embodiments described herein without departing from the scope and spirit of the invention. In addition, descriptions of well-known functions and constructions are omitted for clarity and conciseness.
It should be appreciated that reference throughout this specification to "an embodiment" or "one embodiment" means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the present invention. Thus, the appearances of the phrase "in an embodiment" or "in an embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.
In various embodiments of the present invention, it should be understood that the sequence numbers of the following processes do not mean the execution sequence, and the execution sequence of each process should be determined by the function and the inherent logic of the process, and should not constitute any limitation to the implementation process of the embodiments of the present invention.
In the embodiments provided herein, it should be understood that "B corresponding to a" means that B is associated with a from which B can be determined. It should also be understood that determining B from a does not mean determining B from a alone, but may be determined from a and/or other information.
Fig. 1 is a schematic flowchart illustrating a data processing method according to an embodiment of the present invention.
As shown in fig. 1, the data processing method provided in the embodiment of the present invention is applied to a compute engine Spark, and the method specifically includes the following steps:
step 101, monitoring and knowing that the default delimiter of the Spark application cannot divide the data to be processed, and acquiring a preset delimiter library of the Spark.
The data to be processed may be log data or other data.
On one hand, the separator is used for carrying out segmentation on the data to be processed while identifying or reading the data; alternatively, separators are used to identify the location of the text separator, or to identify the start of a new row or column.
When Spark identifies the data to be processed, the data to be processed is divided by using the separator, and each Spark version usually has a preset default separator, if the default separator is inconsistent with the separator carried in the data to be processed, Spark cannot divide the data to be processed, and thus cannot identify and process the data.
Therefore, if a situation that the Spark cannot divide the data to be processed by using the default delimiter is monitored, for example, the situation may be a situation that continuous data does not exist in the data to be processed read by the Spark, the preset delimiter library of the Spark is acquired, and the delimiter in the preset delimiter library is called to divide the data to be processed.
Step 102, traversing separators in the preset separator library, and acquiring target separators matched with the data to be processed in the separators; the target separator is a separator which successfully divides part of the data in the data to be processed and meets a preset check rule.
And circularly traversing the separators in the preset separator library until the target separators are obtained, wherein the target separators are the separators actually used in the data to be processed.
Specifically, for each separator in a preset separator library, a part of data in the data to be processed is divided by using the separator, and if the part of data is successfully divided by using the separator, other data in the data to be processed is continuously selected to check the separator; and determining the separator as a target separator until the preset check rule is met.
Optionally, the preset check rule may be a limit on the number of checks, for example, a minimum number of checks threshold is included, and the minimum number of checks threshold may be obtained by using a deep learning manner.
Step 103, setting the target separator as a default separator of the Spark.
After the target separator is determined, the target separator is set as a default separator of the Spark, for example, the target separator is added to a default separator library of the Spark, where the default separator library includes the default separator of the Spark, so that the Spark can currently identify and process the data to be processed, and when the Spark subsequently encounters the data to be processed including the target separator again, the data to be processed can be processed.
In the above embodiment of the present invention, it is monitored to know that the default delimiter applied by the Spark cannot be used to divide the data to be processed, and a preset delimiter library of the Spark is obtained; traversing the separators in the preset separator library, and acquiring target separators which are matched with the data to be processed in the separators, namely the separators actually used in the data to be processed; and setting the target separator as a default separator of the Spark so that the Spark can identify and process the data to be processed, and the universality degree and the data processing efficiency of the Spark are improved. The embodiment of the invention solves the problem that in the prior art, each Spark version usually has a preset default separator and cannot process data of non-default separators.
Fig. 2 is a flowchart illustrating a data processing method according to another embodiment of the present invention.
As shown in fig. 2, the data processing method provided in the embodiment of the present invention is applied to a compute engine Spark, and the method specifically includes the following steps:
step 201, monitoring and knowing that the default delimiter applied by the Spark cannot divide the data to be processed, and acquiring a preset delimiter library of the Spark.
The separator is used for dividing the data to be processed when the data to be processed is identified or read, and the separator is used for identifying the position of text separation or identifying the starting position of a new row or a new column. The data to be processed may be log data or other data.
When Spark identifies the data to be processed, the data to be processed is divided by using the separator, and each Spark version usually has a preset default separator, if the default separator is inconsistent with the separator carried in the data to be processed, Spark cannot divide the data to be processed, and thus cannot identify and process the data.
Therefore, when it is detected that the Spark cannot divide the data to be processed by using the default delimiter, for example, when the read data to be processed is continuous and has no divided data, the preset delimiter library of the Spark is acquired, and the delimiter in the preset delimiter library is called to divide the data to be processed.
Step 202, sequentially segmenting multiple groups of target data meeting the requirement of a preset character string length in the data to be processed aiming at each delimiter in the preset delimiter library, and obtaining the cumulative times of successful continuous segmentation; and if the group of target data is successfully divided, adding one to the accumulated times.
And randomly extracting partial data from the data to be processed as target data aiming at each separator, wherein the extracted target data meets the requirement of preset character string length, and the character string is prevented from being too short to be segmented.
After a group of target data is extracted, firstly segmenting the target data, if segmentation is successful, adding one to the accumulated times, continuously extracting new target data from the rest part of the data to be processed, and continuously segmenting; if the division fails, the loop extraction step is stopped to exclude the separator. The accumulated number of times is the number of times of successful continuous division.
Step 203, if the accumulated times meet a preset check rule, determining the separator as a target separator matched with the data to be processed.
And if the accumulated times meet a preset check rule, determining the separator as a target separator. Optionally, the preset check rule may be a limit on the number of times of accumulation, for example, a minimum threshold value of the number of times of accumulation is included, and the minimum threshold value of the number of times of accumulation may be obtained in a deep learning manner.
Step 204, setting the target separator as the default separator of Spark.
After the target separator is determined, the target separator is set as a default separator of the Spark, for example, the target separator is added to a default separator library of the Spark, where the default separator library includes the default separator of the Spark, so that the Spark can currently identify and process the data to be processed, and when the Spark subsequently encounters the data to be processed including the target separator again, the data to be processed can be processed.
Optionally, in the above embodiment of the present invention, the step of sequentially segmenting multiple sets of target data that meet a requirement of a preset string length in the data to be processed includes:
aiming at each group of target data, acquiring a preset attribute numerical value of the target data;
dividing the target data by using the separators, wherein if the number of the divided fields is the same as the preset attribute value, the division is successful; otherwise, the segmentation fails.
The preset attribute value is the number of attributes contained in the target data; an attribute is a data type. When Spark divides the data to be processed, the data of different types are divided by separators, therefore, if the number of the divided fields is the same as the preset attribute value, the division is successful; and if the number of the divided fields is different from the preset attribute value, indicating that the division fails.
As an example, when the data to be processed is log data, taking JAVA as an example, a class splitchuits, a method getsplitlarray, and a method gettrealsplitchar are first created in the spare source code.
Specifically, in the first step, a splitchuruils-like code may be newly created in the Spark source code.
Wherein, the class can be used to define three local variables, which are: ATTR _ COUNT, SPLIT _ CHAR _ TMP, and SPLIT _ CHAR, and a static constant: the SUCCESS _ COUNT is 1000, which is the lowest check threshold specified in the preset check rule.
The local variable ATTR _ COUNT is used to indicate the number of attributes of a source data table (i.e., to-be-processed data) read by a current Spark, and the local variable SPLIT _ CHAR _ TMP is used to indicate one delimiter in a preset delimiter library of the Spark obtained currently; SPLIT _ CHAR may represent the target delimiter of the currently acquired data.
And secondly, establishing a method getSplitLibrary in the Spark source code, wherein the method can be used for reading a split _ Library table and returning a List according to the reading result, and the split _ Library table is used for storing separators in a preset separator library.
Step three, a method GetRailSplitChar is newly established, a parameter is introduced into the method, and the parameter can be used for identifying the log data log _ data which is obtained currently; a variable is defined at the beginning of the method to identify the number of successful SPLIT _ count _ TMP checks, success _ count _ TMP, 0.
After the creation process is completed, when Spark reads log data, the getsplidry method may be called first to obtain a set of the separator library, and then the set is traversed, separators in the set are sequentially taken out, and the taken separators are assigned to the variable SPLIT _ CHAR _ TMP.
At this time, the Spark may obtain, by using a variable SPLIT _ CHAR _ TMP as a parameter, a field number data _ attr _ count of the current log data SPLIT by the variable SPLIT _ CHAR _ TMP through a function log _ data.split (SPLIT _ CHAR _ TMP) length with a splitting function;
and comparing the obtained field number data _ ATTR _ COUNT with the attribute number ATTR _ COUNT of the source data table read by the current Spark task: if data _ ATTR _ COUNT is ATTR _ COUNT, then the success _ COUNT _ tmp value is increased by 1; if data _ ATTR _ COUNT is not equal to ATTR _ COUNT, then success _ COUNT _ tmp is 0.
After that, Spark can take the next separator in the preset separator library to check according to the above-mentioned checking procedure. Until the check of one separator meets SUCCESS _ COUNT _ TMP which is SUCCESS _ COUNT, the current separator SPLIT _ CHAR _ TMP is set to the value of SPLIT _ CHAR, that is, the target separator of the log data acquired by the task this time is SPLIT _ CHAR.
Optionally, in the above embodiment of the present invention, the step of sequentially segmenting multiple sets of target data that meet a requirement of a preset string length in the data to be processed includes:
randomly extracting data meeting the length requirement of a preset character string from the data to be processed to serve as a current group of target data, and segmenting the current group of target data by using the separator;
and if the current group of target data is successfully segmented, randomly extracting data meeting the length requirement of the preset character string from the non-current group of target data of the data to be processed again to serve as a new group of target data.
Firstly, randomly extracting a group of data as a current group of target data, and segmenting: if the segmentation is successful, continuing to execute a cyclic extraction step, namely randomly extracting data meeting the length requirement of a preset character string as a new group of target data, and recording the accumulated times; if the division fails, the loop extraction step is stopped to eliminate the separator.
Optionally, in the above embodiment of the present invention, the preset check rule includes that the number of times of successful continuous segmentation is greater than or equal to a preset value.
The preset value, namely the lowest verification threshold, can be preset and preset.
Optionally, in the foregoing embodiment of the present invention, the step of segmenting the target data by using the delimiter includes:
and assigning the separator to a preset segmentation function, and segmenting the target data through the preset segmentation function.
Optionally, a preset segmentation function is used for segmenting the target data by using the separator; for example, in the above example, the preset partition function may be log _ data.split (SPLIT _ CHAR _ TMP). length; success _ count _ tmp is used to identify the number of times the check succeeds. Optionally, in the above embodiment of the present invention, the step of setting the target separator as a default separator of the Spark includes:
and adding the configuration file of the target separator in the separator configuration file of the Spark.
For example, a configuration file of the target separator is newly added under the directory of Spark, and the configuration file may be Spark config.
It should be noted that if there are at least two target separators actually used in the data to be processed, a plurality of Spark config.
Specifically, referring to fig. 3, as an example, taking JAVA as an example, the data processing method mainly includes the following steps:
in step 301, after receiving a command to read log data, the Spark determines the attribute number ATTR _ COUNT of the data table in the log data according to the data in the Hive table.
The data in the Hive table is partitioned based on the actually used separators of the log data. Optionally, when Spark reads data in the Hive table, the number ATTR _ COUNT of the attribute "from deserializer" of the table may be obtained through a "desc table _ name" command.
Step 302, after the Spark acquires the attribute number ATTR _ COUNT, for the currently acquired data, separator matching is performed one by combining a preset separator library of the Spark, namely a database table split _ library, to determine a target separator.
For example, if the current log data may be "ATTR 1, ATTR2, ATTR 3", and Spark determines the attribute number ATTR _ COUNT ═ 3 through the data in the Hive table;
the separator taken out from the split _ library by the Spark for the first time can be "/", at this time, the Spark can divide the log data by using "/", and if the number of the divided arrays is 1 and does not accord with the previously acquired ATTR _ COUNT ═ 3, the Spark continues to take out the next separator in the split _ library;
if the delimiter obtained for the second time is ",", the log data is divided by "and" to obtain the number of the divided arrays of 3, and the number of the divided arrays is consistent with the ATTR _ COUNT of 3, then the SPLIT _ CHAR of the current task is set to ",", and then the data pair is continuously obtained, and "is checked.
Specifically, during the verification, SPLIT _ CHAR _ TMP ═ and is set, new target data is "ATTR 4, ATTR5 and ATTR 6", Spark directly uses SPLIT _ CHAR _ TMP to divide the data, the number of the divided arrays is also 3, and the divided arrays is consistent with the ATTR _ COUNT ═ 3 obtained before, so that the result is circularly verified for a certain number of times, the actual separator of the current Spark task can be determined to be ",", and then the separator of the system, SPLIT _ CHAR, is set to be ",".
Optionally, the SPLIT _ CHAR is stored in a static storage area of the system as a global variable of the current task, the current task can be acquired at any time, and the task is emptied when the task is ended. Therefore, the separators of each Spark are guaranteed to be recalculated, and the reliability of the system is improved.
Step 303, adding the configuration file of the target separator in the configuration file of the separator of the Spark.
In this scenario, after determining the actually used separator of the log data, spark may modify the current default separator to the identified actually used separator.
Optionally, in the above embodiment of the present invention, after the step of adding the configuration file of the target delimiter, the method further includes:
and acquiring a source identifier of a data table in the data to be processed, and recording the corresponding relation between the source identifier and the configuration file.
After the configuration file is added, the source identifier of the data table in the data to be processed corresponding to the target separator is obtained, and the corresponding relation between the source identifier and the configuration file is recorded, so that when the data table corresponding to the source identifier is processed again in the following process, the correct separator is directly selected according to the corresponding relation.
And when more than two separators exist in the data to be processed, the separators can be quickly searched according to the corresponding relation.
Further, in the above embodiment of the present invention, the method further includes:
when data to be processed is read, acquiring an identifier of a data table included in the data to be processed;
and calling the configuration file corresponding to the identifier according to the corresponding relation.
When the data to be processed is read, firstly, the identifier of the data table included in the data to be processed is obtained, and then, the corresponding configuration file is called according to the corresponding relation; still taking JAVA as an example, after the correspondence between the identifier of the data table and the configuration file is added to Spark, the actually used separator in the log data may be set as the name of the data table corresponding to the current log data. Thus, when Spark starts to read and identify log data, a configuration file may be loaded first, and then the configuration file may be called according to the name of the log data, and the separator therein may be defined as the parameter SPLIT _ CHAR. Spark can then read and identify log data according to the parameter SPLIT _ CHAR.
In the above embodiment of the present invention, it is monitored that the default delimiter of the Spark application cannot be used to divide the data to be processed, and a preset delimiter library of the Spark is obtained; traversing the separators in the preset separator library, and acquiring target separators which are matched with the data to be processed in the separators, namely the separators which are actually used in the data to be processed; and setting the target separator as a default separator of the Spark so that the Spark can identify and process the data to be processed, and the universality degree and the data processing efficiency of the Spark are improved.
The data processing method provided by the embodiment of the present invention is described above, and a data processing apparatus provided by the embodiment of the present invention will be described below with reference to the accompanying drawings.
As shown in fig. 4, an embodiment of the present invention further provides a data processing apparatus, which is applied to a compute engine Spark, where the apparatus includes:
the monitoring module 401 is configured to monitor that the default delimiter of the Spark application cannot be used to segment the data to be processed, and obtain a preset delimiter library of the Spark.
The data to be processed may be log data or other data.
On one hand, the separator is used for dividing the data while identifying or reading the data to be processed; alternatively, separators are used to identify the location of the text separator, or to identify the start of a new row or column.
When Spark identifies the data to be processed, the data to be processed is divided by using the separator, and each Spark version usually has a preset default separator, if the default separator is inconsistent with the separator carried in the data to be processed, Spark cannot divide the data to be processed, and thus cannot identify and process the data.
Therefore, if it is monitored that the Spark cannot divide the data to be processed by using the default delimiter, for example, the above situation may be a situation that continuous data does not exist in the data to be processed read by the Spark, the preset delimiter library of the Spark is obtained, and the delimiter in the preset delimiter library is called to divide the data to be processed.
A traversal module 402, configured to traverse separators in the preset separator library, and obtain a target separator, which is matched with the to-be-processed data, in the separator; the target separator is a separator which successfully divides part of the data in the data to be processed and meets a preset check rule.
And circularly traversing the separators in the preset separator library until the target separators are obtained, wherein the target separators are the separators actually used in the data to be processed. Specifically, for each separator in a preset separator library, dividing part of data in the data to be processed by using the separator, and if the separator successfully divides the part of data, continuously selecting other data in the data to be processed to check the separator; and determining the separator as a target separator until the preset check rule is met. Optionally, the preset check rule may be a limit on the number of checks, for example, a minimum number of checks threshold may be included, and the minimum number of checks threshold may be obtained by deep learning.
A setting module 403, configured to set the target separator as a default separator of the Spark.
After the target separator is determined, the target separator is set as a default separator of the Spark, for example, the target separator is added to a default separator library of the Spark, where the default separator library includes the default separator of the Spark, so that the Spark can currently identify and process the data to be processed, and when the Spark subsequently encounters the data to be processed including the target separator again, the data to be processed can be processed.
Optionally, in an embodiment of the present invention, the traversing module 402 includes:
the segmentation sub-module is used for sequentially segmenting a plurality of groups of target data meeting the requirement of the length of a preset character string in the data to be processed aiming at each separator in the preset separator library and obtaining the accumulated times of successful continuous segmentation; if a group of target data is successfully divided, adding one to the accumulated times;
and the checking submodule is used for determining the separator as a target separator matched with the data to be processed if the accumulated times meet a preset checking rule.
Optionally, in this embodiment of the present invention, the partitioning sub-module is configured to:
aiming at each group of target data, acquiring a preset attribute numerical value of the target data;
and segmenting the target data by using the separators, wherein if the number of the segmented fields is the same as the preset attribute value, the segmentation is successful.
Optionally, in an embodiment of the present invention, the partitioning sub-module is configured to:
randomly extracting data meeting the length requirement of a preset character string from the data to be processed to serve as a current group of target data, and segmenting the current group of target data by using the separator;
and if the current group of target data is successfully segmented, randomly extracting data meeting the length requirement of the preset character string from the non-current group of target data of the data to be processed again to serve as a new group of target data.
Optionally, in this embodiment of the present invention, the preset check rule includes that the number of times of successful continuous segmentation is greater than or equal to a preset value.
Optionally, in this embodiment of the present invention, the partitioning sub-module is configured to:
and assigning the separator to a preset segmentation function, and segmenting the target data through the preset segmentation function.
Optionally, in this embodiment of the present invention, the setting module 403 includes:
and the configuration submodule is used for adding the configuration file of the target separator in the separator configuration file of the Spark.
Optionally, in an embodiment of the present invention, the apparatus further includes:
and the recording module is used for acquiring a source identifier of a data table in the data to be processed and recording the corresponding relation between the source identifier and the configuration file.
Optionally, in an embodiment of the present invention, the apparatus further includes:
the calling module is used for acquiring an identifier of a data table included in the data to be processed when the data to be processed is read;
and calling the configuration file corresponding to the identifier according to the corresponding relation.
In the above embodiment of the present invention, a monitoring module 401 monitors and learns that the default delimiter of the Spark application cannot be used for segmenting the data to be processed, and obtains a preset delimiter library of the Spark; the traversal module 402 traverses the separators in the preset separator library, and obtains target separators, which are actually used in the data to be processed, in the separators; the setting module 403 sets the target separator as a default separator of the Spark so that the Spark can identify and process the data to be processed, thereby improving the universality of the Spark and the data processing efficiency.
On the other hand, the embodiment of the present invention further provides an electronic device, which includes a memory, a processor, a bus, and a computer program stored on the memory and executable on the processor, and when the processor executes the computer program, the steps in the data processing method are implemented.
For example, as follows, when the electronic device is a server, fig. 5 illustrates a physical structure diagram of the server.
As shown in fig. 5, the server may include: a Processor (Processor)510, a communication Interface (Communications Interface)520, a Memory (Memory)530 and a communication bus 540, wherein the Processor 510, the communication Interface 520 and the Memory 530 communicate with each other via the communication bus 540. Processor 510 may call logic instructions in memory 530 to perform the following method:
monitoring and obtaining that the default separators applied by the Spark cannot be used for dividing the data to be processed, and acquiring a preset separator library of the Spark;
traversing separators in the preset separator library, and acquiring target separators which are matched with the data to be processed in the separators; the target separator is a separator which successfully divides part of data in the data to be processed and meets a preset check rule;
setting the target separator as a default separator of the Spark.
In addition, the logic instructions in the memory 530 may be implemented in the form of software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as a stand-alone product. Based on such understanding, the technical solution of the present invention or a part thereof which substantially contributes to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk, and various media capable of storing program codes.
In another aspect, an embodiment of the present invention further provides a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to implement the steps in the data processing method, and details of the embodiments of the present invention are not repeated herein.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment may be implemented by software plus a necessary general hardware platform, and may also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, and not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (9)

1. A data processing method applied to a computing engine Spark, the method comprising:
monitoring to acquire that the default separator applied by the Spark cannot divide the data to be processed, and acquiring a preset separator library of the Spark;
traversing separators in the preset separator library, and acquiring target separators which are matched with the data to be processed in the separators; the target separator is a separator which successfully divides part of data in the data to be processed and meets a preset check rule;
setting the target separator as a default separator of the Spark;
the step of traversing the separators in the preset separator library to obtain the target separators in the separators, which are matched with the data to be processed, includes:
sequentially segmenting multiple groups of target data meeting the requirement of the length of a preset character string in the data to be processed aiming at each delimiter in the preset delimiter library, and obtaining the accumulated times of successful continuous segmentation; if a group of target data is successfully divided, adding one to the accumulated times;
and if the accumulated times meet a preset check rule, determining the separator as a target separator matched with the data to be processed.
2. The method according to claim 1, wherein the step of sequentially segmenting the plurality of sets of target data satisfying the requirement of the preset string length in the data to be processed comprises:
aiming at each group of target data, acquiring a preset attribute numerical value of the target data;
and segmenting the target data by using the separators, wherein if the number of the segmented fields is the same as the preset attribute value, the segmentation is successful.
3. The method according to claim 1, wherein the step of sequentially segmenting the plurality of sets of target data satisfying the requirement of the preset string length in the data to be processed comprises:
randomly extracting data meeting the length requirement of a preset character string from the data to be processed to serve as a current group of target data, and segmenting the current group of target data by using the separator;
and if the current group of target data is successfully segmented, randomly extracting data meeting the length requirement of the preset character string from the non-current group of target data of the data to be processed again to serve as a new group of target data.
4. The method of claim 2, wherein the step of segmenting the target data using the delimiters comprises:
and assigning the separator to a preset segmentation function, and segmenting the target data through the preset segmentation function.
5. The method according to any of claims 1 to 4, wherein the step of setting the target separator as a default separator of the Spark comprises:
adding the configuration file of the target separator in the separator configuration file of the Spark;
and acquiring a source identifier of a data table in the data to be processed, and recording the corresponding relation between the source identifier and the configuration file.
6. The method of claim 5, further comprising:
when data to be processed is read, a source identifier of a data table included in the data to be processed is obtained;
and calling the configuration file corresponding to the source identifier according to the corresponding relation.
7. A data processing apparatus applied to a compute engine Spark, the apparatus comprising:
the monitoring module is used for monitoring and knowing that the default separator applied by the Spark cannot be used for dividing the data to be processed, and acquiring a preset separator library of the Spark;
the traversal module is used for traversing the separators in the preset separator library and acquiring target separators which are matched with the data to be processed in the separators; the target separator is a separator which successfully divides part of the data in the data to be processed and meets a preset check rule;
a setting module, configured to set the target separator as a default separator of the Spark;
the traversal module is further configured to:
sequentially segmenting multiple groups of target data meeting the requirement of the length of a preset character string in the data to be processed aiming at each separator in the preset separator library, and obtaining the cumulative times of successful continuous segmentation; if a group of target data is successfully divided, adding one to the accumulated times;
and if the accumulated times meet a preset check rule, determining the separator as a target separator matched with the data to be processed.
8. An electronic device, comprising a memory, a processor, a bus and a computer program stored on the memory and executable on the processor, the processor implementing the steps in the data processing method according to any one of claims 1 to 6 when executing the program.
9. A non-transitory computer-readable storage medium having stored thereon a computer program, characterized in that: the program, when executed by a processor, implements the steps in the data processing method of any one of claims 1 to 6.
CN201910190563.4A 2019-03-13 2019-03-13 Data processing method and device Active CN109947429B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910190563.4A CN109947429B (en) 2019-03-13 2019-03-13 Data processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910190563.4A CN109947429B (en) 2019-03-13 2019-03-13 Data processing method and device

Publications (2)

Publication Number Publication Date
CN109947429A CN109947429A (en) 2019-06-28
CN109947429B true CN109947429B (en) 2022-07-26

Family

ID=67009699

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910190563.4A Active CN109947429B (en) 2019-03-13 2019-03-13 Data processing method and device

Country Status (1)

Country Link
CN (1) CN109947429B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112131296A (en) * 2020-09-27 2020-12-25 北京锐安科技有限公司 Data exploration method and device, electronic equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107622088A (en) * 2017-08-17 2018-01-23 郑州云海信息技术有限公司 It is a kind of that method of more characters as separator is supported based on Hive
CN107798035A (en) * 2017-04-10 2018-03-13 平安科技(深圳)有限公司 A kind of data processing method and terminal
CN108235069A (en) * 2016-12-22 2018-06-29 北京国双科技有限公司 The processing method and processing device of Web TV daily record
CN108804697A (en) * 2018-06-15 2018-11-13 中国平安人寿保险股份有限公司 Method of data synchronization, device, computer equipment based on Spark and storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10599460B2 (en) * 2017-08-07 2020-03-24 Modelop, Inc. Analytic model execution engine with instrumentation for granular performance analysis for metrics and diagnostics for troubleshooting

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108235069A (en) * 2016-12-22 2018-06-29 北京国双科技有限公司 The processing method and processing device of Web TV daily record
CN107798035A (en) * 2017-04-10 2018-03-13 平安科技(深圳)有限公司 A kind of data processing method and terminal
CN107622088A (en) * 2017-08-17 2018-01-23 郑州云海信息技术有限公司 It is a kind of that method of more characters as separator is supported based on Hive
CN108804697A (en) * 2018-06-15 2018-11-13 中国平安人寿保险股份有限公司 Method of data synchronization, device, computer equipment based on Spark and storage medium

Also Published As

Publication number Publication date
CN109947429A (en) 2019-06-28

Similar Documents

Publication Publication Date Title
CN107181636B (en) Health check method and device in load balancing system
CN112364014B (en) Data query method, device, server and storage medium
CN112256635B (en) Method and device for identifying file type
CN108415998B (en) Application dependency relationship updating method, terminal, device and storage medium
CN109947429B (en) Data processing method and device
CN111258985A (en) Data cluster migration method and device
CN111506573B (en) Database table partitioning method, device, computer equipment and storage medium
CN111539206B (en) Method, device, equipment and storage medium for determining sensitive information
US10977150B2 (en) Data analysis
CN109101595B (en) Information query method, device, equipment and computer readable storage medium
CN115185998A (en) Target field searching method and device, server and computer readable storage medium
CN115586953A (en) Hive-based task concurrent execution method and related device
CN111143582B (en) Multimedia resource recommendation method and device for updating association words in double indexes in real time
CN110442604B (en) Data flow direction query method, data flow direction extraction method, data flow direction processing method and related devices
CN114238709A (en) Character string matching method, device, equipment and readable storage medium
CN108881125B (en) Method, storage medium, device and system for analyzing communication protocol
CN108984780B (en) Method and device for managing disk data based on data structure supporting repeated key value tree
CN112612597A (en) Method and device for generating linear task queue
CN112749189A (en) Data query method and device
JP2021039488A (en) Dictionary creation device and dictionary creation method
CN104991963B (en) Document handling method and device
CN111400320B (en) Method and device for generating information
CN114880157B (en) Fault injection method and device
CN114756382B (en) Optimization method, system and server for memory page merging
CN112433743B (en) File updating method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant