CN114258541A - Data merging method and device, electronic equipment and storage medium - Google Patents

Data merging method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN114258541A
CN114258541A CN201980099361.7A CN201980099361A CN114258541A CN 114258541 A CN114258541 A CN 114258541A CN 201980099361 A CN201980099361 A CN 201980099361A CN 114258541 A CN114258541 A CN 114258541A
Authority
CN
China
Prior art keywords
data
data source
merging
file
identity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201980099361.7A
Other languages
Chinese (zh)
Inventor
王少丹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Opper Communication Co ltd
Original Assignee
Beijing Opper Communication Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Opper Communication Co ltd filed Critical Beijing Opper Communication Co ltd
Publication of CN114258541A publication Critical patent/CN114258541A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/9035Filtering based on additional data, e.g. user or group profiles

Abstract

The application discloses a data merging method and device, electronic equipment and a storage medium, and relates to the technical field of data processing. Wherein, the method comprises the following steps: acquiring a configuration file, wherein the configuration file comprises a storage path of a data source and an identity of the data source, and the data sources needing to be combined into one data file are configured with the same identity; acquiring each data source according to the storage path of the data source in the configuration file; the data sources with the same identity are combined into one data file, so that the convenience of the data combining process is improved.

Description

Data merging method and device, electronic equipment and storage medium Technical Field
The present application relates to the field of data processing technologies, and in particular, to a data merging method and apparatus, an electronic device, and a storage medium.
Background
With the continuous development of internet technology, data generated in a network platform is continuously fragmented and diversified. In order to better provide data support for data analysis, search recommendation and other scenes, the associated data generated in the network platform needs to be merged. The common data merging has complex data processing and high merging difficulty.
Disclosure of Invention
In view of the foregoing problems, the present application provides a data merging method, apparatus, electronic device and storage medium to improve the foregoing problems.
In a first aspect, an embodiment of the present application provides a data merging method, where the method includes: acquiring a configuration file, wherein the configuration file comprises a storage path of a data source and an identity of the data source, and the data sources needing to be combined into one data file are configured with the same identity; acquiring each data source according to the storage path of the data source in the configuration file; and merging the data sources with the same identity into a data file.
In a second aspect, an embodiment of the present application provides a data merging apparatus, where the apparatus includes: the file acquisition module is used for acquiring a configuration file, wherein the configuration file comprises a storage path of a data source and an identity identifier of the data source, and the data sources which need to be combined into one data file are configured with the same identity identifier; the data source acquisition module is used for acquiring each data source according to the storage path of the data source in the configuration file; and the merging module is used for merging the data sources with the same identity into a data file.
In a third aspect, an embodiment of the present application provides an electronic device, including: one or more processors; a memory; one or more programs. Wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to perform the methods described above.
In a fourth aspect, the present application provides a computer-readable storage medium, in which a program code is stored, and the program code can be called by a processor to execute the above method.
According to the data merging method and device, the electronic device and the storage medium, the storage path and the identity of the data source to be merged are configured in the configuration file. After the configuration file is obtained, each data source can be obtained according to the storage path in the configuration file, the data sources with the same identity identification are combined into one data file, information does not need to be extracted from the data sources to determine a combination basis, and convenience of a data combination process is improved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 shows a flowchart of a data merging method according to an embodiment of the present application.
Fig. 2 shows a flowchart of a data merging method according to another embodiment of the present application.
Fig. 3 shows a schematic diagram of a configuration file provided in an embodiment of the present application.
Fig. 4 is a functional block diagram of a data merging apparatus provided in an embodiment of the present application.
Fig. 5 shows a block diagram of an electronic device provided in an embodiment of the present application.
Fig. 6 is a storage unit according to an embodiment of the present application, configured to store or carry program code for implementing a data merging method according to an embodiment of the present application.
Detailed Description
In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application.
In the network platform, there are data documents that are associated with each other. The data documents related to each other may be considered as data documents that have a commonality with each other, are generated for the same object, and need to be analyzed in combination when the object is analyzed, or the data documents related to each other are data documents that need to be merged into one document. In addition, the data document may be various forms of data generated in the internet, such as a table, a text of a character, a text of a combination of a character and an image, a code, and the like, which are not described in detail herein.
Many related data documents are generated on different platforms and are independent of each other, and for performing comprehensive and complete analysis on the related data documents, the data documents need to be merged, so that the analysis convenience is improved. For example, the content of an article is produced from a self-media, comments for the article are produced by a user, the number of clicks, the identity of a clicking user, the number of thumbs, the identity of the thumbs, and the like for the article are recorded by a software platform, the article, the comments, the number of clicks, the identity of a clicking user, the number of thumbs, and the identity of the thumbs for the article are distributed on different platforms, and are independent of each other.
In some embodiments, when merging the data documents, the data is obtained from the data documents themselves, and as a basis for merging, the operation is complicated and not accurate enough. For example, it is necessary to extract fields from each data document to be merged, and merge documents having the same fields; or calculating the similarity among the documents, and merging the data documents with the similarity higher than a certain value.
Therefore, the inventor proposes the data merging method, the data merging device, the electronic device and the storage medium provided by the embodiment of the application, determines the merging basis of the data source to be merged through the configuration file, and improves the processing convenience and the merging speed compared with the method for acquiring data from the data source. The data source to be merged and the data document are provided. The data merging method, the data merging device, the electronic device, and the storage medium provided in the embodiments of the present application will be described in detail through specific embodiments.
Referring to fig. 1, a data merging method provided in an embodiment of the present application is shown. Specifically, the method comprises the following steps:
step S110: and acquiring a configuration file, wherein the configuration file comprises a storage path of a data source and an identity of the data source, and the data sources which need to be combined into one data file are configured with the same identity.
In the embodiment of the present application, a configuration file may be configured according to each data source that needs to be merged, so as to determine a merging policy through the configuration file to merge the data sources.
In each data source, the data sources which are mutually related are combined into one data file. In the configuration file, the data sources associated with each other may be labeled, so that it can be determined through the configuration file which data sources need to be merged into the same data file.
In the embodiment of the application, the data sources related to each other may be labeled by the identity, that is, the same identity is configured for the data sources that need to be merged into the same data file, different identities are configured for the data sources that are not merged into the same data file, and the identities of the data sources are stored in the configuration file. For example, the data sources to be merged include a, b, c, d and e, where a, b and c need to be merged into the same data file; d and e need to be combined into the same data file, and a, b and c configure the same identity A in the configuration file; d and e configure the same identity B in the configuration file, and the identity A is different from the identity B.
In addition, each data source has a corresponding storage location, and the configuration file may include a storage path of each data source, so that each data source may be found according to the storage path. For example, each data source to be merged includes an article, a comment of the article, and a like of the article, the storage path of the article may be a server from the media side, the storage path of the comment may be a server corresponding to the user comment, and the storage path of the like may be a server of a software platform where the article is published.
Step S120: and acquiring each data source according to the storage path of the data source in the configuration file.
Step S130: and merging the data sources with the same identity into a data file.
And acquiring each data source according to the storage path of each data source in the acquired configuration file, and merging each data source. And merging the data sources with the same identity into the same data file. For example, in the foregoing example, the identification of the data sources a, b, and c are the same, and a, b, and c are merged into one file to obtain a merged data file; and d and e have the same identity, and d and e are combined to obtain a combined data file.
In the embodiment of the application, a storage path and an identity of a data source to be merged are configured in a configuration file. After the configuration file is obtained, each data source can be obtained according to the storage path in the configuration file, and then which data sources need to be merged into one data file is determined according to the identity of each data source in the configuration file. The data sources with the same identity identification are combined into a data file, information does not need to be extracted from the data sources to determine a combination basis, and convenience and speed of a data combination process are improved.
The embodiment of the application also provides an embodiment. In this embodiment, each data source is first structured, and a configuration file is set according to the structured data source. In the data merging process, the part for merging in the data source is flexibly selected according to the structural characteristics of the data source. Specifically, referring to fig. 2, the method includes:
step S210: and structuring each data source to comprise a plurality of fields, and setting a configuration file according to each data source.
Step S220: and acquiring a configuration file.
In this embodiment of the present application, before acquiring the configuration file to merge the data sources, a structuring process may be performed on the data sources to define a data structure of each data source.
The data source may be structured in such a manner that the data source is divided into a plurality of fields, and each field is a part of the data source. For example, a data source is an article, and a plurality of fields divided by the article can be, respectively, a title, an author name, a summary and a text content; the data source is the comment of the article, and the divided fields may be the title, the comment content, the reviewer and the comment time of the article to be commented. Of course, the data source may also be divided into only one field, i.e. the entire data source is used as one field.
The specific division manner of the fields in the data source is not limited in the embodiment of the present application. In some embodiments, the partitioning of the fields of the data source may be done by the user and uploaded to the executing device. In other embodiments, the dividing rules of the various types of data sources may be preset, and the division may be performed according to the dividing rules. Or the two implementation modes are combined, the data source with the type of the division rule divides fields according to the division rule, and for each data source with the type of the division rule, the data source is submitted to a user for manual division or is submitted to the user to designate the division rule.
In the configuration file, description information is configured for each field of the data source as field description information of the field, for example, as shown in fig. 3, and in the configuration file, the name of the data source corresponds to each field description information of the data source.
Specifically, when the configuration file is set, each field in each data source can be determined, and field description information of each field can be acquired; and configuring field description information of each field in each data source in a configuration file.
Each field description information includes basic information of a corresponding field, for example, the data source 2 in fig. 3 includes 3 fields, which are field 1, field 2, and field 3, respectively. Wherein, the field description information of the field 1 is I11, I12 and I13 respectively; the field description information of the field 2 is I21, I22, I23 and I24 respectively; the field description information of field 3 is I31, I32, and I33, respectively. The field description information specifically includes which information of the field is not limited in the embodiment of the present application, and may include, for example, a field name, a data type of the field, and the like, for example, I11 in fig. 3 may represent the field name of the field 1, and I12 may represent the data type of the field 1, and the like. The field description information of the field may be extracted from the data source, may be determined according to a field division rule, or may be determined by a user, and the like, and is not limited in the embodiment of the present application.
In addition, in the configuration file, an identity may be configured for each data source, and the data sources that need to be merged into one data file are configured with the same identity. Therefore, after the configuration file is obtained, the identity of each data source can be obtained from the configuration file.
Optionally, in this embodiment of the present application, the identity of each data source may be configured in a configuration file after being assigned by a user; the data sources may be distributed with the same identity identifier according to a preset identification rule, and the preset identification rule may be that the associated data sources are distributed with different identity identifiers, but not the associated data sources are distributed with different identity identifiers.
Optionally, in this embodiment of the present application, the data sources associated with each other are generated with the same identity identifier. Correspondingly, the identity of the data source can be obtained from each data source, and the identity of the corresponding data source is configured in the configuration file. For example, when an article is uploaded to a server from the media, an identity is generated for the article. And when the user comment is generated corresponding to the article, generating the identity identifier which is the same as the article identity identifier corresponding to the user comment. Therefore, the identity identification of the article can be obtained from the article, and the article is configured in the configuration file; and obtaining the identity of the user comment from the user comment, and configuring the identity of the user comment in a configuration file corresponding to the user comment.
In one embodiment, the identity may be configured in a configuration file as a field of the data source, or the identity of the data source may be configured in field description information of one of the fields of the data source.
In this embodiment, when the identity of the data source is obtained, the identity may be obtained from field description information configured with the identity.
Optionally, in this embodiment, the field description information of each field may include identity indication information, where the identity indication information indicates whether the field includes an identity, and in the field description information where the identity indication information indicates that the identity includes the identity, the identity of the data source is configured. When the identity of the data source is obtained from the configuration file, for each data source, whether each field description information includes the identity can be judged according to the identity indication information of each field description information of the data source. And when the field description information is judged to comprise the identity, acquiring the identity from the field description information comprising the identity. For example, in the configuration file, the data source a includes a field 1, a field 2, and a field 3, and when the identity of the data source a is obtained, whether the identity indicated in the identity indication information in the field description information of the field 1 of the data source a is "yes" or "no" may be read, and if the identity is "yes", the identity of the data source a is read from the field 1; if no, it indicates that there is no id in field 1, and then reads the id indication information in the field description information of field 2.
The representation of the identity indication information can be simpler than the identity identification, so that whether the identity identification representing more complexity is read or not can be determined after whether the field description information comprises the identity identification or not is judged according to the concise identity indication information.
In addition, optionally, if the field description information does not include the identity indication information, or the identity indication information is null, the identity may be read from the default field description information.
In another embodiment, the id may be stored as a separate parameter corresponding to the name of the data source. And when the identity of the data source is acquired, reading the parameter corresponding to the name of the data source.
In the configuration file, the storage path of each data source is also stored, and after the configuration file is obtained, the storage path of each data source can be obtained from the configuration file. The storage path may be stored as a separate parameter corresponding to the name of the data source, and may be stored in some field description information therein.
Optionally, in this embodiment of the application, the execution device of step S210 may be different from the execution devices of step S220 to step S240. For example, steps S220 to S240 are performed by a single device, and the performing device of step S210 is different from the single device; if steps S220 to S240 are executed by a system, such as a hadoop cluster or other cluster device, the device execution of step S210 is different from that of the system.
Of course, the execution device in step S210 and the execution devices in steps S220 to S240 may also be the same execution device, or devices in the same system or cluster.
The configuration file is acquired from an execution device that sets the configuration file. For example, the configuration file is set by an electronic device, the data merging is performed according to the configuration file and is completed by the hadoop cluster, the electronic device can submit the configuration file from the MR task submission interface to the hadoop cluster for operation, and the hadoop cluster acquires the configuration file from the electronic device for setting the configuration file.
Step S230: and acquiring each data source according to the storage path of the data source in the configuration file.
Step S240: and merging the data sources with the same identity into a data file, and deleting the fields which do not participate in merging in each data source.
And obtaining each data source according to the storage path of the data source in the configuration file, and analyzing each data source according to the data structure configured by the configuration file so as to realize that the data sources with the same identity are combined into one data file.
In some embodiments, in order to reduce the processing difficulty in the data merging process, after the data sources are acquired, the data sources are sorted according to the identity of each data source in the configuration file, so that the data sources with the same identity are adjacent after being sorted. And then merging the adjacent data sources with the same identity into a data file. For example, if the data source is identified by english letters, the data sources are sorted according to the sequence of the identification, and then the data sources with the same identification are arranged adjacently, and then the adjacent data sources with the same english letters are merged into one data file.
In this embodiment, when the data sources with the same adjacent identifiers are merged into one data file, the sorted first data source may be used as a starting data source, and the data sources are sequentially traversed. And when traversing to a data source with the identity identification different from the previous data source, combining all the data sources from the initial data source to the previous data source into a data file. And then, taking the currently traversed data source as a new initial data source, sequentially traversing each data source, merging all the data sources between the initial data source and the previous data source into a data file when traversing the data source with the identity different from the previous data source, taking the currently traversed data source as the new initial data source again, and repeating the steps to realize that all the adjacent data sources with the same identity are merged into a data file. It will be appreciated that when traversing to the last data source, the last data source may be merged with the un-merged data source into one data file since there is no next data source.
Taking the foregoing example as an example, the data sources a, b, c, d, e, a, b, c are configured with the same identity a in the configuration file; the data sources d and e configure the same identity B in the configuration file. After sorting according to the identity, the data sources a, b, and c are arranged adjacently, and the data sources d and e are arranged adjacently, for example, the arrangement order is a, b, c, d, and e. During the traversal, when the data source a traverses to the data source c, the identity marks are all A, and when the data source a traverses to the data source c, the identity marks are different from the identity mark of the previous data source c, and the data sources a, b and c are combined into a data file. When the data source d traverses to the data source e, the identity identifications are the same, and if the data source e is the last data source, the last data source d and the last data source e are combined into a data file.
In some implementations, multiple data sources may be merged in parallel. Specifically, the data sources with different identities may be selected respectively. And for each selected data source, searching data sources with the same identification from the unselected and merged data sources, and merging all the searched data sources with the same identification with the selected data source. In this embodiment, the number of the simultaneously selected data sources may be determined according to the parallel processing channels in the parallel merging, and if there are 5 parallel processing channels, 5 data sources may be simultaneously selected for searching and merging.
Optionally, in this embodiment, after the sorting is performed according to the sorting method of the foregoing embodiment, the data sources may be searched and merged.
In the embodiment of the present application, the merged data file generally has a corresponding usage scenario, and for example, the merged data file may be used for data support in various scenarios, such as data analysis, search recommendation, and the like. However, all the contents in the data source may not be useful in the corresponding usage scenario, and therefore, the merged data file may not include the unnecessary contents, so that the data file is more compact and occupies a smaller storage space. For example, in a usage scenario in which the attitude of the user on the article needs to be counted, reviewers, review times, and reviewers of the article are useless for the usage scenario, and this part may be deleted.
Therefore, in the embodiment of the present application, since the data source is divided into the fields, the field in which the content that does not need to be merged is set as the field that does not participate in merging, and the deletion of the unnecessary part in the data source may be to delete the field that does not participate in merging in each data source.
Specifically, the configuration file may include merging indication information as to whether each field in the data source participates in merging. For example, optionally, merge indication information on whether the field participates in merging may be included in the field description information of each field; or optionally, in the configuration file, a parameter may be specifically set for each data source as merging indication information, indicating which fields in the data source do not participate in merging.
In a data file formed by combining data sources with the same identity, fields which do not participate in combination are deleted.
In an embodiment, after the data sources with the same identity are merged into one data file, the fields not participating in the merging may be deleted from the merged data file according to the merging indication information in the configuration file.
In another embodiment, in order to reduce the amount of merging processing data and facilitate deletion of fields that do not participate in merging, merging data sources with the same identity into one data file may be to delete the fields that do not participate in merging in each data source according to merging indication information in the configuration file, and then merge the data sources with the same identity into one data file.
In this embodiment of the present application, when only one data source is configured in the configuration file, that is, the configuration file only includes configuration information such as an identity of one data source, a storage path, and field description information of each field, content screening of the data source may be implemented by merging the indication information, that is, after a field that does not participate in merging is deleted from the data source, content of interest is screened out.
Specifically, in the embodiment of the present application, before merging the data sources, it may be determined whether the number of the data sources is equal to 1, and specifically, whether there is only configuration information of one data source in the configuration file may be determined. If the number of the data sources is equal to 1, no other data source is merged, and the data sources are used as merged data files after deleting the fields which do not participate in merging from the data sources according to merging indication information in the configuration files, so that the content in the data files is screened. If the number of the data sources is greater than 1, the data sources with the same identity identifier need to be merged into one data file, and the merging operation in the embodiment of the application is executed.
Additionally, in some data sources, there are fields of array type. Specifically, in the field of the array type, there are a plurality of contents juxtaposed to each other for the same concept. In the configuration file, in the field description information of the field of the array type, the data type may be configured as an array and may specify which contents participate in merging among a plurality of parallel contents in the array type. The specified content in the specified array type is not merged, and the specified content may be determined according to a preset specified rule of the array type, for example, only the first content in a field of the specified array type of the preset specified rule participates in merging, for example, the field of the array type includes a pointer, the content pointed by the specified pointer in the preset specified rule is the content participating in merging in the array type, for example, all the content in the specified array type participates in merging in the preset specified rule, and the like; in addition, the field description information may also be configured by the user as to which contents participate in the merge in the field of the array type.
And for the class contents which do not participate in merging in the fields of the array type, the class contents are not reflected in the finally synthesized data file. Specifically, before merging the data sources, the contents that do not participate in merging in the fields of the array type may be deleted, and then the data sources are merged with the data sources with the same identity; or after the data sources with the same identity are merged into the same data file, deleting the content which does not participate in merging in the field of the data file array type.
For example, in an article data source, three article tags for current events, sports, and footwear are included. If the label in the article is used as a field, the field simultaneously has three parallel contents aiming at the concept of the label of the article, such as current affairs, sports and shoes. If the sports and footwear are the real tags of the article, the field description information of the field can specify that the sports and footwear participate in the merge, so that the merged data file does not include the tags of current events.
In the embodiment of the application, after the data sources are combined to obtain the data file, the data file can be output. Specifically, a position to be output by the merged file may be configured in the configuration file, and the position may be defined as a specific position. The data file is output to the specified location after the merging. Of course, the location to be output by the file may not be specified in the configuration file, but may be acquired at the same time as the configuration file is acquired. For example, when the generator of the configuration file submits the configuration file to the acquirer of the configuration file, the location of the output file is specified.
In one embodiment, all of the merged data files may be output to a specified location.
Optionally, in this embodiment, each data file may be output to the same designated location as an independent file.
Optionally, in this embodiment, all the data files may be combined into one file and output to the designated location.
In another embodiment, the configuration file may include the number of file copies to be output, where the number of file copies is a preset number of copies. After all the data sources are synthesized to obtain the data file, all the merged data files can be split into a preset number of parts to be output. For example, 100 data files are obtained after merging, and if the preset number of parts is 5 parts, the data files can be split into 1 part of each 20 data files to obtain 5 parts. Of course, the number of data files in each data file is not limited. Of course, the number of file copies to be output may not be specified in the configuration file, but may be acquired simultaneously when the configuration file is acquired. For example, when the generation side of the configuration file submits the configuration file to the acquisition side of the configuration file, the number of copies of the output file is specified.
Optionally, all the data files may be merged into one file, and then the one file is split into a preset number of parts.
Optionally, each data file may be used as an independent file, and all the data files are divided into a preset number of parts.
In this embodiment, the location of each data file output may be included in the configuration file. Then each data file may be output to a location specified in the configuration file when the data file is output.
In another embodiment, the designated location may be configured for each data file, and the configuration file includes the location of each data file output. For example, in the configuration file, a designated position is configured for one of the data sources corresponding to the same identity, and after the data sources having the same identity are combined into one data file, the data file is output to the designated position corresponding to the configuration of one of the data sources, so that each data file is output to the designated position in the configuration file.
In the embodiment of the application, the data sources can be processed and merged in the hadoop cluster through the Map-Reduce calculation model. Specifically, the configuration file can be acquired through a task acquisition interface of the Map-Reduce task. And acquiring each data source according to the storage path of the data source in the configuration file through a Map program in the Map-Reduce task, and analyzing, sequencing and the like the data source through the Map program. And sending the processed data source to a linux standard output data stream. The Reduce program in the Map-Reduce task can read data from the data stream, so that the data sources with the same identity identification are combined into one data file through the Reduce program and output to the specified output position in the configuration file.
Optionally, in this embodiment of the application, the configuration file may further specify a retry number of the data exception, or specify the retry number of the data exception when the generation side of the configuration file submits the configuration file to the acquirer of the configuration file. In the data processing process, if data exception occurs, the hadoop cluster can retry for multiple times, when the retry times reach the specified retry times, the abnormal equipment in the cluster is searched, and another equipment is switched to replace the abnormal equipment for data processing.
According to the data merging method provided by the embodiment of the application, in the configuration file, the field description information is added to each field of the data source after the structuring processing, so that the field which does not participate in merging in the data source can be determined according to the field description information. In a data file formed by merging the same identity identification into a data source, fields which do not participate in merging can be deleted, so that merging processing is simpler, merging efficiency is higher, and pertinence of the merged data file is improved.
In addition, the data merging method can support a user-defined document merging mode through configuration of the configuration file, is realized in a software mode, and can be used in any Linux environment providing Map-Reduce computing power.
As shown in fig. 4, an embodiment of the present application further provides a data merging apparatus 300. The data merging apparatus 300 includes: the file obtaining module 310 is configured to obtain a configuration file, where the configuration file includes a storage path of a data source and an identity of the data source, and a data source that needs to be merged into one data file is configured with the same identity. The data source obtaining module 320 is configured to obtain each data source according to a storage path of the data source in the configuration file. And the merging module 330 is configured to merge data sources with the same identity into one data file.
Optionally, the merging module 330 may include a sorting unit, configured to sort the data sources according to the identifiers, so that the data sources with the same identifier are adjacent to each other; and the merging unit is used for merging the adjacent data sources with the same identity into one data file.
Optionally, the merging unit may use the sorted first data source as the starting data source; traversing each data source in sequence; when traversing to a data source with an identity different from the previous data source, merging all data sources from the initial data source to the previous data source into a data file; executing the sequential traversal of each data source by taking the currently traversed data source as a new initial data source; when a data source with the identity different from the previous data source is traversed, all data sources from the initial data source to the previous data source are combined into a data file, and the currently traversed data source is used as a new initial data source. Optionally, when traversing to the last data source, merging the last data source and the data source not merged into one data file.
Optionally, each data source is divided into one or more fields, and the configuration file includes merging indication information indicating whether each field in the data source participates in merging. The merging module 330 may be configured to delete fields that do not participate in merging in each data source according to the merging indication information in the configuration file, and then merge the data sources with the same identity into one data file.
Optionally, each data source is divided into one or more fields, and the configuration file includes merging indication information indicating whether each field in the data source participates in merging. The merging module 330 may be configured to delete, according to the merging indication information in the configuration file, a field that does not participate in merging from the merged data file.
Optionally, each data source is divided into one or more fields, and the configuration file includes merging indication information indicating whether each field in the data source participates in merging. The merge module 330 may be configured to determine whether the number of data sources is equal to 1; if the number of the data sources is equal to 1, deleting fields which do not participate in merging from the data sources according to merging indication information in the configuration file, and taking the data sources as merged data files; if the number of the data sources is larger than 1, the data sources with the same identity are merged into one data file.
Optionally, the configuration file includes field description information of fields in each data source, and the field description information of each field includes merge indication information indicating whether the field participates in merging.
Optionally, the apparatus 300 further includes a configuration module, configured to determine each field in each data source, and obtain field description information of each field; and configuring field description information of each field in each data source in a configuration file.
Optionally, the configuration module may be further configured to configure an identity for each data source in the configuration file.
Optionally, the data sources associated with each other are generated with the same identity identifier. The configuration module may further be configured to acquire an identity of each data source from each data source, and is configured to configure the identity of the corresponding data source in a configuration file.
Optionally, the apparatus 300 may further include an information obtaining module, configured to obtain the identity of each data source from the configuration file.
Optionally, each data source is divided into one or more fields, the configuration file includes field description information of the fields in each data source, and the field description information of each field includes identity indication information indicating whether the field includes an identity. The information acquisition module can be used for judging whether each field description information comprises an identity label or not according to the identity indication information of each field description information for each data source; and when the field description information is judged to comprise the identity, acquiring the identity from the field description information comprising the identity.
Optionally, the configuration file includes a designated location of the merged file output. The apparatus 300 may further include an output module for outputting the merged data file to a specified location.
Optionally, the configuration file includes a preset number of copies of files to be output. The output module may be configured to split all the merged data files into a preset number of parts for output.
Optionally, the configuration file includes a location where each data file is output. The output module can be used for outputting each data file to a position specified in the configuration file.
Optionally, the file obtaining module 310 may be configured to obtain the configuration file through a task obtaining interface of the Map-Reduce task. The data source obtaining module 320 may be configured to obtain each data source according to the storage path of the data source in the configuration file through a Map program in the Map-Reduce task. The merge module 330 may be configured to merge data sources with the same id into one data file through a Reduce program in the Map-Reduce task.
It will be clear to those skilled in the art that, for convenience and brevity of description, the various method embodiments described above may be referred to one another; for the specific working processes of the above-described devices and modules, reference may be made to corresponding processes in the foregoing method embodiments, which are not described herein again.
In the several embodiments provided in the present application, the coupling between the modules may be electrical, mechanical or other type of coupling.
In addition, functional modules in the embodiments of the present application may be integrated into one processing module, or each of the modules may exist alone physically, or two or more modules are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. Each module may be configured in different electronic devices, or may be configured in the same electronic device, and the embodiments of the present application are not limited thereto.
Referring to fig. 5, a block diagram of an electronic device 400 according to an embodiment of the present disclosure is shown. The electronic device 400 may be a smart phone, a tablet computer, a computer, or the like. The data merging method and the data merging device in the embodiment of the application can be executed by an electronic device; or a plurality of electronic devices may cooperate to execute, such as a system cluster composed of a plurality of servers.
The electronic device may include one or more processors 410 (only one shown), memory 420, and one or more programs. Wherein the one or more programs are stored in the memory 420 and configured to be executed by the one or more processors 410. The one or more programs are configured to perform the methods described in the foregoing embodiments. If the method described in the foregoing embodiment is executed by a plurality of electronic devices, each electronic device may be configured with a part of the program to be executed.
Processor 410 may include one or more processing cores. The processor 410 interfaces with various components throughout the electronic device 400 using various interfaces and circuitry to perform various functions of the electronic device 400 and process data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory 420 and invoking data stored in the memory 420. Alternatively, the processor 410 may be implemented in hardware using at least one of Digital Signal Processing (DSP), Field-Programmable Gate Array (FPGA), and Programmable Logic Array (PLA). The processor 410 may integrate one or more of a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a modem, and the like. Wherein, the CPU mainly processes an operating system, a user interface, an application program and the like; the GPU is used for rendering and drawing display content; the modem is used to handle wireless communications. It is understood that the modem may not be integrated into the processor 410, but may be implemented by a communication chip.
The Memory 420 may include a Random Access Memory (RAM) or a Read-Only Memory (Read-Only Memory). The memory 420 may be used to store instructions, programs, code, sets of codes, or sets of instructions. The memory 420 may include a program storage area and a data storage area, wherein the program storage area may store instructions for implementing an operating system, instructions for implementing at least one function, instructions for implementing the various method embodiments described above, and the like. The stored data area may also store data created by the electronic device in use, and the like.
Referring to fig. 6, a block diagram of a computer-readable storage medium according to an embodiment of the present application is shown. The computer-readable storage medium 500 has stored therein program code that can be called by a processor to execute the method described in the above-described method embodiments.
The computer-readable storage medium 500 may be an electronic memory such as a flash memory, an EEPROM (electrically erasable programmable read only memory), an EPROM, a hard disk, or a ROM. Alternatively, the computer-readable storage medium 500 includes a non-volatile computer-readable storage medium. The computer readable storage medium 500 has storage space for program code 510 for performing any of the method steps of the method described above. The program code can be read from or written to one or more computer program products. The program code 510 may be compressed, for example, in a suitable form.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not necessarily depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims (20)

  1. A method for merging data, the method comprising:
    acquiring a configuration file, wherein the configuration file comprises a storage path of a data source and an identity of the data source, and the data sources needing to be combined into one data file are configured with the same identity;
    acquiring each data source according to the storage path of the data source in the configuration file;
    and merging the data sources with the same identity into a data file.
  2. The method of claim 1, wherein merging the data sources with the same identity into one data file comprises:
    sequencing the data sources according to the identity identifiers, so that the data sources with the same identity identifiers are adjacent;
    and merging the adjacent data sources with the same identity into a data file.
  3. The method of claim 2, wherein merging the data sources with the same identity into one data file comprises:
    taking the sequenced first data source as an initial data source;
    traversing each data source in sequence;
    when traversing to a data source with an identity different from the previous data source, merging all data sources from the initial data source to the previous data source into a data file;
    taking the currently traversed data source as a new initial data source,
    executing the sequential traversal of the data sources; when a data source with the identity different from the previous data source is traversed, all data sources from the initial data source to the previous data source are combined into a data file, and the currently traversed data source is used as a new initial data source.
  4. The method of claim 3, wherein when traversing to the last data source, merging the last data source with the un-merged data source into a data file.
  5. The method according to any one of claims 1 to 4, wherein each data source is divided into one or more fields, the configuration file includes merging indication information of whether each field in the data source participates in merging, and the merging of data sources with the same identity into one data file includes:
    and according to the merging indication information in the configuration file, deleting the fields which do not participate in merging in each data source, and merging the data sources with the same identity into one data file.
  6. The method according to any one of claims 1 to 4, wherein each data source is divided into one or more fields, the configuration file includes merge indication information indicating whether each field in the data source participates in merging, and after merging the data sources with the same identity into one data file, the method further includes:
    and deleting fields which do not participate in the combination from the combined data file according to the combination indication information in the configuration file.
  7. The method according to any one of claims 1 to 6, wherein each data source is divided into one or more fields, the configuration file includes merging indication information of whether each field in the data source participates in merging, and before merging the data sources with the same identity into one data file, the method further includes:
    judging whether the number of the data sources is equal to 1 or not;
    if the number of the data sources is equal to 1, deleting fields which do not participate in merging from the data sources according to merging indication information in the configuration file, and taking the data sources as merged data files;
    and if the number of the data sources is more than 1, executing the step of combining the data sources with the same identity into one data file.
  8. The method according to any one of claims 5 to 7, wherein the configuration file includes field description information of fields in each data source, and the field description information of each field includes merge indication information of whether the field participates in merging.
  9. The method of claim 8, wherein before obtaining the configuration file, further comprising:
    determining each field in each data source, and acquiring field description information of each field;
    and configuring field description information of each field in each data source in a configuration file.
  10. The method according to any of claims 1-9, wherein prior to obtaining the configuration file, further comprising:
    and configuring the identity for each data source in a configuration file.
  11. The method of claim 10, wherein the correlated data sources are generated with the same id when they are generated, and before configuring the id for each data source in the configuration file, the method further comprises:
    and acquiring the identity of the data source from each data source, and configuring the identity of the corresponding data source in a configuration file.
  12. The method according to any one of claims 1-11, wherein before merging the data sources with the same identity into one data file, further comprising: and acquiring the identity of each data source from the configuration file.
  13. The method according to claim 12, wherein each data source is divided into one or more fields, the configuration file includes field description information of the fields in each data source, the field description information of each field includes identity indication information on whether the field includes an identity, and the obtaining the identity of each data source from the configuration file includes:
    for each data source, judging whether the field description information comprises an identity label or not according to the identity indication information of the field description information;
    and when the field description information is judged to comprise the identity, acquiring the identity from the field description information comprising the identity.
  14. The method according to any one of claims 1-13, wherein the configuration file includes a designated location of the merged file output, and after merging the data sources with the same identity into one data file, the method further includes:
    and outputting the merged data file to a specified position.
  15. The method according to any one of claims 1 to 14, wherein the number of files to be output in the configuration file is a preset number of files, and the method further comprises:
    and splitting all the merged data files into a preset number of parts for output.
  16. The method of claim 15, wherein the configuration file includes a location of each data file output, and the splitting all the merged data files into a preset number of outputs includes:
    and outputting each data file to a position designated in the configuration file.
  17. The method of claim 1, further comprising:
    acquiring a configuration file through a task acquisition interface of a Map-Reduce task;
    acquiring each data source according to the storage path of the data source in the configuration file through a Map program in the Map-Reduce task;
    and merging the data sources with the same identity into one data file through a Reduce program in the Map-Reduce task.
  18. A data merging apparatus, characterized in that the apparatus comprises:
    the file acquisition module is used for acquiring a configuration file, wherein the configuration file comprises a storage path of a data source and an identity identifier of the data source, and the data sources which need to be combined into one data file are configured with the same identity identifier;
    the data source acquisition module is used for acquiring each data source according to the storage path of the data source in the configuration file;
    and the merging module is used for merging the data sources with the same identity into a data file.
  19. An electronic device, comprising:
    one or more processors;
    a memory;
    one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to perform the method of any of claims 1-17.
  20. A computer-readable storage medium, having stored thereon program code that can be invoked by a processor to perform the method according to any one of claims 1 to 17.
CN201980099361.7A 2019-10-18 2019-10-18 Data merging method and device, electronic equipment and storage medium Pending CN114258541A (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2019/112037 WO2021072776A1 (en) 2019-10-18 2019-10-18 Data merging method and apparatus, electronic device, and storage medium

Publications (1)

Publication Number Publication Date
CN114258541A true CN114258541A (en) 2022-03-29

Family

ID=75537404

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201980099361.7A Pending CN114258541A (en) 2019-10-18 2019-10-18 Data merging method and device, electronic equipment and storage medium

Country Status (2)

Country Link
CN (1) CN114258541A (en)
WO (1) WO2021072776A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116089436A (en) * 2022-11-29 2023-05-09 荣耀终端有限公司 Data auditing method of large data volume and electronic equipment

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101902335A (en) * 2009-05-27 2010-12-01 北京启明星辰信息技术股份有限公司 Data filter and combination method
US9083623B2 (en) * 2012-03-12 2015-07-14 Texas Instruments Incorporated Inserting source, sequence numbers into data stream from separate sources
CN103390003A (en) * 2012-05-09 2013-11-13 人人游戏网络科技发展(上海)有限公司 Method and device for combining user data information among servers
CN103577276B (en) * 2012-07-18 2017-11-17 深圳市腾讯计算机系统有限公司 The standby system and method for user's operation data
CN102780780B (en) * 2012-07-25 2014-11-19 中国联合网络通信集团有限公司 Method, equipment and system for data processing in cloud computing mode
CN110097170A (en) * 2019-04-25 2019-08-06 深圳市豪斯莱科技有限公司 Information pushes object prediction model acquisition methods, terminal and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116089436A (en) * 2022-11-29 2023-05-09 荣耀终端有限公司 Data auditing method of large data volume and electronic equipment
CN116089436B (en) * 2022-11-29 2023-11-07 荣耀终端有限公司 Data auditing method of large data volume and electronic equipment

Also Published As

Publication number Publication date
WO2021072776A1 (en) 2021-04-22

Similar Documents

Publication Publication Date Title
CN110705214B (en) Automatic coding method and device
US20060004528A1 (en) Apparatus and method for extracting similar source code
CN113110833A (en) Machine learning model visual modeling method, device, equipment and storage medium
CN112464034A (en) User data extraction method and device, electronic equipment and computer readable medium
CN109460398B (en) Time series data completion method and device and electronic equipment
CN110580189A (en) method and device for generating front-end page, computer equipment and storage medium
CN112328552A (en) Bottom layer data management method, system and computer readable storage medium
CN109285024B (en) Online feature determination method and device, electronic equipment and storage medium
US20190147104A1 (en) Method and apparatus for constructing artificial intelligence application
CN111367976A (en) Method and device for exporting EXCEL file data based on JAVA reflection mechanism
CN107273546B (en) Counterfeit application detection method and system
CN110674413B (en) User relationship mining method, device, equipment and storage medium
CN110716739A (en) Code change information statistical method, system and readable storage medium
CN114258541A (en) Data merging method and device, electronic equipment and storage medium
JP2019101889A (en) Test execution device and program
CN109740074B (en) Method, device and equipment for processing parameter configuration information
CN114611039B (en) Analysis method and device of asynchronous loading rule, storage medium and electronic equipment
CN108595395B (en) Nickname generation method, device and equipment
CN113434507B (en) Data textualization method, device, equipment and storage medium
CN112507214B (en) User name-based data processing method, device, equipment and medium
CN114185958A (en) Blood relationship generation method and device, computer equipment and storage medium
CN112182218A (en) Text data classification method and device
CN115004169A (en) User selection method, device, server and storage medium
CN111651531A (en) Data import method, device, equipment and computer storage medium
CN114840743B (en) Model recommendation method and device, electronic equipment and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination