WO2021072776A1 - 数据合并方法、装置、电子设备及存储介质 - Google Patents

数据合并方法、装置、电子设备及存储介质 Download PDF

Info

Publication number
WO2021072776A1
WO2021072776A1 PCT/CN2019/112037 CN2019112037W WO2021072776A1 WO 2021072776 A1 WO2021072776 A1 WO 2021072776A1 CN 2019112037 W CN2019112037 W CN 2019112037W WO 2021072776 A1 WO2021072776 A1 WO 2021072776A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
data source
file
configuration file
identity
Prior art date
Application number
PCT/CN2019/112037
Other languages
English (en)
French (fr)
Inventor
王少丹
Original Assignee
北京欧珀通信有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京欧珀通信有限公司 filed Critical 北京欧珀通信有限公司
Priority to CN201980099361.7A priority Critical patent/CN114258541A/zh
Priority to PCT/CN2019/112037 priority patent/WO2021072776A1/zh
Publication of WO2021072776A1 publication Critical patent/WO2021072776A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/9035Filtering based on additional data, e.g. user or group profiles

Definitions

  • This application relates to the field of data processing technology, and more specifically, to a data merging method, device, electronic equipment, and storage medium.
  • this application proposes a data merging method, device, electronic equipment, and storage medium to improve the above-mentioned problems.
  • an embodiment of the present application provides a data merging method, the method includes: obtaining a configuration file, the configuration file includes the storage path of the data source and the identity of the data source, where it needs to be merged into one data
  • the data sources of the files are configured with the same identity; each data source is obtained according to the storage path of the data source in the configuration file; the data sources with the same identity are merged into one data file.
  • an embodiment of the present application provides a data merging device, the device includes: a file acquisition module for acquiring a configuration file, the configuration file includes the storage path of the data source and the identity of the data source, wherein , The data sources that need to be merged into one data file are configured with the same identity; the data source acquisition module is used to obtain each data source according to the storage path of the data source in the configuration file; the merging module is used to identify the same identity The data sources are merged into one data file.
  • an embodiment of the present application provides an electronic device, including: one or more processors; a memory; and one or more programs.
  • the one or more programs are stored in the memory and configured to be executed by the one or more processors, and the one or more programs are configured to execute the above-mentioned methods.
  • an embodiment of the present application provides a computer-readable storage medium having program code stored in the computer-readable storage medium, and the program code can be invoked by a processor to execute the above-mentioned method.
  • the storage path and the identity of the data source to be merged are configured in the configuration file. After obtaining the configuration file, each data source can be obtained according to the storage path in the configuration file, and the data sources with the same identity are merged into one data file. There is no need to extract information from the data source itself to determine the merge basis, which improves the data merge process The convenience.
  • Fig. 1 shows a flowchart of a data merging method provided by an embodiment of the present application.
  • Fig. 2 shows a flowchart of a data merging method provided by another embodiment of the present application.
  • Fig. 3 shows a schematic diagram of a configuration file provided by an embodiment of the present application.
  • Fig. 4 shows a functional module diagram of a data merging device provided by an embodiment of the present application.
  • Fig. 5 shows a structural block diagram of an electronic device provided by an embodiment of the present application.
  • Fig. 6 is a storage unit for storing or carrying program code for implementing the data merging method according to the embodiment of the present application according to an embodiment of the present application.
  • Interrelated data files can be considered as data files that have commonalities between each other and are generated for the same object and need to be combined and analyzed when analyzing the object.
  • interrelated data files are data that need to be merged into one file.
  • Documentation the data document may be various forms of data generated on the Internet, such as tables, text, text and image combined text, images, codes, etc., which will not be repeated here.
  • the data when the data files are merged, the data is obtained from the data file itself, as the basis of the merge, the operation is cumbersome and not accurate enough. For example, it is necessary to extract fields from each data document to be merged, and merge documents with the same field; or calculate the similarity between various documents, and merge data documents with a similarity higher than a certain value.
  • the inventor proposes the data merging method, device, electronic device, and storage medium provided by the embodiments of the present application.
  • the merging basis of the data source to be merged is determined through the configuration file. Compared with the data obtained from the data source itself, the processing is improved. Convenience, and improve the speed of merging. Among them, the data source to be merged and the aforementioned data file.
  • the data merging method, device, electronic equipment, and storage medium provided in the embodiments of the present application will be described in detail below through specific embodiments.
  • FIG. 1 shows a data merging method provided by an embodiment of the present application. Specifically, the method includes:
  • Step S110 Obtain a configuration file.
  • the configuration file includes the storage path of the data source and the identity of the data source.
  • the data sources that need to be merged into one data file are configured with the same identity.
  • a configuration file may be configured according to each data source that needs to be merged, so that the merge strategy is determined by the configuration file to merge the data sources.
  • the related data sources are merged into one data file.
  • the configuration file data sources that are related to each other can be marked, so that the configuration file can determine which data sources need to be merged into the same data file.
  • the data sources that are related to each other can be marked by the identity identifier, that is, the data sources that need to be merged into the same data file are configured with the same identity identifier, and the data sources that are not merged into the same data file are configured with different identities. ID, and store the identity of each data source in the configuration file.
  • data sources that need to be merged are a, b, c, d, and e, where a, b, and c need to be merged into the same data file; d and e need to be merged into the same data file, then a, b, c Configure the same identity A in the configuration file; d and e configure the same identity B in the configuration file, and the identity A is different from the identity B.
  • each data source has a corresponding storage location
  • the configuration file can include the storage path of each data source, so that each data source can be found according to the storage path.
  • the storage path of the article may be the server of the self-media party
  • the storage path of the comment may be the server corresponding to the user's comment
  • the storage path of the likes It may be the server of the software platform where the article is published.
  • Step S120 Obtain each data source according to the storage path of the data source in the configuration file.
  • Step S130 Combine data sources with the same identity identifier into one data file.
  • each data source is obtained, and each data source is merged.
  • the data sources with the same identity are merged into the same data file.
  • the identities of the data sources a, b, and c are the same.
  • the storage path and the identity identifier of the data source to be merged are configured in the configuration file.
  • each data source can be obtained according to the storage path in the configuration file, and then according to the identity of each data source in the configuration file, it is determined which data sources need to be merged into one data file.
  • the data sources with the same identity are merged into one data file, and there is no need to extract information from the data source itself to determine the merging basis, which improves the convenience of the data merging process and the speed of merging.
  • each data source is structured first, and a configuration file is set according to the structured data source.
  • the part of the data source used for merging is flexibly selected according to the structural characteristics of the data source. Specifically, please refer to Figure 2.
  • the method includes:
  • Step S210 Structural processing is performed on each data source to include multiple fields, and a configuration file is set according to each data source.
  • Step S220 Obtain a configuration file.
  • the data sources may be structured first to define the data structure of each data source.
  • the way to structure the data source may be to divide the data source into multiple fields, and each field is a part of the data source.
  • each field is a part of the data source.
  • the articles can be divided into multiple fields: title, author name, abstract, and body content; a certain data source is an article’s comment, and the divided fields may be The title of the review article, the content of the review, the reviewer, and the time of the review.
  • the data source can also be divided into only one field, that is, the entire data source is treated as one field.
  • the specific division method of the fields in the data source is not limited in the embodiment of the present application.
  • the division of the fields of the data source may be completed by the user and uploaded to the execution device.
  • partitioning rules for various types of data sources may be preset, and the partitioning may be performed according to the partitioning rules.
  • the data source with the type of division rules divides the fields according to the division rules according to the division rules, and for each data source with the type of division rules, it is submitted to the user for manual division, or submitted to the user to specify Division rules.
  • description information is configured for each field of the data source as the field description information of the field.
  • the name of the data source corresponds to the description information of each field of the data source.
  • each field in each data source can be determined, and the field description information of each field can be obtained; the field description information of each field in each data source can be configured in the configuration file.
  • each field description information includes the basic information of the corresponding field.
  • the data source 2 in FIG. 3 includes three fields, namely, field 1, field 2, and field 3.
  • the field description information of field 1 are I11, I12, and I13, respectively;
  • the field description information of field 2 are I21, I22, I23, and I24, respectively;
  • the field description information of field 3 are I31, I32, and I33, respectively.
  • the field description information specifically includes which information of the field is not limited in the embodiment of this application. For example, it may include the field name, the data type of the field, etc., as shown in Figure 3, I11 can represent the field name of field 1, and I12 can represent field 1. Data types, etc.
  • the field description information of the field can be extracted from the data source, can be determined according to the field division rule, or determined by the user, etc., which is not limited in the embodiment of the present application.
  • an identity can also be configured for each data source, and the data sources that need to be merged into one data file are configured with the same identity. Therefore, after obtaining the configuration file, the identity of each data source can be obtained from the configuration file.
  • the identity of each data source may be assigned by the user and then configured in the configuration file; it may also be assigned to each data source according to a preset identification rule.
  • the identification rule can be that the associated data sources are assigned the same identity, but the associated data sources are not assigned different identities.
  • the data sources that are related to each other are generated with the same identity when they are generated.
  • the identity of the data source can be obtained from each data source, and used to configure the identity of the corresponding data source in the configuration file.
  • an identity is generated for the article.
  • an identity identifier that is the same as the identity of the article is generated corresponding to the user comment.
  • the identity of the article can be obtained from the article, and the corresponding article can be configured in the configuration file; the identity of the user comment can be obtained from the user comment, and the user comment can be configured in the configuration file.
  • the identity can be configured as a field of the data source in the configuration file, or the identity of the data source can be configured in the field description information of one of the fields of the data source.
  • the identity of the data source when obtaining the identity of the data source, it can be obtained from the field description information configured with the identity.
  • the field description information of each field may include identity indication information, the identity indication information indicates whether the field includes an identity, and the identity indication information indicates the field description information including the identity.
  • the identity of the data source is configured.
  • the identification is obtained from the field description information including the identification.
  • data source a includes field 1, field 2, and field 3.
  • the field describes the identity indication information in the information.
  • the representation of the identity indication information can be more concise than the identity identification, so that after judging whether the field description information includes the identification according to the concise identity indication information, it is determined whether to read the more complex identification.
  • the identity can be read from the default field description information.
  • the identity can be stored as a separate parameter corresponding to the name of the data source.
  • the parameter corresponding to the name of the data source is read.
  • the storage path of each data source is also stored. After the configuration file is obtained, the storage path of each data source can be obtained from the configuration file.
  • the storage path can be stored as a separate parameter corresponding to the name of the data source, and can be stored in the description information of a certain field.
  • the execution device of step S210 may be different from the execution device of step S220 to step S240.
  • the execution device of step S210 is different from the single device; if step S220 to step S240 are executed by a system, such as a cluster device such as a hadoop cluster, the device execution of step S210 is the same as that of the system. different.
  • execution device of step S210 and the execution device of step S220 to step S240 may also be the same execution device, or be devices in the same system or cluster.
  • the configuration file is set by an electronic device, and the data merging according to the configuration file is completed by the Hadoop cluster.
  • the electronic device can submit the configuration file from the MR task submission interface to the Hadoop cluster for operation, so that the Hadoop cluster changes from setting the configuration file.
  • the electronic device obtains the configuration file.
  • Step S230 Obtain each data source according to the storage path of the data source in the configuration file.
  • Step S240 Combine data sources with the same identity identifier into one data file, and delete fields in each data source that do not participate in the combination.
  • Each data source can be obtained according to the storage path of the data source in the configuration file, and each data source is analyzed according to the data structure configured in the configuration file, so as to merge the data sources with the same identity into one data file.
  • the data sources are sorted by identity, so that the data sources with the same identity are Adjacent after sorting. Then merge the adjacent data sources with the same identity into one data file. For example, if the identity of the data source is English letters, after sorting the data sources according to the order of the identity, the data sources whose identities are the same English letter should be arranged next to each other, and then the adjacent data sources of the same English letter should be merged As a data file.
  • the first data source after sorting can be used as the starting data source, and each data source can be traversed in sequence.
  • traversing to a data source with a different identity from the previous data source all data sources from the initial data source to the previous data source are merged into one data file. Then use the currently traversed data source as the new starting data source, and traverse each data source in turn.
  • traversing to a data source whose identity is different from the previous data source set the data source between the starting data source and the previous data source.
  • All data sources are merged into one data file, and the currently traversed data source is again used as the new starting data source, and the cycle is repeated to realize that all adjacent data sources with the same identity are merged into one data file. It is understandable that when traversing to the last data source, since there is no next data source, the last data source and the unmerged data source can be merged into one data file.
  • data sources a, b, c, d, e, a, b, c are configured with the same identity A in the configuration file; data sources d and e are configured with the same identity in the configuration file B.
  • the data sources a, b, and c are arranged adjacently, and the data sources d and e are adjacent, for example, the arrangement order is a, b, c, d, and e.
  • the identity is A.
  • multiple data sources can be merged in parallel. Specifically, data sources with different identities can be selected respectively. For each selected data source, the data source with the same identity identifier is searched from the unselected and merged data sources, and all the data sources with the same identity identifier found are merged with the selected data source. In this embodiment, the number of data sources selected at the same time can be determined according to the parallel processing channels during parallel merging. If there are 5 parallel processing channels, 5 data sources can be simultaneously selected for searching and merging.
  • the data source may be searched and merged after sorting according to the sorting method of the foregoing embodiment.
  • the merged data file usually has a corresponding use scenario, for example, it can be used for data support in various scenarios such as data analysis and search recommendation.
  • a corresponding use scenario for example, it can be used for data support in various scenarios such as data analysis and search recommendation.
  • not all content in the data source may be useful in the corresponding usage scenario. Therefore, unnecessary content can be excluded from the merged data file, making the data file more concise and occupying less storage space.
  • the article’s commenters, comment time, and likes are all useless for the usage scenario, and this part can be deleted.
  • the field where the content that does not need to be merged can be set as the field that does not participate in the merge, and the unnecessary part of the data source can be deleted by deleting each data Fields in the source that do not participate in the merge.
  • the configuration file may include merge indication information on whether each field in the data source participates in the merge.
  • the field description information of each field can include merging indication information indicating whether the field participates in merging; or alternatively, in the configuration file, a parameter can be specifically set for each data source as the merging Instruction information, indicating which fields in the data source do not participate in the merge.
  • the fields that do not participate in the merge can be deleted from the merged data file according to the merge instruction information in the configuration file.
  • the data sources with the same identity may be merged into one data file.
  • the merge instruction information in the configuration file each data After the fields in the source that do not participate in the merge are deleted, the data sources with the same identity are merged into one data file.
  • the configuration file when only one data source is configured in the configuration file, that is, the configuration file only includes configuration information such as the identity of one data source, storage path, and field description information of each field, the configuration information can be combined through the instruction information
  • the content filtering of the data source is realized, that is, after the fields that do not participate in the merge are deleted from the data source, the content of interest is filtered out.
  • the number of data sources is equal to 1, and specifically, it can be judged whether there is only one data source configuration information in the configuration file. If the number of data sources is equal to 1, and no other data sources are merged, you can delete the fields that do not participate in the merge from the data source according to the merge instructions in the configuration file, and use the data source as the merged data file to achieve the Filtering of the content in the data file. If the number of data sources is greater than 1, it is necessary to merge data sources with the same identity into one data file, and perform the merge operation in the embodiment of the present application.
  • the data type in the field description information of the field of the array type, can be configured as an array, and you can specify which of the multiple parallel contents in the array type participate in the merging. Among them, which content in the specified array type does not participate in the merging can be determined according to the preset designation rules of the array type. For example, the preset designation rule specifies that only the first content in the field of the array type is involved in the merge, as in the field of the array type.
  • the content pointed to by the pointer specified in the preset specified rules is the content that participates in the merging of the array type, and the preset specified rules specify all the content in the array type to participate in the merging; in addition, the user can also specify the content in the field description information Configure which content in the field of the array type participates in the merging.
  • the content of the array type field that does not participate in the merging will not be reflected in the final synthesized data file. Specifically, before the data sources are merged, the content that does not participate in the merge in the field of the array type can be deleted, and then the data source can be merged with the data source with the same identity; or it can also be the data source with the same identity. After merging into the same data file, delete the content of the array type field in the data file that does not participate in the merging.
  • an article data source includes three article tags of current affairs, sports, and footwear. Taking the tag in the article as a field, the field also has three side-by-side content current affairs, sports, and footwear for the concept of the article tag. If sports and footwear are the real tags of the article, you can specify sports and footwear to participate in the merger in the field description information of this field, so that the current events tag is not included in the merged data file.
  • the data file can be output.
  • the location where the merged file is to be output can be configured in the configuration file, and the location is defined as a designated location.
  • the output location of the file may not be specified in the configuration file, but obtained at the same time when the configuration file is obtained. For example, when the generator of the configuration file submits the configuration file to the acquirer of the configuration file, the location of the output file is specified.
  • all the merged data files can be output to a designated location.
  • each data file may be output as an independent file to the same designated location.
  • all data files can be combined into one file and output to a designated location.
  • the configuration file may include the number of copies of the file to be output, and the number of copies of the file is a preset number of copies. After all data sources are synthesized to obtain data files, all the merged data files can be split into a preset number of copies for output. For example, if 100 data files are obtained after merging, and the preset number of copies is 5, it can be split into 1 copy for every 20 data files to obtain 5 copies.
  • the number of data files in each data file is not limited.
  • the number of files to be output may not be specified in the configuration file, but obtained at the same time when the configuration file is obtained. For example, when the generation of the configuration file submits the configuration file to the acquirer of the configuration file, the number of copies of the output file is specified.
  • all data files can be merged into one file, and then the one file is split into a preset number of copies.
  • each data file can be used as an independent file, and all data files can be divided into a preset number of copies.
  • the configuration file may include the output location of each data file.
  • each data file can be output to the location specified in the configuration file.
  • designated locations may be configured for each data file, and the configuration file includes the output location of each data file.
  • the configuration file includes the output location of each data file.
  • one of the data sources corresponding to the same identity is configured with a specified location, then after the data sources with the same identity are merged into one data file, the data file is output to the corresponding one of the data sources The specified location of the configuration, thereby outputting each data file to the location specified in the configuration file.
  • data source processing and merging can be performed in the Hadoop cluster through the Map-Reduce calculation model.
  • the configuration file can be obtained through the task acquisition interface of the Map-Reduce task.
  • the map program in the Map-Reduce task obtains each data source according to the storage path of the data source in the configuration file, and the data source can be analyzed and sorted through the map program. And send the processed data source to the linux standard output data stream.
  • the reduce program in the Map-Reduce task can read in data from the data stream, so that data sources with the same identity are merged into a data file through the reduce program, and output to the output location specified in the configuration file.
  • the configuration file may also specify the number of retries for abnormal data, or specify the number of retries for abnormal data when the configuration file generator submits the configuration file to the acquirer of the configuration file.
  • the Hadoop cluster can retry multiple times. When the number of retries reaches the specified number of retries, find the abnormal device in the cluster, switch to another device to replace the abnormal device for data deal with.
  • field description information is added to each field of the structured data source, so that the fields in the data source that do not participate in the merging can be determined according to the field description information.
  • fields that do not participate in the merge can be deleted, so that the merge process is simpler, the merge efficiency is higher, and the pertinence of the merged data file is improved.
  • the data merging method can support user-defined document merging methods through the configuration of configuration files, and is implemented in software, and can be used in any Linux environment that provides Map-Reduce computing capabilities.
  • an embodiment of the present application also provides a data merging device 300.
  • the data merging device 300 includes: a file obtaining module 310 for obtaining a configuration file.
  • the configuration file includes the storage path of the data source and the identity of the data source.
  • the data sources that need to be merged into one data file have the same configuration ’S identity.
  • the data source obtaining module 320 is configured to obtain each data source according to the storage path of the data source in the configuration file.
  • the merging module 330 is used to merge data sources with the same identity into one data file.
  • the merging module 330 may include a sorting unit for sorting the data sources according to identities so that data sources with the same identities are adjacent; a merging unit for merging adjacent data sources with the same identities into A data file.
  • the merging unit can use the first sorted data source as the starting data source; traverse each data source in turn; when traversing to a data source with a different identity from the previous data source, move the starting data source to the previous data source. All data sources between a data source are merged into one data file; the data source currently traversed is used as the new starting data source, and each data source is sequentially traversed as described above; when the traversal to the identity is different from the previous data source Data source, merge all data sources from the initial data source to the previous data source into one data file, and use the data source currently traversed as the new initial data source.
  • the last data source merge the last data source and the unmerged data source into one data file.
  • each data source is divided into one or more fields
  • the configuration file includes merge indication information indicating whether each field in the data source participates in the merge.
  • the merging module 330 may be configured to delete the fields that are not involved in merging in each data source according to the merging instruction information in the configuration file, and merge the data sources with the same identity into one data file.
  • each data source is divided into one or more fields, and the configuration file includes merge indication information indicating whether each field in the data source participates in the merge.
  • the merging module 330 may be configured to delete fields not participating in merging from the merged data file according to the merge instruction information in the configuration file.
  • each data source is divided into one or more fields, and the configuration file includes merge indication information indicating whether each field in the data source participates in the merge.
  • the merging module 330 can be used to determine whether the number of data sources is equal to 1; if the number of data sources is equal to 1, according to the merging instruction information in the configuration file, after deleting the fields that are not involved in merging from the data source, all The data source is used as a merged data file; if the number of data sources is greater than 1, the data sources with the same identity are merged into one data file.
  • the configuration file includes field description information of fields in each data source, and the field description information of each field includes merge indication information on whether the field participates in merge.
  • the device 300 further includes a configuration module, which is used to determine each field in each data source and obtain the field description information of each field; configure the field description information of each field in each data source in the configuration file .
  • a configuration module which is used to determine each field in each data source and obtain the field description information of each field; configure the field description information of each field in each data source in the configuration file .
  • the configuration module can also be used to configure an identity for each data source in the configuration file.
  • the data sources that are related to each other are generated with the same identity when they are generated.
  • the configuration module can also be used to obtain the identity of the data source from each data source, and is used to configure the identity of the corresponding data source in the configuration file.
  • the device 300 may further include an information obtaining module, which is used to obtain the identity of each data source from the configuration file.
  • each data source is divided into one or more fields
  • the configuration file includes field description information of the fields in each data source
  • the field description information of each field includes information on whether the field includes an identity identifier.
  • Identity indication information The information acquisition module can be used for each data source to determine whether each field description information includes an identity identifier according to the identity indication information of each field description information; when it is determined that the field description information includes an identity identifier, describe it from the field including the identity identifier Obtain the identity from the information.
  • the configuration file includes a designated location where the merged file is output.
  • the device 300 may also include an output module for outputting the merged data file to a designated location.
  • the configuration file includes a preset number of copies of the document to be output.
  • the output module can be used to split all the merged data files into a preset number of copies for output.
  • the configuration file includes the output location of each data file.
  • the output module can be used to output each data file to the location specified in the configuration file.
  • the file obtaining module 310 may be used to obtain the configuration file through the task obtaining interface of the Map-Reduce task.
  • the data source obtaining module 320 may be used to obtain each data source according to the storage path of the data source in the configuration file through the map program in the Map-Reduce task.
  • the merging module 330 may be used to merge data sources with the same identity into one data file through the reduce program in the Map-Reduce task.
  • the coupling between the modules may be electrical, mechanical or other forms of coupling.
  • each functional module in each embodiment of the present application may be integrated into one processing module, or each module may exist alone physically, or two or more modules may be integrated into one module.
  • the above-mentioned integrated modules can be implemented in the form of hardware or software function modules.
  • Each module can be configured in different electronic devices, and can also be configured in the same electronic device, which is not limited in the embodiment of the present application.
  • FIG. 5 shows a structural block diagram of an electronic device 400 provided by an embodiment of the present application.
  • the electronic device 400 may be a smart device such as a smart phone, a tablet computer, or a computer.
  • the data merging method and device in the embodiments of the present application can be executed by one electronic device; or by multiple electronic devices, such as a system cluster composed of multiple servers.
  • the electronic device may include one or more processors 410 (only one is shown in the figure), a memory 420, and one or more programs.
  • the one or more programs are stored in the memory 420 and configured to be executed by the one or more processors 410.
  • the one or more programs are configured to execute the methods described in the foregoing embodiments. If the method described in the foregoing embodiment is executed by multiple electronic devices, each electronic device may be configured with a part of the program to be executed.
  • the processor 410 may include one or more processing cores.
  • the processor 410 uses various interfaces and lines to connect various parts of the entire electronic device 400, and executes by running or executing instructions, programs, code sets, or instruction sets stored in the memory 420, and calling data stored in the memory 420.
  • the processor 410 may use at least one of Digital Signal Processing (DSP), Field-Programmable Gate Array (FPGA), and Programmable Logic Array (PLA).
  • DSP Digital Signal Processing
  • FPGA Field-Programmable Gate Array
  • PDA Programmable Logic Array
  • the processor 410 may be integrated with one or a combination of a central processing unit (CPU), a graphics processing unit (GPU), a modem, and the like.
  • the CPU mainly processes the operating system, user interface, and application programs; the GPU is used for rendering and drawing of display content; the modem is used for processing wireless communication. It can be understood that the above-mentioned modem may not be integrated into the processor 410, but may be implemented by a communication chip alone.
  • the memory 420 may include random access memory (RAM) or read-only memory (Read-Only Memory).
  • the memory 420 may be used to store instructions, programs, codes, code sets or instruction sets.
  • the memory 420 may include a storage program area and a storage data area, where the storage program area may store instructions for implementing an operating system, instructions for implementing at least one function, instructions for implementing each of the foregoing method embodiments, and the like.
  • the data storage area can also be the data created by the electronic device in use, etc.
  • FIG. 6 shows a structural block diagram of a computer-readable storage medium provided by an embodiment of the present application.
  • the computer-readable storage medium 500 stores program code, and the program code can be invoked by a processor to execute the method described in the foregoing method embodiment.
  • the computer-readable storage medium 500 may be an electronic memory such as flash memory, EEPROM (Electrically Erasable Programmable Read Only Memory), EPROM, hard disk, or ROM.
  • the computer-readable storage medium 500 includes a non-transitory computer-readable storage medium.
  • the computer-readable storage medium 500 has a storage space for the program code 510 for executing any method steps in the above-mentioned methods. These program codes can be read from or written into one or more computer program products.
  • the program code 510 may be compressed in an appropriate form, for example.

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请公开了一种数据合并方法、装置、电子设备及存储介质,涉及数据处理技术领域。其中,该方法包括:获取配置文件,所述配置文件中包括数据源的存储路径以及数据源的身份标识,其中,需要合并为一个数据文件的数据源配置有相同的身份标识;根据所述配置文件中数据源的存储路径获取各个数据源;将身份标识相同的数据源合并为一个数据文件,提高了数据合并过程的便捷性。

Description

数据合并方法、装置、电子设备及存储介质 技术领域
本申请涉及数据处理技术领域,更具体地,涉及一种数据合并方法、装置、电子设备及存储介质。
背景技术
随着互联网技术的不断发展,网络平台中产生的数据不断碎片化、多源化。为了更好的为数据分析、搜索推荐等场景提供数据支撑,需要对网络平台中产生的相关联的数据进行合并。常用的数据合并对数据的处理繁琐,合并难度大。
发明内容
鉴于上述问题,本申请提出了一种数据合并方法、装置、电子设备及存储介质,以改善上述问题。
第一方面,本申请实施例提供了一种数据合并方法,所述方法包括:获取配置文件,所述配置文件中包括数据源的存储路径以及数据源的身份标识,其中,需要合并为一个数据文件的数据源配置有相同的身份标识;根据所述配置文件中数据源的存储路径获取各个数据源;将身份标识相同的数据源合并为一个数据文件。
第二方面,本申请实施例提供了一种数据合并装置,所述装置包括:文件获取模块,用于获取配置文件,所述配置文件中包括数据源的存储路径以及数据源的身份标识,其中,需要合并为一个数据文件的数据源配置有相同的身份标识;数据源获取模块,用于根据所述配置文件中数据源的存储路径获取各个数据源;合并模块,用于将身份标识相同的数据源合并为一个数据文件。
第三方面,本申请实施例提供了一种电子设备,包括:一个或多个处理器;存储器;一个或多个程序。其中所述一个或多个程序被存储在所述存储器中并被配置为由所述一个或多个处理器执行,所述一个或多个程序配置用于执行上述的方法。
第四方面,本申请实施例提供了一种计算机可读存储介质,所述计算机 可读存储介质中存储有程序代码,所述程序代码可被处理器调用执行上述的方法。
本申请实施例提供的数据合并方法、装置、电子设备及存储介质,在配置文件中配置有需要合并的数据源的存储路径以及身份标识。获得配置文件后,可以根据配置文件中的存储路径获取到各个数据源,将身份标识相同的数据源合并为一个数据文件,不需要再从数据源本身提取信息确定合并依据,提高了数据合并过程的便捷性。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1示出了本申请一实施例提供的数据合并方法的流程图。
图2示出了本申请另一实施例提供的数据合并方法的流程图。
图3示出了本申请实施例提供的配置文件的示意图。
图4示出了本申请实施例提供的数据合并装置的功能模块图。
图5示出了本申请实施例提供的电子设备的结构框图。
图6是本申请实施例的用于保存或者携带实现根据本申请实施例的数据合并方法的程序代码的存储单元。
具体实施方式
为了使本技术领域的人员更好地理解本申请方案,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述。
在网络平台中,存在相互关联的数据文档。相互关联的数据文档可以认为是,彼此之间存在共性、针对同一对象产生的、对该对象进行分析时需要结合分析的数据文档,或者说,相互关联的数据文档就是需要合并为一个文档的数据文档。另外,数据文档可以是互联网中产生的各种形式的数据,如表格、文字文本、文字与图像结合的文本、图像以及代码等等,此处不进行一一赘述。
很多相关关联的数据文档产生于不同的平台,彼此之间相互独立,而要对这些相互关联的数据文档进行全面完整的分析,则需要将这些数据文档合并,提高分析的便捷性。例如,一篇文章的内容由自媒体生产,针对文章的评论由 用户生产,针对文章的点击数以及点击用户身份、点赞数以及点赞用户身份等由软件平台记录,这篇文章、这篇文章的评论、点击数、点击用户身份、点赞数以及点赞用户身份分布于不同的平台,相互独立,若需要为数据分析、搜索推荐等场景提供数据支撑,例如分析用户对文章的喜好程度、这篇文章所属类型的文章是否受用户欢迎、受哪些类型的用户欢迎等,则需要将文章、文章的评论、点击数、点击用户身份、点赞数以及点赞用户身份等相互独立的数据文档合成为一个文档。
在一些实施方式中,在对数据文档进行合并时从数据文档本身获取数据,作为合并的依据,操作比较繁琐,且不够准确。例如需要从各个待合并的数据文档中提取字段,将具有相同字段的文档合并;或者计算各个文档之间的相似度,将相似度高于某个值的数据文档合并。
因此,发明人提出了本申请实施例提供的数据合并方法、装置、电子设备及存储介质,通过配置文件确定待合并的数据源的合并依据,相对于从数据源本身获取数据,提高了处理的便捷性,且提高了合并速度。其中,待合并的数据源及前述的数据文档。下面将通过具体实施例对本申请实施例提供的数据合并方法、装置、电子设备及存储介质进行详细说明。
请参阅图1,示出了本申请实施例提供的数据合并方法。具体的,该方法包括:
步骤S110:获取配置文件,所述配置文件中包括数据源的存储路径以及数据源的身份标识,其中,需要合并为一个数据文件的数据源配置有相同的身份标识。
在本申请实施例中,可以根据各个需要进行合并的数据源配置有配置文件,以通过配置文件确定合并策略进行数据源的合并。
其中,各个数据源中,相互关联的数据源合并为一个数据文件。在配置文件中,可以将相互关联的数据源进行标注,从而可以通过配置文件确定哪些数据源需要合并为同一个数据文件。
在本申请实施例中,可以通过身份标识标注相互关联的数据源,即为需要合并为同一个数据文件的数据源配置相同的身份标识,不合并为同一个数据文件的数据源配置不同的身份标识,并且将各个数据源的身份标识存储在配置文件中。例如,需要合并是数据源有a、b、c、d以及e,其中,a、b以及c需要合并为同一个数据文件;d和e需要合并为同一个数据文件,则a、b、c在 配置文件中配置相同的身份标识A;d和e在配置文件中配置相同的身份标识B,且身份标识A不同于身份标识B。
另外,各个数据源具有相应的存储位置,配置文件中可以包括各个数据源的存储路径,从而可以根据存储路径查找到各个数据源。例如,需要合并的各个数据源包括文章、文章的评论、文章的点赞,则文章的存储路径可能为自媒体方的服务器,评论的存储路径可能是用户评论对应的服务器,点赞的存储路径可能是该文章刊登的软件平台的服务器。
步骤S120:根据所述配置文件中数据源的存储路径获取各个数据源。
步骤S130:将身份标识相同的数据源合并为一个数据文件。
根据获得的配置文件中各个数据源的存储路径,获取各个数据源,并对各个数据源进行合并。其中,将身份标识相同的数据源合并为同一个数据文件。例如,在前述举例中,数据源a、b以及c的身份标识相同,将a、b、c合并为一个文件,获得合并后的数据文件;d和e的身份标识相同,将d和e进行合并,获得合并后的数据文件。
在本申请实施例中,在配置文件中配置有需要合并的数据源的存储路径以及身份标识。获得配置文件后,可以根据配置文件中的存储路径获取到各个数据源,再根据配置文件中各个数据源的身份标识确定哪些数据源需要合并为一个数据文件。将身份标识相同的数据源合并为一个数据文件,不需要再从数据源本身提取信息确定合并依据,提高了数据合并过程的便捷性以及合并的速度。
本申请实施例还提供一种实施例。在该实施例中,各个数据源先进行结构化处理,根据结构化处理后的数据源设置配置文件。在数据合并过程中,根据数据源的结构化特性灵活选择数据源中用于合并的部分。具体的,请参见图2,该方法包括:
步骤S210:对各个数据源进行结构化处理为包括多个字段,并根据各个数据源设置配置文件。
步骤S220:获取配置文件。
在本申请实施例中,在获取配置文件对数据源进行合并之前,可以先对数据源进行结构化处理,定义各个数据源的数据结构。
其中,对数据源进行结构化处理的方式可以是,将数据源分为多个字段,每个字段为该数据源的一部分。例如,某数据源是一篇文章,该文章所划分的 多个字段可以分别是,标题、作者名、摘要以及正文内容;某数据源是文章的评论,所划分的各个字段可能分别是,被评论文章的标题、评论内容、评论者以及评论时间。当然,数据源也可以仅分为一个字段,即整个数据源作为一个字段。
数据源中字段的具体划分方式在本申请实施例中并不限定。在一些实施方式中,数据源的字段的划分可以由用户完成后上传到执行设备。在另一些实施方式中,可以预先设置各种类型数据源的划分规则,根据划分规则进行划分。或者上述两种实施方式相结合,具有划分规则的类型的数据源根据划分规则根据划分规则划分字段,对于每一设置划分规则的类型的数据源,提交给用户进行人工划分,或者提交给用户指定划分规则。
在配置文件中,对数据源的每个字段配置描述信息,作为字段的字段描述信息,例如图3所示,在配置文件中,数据源的名称与该数据源的各个字段描述信息对应。
具体的,设置配置文件时,可以确定每个数据源中的各个字段,并获取各个字段的字段描述信息;在配置文件中配置每个数据源中各个字段的字段描述信息。
其中,每个字段描述信息中包括了相应字段的基本信息,例如图3中数据源2包括了3个字段,分别为字段1、字段2以及字段3。其中,字段1的字段描述信息分别为I11、I12、I13;字段2的字段描述信息分别为I21、I22、I23以及I24;字段3的字段描述信息分别为I31、I32以及I33。字段描述信息具体包括字段的哪些信息在本申请实施例中并不限定,例如可以包括字段名称、字段的数据类型等,如图3中的I11可以代表字段1的字段名称,I12可以表示字段1的数据类型等。字段的字段描述信息可以数据源中提取、可以根据字段划分规则确定、或者由用户确定等,在本申请实施例中并不限定。
另外,在配置文件中,还可以为每一个数据源配置身份标识,并且,将需要合并为一个数据文件的数据源配置为相同的身份标识。从而在获取到配置文件后,可以从配置文件中获取每个数据源的身份标识。
可选的,在本申请实施例中,每个数据源的身份标识可以由用户分配后配置在配置文件中;也可以是根据预设的标识规则为每个数据源分配身份标识,该预设的标识规则可以是,相关联的数据源分配相同的身份标识,不是相关联的数据源分配不同的身份标识。
可选的,在本申请实施例中,相互关联的数据源在生成时,生成有相同的 身份标识。对应的,可以从每个数据源中获取数据源的身份标识,用于在配置文件中对相应的数据源进行身份标识的配置。例如,一篇文章在上传到自媒体的服务器时,为该文章生成一个身份标识。当对应该文章产生用户评论时,对应该用户评论生成与该文章身份标识相同的身份标识。从而可以从文章中获取文章的身份标识,在配置文件中对应该文章进行配置;从用户评论中获取用户评论的身份标识,在配置文件中对应该用户评论进行配置。
在一种实施方式中,身份标识可以作为数据源的一个字段配置在配置文件中,或者是将数据源的身份标识配置在数据源的其中一个字段的字段描述信息中。
在该实施方式中,获取数据源的身份标识时,可以从配置有身份标识的字段描述信息中获取。
可选的,在该实施方式中,每个字段的字段描述信息中可以包括身份指示信息,该身份指示信息表示所在字段中是否包括身份标识,并且在身份指示信息指示包括身份标识的字段描述信息中,配置有数据源的身份标识。在从配置文件中获取数据源的身份标识时,对每个数据源,可以根据该数据源的各个字段描述信息的身份指示信息,判断各个字段描述信息是否包括身份标识。当判定有字段描述信息中包括身份标识,从包括身份标识的字段描述信息中获取身份标识。例如,在配置文件中,数据源a包括字段1、字段2以及字段3,在获取数据源a的身份标识时,可以读取数据源a的字段1的字段描述信息中的身份指示信息中指示的为“是”还是“否”,若为“是”,再从该字段1中读取数据源a的身份标识;若为“否”,表示字段1中没有身份标识,再读取字段2的字段描述信息中的身份指示信息。
其中,身份指示信息的表示可以比身份标识简洁,从而可以在根据简洁的身份指示信息判断字段描述信息中是否包括身份标识后,再确定是否读取表示更复杂的身份标识。
另外,可选的,若字段描述信息中不包括身份指示信息,或者身份指示信息为空,可以从默认的字段描述信息中读取身份标识。
在另一种实施方式中,身份标识可以作为一个单独的参数对应数据源的名称进行存储。在获取数据源的身份标识时,从该数据源的名称对应的该参数进行读取。
在配置文件中,对每个数据源还存储有其存储路径,在获取到配置文件后,可以从配置文件中获得各个数据源的存储路径。该存储路径可以作为一个单独 的参数对应数据源的名称进行存储,可以存储在其中的某个字段描述信息中。
可选的,在本申请实施例中,步骤S210的执行设备可以不同于步骤S220至步骤S240的执行设备。例如,步骤S220至步骤S240由单一设备执行,则步骤S210的执行设备不同于该单一设备;若步骤S220至步骤S240由一个系统执行,如hadoop集群等集群设备,步骤S210的设备执行与该系统不同。
当然,步骤S210的执行设备与步骤S220至步骤S240的执行设备也可以是同一个执行设备,或者为同一个系统或集群中的设备。
从设置配置文件的执行设备获取配置文件。例如,配置文件由一电子设备设置完成,根据配置文件进行数据合并由hadoop集群完成,可以将该电子设备可以将配置文件从MR任务提交接口提交到hadoop集群运行,从而hadoop集群从设置配置文件的电子设备获取到配置文件。
步骤S230:根据所述配置文件中数据源的存储路径获取各个数据源。
步骤S240:将身份标识相同的数据源合并为一个数据文件,并删除各数据源中不参与合并的字段。
根据配置文件中数据源的存储路径可以获取到各个数据源,并且根据配置文件配置好的数据结构对各个数据源进行分析,以实现相同身份标识的数据源合并到一个数据文件。
在一些实施方式中,为了降低数据合并过程中的处理难度,在获取到数据源后,根据配置文件中各个数据源的身份标识,将数据源按身份标识进行排序,使身份标识相同的数据源在排序后相邻。再将相邻的身份标识相同的数据源合并为一个数据文件。例如,数据源的身份标识为英文字母,则根据身份标识的顺序将数据源进行排序后,身份标识为相同英文字母的数据源应当相邻排列,再将相邻的相同英文字母的数据源合并为一个数据文件。
在该实施方式中,所述将相邻的身份标识相同的数据源合并为一个数据文件时,可以以排序后的第一个数据源作为起始数据源,依次遍历各个数据源。当遍历到身份标识与前一个数据源不同的数据源,将起始数据源到前一个数据源之间的所有数据源合并为一个数据文件。再以当前遍历到的数据源作为新的起始数据源,依次遍历各个数据源,当遍历到身份标识与前一个数据源不同的数据源,将起始数据源到前一个数据源之间的所有数据源合并为一个数据文件,再次以当前遍历到的数据源作为新的起始数据源,如此循环,实现所有相邻的相同身份标识的数据源合并为一个数据文件。可以理解的,当遍历到最后一个数据源,由于不存在下一个数据源,可以将该最后一个数据源与未合并的 数据源合并为一个数据文件。
以前述举例为例,有数据源a、b、c、d、e,a、b、c在配置文件中配置有相同的身份标识A;数据源d和e在配置文件中配置相同的身份标识B。在根据身份标识进行排序后,数据源a、b、c相邻排列,数据源d和e相邻,例如排列顺序为a、b、c、d、e。进行遍历时,从数据源a遍历到数据源c时,身份标识都是A,当遍历到d时,身份标识与前一个数据源c的身份标识不同,则将数据源a、b、c合并为一个数据文件。从数据源d遍历到数据源e时,身份标识相同,数据源e是最后一个数据源,则将最后的数据源d和e合并为一个数据文件。
在一些实施方式中,可以对多个数据源进行并行合并。具体的,可以分别选定身份标识不同的数据源。对于每一个选定的数据源,从未选定且为合并的数据源中查找身份标识相同的数据源,将查找到的所有身份标识相同的数据源与该选定的数据源进行合并。在该实施方式中,同时选定的数据源的个数可以根据并行合并时的并行处理通道确定,如具有5个并行处理通道,则可以同时选定5个数据源进行查找与合并。
可选的,在该实施方式中,也可以先按照前述实施方式的排序方式进行排序后,再进行数据源的查找以及合并。
在本申请实施例中,合并后的数据文件通常具有相应的使用场景,例如可以用于数据分析、搜索推荐等各种场景下的数据支持。但是,可能并非数据源中所有内容在相应的使用场景中都有用,因此,可以使合并后的数据文件中不包括不需要的内容,使数据文件更简洁,且占用更小的存储空间。例如,在需要统计用户对文章的态度的使用场景下,文章的评论者、评论时间以及点赞者都对该使用场景无用,可以删除这一部分。
因此,在本申请实施例中,由于数据源被分为各个字段,可以将不需要合并的内容所在的字段设置为不参与合并的字段,删除数据源中不需要的部分可以是,删除各个数据源中不参与合并的字段。
具体的,在配置文件中可以包括对数据源中各个字段是否参与合并的合并指示信息。例如,可选的,可以在每个字段的字段描述信息中包括对该字段是否参与合并的合并指示信息;或者可选的,在配置文件中,可以对应每个数据源专门设置一个参数作为合并指示信息,指示该数据源中哪些字段不参与合并。
在身份标识相同的数据源合并成的一个数据文件中,删除了不参与合并的 字段。
在一种实施方式中,将身份标识相同的数据源合并为一个数据文件后,可以根据所述配置文件中的合并指示信息,从合并后的数据文件中删除不参与合并的字段。
在另一种实施方式中,为了降低合并处理数据量,便于删除不参与合并的字段,将身份标识相同的数据源合并为一个数据文件可以是,根据配置文件中的合并指示信息,将各个数据源中不参与合并的字段删除后,将身份标识相同的数据源合并为一个数据文件。
在本申请实施例中,当配置文件中只对一个数据源进行配置,即配置文件中只包括一个数据源的身份标识、存储路径以及各个字段的字段描述信息等配置信息,可以通过合并指示信息实现该数据源的内容筛选,即从该数据源中删除不参与合并的字段后,筛选出关注的内容。
具体的,在本申请实施例中,在对数据源进行合并之前,可以判断数据源的数量是否等于1,具体可以判断配置文件中是否只有一个数据源的配置信息。若数据源的数量等于1,没有其他数据源进行合并,可以根据配置文件中的合并指示信息,从该数据源中删除不参与合并的字段后,将数据源作为合并后的数据文件,实现对数据文件中内容的筛选。若数据源的数量大于1,则需要将身份标识相同的数据源合并为一个数据文件,执行本申请实施例中的合并操作。
另外,在某些数据源中,具有数组类型的字段。具体的,在数组类型的字段中,具有多个针对同一概念彼此并列的内容。在配置文件中,数组类型的字段的字段描述信息中,数据类型可以配置为数组,并且可以指定数组类型中多个并列的内容中,哪些内容参与合并。其中,指定数组类型中哪些内容不参与合并,可以根据数组类型的预设指定规则确定,例如预设指定规则指定数组类型的字段中只有第一个内容参与合并,又如在数组类型的字段中包括指针,预设指定规则中指定指针指向的内容为数组类型中参与合并的内容,又如预设指定规则指定数组类型中的所有内容参与合并等;另外,也可以由用户在字段描述信息中配置数组类型的字段中,哪些内容参与合并。
对于数组类型的字段中不参与合并的类容,不体现在最后合成的数据文件中。具体的,在将数据源进行合并之前,可以将数组类型的字段中不参与合并的内容删除,再将数据源与相同身份标识的数据源合并;或者也可以是,将相同身份标识的数据源合并为同一个数据文件后,将数据文件中数组类型的字段 中,不参与合并的内容删除。
例如,在一文章数据源中,包括时事、运动、鞋类三个文章标签。将文章中的标签作为一个字段,则该字段中同时具有针对文章标签这一概念的三个并列的内容时事、运动、鞋类。若运动和鞋类才是该文章真正的标签,则可以在该字段的字段描述信息指定运动和鞋类参与合并,使合并后的数据文件中,不包括时事这一标签。
本申请实施例中,将数据源合并获得数据文件后,可以将数据文件输出。具体的,可以在配置文件中配置有合并后的文件所要输出的位置,定义该位置为指定位置。则在合并后,将数据文件输出到指定位置。当然,该文件所要输出的位置也可以不在配置文件中指定,而是在获取配置文件时同时获取。例如,在配置文件的生成方向配置文件的获取方提交配置文件时,指定输出文件的位置。
在一种实施方式中,可以将合并后的所有数据文件输出到一个指定位置。
可选的,在该实施方式中,各个数据文件可以分别作为独立的文件输出到同一个指定位置。
可选的,在该实施方式中,所有数据文件可以合并为一个文件,输出到指定位置。
在另一种实施方式中,配置文件中可以包括需要输出的文件份数,该文件份数为预设数量份。在将所有数据源合成获得数据文件后,可以将合并后的所有数据文件拆分为预设数量份输出。例如,合并后得到100个数据文件,预设数量份为5份,则可以拆分为每20个数据文件1份,获得5份。当然,每一份数据文件里数据文件的个数并不限制。当然,该需要输出的文件份数也可以不在配置文件中指定,而是在获取配置文件时同时获取。例如,在配置文件的生成方向配置文件的获取方提交配置文件时,指定输出文件份数。
可选的,所有数据文件可以合并为一个文件,再将该一个文件拆分为预设数量份。
可选的,可以各个数据文件可以分别作为独立的文件,将所有的数据文件分成预设数量份。
在该实施方式中,配置文件中可以包括每一份数据文件输出的位置。则在将数据文件输出时,可以将各份数据文件输出到配置文件中指定的位置。
在又一种实施方式中,可以为各个数据文件分别配置指定位置,配置文件中包括每一份数据文件输出的位置。例如,在配置文件中,对应相同身份标识 的数据源中的一个数据源配置指定位置,则在将身份标识相同的数据源合并为一个数据文件后,将该数据文件输出到对应其中一个数据源配置的指定位置,从而将各份数据文件输出到配置文件中指定的位置。
在本申请实施例中,可以在hadoop集群通过Map-Reduce计算模型进行数据源的处理以及合并。具体的,可以通过Map-Reduce任务的任务获取接口获取配置文件。通过Map-Reduce任务中的map程序根据所述配置文件中数据源的存储路径获取各个数据源,并且可以通过map程序对数据源进行解析、排序等处理。并且将处理完的数据源发送至linux标准输出数据流。Map-Reduce任务中的reduce程序可以从该数据流读入数据,从而通过reduce程序将身份标识相同的数据源合并为一个数据文件,并输出至配置文件中指定的输出位置。
可选的,在本申请实施例中,配置文件中还可以指定数据异常的重试次数,或者在配置文件的生成方向配置文件的获取方提交配置文件时,指定数据异常的重试次数。则在数据处理过程中,若出现数据异常,hadoop集群可以重试多次,当重试次数达到指定的重试次数时,查找集群中异常的设备,切换另一台设备替换异常的设备进行数据处理。
本申请实施例提供的数据合并方法,在配置文件中,对结构化处理后的数据源的各个字段添加有字段描述信息,从而可以根据字段描述信息确定数据源中不参与合并的字段。在将相同身份标识是数据源合并为的一个数据文件中,可以删除不参与合并的字段,从而使合并处理更简单,合并效率更高,提高合并后的数据文件的针对性。
另外,该数据合并方法可以通过对配置文件的配置,支持用户自定义文档合并方式,并且采用软件的方式实现,可以使用在任何提供Map-Reduce计算能力的Linux环境中。
如图4所示,本申请实施例还提供了一种数据合并装置300。该数据合并装置300包括:文件获取模块310,用于获取配置文件,所述配置文件中包括数据源的存储路径以及数据源的身份标识,其中,需要合并为一个数据文件的数据源配置有相同的身份标识。数据源获取模块320,用于根据所述配置文件中数据源的存储路径获取各个数据源。合并模块330,用于将身份标识相同的数据源合并为一个数据文件。
可选的,合并模块330可以包括排序单元,用于将数据源按身份标识进行排序,使身份标识相同的数据源相邻;合并单元,用于将相邻的身份标识相同 的数据源合并为一个数据文件。
可选的,合并单元可以以排序后的第一个数据源作为起始数据源;依次遍历各个数据源;当遍历到身份标识与前一个数据源不同的数据源,将起始数据源到前一个数据源之间的所有数据源合并为一个数据文件;以当前遍历到的数据源作为新的起始数据源,执行所述依次遍历各个数据源;当遍历到身份标识与前一个数据源不同的数据源,将起始数据源到前一个数据源之间的所有数据源合并为一个数据文件,以当前遍历到的数据源作为新的起始数据源。可选的,当遍历到最后一个数据源,将最后一个数据源与未合并的数据源合并为一个数据文件。
可选的,每个数据源分为一个或多个字段,所述配置文件中包括数据源中的各个字段是否参与合并的合并指示信息。合并模块330可以用于根据所述配置文件中的合并指示信息,将各个数据源中不参与合并的字段删除后,将身份标识相同的数据源合并为一个数据文件。
可选的,每个数据源分为一个或多个字段,所述配置文件中包括数据源中的各个字段是否参与合并的合并指示信息。合并模块330可以用于根据所述配置文件中的合并指示信息,从合并后的数据文件中删除不参与合并的字段。
可选的,每个数据源分为一个或多个字段,所述配置文件中包括数据源中的各个字段是否参与合并的合并指示信息。合并模块330可以用于判断数据源的数量是否等于1;若数据源的数量等于1,根据所述配置文件中的合并指示信息,从所述数据源中删除不参与合并的字段后,将所述数据源作为合并后的数据文件;若数据源的数量大于1,所述将身份标识相同的数据源合并为一个数据文件。
可选的,所述配置文件中包括各个数据源中字段的字段描述信息,每个字段的字段描述信息中包括对所述字段是否参与合并的合并指示信息。
可选的,所述装置300还包括配置模块,用于确定每个数据源中的各个字段,并获取各个字段的字段描述信息;在配置文件中配置每个数据源中各个字段的字段描述信息。
可选的,配置模块还可以用于在配置文件中为每一个数据源配置身份标识。
可选的,相互关联的数据源在生成时,生成有相同的身份标识。配置模块还可以用于从每个数据源中获取所述数据源的身份标识,用于在配置文件中对相应的数据源进行身份标识的配置。
可选的,装置300还可以包括信息获取模块,用于从所述配置文件中获取每个数据源的身份标识。
可选的,每个数据源分为一个或多个字段,所述配置文件中包括各个数据源中字段的字段描述信息,每个字段的字段描述信息中包括对所述字段是否包括身份标识的身份指示信息。信息获取模块可以用于对每个数据源,根据各个字段描述信息的身份指示信息,判断各个字段描述信息是否包括身份标识;当判定有字段描述信息中包括身份标识,从包括身份标识的字段描述信息中获取身份标识。
可选的,所述配置文件中包括合并后的文件输出的指定位置。装置300还可以包括输出模块,用于将合并后的数据文件输出到指定位置。
可选的,所述配置文件中包括需要输出的文件份数为预设数量份。输出模块可以用于将合并后的所有数据文件拆分为预设数量份输出。
可选的,所述配置文件中包括每一份数据文件输出的位置。输出模块可以用于将各份数据文件输出到配置文件中指定的位置。
可选的,文件获取模块310可以用于通过Map-Reduce任务的任务获取接口获取配置文件。数据源获取模块320可以用于通过Map-Reduce任务中的map程序根据所述配置文件中数据源的存储路径获取各个数据源。合并模块330可以用于通过Map-Reduce任务中的reduce程序将身份标识相同的数据源合并为一个数据文件。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述的各个方法实施例之间可以相互参照;上述描述装置和模块的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,模块相互之间的耦合可以是电性,机械或其它形式的耦合。
另外,在本申请各个实施例中的各功能模块可以集成在一个处理模块中,也可以是各个模块单独物理存在,也可以两个或两个以上模块集成在一个模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。各个模块可以配置在不同的电子设备中,也可以配置在相同的电子设备中,本申请实施例并不限定。
请参考图5,其示出了本申请实施例提供的一种电子设备400的结构框图。该电子设备400可以是智能手机、平板电脑、计算机等智能设备。本申请实施例中的数据合并方法及装置,可以由一电子设备执行;也可以 由多个电子设备配合执行,如多个服务器组成的系统集群。
该电子设备可以包括一个或多个处理器410(图中仅示出一个),存储器420以及一个或多个程序。其中,所述一个或多个程序被存储在所述存储器420中,并被配置为由所述一个或多个处理器410执行。所述一个或多个程序配置用于执行前述实施例所描述的方法。若前述实施例所描述的方法由多个电子设备配合执行,每个电子设备中可以配置所要执行的部分程序。
处理器410可以包括一个或者多个处理核。处理器410利用各种接口和线路连接整个电子设备400内的各个部分,通过运行或执行存储在存储器420内的指令、程序、代码集或指令集,以及调用存储在存储器420内的数据,执行电子设备400的各种功能和处理数据。可选地,处理器410可以采用数字信号处理(Digital Signal Processing,DSP)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)、可编程逻辑阵列(Programmable Logic Array,PLA)中的至少一种硬件形式来实现。处理器410可集成中央处理器(Central Processing Unit,CPU)、图像处理器(Graphics Processing Unit,GPU)和调制解调器等中的一种或几种的组合。其中,CPU主要处理操作系统、用户界面和应用程序等;GPU用于负责显示内容的渲染和绘制;调制解调器用于处理无线通信。可以理解的是,上述调制解调器也可以不集成到处理器410中,单独通过一块通信芯片进行实现。
存储器420可以包括随机存储器(Random Access Memory,RAM),也可以包括只读存储器(Read-Only Memory)。存储器420可用于存储指令、程序、代码、代码集或指令集。存储器420可包括存储程序区和存储数据区,其中,存储程序区可存储用于实现操作系统的指令、用于实现至少一个功能的指令、用于实现上述各个方法实施例的指令等。存储数据区还可以电子设备在使用中所创建的数据等。
请参考图6,其示出了本申请实施例提供的一种计算机可读存储介质的结构框图。该计算机可读存储介质500中存储有程序代码,所述程序代码可被处理器调用执行上述方法实施例中所描述的方法。
计算机可读存储介质500可以是诸如闪存、EEPROM(电可擦除可编程只读存储器)、EPROM、硬盘或者ROM之类的电子存储器。可选地, 计算机可读存储介质500包括非易失性计算机可读介质(non-transitory computer-readable storage medium)。计算机可读存储介质500具有执行上述方法中的任何方法步骤的程序代码510的存储空间。这些程序代码可以从一个或者多个计算机程序产品中读出或者写入到这一个或者多个计算机程序产品中。程序代码510可以例如以适当形式进行压缩。
最后应说明的是:以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不驱使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围。

Claims (20)

  1. 一种数据合并方法,其特征在于,所述方法包括:
    获取配置文件,所述配置文件中包括数据源的存储路径以及数据源的身份标识,其中,需要合并为一个数据文件的数据源配置有相同的身份标识;
    根据所述配置文件中数据源的存储路径获取各个数据源;
    将身份标识相同的数据源合并为一个数据文件。
  2. 根据权利要求1所述的方法,其特征在于,所述将身份标识相同的数据源合并为一个数据文件,包括:
    将数据源按身份标识进行排序,使身份标识相同的数据源相邻;
    将相邻的身份标识相同的数据源合并为一个数据文件。
  3. 根据权利要求2所述的方法,其特征在于,所述将相邻的身份标识相同的数据源合并为一个数据文件,包括:
    以排序后的第一个数据源作为起始数据源;
    依次遍历各个数据源;
    当遍历到身份标识与前一个数据源不同的数据源,将起始数据源到前一个数据源之间的所有数据源合并为一个数据文件;
    以当前遍历到的数据源作为新的起始数据源,
    执行所述依次遍历各个数据源;当遍历到身份标识与前一个数据源不同的数据源,将起始数据源到前一个数据源之间的所有数据源合并为一个数据文件,以当前遍历到的数据源作为新的起始数据源。
  4. 根据权利要求3所述的方法,其特征在于,当遍历到最后一个数据源,将最后一个数据源与未合并的数据源合并为一个数据文件。
  5. 根据权利要求1-4任一项所述的方法,其特征在于,每个数据源分为一个或多个字段,所述配置文件中包括数据源中的各个字段是否参与合并的合并指示信息,所述将身份标识相同的数据源合并为一个数据文件,包括:
    根据所述配置文件中的合并指示信息,将各个数据源中不参与合并的字段删除后,将身份标识相同的数据源合并为一个数据文件。
  6. 根据权利要求1-4任一项所述的方法,其特征在于,每个数据源分为一个或多个字段,所述配置文件中包括数据源中的各个字段是否参与合并的合并指示信息,所述将身份标识相同的数据源合并为一个数据文件后,还包括:
    根据所述配置文件中的合并指示信息,从合并后的数据文件中删除不参与 合并的字段。
  7. 根据权利要求1-6任一项所述的方法,其特征在于,每个数据源分为一个或多个字段,所述配置文件中包括数据源中的各个字段是否参与合并的合并指示信息,所述将身份标识相同的数据源合并为一个数据文件前,还包括:
    判断数据源的数量是否等于1;
    若数据源的数量等于1,根据所述配置文件中的合并指示信息,从所述数据源中删除不参与合并的字段后,将所述数据源作为合并后的数据文件;
    若数据源的数量大于1,执行所述将身份标识相同的数据源合并为一个数据文件的步骤。
  8. 根据权利要求5-7任一项所述的方法,其特征在于,所述配置文件中包括各个数据源中字段的字段描述信息,每个字段的字段描述信息中包括对所述字段是否参与合并的合并指示信息。
  9. 根据权利要求8所述的方法,其特征在于,所述获取配置文件之前,还包括:
    确定每个数据源中的各个字段,并获取各个字段的字段描述信息;
    在配置文件中配置每个数据源中各个字段的字段描述信息。
  10. 根据权利要求1-9任一项所述的方法,其特征在于,所述获取配置文件之前,还包括:
    在配置文件中为每一个数据源配置身份标识。
  11. 根据权利要求10所述的方法,其特征在于,相互关联的数据源在生成时,生成有相同的身份标识,所述在配置文件中为每一个数据源配置身份标识之前,还包括:
    从每个数据源中获取所述数据源的身份标识,用于在配置文件中对相应的数据源进行身份标识的配置。
  12. 根据权利要求1-11任一项所述的方法,其特征在于,所述将身份标识相同的数据源合并为一个数据文件之前,还包括:从所述配置文件中获取每个数据源的身份标识。
  13. 根据权利要求12所述的方法,其特征在于,每个数据源分为一个或多个字段,所述配置文件中包括各个数据源中字段的字段描述信息,每个字段的字段描述信息中包括对所述字段是否包括身份标识的身份指示信息,所述从所述配置文件中获取每个数据源的身份标识包括:
    对每个数据源,根据各个字段描述信息的身份指示信息,判断各个字段描 述信息是否包括身份标识;
    当判定有字段描述信息中包括身份标识,从包括身份标识的字段描述信息中获取身份标识。
  14. 根据权利要求1-13任一项所述的方法,其特征在于,所述配置文件中包括合并后的文件输出的指定位置,所述将身份标识相同的数据源合并为一个数据文件之后,还包括:
    将合并后的数据文件输出到指定位置。
  15. 根据权利要求1-14任一项所述的方法,其特征在于,所述配置文件中包括需要输出的文件份数为预设数量份,所述方法还包括:
    将合并后的所有数据文件拆分为预设数量份输出。
  16. 根据权利要求15所述的方法,其特征在于,所述配置文件中包括每一份数据文件输出的位置,所述将合并后的所有数据文件拆分为预设数量份输出包括:
    将各份数据文件输出到配置文件中指定的位置。
  17. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    通过Map-Reduce任务的任务获取接口获取配置文件;
    通过Map-Reduce任务中的map程序根据所述配置文件中数据源的存储路径获取各个数据源;
    通过Map-Reduce任务中的reduce程序将身份标识相同的数据源合并为一个数据文件。
  18. 一种数据合并装置,其特征在于,所述装置包括:
    文件获取模块,用于获取配置文件,所述配置文件中包括数据源的存储路径以及数据源的身份标识,其中,需要合并为一个数据文件的数据源配置有相同的身份标识;
    数据源获取模块,用于根据所述配置文件中数据源的存储路径获取各个数据源;
    合并模块,用于将身份标识相同的数据源合并为一个数据文件。
  19. 一种电子设备,其特征在于,包括:
    一个或多个处理器;
    存储器;
    一个或多个程序,其中所述一个或多个程序被存储在所述存储器中并被配置为由所述一个或多个处理器执行,所述一个或多个程序配置用于执行如权利 要求1-17任一项所述的方法。
  20. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质中存储有程序代码,所述程序代码可被处理器调用执行如权利要求1-17任一项所述的方法。
PCT/CN2019/112037 2019-10-18 2019-10-18 数据合并方法、装置、电子设备及存储介质 WO2021072776A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201980099361.7A CN114258541A (zh) 2019-10-18 2019-10-18 数据合并方法、装置、电子设备及存储介质
PCT/CN2019/112037 WO2021072776A1 (zh) 2019-10-18 2019-10-18 数据合并方法、装置、电子设备及存储介质

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2019/112037 WO2021072776A1 (zh) 2019-10-18 2019-10-18 数据合并方法、装置、电子设备及存储介质

Publications (1)

Publication Number Publication Date
WO2021072776A1 true WO2021072776A1 (zh) 2021-04-22

Family

ID=75537404

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/112037 WO2021072776A1 (zh) 2019-10-18 2019-10-18 数据合并方法、装置、电子设备及存储介质

Country Status (2)

Country Link
CN (1) CN114258541A (zh)
WO (1) WO2021072776A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116089436B (zh) * 2022-11-29 2023-11-07 荣耀终端有限公司 一种大数据量的数据稽核方法和电子设备

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101902335A (zh) * 2009-05-27 2010-12-01 北京启明星辰信息技术股份有限公司 一种数据过滤与合并的方法
CN102780780A (zh) * 2012-07-25 2012-11-14 中国联合网络通信集团有限公司 云计算模式下的数据处理方法、设备和系统
CN103390003A (zh) * 2012-05-09 2013-11-13 人人游戏网络科技发展(上海)有限公司 在服务器之间合并用户数据信息的方法和装置
CN103577276A (zh) * 2012-07-18 2014-02-12 深圳市腾讯计算机系统有限公司 用户操作数据的备份系统及方法
US9497097B2 (en) * 2012-03-12 2016-11-15 Texas Instruments Incorporated Inserting sequence numbers into data blocks merged from data streams
CN110097170A (zh) * 2019-04-25 2019-08-06 深圳市豪斯莱科技有限公司 信息推送对象预测模型获取方法、终端及存储介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101902335A (zh) * 2009-05-27 2010-12-01 北京启明星辰信息技术股份有限公司 一种数据过滤与合并的方法
US9497097B2 (en) * 2012-03-12 2016-11-15 Texas Instruments Incorporated Inserting sequence numbers into data blocks merged from data streams
CN103390003A (zh) * 2012-05-09 2013-11-13 人人游戏网络科技发展(上海)有限公司 在服务器之间合并用户数据信息的方法和装置
CN103577276A (zh) * 2012-07-18 2014-02-12 深圳市腾讯计算机系统有限公司 用户操作数据的备份系统及方法
CN102780780A (zh) * 2012-07-25 2012-11-14 中国联合网络通信集团有限公司 云计算模式下的数据处理方法、设备和系统
CN110097170A (zh) * 2019-04-25 2019-08-06 深圳市豪斯莱科技有限公司 信息推送对象预测模型获取方法、终端及存储介质

Also Published As

Publication number Publication date
CN114258541A (zh) 2022-03-29

Similar Documents

Publication Publication Date Title
US20230126005A1 (en) Consistent filtering of machine learning data
US10366053B1 (en) Consistent randomized record-level splitting of machine learning data
US10713589B1 (en) Consistent sort-based record-level shuffling of machine learning data
US20210049163A1 (en) Data preparation context navigation
CN108509556B (zh) 数据迁移方法和装置、服务器、存储介质
JP2018522343A (ja) 意思決定モデルを構築する方法、コンピュータデバイス及び記憶デバイス
CN109299169B (zh) 数据可视化方法、系统、终端及计算机可读存储介质
CN110737689B (zh) 数据标准符合性检测方法、装置、系统及存储介质
CN110580189A (zh) 生成前端页面的方法、装置、计算机设备以及存储介质
JP6573321B2 (ja) 情報処理装置、情報処理方法およびプログラム
US11019012B2 (en) File sending in instant messaging application
CN112464034A (zh) 用户数据提取方法、装置、电子设备及计算机可读介质
CN108536467B (zh) 代码的定位处理方法、装置、终端设备及存储介质
EP3617910A1 (en) Method and apparatus for displaying textual information
CN107918618A (zh) 数据处理方法及装置
US20190147104A1 (en) Method and apparatus for constructing artificial intelligence application
CN113032580A (zh) 关联档案推荐方法、系统及电子设备
WO2021072776A1 (zh) 数据合并方法、装置、电子设备及存储介质
CN110609924A (zh) 基于图数据的全量关系计算方法、装置、设备及存储介质
US20180330156A1 (en) Detection of caption elements in documents
CN109542890B (zh) 数据修改方法、装置、计算机设备及存储介质
CN110930056A (zh) 一种基于思维导图的任务管理方法、终端设备及存储介质
CN115935917A (zh) 一种可视化图表的数据处理方法、装置、设备及存储介质
CN110866605A (zh) 数据模型训练方法、装置、电子设备及可读介质
CN111651531A (zh) 数据导入方法、装置、设备及计算机存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19949246

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19949246

Country of ref document: EP

Kind code of ref document: A1