WO2017181614A1 - 流式数据定位方法、装置及电子设备 - Google Patents

流式数据定位方法、装置及电子设备 Download PDF

Info

Publication number
WO2017181614A1
WO2017181614A1 PCT/CN2016/101092 CN2016101092W WO2017181614A1 WO 2017181614 A1 WO2017181614 A1 WO 2017181614A1 CN 2016101092 W CN2016101092 W CN 2016101092W WO 2017181614 A1 WO2017181614 A1 WO 2017181614A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
positioning
sampling
target
segment
Prior art date
Application number
PCT/CN2016/101092
Other languages
English (en)
French (fr)
Inventor
赵富欣
Original Assignee
乐视控股(北京)有限公司
乐视网信息技术(北京)股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 乐视控股(北京)有限公司, 乐视网信息技术(北京)股份有限公司 filed Critical 乐视控股(北京)有限公司
Publication of WO2017181614A1 publication Critical patent/WO2017181614A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/13File access structures, e.g. distributed indices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/14Details of searching files based on file metadata
    • G06F16/148File search processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/064Management of blocks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0655Vertical data movement, i.e. input-output transfer; data movement between one or more hosts and one or more storage devices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/067Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS]

Definitions

  • the present application relates to the field of data processing, and in particular, to a streaming data positioning method and apparatus.
  • HDFS Hadoop Distributed File System
  • the number of messages generated at each moment is very large, and the background collects these messages through the message pipeline.
  • the received streaming data may not be stored in memory, or it may be persisted on the disk.
  • the data on the disk is piece by piece. As time passes and the amount of data increases, a new amount of data is generated after the amount of data stored in each data block reaches a certain amount.
  • the difficulty is to locate the streaming data according to the data requirements sent by the user; for example, to locate data of a certain day or data of a certain hour of a certain day.
  • the unit of data storage is a block, not a time granularity.
  • a block is the smallest granularity unit of data storage. Therefore, if a certain type of data has only one piece of data per month, then this piece of data contains 30 days of data for this month, and only the monthly granularity can be located when the data is located. Can not be positioned to the sky granularity; if there is 10 pieces of data in a day for a certain type of data, Then the accuracy of positioning is less than 1 day.
  • the existing solution streaming positioning solution cannot provide accurate positioning data functions, and only obtains a relatively coarse positioning value, and the accuracy is not high.
  • the data that the user wants to find is the data of March 1st, but the data of March 1st is stored in the data of March.
  • the data segment required by the user can be roughly stored in the data segment.
  • the March data block but it is not able to tell which piece of data. If the data in the entire data block is uploaded to the processing system, the workload is very large.
  • the embodiment of the present invention provides a streaming data positioning method, a device, and an electronic device, which are used to solve the defects of inaccurate data positioning and low efficiency when data is uploaded to a distributed system in the prior art, and implement accurate and efficient data positioning of data.
  • the embodiment of the present application provides a streaming data positioning method, including:
  • the embodiment of the present application provides a streaming data positioning apparatus, including:
  • a first positioning module configured to receive a data positioning parameter, query the description information of the corresponding data block according to the data positioning parameter, and determine a data segment where the target data is located;
  • a sampling module configured to perform data sampling on the data segment in a preset step size, and acquire a data identification identifier of the data sampling result
  • a second positioning module configured to determine, according to the data identification identifier and the data positioning parameter, whether the data sampling result includes the target data, and when the determination is yes, the data is collected The results are judged one by one until the target data is located.
  • the embodiment of the present application further provides a non-transitory computer storage medium storing computer executable instructions for performing the above-described streaming data positioning method of the present application.
  • An embodiment of the present application provides a data positioning electronic device, including: at least one processor;
  • the memory stores instructions executable by the one processor, the instructions being executed by the at least one processor to enable the at least one processor to perform the web page display method of any of the above-described embodiments of the present application.
  • the embodiment of the present application further provides a computer program product, the computer program product comprising a computing program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions, when the program instructions are executed by a computer
  • the computer is caused to perform the above-described webpage page display method of the present application.
  • the streaming data positioning method and device acquires a data segment in which the target data is located according to the data positioning parameter, and samples the data segment to locate the data according to the sampling result, and changes the prior art to stream When data is located, it is necessary to perform complicated operations by comparing one pair of data segments, realizing accurate data positioning, and improving the efficiency of data uploading to a distributed file system.
  • Embodiment 1 is a technical flowchart of Embodiment 1 of the present application.
  • Embodiment 2 is a technical flowchart of Embodiment 2 of the present application.
  • Embodiment 3 is a schematic structural diagram of an apparatus embodiment of Embodiment 3 of the present application.
  • FIG. 4 is a schematic structural diagram of a data positioning electronic device according to Embodiment 4 of the present application.
  • the embodiment of the present application may have the following application scenario: the streaming data needs to be uploaded to the distributed system for parallel processing, and before the uploading, the currently saved data needs to be located according to the target data requirement, that is, the streaming data in a large amount of storage is accurately The target data is found and uploaded to the distributed system for processing after the data is found.
  • a data positioning method mainly includes the following steps:
  • Step S110 Receive a data positioning parameter, and query the description information of the corresponding data block according to the data positioning parameter to determine a data segment where the target data is located;
  • Step S120 Perform data sampling on the data segment in a preset step size, and acquire a data identification identifier of the data sampling result;
  • Step S130 determining, according to the data identification identifier and the data positioning parameter, whether the data sampling result includes the target data, and when the determination is yes, determining the data sampling result one by one until the target data is located.
  • the data positioning parameter is sent by the data requester, and is used to perform data positioning according to the parameter.
  • the data positioning parameter may include a time label, a line number, an index number, an offset, and the like corresponding to the data. For example, a user needs to obtain data of xx years x months x days for analysis, "xx year x month x day" is the data positioning parameter; or, the user needs to obtain data with index number xxx, "index number xxx” is data Positioning parameters.
  • the description information of the database is queried according to the data positioning parameter, and the description information of each existing data block is queried by using the data positioning parameter as a reference.
  • the description information of the data block is data describing the data block, and contains descriptive information about the data block and the information resource. For example, data stored in a certain data block is a certain number of days to a few days.
  • the description information of each data block is traversed according to the positioning requirement parameter. , roughly locate the data block where the target data is located, and then perform further detailed positioning according to the data blocks. This rough query process greatly reduces the scope of data positioning and saves data positioning time.
  • the preset step size may be a variable, and is adjusted according to the size of the total data included in the data block and the result of each sampling. For example, the amount of data contained in a certain data block is very large. If the sampling step size is too small, the number of sampling times is increased, resulting in an inefficient improvement of data positioning; if the amount of data contained in a data block is relatively small, if If the sampling step size is too large, it will easily lead to a large proportion of the data obtained by sampling, and the amount of data to be compared one by one will increase when the subsequent precise positioning is performed. Therefore, the preset step size is an empirical value related to the amount of data contained in the data block.
  • the sampling result is identified, and according to the data identification identifier of the sampling data, it is determined that the sampling result is very close to the target data, and the original step length can be maintained.
  • Sampling can also increase the sampling step size appropriately, so that the data segment in which the target data is located can be quickly found in the most time-saving manner.
  • the data identification identifier is the same as the data positioning parameter in step S110.
  • the identifiers are included in the data, and may include a time label, a line number, an index number, an offset, and the like corresponding to the data.
  • step S130 determining, according to the data identification identifier and the data positioning parameter, whether the data sampling result includes the target data, that is, comparing the received data positioning parameter from the user and the sampling data.
  • the data identification of the sampled data read in Whether the logos are consistent. For example, when the data positioning parameter is a time tag of the data, querying, according to the time tag, a time tag included in the data identification identifier of the sampled data, determining whether there is a consistency, and if so, determining The data sampling result includes the target data.
  • the data segment will continue to be sampled. Specifically, the data segment is resampled by the preset step size from the end position of the data sampling result, and the data identification identifier of the data resampling result is obtained until the data is identified according to the data. The identification and the data positioning parameter determine that the data resampling result includes the target data.
  • determining whether the data sampling result includes the target data may be the following method:
  • the data positioning parameter is a data time label
  • the data time label included in the data identification identifier of the first data in the query data sampling result is compared, and which of the two data time labels is preceded by which one. Since the streaming data is stored in chronological order according to the data, there is a sequence in time. Therefore, comparing the time stamps of the data obtained by the sampling can determine whether the sampling data contains the target data. If the first data of the sampling result lags behind the time on the time label of the positioning requirement, it indicates that the target data must be the data before the current sampling result, and the data after the first data must not be the target data.
  • the target data of the positioning requirement needs to continue after the first piece of data. Sampling or even sampling multiple times in the remaining data segments.
  • the preset step size may be continuously updated, and the value of the updated value in the step of updating is an empirical process, and the specific update is not limited in the embodiment of the present application. The size of the value.
  • N that is, first sample a piece of data at 0+5000*1, and perform matching. If the target data is not included, continue to sample a piece of data at 0+5000*2, and find that the target data is included, then it can be judged that the target data is Between the 5000th data and the 1000th data, it is necessary to sample between 5,000 and 10000 data segments, and the step size needs to be updated.
  • the step size is half of the number of data in the sampled data. That is, one data is sampled at 7500 data and compared with the positioning parameters. If there is no match, the sampling step size is continuously updated (reduced). If the number of data in the sampling result is less than the preset data amount threshold, then all the data is sampled at one time, and the matching is judged one by one. The step size at this time is 1.
  • the data volume threshold may be 200, because when the data volume is less than 200, the time of the network access overhead is similar to the time cost of the local query, and the sampling is stopped at the highest efficiency.
  • the data identification identifier of the first data in the current data sampling result is generally selected and the data positioning is performed. The parameter is compared. If the data identification identifier of the first data is inconsistent with the data positioning parameter, the remaining data in the current sampling result is no longer compared, and the current sampling data result is directly determined to not include the target data.
  • the sampling is stopped, and thus all the sample data including the target data is obtained. Then, in the obtained sampling data, according to the data positioning parameter, each data in the sampled data is compared one by one as target data, thereby achieving high-efficiency positioning.
  • the data segment in which the target data is located is obtained according to the data positioning parameter, and the data segment is sampled to locate the data according to the sampling result.
  • a data positioning method of the present application further includes the following feasible implementation steps:
  • Step S210 Receive a data positioning parameter, and query the description information of the corresponding data block according to the data positioning parameter to determine a data segment where the target data is located;
  • Step S220 Perform data sampling on the data segment in a preset step size, and acquire a data identification identifier of the data sampling result;
  • Step S230 Determine whether the data sampling result includes the target data according to the data identification identifier and the data positioning parameter. When the determination is yes, the data sampling result is determined one by one until the target data is located.
  • Step S240 segment the target data according to a preset policy.
  • Step S250 Encapsulate the result of the segmentation and generate a distributed parallel task to upload to the distributed file system.
  • steps S210 to S230 are the same as the steps S110 to S130 of the embodiment, and are not described here.
  • the preset policy may include the following manners:
  • the target data is equally divided according to the number of computing nodes in the distributed cluster
  • the data segmentation threshold is calculated according to the number of computing nodes in the distributed cluster, the computing efficiency of the computing node, and the computing time requirement, and the target data is segmented according to the data segmentation threshold.
  • the process of equalization does not take into account the remaining amount of computing resources and the computing power of each computing node in the server cluster, and directly divides the target data to be processed according to the number of computing nodes.
  • the advantage is that it saves each The analysis of the computing power of the computing node is more time-saving.
  • priority is obtained for each node word computing capability in the server cluster, such as computational efficiency and computation time requirements, so that the data is segmented according to the reference data.
  • step S250 the targeted target data needs to be encapsulated before being uploaded to the distributed file system.
  • each piece of data has a start position, an end position, a storage server location of the data, and metadata information of the data.
  • the encapsulation is to package the start position, the end position of each segment, the storage server location of the data, the metadata information of the data, and the like into a data object recognized by the Hadoop distributed task MapReduce, so that the MapReduce task takes By going to these segmentation information, you can access the specific data in the segment and save the data to the distributed file system. Because MapReduce tasks are distributed, resources are sufficient, and processing efficiency is very high.
  • the data is performed according to a preset policy. Segmentation fully considers the server resource utilization and computing resource utilization of the distributed processing system, further improving the efficiency of data uploading to the distributed file system.
  • a data locating device includes the following modules:
  • the first positioning module 310 is configured to receive a data positioning parameter, and query the description information of the corresponding data block according to the data positioning parameter to determine a data segment where the target data is located;
  • the sampling module 320 is configured to perform data sampling on the data segment in a preset step size, and acquire a data identification identifier of the data sampling result;
  • the second positioning module 330 is configured to determine, according to the data identification identifier and the data positioning parameter, whether the data sampling result includes the target data, and when the determination is yes, determine the data sampling result one by one, Until the target data is located.
  • the data positioning parameter specifically includes: a time label, a line number, an index number, and an offset corresponding to the data.
  • the sampling module 320 is specifically configured to: start from a starting position of the data segment, and take a corresponding amount of data from the data segment by using the preset step.
  • the second positioning module 330 is further configured to: if not, resample the data segment by using the preset step size from the end position of the data sampling result, and obtain Determining, by the data re-sampling result, the data identification identifier until the data re-sampling result is determined to include the target data according to the data identification identifier and the data positioning parameter.
  • the second positioning module 330 is further configured to: during the data sampling, update the preset step according to the comparison result of the data identification identifier and the data positioning parameter.
  • the device further includes a segmentation module 340, the segmentation module 340 is configured to: segment the target data according to a preset policy, package the result of the segmentation, and generate a distributed parallel task to upload to Distributed file system.
  • the preset policy includes: the target according to the number of computing nodes in the distributed cluster
  • the target data is equally divided; or, the data segmentation threshold is calculated according to the number of computing nodes in the distributed cluster, the computing efficiency of the computing node, and the computing time requirement, and the target data is segmented according to the data segmentation threshold. .
  • the apparatus shown in FIG. 3 can perform the method of the embodiment shown in FIG. 1 and FIG. 2, and the implementation principle and technical effects are referred to the embodiment shown in FIG. 1 and FIG. 2, and details are not described herein again.
  • the embodiment of the present application provides a non-volatile computer storage medium, where the computer storage medium stores computer executable instructions, and the computer executable instructions can execute the data positioning method in any of the foregoing method embodiments.
  • FIG. 4 is a schematic structural diagram of a data locating electronic device according to Embodiment 4 of the present application.
  • the device in this embodiment may be part of a data locating server or a data locating server, and the device may include:
  • One or more processors 401 and memory 402 are exemplified by one processor 401 in FIG.
  • the web page display electronic device may further include: an input device 403 and an output device 404.
  • the processor 401, the memory 402, the input device 403, and the output device 404 can be connected by a bus or other means.
  • the memory 402 is used as a non-transitory computer readable storage medium, and can be used for storing non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the web page display method in the embodiment of the present application.
  • the processor 401 executes various functional applications and data processing of the server by running non-transitory software programs, instructions, and modules stored in the memory 402, that is, implementing the web page display method of the above method embodiment.
  • the memory 402 may include a storage program area and an storage data area, wherein the storage program area may store an operating system, an application required for at least one function; the storage data area may store data created according to use of the web page display device, and the like.
  • memory 402 can include high speed random access memory, and can also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device.
  • memory 402 can optionally include memory remotely located relative to processor 401, which can be connected to the network via A processing device for list item operations. Examples of such networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.
  • the input device 403 can receive input numeric or character information and generate key signal inputs related to user settings and function control of the web page display device.
  • Output device 404 can include a display device such as a display screen.
  • the one or more modules are stored in the memory 402, and when executed by the one or more processors 401, execute a webpage page display method in any of the above method embodiments.
  • the electronic device of the embodiment of the present application exists in various forms, including but not limited to:
  • Mobile communication devices These devices are characterized by mobile communication functions and are mainly aimed at providing voice and data communication.
  • Such terminals include: smart phones (such as iPhone), multimedia phones, functional phones, and low-end phones.
  • Ultra-mobile personal computer equipment This type of equipment belongs to the category of personal computers, has computing and processing functions, and generally has mobile Internet access.
  • Such terminals include: PDAs, MIDs, and UMPC devices, such as the iPad.
  • Portable entertainment devices These devices can display and play multimedia content. Such devices include: audio, video players (such as iPod), handheld game consoles, e-books, and smart toys and portable car navigation devices.
  • the server consists of a processor, a hard disk, a memory, a system bus, etc.
  • the server is similar to a general-purpose computer architecture, but because of the need to provide highly reliable services, processing power and stability High reliability in terms of reliability, security, scalability, and manageability.
  • User review data from a video website is stored on a storage medium, and the data consumer needs to process the comment data to analyze the user's video viewing interest and view the hotspot.
  • the user's comment data is continuously generated streaming data, and on the server side, the comment data is stored in the storage medium in the form of data blocks.
  • the storage data source is constantly coming. After the current data block is full, the new data will be another data block to store the data.
  • Each data block has description information, and the description information of each data block is read, and the comment data of those time periods is stored in the data block. It is assumed that, in the present embodiment, according to the positioning parameter given by the user, that is, the time stamp of April 2016, the data block 1 and the data block 2 are stored in the storage medium and the comment data of April 2016 is stored. Then, in the data block 1 and the data block 2, it will continue to query which data is the data of April 2016 or even which data is the data from April 1, 2016 to April 7, 2016. At this time, in the storage medium, other data blocks except the data block 1 and the data block 2 can be discarded, and they are not queried, thereby changing the defect that the prior art scans all the data blocks, thereby improving the coarse positioning of the data. effectiveness.
  • the data block 1 and the data block 2 are sampled separately with a certain sampling step size.
  • the following section details the sampling process of data block 1 as an example.
  • sampling step size is 1000 samples per sample
  • 1000 data is extracted from the starting position of the data block 1, and the time label of the first data in the 1000 data is obtained, and the time corresponding to the time label is queried. If the time is before April 1, 2016, then give up 1000 Sample the data and continue sampling on block 1.
  • the next sampling step can be to take 1000 data at a time or 5000 data at a time.
  • the starting node of the sampling starts from the end of the last sampling, assuming that 1000 data is resampled, and the first data of 1000 data is taken.
  • the sampling step length can be further increased. For example, starting from the end of the second sampled data, 1000 samples of data are resampled.
  • the sampled 2000 data and the 3000th data are sampled, that is, whether the 2500th data is greater than April 7, 2016, and if the determination is yes, this continues to narrow the sampling.
  • the device embodiments described above are merely illustrative, wherein the units described as separate components may or may not be physically separate, and the components displayed as units may or may not be physical units, ie may be located A place, or it can be distributed to multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the embodiment. Those of ordinary skill in the art can understand and implement without deliberate labor.

Abstract

一种数据定位方法、装置及电子设备,所述方法包括:接收数据定位参数,根据所述数据定位参数查询对应的数据块的说明信息从而确定目标数据所在的数据段(110);以预设的步长对所述数据段进行数据采样,并获取所述数据采样结果的数据识别标识(120);根据所述数据识别标识以及所述数据定位参数判断所述数据采样结果是否包含所述目标数据,当判定为是时,在所述数据采样结果中逐条进行判断,直至定位目标数据(130)。所述方法实现了精准高效的数据定位。

Description

流式数据定位方法、装置及电子设备
交叉引用
本申请要求于2016年04月21日递交中国专利局,申请号为201610252499.4的中国专利申请的优先权,其全部内容通过引用被全部并入本申请。
技术领域
本申请涉及数据处理领域,尤其涉及一种流式数据定位方法及装置。
背景技术
大量的实时消息保存在消息队列之后,通过分布式处理的方式保存在HDFS(Hadoop分布式文件系统)中。例如互联网环境下,每一个时刻产生的消息数量都是十分庞大的,后台会通过消息管道收集这些消息。
这些实时的消息数据是流式传输的,其特点是像水流一样不间断传输的且其存储是分片的。在对流式数据进行处理的时候,因其数据量较大,需要足够多的资源才能够高效地对其进行处理。而分布式系统有足够多的资源,因此,在数据处理之前,需将保存的流式数据上传至分布式文件系统。
接收到的流式数据不一定都保存在内存中,也可能持久化保存在磁盘上,磁盘上的数据是一块一块的。随着时间的推移以及数据量的增多,每一个数据块中存储的数据量达到一定量之后,会生成一个新的数据块。
将流式数据保存至分布式文件系统中时,其难点在于,根据用户发送的数据需求进行流式数据的定位;例如,定位某一天的数据或者某一天中某几个小时的数据。因为数据存储的单位是块,而不是按照时间粒度来存储的。块是数据存储的最小粒度单位,因此,某一种数据如果一个月只有一块数据,那么,这块数据包含了此月份的30天数据,在数据进行定位的时候就只能定位到月粒度,而不能定位到天粒度;如果某一种数据的1天有10块数据, 那么定位的精度就小于1天。现有的方案流式定位解决方案不能提供精确定位的数据功能,只能得到一个比较粗略的定位值,而且准确度不高。
例如,用户想查找的数据是3月1号的数据,然而3月1号的数据是保存在3月份的数据中,在粗略定位的时候,只能粗略定位到用户所需求的数据段存储在3月的数据块中,但并不能够告知具体是哪一块数据。若是将整个数据块中的数据均上传至处理系统,则工作量非常庞大。
因此,一种流式数据定位方法亟待提出。
发明内容
本申请实施例提供一种流式数据定位方法、装置及电子设备,用以解决现有技术中数据定位不准确以及数据上传至分布式系统时效率低的缺陷,实现数据精准高效的数据定位。
本申请实施例提供一种流式数据定位方法,包括:
接收数据定位参数,根据所述数据定位参数查询对应的数据块的说明信息,确定目标数据所在的数据段;
以预设的步长对所述数据段进行数据采样,并获取所述数据采样结果的数据识别标识;
根据所述数据识别标识以及所述数据定位参数,判断所述数据采样结果是否包含所述目标数据,当判定为是时,在所述数据采样结果中逐条判断,直至定位目标数据。
本申请实施例提供一种流式数据定位装置,包括:
第一定位模块,用于接收数据定位参数,根据所述数据定位参数查询对应的数据块的说明信息,确定目标数据所在的数据段;
采样模块,用于以预设的步长对所述数据段进行数据采样,并获取所述数据采样结果的数据识别标识;
第二定位模块,用于根据所述数据识别标识以及所述数据定位参数,判断所述数据采样结果是否包含所述目标数据,当判定为是时,在所述数据采 样结果中逐条进行判断,直至定位目标数据。
本申请实施例还提供一种非暂态计算机存储介质,存储有计算机可执行指令,所述计算机可执行指令用于执行本申请上述任一项流式数据定位方法。
本申请实施例提供一种数据定位电子设备,包括:至少一个处理器;以及,
与所述至少一个处理器通信连接的存储器;其中,
所述存储器存储有可被所述一个处理器执行的指令,所述指令被被所述至少一个处理器执行,以使所述至少一个处理器能够执行本申请上述任一项网页页面显示方法。
本申请实施例还提供了一种计算机程序产品,所述计算机程序产品包括存储在非暂态计算机可读存储介质上的计算程序,所述计算机程序包括程序指令,当所述程序指令被计算机执行时,使所述计算机执行本申请上述任一项网页页面显示方法。
本申请实施例提供的流式数据定位方法及装置,根据数据定位参数获取目标数据所在数据段,并对所述数据段进行取样从而根据取样结果对数据进行定位,改变了现有技术进行流式数据定位时,需要逐一对数据段进行对比的繁琐操作,实现了精确的数据定位,提高了数据上传至分布式文件系统的效率。
附图说明
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作一简单地介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本申请实施例一的技术流程图;
图2为本申请实施例二的技术流程图;
图3为本申请实施例三的装置实施例结构示意图;
图4为本申请实施例四的数据定位电子设备的结构示意图。
具体实施方式
为使本申请实施例的目的、技术方案和优点更加清楚,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
本申请实施例可以有如下应用场景:流式数据需要上传至分布式系统并行处理,在上传之前,需要根据目标数据需求来对当前保存的数据进行定位,即在大量存储的流式数据精确地中找到目标数据并且在找到这些数据之后将这些数据上传至分布式系统进行处理。
图1是本申请实施例一的技术流程图,结合图1,本申请实施例一种数据定位方法,主要包括如下的步骤:
步骤S110:接收数据定位参数,根据所述数据定位参数查询对应的数据块的说明信息从而确定目标数据所在的数据段;
步骤S120:以预设的步长对所述数据段进行数据采样,并获取所述数据采样结果的数据识别标识;
步骤S130:根据所述数据识别标识以及所述数据定位参数判断所述数据采样结果是否包含所述目标数据,当判定为是时,在所述数据采样结果中逐条进行判断,直至定位目标数据。
具体的,在步骤S110中,所述数据定位参数是数据需求方发送的,用于根据此参数进行数据的定位。所述数据定位参数可以包括数据对应的时间标签、行号、索引号、偏移量等。例如,某一用户需要获取xx年x月x日的数据进行分析,“xx年x月x日”即数据定位参数;或,用户需要获取索引号为xxx的数据,“索引号xxx”即数据定位参数。
本步骤中,根据所述数据定位参数查询数据库的说明信息,具体以所述数据定位参数为参考,查询现有每一个数据块的说明信息。所述数据块的说明信息是描述数据块的数据,包含对数据块及信息资源的描述性信息,例如,某一数据块中存储的是某个月几号至几号的数据。
由于每一个数据库中都存在大量的数据,在进行数据定位时,若是根据定位需求参数对每一条数据逐条比对则十分浪费时间以及计算资源。然而,相对于每一数据中的数据量而言,其对应说明信息的数据量是极小的,因此,本申请实施例中,首先根据所述定位需求参数遍历查询每一数据块的说明信息,粗略定位目标数据所在的数据块,后续再根据这些数据块进行下一步更细致的定位。这一步粗略查询过程极大缩小了数据定位的范围,节省了数据定位时间。
具体的,在步骤S120中,所述预设的步长可以是一个变量,根据数据块中包含的数据总量的大小以及每一次采样的结果进行调整。例如,某一数据块中包含的数据量十分庞大,若采样步长过小,则采样的次数增多,导致数据定位的效率无法提升;若某一数据块中包含的数据量比较小,若采用过大的采样步长,则容易导致采样得到的数据量占整个数据块的比重很大,在后续进行精确定位时,需要逐条比较的数据量增大。因此,所述预设的步长是一个与数据块包含的数据量大小相关的经验值。
另,当以较小的步长对数据块进行第一次采样后,对采样结果进行识别,根据采样数据的数据识别标识判断得知采样结果十分接近目标数据,此时可以保持原步长进行采样也可以适当增大采样步长,从而能够以最节省时间的方式快速找到目标数据所在的数据段。
其中,所述数据识别标识同步骤S110中的所述数据定位参数相同,这些识别标识是数据本身带有的,可以包括数据对应的时间标签、行号、索引号、偏移量等。
具体的,在步骤S130中,根据所述数据识别标识以及所述数据定位参数判断所述数据采样结果是否包含所述目标数据,即对比接收到的来自用户的所述数据定位参数以及从采样数据中读取到的采样数据的所述数据识别 标识是否一致。例如,当所述数据定位参数为数据的时间标签时,则根据所述时间标签,查询采样得到的数据的所述数据识别标识中包含的时间标签,判断是否有一致的,若有,则判定所述数据采样结果包含所述目标数据。
若所述数据采样结果中并不包含所述目标数据,则将对所述数据段继续采样。具体的,从所述数据采样结果的结束位置起,以所述预设的步长对所述数据段进行数据再采样,并获取所述数据再采样结果的数据识别标识直至根据所述数据识别标识以及所述数据定位参数判定所述数据再采样结果包含所述目标数据。
其中,判断所述数据采样结果是否包含所述目标数据,可以是如下的方法:
当所述数据定位参数为数据时间标签时,查询数据采样结果中的第一条数据的数据识别标识中包含的数据时间标签,对比两个数据时间标签哪个在前哪一个在后。因流式数据是一条按照数据来的时间顺序进行存储的,在时间上是有先后的,因此,对比采样得到数据的时间标签即可判断所述采样数据是否包含目标数据。若是采样结果的第一条数据在时间上滞后定位需求的时间标签上对应的时间,则说明,目标数据一定是当前采样结果之前的数据,所述第一条数据之后的数据一定不是目标数据,则无需再向后采样;反之,若采样结果的第一条数据在时间上超前定位需求的时间标签对应的时间,则说明定位需求的目标数据在所述第一条数据之后,还需继续在剩下的数据段中进行采样甚至多次采样。其中,在所述数据采样的过程中,所述预设的步长可以不断进行更新,步长的更新过程中更新值的取值大小是一个经验过程,本申请实施例中并不限制具体更新值的大小。
例如,本申请实施例可以有如下的步长更新过程,对定位得到的数据段进行采样,以固定步长N(设N=5000)进行采样,采样步长的设置逻辑可以为0+5000*n,即首先在0+5000*1处采样一条数据,进行匹配,如果不包含目标数据,继续取在0+5000*2处采样一条数据,发现包含目标数据,则可判断得知目标数据在第5000条数据至第1000条数据之间,此时需要在5000至10000的数据段之间进行采样,步长需要更新。例如可更新为 (10000-5000)/2=2500,步长为采样数据中数据条数的一半。即在7500条数据处采样一条数据并与定位参数进行对比,若不匹配,则继续更新(缩小)采样步长。如果采样结果中,数据的条数小于预设的数据量阈值,则就一次采样出所有数据,逐条判断是否匹配,此时的步长就理论是1。本申请实施例中,所述数据量阈值可以是200,因为当数据量小于200时,网络访问开销的时间和本地查询的时间开销相近,这时停止抽样,效率最高。
需要说明的是,在本申请实施例中,对所述数据定位参数以及所述数据识别标识进行比对时,通常选择当前数据采样结果中的第一条数据的数据识别标识与所述数据定位参数进行对比,若第一条数据的所述数据识别标识与所述数据定位参数不一致,则不再对比当前采样结果中的剩余数据,可直接判断当前采样数据结果中不包含目标数据。
当确认对数据进行采样甚至再采样的结果中包含目标数据且下一次再采样的结果中不包含目标数据时,便停止采样,如此便得到了包含目标数据的所有采样数据。之后,在得到的采样数据中,根据所述数据定位参数,逐条对比采样数据中的每一条数据是否为目标数据,从而做到高效率定位。本实施例中,根据数据定位参数获取目标数据所在数据段,并对所述数据段进行取样从而根据取样结果对数据进行定位,改变了现有技术进行流式数据定位时,需要逐一对数据段进行对比的繁琐操作,实现了精确的数据定位,提高了数据上传至分布式文件系统的效率。
图2是本申请实施例二的技术流程图,结合图2,本申请一种数据定位方法,还包括如下可行的实施步骤:
步骤S210:接收数据定位参数,根据所述数据定位参数查询对应的数据块的说明信息从而确定目标数据所在的数据段;
步骤S220:以预设的步长对所述数据段进行数据采样,并获取所述数据采样结果的数据识别标识;
步骤S230:根据所述数据识别标识以及所述数据定位参数判断所述数据采样结果是否包含所述目标数据,当判定为是时,在所述数据采样结果中逐条进行判断,直至定位目标数据。
步骤S240:根据预设策略对所述目标数据进行分段;
步骤S250:将所述分段的结果封装并生成分布式并行任务上传至分布式文件系统。
上述步骤S210~步骤S230同实施例一种的步骤S110~步骤S130,此处不再赘述。
具体的,在步骤S240中,所述预设策略可以包括如下方式:
其一,根据分布式集群中计算节点的数量对所述目标数据进行均分;
其二,根据分布式集群中计算节点的数量、所述计算节点的计算效率以及计算时间需求计算数据分段阈值并根据所述数据分段阈值对所述目标数据进行分段。
其中,均分的过程并没有考虑到服务器集群中每个计算节点的计算资源剩余量以及计算能力,直接按照计算节点的数量对待处理的目标数据进行平均分,其优势在于,省去对每一计算节点计算能力的分析,更加节约时间。
在另一种数据分段方式中,在分段之前,优先获取服务器集群中每个节点字计算能力,例如计算效率以及计算时间需求,从而根据这些参考数据在对数据进行分段的时候适当做一些倾斜,能够实现更合理的计算任务分配。
具体的,在步骤S250中,定位得到的目标数据在上传至分布式文件系统之前需要进行封装。
通过前述步骤,得到了精确定位以及分段后的目标数据,对于分段的数据而言,每一段数据都有开始位置,结束位置,数据的存储服务器位置,数据的元数据信息等。
本步骤中,所述封装是将每个分段的开始位置,结束位置,数据的存储服务器位置,数据的元数据信息等都包装成一种Hadoop分布式任务MapReduce认识的数据对象,从而MapReduce任务拿到这些分段信息,就可以访问分段里的具体数据,将数据保存到分布式文件系统中。由于MapReduce任务是分布式的,资源充足的,处理效率非常高。
本实施例中,在精确定位用户所需数据后,按照预设的策略对数据进行 分段,充分考虑了分布式处理系统的服务器资源利用率以及计算资源利用率,进一步提升了数据上传至分布式文件系统的效率。
图3是本申请实施例三的装置结构示意图,结合图3,本申请实施例一种数据定位装置,包括如下的模块:
第一定位模块310,用于接收数据定位参数,根据所述数据定位参数查询对应的数据块的说明信息从而确定目标数据所在的数据段;
采样模块320,用于以预设的步长对所述数据段进行数据采样,并获取所述数据采样结果的数据识别标识;
第二定位模块330,用于根据所述数据识别标识以及所述数据定位参数判断所述数据采样结果是否包含所述目标数据,当判定为是时,在所述数据采样结果中逐条进行判断,直至定位目标数据。
其中,所述数据定位参数,具体包括:数据对应的时间标签、行号、索引号、偏移量。
其中,所述采样模块320具体用于:从所述数据段的起始位置开启,以所述预设的步长从所述数据段中取出相应数量的数据。
其中,所述第二定位模块330具体还用于:若为否,则从所述数据采样结果的结束位置起,以所述预设的步长对所述数据段进行数据再采样,并获取所述数据再采样结果的数据识别标识直至根据所述数据识别标识以及所述数据定位参数判定所述数据再采样结果包含所述目标数据。
其中,所述第二定位模块330还用于:在所述数据采样的过程中,根据所述数据识别标识以及所述数据定位参数的对比结果对所述预设的步长进行更新。
其中,所述装置还包括分段模块340,所述分段模块340用于:根据预设策略对所述目标数据进行分段,将所述分段的结果封装并生成分布式并行任务上传至分布式文件系统。
其中,所述预设策略包括:根据分布式集群中计算节点的数量对所述目 标数据进行均分;或,根据分布式集群中计算节点的数量、所述计算节点的计算效率以及计算时间需求计算数据分段阈值并根据所述数据分段阈值对所述目标数据进行分段。
图3所示装置可以执行图1以及图2所示实施例的方法,实现原理和技术效果参考图1以及图2所示实施例,不再赘述。
本申请实施例提供了一种非易失性计算机存储介质,所述计算机存储介质存储有计算机可执行指令,该计算机可执行指令可执行上述任意方法实施例中的数据定位方法。
图4为本申请实施例四的数据定位电子设备的结构示意图,本实施例所述设备可以为数据定位服务器或数据定位服务器中的一部分,该设备可以包括:
一个或多个处理器401以及存储器402,图4中以一个处理器401为例。
网页页面显示电子设备还可以包括:输入装置403和输出装置404。
处理器401、存储器402、输入装置403和输出装置404可以通过总线或者其他方式连接。
存储器402作为一种非暂态计算机可读存储介质,可用于存储非暂态软件程序、非暂态计算机可执行程序以及模块,如本申请实施例中的网页页面显示方法对应的程序指令/模块。处理器401通过运行存储在存储器402中的非暂态软件程序、指令以及模块,从而执行服务器的各种功能应用以及数据处理,即实现上述方法实施例网页页面显示方法。
存储器402可以包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需要的应用程序;存储数据区可存储根据网页页面显示装置的使用所创建的数据等。此外,存储器402可以包括高速随机存取存储器,还可以包括非暂态存储器,例如至少一个磁盘存储器件、闪存器件、或其他非暂态固态存储器件。在一些实施例中,存储器402可选包括相对于处理器401远程设置的存储器,这些远程存储器可以通过网络连接至 列表项操作的处理装置。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。
输入装置403可接收输入的数字或字符信息,以及产生与网页页面显示装置的用户设置以及功能控制有关的键信号输入。输出装置404可包括显示屏等显示设备。
所述一个或者多个模块存储在所述存储器402中,当被所述一个或者多个处理器401执行时,执行上述任意方法实施例中的网页页面显示方法。
上述产品可执行本申请实施例所提供的方法,具备执行方法相应的功能模块和有益效果。未在本实施例中详尽描述的技术细节,可参见本申请实施例所提供的方法。
本申请实施例的电子设备以多种形式存在,包括但不限于:
(1)移动通信设备:这类设备的特点是具备移动通信功能,并且以提供话音、数据通信为主要目标。这类终端包括:智能手机(例如iPhone)、多媒体手机、功能性手机,以及低端手机等。
(2)超移动个人计算机设备:这类设备属于个人计算机的范畴,有计算和处理功能,一般也具备移动上网特性。这类终端包括:PDA、MID和UMPC设备等,例如iPad。
(3)便携式娱乐设备:这类设备可以显示和播放多媒体内容。该类设备包括:音频、视频播放器(例如iPod),掌上游戏机,电子书,以及智能玩具和便携式车载导航设备。
(4)服务器:提供计算服务的设备,服务器的构成包括处理器、硬盘、内存、系统总线等,服务器和通用的计算机架构类似,但是由于需要提供高可靠的服务,因此在处理能力、稳定性、可靠性、安全性、可扩展性、可管理性等方面要求较高。
(5)其他具有数据交互功能的电子装置。
应用实例
以下部分将结合一个具体的应用场景,以一个实际的例子对本申请实施例的技术方案进行进一步的阐述。
在一存储介质上存储有来自视频网站的用户评论数据,数据消费者需要对这些评论数据进行处理从而分析用户的视频观看兴趣度以及观看热点。用户的评论数据是不断产生的流式数据,在服务器端,这些评论数据以数据块的形式存放在存储介质中。存储数据源源不断的来,当前数据块存满之后,新来的数据会另起一个数据块来存储数据。
假设该视频网站于2016年4月1日首映了一部电影,此时想获取电影首映一周后的用户评论数据从而对电影的播放效果进行分析。然而所有的评论数据都是分块存放的,数据块的数量较多,从这些大量的数据中定位出数据消费者所需的数据是一个难点。按照本申请实施例的技术方案,为节省效率,不能将所有的数据块都扫描一遍,只能根据数据消费者需求的数据首先告诉获取某一段时间内产生的数据块的地址,例如2016年的评论数据存放在哪些数据块或2016年4月的数据存放在那些数据块。
每一数据块都有说明信息,读取每一数据块的说明信息,即可知道这一数据块中存放了那些时段的评论数据。假设,本实施例中,根据用户给的定位参数,即2016年4月这一时间标签定位到存储介质中数据块1和数据块2存放有2016年4月份的评论数据。那么接下来,将在数据块1和数据块2中继续查询有哪些数据是2016年4月的数据甚至具体到哪些数据是2016年4月1日~2016年4月7日的数据。此时,存储介质中,除数据块1和数据块2的其他数据块都可以放弃,不对它们进行查询,从而一改现有技术对所有数据块均进行扫描的缺陷,提升了数据粗定位的效率。
在确定数据块1和数据块2中包含需求的目标数据之后,采用一定的采样步长对分别对数据块1和数据块2进行采样。以下部分以对数据块1的采样过程为例进行详细阐述。
假设采样步长为每次采样1000条数据,则从数据块1的起始位置开始,抽取1000条数据,并获取这1000条数据中的第一条数据的时间标签,查询时间标签对应的时间,若是时间在2016年4月1日之前,则放弃这1000条 采样数据,对数据块1继续采样。下一次采样的步长可以是一次取1000条数据或者一次取5000条数据。采样的起始节点从上一次采样的末端开始,假设再采样1000条数据,取1000条数据中的第一条数据,对比时间标签后发现仍然包括目标数据,则可再增大采样步长,例如,从第二次采样数据的末端开始,再采样1000条数据。选取1000条数据中的第一条数据,对比时间标签后,发现这次采样得到的第一条数据的时间标签滞后于2016年4月,可判定目标数据在第2000条数据和第3000条数据之间,此时更新采样步长为(3000-2000)/2=500。接下来以500为采样步长,对采样得到的第2000条数据和第3000条数据之间进行采样,即判断第2500条数据是否大于2016年4月7,若判定为是,这继续缩小采样步长为(2500-2000)/2=250,即判断第2250条数据是否大于2016年4月7,若判定为是,则抽出2000至2250之间的250条数据,逐条对比这250条数据中哪些数据是2016年4月1日~2016年4月7日的数据。由于流式数据是有序的,因此,采用上述方法即可完全找出用户所需求的2016年4月1日~2016年4月7日的用户评论数据。
以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性的劳动的情况下,即可以理解并实施。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到各实施方式可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件。基于这样的理解,上述技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品可以存储在计算机可读存储介质中,如ROM/RAM、磁碟、光盘等,包括若干指令用以使得一台计算机装置(可以是个人计算机,服务器,或者网络装置等)执行各个实施例或者实施例的某些部分所述的方法。
最后应说明的是:以上实施例仅用以说明本申请的技术方案,而非对其 限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围。

Claims (17)

  1. 一种数据定位方法,应用于电子设备,其特征在于,包括如下的步骤:
    接收数据定位参数,根据所述数据定位参数查询对应的数据块的说明信息,确定目标数据所在的数据段;
    以预设的步长对所述数据段进行数据采样,并获取所述数据采样结果的数据识别标识;
    根据所述数据识别标识以及所述数据定位参数,判断所述数据采样结果是否包含所述目标数据,当判定为是时,在所述数据采样结果中逐条进行判断,直至定位目标数据。
  2. 根据权利要求1所述的方法,其特征在于,以预设的步长对所述数据段进行数据采样,具体包括:
    从所述数据段的起始位置开启,以所述预设的步长从所述数据段中取出相应数量的数据。
  3. 根据权利要求1所述的方法,其特征在于,根据所述数据识别标识以及所述数据定位参数判断所述数据采样结果是否包含所述目标数据,还包括:
    若为否,则从所述数据采样结果的结束位置起,以所述预设的步长对所述数据段进行数据再采样,并获取所述数据再采样结果的数据识别标识,直至根据所述数据识别标识以及所述数据定位参数判定所述数据再采样结果包含所述目标数据。
  4. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    在所述数据采样的过程中,根据所述数据识别标识以及所述数据定位参数的对比结果对所述预设的步长进行调整。
  5. 根据权利要求1所述的方法,其特征在于,所述数据定位参数,具体包括:
    数据对应的时间标签、行号、索引号、偏移量。
  6. 根据权利要求1所述的方法其特征在于,所述方法还包括:
    根据预设方法对所述目标数据进行分段,将所述分段的结果封装并生成分布式并行任务上传至分布式文件系统。
  7. 根据权利要求6所述的方法,其特征在于,所述预设方法包括:
    根据分布式集群中计算节点的数量对所述目标数据进行均分;或,
    根据分布式集群中计算节点的数量、所述计算节点的计算效率以及计算时间需求计算数据分段阈值并根据所述数据分段阈值对所述目标数据进行分段。
  8. 一种数据定位装置,其特征在于,包括如下的模块:
    第一定位模块,用于接收数据定位参数,根据所述数据定位参数查询对应的数据块的说明信息,确定目标数据所在的数据段;
    采样模块,用于以预设的步长对所述数据段进行数据采样,并获取所述数据采样结果的数据识别标识;
    第二定位模块,用于根据所述数据识别标识以及所述数据定位参数,判断所述数据采样结果是否包含所述目标数据,当判定为是时,在所述数据采样结果中逐条进行判断,直至定位目标数据。
  9. 根据权利要求7所述的装置,其特征在于,所述采样模块具体用于:
    从所述数据段的起始位置开启,以所述预设的步长从所述数据段中取出相应数量的数据。
  10. 根据权利要求7所述的装置,其特征在于,所述第二定位模块具体还用于:
    若为否,则从所述数据采样结果的结束位置起,以所述预设的步长对所述数据段进行数据再采样,并获取所述数据再采样结果的数据识别标识,直至根据所述数据识别标识以及所述数据定位参数判定所述数据再采样结果包含所述目标数据。
  11. 根据权利要求7所述的装置,其特征在于,所述第二定位模块还用于:
    在所述数据采样的过程中,根据所述数据识别标识以及所述数据定位参数的对比结果对所述预设的步长进行更新。
  12. 根据权利要求8所述的装置,其特征在于,所述数据定位参数,具体包括:
    数据对应的时间标签、行号、索引号、偏移量。
  13. 根据权利要求8所述的装置其特征在于,所述装置还包括分段模块,所述分段模块用于:
    根据预设方法对所述目标数据进行分段,将所述分段的结果封装并生成分布式并行任务上传至分布式文件系统。
  14. 根据权利要求13所述的装置,其特征在于,所述预设方法包括:
    根据分布式集群中计算节点的数量对所述目标数据进行均分;或,
    根据分布式集群中计算节点的数量、所述计算节点的计算效率以及计算时间需求计算数据分段阈值并根据所述数据分段阈值对所述目标数据进行分段。
  15. 一种电子设备,其特征在于,包括:
    至少一个处理器;以及,
    与所述至少一个处理器通信连接的存储器;其中,
    所述存储器存储有可被所述一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行权利要求1-7任一所述方法。
  16. 一种非暂态计算机可读存储介质,其特征在于,所述非暂态计算机可读存储介质存储计算机指令,当所述计算机可执行指令被电子设备执行时,使所述电子设备执行权利要求1-7任一所述方法。
  17. 一种计算机程序产品,所述计算机程序产品包括存储在非暂态 计算机可读存储介质上的计算程序,所述计算机程序包括程序指令,当所述程序指令被电子设备执行时,使所述电子设备执行权利要求1-7任一所述方法。
PCT/CN2016/101092 2016-04-21 2016-09-30 流式数据定位方法、装置及电子设备 WO2017181614A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610252499.4 2016-04-21
CN201610252499.4A CN105912274A (zh) 2016-04-21 2016-04-21 流式数据定位方法及装置

Publications (1)

Publication Number Publication Date
WO2017181614A1 true WO2017181614A1 (zh) 2017-10-26

Family

ID=56747677

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/101092 WO2017181614A1 (zh) 2016-04-21 2016-09-30 流式数据定位方法、装置及电子设备

Country Status (2)

Country Link
CN (1) CN105912274A (zh)
WO (1) WO2017181614A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109657294A (zh) * 2018-11-29 2019-04-19 中国航空工业集团公司沈阳飞机设计研究所 基于特征参数的试飞数据自动化分析方法及系统
CN111159506A (zh) * 2019-12-26 2020-05-15 广州信天翁信息科技有限公司 一种数据有效性识别方法、装置、设备及可读存储介质
CN112199249A (zh) * 2020-09-16 2021-01-08 中国建设银行股份有限公司 监控数据的处理方法、装置、设备和介质

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105912274A (zh) * 2016-04-21 2016-08-31 乐视控股(北京)有限公司 流式数据定位方法及装置
CN110147384B (zh) * 2019-04-17 2023-06-20 平安科技(深圳)有限公司 数据查找模型建立方法、装置、计算机设备和存储介质
CN112883064B (zh) * 2021-03-02 2022-11-15 清华大学 一种自适应采样与查询方法及系统

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101178693A (zh) * 2007-12-14 2008-05-14 沈阳东软软件股份有限公司 一种数据缓存方法及系统
US20080256143A1 (en) * 2007-04-11 2008-10-16 Data Domain, Inc. Cluster storage using subsegmenting
CN102841860A (zh) * 2012-08-17 2012-12-26 珠海世纪鼎利通信科技股份有限公司 一种大数据量信息存储与访问方法
CN105354242A (zh) * 2015-10-15 2016-02-24 北京航空航天大学 分布式数据处理方法及装置
CN105912274A (zh) * 2016-04-21 2016-08-31 乐视控股(北京)有限公司 流式数据定位方法及装置

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102054000B (zh) * 2009-10-28 2012-07-25 中国移动通信集团公司 数据查询方法、装置及系统

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080256143A1 (en) * 2007-04-11 2008-10-16 Data Domain, Inc. Cluster storage using subsegmenting
CN101178693A (zh) * 2007-12-14 2008-05-14 沈阳东软软件股份有限公司 一种数据缓存方法及系统
CN102841860A (zh) * 2012-08-17 2012-12-26 珠海世纪鼎利通信科技股份有限公司 一种大数据量信息存储与访问方法
CN105354242A (zh) * 2015-10-15 2016-02-24 北京航空航天大学 分布式数据处理方法及装置
CN105912274A (zh) * 2016-04-21 2016-08-31 乐视控股(北京)有限公司 流式数据定位方法及装置

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109657294A (zh) * 2018-11-29 2019-04-19 中国航空工业集团公司沈阳飞机设计研究所 基于特征参数的试飞数据自动化分析方法及系统
CN111159506A (zh) * 2019-12-26 2020-05-15 广州信天翁信息科技有限公司 一种数据有效性识别方法、装置、设备及可读存储介质
CN111159506B (zh) * 2019-12-26 2023-11-14 广州信天翁信息科技有限公司 一种数据有效性识别方法、装置、设备及可读存储介质
CN112199249A (zh) * 2020-09-16 2021-01-08 中国建设银行股份有限公司 监控数据的处理方法、装置、设备和介质

Also Published As

Publication number Publication date
CN105912274A (zh) 2016-08-31

Similar Documents

Publication Publication Date Title
WO2017181614A1 (zh) 流式数据定位方法、装置及电子设备
US10162550B2 (en) Large-scale, dynamic graph storage and processing system
WO2017166648A1 (zh) 一种导航路线的生成方法和装置、设备
CN109842781B (zh) 监控视频播放方法、装置、系统、媒体服务器及存储介质
EP3047667B1 (en) Combining communication contents
US20180103115A1 (en) Information Pushing
US20230385326A1 (en) Using cross-matching between users and matching against reference data to facilitate content identification
US11388232B2 (en) Replication of content to one or more servers
US20110196824A1 (en) Orchestrated data exchange and synchronization between data repositories
US20140047059A1 (en) Method for improving mobile network performance via ad-hoc peer-to-peer request partitioning
WO2019232927A1 (zh) 分布式数据删除流控方法、装置、电子设备及存储介质
WO2023109964A1 (zh) 一种数据分析方法、装置、设备及计算机可读存储介质
CN112162965A (zh) 一种日志数据处理的方法、装置、计算机设备及存储介质
CN106657182B (zh) 云端文件处理方法和装置
CN113468226A (zh) 一种业务处理方法、装置、电子设备和存储介质
CN115576973B (zh) 一种业务部署方法、装置、计算机设备和可读存储介质
US10713266B2 (en) Processing a query via a lambda application
CN107508705B (zh) 一种http元素的资源树构建方法及计算设备
TW201905669A (zh) App應用展示介面的方法、裝置和電子設備
CN115373831A (zh) 数据处理方法、装置以及计算机可读存储介质
US20160248722A1 (en) Estimation of information diffusion route on computer mediated communication network
CN113139082A (zh) 多媒体内容处理方法、装置、设备及介质
CN111782479A (zh) 日志处理方法、装置、电子设备及计算机可读存储介质
CN110956349A (zh) 服务质量分析方法、系统、装置、服务器及电子设备
WO2019165762A1 (zh) 一种抽样查询的方法和装置

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16899199

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 16899199

Country of ref document: EP

Kind code of ref document: A1