WO2020232612A1 - 降低用于数据可视化的数据量的方法和装置 - Google Patents

降低用于数据可视化的数据量的方法和装置 Download PDF

Info

Publication number
WO2020232612A1
WO2020232612A1 PCT/CN2019/087661 CN2019087661W WO2020232612A1 WO 2020232612 A1 WO2020232612 A1 WO 2020232612A1 CN 2019087661 W CN2019087661 W CN 2019087661W WO 2020232612 A1 WO2020232612 A1 WO 2020232612A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
data distribution
image
classifier
probability
Prior art date
Application number
PCT/CN2019/087661
Other languages
English (en)
French (fr)
Inventor
罗章维
朱景文
俞悦
于世强
施内加斯·丹尼尔
李聪超
Original Assignee
西门子股份公司
西门子(中国)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 西门子股份公司, 西门子(中国)有限公司 filed Critical 西门子股份公司
Priority to PCT/CN2019/087661 priority Critical patent/WO2020232612A1/zh
Publication of WO2020232612A1 publication Critical patent/WO2020232612A1/zh

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/169Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding
    • H04N19/17Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object
    • H04N19/176Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object the region being a block, e.g. a macroblock

Definitions

  • the present disclosure generally relates to information processing, and more specifically, to a mechanism that effectively reduces the amount of data used for data visualization.
  • Data visualization refers to the visual representation of data, which aims to convey the information contained in the data clearly and efficiently through graphical means. Data visualization has become an important part of data analysis.
  • a data visualization tool can obtain data collected by multiple Internet of Things (IoT) sensors at a certain frequency, such as temperature data, pressure data, humidity data, etc.
  • IoT Internet of Things
  • a large number of constructed data points are visualized, and they are drawn into charts and presented on the visual user interface.
  • the data distribution shown in the chart can reflect the correlation between at least two component data that constitute the data point to a certain extent.
  • the data distribution graph is interactive. The operator can zoom, pan and other operations on the selected graph, so as to view the data distribution pattern, etc.; the operator can also control the individual data shown in the graph. Data points are selected and other operations, so as to view further information, perform related calculations, and so on.
  • a method for reducing the amount of data used for data visualization including: drawing a plurality of data distribution maps for a set of data points, wherein each data distribution map is used to present data from the A different proportion of data points selected from a set of data points; the image of each data distribution map is provided as input to the classifier based on the neural network model to obtain the output of the classifier, wherein the The output indicates the probability that each image belongs to a particular category; a value interval is determined based on the output of the classifier, wherein for the plurality of data distribution graphs, the corresponding proportion falls within the value interval
  • the second plurality of data distribution maps, the probability of each image in the second plurality of data distribution maps belonging to the specific category is not less than a threshold; and the number of data points in the set of A target ratio is selected in the value interval to reduce the number of data points used for data visualization.
  • This aspect of the present disclosure provides a method based on machine learning, which can effectively reduce the number of data points used for data visualization, while still maintaining the valuable information contained in the data distribution, and thus does not affect related data analysis .
  • the substantial reduction in the number of data points used for data visualization can reduce the demand for computing resources when the data visualization tool is running, which not only speeds up the visualization of data, but also effectively reduces the various interactions between users and the data distribution map The occurrence of occasional lags and other situations has brought a smoother user experience.
  • the image of each data distribution diagram in the plurality of data distribution diagrams may only include the distribution form of the corresponding data point.
  • the above example can also avoid unnecessary information (for example, information such as coordinate axes included in the original data distribution map) from interference with operations related to the neural network model, thereby improving the classification accuracy of the classifier.
  • unnecessary information for example, information such as coordinate axes included in the original data distribution map
  • the method may further include: using a training data set to train the neural network model, wherein the training data set includes: a first part, which includes a part marked as belonging to the specific The image of all the data distribution maps of the classification, and the second part, which includes the images of all the data distribution maps that are marked as not belonging to the specific classification, wherein, based on a division value k selected between 0-50%, For the proportion f corresponding to each data distribution diagram in the plurality of data distribution diagrams, if f>(100%-k), then mark the image of the data distribution diagram as belonging to the specific category, and if f ⁇ k, the image of the data distribution map is marked as not belonging to the specific category.
  • the above example can also clearly divide the classification of the data distribution map, and using a labeled training data set, the neural network model can be trained in a targeted manner in a supervised learning manner.
  • determining the value interval based on the output of the classifier may include: drawing a relationship diagram for the images of the multiple data distribution diagrams, the relationship diagram reflecting each The correlation between the probability that the image belongs to the specific category and the ratio corresponding to the corresponding data distribution diagram; and determining the ratio interval in the relationship diagram corresponding to the probability not less than the threshold value as the Value range.
  • the use relationship diagram can clearly reflect the correlation between the probability that the image of the data distribution diagram belongs to a specific category and the ratio corresponding to the corresponding data distribution diagram, and thus the appropriate value interval can be clarified .
  • the selected target ratio may correspond to the lower limit of the value interval.
  • the amount of data used for data visualization can also be minimized.
  • the method may further include: storing the collected data points according to the selected target ratio.
  • the above example can also reduce the demand for storage and computing resources, and enhance the sustainability of data recording and retrieval.
  • the test data set includes images of all data distribution maps that do not belong to the training data set among the images of the multiple data distribution maps, and wherein, based on the output of the classifier, If it is determined that the probability that at least one image in the second part is determined by the classifier to belong to the specific category is greater than 0, or if it is determined that the transition from 0 to 1 in the probability does not appear in the test data set, Then: reselect a smaller division value k; use the images of the multiple data distribution maps to reconstruct the training data set based on the reselected division value k; and use the reconstructed training data Set to retrain the neural network model.
  • a suitable value interval can be determined more accurately.
  • each data point in the set of data points may include data collected from at least one sensor, and wherein the method may further include: instructing the The sensor reduces the data sampling frequency.
  • the sampling frequency of the sensor as the data source is adjustable, the power consumption of the sensor can be reduced, and the amount of data used for data visualization can be directly reduced from the source.
  • a device for reducing the amount of data used for data visualization including: a module for drawing a plurality of data distribution graphs for a group of data points, wherein each data distribution graph uses For presenting a different proportion of data points selected from the set of data points; used to provide the image of each data distribution map as input to the classifier based on the neural network model to obtain the output of the classifier Module, wherein the output of the classifier indicates the probability that each image belongs to a specific category; a module for determining a value interval based on the output of the classifier, wherein for the plurality of data distribution graphs The second plurality of data distribution maps whose corresponding proportions fall within the value interval, and the probability that each image in the second plurality of data distribution maps belongs to the specific category is not less than a threshold; and A module for reducing the number of data points used for data visualization according to a target ratio selected from the value range relative to the number of the set of data points.
  • a computing device including: a memory for storing instructions; and at least one processor coupled to the memory, wherein the instructions are processed by the at least one When the processor executes, the at least one processor is caused to execute the method described herein.
  • a computer-readable storage medium having instructions stored thereon, and the instructions, when executed by at least one processor, cause the at least one processor to use the method described herein.
  • Figure 1 shows an exemplary environment in which some implementations of the present disclosure may be implemented
  • Figure 2 is a flowchart of an exemplary method according to one implementation of the present disclosure
  • Figure 3 is a flowchart of an exemplary method according to one implementation of the present disclosure.
  • Figures 4A-4C show the data distribution diagrams of three exemplary complete sets of data points
  • Figures 5A-5C show data distribution diagrams of three exemplary data point subsets
  • 6A-6C show images of data distribution diagrams for three exemplary data point subsets
  • Figures 7A-7C show three exemplary relationship diagrams
  • Figures 8A-8C show a comparison of three exemplary data distribution diagrams of the complete set of data points with the data distribution diagram after reducing the amount of data;
  • Figure 9 is a block diagram of an exemplary apparatus according to one implementation of the present disclosure.
  • Figure 10 is a block diagram of an exemplary computing device according to one implementation of the present disclosure.
  • Terminal equipment 120 One or more data sources 130: Network
  • references to "one implementation”, “implementation”, “exemplary implementation”, “some implementations”, “various implementations”, etc. indicate that the described implementation of the present invention may include specific features, structures, or Features, however, does not mean that every implementation must include these specific features, structures, or characteristics. In addition, some implementations may have some, all, or none of the features described for other implementations.
  • Coupled and “connected” and their derivatives may be used. It should be understood that these terms are not meant to be synonyms for each other. On the contrary, in a specific implementation, “connected” is used to indicate that two or more components are in direct physical or electrical contact with each other, and “coupled” is used to indicate that two or more components cooperate or interact with each other, but they may, There may also be no direct physical or electrical contact.
  • Data visualization tools can draw data distribution maps for a large number of data points (for example, high-frequency data collection from multiple IoT sensors) to present them to users, and require interaction between users and data distribution maps (for example, by using Pointing tools or touching to zoom, drag, select a group of data points for calculation, etc.) to respond, which requires a lot of computing resources. As the amount of data increases, delays and freezes in these operations are more and more appearing, which affects both efficiency and user experience.
  • the present disclosure aims to provide a mechanism based on machine learning to solve the above-mentioned problems.
  • this mechanism the magnitude of the data used for data visualization can be effectively reduced, while still maintaining the valuable information contained in the data distribution, so it will not affect data analysis.
  • the data visualization tool's demand for computing resources can be reduced, and processing efficiency and response speed can be accelerated.
  • the operating environment 100 may include a terminal device 110 and one or more data sources 120.
  • the terminal device 110 and the data source 120 may be communicatively coupled to each other through the network 130.
  • a data visualization tool may run on the terminal device 110, which is used to visualize data obtained from one or more data sources 120.
  • the machine learning-based mechanism provided in the present disclosure may be implemented as a part of a data visualization tool, for example as a plug-in. In other examples, the mechanism can be implemented as a separate component.
  • Examples of the terminal device 110 may include but are not limited to: mobile devices, personal digital assistants (PDAs), wearable devices, smart phones, cellular phones, handheld devices, messaging devices, computers, personal computers (PC), desktop computers, laptops PCs, notebook computers, handheld computers, tablet computers, workstations, minicomputers, mainframe computers, supercomputers, network equipment, web equipment, processor-based systems, multi-processor systems, consumer electronics, programmable consumer electronics , TV, digital TV, set-top box, or any combination thereof.
  • PDAs personal digital assistants
  • wearable devices smart phones
  • cellular phones handheld devices
  • messaging devices computers
  • PC personal computers
  • desktop computers laptops PCs, notebook computers, handheld computers, tablet computers, workstations, minicomputers, mainframe computers, supercomputers, network equipment, web equipment, processor-based systems, multi-processor systems, consumer electronics, programmable consumer electronics , TV, digital TV, set-top box, or any combination thereof.
  • One or more data sources 120 are used to provide data for manipulation by a data visualization tool on the terminal device 110.
  • the data source 120 may include various types of sensors, such as temperature sensors, pressure sensors, humidity sensors, current sensors, and so on.
  • the sensor 120 may be configured to collect data at a fixed frequency, while in other examples, the data sampling frequency of the sensor 120 is adjustable, for example, in response to an external source (such as the terminal device 110) Indicating signal.
  • the data collected by one or more data sources 120 may be directly provided to the terminal device 110 for data visualization operations, or may be stored in the terminal device 110 (for example, in the memory contained therein) or communicated with the terminal device 110 and/or the data source.
  • 120 is a database/server (not shown) that is communicatively coupled through the network 130 and is accessed when needed.
  • the network 130 may include any type of wired or wireless communication network, or a combination of wired and wireless networks.
  • the network 130 may include a wide area network (WAN), a local area network (LAN), a wireless network, a public telephone network, an intranet, the Internet of Things (IoT), and so on.
  • WAN wide area network
  • LAN local area network
  • IoT Internet of Things
  • the network 130 may be configured to include multiple networks.
  • the communication between the terminal device 110 and one or more data sources 120 may also be directly without going through the network. Communicatively coupled.
  • the present disclosure is not limited to the specific architecture shown in FIG. 1.
  • the data visualization tools mentioned above and the mechanism for reducing the amount of data used for data visualization provided in the present disclosure can also be deployed in a distributed computing environment, and cloud computing technology can also be used to realise.
  • Figure 2 shows a flowchart of an exemplary method 200 according to one implementation of the present disclosure.
  • the exemplary method 200 helps reduce the amount of data used for data visualization.
  • the method 200 starts at step 210.
  • a plurality of data distribution maps are drawn for a group (for example, a total of N) data points, wherein each data distribution map is used to present all data points. Describe a set of data points with different proportions selected.
  • each ratio is used to represent a part of the total. For example, it can be in the form of a fraction (for example, 1/3, 3/7, etc.), or in the form of a percentage (for example, 27%, 43%). , Etc.), the present disclosure is not limited to the above or any other specific forms.
  • each data distribution diagram in the multiple data distribution diagrams presents a data distribution form of a different subset of N data points.
  • the data distribution diagram may include a scatter diagram, which can effectively reflect the relationship between two or more element values of the data point, and better reveal the data distribution pattern/trend.
  • step 220 the image of each data distribution map is provided as an input to the classifier based on the neural network model to obtain the output of the classifier, wherein the output of the classifier indicates that each image belongs to a specific classification The probability.
  • the classifier based on the neural network model may adopt the convolutional neural network model. Convolutional neural networks have high accuracy for image classification.
  • a value interval is determined based on the output of the classifier, where the corresponding proportions in the multiple data distribution graphs For the second plurality of data distribution graphs falling within the value interval, the probability of each image in the second plurality of data distribution graphs belonging to the specific category is not less than a threshold.
  • the threshold can be set according to actual needs, for example, it can be set to 95%, or 99%, and so on.
  • a value interval may be determined so that the probability corresponding to the proportion value falling within the value interval is stable to 1, that is, 100%.
  • step 240 relative to the number of the set of data points, the number of data points used for data visualization is reduced according to a target ratio selected from the value interval.
  • a target ratio selected from the value interval. For example, assuming that the value interval is determined to be from 2/5 to 1 (for example, 2/5 corresponds to the minimum ratio value at which the probability starts to stabilize to 1), then with respect to the total amount N, you can choose a value from 2/5 to 1 A target ratio f t , reducing the number of data points used for data visualization to N*f t . It can be understood that, in this example, the minimum amount of data points used for data visualization is N*2/5.
  • the disclosed mechanism can accurately determine the allowable data volume reduction/reduction range, that is, the value interval determined in the above steps. Compared with the original total amount of data N, the ratio selected in this interval can effectively reduce the amount of data while still maintaining the valuable information contained in the data distribution.
  • each data point can be a vector composed of more than one element value.
  • the element value may come from a sensor, for example.
  • the element value may also include time information.
  • the first type of data point is data from one sensor plus a time stamp
  • the second type of data point is data from two sensors (without a time stamp)
  • the third type of data point is from two sensors The data is time stamped.
  • 4A-4C respectively show the data distribution diagrams of the complete set of N data points drawn for these three exemplary situations.
  • the data distribution diagram shown in Figure 4A is for data points from a sensor plus a time stamp.
  • the horizontal direction represents the amount of time
  • the vertical direction represents the ordinate, for example, the ordinate may be associated with the measurement unit of the data collected by the sensor.
  • the data distribution diagram shown in Figure 4B is for data points from two sensors (without time stamps).
  • the horizontal direction represents the abscissa
  • the vertical direction represents the ordinate.
  • the horizontal and ordinate can be respectively associated with the measurement units of the data collected by the two sensors.
  • the data distribution diagram shown in Figure 4C is aimed at data points from two sensors plus a time stamp. For example, both sensors are current sensors.
  • the horizontal direction represents the first current value
  • the vertical direction represents the second current value
  • the depth represents the appearance time.
  • the exemplary method 300 aims to determine the allowable reduction range relative to the total set of data, that is, the ratio value interval.
  • the specific value of the ratio is in the form of a percentage.
  • the method 300 starts at step 305.
  • a percentage f is selected from the range of 0-100%.
  • the percentage f is randomly selected within the aforementioned range.
  • the percentage f is selected at specified intervals within the above range. Other selection methods are also feasible.
  • the data point of the percentage f is selected from the N data points. For example, assuming that the currently selected f is 25%, N*25% data points are selected from N data points. In some examples, these 25% of data points are randomly selected from N data points. In other examples, the 25% of the data points are selected from N data points at specified intervals. Other selection methods are also feasible.
  • the method 300 proceeds to step 315.
  • the data distribution diagram of the data points selected in step 310 is drawn.
  • the data distribution graph may include a scatter graph.
  • the data distribution graph may include a histogram.
  • the present disclosure is not limited to this.
  • FIG. 5A-5C respectively show the data distribution diagrams of a subset (N*f) of N data points drawn for the foregoing three exemplary data point configurations.
  • the current percentage f is selected as 10%
  • the current percentage f is selected as 50%
  • the current percentage f is selected as 5%. It should be noted that the specific selection of the above percentage f is only for illustrative purposes.
  • step 320 the data distribution diagram drawn in step 315 is converted into an image.
  • all non-essential information including coordinate axes, etc. are removed from the data distribution map, and only those data points are included, so that the resulting image obtained only contains the distribution patterns of those data points Therefore, it is possible to avoid interference to the training and use of the neural network in the subsequent steps, and to improve the accuracy of classification.
  • the converted image adopts the JPEG format, but the present disclosure is not limited thereto.
  • FIGS. 5A-5C respectively show the images obtained after transforming the data distribution diagrams shown in FIGS. 5A-5C. It can be seen that all unnecessary information in the previous data distribution diagrams has been removed.
  • step 325 it is determined whether a desired number (for example, M) of images have been obtained. If the judgment at this step is "No”, the method 300 jumps back to step 305, and steps 305-320 are repeatedly executed to generate more images. If the determination at this step is “yes”, the method 300 proceeds to step 330.
  • a desired number for example, M
  • a division value k is selected from the range of 0-50%, and according to the relationship between the percentage f and the division value k, some of the images obtained in the previous step are marked as belonging to a specific classification (for example, , Category A) and mark other images as not belonging to this particular category; the remaining images remain unmarked and are to be used as test data sets.
  • a specific classification for example, , Category A
  • the percentage f corresponding to an image satisfies f>(100%-k,)
  • the image is marked as belonging to a specific category A
  • the percentage f corresponding to an image satisfies f ⁇ k
  • the The image is marked as not belonging to this particular category A.
  • all M images all the images that fall into these two sets are regarded as the training data set. All remaining images are used as the test data set.
  • the method 300 proceeds to step 335.
  • the images in the training data set are used to train a classifier based on the neural network model so that it can recognize whether an input image belongs to a specific category A and give a corresponding probability.
  • the images in the training data set have corresponding labels (belonging to a specific category A, not belonging to the specific category A). Therefore, the training process is performed in a supervised learning manner.
  • the neural network model may include a convolutional neural network model.
  • step 335 After finishing the training of the classifier in step 335, the method 300 proceeds to step 340.
  • all the previously obtained images (including those belonging to the training data set and those belonging to the test data set) are provided as input to the trained classifier.
  • the output of the classifier includes each The probability that an image belongs to a particular category A.
  • a relationship diagram can be drawn for all images to reflect the probability of each image belonging to a specific category A (that is, the output of the classifier in step 340) and the data distribution corresponding to the image.
  • the graph corresponds to the correlation between the percentage f.
  • Using the relationship diagram can more clearly reflect the above correlation in a visual form. It can be understood that in some examples, the operation of drawing the relationship diagram in step 345 is not necessary. Using the results of the previous steps, you can directly determine the probability that each image belongs to a specific category A and the percentage corresponding to the corresponding data distribution diagram. f The correlation between the two. Refer to FIGS.
  • FIG. 7A-7C which respectively show the relationship diagrams drawn in step 345 after undergoing the foregoing processing for the foregoing three exemplary data points.
  • Each point in the figure represents an image of a data distribution map, and its ordinate value indicates the probability that the image belongs to a specific category A, and its abscissa value indicates the percentage f corresponding to the data distribution map/image.
  • step 350 it is determined whether the following conditions are met: at least one image in the part of the training data set that satisfies f ⁇ k has been determined by the trained classifier to have a probability greater than 0. Or, the transition of the probability from 0 to 1 does not appear in the test data set. If the judgment result here is "yes", it means that the previously selected value of k is too large, then the method 300 jumps to step 330, and steps 330-345 are repeated.
  • step 350 an interval where the probability is stable not less than a threshold is found based on the output of the classifier, as a value interval.
  • the position enclosed by the circle in the figure indicates that from here on, the probability is stable to not less than a threshold.
  • the threshold can be set to 95%, or 99%, etc., which can be set according to actual needs.
  • the value interval to be found may be the interval where the probability starts to stabilize to 1 (ie, 100%).
  • the percentage f corresponding to this position can be denoted as f min .
  • the interval between f min and 100% is determined as the value interval.
  • a target percentage f t can be selected from it, and the amount of data used for data visualization can be reduced based on the f t selected in this way, that is, the number of data points is reduced from the original N to N*f t , still The valuable information contained in the data distribution can be maintained, and the data analysis will not be affected.
  • the selected target percentage f t corresponds to the lower limit of the value interval, that is, f min , so as to minimize the amount of data.
  • the target ratio f t selected in the aforementioned value interval can be used to instruct the sensor to reduce the sampling frequency accordingly, thereby The power consumption of the sensor can be reduced, and the amount of data used for data visualization can be directly reduced from the source.
  • FIG. 9 is a block diagram of an exemplary apparatus 900 according to an implementation of the present disclosure.
  • the apparatus 900 may be implemented in the terminal device 110 shown in FIG. 1 or any similar or related entity.
  • the exemplary device 900 is used to reduce the amount of data used for data visualization.
  • the exemplary device 900 may include a module 910 for drawing a plurality of data distribution maps for a set of data points, wherein each data distribution map is used for presenting data from the set of data points. A different scale of data points selected in.
  • the exemplary device 900 may further include a module 920 for providing an image of each data distribution map as an input to a classifier based on a neural network model to obtain an output of the classifier, wherein the classifier The output indicates the probability that each image belongs to a particular category.
  • the exemplary device 900 may further include a module 930, which is configured to determine a value interval based on the output of the classifier, wherein the corresponding ratio for the plurality of data distribution graphs falls within the range For the second plurality of data distribution graphs in the value interval, the probability of each image in the second plurality of data distribution graphs belonging to the specific category is not less than a threshold.
  • the exemplary device 900 may further include a module 940 for reducing the number of data points used for data visualization according to a target ratio selected from the value interval relative to the number of the set of data points Quantity.
  • the device 900 may also include additional modules for performing other operations that have been described in the specification.
  • the exemplary apparatus 900 may be implemented by software, hardware, firmware, or any combination thereof.
  • the exemplary computing device 1000 may include one or more processing units 1010.
  • the processing unit 1010 may include any type of general-purpose processing unit/core (for example, but not limited to: CPU, GPU), or dedicated processing unit, core, circuit, controller, etc.
  • the exemplary computing device 1000 may also include a memory 1020.
  • the memory 1020 may include any type of media that can be used to store data.
  • the memory 1020 is configured to store instructions that, when executed, cause one or more processing units 1010 to perform the methods described herein, for example, the exemplary method 200, the exemplary method 300, and so on.
  • Various implementations of the present disclosure can be implemented using hardware units, software units, or a combination thereof.
  • hardware units may include devices, components, processors, microprocessors, circuits, circuit elements (e.g., transistors, resistors, capacitors, inductors, etc.), integrated circuits, application specific integrated circuits (ASIC), programmable Logic devices (PLD), digital signal processors (DSP), field programmable gate arrays (FPGA), memory cells, logic gates, registers, semiconductor devices, chips, microchips, chipsets, etc.
  • Examples of software units may include software components, programs, applications, computer programs, application programs, system programs, machine programs, operating system software, middleware, firmware, software modules, routines, subroutines, functions, methods, procedures, Software interface, application programming interface (API), instruction set, calculation code, computer code, code segment, computer code segment, word, value, symbol, or any combination thereof. Determining whether an implementation is implemented using hardware units and/or software units can vary depending on a variety of factors, such as expected calculation rate, power level, heat resistance, processing cycle budget, input data rate, output data rate, memory Resources, data bus speed, and other design or performance constraints are as expected for a given implementation.
  • Some implementations of the present disclosure may include articles of manufacture.
  • Articles of manufacture may include storage media, which are used to store logic.
  • Examples of storage media may include one or more types of computer-readable storage media capable of storing electronic data, including volatile memory or non-volatile memory, removable or non-removable memory, erasable or non-erasable Memory, writable or rewritable memory, etc.
  • Examples of logic may include various software units, such as software components, programs, applications, computer programs, application programs, system programs, machine programs, operating system software, middleware, firmware, software modules, routines, subroutines, functions , Method, process, software interface, application program interface (API), instruction set, calculation code, computer code, code segment, computer code segment, word, value, symbol, or any combination thereof.
  • the article of manufacture may store executable computer program instructions that, when executed by the processing unit, cause the processing unit to perform the methods and/or operations described herein.
  • the executable computer program instructions may include any suitable type of code, for example, source code, compiled code, interpreted code, executable code, static code, dynamic code, and so on.
  • the executable computer program instructions can be implemented according to a predefined computer language, manner, or syntax for instructing a computer to perform a specific function.
  • the instructions can be implemented using any suitable high-level, low-level, object-oriented, visual, compiled and/or interpreted programming language.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Image Analysis (AREA)

Abstract

提供了一种降低用于数据可视化的数据量的方法,包括:针对一组数据点来绘制多个数据分布图,其中,每个数据分布图用于呈现从所述一组数据点中选取的一不同比例的数据点;将每个数据分布图的图像作为输入提供给基于神经网络模型的分类器,以获得所述分类器的输出,其中,所述分类器的输出指示每个图像属于特定分类的概率;基于所述分类器的输出来确定一取值区间,其中,对于所述多个数据分布图中其相对应的比例落入所述取值区间内的第二多个数据分布图,所述第二多个数据分布图中的每一个的图像属于所述特定分类的概率不小于一阈值;以及相对于所述一组数据点的数量,按照从所述取值区间内选择的一目标比例来降低用于数据可视化的数据点数量。

Description

降低用于数据可视化的数据量的方法和装置 技术领域
本公开总体上涉及信息处理,更具体地,涉及有效降低用于数据可视化的数据量的机制。
背景技术
数据可视化是指对数据的视觉表现,其旨在以图形化手段来清楚高效地传达数据所包含的信息。数据可视化已经成为数据分析中的重要一环。
在一种示例性的典型应用场景中,数据可视化工具可以获取由多个物联网(IoT)传感器按一定频率采集的数据,例如温度数据、压力数据、湿度数据,等等,对由这些传感器数据构造的大量数据点进行可视化处理,将其绘制成图表呈现在可视化用户界面上。图表展示的数据分布可以在一定程度上反映出构成数据点的至少两种成分数据之间的相关性。在典型的数据可视化工具中,数据分布图是交互式的,操作者可以对选择该图进行缩放、平移等操作,以便于查看数据分布形态等;操作者也可以对图中所示的单独的数据点进行选取等操作,以便于查看进一步的信息、进行相关的计算,等等。
发明内容
提供本发明内容部分来以简化的形式介绍一些选出的概念,其将在下面的具体实施方式部分中被进一步描述。该发明内容部分并非是要标识出所要求保护的主题的任何关键特征或必要特征,也不是要被用于帮助确定所要求保护的主题的范围。
根据本公开的一个方面,提供了一种降低用于数据可视化的数据量的方法,包括:针对一组数据点来绘制多个数据分布图,其中,每个数据分布图用于呈现从所述一组数据点中选取的一不同比例的数据点;将每个数据分布图的图像作为输入提供给基于神经网络模型的分类器,以获得所述 分类器的输出,其中,所述分类器的输出指示每个图像属于特定分类的概率;基于所述分类器的输出来确定一取值区间,其中,对于所述多个数据分布图中其相对应的比例落入所述取值区间内的第二多个数据分布图,所述第二多个数据分布图中的每一个的图像属于所述特定分类的概率不小于一阈值;以及相对于所述一组数据点的数量,按照从所述取值区间内选择的一目标比例来降低用于数据可视化的数据点数量。
本公开的该方面提供了一种基于机器学习的方法,其能够有效降低用于数据可视化的数据点数量,同时仍然能够保持数据分布中所包含的有价值信息,因而不会影响相关的数据分析。有利地,用于数据可视化的数据点数量实质性减少能够降低数据可视化工具运行时对于计算资源的需求,这不但加快了对数据的可视化处理,也能够有效减少用户与数据分布图进行各种交互时的卡顿等情形的出现,带来了更加顺滑的用户体验。
此外,在前述方法的一个示例中,所述多个数据分布图中的每个数据分布图的图像可以仅包含相应的数据点的分布形态。
有利地,上述示例还可以避免非必要信息(例如,原数据分布图中包含的坐标轴等信息)对于与神经网络模型相关的操作的干扰,由此可以提高分类器的分类准确性。
此外,在前述方法的一个示例中,该方法还可以包括:使用训练数据集对所述神经网络模型进行训练,其中,所述训练数据集包括:第一部分,其包括被标记为属于所述特定分类的所有数据分布图的图像,以及第二部分,其包括被标记为不属于所述特定分类的所有数据分布图的图像,其中,基于在0-50%之间选择的一划分值k,对于所述多个数据分布图中的每个数据分布图对应于的比例f,如果f>(100%-k),则将该数据分布图的图像标记为属于所述特定分类,而如果f<k,则将该数据分布图的图像标记为不属于所述特定分类。
有利地,上述示例还可以将数据分布图的分类进行明确划分,并且利用带标记的训练数据集,可以以监督学习的方式对所述神经网络模型进行有针对性的训练。
此外,在前述方法的一个示例中,基于所述分类器的输出来确定所述取值区间可以包括:针对所述多个数据分布图的图像来绘制一关系图,所 述关系图反映每个图像属于所述特定分类的概率与相应的数据分布图对应于的比例二者之间的相关性;以及确定所述关系图中与不小于所述阈值的概率相对应的比例区间,作为所述取值区间。
有利地,在上述示例中,使用关系图可以清楚反映数据分布图的图像属于特定分类的概率以及相应的数据分布图对应于的比例二者之间的相关性,进而可以明确合适的取值区间。
此外,在前述方法的一个示例中,所选择的目标比例可以对应于所述取值区间的下限。
有利地,在上述示例中,通过使用对应于所确定的取值区间的下限的目标比例,还可以最大化地降低用于数据可视化的数据量。
此外,在前述方法的一个示例中,该方法还可以包括:根据所选择的目标比例对采集的数据点进行存储。
有利地,上述示例还可以降低对存储和计算资源的需求,增强数据记录和检索的可持续性。
此外,在前述方法的一个示例中,测试数据集包括所述多个数据分布图的图像中不属于所述训练数据集的所有数据分布图的图像,并且其中,基于所述分类器的输出,如果确定所述第二部分中的至少一个图像被所述分类器判断为属于所述特定分类的概率大于0,或者如果确定所述概率从0到1的转变不是出现在所述测试数据集中,则:重新选择一个更小的划分值k;利用所述多个数据分布图的图像,基于重新选择的所述划分值k来重新构造所述训练数据集;以及使用重新构造的所述训练数据集来重新训练所述神经网络模型。
有利地,上述示例中通过对划分值的调整,可以更准确地确定合适的取值区间。
此外,在前述方法的一个示例中,所述一组数据点中的每个数据点可以包含从至少一个传感器采集的数据,并且其中所述方法还可以包括:根据所选择的目标比例指示所述传感器降低数据采样频率。
有利地,在上述示例中,在作为数据来源的传感器的采样频率可调的情况下,可以降低传感器的功耗,从源头上直接减少用于数据可视化的数据量。
根据本公开的另一个方面,提供了一种降低用于数据可视化的数据量的装置,包括:用于针对一组数据点来绘制多个数据分布图的模块,其中,每个数据分布图用于呈现从所述一组数据点中选取的一不同比例的数据点;用于将每个数据分布图的图像作为输入提供给基于神经网络模型的分类器,以获得所述分类器的输出的模块,其中,所述分类器的输出指示每个图像属于特定分类的概率;用于基于所述分类器的输出来确定一取值区间的模块,其中,对于所述多个数据分布图中其相对应的比例落入所述取值区间内的第二多个数据分布图,所述第二多个数据分布图中的每一个的图像属于所述特定分类的概率不小于一阈值;以及用于相对于所述一组数据点的数量,按照从所述取值区间内选择的一目标比例来降低用于数据可视化的数据点数量的模块。
根据本公开的再一个方面,提供了一种计算设备,包括:存储器,其用于存储指令;以及至少一个处理器,其耦合到所述存储器,其中,所述指令在由所述至少一个处理器执行时,使得所述至少一个处理器执行本文中所述的方法。
根据本公开的又一个方面,提供了一种计算机可读存储介质,其上存储有指令,所述指令在由至少一个处理器执行时,使得所述至少一个处理器本文中所述的方法。
附图说明
在附图中对本公开的实现以示例的形式而非限制的形式进行了说明,附图中相似的附图标记表示相同或类似的部件,其中:
图1示出了可以在其中实施本公开的一些实现的示例性环境;
图2是根据本公开的一个实现的示例性方法的流程图;
图3是根据本公开的一个实现的示例性方法的流程图;
图4A-4C示出了三个示例性的数据点全集的数据分布图;
图5A-5C示出了三个示例性的数据点子集的数据分布图;
图6A-6C示出了三个示例性的数据点子集的数据分布图的图像;
图7A-7C示出了三个示例性的关系图;
图8A-8C示出了三个示例性的数据点全集的数据分布图与降低数据量 后的数据分布图的对比;
图9是根据本公开的一个实现的示例性装置的框图;以及
图10是根据本公开的一个实现的示例性计算设备的框图。
附图标记列表
110:终端设备 120:一个或多个数据源 130:网络
210:绘制数据分布图
220:获得数据分布图的图像属于特定分类的概率
230:基于概率确定取值区间
240:使用取值区间内的目标比例来降低数据量
305:选择百分比f
310:选取总量的百分比f的数据点
315:绘制选取的数据点的数据分布图
320:将数据分布图转换为图像
325:判断是否得到期望数目的图像
330:根据选择的划分值K确定训练数据集和测试数据集
335:使用训练数据集来训练基于神经网络模型的分类器
340:获得分类器输出的每个输入图像属于特定分类的概率
345:针对所有图像绘制关系图
350:使用关系图判断是否满足条件
355:找到概率不小于阈值对应的百分比取值区间
910-940:模块
1010:处理器
1020:存储器
具体实施方式
在以下的说明书中,出于解释的目的,阐述了大量具体细节。然而,可以理解的是,本发明的实现无需这些具体细节就可以实施。在其它实例中,并未详细示出公知的电路、结构和技术,以免影响对说明书的理解。
说明书通篇中对“一种实现”、“实现”、“示例性实现”、“一些实现”、“各 种实现”等的引述表示所描述的本发明的实现可以包括特定的特征、结构或特性,然而,并不是说每个实现都必须要包含这些特定的特征、结构或特性。此外,一些实现可以具有针对其它实现描述的特征中的一些、全部,或者不具有针对其它实现描述的特征。
在下面的说明书和权利要求书中,可能会用到术语“耦合”和“连接”及其派生词。需要理解的是,这些术语并非是要作为彼此的同义词。相反,在特定的实现中,“连接”用于表示两个或更多部件彼此直接物理或电接触,而“耦合”则用于表示两个或更多部件彼此协作或交互,但是它们可能、也可能不直接物理或电接触。
数据可视化工具可以绘制针对大量数据点(例如,来自于多个IoT传感器的高频数据采集)的数据分布图以将其呈现给用户,并且需要对用户与数据分布图的交互(例如,通过使用指点工具或触摸来进行缩放、拖拽、选取其中一组数据点进行计算等)做出响应,这需要占用大量计算资源。随着数据量的增加,这些操作中延迟和卡顿等情形越来越多出现,既影响了效率也影响了用户体验。
本公开旨在提供一种基于机器学习的机制来解决上述问题。借助该机制,用于数据可视化的数据数量级能够得到有效降低,同时仍然保持数据分布中所包含的有价值信息,因而不会影响数据分析。由此,可以减轻数据可视化工具对计算资源的需求,加快处理效率和响应速度。
下面参照图1,其示出了可以在其中实施本公开的一些实现的示例性操作环境100。操作环境100可以包括终端设备110和一个或多个数据源120。在一些实现中,终端设备110和数据源120可以通过网络130来彼此通信地耦合。
在一些示例中,终端设备110上可以运行有数据可视化工具,其用于对获取自一个或多个数据源120的数据进行可视化处理。在一些示例中,本公开中提供的基于机器学习的机制可以被实施为数据可视化工具的一部分,例如用作其插件。在另一些示例中,所述机制可以被实施为一个单独的组件。
终端设备110的示例可以包括但不限于:移动设备,个人数字助理(PDA),可穿戴设备,智能电话,蜂窝电话,手持设备,消息传送设备, 计算机,个人计算机(PC),台式计算机,膝上型计算机,笔记本计算机,手持计算机,平板计算机,工作站,迷你计算机,大型计算机,超级计算机,网络设备,web设备,基于处理器的系统,多处理器系统,消费电子设备,可编程消费电子设备,电视,数字电视,机顶盒,或其任意组合。
一个或多个数据源120用于提供数据以供终端设备110上的数据可视化工具操纵。作为示例而非限制,数据源120可以包括各种类型的传感器,例如温度传感器、压力传感器、湿度传感器、电流传感器,等等。在一些示例中,传感器120可以被配置为按照固定的频率来采集数据,而在另外一些示例中,传感器120的数据采样频率是可调整的,例如,响应于来自外部(例如终端设备110)的指示信号。
一个或多个数据源120所采集的数据可以直接提供给终端设备110进行数据可视化操作,也可以先存储在终端设备110中(例如其中包含的存储器中)或者与终端设备110和/或数据源120通过网络130通信地耦合的数据库/服务器(未示出)中,待需要时被取用。
网络130可以包括任意类型的有线或无线通信网络,或者有线和无线网络的组合。在一些示例中,网络130可以包括广域网(WAN)、局域网(LAN)、无线网、公共电话网、内联网、物联网(IoT)等等。此外,尽管这里示出了单个网络130,但是网络130可以被配置为包括多个网络。
此外,尽管在上面结合图1描述了根据本公开的一些实现的示例性操作环境,在另一些实现中,终端设备110与一个或多个数据源120之间的通信也可以不通过网络而直接通信地耦合。本公开并不限于图1所示的特定架构。
此外,在一些示例中,上文提及的数据可视化工具、以及本公开中提供的降低用于数据可视化的数据量的机制也可以被部署在分布式计算环境中,并且也可以使用云计算技术来实现。
图2示出了根据本公开的一个实现的示例性方法200的流程图。
示例性方法200有助于降低用于数据可视化的数据量。参见图2,方法200开始于步骤210,在该步骤中,针对一组(例如,总量为N个)数据点来绘制多个数据分布图,其中,每个数据分布图用于呈现从所述一组数据点中选取的一个不同比例的数据点。这里,每个比例用来表示总量的 一部分,例如,其可以采用分数的形式(例如,1/3、3/7,等等),也可以采用百分比的形式(例如,27%、43%,等等),本公开并不限于上述或任何其他特定形式。可以理解,选择的比例越高,从数据点总量中选择的数据点的数目越多。也就是说,所述多个数据分布图中的每个数据分布图呈现的分别是N个数据点的一个不同子集的数据分布形态。在一些示例中,所述数据分布图可以包括散点图,散点图能够有效反映数据点的两个或更多个元素值之间的关系,更好地揭示数据分布形态/趋势。
接着,在步骤220,将每个数据分布图的图像作为输入提供给基于神经网络模型的分类器,以获得所述分类器的输出,其中,所述分类器的输出指示每个图像属于特定分类的概率。在一些示例中,基于神经网络模型的分类器可以采用卷积神经网络模型。卷积神经网络针对图像分类有很高的准确性。
在获得分类器的输出之后,方法200前进到步骤230,在该步骤中,基于所述分类器的输出来确定一取值区间,其中,对于所述多个数据分布图中其相对应的比例落入所述取值区间内的第二多个数据分布图,所述第二多个数据分布图中的每一个的图像属于所述特定分类的概率不小于一阈值。所述阈值可以根据实际需要来设置,例如,其可以被设为95%、或者99%,等等。此外,在一个示例中,可以确定一取值区间以满足落入该取值区间内的比例值对应的所述概率稳定为1,即100%。
然后,在步骤240中,相对于所述一组数据点的数量,按照从所述取值区间内选择的一目标比例来降低用于数据可视化的数据点数量。例如,假设取值区间被确定为从2/5至1(例如2/5对应于所述概率开始稳定为1的最小比例值),则相对于总量N,可以选择2/5至1之间的一个目标比例f t,将用于数据可视化的数据点数量降低至N*f t。可以理解,在该示例中,用于数据可视化的数据点的最小量为N*2/5。
借助于通过神经网络模型实现图像分类(这里是对N个数据点的多个不同子集的数据分布图的图像进行分类)的优秀能力,针对降低用于数据可视化的数据量这一方向,本公开的上述实现提供的机制可以准确确定出可允许的数据量降低/缩减范围,亦即上面步骤中确定的取值区间。相对于原来的数据总量N,在该区间内选择的比例可以有效降低数据量同时仍保 持数据分布中所包含的有价值信息。
下面参考图3,结合一些具体示例来描述根据本公开的一个实现的示例性方法300的具体实现。
首先,取得一组数据点(例如,总量为N个),每个数据点可以是由多于一个元素值构成的向量。所述元素值例如可以来自于传感器。在一些示例中,所述元素值也可以包括时间信息。作为示例而非显示,第一种数据点是来自一个传感器的数据加上时间戳,第二种数据点是来自两个传感器的数据(没有时间戳),第三种数据点是来自两个传感器的数据加上时间戳。本领域技术人员可以理解,其它类型的数据点也同样可以使用,本公开并不是要在此方面做出限制。图4A-4C分别示出了针对这三种示例性情况,绘制出的N个数据点的全集的数据分布图。
图4A所示的数据分布图针对的数据点是来自一个传感器的数据加上时间戳。在图4A中,水平方向表示时间量,而垂直方向表示纵坐标,例如纵坐标可以与该传感器采集的数据的计量单位相关联。图4B所示的数据分布图针对的数据点是来自两个传感器的数据(没有时间戳)。在图4B中,水平方向表示横坐标,垂直方向表示纵坐标,例如横、纵坐标可以分别与这两个传感器采集的数据的计量单位相关联。图4C所示的数据分布图针对的数据点是来自两个传感器的数据加上时间戳。例如,这两个传感器均为电流传感器。在图4C中,水平方向表示第一电流值,垂直方向表示第二电流值,深浅表示出现时间。
回到图3,示例性方法300旨在确定可允许的相对于全集数据量的缩减范围,即比例取值区间。为了便于说明,在图3的示例中,比例的具体取值采用百分比的形式,然而本领域技术人员可以理解,用于表示比例的方式,包括分数等,也同样是可行的,本公开并不限于此。方法300开始于步骤305。在该步骤中,从0-100%的范围内选择一个百分比f。在一些示例中,该百分比f是在上述范围内随机选择的。而在另一些示例中,该百分比f是在上述范围内按照指定的间隔选择的。其它选取方式也是可行的。
接着,在步骤310中,从N个数据点中选取该百分比f的数据点。例如,假定当前选择的f为25%,则从N个数据点中选取N*25%个数据点。 在一些示例中,这25%的数据点是从N个数据点中随机选择的。而在另一些示例中,这25%的数据点是从N个数据点中按照指定的间隔选择的。其它选取方式也是可行的。
然后,方法300前进到步骤315。在该步骤中,绘制在步骤310中选取的数据点的数据分布图。如前所述,在一些示例中,所述数据分布图可以包括散点图。在另一些示例中,所述数据分布图可以包括柱状图。然而,本公开并不限于此。
图5A-5C分别示出了针对前述的三种示例性数据点构成而绘制的N个数据点的一个子集(N*f)的数据分布图。其中,图5A中,当前的百分比f选择为10%;图5B中,当前的百分比f选择为50%;另外,图5C中,当前的百分比f选择为5%。需要注意的是,上述百分比f的具体选择仅是为了举例说明的目的。
回到方法300,现在前进到步骤320。在该步骤中,将在步骤315中绘制的数据分布图转换为图像。在一些示例中,在转换之前,包括坐标轴等在内的所有非必要信息均被从数据分布图中移除,而仅包含那些数据点,这样获得的结果图像仅包含那些数据点的分布形态,由此可以避免对后续步骤中的神经网络的训练和使用造成干扰,提高分类的准确性。另外,在一些示例中,转换后的图像采用JPEG格式,然而本公开并不限于此。
图6A-6C分别示出了将图5A-5C中所示的数据分布图进行转换后获得的图像,可以看到,之前的数据分布图中所有非必要信息均被移除了。
接下来,方法300进行到步骤325。在该步骤中,确定是否已经得到了期望数目(例如,M个)的图像。如果在该步骤的判断为“否”,则方法300跳回到步骤305,重复执行步骤305-320以生成更多的图像。如果在该步骤的判断为“是”,则方法300前进到步骤330。
在步骤330中,从0-50%的范围内选择一个划分值k,并根据百分比f与划分值k的关系,来将前面步骤中得到的所有图像中的一些图像标记为属于特定分类(例如,分类A)而将另一些图像标记为不属于该特定分类;剩余的图像保持不被标记,待用作测试数据集。在一些示例中,如果一个图像对应于的百分比f满足f>(100%-k、),则将该图像标记为属于特定分类A;而如果一个图像对应的百分比f满足f<k,则将该图像标记为不属 于该特定分类A。所有的M个图像中,落入这两个集合的全部图像被当作训练数据集。剩余的全部图像被当作测试数据集。
接下来,方法300前进到步骤335。在该步骤中,使用所述训练数据集中的图像来训练基于神经网络模型的分类器,以使其能够识别一个输入图像是否属于特定分类A并给出相应的概率。训练数据集中的图像带有相应的标签(属于特定分类A、不属于该特定分类A),因此,训练过程是以有监督学习的方式进行的。在一些示例中,考虑到前面步骤中得到的训练数据集中的图像数量可能并不足以完整地训练一个未经训练过的神经网络模型,可以选取已经使用其它图像数据(例如,并非数据分布图转换成的图像数据)部分训练过的神经网络模型,例如,可以采用基于迁移学习的神经网络模型。此外,在一些示例中,所述神经网络模型可以包括卷积神经网络模型。
在结束了步骤335中对分类器进行的训练之后,方法300继续进行到步骤340。在该步骤中,将前面得到的所有图像(既包括属于训练数据集的那些图像,也包括属于测试数据集的那些图像)作为输入分别提供给训练好的分类器,该分类器的输出包括每个图像属于特定分类A的概率。
之后,在步骤345中,可以针对所有图像来绘制一关系图,以反映每个图像属于特定分类A的概率(即,在步骤340中所述分类器所输出的)与该图像相应的数据分布图对应于的百分比f二者之间的相关性。使用关系图可以更清楚地以可视的形式反映出上述相关性。可以理解,在一些示例中,步骤345中绘制关系图的操作并不是必须的,利用前面步骤的结果,可以直接判断出每个图像属于特定分类A的概率与相应的数据分布图对应于的百分比f二者之间的相关性。参见图7A-7C,其分别示出了针对前述的三种示例性数据点,在经历了前述的处理之后,在步骤345中绘制出的关系图。图中每一个点代表一个数据分布图的图像,其纵坐标值表示该图像属于特定分类A的概率,而其横坐标值表示该数据分布图/图像对应于的百分比f。
接下来,方法300前进到步骤350,在这里,判断以下条件是否满足:训练数据集中满足f<k的那一部分中有至少一个图像被训练好的分类器判断为属于特定分类A的概率大于0,或者,所述概率从0到1的转变不 是出现在测试数据集中。如果这里的判断结果为“是”,则说明之前选择的k值偏大,那么方法300跳转到步骤330,重复执行步骤330-345。
反之,如果步骤350中的两种情况均未出现,则所述方法可以进行到步骤355。在该步骤中,基于分类器的输出找到所述概率稳定为不小于一阈值的一个区间,作为取值区间。继续参照图7A-7C的图示,图中圆形圈住的位置指示从这里开始,所述概率稳定为不小于一阈值。例如,阈值可以被设为95%、或者99%,等等,可以根据实际需要来进行设置。在一个优选的实现中,所找的取值区间可以是所述概率开始稳定为1(即100%)的那个区间。这个位置相对应的百分比f,可以记为f min。f min和100%之间的区间,被确定为取值区间。方法300可以在这里结束。
对于所确定的取值区间,可以从中选取一目标百分比f t,基于这样选取的f t来降低用于数据可视化的数据量,即数据点的数量从原先的N降为N*f t,仍旧能保持数据分布中包含的有价值信息,进而不会对数据分析造成影响。优选地,在一些示例中,所选取的目标百分比f t对应于该取值区间的下限,即f min,以此来最大化地降低数据量。
图8A-8C分别示出了针对前述的三种示例性数据点,N个数据点的全集的数据分布图与步骤355中确定的f min的数据点子集的数据分布图二者的对照。可以看到,相比于数据点全集的数据分布图,降低数据量后的数据分布图仍能保持整体一致的数据分布。
此外,在一些实例中,利用在上述取值区间内选取的目标比例f t,针对数据可视化应用,对于每次采集的N个数据点,可以只存储从中选取的N*f t个数据点,而无需将全部N个数据点都进行存储。这样可以大大降低对存储和计算资源的需求,增强数据记录和检索的可持续性。
此外,在一些实例中,在作为数据来源的传感器的数据采样频率时可调整的情况下,可以利用在上述取值区间内选取的目标比例f t,来指示传感器相应地降低采样频率,由此可以降低传感器的功耗,并且能够从源头上直接减少用于数据可视化的数据量。
下面参考图9,图9是根据本公开的一个实现的示例性装置900的框图。例如,装置900可以在图1中所示的终端设备110或任何类似的或相关的实体中实现。
示例性装置900用于降低用于数据可视化的数据量。如图9所示,示例性装置900可以包括模块910,该模块910用于针对一组数据点来绘制多个数据分布图,其中,每个数据分布图用于呈现从所述一组数据点中选取的一不同比例的数据点。示例性装置900还可以包括模块920,该模块920用于将每个数据分布图的图像作为输入提供给基于神经网络模型的分类器,以获得所述分类器的输出,其中,所述分类器的输出指示每个图像属于特定分类的概率。此外,示例性装置900还可以包括模块930,该模块930用于基于所述分类器的输出来确定一取值区间,其中,对于所述多个数据分布图中其相对应的比例落入所述取值区间内的第二多个数据分布图,所述第二多个数据分布图中的每一个的图像属于所述特定分类的概率不小于一阈值。此外,示例性装置900还可以包括模块940,该模块940用于相对于所述一组数据点的数量,按照从所述取值区间内选择的一目标比例来降低用于数据可视化的数据点数量。
此外,在一些示例中,装置900还可以包括附加的模块,用于执行说明书中已经描述的其它操作。本领域技术人员可以理解,示例性装置900可以用软件、硬件、固件、或其任意组合来实现。
现在转到图10,这里示出了根据本公开的一个实现的示例性计算设备1000的框图。如图所示,示例性计算设备1000可以包括一个或多个处理单元1010。处理单元1010可以包括任意类型的通用处理单元/核心(例如但不限于:CPU、GPU),或者专用处理单元、核心、电路、控制器,等等。此外,示例性计算设备1000还可以包括存储器1020。存储器1020可以包括任意类型的可以用于存储数据的介质。在一个实现中,存储器1020被配置为存储指令,所述指令在执行时使得一个或多个处理单元1010执行本文中所述的方法,例如,示例性方法200、示例性方法300,等等。
本公开的各种实现可以使用硬件单元、软件单元或其组合来实现。硬件单元的示例可以包括设备、部件、处理器、微处理器、电路、电路元件(例如、晶体管、电阻器、电容器、电感器,等等)、集成电路、专用集成电路(ASIC)、可编程逻辑器件(PLD)、数字信号处理器(DSP)、现场可编程门阵列(FPGA)、存储单元、逻辑门、寄存器、半导体器件、芯片、微芯片、芯片组,等等。软件单元的示例可以包括软件部件、程序、应用、 计算机程序、应用程序、系统程序、机器程序、操作系统软件、中间件、固件、软件模块、例程、子例程、函数、方法、过程、软件接口、应用程序接口(API)、指令集、计算代码、计算机代码、代码段、计算机代码段、字、值、符号、或其任意组合。确定一个实现是使用硬件单元和/或软件单元来实施的可以取决于多种因素而变化,例如期望的计算速率、功率级别、耐热性、处理周期预算、输入数据速率、输出数据速率、存储器资源、数据总线速度,以及其它的设计或性能约束,正如一个给定的实现所期望的。
本公开的一些实现可以包括制品。制品可以包括存储介质,其用于存储逻辑。存储介质的示例可以包括一种或多种类型的能够存储电子数据的计算机可读存储介质,包括易失性存储器或非易失性存储器、可移动或不可移动存储器、可擦除或不可擦除存储器、可写或可重写存储器,等等。逻辑的示例可以包括各种软件单元,例如软件部件、程序、应用、计算机程序、应用程序、系统程序、机器程序、操作系统软件、中间件、固件、软件模块、例程、子例程、函数、方法、过程、软件接口、应用程序接口(API)、指令集、计算代码、计算机代码、代码段、计算机代码段、字、值、符号、或其任意组合。在一些实现中,例如,制品可以存储可执行的计算机程序指令,其在被处理单元执行时,使得处理单元执行本文中所述的方法和/或操作。可执行的计算机程序指令可以包括任意合适类型的代码,例如,源代码、编译代码、解释代码、可执行代码、静态代码、动态代码,等等。可执行的计算机程序指令可以根据预定义的用于命令计算机来执行特定功能的计算机语言、方式或语法来实现。所述指令可以使用任意适当的高级的、低级的、面向对象的、可视化的、编译的和/或解释的编程语言来实现。
上面已经描述的包括所公开的架构的示例。当然并不可能描述部件和/或方法的每种可以想见的组合,但是本领域技术人员可以理解,许多其它的组合和排列也是可行的。因此,该新颖架构旨在涵盖落入所附权利要求的精神和范围之内的所有这样的替代、修改和变型。

Claims (13)

  1. 一种降低用于数据可视化的数据量的方法,包括:
    针对一组数据点来绘制多个数据分布图,其中,每个数据分布图用于呈现从所述一组数据点中选取的一不同比例的数据点;
    将每个数据分布图的图像作为输入提供给基于神经网络模型的分类器,以获得所述分类器的输出,其中,所述分类器的输出指示每个图像属于特定分类的概率;
    基于所述分类器的输出来确定一取值区间,其中,对于所述多个数据分布图中其相对应的比例落入所述取值区间内的第二多个数据分布图,所述第二多个数据分布图中的每一个的图像属于所述特定分类的概率不小于一阈值;以及
    相对于所述一组数据点的数量,按照从所述取值区间内选择的一目标比例来降低用于数据可视化的数据点数量。
  2. 根据权利要求1所述的方法,其中:
    所述多个数据分布图中的每个数据分布图的图像仅包含相应的数据点的分布形态。
  3. 根据权利要求1所述的方法,还包括:
    使用训练数据集对所述神经网络模型进行训练,其中,所述训练数据集包括:
    第一部分,其包括被标记为属于所述特定分类的所有数据分布图的图像,以及
    第二部分,其包括被标记为不属于所述特定分类的所有数据分布图的图像,
    其中,基于在0-50%之间选择的一划分值k,对于所述多个数据分布图中的每个数据分布图对应于的比例f,如果f>(100%-k),则将该数据分布图的图像标记为属于所述特定分类,而如果f<k,则将该数据分布图的图像标记为不属于所述特定分类。
  4. 根据权利要求1所述的方法,其中,基于所述分类器的输出来确定所述取值区间包括:
    针对所述多个数据分布图的图像来绘制一关系图,所述关系图反映每个图像属于所述特定分类的概率与相应的数据分布图对应于的比例二者之间的相关性;以及
    确定所述关系图中与不小于所述阈值的概率相对应的比例区间,作为所述取值区间。
  5. 根据权利要求1所述的方法,其中,所选择的目标比例对应于所述取值区间的下限。
  6. 根据权利要求1所述的方法,还包括:
    根据所选择的目标比例对采集的数据点进行存储。
  7. 根据权利要求3所述的方法,其中,测试数据集包括所述多个数据分布图的图像中不属于所述训练数据集的所有数据分布图的图像,并且其中,基于所述分类器的输出,如果确定所述第二部分中的至少一个图像被所述分类器判断为属于所述特定分类的概率大于0,或者如果确定所述概率从0到1的转变不是出现在所述测试数据集中,则:
    重新选择一个更小的划分值k;
    利用所述多个数据分布图的图像,基于重新选择的所述划分值k来重新构造所述训练数据集;以及
    使用重新构造的所述训练数据集来重新训练所述神经网络模型。
  8. 根据权利要求1所述的方法,其中,所述一组数据点中的每个数据点包含从至少一个传感器采集的数据,并且其中,所述方法还包括:根据所选择的目标比例指示所述传感器降低数据采样频率。
  9. 一种降低用于数据可视化的数据量的装置,包括:
    用于针对一组数据点来绘制多个数据分布图的模块,其中,每个数据分布图用于呈现从所述一组数据点中选取的一不同比例的数据点;
    用于将每个数据分布图的图像作为输入提供给基于神经网络模型的分类器,以获得所述分类器的输出的模块,其中,所述分类器的输出指示每个图像属于特定分类的概率;
    用于基于所述分类器的输出来确定一取值区间的模块,其中,对于所述多个数据分布图中其相对应的比例落入所述取值区间内的第二多个数据分布图,所述第二多个数据分布图中的每一个的图像属于所述特定分类的概率不小于一阈值;以及
    用于相对于所述一组数据点的数量,按照从所述取值区间内选择的一目标比例来降低用于数据可视化的数据点数量的模块。
  10. 根据权利要求9所述的装置,还包括:
    用于使用训练数据集对所述神经网络模型进行训练的模块,其中,所述训练数据集包括:
    第一部分,其包括被标记为属于所述特定分类的所有数据分布图的图像,以及
    第二部分,其包括被标记为不属于所述特定分类的所有数据分布图的图像,
    其中,基于在0-50%之间选择的一划分值k,对于所述多个数据分布图中的每个数据分布图对应于的比例f,如果f>(100%-k),则将该数据分布图的图像标记为属于所述特定分类,而如果f<k,则将该数据分布图的图像标记为不属于所述特定分类。
  11. 根据权利要求10所述的装置,其中,测试数据集包括所述多个数据分布图的图像中不属于所述训练数据集的所有数据分布图的图像,并且其中,基于所述分类器的输出,如果确定所述第二部分中的至少一个图像被所述分类器判断为属于所述特定分类的概率大于0,或者如果确定所述概率从0到1的转变不是出现在所述测试数据集中,则:
    重新选择一个更小的划分值k;
    利用所述多个数据分布图的图像,基于重新选择的所述划分值k来重新构造所述训练数据集;以及
    使用重新构造的所述训练数据集来重新训练所述神经网络模型。
  12. 一种计算设备,包括:
    存储器,其用于存储指令;以及
    至少一个处理器,其耦合到所述存储器,其中,所述指令在由所述至少一个处理器执行时,使得所述至少一个处理器执行根据权利要求1-8中的任意一项所述的方法。
  13. 一种计算机可读存储介质,其上存储有指令,所述指令在由至少一个处理器执行时,使得所述至少一个处理器根据权利要求1-8中的任意一项所述的方法。
PCT/CN2019/087661 2019-05-20 2019-05-20 降低用于数据可视化的数据量的方法和装置 WO2020232612A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2019/087661 WO2020232612A1 (zh) 2019-05-20 2019-05-20 降低用于数据可视化的数据量的方法和装置

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2019/087661 WO2020232612A1 (zh) 2019-05-20 2019-05-20 降低用于数据可视化的数据量的方法和装置

Publications (1)

Publication Number Publication Date
WO2020232612A1 true WO2020232612A1 (zh) 2020-11-26

Family

ID=73459402

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/087661 WO2020232612A1 (zh) 2019-05-20 2019-05-20 降低用于数据可视化的数据量的方法和装置

Country Status (1)

Country Link
WO (1) WO2020232612A1 (zh)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103188419A (zh) * 2011-12-31 2013-07-03 北大方正集团有限公司 一种图像压缩方法及其装置
CN107240136A (zh) * 2017-05-25 2017-10-10 华北电力大学 一种基于深度学习模型的静态图像压缩方法
CN107832807A (zh) * 2017-12-07 2018-03-23 深圳联影医疗科技有限公司 一种图像处理方法和系统
US20180376142A1 (en) * 2009-11-06 2018-12-27 Adobe Systems Incorporated Compression of a collection of images using pattern separation and re-organization
CN109391818A (zh) * 2018-11-30 2019-02-26 昆明理工大学 一种基于dct变换的快速搜索分形图像压缩方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180376142A1 (en) * 2009-11-06 2018-12-27 Adobe Systems Incorporated Compression of a collection of images using pattern separation and re-organization
CN103188419A (zh) * 2011-12-31 2013-07-03 北大方正集团有限公司 一种图像压缩方法及其装置
CN107240136A (zh) * 2017-05-25 2017-10-10 华北电力大学 一种基于深度学习模型的静态图像压缩方法
CN107832807A (zh) * 2017-12-07 2018-03-23 深圳联影医疗科技有限公司 一种图像处理方法和系统
CN109391818A (zh) * 2018-11-30 2019-02-26 昆明理工大学 一种基于dct变换的快速搜索分形图像压缩方法

Similar Documents

Publication Publication Date Title
CN106650780B (zh) 数据处理方法及装置、分类器训练方法及系统
WO2021114832A1 (zh) 样本图像数据增强方法、装置、电子设备及存储介质
US10216558B1 (en) Predicting drive failures
WO2017166449A1 (zh) 机器学习模型生成方法和装置
US20210056417A1 (en) Active learning via a sample consistency assessment
WO2016180268A1 (zh) 一种文本聚合方法及装置
WO2022042123A1 (zh) 图像识别模型生成方法、装置、计算机设备和存储介质
US8958661B2 (en) Learning concept templates from web images to query personal image databases
WO2017107422A1 (zh) 一种用户性别识别方法及装置
CN110399487B (zh) 一种文本分类方法、装置、电子设备及存储介质
US9276821B2 (en) Graphical representation of classification of workloads
WO2020154830A1 (en) Techniques to detect fusible operators with machine learning
JP2022058691A (ja) 敵対的ネットワークモデルのトレーニング方法、文字ライブラリの作成方法、並びにそれらの装置、電子機器、記憶媒体及びプログラム
US20200097997A1 (en) Predicting counterfactuals by utilizing balanced nonlinear representations for matching models
CN110969198A (zh) 深度学习模型的分布式训练方法、装置、设备及存储介质
WO2022088632A1 (zh) 用户数据监控分析方法、装置、设备及介质
CN112131322B (zh) 时间序列分类方法及装置
JP2022058696A (ja) 敵対的ネットワークモデルのトレーニング方法、文字ライブラリの作成方法、並びにそれらの装置、電子機器、記憶媒体並びにコンピュータプログラム
JP2021193615A (ja) 量子データの処理方法、量子デバイス、コンピューティングデバイス、記憶媒体、及びプログラム
WO2020238303A1 (zh) 一种识别功能区的方法和装置
CN113128565B (zh) 面向预训练标注数据不可知的图像自动标注系统和装置
WO2023174189A1 (zh) 图网络模型节点分类方法、装置、设备及存储介质
WO2020232612A1 (zh) 降低用于数据可视化的数据量的方法和装置
CN109919324B (zh) 基于标签比例学习的迁移学习分类方法、系统及设备
WO2023193473A1 (zh) 频谱感知方法、电子设备和存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19929546

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19929546

Country of ref document: EP

Kind code of ref document: A1