WO2019002379A1 - An interactive interface for improving the management of datasets - Google Patents

An interactive interface for improving the management of datasets Download PDF

Info

Publication number
WO2019002379A1
WO2019002379A1 PCT/EP2018/067270 EP2018067270W WO2019002379A1 WO 2019002379 A1 WO2019002379 A1 WO 2019002379A1 EP 2018067270 W EP2018067270 W EP 2018067270W WO 2019002379 A1 WO2019002379 A1 WO 2019002379A1
Authority
WO
WIPO (PCT)
Prior art keywords
variable
adaptation
values
type
objects
Prior art date
Application number
PCT/EP2018/067270
Other languages
French (fr)
Inventor
Christophe GENOLINI
Aygul ABAKIROVA
Original Assignee
Zebrys
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zebrys filed Critical Zebrys
Priority to US16/624,880 priority Critical patent/US11106866B2/en
Publication of WO2019002379A1 publication Critical patent/WO2019002379A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/177Editing, e.g. inserting or deleting of tables; using ruled lines
    • G06F40/18Editing, e.g. inserting or deleting of tables; using ruled lines of spreadsheets
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/18Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T11/002D [Two Dimensional] image generation
    • G06T11/20Drawing from basic elements, e.g. lines or circles
    • G06T11/206Drawing of charts or graphs

Definitions

  • the present invention relates to the field of database management. More specifically, it relates to the interaction between a user and a dataset through an interface.
  • Spreadsheet applications such as Microsoft Excel®, usually represent data in the form of a two dimensional array. This representation has proven to be effective for managing data describing a large number of objects, wherein the data comprises values of a number of variables for each object.
  • This kind of application can be used for example to perform statistical computations regarding a number of variables, such as the gender, age, grade, etc... of objects.
  • the values of variables are generally displayed in a two dimensional array, wherein each line represents an object, and each column represents a variable.
  • the user can use the interface to modify values of the variables.
  • the user can also perform a number of operations such as filtering values or drawing graphs in order to have a quick overview of the data. For example, if the dataset comprises ages of students, and if a user wishes to have an overview of the ages of the students, he/she can select the column which contains the ages and click on a button.
  • Prior art applications let the user choose a type of graph, for example by presenting all the possible types of graph in a window, and the user can select the desired type of graph to represent data to generate a graph, for example an histogram.
  • the user of existing solutions faces a number of problems.
  • One of these problems is that the user needs to manually select the type of graphics that he/she wants to display. This may represent a burden for users which are not used to manipulate statistics.
  • the sources of data may be very diverse. In a large number of cases, the data is inputted manually by many different persons. It is for example the case of data issued from a poll, wherein a large number of subjects are invited to provide information about themselves. In this case, many errors may occur in input data. For example, if the subjects are asked to indicate their ages, many of them could enter their ages in an incorrect format, for example by entering the age in letters instead of numbers. In this case, existing solution are unable to correctly interpret data as numbers, and to provide a valid graph.
  • Another functionality of spreadsheet is to allow the user to perform statistical calculations. For example, the user can input formulas to calculate statistical values (i.e the minimum, maximum, median, average, etc ..) of a series of values of a variable for all the objects.
  • the existing solutions suffer at least two major drawbacks. One of them is that the user needs to manually enter formulas to perform these calculations. These solutions thus require the user to be familiar with semantics of formulas and programming, which is not the case of many users.
  • errors in the input for example if one of the occurrences of the variable "age" is entered in letters instead of numbers
  • the existing solutions are unable to correctly interpret data and thus generate errors.
  • they are unable to indicate to the user the origin of the error. In case of a large dataset, the user may thus be unable to locate the faulty value and correct it.
  • inconsistent values may be problematic. For example, if a subject inputs an age of "200" instead of "20", and the user wishes to display an histogram of ages, the value "200" will be processed as semantically correct, but will affect the histogram by generating an isolated meaningless bar in the histogram. Once the user has identified that an inconsistent value is present in the data, the existing solutions do not provide any simple and efficient mechanism to identify and correct the inconsistent values, especially when the amount of data is very large.
  • the invention discloses a device comprising: an access to a display; one or more input interfaces to receive commands from a user; an access to one or more memories storing a dataset comprising values of one or more sets of variables for a plurality of objects; a processing logic comprising: an adaptation to obtain automatically a type of a variable in one or more sets of variables; an adaptation to generate a raw representation of values of said variable for said plurality of objects; an adaptation to determine a synthetic representation type based on said type of variable; an adaptation to generate a synthetic representation of values of said variable for said plurality of objects using said synthetic representation type, said synthetic representation comprising a plurality of elements, each element representing one or more values of the variable for one or more objects; an adaptation to output said raw representation of values and said synthetic representation of values to said display; an adaptation to receive from said one or more input interfaces a selection by the user of an element of said synthetic representation; an adaptation to perform a selection of all objects in said plurality of objects whose value of the variable is represented
  • object is similar in scope to the term “individual” in statistics. Therefore, it designates objects, as well as animals or humans or the like, whose properties can be described through variables.
  • a raw representation is a representation that provides to the user values of the variables for each object.
  • the raw representation can be a two dimensional array of a spreadsheet.
  • the invention thus allows a user to easily and intuitively detect interesting, inconsistent or incorrect values within data, and modify them.
  • the dataset comprising the values of the variable for the plurality of objects is stored in one or more files; the adaptation to obtain automatically a type of the variable comprises an adaptation to detect automatically the type of the variable, based on the values of said variable for the plurality of objects.
  • said adaptation to detect automatically the type of the variable comprises one or more adaptations to: retrieve a collection of text strings for all occurrences of the variable; if the number of dictionary words in the collection of text strings is equal to two, detect the type of the variable as a binary type; if the number of dictionary words in the collection of text strings is different from two: if each string in the collection is representative of an integer number, and if the size of each string is below a predefined threshold, detect the type of the variable as an integer type; if each string in the collection is representative of a number, and if at least one number is a non-integer number or if the size of at least one string is above the predefined threshold, detect the type of the variable as a numeric type; if the collection of text strings comprises at least one non-numeric character, detect the type of the variable as a nominal type.
  • the synthetic representation type is an array comprising one or more statistical parameters; said adaptation to determine the synthetic representation type based on said type of variable comprises an adaptation to determine said one or more statistical parameters based on said type of variable; said adaptation to generate the synthetic representation of values comprises an adaptation to calculate one or more statistical parameters for said variable; said adaptation to receive from said one or more input interfaces the selection by the user of the element of said representation comprises an adaptation to receive from said one or more input interfaces a selection by the user of a statistical parameter; said adaptation to perform the selection of one or more objects corresponding to said element comprises an adaptation to select said one or more objects based on said statistical parameter and an adaptation to highlight values of said variable for said one or more objects in the raw representation.
  • the automatic calculation of statistical parameters based on the type of variables allows an automatic calculation of the most relevant statistical parameters. For example, an average, median, minimum and maximum could be calculated for a numeric variable, while a percentage of occurrences of each value could be calculated for a nominal variable.
  • the selection by the user of a statistical parameter allows the user to intuitively select a statistical value which appears abnormal or of interest. For example, the user may select a maximum value if it is abnormally high.
  • said synthetic representation is a graph, each section of the graph representing the number of occurrences of one or more values of the variable for the plurality of objects; said adaptation to determine the synthetic representation type based on said type of variable comprises an adaptation to determine a type of graph based on the type of variable; said adaptation to generate the synthetic representation of values comprises an adaptation to calculate a size of each section of the graph; said adaptation to receive from said one or more input interfaces a selection by the user of an element of said representation comprises an adaptation to receive from said one or more input interfaces a selection by the user of a section of the graph; said adaptation to perform a selection of one or more objects corresponding to said element comprises an adaptation to select said one or more objects whose value of the variable is represented by the section selected by the user.
  • the automatic generation of a graph based on the type of variable allows representing the most relevant type of graph for each type of variable, for example a histogram for numeric variables, a spaced bar graph for binary variables, or a circle graph for nominal variables.
  • the selection by the user of a section of the graph allows the user to intuitively select a section of the graph value which visually appears abnormal or of interest. For example, a user can select an isolated bar of a histogram, which visually represents a value with an unlikely high or low value, or a very small section of a circle graph, which represents a value of a nominal variable with a very low, or a unique occurrence, any of these values likely representing an input error.
  • said synthetic representation is a list of modalities of the variable; said adaptation to determine the synthetic representation type based on said type of variable comprises an adaptation to determine if the variable is a categorical variable; said adaptation to generate the synthetic representation of values comprises an adaptation to represent, if the variable is a categorical variable, modalities of the variable among the plurality of objects; said adaptation to receive from said one or more input interfaces a selection by the user of an element of said representation comprises an adaptation to receive from said one or more input interfaces an association between a first and a second modality of the variable; said adaptation to perform a selection of one or more objects corresponding to said element comprises an adaptation to select one or more objects having the first modality as value of the variable, and assign the second modality as value of the variable for said one or more objects.
  • the representation of the modalities of the variable allows the user to obtain an overview of the values of the variables for the plurality of objects, and detect values which are similar (for example values which differ from a single letter, singular and plurals, or by capital letters), which likely have the same meaning.
  • said synthetic representation is one of a list of variable types or a combination of a list of variable types and a list of modalities of the variable; said adaptation to determine the synthetic representation type based on said type of variable comprises an adaptation to determine if the variable is a categorical variable; said adaptation to generate the synthetic representation of values comprises an adaptation to represent a list of variable types and, if the variable is a categorical variable, modalities of the variable among the plurality of objects; said adaptation to receive from said one or more input interfaces a selection by the user of an element of said representation comprises an adaptation to receive from said one or more input a selection by the user of a desired type of the variable, which is different from said type of variable; said adaptation to perform a selection of one or more objects corresponding to said element comprises: an adaptation to identify and output to said display modalities of the variable that do not match the desired type of the variable; an adaptation to receive from said one or more input interfaces a selection of a modality of the variable that do not match the desired type
  • the display of the list of possible types of the variable allows the user to determine instantaneously if the variable has been casted in a type it should not be, and to display to the end user a proposal of the type that the variable should have.
  • the processing logic further comprises an adaptation to set the type of the variable as the desired type of variable, and modify representations in memory of the occurrences of said variable according to said desired type of variable, when all occurrences of said variable match the desired type of variable.
  • a nominal variable can be represented by a list of all its modalities; an integer variable can be represented using a mix of single values and intervals; a continuous numeric variable can be represented using intervals.
  • said synthetic representation type is one or more types of values to highlight; said adaptation to determine the synthetic representation type based on said type of variable comprises an adaptation to determine said one or more types of values to highlight based on said type of variable; said adaptation to generate the synthetic representation of values comprises an adaptation to select values of the variable where to superimpose colors in the raw representation based on said types of values to highlight, and to superimpose colors on said selected values; said adaptation to receive from said one or more input interfaces a selection by the user of an element of said representation comprises an adaptation to receive from said one or more input interfaces a selection by the user of a value highlighted in a color in the raw representation; said adaptation to perform a selection of one or more objects corresponding to said element comprises an adaptation to select one or more objects having a value of the variable highlighted in the color.
  • variable is a nominal variable
  • empty values can be highlighted.
  • variable is a numeric variable, particularly high or low values (for example, the minimum, maximum, top 5% values, etc ..) can be highlighted.
  • high or low values for example, the minimum, maximum, top 5% values, etc ..
  • the invention also discloses a method comprising: accessing one or more memories storing a dataset comprising values of one or more sets of variables for a plurality of objects; obtaining a type of a variable in the one or more sets of variables; generating a raw representation of values of said variable for said plurality of objects; determining a synthetic representation type based on said type of variable; generating a synthetic representation of values of said variable for said plurality of objects using said synthetic representation type, said synthetic representation comprising a plurality of elements, each element representing one or more values of the variable for one or more objects; displaying said raw representation of values and said synthetic representation of values; receiving a selection by a user of an element of said synthetic representation; selecting all objects in said plurality of objects whose value of the variable is represented by said element.
  • the invention also discloses a computer program product comprising computer code instructions configured to: access one or more memories storing a dataset comprising values of one or more sets of variables for a plurality of objects; obtain a type of a variable in the one or more sets of variables; generate a raw representation of values of said variable for said plurality of objects; determine a synthetic representation type based on said type of variable; generate a synthetic representation of values of said variable for said plurality of objects using said synthetic representation type, said synthetic representation comprising a plurality of elements, each element representing one or more values of the variable for one or more objects; display said raw representation of values and said synthetic representation of values; receive a selection by a user of an element of said synthetic representation; select all objects in said plurality of objects whose value of the variable is represented by said element.
  • FIG. 1 displays an example of a functional architecture of a device in a number of embodiments of the invention
  • FIG. 2 displays an example of an interface of a device in a number of embodiments of the invention
  • FIGS. 4a to 4d display an example of a drawing of a graph representing values of a variable, detection and correction of inconsistent data using an interaction of the user with the graph;
  • FIG. 5a to 5h display an example of association and modification of modalities of a variable in a number of embodiments of the invention
  • FIG. 6a to 6g display a first example of modification of a type of variable in a number of embodiments of the invention
  • FIG. 8a to 8d display an example of filtering of values based on two variables in a number of embodiment of the invention
  • FIG. 9a to 9e display an example of highlighting values of interest depending of the types of each variable in a number of embodiments of the invention.
  • Figure 1 displays an example of a functional architecture of a device in a number of embodiments of the invention.
  • the device 100 may be any kind of device with computing capabilities.
  • the device 100 may be a computer, a smartphone or a tablet, an loT device or the like.
  • the device comprises an access to a display 1 10.
  • the display 1 10 can be any kind of display allowing the user to view images from the device 100.
  • the display 1 10 may be a any screen or the like such as glasses, a tactile screen, a watch or a video projector.
  • the display 1 10 may be embedded within the same housing as the device 100. This is for example the case if the display 1 10 is a screen (whether tactile or not) of a smartphone or a tablet.
  • the display may also be embedded within a housing separated from the device 100, and the device 100 connected to the display 1 10. It is for example the case if the display is a separate screen, or a video projector.
  • the device 100 can be associated to the display 1 10 through any kind of connection, for example through a wire or a wireless connection (e.g BluetoothTM, Wi-FiTM, or any kind of wireless connection allowing the device 100 to send images to the display 1 10).
  • the display can also be a combination of elementary displays.
  • the display can comprise two screens side by side, and the user can use seamlessly the two screens.
  • the device 100 also comprises one or more input interfaces 120 to receive commands from a user.
  • the one or more input interfaces 120 can be any kind of interface allowing the user to input commands such as a mouse, vocal command system, buttons, keys, keyboard, eyes contact or the like.
  • figure 1 displays the one or more input interfaces 120 as being a connection to a mouse and a keyboard in a separate housing
  • the one or more input interfaces may be either a connection to an external device, a part of the device, or a connection to the display 1 10.
  • the one or more input interfaces may comprise a wired or wireless connection to a mouse, a wired or wireless connection to a keyboard, a connection to an external tactile screen (which can in this case also serve as display 1 10), a touchpad or keyboard embedded within the housing of the device 100, a microphone to receive vocal commands, or any other kind of suitable input interface.
  • the device 100 further comprises an access to one or more memories 130 storing a dataset.
  • the one or more memories 130 may be any kind of internal, external, volatile or non-volatile memories.
  • it can be formed by an internal or external hard drive, a memory in a cloud which can be accessed by the device 100, a memory shared among a plurality of devices, a memory on a server, a CD-ROM, or a combination thereof, or more generally any suitable memory or combination thereof.
  • the one or more memories 130 store a dataset comprising values of one or more sets of variables for a plurality of objects.
  • the dataset can take any suitable form to store values of variables for objects.
  • the dataset can be a database, a text file, a raw or compressed file associated with a spreadsheet application (for example an Excel® file such as one with the extension ".xls" or ".xlsx”) or more generally any format allowing a storage of values of variables for a plurality of objects.
  • a spreadsheet application for example an Excel® file such as one with the extension ".xls" or ".xlsx”
  • object is to be defined as a statistical object or individual. Therefore, an object can designate a human, animal or an item, provided that the object can be characterized at least in part by value of variables.
  • the invention can thus be applied to a large number of datasets.
  • it can be applied to datasets of vehicles, wherein the variables may be the number of wheels, the number of doors, the license plate, the color, the brand of the vehicles, the expected consumption, the price, etc...
  • datasets of persons wherein the variables may be the age, gender, weight, etc...
  • the device 100 allows the user to view, modify and interact with the dataset.
  • the device 100 allows the user to view the values of the variables for each object, but also modify these variables. It also allows drawing graphs and displaying colors that allow the user to have a quick overview and understanding of the content of the dataset.
  • the dataset can be retrieved from a wide variety of sources.
  • the values may be entered manually by the user of the device 100, may be retrieved from external sources or databases, or may be input by a large number of persons, for example through a poll. The latter case is for example commonly used if the dataset is created from questions asked to a large number of persons, in particular when users themselves enter the values.
  • the device 100 comprises a processing logic 140.
  • a processing logic may be a processor operating in accordance with software instructions, a hardware configuration of the processor, or a combination thereof. It should be understood that any or all of the functions discussed herein may be implemented in a pure hardware implementation and/or by a processor operating in accordance with software instructions, and/or a configuration of a machine learning engine or neural network.
  • a processing logic may also be a multi-core processor executing operations in parallel, a series of processors, or a combination thereof. It should also be understood that any or all software instructions may be stored in a non-transitory computer-readable medium.
  • the term "adaptation of a processing logic" refers to any means (for example hardware configuration, software instructions, machine learning, training or neural network, or any other adaptation means or combination thereof) of adapting a processing logic to execute operations.
  • variable In the course of the application, various embodiments of the invention will often be exemplified in relation to a single variable, that will be designated as "the variable”. However, it shall be noted that the invention can be applied to a plurality of variables in parallel.
  • the invention relies on a management of values of variables in the one or more sets of variables.
  • the processing described in the application is performed on a single set of data.
  • the invention can be applied concurrently on a plurality of sets of data.
  • the invention can be applied to a plurality of sets of data: it is for example the case if the invention is applied on a spreadsheet comprising a plurality of tabs: each tab can define a set of variables. It can also be the case within a single tab of a spreadsheet, if variables within the tab are clearly separated in a plurality of sets. It can also be the case if the invention is applied to a plurality of files, each file can define its own set of variables.
  • the invention can be applied separately on each set of variables.
  • the processing logic can comprise an adaptation, not represented in the figures, to identify the one or more sets of variables. Any suitable method can be used to identify the one or more sets of variables.
  • some file or database formats explicitly define the variables.
  • the variables can be simply read in a file or database.
  • the variables are not explicitly defined, but the presentation of data implicitly indicates what are the variables. It is for example the case of files that organize data by columns, wherein the name of the variable is written in the top cell of the columns.
  • the set of variables can be defined by detecting patterns representative of an organization of the dataset, and reading the names of variables in the place defined by the pattern, for example the top cell of each column.
  • the raw data can be presented to the user, who defines what are the variables, for example by clicking on the names of the variable, and selecting the values of each variable.
  • the processing logic 140 comprises an adaptation 141 to obtain automatically a type of a variable in the one or more sets of variables.
  • o continuous numeric variable a numeric variable that can take any value within a range (for example, a temperature);
  • o discrete numeric variable a numeric variable that can take only discrete values (for example a grade expressed with a precision of 0.1 );
  • - integer variable a kind of numeric variable that can take only integer values (for example, the number of wheels of a vehicle);
  • - nominal variable a variable having modalities expressed by words (for example, names of persons).
  • the term 'word' is used in the description as alphanumeric strings. The strings are interpreted as words of a nominal variable if they do not appear to belong to other categories;
  • - ordered variable kind of nominal variable, in which the modalities of the variable can be logically ordered (for example, an appreciation of an experience by the user can be expressed using 6 ordered modalities : "very poor”, “poor”, “average”, “good”, “excellent”, “outstanding”).
  • variables which are not numeric can be generally referred to as "categorical variables”.
  • the dataset comprises values of variables for objects in one or more files.
  • the dataset comprises information to explicitly determine the type of data.
  • a variable may be explicitly defined as an Integer' variable.
  • the adaptation 141 can comprise an adaption to retrieve the type of variable from the dataset.
  • the dataset comprises values of variables without any explicit mention of the type of variable, but the type of variable can be deduced from the values in the dataset. It is for example the case if the values are stored in a file in the form of text characters, either in a raw or compressed form. This kind of file is very common for storing dataset. For example, the .xls, .xlsx or .odt filed belong to these types of storage, unless they comprise words to explicitly define the types of variables.
  • the adaptation 141 can comprise an adaptation to retrieve the text characters corresponding to values of said variable for the plurality of objects, and analyze these text characters to deduce the type of the variable.
  • these adaptations may comprise one or more adaptations to:
  • the number of dictionary words (i.e the number of different text strings) in the collection of text strings is equal to two, to detect the type of the variable as a binary type. Indeed, if only two different text strings are present among all the values of the variables, it can be assumed that they are representative a binary option. This can be performed for example by counting the number of different text strings, and setting the type of the variable as binary if there are exactly two different text strings;
  • variable type if the number of words in the collection is different from two, and if each string in the collection comprises only digits, detect the variable type as one of an integer or a numeric type
  • a number of different options can be used to disambiguate a integer or a numeric type.
  • an integer type could be selected if all numbers are integers, and a numeric type could be selected if at least one number is not an integer number.
  • the size of the numbers can also be used.
  • a numeric type can be selected instead of an integer type, if at least on number has a number of digits above a predefined threshold, for example at least 10 digits;
  • variable type if the number of words in the collection is different from two, and if the text string comprises at least one non-numeric character, detect the variable type as a nominal type.
  • the processing logic 140 further comprises an adaptation 142 to generate a raw representation of values of said variable for said plurality of objects.
  • the raw representation can be any kind of representation that provides a view of the values themselves, and allows the user to read values of the variables for the objects.
  • the raw representation can be a two dimensional array wherein each column represents a variable, each line represents an object, and each cell in a line and a column contains the value of the variable represented by the column for the object represented by the line.
  • An example of such a representation is provided in figure 2.
  • this kind of representation is provided by means of non limitative example only, and any kind of representation that allows the user to read the each value of the variable can be used as raw representation.
  • the processing logic 140 further comprises an adaptation 143 to determine a synthetic representation type based on said type of variable.
  • a synthetic representation is a representation that is built using an analyze and/or processing of values, in order to display to the user information that is meaningful at the first sight.
  • the synthetic representation may be of different types.
  • the synthetic representation types comprise different types of graphs (histogram, circular graph%) or representations of statistical values (average, maximum%) wherein the statistical vales are chosen depending on the type of variables. Examples of synthetic representations will be provided hereinafter. Determining the type of synthetic representation depending on the type of variable allows providing a synthesis of information to the user, which is the most relevant depending on the type of variable.
  • the processing logic 140 further comprises an adaptation 144 to generate a synthetic representation of values of said variable for said plurality of objects using said synthetic representation type.
  • the synthetic representation comprises a plurality of elements and each element represents one or more values for the plurality of objects.
  • each bar of the histogram represents one or more values of the variables
  • the height of the bar is representative of the number of occurrences of the one or more variables.
  • the generation of the synthetic representation can be performed for example by calculating the synthetic values which characterize the synthetic representation (for example, the heights of the bars of a histogram, or an average of the values of the variable).
  • the representation type is selected depending on the type of variable, the representation highlights specific or abnormal values.
  • the processing logic 140 further comprises an adaptation 145 to output the raw representation of values and said synthetic representation of values to the display 1 10.
  • the adaptation 145 can thus send the raw representation and the synthetic representation of values to the display 1 10 in order for the representations to be seen by the user.
  • the representations can be sent in any kind of communication which is used between the device 100 and the display 1 10, for example by sending images to the display.
  • the processing logic 140 can further comprise an adaptation 146 to receive from said one or more input interfaces a selection by the user of an element of said synthetic representation.
  • the selection of the element the user is interested in can be performed in any way. For example, the user can move a cursor on the display, and click where the element is displayed, or navigate among elements using a keyboard, and press a button when the relevant element is selected.
  • the processing logic 140 further comprises an adaptation 147 to perform a selection of one or more objects corresponding to said element.
  • the one or more objects corresponding to said elements are the one or more objects whose values of the variable have been used to generate the element.
  • the one or more objects can thus be selected automatically on the basis of the element selected by the user.
  • the synthetic representation highlights values that are of interest for the user, for example abnormal or incorrect values, because the synthetic representation type is based on the type of the variable.
  • the user can thus intuitively select an element of the synthetic representation that attracts his/her attention.
  • the corresponding objects are then selected. This allows the user to select intuitively and efficiently the one or more objects that he/she may wish to verify or modify, without having to check all objects of the dataset, which is often a cumbersome and impracticable task.
  • a tablet or a smartphone display would also bring powerful data selection, processing and representation.
  • a column of an array may be selecting by touching the column with a finger, a pen, a smart watch or by means of a gesture.
  • Display may adapt itself to bring data representation as depicted in the invention.
  • a graphical representation may appear.
  • a long pressure or a harder pressure on the screen may bring a conceptual menu.
  • Menu selection may trigger processing and adaptation to the selected values. The speed and efficiency to select and process values of interest makes it possible to work with large datasets on tablets and smartphones despite the reduced computing power capabilities compared to powerful desktop computers used by professional statisticians.
  • Figure 2 displays an example of an interface of a device in a number of embodiments of the invention.
  • the interface 200 allows visualizing and modifying values of a dataset.
  • the interface 200 share certain common aspects of interfaces of spreadsheets applications (such as for example the Microsoft Excel ® application).
  • the interface 200 comprises a toolbar 210, and a raw representation of values 220.
  • the interface 200 can be displayed in the display 1 10, and the user can interact with it using the one or more input interfaces 120.
  • the toolbar 210 comprises:
  • buttons to manage the windowing of the interface for example by providing a second window, cascading windows, etc .
  • the raw representation 220 represents the values of the dataset using a two dimensional array.
  • a first line 230 displays the name of the variables; each subsequent line 231 , 232, 233... represents an object.
  • the raw representation 220 represents a French dataset representing information regarding French university students.
  • a first column 2400 represents the number of the lines.
  • Each subsequent column represents a variable. For example:
  • the column 2401 represents the values of the variable "Id", representative of the IDs the students;
  • the column 2402 represents the values of the variable "Age”, representative of the ages of the students; the column 2403 represents the values of the variable "Sexe” (French for "Gender”, representative of the genders of the students, with two possible modalities:
  • the column 2404 represents the values of the variable "NiveauDEtude", representative of the grades of the students (in French, “Niveau d'Etudes" - literally “Level of Studies”), with three possible modalities:
  • the column 2405 represents the values of the variable "UFR", representative of the faculties of the students (in French, “UFR” - for "Unite de Formation et debericht” - literally “Formation and Research Unity”), with three possible modalities:
  • SEGMI "SEGMI” (French for "Faculty of Mathematics and Economy” - "SEGMI” means "Sciences Economiques, Gestion, Mathematiques et Informatique”, literally “Economic Sciences, Management, Mathematics and Computing Science”);
  • the column 2406 represents the values of the variable "Redoublement", representative of whether the students have already repeated a school year at least once (in French, "Redoublement” means “Repeating a year), with two modalities:
  • the column 2407 represents the values of the variable "MentionBac", representative of if the grade at the French grade “baccalaureat” (equivalent of A level grade - in French “Mention Bac"), with five possible modalities:
  • the column 2408 represents the values of the variable "Copier", representative of whether students copy off during exams (in French “Copier” means “Copy off”);
  • the column 2409 represents the values of the variable "Communiquer", representative of whether the students communicate during exams (in French “Communiquer” means “Communicate”);
  • the column 2410 represents the values of the variable "EchangeBrouillon", representative of whether the students exchange drafts during exams (in French “EchangeBrouillon” means "Exchange Draft”);
  • the column 241 1 represents the values of the variable "Antiseche", representative of whether the students use cheat sheets (in French “Antiseche” means "Cheat sheet”);
  • the column 2412 represents the values of the variable "SMS", representative of whether the students use text messages during exams (in French “SMS” means "text message”);
  • the column 2413 represents the values of the variable "CoursGenoux", representative of whether the students hold lessons on the knees during exams (in French “CoursGenoux” means Lessons on Knees”);
  • the column 2414 represents the values of the variable "GarderCopie", representative of whether the students keep their examination test (in French “GarderCopie” means “Keep Copy”);
  • the column 2415 represents the values of the variable "PreparerSalle", representative of whether the students prepare the examination rooms during exams (in French “Preparer Salle” means "Prepare Room”);
  • the column 241 6 represents the values of the variable "VolerSujet", representative of whether the students steal exam subjects (in French “VolerSujet” means “Steal subject”).
  • Each cell represents the value of a variable for an object.
  • the cell 251 indicates that the value for the variable "Age” for student having the ID 101 is 21 (i.e the student is aged 21 ), and the cell 252 indicates that the value of the variable "NiveauDEtude" for student having the ID 107 has the value "L3", thereby indicating that the student is in the third year of Bachelor's degree.
  • the values can thus be extracted from the dataset to be displayed to the user, for example by loading a file.
  • the user can also enter values in the cells to modify one or more values, and save the modified dataset in a file.
  • the figure 2 displays only the first lines and the first columns of the exemplary dataset.
  • the interface 200 comprises sliders, not represented in figure 2, to navigate in the raw representation, see and modify values of the variables for the objects in the dataset.
  • the interface 200 is provided by means of non-limitative example only, and other interfaces may be used in various embodiments of the invention.
  • the invention is applicable to other interfaces with other kinds of raw representations of values of a dataset.
  • Figure 3 displays an example of an automatic generation and display of statistical parameters in a number of embodiments of the invention.
  • the example of figure 3 is based on the exemplary interface 200 of figure 2.
  • the example of figure 3 applies on the same dataset, objects, variables and values than figure 2. Therefore the columns and lines of the raw representation 220 are identical between figure 2 and figure 3.
  • the techniques disclosed in figure 3 can be adapted to any embodiment and interface of the invention, and can be applied to other datasets.
  • the invention advantageously solves this issue by calculating and displaying statistical parameters that depend upon the type of a variable, and, when the user clicks on a statistical parameter, selecting and/or highlighting the corresponding objects and/or values. Therefore, the user is able to intuitively select unusual or abnormal values.
  • the types of one or more variables are automatically defined based on the values in the dataset.
  • the adaptation 143 to determine the synthetic representation types based on said type of variable comprises an adaptation to determine one or more statistical parameters based on said type of variable;
  • the adaptation 144 to generate the synthetic representation of values comprises an adaptation to calculate one or more statistical parameters for said variable.
  • synthetic representations 3401 , 3402, 3403, 3404, 3405, 3406, 3407, 3408 are generated and displayed.
  • These representations are arrays comprising one or more statistical parameters depending on the type of variable. In this example:
  • a synthetic representation For each integer or numeric variable, a synthetic representation displays the minimum, maximum, 1 st quartile, median, 3 rd quartile, maximum, average and standard deviation values. It is for example the case of synthetic representations 3401 , 3402 for variables "Id” and "Age” respectively;
  • a synthetic representation For each binary, nominal or ordered variables, a synthetic representation displays a list of the modalities of the variable, and the number of occurrences of each modality. It is for example the case of the synthetic representations 3403, 3404, 3405, 3406, 3407, 3408 for the variables "Sexe”, “NiveauDEtude",
  • variables are chosen by means of non-limitative example only. In some embodiments of the invention, other statistical variables may be used, provided that they are selected based on the type of variable. For example, a percentage of occurrences of each modality may be used instead of the number of modalities for binary, nominal or ordered variables.
  • the variables can take values representative of an undefined or inapplicable value.
  • a "NA” or “N/A” modality can stand for "Not Applicable”.
  • the number of undefined values can be indicated for each variable. For example, it is indicated in 3502 that the variable "Age" has 10 undefined or empty values.
  • the synthetic representation of statistical variables allows the user to have a quick understanding of the repartition of values among the variables, and detect abnormal or unusual values. For example, the user can detect easily than the maximum value 34021 of the variable "Age” is 41 , which is far above the median value 34022, or even the third quarter 34023 of the variable "Age” (respectively 21 and 22).
  • the adaptation 146 to receive from said one or more input interfaces the selection by the user of the element of said representation comprises an adaptation to receive from said one or more input interfaces a selection by the user of a statistical parameter
  • the adaptation 147 to perform the selection of one or more objects corresponding to said element comprises an adaptation to select said one or more objects based on said statistical parameter and an adaptation to highlight values of said variable for said one or more objects in the raw representation.
  • the interface 200 allows this operation by letting the user click on the maximum value 34021 .
  • the object whose value of the variable "Age” is 41 is selected and displayed.
  • the user can then modify the value "41 " and "21 ".
  • the user may click on the 3 rd percentile 34023 to select all the user whose value of "Age” is in the 3 rd percentile, click on the number 34041 of occurrences of the modality "M2" of the variable "NiveauDEtude” to select all the occurrences of this modality, etc...
  • the values that the user wishes to select can be selected and/or highlighted.
  • the object for which the variable has the desired value can also be placed in the top lines of the raw representation, in order to be easily identified by the user.
  • Figures 4a to 4d display an example of a drawing of a graph representing values of a variable, detection and correction of inconsistent data using an interaction of the user with the graph.
  • This example provides another option for allowing the user to detect and correct intuitively and easily unusual or abnormal values.
  • the values of variable can be represented in the form of a graph, and the user can select values by clicking on elements of the graph.
  • Figure 4a displays an example of synthetic representation of values using graphs, which is generated in this example when the user clicks on the fifth button of the "Windows" section 214.
  • the synthetic representation type is a graph, wherein each section of the graph represents the number of occurrences of one or more values of the variable for the plurality of objects.
  • the adaptation 143 to determine the synthetic representation type based on said type of variable comprises an adaptation to determine a type of graph based on the types of variable;
  • the adaptation 144 to generate the synthetic representation of values comprises an adaptation to calculate a size of each section of the graph.
  • graphs are chosen by means of example only. In some embodiments of the invention, other types of graphs may be used, provided that they are selected based on the type of variable.
  • a circle graph may be used instead of a bar graph for binary, nominal or ordered variables.
  • graphs that simply indicate a number of occurrences such as bar or circle graphs, may be used for variables that do not trigger any kind of order between the values, for example binary or nominal variables, while histograms may be used for variables that imply an order in the values of variables, such as integer, numeric or ordered variables.
  • the sizes of the bars of the graphs are representative of the number of occurrences of values, or the number of occurrences of values within a range. In the cases of histogram, the positions of the bars are in addition representative of the value themselves.
  • the graphical representation of statistical variables advantageously allows the user to have a quick understanding of the repartition of values among the variables, and detect abnormal or unusual values.
  • the exemplary dataset of figures 4a to 4d is identical to the exemplary dataset of figure 3.
  • the abnormal value "41 " of the variable "Age” can be immediately identified by the small and isolated bar 44021 a.
  • the user can click on the graph in order to enlarge it: the graph 4402b is an enlarged version of the graph 4402a, and further enhances the isolation of bar 44021 b corresponding to the value "41 " of the variable "Age”.
  • the adaptation 146 to receive from said one or more input interfaces a selection by the user of an element of said representation comprises an adaptation to receive from said one or more input interfaces a selection by the user of a section of the graph;
  • the adaptation 147 to perform a selection of one or more objects corresponding to said element comprises an adaptation to select said one or more objects whose value of the variable is represented by the section selected by the user.
  • the interface 200 allows this operation by letting the user click on the isolated bar 44021 c.
  • the student whose value of the variable "Age” is 41 is selected and displayed: the corresponding line 431 is displaced on top of the raw representation, and highlighted.
  • the abnormal value "41 " is thus immediately apparent 431 1 c.
  • the user can modify the value "41 " and replace it by the value "21 " 431 1 d, by simply clicking on the value and entering the new value.
  • the graph 4402d is immediately updated, while the modified value remains highlighted 44021 d. The user can thus ensure that the value that he/she entered is a correct one, and that no other abnormal values are present.
  • Figures 5a to 5h display an example of association and modification of modalities of a variable in a number of embodiments of the invention.
  • Figure 5a displays an example of synthetic representation allowing a detection by the user of such inconsistencies within the dataset.
  • Figure 5a also displays the interface 200 for managing the dataset displayed in figures 2 and 3: the variables "Id”, “Age”, “Sexe”, “UFR”, “Redoublement”, “MentionBac”, “Copier”, “Communiquer”, “EchangeBrouillon” are respectively displayed in columns 2401 , 2402, 2403, 2405, 2406, 2407, 2408, 2409 and 2410.
  • the variable “Niveau Detude” is still represented in column 2404, but the column has been displaced between the columns 2407 and 2408.
  • a variable Taille (French for "Height") has been added in column 5420, that represent the values of the heights of the users, in meters.
  • the variable "Taille” is classified as a numeric variable, because all values belonging to this variable are non-integer numbers.
  • the interface 200 displays 560 the type of each variable. For example, it displays 5601 that the variable “Id” is an integer variable, it displays 5620 that the variable “Taille” is a numeric variable and it displays 5605 that the variable "UFR" is a nominal variable.
  • the interface 200 displays synthetic representations for each categorical variable.
  • the synthetic representation type is a list of modalities of the variable, if the variable is a categorical variable (for example, a nominal, ordered or binary variable).
  • the variable "Age" is considered as a nominal variable, because at least one student used letters to enter his/her age.
  • An integer or numeric variable can take a large number of different values.
  • the adaptation 143 to determine the synthetic representation type based on said type of variable comprises an adaptation to determine if the variable is a categorical variable
  • the adaptation 144 to generate the synthetic representation of values comprises an adaptation to represent, if the variable is a categorical variable, modalities of the variable among the plurality of objects.
  • the interface thus displays lists of modalities 5703, 5705, 5706, 5707, 5704, 5708, 5709 and 5710 for the variables “Sexe”, “UFR”, “Redoublement”, “MentionBac”, “NiveauDEtude”, “Copier”, “Communiquer”, and “EchangeBrouillon” respectively. Meanwhile, no list of values is displayed for variables “Id”, “Age” and “Taille”, which are integer or numeric variables.
  • the occurrences of the variable are represented in alphabetical order. This allows the user to quickly identify variables that have been entered using separate words for the same notion. For example, the user can identify easily that the variable "UFR" has a modality "Sjap” 57051 a and a modality "SJAP" 57052a, and that these modalities should associated. The user can also identify easily incorrect inputs such as the modality "piano" 57053a, which does not correspond to any faculty.
  • the modality "economie” 57056a (French word for economics) also does not correspond to a name of a faculty, but should be associated to the modality "SEGMI” 57057a (as discussed before, the "SEGMI” is the faculty of economics).
  • the user may thus wish to remove the value 57053a that does not correspond to a possible modality, and associate the values 57051 a and 57052a that should correspond to the same modality.
  • the adaptation 146 to receive from said one or more input interfaces a selection by the user of an element of said representation comprises an adaptation to receive from said one or more input interfaces an association between a first and a second modality of the variable;
  • the adaptation 147 to perform a selection of one or more objects corresponding to said element comprises an adaptation to select one or more objects having the first modality as value of the variable, and assign the second modality as value of the variable for said one or more objects.
  • the user can select the modality "piano" 57053b, then provide an instruction to delete this modality.
  • the information to delete the modality can be provided for example by clicking on a specific button (for example the button 'suppr' or 'delete'). Every suitable option is possible to indicate that the modality should be deleted.
  • the instruction to delete the modality is received, the corresponding occurrences (in this case, the unique occurrence) are highlighted 57053c.
  • a validation button 57054 and a cancel button 57055 appear. Then changes are not performed until the user clicks on the validation button 57054. At any moment, the user can click on the cancel button 57055 to cancel the changes of values.
  • Figures 5d, 5e and 5f display the association by the user of the modality "economie” to the modality "SEGMI".
  • the user drags and drops the word “economie” to the word “SEGMI” 57056d, 57056e, 57056f.
  • the word “economie” is placed on the cell “SEGMI”
  • the cell “SEGMI” is highlighted 57057e.
  • the modality "economy” 57056f is placed under “SEGMI” 57057f.
  • the occurrences 57058f of the modality "economy” are highlighted, in order to indicate the values which are subject to change upon the validation of the user.
  • the figure 5g displays the results of the same operation, to associate the modality "Sjap” 57051 g with the modality "SJAP" 57052g.
  • the occurrences 57059g of the modality "Sjap” to be modified are highlighted in addition to the occurrences of the modality "piano” and the occurrences of the modality "economie” that were previously highlighted.
  • the user clicks on the validation button 57054.
  • the figure 5h displays the interface 200 once the user has clicked on the validation button 57054.
  • the modalities “piano”, “Sjap” and “economics” have been successfully deleted.
  • the occurrence of the modality “piano” has been replaced by a "NA” value 57053h indicating an empty value;
  • the occurrences of the "economie” modality has been replaced by occurrences of the "SEGMI” modality 57058h;
  • the occurrence of the "Sjap” modality has been replaced by occurrences of the "SJAP” modality 57059h.
  • This example demonstrates the ability of the invention to remove incorrect values and associate values of variables that have the same meaning.
  • This interface present the advantages to be very intuitive and convenient to use for the user.
  • the modifications performed by the user apply optimally automatically to all values of a variable. This advantageously allows a removal of inconsistent values from the dataset, which improves the statistics and graphics that can be built on the dataset afterwards.
  • Figures 6a to 6g display a first example of modification of a type of variable in a number of embodiments of the invention.
  • an incorrect type could be associated to a variable. This is for example the case if a variable should be binary, and the inputs provided by a large number of individuals define at least three values for the variable. The variable can then be automatically detected as a nominal variable instead of a logical one.
  • Figure 6a displays an example of synthetic representation allowing a detection by the user of an incorrect detection of a type of a variable.
  • Figure 6a also displays the interface 200 for managing the same dataset displayed than in figures 5a to 5h.
  • the synthetic representations for each variable are lists of the variable types indicating to which type belong the variables, and, for categorical variables (i.e binary, nominal or ordered variables), a list of occurrences.
  • the integer or numeric variable can take a large number of different values.
  • the modalities should be represented only for categorical variable, and: - the adaptation 143 to determine the synthetic representation type based on said type of variable comprises an adaptation to determine if the variable is a categorical variable;
  • the adaptation 144 to generate the synthetic representation of values comprises an adaptation to represent a list of variable types and, if the variable is a categorical variable, modalities of the variable among the plurality of objects.
  • the column 2406 represents the values of the variable "Redoublement”.
  • the synthetic representation for the variable "Redoublement” is formed of a list of variable types 6606a indicating that the variable "Redoublement” is a Nominal variable, and a list of occurrences 6706a indicating that the variable "Redoublement” has three modalities: "Oui", ⁇ and "Non”.
  • variable “Redoublement” indicates if students repeated a school year at least once. It therefore should be a binary variable with two modalities Oui” (French for "Yes") and “Non” (French for "No”). In this example, some student entered OUI” (French for "YES”) using only capital letters.
  • the synthetic representation using the lists 6606a and 6706a allows the user to detect intuitively and efficiently the issue with the variable "Redoublement".
  • the user may thus wish to associate the "OUI” and “Oui” modalities within a single one, in order for the variable “Redoublement” to be correctly interpreted as a binary variable.
  • the adaptation 146 to receive from said one or more input interfaces a selection by the user of an element of said representation comprises an adaptation to receive from said one or more input a selection by the user of a desired type of the variable, which is different from said type of variable;
  • the adaptation 147 to perform a selection of one or more objects corresponding to said element comprises:
  • an adaptation to identify and output to said display modalities of the variable that do not match the desired type of the variable o an adaptation to receive from said one or more input interfaces a selection of a modality of the variable that do not match the desired type of the variable, and a replacement modality to replace said modality;
  • the user can click on "logical” 6606b in order to indicate that the variable "Redoublement” should be a logical one. It is impossible to set the type of the variable "Redoublement” in a logical type initially, because the values have three different modalities, while a logical variable should have only two. The list of modalities is thus highlighted 6706b.
  • a validation button 67064 and a cancel button 67065 also appear. At this time, the validation button 67064 is marked in red in order to indicate that it is not yet possible to validate the change of the variable type, because there are still more than two modalities of the variable "Redoublement".
  • the user drags and drops in the list the modality ⁇ 67061 c, 67061 d, 67061 e to the modality "Oui" 67062e.
  • the modality "OUI” 67061 f is placed under the modality Oui” 67062f.
  • the occurrences of the modality "OUI”, which may be modified, are highlighted 62061 f, 62062f.
  • the validation button 67064 is not highlighted anymore, and the user can click on it to validate the modification.
  • the figure 6g displays the interface when the user clicks on the button 67064 to validate the changes.
  • the type of the variable "Redoublement” is set as "Logical” with two modalities Oui” 67062g and “Non” 67063g.
  • the previous occurrences of the value OUI" have been modified in Oui” 67061 g, 67062g.
  • FIG. 7a to 7g display a second example of modification of a type of variable in a number of embodiments of the invention.
  • the user can click on "integer” 7602a to indicate that the variable "Age” should be an integer one.
  • a validation button 77024, and a cancel button 77025 appear, in order to validate or cancel the modification.
  • the button 77024 is initially highlighted in red, in order to indicate that it is not possible to validate the modification, because the values of the variable "Age" are not all integer values yet.
  • the user can click on the value "Vingt” 72022d, and a pop-up 72024d appears with possible corrections.
  • the user can choose between the empty "NA” value, and manually entering a new value.
  • the user manually enters the value "20" 72024e.
  • the figure 7g displays the output of the modification when the user validates the changes.
  • the type of the "Age” variable is marked as integer 7602g; the previously incorrect values are set to the integer value "20" (for example the value 72021 g for the student having the Id 6) and the lines are re-ordered in the raw representation.
  • the representation of the values of the variable "Age" can be changed in the memory of the device. For example, if the values were initially stored in the form of string of characters, they can be casted as integers. This saves memory space and allows more efficient management of the values of the variable.
  • the method of modification displayed in figures 7a to 7g is provided by means of example only of modifications of values of a variable to assign to a variable an integer or numeric type.
  • This method can be adapted in many different ways to various interfaces. For example, instead of using a pop-up window, the user may enter directly corrected values in the cell that contain nominal values, and the changes can be propagated to all values of the variable that have the same modality.
  • Another use of the interface is to allow the user to re-order ordered values.
  • the order of the modalities 7707g of the variable "MentionBac” is incorrect: they should rank in the following order: “Rattrapage”; “Passable”; “Assez bien”; “Bien”; “Tres bien”; “Felicitations”.
  • this issue can be detected easily by a user by viewing the graph 4407. The user can correct this issue by dragging and dropping modalities of the list 7707g.
  • Figure 8a to 8d display an example of filtering of values based on two variables in a number of embodiment of the invention.
  • Another functionality of spreadsheets applications is the filtering of objects to display based on their values of variables.
  • a user of the interface may wish to view only students of a certain age, in a given faculty, etc...
  • the filtering can be performed based on the values of one or more variables. This allows inspecting values only on a desired subset of objects. Graphs can be drawn only for a desired subset of values.
  • a processing logic of a device of the invention is configured to generate a synthetic representation type that is one of a list of modalities, a list of intervals, or a list of modalities and intervals. More specifically:
  • the adaptation 143 to determine the synthetic representation type based on said type of variable comprises an adaptation to set the representation type as a list of modalities if the variable is of a categorical type, and one of a list of intervals, or a list of modalities and intervals if the variable is of an integer or numeric type;
  • said adaptation 144 to generate the synthetic representation of values comprises an adaptation to generate a list of modalities, intervals, or combination thereof describing all the values of the variable for the plurality of object, depending on the synthetic representation type;
  • the processing logic 140 further comprises an adaptation to filter the values of the raw representation in order to display only values corresponding to an element of said list of modalities, intervals, or combination thereof selected by the user.
  • Figure 8a displays an interface to filter values according to the invention.
  • a synthetic representation is displayed for each variable in order to allow the user to select objects.
  • the synthetic representation for categorical variables is a list of modalities of all the variables.
  • the synthetic representation of the variable "Sexe” is a list 8603a of the two modalities “Homme” and "Femme”.
  • the synthetic representation for integer or numerical variables is a mix of values and intervals. This allows limiting the size of the list for variable that can take a large number of different values.
  • the synthetic representation 8602a comprises:
  • the synthetic representation 8602a encompasses all values of the variable age, and allows the user to select easily a large number of values. Meanwhile, the combination of values and intervals allows the synthetic representation 8602a to remain compact.
  • This synthetic representation is provided by means of example only, and other rules of definition of the values and intervals could be used for integer or numerical variables. If the user wishes to have the option to select separately all the possible values of the variable, he/she can right-click on the name of the variable "Age" on top of the synthetic representation 8602a.
  • Figure 8b discloses a selection of objects by a user.
  • the variable “NiveauDEtude” has 5 modalities: “L1 “, “L2”, “L3”, “M1 " and “M2”.
  • the objects are filtered.
  • the columns 8400b only a part of the objects are displayed in the raw representation.
  • the objects shown are those whose value of the variable "NiveauDetude" is "M1 ". Therefore, the selection allows the user to view only data relative the students studying for a "M1 " grade.
  • the selection could also be performed for a range of values, if this option is available.
  • Another option is to let the user enter a formula that defines the values he/she wishes to select. For example, the user may enter simple formulas to select a plurality of values, a combination of values and intervals, etc...
  • figure 8c the user modifies his/her selection by also clicking on the modality "M2" 8604c. As shown in the columns 8400c and 8404c, a larger number of objects are thus selected and represented, which corresponds to all the students that are studying for a "M1 " or "M2" grade. [00188] In figure 8d, the user further performs a filtering on the variable "UFR" that has 3 modalities "Droit” (French for law studies), “ economies” (French for economics studies), and "Sport”. The user clicks on "Sport” 8605d.
  • the filtering is performed on two variables in the same time. As shown in columns 8400d, 8404d and 8405d, the raw representation represents the students that are either studying in "M1 " or "M2", and are studying sports
  • Figures 9a to 9e display an example of highlighting values of interest depending of the types of each variable in a number of embodiments of the invention.
  • One objective of spreadsheet applications is to highlight the relevant information to the user.
  • Some existing applications allow the user to highlight values by associating the output of a formula to a color. For example, a user may define rules such as "highlight values below 0 in red”. However, these rules are often difficult to use for the user, and the prior art solutions lack a simple and efficient way for the user to highlight values he/she may be interested in.
  • a device of the invention is configured to display synthetic representation types consists in highlighting values in the raw representation. Furthermore:
  • the adaptation 143 to determine the synthetic representation type based on said type of variable comprises an adaptation to determine said one or more types of values to highlight based on said variable type;
  • the adaptation 144 to generate the synthetic representation of values comprises an adaptation to select values of the variable where to superimpose colors in the raw representation based on said types of values to highlight, and to superimpose colors on said selected values.
  • Figures 9a to 9e provide some examples of highlights.
  • the binary variables for example, "Sexe” and “Redoublement”
  • the numeric variables for example, "Taille”
  • the nominal variables for example "NiveauDEtude", "UFR” or "MentionBac”
  • the variable "Age” is highlighted in blue. It is indeed classified as a nominal variable because at least one student entered his/her age using letters.
  • the interface highlights the cells comprising empty values, such as for example the cells 92041 b, 92081 b or 92082b.
  • the invention allows the user to easily select the values he/she is interested in. In order to do so:
  • the adaptation 146 to receive from said one or more input interfaces a selection by the user of an element of said representation comprises an adaptation to receive from said one or more input a selection by the user of a value highlighted in a color in the raw representation;
  • the adaptation 147 to perform a selection of one or more objects corresponding to said element comprises an adaptation to select one or more objects having a value of the variable highlighted in the color.
  • the user can click an empty value highlighted in orange for the variable "Copier” 92081 b in order to indicate that he/she wishes to select and/or modify the empty values of the variable "Copier".
  • the user can click on the button 2141 in order to select all the students whose value of "Copier" is empty. If the user clicks on the button 2142, the lines corresponding to the students which are selected are furthermore displaced on top of the raw representation. If the user clicks on the button 2143, the lines corresponding to the students which are selected are displaced on top of the raw representation, and the other lines are hidden.
  • the user can click on one of the values highlighted in blue 92021 c, in order to indicate that he/she wishes to select and/or modify these values.
  • the user can click on the button 2141 in order to select all the students whose value of "Age" is 18. If the user clicks on the button 2142, the lines corresponding to the students which are selected are furthermore displaced on top of the raw representation. If the user clicks on the button 2143, the lines corresponding to the students which are selected are displaced on top of the raw representation, and the other lines are hidden.
  • the very low values of integer or numerical variables are highlighted in light blue, and the very high values are highlighted in light red.
  • the limits between very low/normal/very high values can be set for example at 2 times the standard deviation of the values of the variables below or above the average value.
  • the values below two times the standard deviation of the value below the average can be highlighted in light blue. It is for example the case of the value 92021 d for the variable "Taille”.
  • the values above two times the standard deviation above the average can be highlighted in light red. It is for example the case of the values 92022d of the variable "Taille”.
  • the user can also view in the same time a combination of different types of highlights.
  • the user clicks in the same time on the buttons 2131 , 2132, 2133 and 2134.
  • All the cells that were identified in figures 9b, 9c and 9d are highlighted in the same time: the empty cells, the cells with minimum values in blue, the cells with maximum values in red, the cells with high value in light red, and the cells with low value in light blue.
  • a cell that could be represented as both maximum or high value is highlighted in red, such as for example the cell 92022e, while a cell that can be represented both in minimum or low value is highlighted in blue, such as for example the cell 92021 e.
  • Figure 10 displays an example of a method in a number of embodiments of the invention.
  • the method 1000 comprises a first step 1010 of accessing one or more memories storing a dataset comprising values of a set of variables for a plurality of objects. [00208] The method 1000 further comprises a second step 1020 of obtaining a type of a variable in the set of variable.
  • the method 1000 further comprises a third step 1030 of generating a raw representation of values of said variable for said plurality of objects.
  • the method 1000 further comprises a fourth step 1040 of determining a synthetic representation type based on said type of variable.
  • the method 1000 further comprises a fifth step 1050 of generating a synthetic representation of values of said variable for said plurality of objects using said synthetic representation type, said synthetic representation comprising a plurality of elements, each element representing one or more values of the variable for one or more objects.
  • the method 1000 further comprises a sixth step 1060 of displaying said raw representation of values and said synthetic representation of values.
  • the method 1000 further comprises a seventh step of 1070 of receiving a selection by a user of an element of said synthetic representation.
  • the method 1000 further comprises an eighth step 1080 of selecting one or more objects corresponding to said element.

Abstract

The invention relates to the field of data representation, for example for spreadsheet applications. The invention comprises accessing a dataset comprising values of variables for a plurality of objects. The invention further comprises generating a raw representation of the value, and a synthetic representation of the values for a variable, which depends on the type of variable. The synthetic representation of values comprises a plurality of elements, each elements corresponding to a plurality of values of the variable for a plurality of objects. The invention comprises receiving from a user a selection of an element of the synthetic representation, and selecting the corresponding plurality of objects. The type of synthetic representation being dependent upon the type of variable, the invention allows the user to easily and intuitively select objects interest him/her.

Description

AN INTERACTIVE INTERFACE FOR IMPROVING THE MANAGEMENT OF
DATASETS
FIELD OF THE INVENTION
[001] The present invention relates to the field of database management. More specifically, it relates to the interaction between a user and a dataset through an interface.
BACKGROUND PRIOR ART
[002] A number of solutions exist to display and manage datasets, databases, or more generally organized sets of data. Spreadsheet applications, such as Microsoft Excel®, usually represent data in the form of a two dimensional array. This representation has proven to be effective for managing data describing a large number of objects, wherein the data comprises values of a number of variables for each object. This kind of application can be used for example to perform statistical computations regarding a number of variables, such as the gender, age, grade, etc... of objects. The values of variables are generally displayed in a two dimensional array, wherein each line represents an object, and each column represents a variable.
[003] In such solutions, the user can use the interface to modify values of the variables. The user can also perform a number of operations such as filtering values or drawing graphs in order to have a quick overview of the data. For example, if the dataset comprises ages of students, and if a user wishes to have an overview of the ages of the students, he/she can select the column which contains the ages and click on a button. Prior art applications let the user choose a type of graph, for example by presenting all the possible types of graph in a window, and the user can select the desired type of graph to represent data to generate a graph, for example an histogram.
[004] However, the user of existing solutions faces a number of problems. One of these problems is that the user needs to manually select the type of graphics that he/she wants to display. This may represent a burden for users which are not used to manipulate statistics. Furthermore, the sources of data may be very diverse. In a large number of cases, the data is inputted manually by many different persons. It is for example the case of data issued from a poll, wherein a large number of subjects are invited to provide information about themselves. In this case, many errors may occur in input data. For example, if the subjects are asked to indicate their ages, many of them could enter their ages in an incorrect format, for example by entering the age in letters instead of numbers. In this case, existing solution are unable to correctly interpret data as numbers, and to provide a valid graph.
[005] Another functionality of spreadsheet is to allow the user to perform statistical calculations. For example, the user can input formulas to calculate statistical values (i.e the minimum, maximum, median, average, etc ..) of a series of values of a variable for all the objects. However, the existing solutions suffer at least two major drawbacks. One of them is that the user needs to manually enter formulas to perform these calculations. These solutions thus require the user to be familiar with semantics of formulas and programming, which is not the case of many users. Moreover, in case of errors in the input (for example if one of the occurrences of the variable "age" is entered in letters instead of numbers), the existing solutions are unable to correctly interpret data and thus generate errors. Furthermore, they are unable to indicate to the user the origin of the error. In case of a large dataset, the user may thus be unable to locate the faulty value and correct it.
[006] Even in cases of semantically correct inputs, inconsistent values may be problematic. For example, if a subject inputs an age of "200" instead of "20", and the user wishes to display an histogram of ages, the value "200" will be processed as semantically correct, but will affect the histogram by generating an isolated meaningless bar in the histogram. Once the user has identified that an inconsistent value is present in the data, the existing solutions do not provide any simple and efficient mechanism to identify and correct the inconsistent values, especially when the amount of data is very large.
[007] More generally, input errors or inconsistent data generate a number of issues in spreadsheet, statistical, or other kind of data management applications, and these applications lack the functionality of allowing the user to easily identify and correct the faulty data. It shall be noted that correcting such errors usually require knowledge from the user of the actual meaning of data. Performing automatic corrections of data without intervention of the user is thus not a satisfying solution, and provides a risk of incorrect modifications of data.
[008] There is therefore the need of a device, method and application to allow a user to intuitively and easily detect and correct input data errors in any kind of data defining the values of variables for a number of objects. SUMMARY OF THE INVENTION
[009] To this effect, the invention discloses a device comprising: an access to a display; one or more input interfaces to receive commands from a user; an access to one or more memories storing a dataset comprising values of one or more sets of variables for a plurality of objects; a processing logic comprising: an adaptation to obtain automatically a type of a variable in one or more sets of variables; an adaptation to generate a raw representation of values of said variable for said plurality of objects; an adaptation to determine a synthetic representation type based on said type of variable; an adaptation to generate a synthetic representation of values of said variable for said plurality of objects using said synthetic representation type, said synthetic representation comprising a plurality of elements, each element representing one or more values of the variable for one or more objects; an adaptation to output said raw representation of values and said synthetic representation of values to said display; an adaptation to receive from said one or more input interfaces a selection by the user of an element of said synthetic representation; an adaptation to perform a selection of all objects in said plurality of objects whose value of the variable is represented by said element.
[0010] In the course of the invention, the term "object" is similar in scope to the term "individual" in statistics. Therefore, it designates objects, as well as animals or humans or the like, whose properties can be described through variables.
[0011] As will be exemplified above, a raw representation is a representation that provides to the user values of the variables for each object. For example, the raw representation can be a two dimensional array of a spreadsheet.
[0012] The use of type of data to present synthetic values allows the user to have an immediate overview of specific or inconsistent data. The selection of the one or more object corresponding to this data allows the user to modify easily the values which he/she is looking for.
[0013] The invention thus allows a user to easily and intuitively detect interesting, inconsistent or incorrect values within data, and modify them.
[0014] Advantageously, the dataset comprising the values of the variable for the plurality of objects is stored in one or more files; the adaptation to obtain automatically a type of the variable comprises an adaptation to detect automatically the type of the variable, based on the values of said variable for the plurality of objects.
[0015] This allows, from any compressed or non-compressed source, for example a text-based source (for example, an Excel® file), using the most appropriate type for each type of variable. Moreover, the correct typing of variables advantageously allows saving space in memory, because the most relevant representation is used for each value of each variable. A correct typing of variables also allows the use of the most relevant statistical operations on the variables. This is also simpler to use by the user, who do not need to manually set a type of the variable.
[0016] Advantageously, said adaptation to detect automatically the type of the variable comprises one or more adaptations to: retrieve a collection of text strings for all occurrences of the variable; if the number of dictionary words in the collection of text strings is equal to two, detect the type of the variable as a binary type; if the number of dictionary words in the collection of text strings is different from two: if each string in the collection is representative of an integer number, and if the size of each string is below a predefined threshold, detect the type of the variable as an integer type; if each string in the collection is representative of a number, and if at least one number is a non-integer number or if the size of at least one string is above the predefined threshold, detect the type of the variable as a numeric type; if the collection of text strings comprises at least one non-numeric character, detect the type of the variable as a nominal type.
[0017] This allows good detection of types of variables, and using the most efficient variable type. This also allows a reduction of the memory footprint to store variables in memory. Indeed, storing a variable having only two words in a binary type, or numeric variables using an integer or numeric type uses much less memory than storing these variables as an array of strings. Meanwhile, the detection of the variable as a nominal variable by default ensures that a type is always detected for each variable.
[0018] Advantageously, the synthetic representation type is an array comprising one or more statistical parameters; said adaptation to determine the synthetic representation type based on said type of variable comprises an adaptation to determine said one or more statistical parameters based on said type of variable; said adaptation to generate the synthetic representation of values comprises an adaptation to calculate one or more statistical parameters for said variable; said adaptation to receive from said one or more input interfaces the selection by the user of the element of said representation comprises an adaptation to receive from said one or more input interfaces a selection by the user of a statistical parameter; said adaptation to perform the selection of one or more objects corresponding to said element comprises an adaptation to select said one or more objects based on said statistical parameter and an adaptation to highlight values of said variable for said one or more objects in the raw representation.
[0019] The automatic calculation of statistical parameters based on the type of variables allows an automatic calculation of the most relevant statistical parameters. For example, an average, median, minimum and maximum could be calculated for a numeric variable, while a percentage of occurrences of each value could be calculated for a nominal variable.
[0020] The selection by the user of a statistical parameter allows the user to intuitively select a statistical value which appears abnormal or of interest. For example, the user may select a maximum value if it is abnormally high.
[0021] The selection of corresponding object and highlighting of corresponding values allows the user to have an immediate overview of the locations of the abnormal or interesting values.
[0022] Therefore, these features allow the user to interact with data, select and modify values of interest intuitively.
[0023] Advantageously, said synthetic representation is a graph, each section of the graph representing the number of occurrences of one or more values of the variable for the plurality of objects; said adaptation to determine the synthetic representation type based on said type of variable comprises an adaptation to determine a type of graph based on the type of variable; said adaptation to generate the synthetic representation of values comprises an adaptation to calculate a size of each section of the graph; said adaptation to receive from said one or more input interfaces a selection by the user of an element of said representation comprises an adaptation to receive from said one or more input interfaces a selection by the user of a section of the graph; said adaptation to perform a selection of one or more objects corresponding to said element comprises an adaptation to select said one or more objects whose value of the variable is represented by the section selected by the user.
[0024] The automatic generation of a graph based on the type of variable allows representing the most relevant type of graph for each type of variable, for example a histogram for numeric variables, a spaced bar graph for binary variables, or a circle graph for nominal variables.
[0025] The selection by the user of a section of the graph allows the user to intuitively select a section of the graph value which visually appears abnormal or of interest. For example, a user can select an isolated bar of a histogram, which visually represents a value with an unlikely high or low value, or a very small section of a circle graph, which represents a value of a nominal variable with a very low, or a unique occurrence, any of these values likely representing an input error.
[0026] The selection of corresponding object and highlighting of corresponding values allows the user to have an immediate overview of the locations of the abnormal or interesting values. [0027] Therefore, these features allow the user to intuitively and visually select abnormal or interesting values by simply clicking on a graph.
[0028] Advantageously, said synthetic representation is a list of modalities of the variable; said adaptation to determine the synthetic representation type based on said type of variable comprises an adaptation to determine if the variable is a categorical variable; said adaptation to generate the synthetic representation of values comprises an adaptation to represent, if the variable is a categorical variable, modalities of the variable among the plurality of objects; said adaptation to receive from said one or more input interfaces a selection by the user of an element of said representation comprises an adaptation to receive from said one or more input interfaces an association between a first and a second modality of the variable; said adaptation to perform a selection of one or more objects corresponding to said element comprises an adaptation to select one or more objects having the first modality as value of the variable, and assign the second modality as value of the variable for said one or more objects.
[0029] The representation of the modalities of the variable allows the user to obtain an overview of the values of the variables for the plurality of objects, and detect values which are similar (for example values which differ from a single letter, singular and plurals, or by capital letters), which likely have the same meaning.
[0030] The reception of an association from the user between two modalities allows identifying the modalities that should have been the same. This avoids using automatic rules of associations of modalities that generate a high risk of incorrect modifications. Furthermore, allowing the user to associate the modalities allows obtaining an insight from the comprehension of the user of statistical data.
[0031] The automatic assignation of the second modality to the values of the variable for objects that had previously the first modality allows automatically propagating of the association made by the user to all the objects. Therefore, the occurrences of the first modality are deleted and replaced by occurrences of the second modality, which simplify the management and understanding of statistical data.
[0032] Therefore, these features allow simplifying the management and understanding of statistical data while letting the user intuitively indicating the modalities that should be associated and deleted.
[0033] Advantageously, said synthetic representation is one of a list of variable types or a combination of a list of variable types and a list of modalities of the variable; said adaptation to determine the synthetic representation type based on said type of variable comprises an adaptation to determine if the variable is a categorical variable; said adaptation to generate the synthetic representation of values comprises an adaptation to represent a list of variable types and, if the variable is a categorical variable, modalities of the variable among the plurality of objects; said adaptation to receive from said one or more input interfaces a selection by the user of an element of said representation comprises an adaptation to receive from said one or more input a selection by the user of a desired type of the variable, which is different from said type of variable; said adaptation to perform a selection of one or more objects corresponding to said element comprises: an adaptation to identify and output to said display modalities of the variable that do not match the desired type of the variable; an adaptation to receive from said one or more input interfaces a selection of a modality of the variable that do not match the desired type of the variable, and a replacement modality to replace said modality; an adaptation to select one or more objects whose value of the variable is said modality that do not match the desired type of the variable, and replace said modality that do not match the desired type of the variable by said replacement modality for said one or more objects.
[0034] The display of the list of possible types of the variable allows the user to determine instantaneously if the variable has been casted in a type it should not be, and to display to the end user a proposal of the type that the variable should have.
[0035] Furthermore, this allows the user to automatically verify the modalities of the variable that prevent the relevant type of variable to be determined. Input errors can thus be detected and corrected by the user (for example if a user entered a "O" instead of a "0" for a value, the variable is detected as a nominal variable instead of a numeric variable).
[0036] The automatic replacement of the modalities that prevent the variable to be casted in the relevant type furthermore allows the variable to be determined as belonging to the desired type of variables.
[0037] These features thus allow the user to intuitively and efficiently assign the right type to each variable. Furthermore, this operation is much more intuitive and easy to perform for the user than a manual verification of all occurrences of the variable.
[0038] Advantageously, the processing logic further comprises an adaptation to set the type of the variable as the desired type of variable, and modify representations in memory of the occurrences of said variable according to said desired type of variable, when all occurrences of said variable match the desired type of variable.
[0039] These features allow casting all the occurrences of the variable in memory to a memory type which is best suited for the management of said variable. For example, when a variable is converted from a nominal to an integer variable, the occurrences of said variable can be casted in memory from a string to an integer memory type. This advantageously allows a save of memory space, and an increase of the speed of execution of the program, because the values of variables are expressed in memory in the most relevant form. [0040] Advantageously, said synthetic representation type is one of a list of modalities, a list of intervals, or a list of modalities and intervals; said adaptation to determine the synthetic representation type based on said type of variable comprises an adaptation to set the representation type as a list of modalities if the variable is of a categorical type, and one of a list of intervals, or a list of modalities and intervals if the variable is of an integer or numeric type; said adaptation to generate the synthetic representation of values comprises an adaptation to generate a list of modalities, intervals, or combination thereof describing all the values of the variable for the plurality of object, depending on the synthetic representation type; the processing logic further comprises an adaptation to filter the values of the raw representation in order to display only values corresponding to an element of said list of modalities, intervals, or combination thereof selected by the user.
[0041] The generation of a list of elements that encompasses all values of variable for the plurality of objects allows the user to have an overview of all possible values of the variable. Meanwhile, defining the elements of the list based on the type of the variable allows tailoring the representation of the list of elements to best represent the type of the variable. For example, a nominal variable can be represented by a list of all its modalities; an integer variable can be represented using a mix of single values and intervals; a continuous numeric variable can be represented using intervals.
[0042] These features thus allow the user to filter the variable intuitively, using the most appropriate representation of values of variable depending of the type of said variable.
[0043] Advantageously, said synthetic representation type is one or more types of values to highlight; said adaptation to determine the synthetic representation type based on said type of variable comprises an adaptation to determine said one or more types of values to highlight based on said type of variable; said adaptation to generate the synthetic representation of values comprises an adaptation to select values of the variable where to superimpose colors in the raw representation based on said types of values to highlight, and to superimpose colors on said selected values; said adaptation to receive from said one or more input interfaces a selection by the user of an element of said representation comprises an adaptation to receive from said one or more input interfaces a selection by the user of a value highlighted in a color in the raw representation; said adaptation to perform a selection of one or more objects corresponding to said element comprises an adaptation to select one or more objects having a value of the variable highlighted in the color.
[0044] These features allow the user to have a direct overview of specific values depending on the type of variable. For example, if the variable is a nominal variable, empty values can be highlighted. On the other hand, if the variable is a numeric variable, particularly high or low values (for example, the minimum, maximum, top 5% values, etc ..) can be highlighted. Meanwhile, the user can intuitively select and modify the values he/she is interested in.
[0045] The invention also discloses a method comprising: accessing one or more memories storing a dataset comprising values of one or more sets of variables for a plurality of objects; obtaining a type of a variable in the one or more sets of variables; generating a raw representation of values of said variable for said plurality of objects; determining a synthetic representation type based on said type of variable; generating a synthetic representation of values of said variable for said plurality of objects using said synthetic representation type, said synthetic representation comprising a plurality of elements, each element representing one or more values of the variable for one or more objects; displaying said raw representation of values and said synthetic representation of values; receiving a selection by a user of an element of said synthetic representation; selecting all objects in said plurality of objects whose value of the variable is represented by said element.
[0046] The invention also discloses a computer program product comprising computer code instructions configured to: access one or more memories storing a dataset comprising values of one or more sets of variables for a plurality of objects; obtain a type of a variable in the one or more sets of variables; generate a raw representation of values of said variable for said plurality of objects; determine a synthetic representation type based on said type of variable; generate a synthetic representation of values of said variable for said plurality of objects using said synthetic representation type, said synthetic representation comprising a plurality of elements, each element representing one or more values of the variable for one or more objects; display said raw representation of values and said synthetic representation of values; receive a selection by a user of an element of said synthetic representation; select all objects in said plurality of objects whose value of the variable is represented by said element.
BRIEF DESCRIPTION OF THE DRAWINGS
[0047] The invention will be better understood and its various features and advantages will emerge from the following description of a number of exemplary embodiments provided for illustration purposes only and its appended figures in which:
- Figures 1 displays an example of a functional architecture of a device in a number of embodiments of the invention;
- Figure 2 displays an example of an interface of a device in a number of embodiments of the invention;
- Figure 3 displays an example of an automatic generation and display of statistical parameters in a number of embodiments of the invention;
- Figures 4a to 4d display an example of a drawing of a graph representing values of a variable, detection and correction of inconsistent data using an interaction of the user with the graph;
- Figures 5a to 5h display an example of association and modification of modalities of a variable in a number of embodiments of the invention;
- Figures 6a to 6g display a first example of modification of a type of variable in a number of embodiments of the invention;
- Figures 7a to 7g display a second example of modification of a type of variable in a number of embodiments of the invention;
- Figures 8a to 8d display an example of filtering of values based on two variables in a number of embodiment of the invention;
- Figures 9a to 9e display an example of highlighting values of interest depending of the types of each variable in a number of embodiments of the invention;
- Figure 10 displays an example of a method in a number of embodiments of the invention. DETAILED DESCRIPTION OF THE INVENTION
[0048] Figure 1 displays an example of a functional architecture of a device in a number of embodiments of the invention.
[0049] The device 100 may be any kind of device with computing capabilities. For example, the device 100 may be a computer, a smartphone or a tablet, an loT device or the like.
[0050] The device comprises an access to a display 1 10. The display 1 10 can be any kind of display allowing the user to view images from the device 100. For example, the display 1 10 may be a any screen or the like such as glasses, a tactile screen, a watch or a video projector. The display 1 10 may be embedded within the same housing as the device 100. This is for example the case if the display 1 10 is a screen (whether tactile or not) of a smartphone or a tablet. The display may also be embedded within a housing separated from the device 100, and the device 100 connected to the display 1 10. It is for example the case if the display is a separate screen, or a video projector. The device 100 can be associated to the display 1 10 through any kind of connection, for example through a wire or a wireless connection (e.g Bluetooth™, Wi-Fi™, or any kind of wireless connection allowing the device 100 to send images to the display 1 10).
[0051] The display can also be a combination of elementary displays. For example, the display can comprise two screens side by side, and the user can use seamlessly the two screens.
[0052] The device 100 also comprises one or more input interfaces 120 to receive commands from a user. The one or more input interfaces 120 can be any kind of interface allowing the user to input commands such as a mouse, vocal command system, buttons, keys, keyboard, eyes contact or the like.
[0053] Although figure 1 displays the one or more input interfaces 120 as being a connection to a mouse and a keyboard in a separate housing, according to various embodiments of the invention, the one or more input interfaces may be either a connection to an external device, a part of the device, or a connection to the display 1 10. For example, the one or more input interfaces may comprise a wired or wireless connection to a mouse, a wired or wireless connection to a keyboard, a connection to an external tactile screen (which can in this case also serve as display 1 10), a touchpad or keyboard embedded within the housing of the device 100, a microphone to receive vocal commands, or any other kind of suitable input interface.
[0054] The device 100 further comprises an access to one or more memories 130 storing a dataset. The one or more memories 130 may be any kind of internal, external, volatile or non-volatile memories. For example, it can be formed by an internal or external hard drive, a memory in a cloud which can be accessed by the device 100, a memory shared among a plurality of devices, a memory on a server, a CD-ROM, or a combination thereof, or more generally any suitable memory or combination thereof.
[0055] The one or more memories 130 store a dataset comprising values of one or more sets of variables for a plurality of objects. The dataset can take any suitable form to store values of variables for objects. For example, the dataset can be a database, a text file, a raw or compressed file associated with a spreadsheet application (for example an Excel® file such as one with the extension ".xls" or ".xlsx") or more generally any format allowing a storage of values of variables for a plurality of objects. It should be understood that, in the course of the application, the term "object" is to be defined as a statistical object or individual. Therefore, an object can designate a human, animal or an item, provided that the object can be characterized at least in part by value of variables.
[0056] The invention can thus be applied to a large number of datasets. For example, it can be applied to datasets of vehicles, wherein the variables may be the number of wheels, the number of doors, the license plate, the color, the brand of the vehicles, the expected consumption, the price, etc... It can also be applied to datasets of persons, wherein the variables may be the age, gender, weight, etc... These examples are provided by means of example only, and the invention can be applied to any kind of dataset.
[0057] As will be described in more details hereinafter, the device 100 allows the user to view, modify and interact with the dataset. For example, the device 100 allows the user to view the values of the variables for each object, but also modify these variables. It also allows drawing graphs and displaying colors that allow the user to have a quick overview and understanding of the content of the dataset.
[0058] According to various embodiments of the invention, the dataset can be retrieved from a wide variety of sources. For example, the values may be entered manually by the user of the device 100, may be retrieved from external sources or databases, or may be input by a large number of persons, for example through a poll. The latter case is for example commonly used if the dataset is created from questions asked to a large number of persons, in particular when users themselves enter the values.
[0059] In certain circumstances, incorrect values of variables may be input. This case is especially frequent when the dataset is built from inputs of a large number of persons. For example, if a person aged 20 is asked to input his/her age, and if the age is to be entered in the form of a number, he/she should input "20". However, in some cases he/she may enter incorrect inputs such as "200", "twenty" or "20" (with a capital letter "O" instead of the digit "0").
[0060] As will be exemplified hereinafter, such input errors may be very problematic for data representation and statistical analysis. Meanwhile, it is a cumbersome and impractical task for the user to review all data in order to look for errors, especially in the frequent case of large datasets. Meanwhile, an automatic correction of errors is not desirable, because correcting such input errors require an in-depth understanding of the meaning of the values of variables within the dataset, which is very difficult to acquire by a machine, especially if new types of objects and variables are used. [0061] One objective of the invention is to let the user to analyze and correct easily and intuitively input data.
[0062] To do so, the device 100 comprises a processing logic 140.
[0063] According to various embodiments of the invention, a processing logic may be a processor operating in accordance with software instructions, a hardware configuration of the processor, or a combination thereof. It should be understood that any or all of the functions discussed herein may be implemented in a pure hardware implementation and/or by a processor operating in accordance with software instructions, and/or a configuration of a machine learning engine or neural network. A processing logic may also be a multi-core processor executing operations in parallel, a series of processors, or a combination thereof. It should also be understood that any or all software instructions may be stored in a non-transitory computer-readable medium. The term "adaptation of a processing logic" refers to any means (for example hardware configuration, software instructions, machine learning, training or neural network, or any other adaptation means or combination thereof) of adapting a processing logic to execute operations.
[0064] In the course of the application, various embodiments of the invention will often be exemplified in relation to a single variable, that will be designated as "the variable". However, it shall be noted that the invention can be applied to a plurality of variables in parallel.
[0065] The invention relies on a management of values of variables in the one or more sets of variables. In a number of embodiments, the processing described in the application is performed on a single set of data. However, in many cases the invention can be applied concurrently on a plurality of sets of data. According to various embodiments of the invention, the invention can be applied to a plurality of sets of data: it is for example the case if the invention is applied on a spreadsheet comprising a plurality of tabs: each tab can define a set of variables. It can also be the case within a single tab of a spreadsheet, if variables within the tab are clearly separated in a plurality of sets. It can also be the case if the invention is applied to a plurality of files, each file can define its own set of variables.
[0066] In such cases, the invention can be applied separately on each set of variables. The processing logic can comprise an adaptation, not represented in the figures, to identify the one or more sets of variables. Any suitable method can be used to identify the one or more sets of variables. For example, some file or database formats explicitly define the variables. In such cases, the variables can be simply read in a file or database. In other cases, the variables are not explicitly defined, but the presentation of data implicitly indicates what are the variables. It is for example the case of files that organize data by columns, wherein the name of the variable is written in the top cell of the columns. In such cases, the set of variables can be defined by detecting patterns representative of an organization of the dataset, and reading the names of variables in the place defined by the pattern, for example the top cell of each column. In yet other cases, the raw data can be presented to the user, who defines what are the variables, for example by clicking on the names of the variable, and selecting the values of each variable.
[0067] The processing logic 140 comprises an adaptation 141 to obtain automatically a type of a variable in the one or more sets of variables.
[0068] Automatically obtaining the type of the variable allows an obtaining the most relevant information depending on the type of variable, without requiring the user to manually define the type of variable.
[0069] The invention applies to any type of variables. Some commonly used types and subtypes of variables include the following:
- numeric variable : defines a number:
o continuous numeric variable : a numeric variable that can take any value within a range (for example, a temperature);
o discrete numeric variable : a numeric variable that can take only discrete values (for example a grade expressed with a precision of 0.1 );
- integer variable : a kind of numeric variable that can take only integer values (for example, the number of wheels of a vehicle);
- binary or boolean variable : a variable that can take only two values, each value having a meaning (for example true/false, present/absent...)
- nominal variable: a variable having modalities expressed by words (for example, names of persons). The term 'word' is used in the description as alphanumeric strings. The strings are interpreted as words of a nominal variable if they do not appear to belong to other categories;
- ordered variable: kind of nominal variable, in which the modalities of the variable can be logically ordered (for example, an appreciation of an experience by the user can be expressed using 6 ordered modalities : "very poor", "poor", "average", "good", "excellent", "outstanding").
The variables which are not numeric can be generally referred to as "categorical variables".
[0070] In a number of embodiments of the invention, the dataset comprises values of variables for objects in one or more files. In a number of embodiments of the invention, the dataset comprises information to explicitly determine the type of data. For example, a variable may be explicitly defined as an Integer' variable. In such cases, the adaptation 141 can comprise an adaption to retrieve the type of variable from the dataset. [0071] In other embodiments of the invention, the dataset comprises values of variables without any explicit mention of the type of variable, but the type of variable can be deduced from the values in the dataset. It is for example the case if the values are stored in a file in the form of text characters, either in a raw or compressed form. This kind of file is very common for storing dataset. For example, the .xls, .xlsx or .odt filed belong to these types of storage, unless they comprise words to explicitly define the types of variables.
[0072] In such cases, the adaptation 141 can comprise an adaptation to retrieve the text characters corresponding to values of said variable for the plurality of objects, and analyze these text characters to deduce the type of the variable.
[0073] For example, these adaptations may comprise one or more adaptations to:
- retrieve a collection of text strings for all occurrences of the variable, i.e, retrieve the text strings that correspond to each occurrence of the variable;
- if the number of dictionary words (i.e the number of different text strings) in the collection of text strings is equal to two, to detect the type of the variable as a binary type. Indeed, if only two different text strings are present among all the values of the variables, it can be assumed that they are representative a binary option. This can be performed for example by counting the number of different text strings, and setting the type of the variable as binary if there are exactly two different text strings;
- if the number of words in the collection is different from two, and if each string in the collection comprises only digits, detect the variable type as one of an integer or a numeric type;
- a number of different options can be used to disambiguate a integer or a numeric type. For example, an integer type could be selected if all numbers are integers, and a numeric type could be selected if at least one number is not an integer number. Alternatively, the size of the numbers can also be used. For example, a numeric type can be selected instead of an integer type, if at least on number has a number of digits above a predefined threshold, for example at least 10 digits;
- if the number of words in the collection is different from two, and if the text string comprises at least one non-numeric character, detect the variable type as a nominal type.
This example of typing variables are provided by means of non-limitative example only, and other rules can be applied to detect these types of variables, or other types of variables in the invention.
[0074] The processing logic 140 further comprises an adaptation 142 to generate a raw representation of values of said variable for said plurality of objects. [0075] The raw representation can be any kind of representation that provides a view of the values themselves, and allows the user to read values of the variables for the objects. For example, the raw representation can be a two dimensional array wherein each column represents a variable, each line represents an object, and each cell in a line and a column contains the value of the variable represented by the column for the object represented by the line. An example of such a representation is provided in figure 2. However this kind of representation is provided by means of non limitative example only, and any kind of representation that allows the user to read the each value of the variable can be used as raw representation.
[0076] The processing logic 140 further comprises an adaptation 143 to determine a synthetic representation type based on said type of variable.
[0077] A synthetic representation is a representation that is built using an analyze and/or processing of values, in order to display to the user information that is meaningful at the first sight. The synthetic representation may be of different types. For example, the synthetic representation types comprise different types of graphs (histogram, circular graph...) or representations of statistical values (average, maximum...) wherein the statistical vales are chosen depending on the type of variables. Examples of synthetic representations will be provided hereinafter. Determining the type of synthetic representation depending on the type of variable allows providing a synthesis of information to the user, which is the most relevant depending on the type of variable.
[0078] The processing logic 140 further comprises an adaptation 144 to generate a synthetic representation of values of said variable for said plurality of objects using said synthetic representation type.
[0079] The synthetic representation comprises a plurality of elements and each element represents one or more values for the plurality of objects. For example, if the synthetic representation type is a histogram, each bar of the histogram represents one or more values of the variables, and the height of the bar is representative of the number of occurrences of the one or more variables. The generation of the synthetic representation can be performed for example by calculating the synthetic values which characterize the synthetic representation (for example, the heights of the bars of a histogram, or an average of the values of the variable).
[0080] Since the representation type is selected depending on the type of variable, the representation highlights specific or abnormal values.
[0081] The processing logic 140 further comprises an adaptation 145 to output the raw representation of values and said synthetic representation of values to the display 1 10. [0082] The adaptation 145 can thus send the raw representation and the synthetic representation of values to the display 1 10 in order for the representations to be seen by the user. The representations can be sent in any kind of communication which is used between the device 100 and the display 1 10, for example by sending images to the display.
[0083] The processing logic 140 can further comprise an adaptation 146 to receive from said one or more input interfaces a selection by the user of an element of said synthetic representation.
[0084] The selection of the element the user is interested in can be performed in any way. For example, the user can move a cursor on the display, and click where the element is displayed, or navigate among elements using a keyboard, and press a button when the relevant element is selected.
[0085] The processing logic 140 further comprises an adaptation 147 to perform a selection of one or more objects corresponding to said element.
[0086] The one or more objects corresponding to said elements are the one or more objects whose values of the variable have been used to generate the element. The one or more objects can thus be selected automatically on the basis of the element selected by the user.
[0087] As highlighted above, the synthetic representation highlights values that are of interest for the user, for example abnormal or incorrect values, because the synthetic representation type is based on the type of the variable. The user can thus intuitively select an element of the synthetic representation that attracts his/her attention. The corresponding objects are then selected. This allows the user to select intuitively and efficiently the one or more objects that he/she may wish to verify or modify, without having to check all objects of the dataset, which is often a cumbersome and impracticable task.
[0088] Manipulating large datasets without using the invention often requires to be done with desktop computers with large screens, large memories capabilities and powerful computation capabilities. As described in Figure 1 , the invention brings great efficiency to various tasks users and statisticians have to perform. In addition to desktop computers, the invention makes the manipulation of datasets advantageously much more practical on tablets and smartphones which may have smaller display and memories and less powerful computing power.
[0089] While the next Figures (2 and after) correspond to a desktop computer implementation of the invention for greater clarity, a tablet or a smartphone display would also bring powerful data selection, processing and representation. For example, a column of an array may be selecting by touching the column with a finger, a pen, a smart watch or by means of a gesture. Display may adapt itself to bring data representation as depicted in the invention. For example, a graphical representation may appear. A long pressure or a harder pressure on the screen may bring a conceptual menu. Menu selection may trigger processing and adaptation to the selected values. The speed and efficiency to select and process values of interest makes it possible to work with large datasets on tablets and smartphones despite the reduced computing power capabilities compared to powerful desktop computers used by professional statisticians.
[0090] Figure 2 displays an example of an interface of a device in a number of embodiments of the invention.
[0091] The interface 200 allows visualizing and modifying values of a dataset. The interface 200 share certain common aspects of interfaces of spreadsheets applications (such as for example the Microsoft Excel ® application). The interface 200 comprises a toolbar 210, and a raw representation of values 220. The interface 200 can be displayed in the display 1 10, and the user can interact with it using the one or more input interfaces 120.
[0092] The toolbar 210 comprises:
- a "File" section 21 1 that comprises buttons to manage the files : open, save close, import, export a file, etc... ;
- an "Edit" section 212, that comprises an undo and redo buttons, to undo and redo last action(s);
- a "Color" section 213 that comprise buttons to add color information to the raw representation to highlight specific values;
- a "Windows" section 214 that comprise buttons to manage the windowing of the interface (for example by providing a second window, cascading windows, etc .);
- a "Subgroups" section 215 to separate values into groups.
[0093] The raw representation 220 represents the values of the dataset using a two dimensional array. A first line 230 displays the name of the variables; each subsequent line 231 , 232, 233... represents an object.
[0094] In this example, the raw representation 220 represents a French dataset representing information regarding French university students. A first column 2400 represents the number of the lines.
[0095] Each subsequent column represents a variable. For example:
- the column 2401 represents the values of the variable "Id", representative of the IDs the students;
- the column 2402 represents the values of the variable "Age", representative of the ages of the students; the column 2403 represents the values of the variable "Sexe" (French for "Gender", representative of the genders of the students, with two possible modalities:
o "Homme" (French for "Male");
o "Femme" (French for "Female");
the column 2404 represents the values of the variable "NiveauDEtude", representative of the grades of the students (in French, "Niveau d'Etudes" - literally "Level of Studies"), with three possible modalities:
o "L2" (second year of Bachelor's degree);
o "L3" (third year of Bachelor's degree);
o "M1 " (first year of master's degree);
the column 2405 represents the values of the variable "UFR", representative of the faculties of the students (in French, "UFR" - for "Unite de Formation et de Recherche" - literally "Formation and Research Unity"), with three possible modalities:
o "Staps" (French for "Faculty of Sports" - "STAPS" means "Sciences et Techniques des Activites Physiques et Sportives", literally "Sciences and Techniques of Sportive and Physical Activities");
o "SEGMI" (French for "Faculty of Mathematics and Economy" - "SEGMI" means "Sciences Economiques, Gestion, Mathematiques et Informatique", literally "Economic Sciences, Management, Mathematics and Computing Science");
o "SJPA" (French for "Faculty of Politics Science" - in French "SJPA" means "Sciences Juridiques Administratives et Politiques", literally "Politics, Administrative and legal sciences");
the column 2406 represents the values of the variable "Redoublement", representative of whether the students have already repeated a school year at least once (in French, "Redoublement" means "Repeating a year), with two modalities:
o "Oui" (French for "Yes": the student has already repeated at least one year);
o "Non" (French for "No": the student never repeated any year);
the column 2407 represents the values of the variable "MentionBac", representative of if the grade at the French grade "baccalaureat" (equivalent of A level grade - in French "Mention Bac"), with five possible modalities:
o "Felicitations" (French for " Congratulations");
o "Tres bien" (French for " Very good");
o "Bien" (French for "Good"); o "Assez Bien" (French for " Fair");
o "Passable" (French for "Poor");
o "Rattrapage" (French for "Re-take");
- the column 2408 represents the values of the variable "Copier", representative of whether students copy off during exams (in French "Copier" means "Copy off");
- the column 2409 represents the values of the variable "Communiquer", representative of whether the students communicate during exams (in French "Communiquer" means "Communicate");
- the column 2410 represents the values of the variable "EchangeBrouillon", representative of whether the students exchange drafts during exams (in French "EchangeBrouillon" means "Exchange Draft");
- the column 241 1 represents the values of the variable "Antiseche", representative of whether the students use cheat sheets (in French "Antiseche" means "Cheat sheet");
- the column 2412 represents the values of the variable "SMS", representative of whether the students use text messages during exams (in French "SMS" means "text message");
- the column 2413 represents the values of the variable "CoursGenoux", representative of whether the students hold lessons on the knees during exams (in French "CoursGenoux" means Lessons on Knees");
- the column 2414 represents the values of the variable "GarderCopie", representative of whether the students keep their examination test (in French "GarderCopie" means "Keep Copy");
- the column 2415 represents the values of the variable "PreparerSalle", representative of whether the students prepare the examination rooms during exams (in French "Preparer Salle" means "Prepare Room");
- the column 241 6 represents the values of the variable "VolerSujet", representative of whether the students steal exam subjects (in French "VolerSujet" means "Steal subject").
[0096] Each cell represents the value of a variable for an object. For example, the cell 251 indicates that the value for the variable "Age" for student having the ID 101 is 21 (i.e the student is aged 21 ), and the cell 252 indicates that the value of the variable "NiveauDEtude" for student having the ID 107 has the value "L3", thereby indicating that the student is in the third year of Bachelor's degree.
[0097] The values can thus be extracted from the dataset to be displayed to the user, for example by loading a file. The user can also enter values in the cells to modify one or more values, and save the modified dataset in a file. [0098] Due to the limited size used for drawings, and the large size of typical datasets, the figure 2 displays only the first lines and the first columns of the exemplary dataset. The interface 200 comprises sliders, not represented in figure 2, to navigate in the raw representation, see and modify values of the variables for the objects in the dataset.
[0099] The interface 200 is provided by means of non-limitative example only, and other interfaces may be used in various embodiments of the invention. The invention is applicable to other interfaces with other kinds of raw representations of values of a dataset.
[00100] Figure 3 displays an example of an automatic generation and display of statistical parameters in a number of embodiments of the invention.
[00101] The example of figure 3 is based on the exemplary interface 200 of figure 2. The example of figure 3 applies on the same dataset, objects, variables and values than figure 2. Therefore the columns and lines of the raw representation 220 are identical between figure 2 and figure 3. However the techniques disclosed in figure 3 can be adapted to any embodiment and interface of the invention, and can be applied to other datasets.
[00102] As discussed above, there may be in dataset errors, incorrect or abnormal values, or more generally normal but unusual ones. In prior art solutions, a user can detect such values by 'manually' inspecting all values of the dataset. However this solution is cumbersome and impossible in practice for large datasets. Another option for the user is to enter formulas that provide statistical parameters (for example the minimum, maximum, number of occurrences of a variable...) and, if one of these indexes is abnormal or unusual, look for the cell comprising the values that generated this abnormal or unusual. However, this solution is also not satisfying, for at least two reasons. It first requires a user to enter correct and adequate formulas, which a lot of users are not used to. In addition, once an abnormal or unusual statistical parameter is identified, the user has to check the values that generated that abnormal or unusual statistical parameter, which is not an obvious task.
[00103] In one aspect, the invention advantageously solves this issue by calculating and displaying statistical parameters that depend upon the type of a variable, and, when the user clicks on a statistical parameter, selecting and/or highlighting the corresponding objects and/or values. Therefore, the user is able to intuitively select unusual or abnormal values.
[00104] In this example, statistical parameters are calculated and displayed for all variables when the user clicks on a specific button. However this is provided by means of example only, and the invention could be applied to any means of displaying statistical parameters for one or more variable.
[00105] As discussed above, upon the reception or load of the dataset, the types of one or more variables are automatically defined based on the values in the dataset.
[00106] In this example, the following types are automatically determined for the variables:
- Variable "Id": classified as an Integer' variable, because all values of the column 2401 are integer numbers with less than 10 digits;
- Variable "Age": classified as an Integer' variable, because all values of the column 2402 are integer numbers with less than 10 digits;
- Variable "Sexe" and "Redoublement": classified as a 'binary' variable, because they have only two occurrences in the dataset (for example "Oui'V'Non" for "Redoublement");
- Variable "NiveauDEtude", "UFR", "MentionBac", "Copier": classified as 'nominal', because it has more than two occurrences, and the values comprise letters.
[00107] In order to determine, calculate, and display the statistical values:
- the adaptation 143 to determine the synthetic representation types based on said type of variable comprises an adaptation to determine one or more statistical parameters based on said type of variable;
- the adaptation 144 to generate the synthetic representation of values comprises an adaptation to calculate one or more statistical parameters for said variable.
[00108] In this example, in order to let the user view and select specific values, synthetic representations 3401 , 3402, 3403, 3404, 3405, 3406, 3407, 3408 are generated and displayed. These representations are arrays comprising one or more statistical parameters depending on the type of variable. In this example:
- For each integer or numeric variable, a synthetic representation displays the minimum, maximum, 1 st quartile, median, 3rd quartile, maximum, average and standard deviation values. It is for example the case of synthetic representations 3401 , 3402 for variables "Id" and "Age" respectively;
For each binary, nominal or ordered variables, a synthetic representation displays a list of the modalities of the variable, and the number of occurrences of each modality. It is for example the case of the synthetic representations 3403, 3404, 3405, 3406, 3407, 3408 for the variables "Sexe", "NiveauDEtude",
"UFR", "Redoublement", "MentionBac", "Copier" respectively.
These variables are chosen by means of non-limitative example only. In some embodiments of the invention, other statistical variables may be used, provided that they are selected based on the type of variable. For example, a percentage of occurrences of each modality may be used instead of the number of modalities for binary, nominal or ordered variables.
[00109] In a number of embodiments of the invention, the variables can take values representative of an undefined or inapplicable value. For example, a "NA" or "N/A" modality can stand for "Not Applicable". In a number of embodiments of the invention, the number of undefined values can be indicated for each variable. For example, it is indicated in 3502 that the variable "Age" has 10 undefined or empty values.
[00110] The synthetic representation of statistical variables allows the user to have a quick understanding of the repartition of values among the variables, and detect abnormal or unusual values. For example, the user can detect easily than the maximum value 34021 of the variable "Age" is 41 , which is far above the median value 34022, or even the third quarter 34023 of the variable "Age" (respectively 21 and 22).
[00111] There is therefore a strong probability than the value "41 " is an incorrect input. The user can determine if it is the case using his/her knowledge of the dataset. In this case, the user is aware than the dataset is representative of university students aged around 20. Therefore, the user suspects that the value "41 " results from an incorrect input and wishes to replace the "41 " by "21 ".
[00112] In order to let the user select and modify values:
- the adaptation 146 to receive from said one or more input interfaces the selection by the user of the element of said representation comprises an adaptation to receive from said one or more input interfaces a selection by the user of a statistical parameter;
- the adaptation 147 to perform the selection of one or more objects corresponding to said element comprises an adaptation to select said one or more objects based on said statistical parameter and an adaptation to highlight values of said variable for said one or more objects in the raw representation.
[00113] In this example, the interface 200 allows this operation by letting the user click on the maximum value 34021 . Upon the click, the object whose value of the variable "Age" is 41 is selected and displayed. The user can then modify the value "41 " and "21 ". Similarly, the user may click on the 3rd percentile 34023 to select all the user whose value of "Age" is in the 3rd percentile, click on the number 34041 of occurrences of the modality "M2" of the variable "NiveauDEtude" to select all the occurrences of this modality, etc... [00114] According to various embodiments of the invention, the values that the user wishes to select can be selected and/or highlighted. The object for which the variable has the desired value can also be placed in the top lines of the raw representation, in order to be easily identified by the user.
[00115] Therefore, the user is able to identify and modify abnormal or unusual values.
[00116] Figures 4a to 4d display an example of a drawing of a graph representing values of a variable, detection and correction of inconsistent data using an interaction of the user with the graph.
[00117] This example provides another option for allowing the user to detect and correct intuitively and easily unusual or abnormal values. In order to provide to the user an even more intuitive experience, the values of variable can be represented in the form of a graph, and the user can select values by clicking on elements of the graph.
[00118] Figure 4a displays an example of synthetic representation of values using graphs, which is generated in this example when the user clicks on the fifth button of the "Windows" section 214.
[00119] In this example, the synthetic representation type is a graph, wherein each section of the graph represents the number of occurrences of one or more values of the variable for the plurality of objects.
[00120] In order to provide the best overview of the values for each variable, different types of graphs can be used, and:
the adaptation 143 to determine the synthetic representation type based on said type of variable comprises an adaptation to determine a type of graph based on the types of variable;
- the adaptation 144 to generate the synthetic representation of values comprises an adaptation to calculate a size of each section of the graph.
[00121] In this example:
- for each integer, numeric variable or ordered variables, a histogram is used. It is for example the case of synthetic representations 4401 , 4402a for variables "Id" and "Age" respectively;
for each binary or nominal variable, a bar graph is used. It is for example the case of the synthetic representations 4403, 4404, 4405, 4406, 4407, 4408 for the variables "Sexe", "NiveauDEtude", "UFR", "Redoublement", MentionBac",
"Copier" respectively.
These types of graphs are chosen by means of example only. In some embodiments of the invention, other types of graphs may be used, provided that they are selected based on the type of variable. For example, a circle graph may be used instead of a bar graph for binary, nominal or ordered variables. It shall be noted that graphs that simply indicate a number of occurrences, such as bar or circle graphs, may be used for variables that do not trigger any kind of order between the values, for example binary or nominal variables, while histograms may be used for variables that imply an order in the values of variables, such as integer, numeric or ordered variables. Indeed, the sizes of the bars of the graphs are representative of the number of occurrences of values, or the number of occurrences of values within a range. In the cases of histogram, the positions of the bars are in addition representative of the value themselves.
[00122] The graphical representation of statistical variables advantageously allows the user to have a quick understanding of the repartition of values among the variables, and detect abnormal or unusual values. The exemplary dataset of figures 4a to 4d is identical to the exemplary dataset of figure 3. In this case, the abnormal value "41 " of the variable "Age" can be immediately identified by the small and isolated bar 44021 a.
[00123] As shown in figure 4b, the user can click on the graph in order to enlarge it: the graph 4402b is an enlarged version of the graph 4402a, and further enhances the isolation of bar 44021 b corresponding to the value "41 " of the variable "Age".
[00124] There is therefore a strong probability than the value "41 " is an incorrect input. The user can determine if it is the case using his/her knowledge of the dataset. In this case, the user is aware than the dataset is representative of "university student aged around 20". Therefore, the user suspects that the value "41 " results from an incorrect input and may wish to replace the value "41 " by "21 ".
[00125] In order to let the user select and modify values:
- the adaptation 146 to receive from said one or more input interfaces a selection by the user of an element of said representation comprises an adaptation to receive from said one or more input interfaces a selection by the user of a section of the graph;
- the adaptation 147 to perform a selection of one or more objects corresponding to said element comprises an adaptation to select said one or more objects whose value of the variable is represented by the section selected by the user.
[00126] As shown in figure 4c, in this example, the interface 200 allows this operation by letting the user click on the isolated bar 44021 c. Upon the click, the student whose value of the variable "Age" is 41 is selected and displayed: the corresponding line 431 is displaced on top of the raw representation, and highlighted. The abnormal value "41 " is thus immediately apparent 431 1 c. [00127] As s shown in figure 4d, the user can modify the value "41 " and replace it by the value "21 " 431 1 d, by simply clicking on the value and entering the new value. Upon the modification of the value, the graph 4402d is immediately updated, while the modified value remains highlighted 44021 d. The user can thus ensure that the value that he/she entered is a correct one, and that no other abnormal values are present.
[00128] It shall be noted that, while in this example the user selects and modifies a single value, in cases where an element of a graph represents a plurality of values, the user can select and modify the plurality of values in the same time by clicking on the element of the graph, and entering a new value.
[00129] Therefore, the user is able to easily and intuitively identify and modify abnormal or unusual values.
[00130] Such graph representations also allow the user to detect values which are placed in a wrong order. It is for example the case upon inspection of the graph 4407, as shown in figure 4d: the order of the sizes of the bars of the graph seems random, with for example a high bar 44071 , then smaller bars 44072, then a yet higher bar 44073. This graph is not coherent with the signification of the variable "MentionBac". Indeed, the variable "MentionBac" represents appreciations obtained by the student in their "baccalaureate" exams. Therefore, the sizes of the bar should roughly decrease from the middle to the sides of the graph, as the probability of obtaining a very high or very low appreciation is lower than the probability of obtaining an average appreciation. The invention also allows a correction of this issue, as will be explained with reference hereinafter. Therefore, a simple overview of the graphs allows the user to intuitively detect and correct incoherent or abnormal values.
[00131] Figures 5a to 5h display an example of association and modification of modalities of a variable in a number of embodiments of the invention.
[00132] One issue arising with datasets created from a large number of sources (for example datasets created by polling a large number of individuals) is that some inputs may be erroneous inputs, or use several different words for designating the same value of a variable.
[00133] Figure 5a displays an example of synthetic representation allowing a detection by the user of such inconsistencies within the dataset.
[00134] Figure 5a also displays the interface 200 for managing the dataset displayed in figures 2 and 3: the variables "Id", "Age", "Sexe", "UFR", "Redoublement", "MentionBac", "Copier", "Communiquer", "EchangeBrouillon" are respectively displayed in columns 2401 , 2402, 2403, 2405, 2406, 2407, 2408, 2409 and 2410. The variable "Niveau Detude" is still represented in column 2404, but the column has been displaced between the columns 2407 and 2408.
[00135] A variable Taille" (French for "Height") has been added in column 5420, that represent the values of the heights of the users, in meters. The variable "Taille" is classified as a numeric variable, because all values belonging to this variable are non-integer numbers.
[00136] In this example, when the user clicks on the third button of the "Windosw" section 214, the interface 200 displays 560 the type of each variable. For example, it displays 5601 that the variable "Id" is an integer variable, it displays 5620 that the variable "Taille" is a numeric variable and it displays 5605 that the variable "UFR" is a nominal variable.
[00137] In addition, the interface 200 displays synthetic representations for each categorical variable. In this example, the synthetic representation type is a list of modalities of the variable, if the variable is a categorical variable (for example, a nominal, ordered or binary variable). In this example, the variable "Age" is considered as a nominal variable, because at least one student used letters to enter his/her age.
Methods to correct such input error will be presented hereinafter.
[00138] An integer or numeric variable can take a large number of different values.
Displaying the values of numeric or integer variable would thus lead to reduce the clarity of the presentation of the modalities of the variables to the user. In order to let the user have a clear overview of the modalities of variable, the modalities should be represented only for categorical variables, and:
- the adaptation 143 to determine the synthetic representation type based on said type of variable comprises an adaptation to determine if the variable is a categorical variable;
- the adaptation 144 to generate the synthetic representation of values comprises an adaptation to represent, if the variable is a categorical variable, modalities of the variable among the plurality of objects.
[00139] In this example, the interface thus displays lists of modalities 5703, 5705, 5706, 5707, 5704, 5708, 5709 and 5710 for the variables "Sexe", "UFR", "Redoublement", "MentionBac", "NiveauDEtude", "Copier", "Communiquer", and "EchangeBrouillon" respectively. Meanwhile, no list of values is displayed for variables "Id", "Age" and "Taille", which are integer or numeric variables.
[00140] In this example, the occurrences of the variable are represented in alphabetical order. This allows the user to quickly identify variables that have been entered using separate words for the same notion. For example, the user can identify easily that the variable "UFR" has a modality "Sjap" 57051 a and a modality "SJAP" 57052a, and that these modalities should associated. The user can also identify easily incorrect inputs such as the modality "piano" 57053a, which does not correspond to any faculty. The modality "economie" 57056a (French word for economics) also does not correspond to a name of a faculty, but should be associated to the modality "SEGMI" 57057a (as discussed before, the "SEGMI" is the faculty of economics).
[00141] The user may thus wish to remove the value 57053a that does not correspond to a possible modality, and associate the values 57051 a and 57052a that should correspond to the same modality.
[00142] In order to do so:
- the adaptation 146 to receive from said one or more input interfaces a selection by the user of an element of said representation comprises an adaptation to receive from said one or more input interfaces an association between a first and a second modality of the variable;
- the adaptation 147 to perform a selection of one or more objects corresponding to said element comprises an adaptation to select one or more objects having the first modality as value of the variable, and assign the second modality as value of the variable for said one or more objects.
[00143] As shown in figures 5b and 5c, the user can select the modality "piano" 57053b, then provide an instruction to delete this modality. The information to delete the modality can be provided for example by clicking on a specific button (for example the button 'suppr' or 'delete'). Every suitable option is possible to indicate that the modality should be deleted. Once the instruction to delete the modality is received, the corresponding occurrences (in this case, the unique occurrence) are highlighted 57053c.
[00144] In this example, when changes of the values are to be performed, a validation button 57054 and a cancel button 57055 appear. Then changes are not performed until the user clicks on the validation button 57054. At any moment, the user can click on the cancel button 57055 to cancel the changes of values.
[00145] Figures 5d, 5e and 5f display the association by the user of the modality "economie" to the modality "SEGMI". The user drags and drops the word "economie" to the word "SEGMI" 57056d, 57056e, 57056f. When the word "economie" is placed on the cell "SEGMI", the cell "SEGMI" is highlighted 57057e. When the modalities are successfully associated, the modality "economy" 57056f is placed under "SEGMI" 57057f. The occurrences 57058f of the modality "economy" are highlighted, in order to indicate the values which are subject to change upon the validation of the user.
[00146] The figure 5g displays the results of the same operation, to associate the modality "Sjap" 57051 g with the modality "SJAP" 57052g. The occurrences 57059g of the modality "Sjap" to be modified are highlighted in addition to the occurrences of the modality "piano" and the occurrences of the modality "economie" that were previously highlighted. The user then clicks on the validation button 57054.
[00147] The figure 5h displays the interface 200 once the user has clicked on the validation button 57054. The modalities "piano", "Sjap" and "economics" have been successfully deleted. The occurrence of the modality "piano" has been replaced by a "NA" value 57053h indicating an empty value; the occurrences of the "economie" modality has been replaced by occurrences of the "SEGMI" modality 57058h; the occurrence of the "Sjap" modality has been replaced by occurrences of the "SJAP" modality 57059h.
[00148] This example demonstrates the ability of the invention to remove incorrect values and associate values of variables that have the same meaning. This interface present the advantages to be very intuitive and convenient to use for the user. The modifications performed by the user apply optimally automatically to all values of a variable. This advantageously allows a removal of inconsistent values from the dataset, which improves the statistics and graphics that can be built on the dataset afterwards.
[00149] Figures 6a to 6g display a first example of modification of a type of variable in a number of embodiments of the invention.
[00150] In certain cases of error inputs, an incorrect type could be associated to a variable. This is for example the case if a variable should be binary, and the inputs provided by a large number of individuals define at least three values for the variable. The variable can then be automatically detected as a nominal variable instead of a logical one.
[00151] Figure 6a displays an example of synthetic representation allowing a detection by the user of an incorrect detection of a type of a variable.
[00152] Figure 6a also displays the interface 200 for managing the same dataset displayed than in figures 5a to 5h.
[00153] In this example, the synthetic representations for each variable are lists of the variable types indicating to which type belong the variables, and, for categorical variables (i.e binary, nominal or ordered variables), a list of occurrences.
[00154] The integer or numeric variable can take a large number of different values.
Displaying the values of numeric or integer variable would thus lead to reduce the clarity of the presentation of the modalities of the variables to the user In order to let the user have a clear overview of the modalities of variable, the modalities should be represented only for categorical variable, and: - the adaptation 143 to determine the synthetic representation type based on said type of variable comprises an adaptation to determine if the variable is a categorical variable;
- the adaptation 144 to generate the synthetic representation of values comprises an adaptation to represent a list of variable types and, if the variable is a categorical variable, modalities of the variable among the plurality of objects.
[00155] In this example, it thus displays lists of modalities for the variables "Sexe", "Age", "UFR", "Redoublement", "MentionBac", "NiveauDEtude", "Copier", "Communiquer" and "EchangeBrouillon" respectively. Meanwhile, no list of values is displayed for variables "Id", and "Taille", which are integer or numeric variables. On the other hand, the type of variable is represented for all variables. It shall be noted that the variable "Age" is interpreted as a nominal variable in this example, as it should be interpreted as integer. This issue will be explained in more details with reference to figures 7a to 7h.
[00156] The column 2406 represents the values of the variable "Redoublement". The synthetic representation for the variable "Redoublement" is formed of a list of variable types 6606a indicating that the variable "Redoublement" is a Nominal variable, and a list of occurrences 6706a indicating that the variable "Redoublement" has three modalities: "Oui", ΌΙΙΓ and "Non".
[00157] As explained above, the variable "Redoublement" indicates if students repeated a school year at least once. It therefore should be a binary variable with two modalities Oui" (French for "Yes") and "Non" (French for "No"). In this example, some student entered OUI" (French for "YES") using only capital letters. The synthetic representation using the lists 6606a and 6706a allows the user to detect intuitively and efficiently the issue with the variable "Redoublement".
[00158] The user may thus wish to associate the "OUI" and "Oui" modalities within a single one, in order for the variable "Redoublement" to be correctly interpreted as a binary variable.
[00159] In order to do so:
- the adaptation 146 to receive from said one or more input interfaces a selection by the user of an element of said representation comprises an adaptation to receive from said one or more input a selection by the user of a desired type of the variable, which is different from said type of variable; - the adaptation 147 to perform a selection of one or more objects corresponding to said element comprises:
o an adaptation to identify and output to said display modalities of the variable that do not match the desired type of the variable; o an adaptation to receive from said one or more input interfaces a selection of a modality of the variable that do not match the desired type of the variable, and a replacement modality to replace said modality;
o an adaptation to select one or more objects whose value of the variable is said modality that do not match the desired type of the variable, and replace said modality that do not match the desired type of the variable by said replacement modality for said one or more objects.
[00160] As shown in figure 6b, the user can click on "logical" 6606b in order to indicate that the variable "Redoublement" should be a logical one. It is impossible to set the type of the variable "Redoublement" in a logical type initially, because the values have three different modalities, while a logical variable should have only two. The list of modalities is thus highlighted 6706b. A validation button 67064 and a cancel button 67065 also appear. At this time, the validation button 67064 is marked in red in order to indicate that it is not yet possible to validate the change of the variable type, because there are still more than two modalities of the variable "Redoublement".
[00161] As shown in figures 6c, 6d, 6e and 6f, the user drags and drops in the list the modality ΌΙΙΓ 67061 c, 67061 d, 67061 e to the modality "Oui" 67062e. When the drag and drop is complete, the modality "OUI" 67061 f is placed under the modality Oui" 67062f. The occurrences of the modality "OUI", which may be modified, are highlighted 62061 f, 62062f. The validation button 67064 is not highlighted anymore, and the user can click on it to validate the modification.
[00162] The figure 6g displays the interface when the user clicks on the button 67064 to validate the changes. The type of the variable "Redoublement" is set as "Logical" with two modalities Oui" 67062g and "Non" 67063g. The previous occurrences of the value OUI" have been modified in Oui" 67061 g, 67062g.
[00163] In a number of embodiments of the invention, upon the modification of a type of a variable, all values of the variable are modified in the memory of a device of the invention, in order to be casted to a type adapted to the new type of the variable. In this example, all values of the variable "Redoublement" are initially stored in the form of strings of characters. Upon modification of the type of variable to a binary variable, all the values can be casted in memory to a more adapted type, for example to a Boolean type. This allows savings of memory space, and a more efficient performance of operations on the values of the variable.
[00164] This example demonstrates the ability of the invention to let the user correct an incorrect typing due to inconsistent inputs. The modifications performed by the user can optionally automatically apply to all values of a variable. This advantageously allows a removal of inconsistent values from the dataset, which improves the statistics and graphics that can be built on the dataset afterwards. [00165] Figures 7a to 7g display a second example of modification of a type of variable in a number of embodiments of the invention.
[00166] Another case of input error arises when some values of a numeric or integer variable are entered using letters. It is for example the case of the variable "Age" is detected as a nominal instead of an integer variable because some values have been entered using letters: three students aged 20 have entered "20" (with a capital letter O instead of a zero) or "Vingt" ("Twenty" in French) instead of "20". Therefore the variable is detected as being a nominal one.
[00167] As depicted in figure 6a, the user can click on "integer" 7602a to indicate that the variable "Age" should be an integer one. When he/she does so, a validation button 77024, and a cancel button 77025 appear, in order to validate or cancel the modification. The button 77024 is initially highlighted in red, in order to indicate that it is not possible to validate the modification, because the values of the variable "Age" are not all integer values yet.
[00168] Meanwhile, lines corresponding to objects whose values of the "Age" variable are not compatible with an integer variable are displaced on top of the raw representation 720 and these value highlighted: the value "2O" 72021 a, and the values "Vingt" 72022a, 72023a. Therefore, the user can identify easily the values that made the variable "Age" typed as a nominal variable instead of an integer variable.
[00169] As displayed in figure 7b, when the user clicks on one of the highlighted values, a pop-up appears to let the user modify the value. For example, the user clicks on the value "2O" 72021 b. A pop-up 72024b is displayed. The pop-up allows the user to choose between propositions of corrections, or manually entering a correction. In this case, a first proposition of correction is "NA" to indicate an empty value. A second proposition is "20". Indeed, the value "20" has been detected as a possible correction of the value "2O", since "0" is a digit whose shape is similar to the capital letter Ό". The user clicks on "20".
[00170] As shown in figure 7c, upon the click on "20", the value "2O" is highlighted and/or put in a different color 72021 c in order to indicate that this value is to be modified upon validation of the changes and is not highlighted anymore as an incorrect value.
[00171] As shown in figure 7d, the user can click on the value "Vingt" 72022d, and a pop-up 72024d appears with possible corrections. In this case, the user can choose between the empty "NA" value, and manually entering a new value. As shown in figure 7e, the user manually enters the value "20" 72024e.
[00172] As shown in figure 7f, when the user enters the value "20" as the correction to perform, all the occurrences of the value "Vingt" are highlighted in a different color 72022f, 72023f in order to indicate that these values are to be modified upon validation of the changes. Therefore, all incorrect values of the variable "Age" are ready to be modified into an integer value, and no value is highlighted as incorrect value. The user can click on the validation button 77024, which is no more highlighted, in order to validate the modification.
[00173] The figure 7g displays the output of the modification when the user validates the changes. The type of the "Age" variable is marked as integer 7602g; the previously incorrect values are set to the integer value "20" (for example the value 72021 g for the student having the Id 6) and the lines are re-ordered in the raw representation.
[00174] This example demonstrates the ability of the invention to let the user correct an incorrect type of variable due to inconsistent inputs, when a variable should be an integer. A similar modification can be performed for numeric variables. The modifications performed by the user apply to all values of the variable. This advantageously allows a removal of inconsistent values from the dataset, which improves the statistics and graphics that can be built on the dataset afterwards.
[00175] As discussed above, the representation of the values of the variable "Age" can be changed in the memory of the device. For example, if the values were initially stored in the form of string of characters, they can be casted as integers. This saves memory space and allows more efficient management of the values of the variable.
[00176] The method of modification displayed in figures 7a to 7g is provided by means of example only of modifications of values of a variable to assign to a variable an integer or numeric type. This method can be adapted in many different ways to various interfaces. For example, instead of using a pop-up window, the user may enter directly corrected values in the cell that contain nominal values, and the changes can be propagated to all values of the variable that have the same modality.
[00177] Another use of the interface is to allow the user to re-order ordered values. For example, as explained above and displayed in figure 7g, the order of the modalities 7707g of the variable "MentionBac" is incorrect: they should rank in the following order: "Rattrapage"; "Passable"; "Assez bien"; "Bien"; "Tres bien"; "Felicitations". As discussed above, this issue can be detected easily by a user by viewing the graph 4407. The user can correct this issue by dragging and dropping modalities of the list 7707g. For example, he/she can drag and drop the modality "Rattrapage" 77071 g on top on the list, then the modality "Passable" 77072g under the modality "Rattrapage", etc... This allows re-ordering the modalities in order to have proper graphs.
[00178] Figure 8a to 8d display an example of filtering of values based on two variables in a number of embodiment of the invention.
[00179] Another functionality of spreadsheets applications is the filtering of objects to display based on their values of variables. In the example of the dataset of students, a user of the interface may wish to view only students of a certain age, in a given faculty, etc... The filtering can be performed based on the values of one or more variables. This allows inspecting values only on a desired subset of objects. Graphs can be drawn only for a desired subset of values.
[00180] In existing applications, the selection of the values is performed using lists of all the values of a variable, wherein a user can check all the values to display. However, this action may be difficult to perform in cases of variables that allow a large number of different values, for example integer or numeric variables.
[00181] In order to allow the user to select more efficiently the objects to display, in one aspect of the invention, a processing logic of a device of the invention is configured to generate a synthetic representation type that is one of a list of modalities, a list of intervals, or a list of modalities and intervals. More specifically:
- the adaptation 143 to determine the synthetic representation type based on said type of variable comprises an adaptation to set the representation type as a list of modalities if the variable is of a categorical type, and one of a list of intervals, or a list of modalities and intervals if the variable is of an integer or numeric type;
- said adaptation 144 to generate the synthetic representation of values comprises an adaptation to generate a list of modalities, intervals, or combination thereof describing all the values of the variable for the plurality of object, depending on the synthetic representation type;
- the processing logic 140 further comprises an adaptation to filter the values of the raw representation in order to display only values corresponding to an element of said list of modalities, intervals, or combination thereof selected by the user.
[00182] Figure 8a displays an interface to filter values according to the invention. A synthetic representation is displayed for each variable in order to allow the user to select objects. The synthetic representation for categorical variables is a list of modalities of all the variables. For example, the synthetic representation of the variable "Sexe" is a list 8603a of the two modalities "Homme" and "Femme". [00183] Meanwhile, the synthetic representation for integer or numerical variables is a mix of values and intervals. This allows limiting the size of the list for variable that can take a large number of different values.
[00184] According to various embodiments of the invention, the values and intervals can be defined in different ways. In this example, the synthetic representation 8602a comprises:
- the minimum 86021 a of the values of the variable "Age";
- an interval 86022a that encompasses all of the values of the variable "Age" between the minimum and the median;
- the median 86023a of the values of the variable "Age";
- an interval 86024a that encompasses all of the values of the variable "Age" between the median and the maximum;
- the maximum 86025a of the values of the variable "Age".
[00185] Therefore, the synthetic representation 8602a encompasses all values of the variable age, and allows the user to select easily a large number of values. Meanwhile, the combination of values and intervals allows the synthetic representation 8602a to remain compact. This synthetic representation is provided by means of example only, and other rules of definition of the values and intervals could be used for integer or numerical variables. If the user wishes to have the option to select separately all the possible values of the variable, he/she can right-click on the name of the variable "Age" on top of the synthetic representation 8602a.
[00186] Figure 8b discloses a selection of objects by a user. For example, the variable "NiveauDEtude" has 5 modalities: "L1 ", "L2", "L3", "M1 " and "M2". The user clicks on the modality "M1 " 8604b. Upon the selection of the user, the objects are filtered. As shown in the column 8400b, only a part of the objects are displayed in the raw representation. As shown in the column 8404b, the objects shown are those whose value of the variable "NiveauDetude" is "M1 ". Therefore, the selection allows the user to view only data relative the students studying for a "M1 " grade. The selection could also be performed for a range of values, if this option is available. Another option is to let the user enter a formula that defines the values he/she wishes to select. For example, the user may enter simple formulas to select a plurality of values, a combination of values and intervals, etc...
[00187] In figure 8c, the user modifies his/her selection by also clicking on the modality "M2" 8604c. As shown in the columns 8400c and 8404c, a larger number of objects are thus selected and represented, which corresponds to all the students that are studying for a "M1 " or "M2" grade. [00188] In figure 8d, the user further performs a filtering on the variable "UFR" that has 3 modalities "Droit" (French for law studies), "Economie" (French for economics studies), and "Sport". The user clicks on "Sport" 8605d.
[00189] Thus, the filtering is performed on two variables in the same time. As shown in columns 8400d, 8404d and 8405d, the raw representation represents the students that are either studying in "M1 " or "M2", and are studying sports
[00190] This example demonstrates the ability of the invention to perform selections on single or multiple variables. [00191] Figures 9a to 9e display an example of highlighting values of interest depending of the types of each variable in a number of embodiments of the invention.
[00192] One objective of spreadsheet applications is to highlight the relevant information to the user. Some existing applications allow the user to highlight values by associating the output of a formula to a color. For example, a user may define rules such as "highlight values below 0 in red". However, these rules are often difficult to use for the user, and the prior art solutions lack a simple and efficient way for the user to highlight values he/she may be interested in.
[00193] In order to solve this issue, in a number of embodiments of the invention, a device of the invention is configured to display synthetic representation types consists in highlighting values in the raw representation. Furthermore:
- the adaptation 143 to determine the synthetic representation type based on said type of variable comprises an adaptation to determine said one or more types of values to highlight based on said variable type;
- the adaptation 144 to generate the synthetic representation of values comprises an adaptation to select values of the variable where to superimpose colors in the raw representation based on said types of values to highlight, and to superimpose colors on said selected values.
[00194] Figures 9a to 9e provide some examples of highlights.
[00195] In the figure 9a, the user clicks on the button 2131 . This highlights the values of the variable depending on the type of each variable. For example, the binary variables (for example, "Sexe" and "Redoublement") can be highlighted in yellow; the numeric variables (for example, "Taille") in red, and the nominal variables (for example "NiveauDEtude", "UFR" or "MentionBac") in blue. This allows the user to instantaneously view the types of each variable. In this example the variable "Age" is highlighted in blue. It is indeed classified as a nominal variable because at least one student entered his/her age using letters. [00196] In figure 9b, the user clicks on the button 2132. Upon the click on the button, the interface highlights the cells comprising empty values, such as for example the cells 92041 b, 92081 b or 92082b.
[00197] In addition of highlighting values, the invention allows the user to easily select the values he/she is interested in. In order to do so:
- the adaptation 146 to receive from said one or more input interfaces a selection by the user of an element of said representation comprises an adaptation to receive from said one or more input a selection by the user of a value highlighted in a color in the raw representation;
- the adaptation 147 to perform a selection of one or more objects corresponding to said element comprises an adaptation to select one or more objects having a value of the variable highlighted in the color.
[00198] For example, the user can click an empty value highlighted in orange for the variable "Copier" 92081 b in order to indicate that he/she wishes to select and/or modify the empty values of the variable "Copier". For example, the user can click on the button 2141 in order to select all the students whose value of "Copier" is empty. If the user clicks on the button 2142, the lines corresponding to the students which are selected are furthermore displaced on top of the raw representation. If the user clicks on the button 2143, the lines corresponding to the students which are selected are displaced on top of the raw representation, and the other lines are hidden.
[00199] In the figure 9c, the user clicks on the button 2133. Upon the click, the cells comprising minima of integer or numeric variables are highlighted in blue, while cells comprising maxima are highlighted in red. In this example:
- the minimum of the variable "Id" is 101 . It is highlighted in blue 9201 1 c; - the minimum of the variable "Age" is 18. The cells comprising this value in the corresponding column are highlighted in blue 92021 c;
- the maximum of the variable "Age" is 24. The cells comprising this value in the corresponding column are highlighted in red 92022c;
- the minimum of the variable "Taille" is 1 ,55. The cells comprising this value in the corresponding column are highlighted in blue 92201 c;
- the maximum of the variable "Taille" is 1 67 (likely input by cm instead of meters). The cells comprising this value in the corresponding column are highlighted in red 92202c.
[00200] The embodiments discussed above relative to the selection of students based on the values which are highlighted are also applicable here. For example, the user can click on one of the values highlighted in blue 92021 c, in order to indicate that he/she wishes to select and/or modify these values. For example, the user can click on the button 2141 in order to select all the students whose value of "Age" is 18. If the user clicks on the button 2142, the lines corresponding to the students which are selected are furthermore displaced on top of the raw representation. If the user clicks on the button 2143, the lines corresponding to the students which are selected are displaced on top of the raw representation, and the other lines are hidden.
[00201] In the figure 9d the user clicks on the button 2134. Upon the button click, the very low values of integer or numerical variables are highlighted in light blue, and the very high values are highlighted in light red. The limits between very low/normal/very high values can be set for example at 2 times the standard deviation of the values of the variables below or above the average value. For example, the values below two times the standard deviation of the value below the average can be highlighted in light blue. It is for example the case of the value 92021 d for the variable "Taille". Similarly, the values above two times the standard deviation above the average can be highlighted in light red. It is for example the case of the values 92022d of the variable "Taille".
[00202] In this case, the rules of selections explained above can also be used, using respectively the cells highlighted in light red or light blue.
[00203] The user can also view in the same time a combination of different types of highlights. For example, in the figure 9e, the user clicks in the same time on the buttons 2131 , 2132, 2133 and 2134. All the cells that were identified in figures 9b, 9c and 9d are highlighted in the same time: the empty cells, the cells with minimum values in blue, the cells with maximum values in red, the cells with high value in light red, and the cells with low value in light blue. A cell that could be represented as both maximum or high value is highlighted in red, such as for example the cell 92022e, while a cell that can be represented both in minimum or low value is highlighted in blue, such as for example the cell 92021 e.
[00204] The embodiments discussed above relative to the selection of objects based on the value that are highlighted are also applicable here. In this example, the user has a much larger choice of variables and highlighting colors to select to perform the selection, choice and/or sort of objects.
[00205] These examples demonstrate the ability of the invention to highlight the relevant information to user in a very intuitive and friendly way.
[00206] Figure 10 displays an example of a method in a number of embodiments of the invention.
[00207] The method 1000 comprises a first step 1010 of accessing one or more memories storing a dataset comprising values of a set of variables for a plurality of objects. [00208] The method 1000 further comprises a second step 1020 of obtaining a type of a variable in the set of variable.
[00209] The method 1000 further comprises a third step 1030 of generating a raw representation of values of said variable for said plurality of objects.
[00210] The method 1000 further comprises a fourth step 1040 of determining a synthetic representation type based on said type of variable.
[00211] The method 1000 further comprises a fifth step 1050 of generating a synthetic representation of values of said variable for said plurality of objects using said synthetic representation type, said synthetic representation comprising a plurality of elements, each element representing one or more values of the variable for one or more objects.
[00212] The method 1000 further comprises a sixth step 1060 of displaying said raw representation of values and said synthetic representation of values.
[00213] The method 1000 further comprises a seventh step of 1070 of receiving a selection by a user of an element of said synthetic representation.
[00214] The method 1000 further comprises an eighth step 1080 of selecting one or more objects corresponding to said element.
[00215] All the embodiments discussed with reference to figures 1 to 9g are respectively applicable to the method 1000.
[00216] The examples described above are given as non limitative illustrations of embodiments of the invention. They do not in any way limit the scope of the invention which is defined by the following claims.

Claims

A device (100) comprising:
- an access to a display (1 10);
- one or more input interfaces (120) to receive commands from a user;
- an access to one or more memories (130) storing a dataset comprising values of one or more sets of variables for a plurality of objects;
- a processing logic (140) comprising:
o an adaptation (141 ) to obtain automatically a type of a variable in one or more sets of variables;
o an adaptation (142) to generate a raw representation of values of said variable for said plurality of objects;
o an adaptation (143) to determine a synthetic representation type based on said type of variable;
o an adaptation (144) to generate a synthetic representation of values of said variable for said plurality of objects using said synthetic representation type, said synthetic representation comprising a plurality of elements, each element representing one or more values of the variable for one or more objects;
o an adaptation (145) to output said raw representation of values and said synthetic representation of values to said display;
o an adaptation (146) to receive from said one or more input interfaces a selection by the user of an element of said synthetic representation;
o an adaptation (147) to perform a selection of all objects in said plurality of objects whose value of the variable is represented by said element.
2. The device of claim 1 , wherein:
- the dataset comprising the values of the variable for the plurality of objects is stored in one or more files;
- the adaptation (141 ) to obtain automatically a type of the variable comprises an adaptation to detect automatically the type of the variable, based on the values of said variable for the plurality of objects.
3. The device of claim 2, wherein said adaptation to detect automatically the type of the variable comprises one or more adaptations to:
- retrieve a collection of text strings for all occurrences of the variable; if the number of dictionary words in the collection of text strings is equal to two, detect the type of the variable as a binary type;
if the number of dictionary words in the collection of text strings is different from two:
o if each string in the collection is representative of an integer number, and if the size of each string is below a predefined threshold, detect the type of the variable as an integer type;
o if each string in the collection is representative of a number, and if at least one number is a non-integer number or if the size of at least one string is above the predefined threshold, detect the type of the variable as a numeric type;
if the collection of text strings comprises at least one non-numeric character, detect the type of the variable as a nominal type.
4. The device of one of claims 1 to 3, wherein:
- the synthetic representation type is an array comprising one or more statistical parameters;
- said adaptation (143) to determine the synthetic representation type based on said type of variable comprises an adaptation to determine said one or more statistical parameters based on said type of variable;
- said adaptation (144) to generate the synthetic representation of values comprises an adaptation to calculate one or more statistical parameters for said variable;
- said adaptation (146) to receive from said one or more input interfaces the selection by the user of the element of said representation comprises an adaptation to receive from said one or more input interfaces a selection by the user of a statistical parameter;
- said adaptation (147) to perform the selection of one or more objects corresponding to said element comprises an adaptation to select said one or more objects based on said statistical parameter and an adaptation to highlight values of said variable for said one or more objects in the raw representation.
5. The device of one of claims 1 to 3, wherein:
- said synthetic representation is a graph, each section of the graph representing the number of occurrences of one or more values of the variable for the plurality of objects; - said adaptation (143) to determine the synthetic representation type based on said type of variable comprises an adaptation to determine a type of graph based on the type of variable;
- said adaptation (144) to generate the synthetic representation of values comprises an adaptation to calculate a size of each section of the graph;
- said adaptation (146) to receive from said one or more input interfaces a selection by the user of an element of said representation comprises an adaptation to receive from said one or more input interfaces a selection by the user of a section of the graph;
- said adaptation (147) to perform a selection of one or more objects corresponding to said element comprises an adaptation to select said one or more objects whose value of the variable is represented by the section selected by the user.
6. The device of one of claims 1 to 3, wherein:
- said synthetic representation is a list of modalities of the variable;
- said adaptation (143) to determine the synthetic representation type based on said type of variable comprises an adaptation to determine if the variable is a categorical variable;
- said adaptation (144) to generate the synthetic representation of values comprises an adaptation to represent, if the variable is a categorical variable, modalities of the variable among the plurality of objects;
- said adaptation (146) to receive from said one or more input interfaces a selection by the user of an element of said representation comprises an adaptation to receive from said one or more input interfaces an association between a first and a second modality of the variable;
- said adaptation (147) to perform a selection of one or more objects corresponding to said element comprises an adaptation to select one or more objects having the first modality as value of the variable, and assign the second modality as value of the variable for said one or more objects.
7. The device of one of claims 1 to 3, wherein:
- said synthetic representation is one of a list of variable types or a combination of a list of variable types and a list of modalities of the variable;
- said adaptation (143) to determine the synthetic representation type based on said type of variable comprises an adaptation to determine if the variable is a categorical variable; said adaptation (144) to generate the synthetic representation of values comprises an adaptation to represent a list of variable types and, if the variable is a categorical variable, modalities of the variable among the plurality of objects;
said adaptation (146) to receive from said one or more input interfaces a selection by the user of an element of said representation comprises an adaptation to receive from said one or more input a selection by the user of a desired type of the variable, which is different from said type of variable; said adaptation (147) to perform a selection of one or more objects corresponding to said element comprises:
o an adaptation to identify and output to said display modalities of the variable that do not match the desired type of the variable;
o an adaptation to receive from said one or more input interfaces a selection of a modality of the variable that do not match the desired type of the variable, and a replacement modality to replace said modality;
o an adaptation to select one or more objects whose value of the variable is said modality that do not match the desired type of the variable, and replace said modality that do not match the desired type of the variable by said replacement modality for said one or more objects.
8. The device of claim 7, wherein the processing logic further comprises an adaptation to set the type of the variable as the desired type of variable, and modify representations in memory of the occurrences of said variable according to said desired type of variable, when all occurrences of said variable match the desired type of variable.
9. The device of one of claims 1 to 3, wherein:
- said synthetic representation type is one of a list of modalities, a list of intervals, or a list of modalities and intervals;
- said adaptation (143) to determine the synthetic representation type based on said type of variable comprises an adaptation to set the representation type as a list of modalities if the variable is of a categorical type, and one of a list of intervals, or a list of modalities and intervals if the variable is of an integer or numeric type;
- said adaptation (144) to generate the synthetic representation of values comprises an adaptation to generate a list of modalities, intervals, or combination thereof describing all the values of the variable for the plurality of object, depending on the synthetic representation type;
- the processing logic further comprises an adaptation to filter the values of the raw representation in order to display only values corresponding to an element of said list of modalities, intervals, or combination thereof selected by the user.
10. The device of one of claims 1 to 3, wherein:
- said synthetic representation type is one or more types of values to highlight;
- said adaptation (143) to determine the synthetic representation type based on said type of variable comprises an adaptation to determine said one or more types of values to highlight based on said type of variable;
- said adaptation (144) to generate the synthetic representation of values comprises an adaptation to select values of the variable where to superimpose colors in the raw representation based on said types of values to highlight, and to superimpose colors on said selected values;
- said adaptation (146) to receive from said one or more input interfaces a selection by the user of an element of said representation comprises an adaptation to receive from said one or more input interfaces a selection by the user of a value highlighted in a color in the raw representation;
- said adaptation (147) to perform a selection of one or more objects corresponding to said element comprises an adaptation to select one or more objects having a value of the variable highlighted in the color.
1 1 . A method (1000) comprising:
- accessing (1010) one or more memories storing a dataset comprising values of one or more sets of variables for a plurality of objects;
- obtaining (1020) a type of a variable in the one or more sets of variables;
- generating (1030) a raw representation of values of said variable for said plurality of objects;
- determining (1040) a synthetic representation type based on said type of variable;
- generating (1050) a synthetic representation of values of said variable for said plurality of objects using said synthetic representation type, said synthetic representation comprising a plurality of elements, each element representing one or more values of the variable for one or more objects; - displaying (1060) said raw representation of values and said synthetic representation of values;
- receiving (1070) a selection by a user of an element of said synthetic representation;
- selecting (1080) all objects in said plurality of objects whose value of the variable is represented by said element. A computer program product comprising computer code instructions configured to:
- access one or more memories storing a dataset comprising values of one or more sets of variables for a plurality of objects;
- obtain a type of a variable in the one or more sets of variables;
- generate a raw representation of values of said variable for said plurality of objects;
- determine a synthetic representation type based on said type of variable;
- generate a synthetic representation of values of said variable for said plurality of objects using said synthetic representation type, said synthetic representation comprising a plurality of elements, each element representing one or more values of the variable for one or more objects;
- display said raw representation of values and said synthetic representation of values;
- receive a selection by a user of an element of said synthetic representation;
- select all objects in said plurality of objects whose value of the variable is represented by said element.
PCT/EP2018/067270 2017-06-27 2018-06-27 An interactive interface for improving the management of datasets WO2019002379A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US16/624,880 US11106866B2 (en) 2017-06-27 2018-06-27 Interactive interface for improving the management of datasets

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
EP17305804.1A EP3422199A1 (en) 2017-06-27 2017-06-27 An interactive interface for improving the management of datasets
EP17305804.1 2017-06-27

Publications (1)

Publication Number Publication Date
WO2019002379A1 true WO2019002379A1 (en) 2019-01-03

Family

ID=59315532

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2018/067270 WO2019002379A1 (en) 2017-06-27 2018-06-27 An interactive interface for improving the management of datasets

Country Status (3)

Country Link
US (1) US11106866B2 (en)
EP (1) EP3422199A1 (en)
WO (1) WO2019002379A1 (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1286284A1 (en) * 2001-08-15 2003-02-26 F1F9 (UK) Ltd. Spreadsheet data processing system
WO2007032913A1 (en) * 2005-09-09 2007-03-22 Microsoft Corporation Automated placement of fields in a data summary table
US20140149841A1 (en) * 2012-11-27 2014-05-29 Microsoft Corporation Size reducer for tabular data model

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7536413B1 (en) * 2001-05-07 2009-05-19 Ixreveal, Inc. Concept-based categorization of unstructured objects
US20040205524A1 (en) * 2001-08-15 2004-10-14 F1F9 Spreadsheet data processing system
US20070055556A1 (en) * 2005-07-06 2007-03-08 Frank-Backman Elizabeth G Spreadsheet Generator
US20090300482A1 (en) * 2006-08-30 2009-12-03 Compsci Resources, Llc Interactive User Interface for Converting Unstructured Documents
US20130080444A1 (en) * 2011-09-26 2013-03-28 Microsoft Corporation Chart Recommendations
US9135233B2 (en) * 2011-10-13 2015-09-15 Microsoft Technology Licensing, Llc Suggesting alternate data mappings for charts
US20150074127A1 (en) * 2013-09-10 2015-03-12 Microsoft Corporation Creating Visualizations from Data in Electronic Documents
US9542622B2 (en) * 2014-03-08 2017-01-10 Microsoft Technology Licensing, Llc Framework for data extraction by examples
US10290147B2 (en) * 2015-08-11 2019-05-14 Microsoft Technology Licensing, Llc Using perspective to visualize data
US11093703B2 (en) * 2016-09-29 2021-08-17 Google Llc Generating charts from data in a data table

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1286284A1 (en) * 2001-08-15 2003-02-26 F1F9 (UK) Ltd. Spreadsheet data processing system
WO2007032913A1 (en) * 2005-09-09 2007-03-22 Microsoft Corporation Automated placement of fields in a data summary table
US20140149841A1 (en) * 2012-11-27 2014-05-29 Microsoft Corporation Size reducer for tabular data model

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
ANONYMOUS: "Microsoft SQL Server Integration Services: Mixed data types in Excel column", 25 June 2011 (2011-06-25), XP055417679, Retrieved from the Internet <URL:http://microsoft-ssis.blogspot.nl/2011/06/mixed-data-types-in-excel-column.html> [retrieved on 20171020] *
CAMPOS M M ET AL: "Data-Centric Automated Data Mining", MACHINE LEARNING AND APPLICATIONS, 2005. PROCEEDINGS. FOURTH INTERNATI ONAL CONFERENCE ON LOS ANGELES, CA, USA 15-17 DEC. 2005, PISCATAWAY, NJ, USA,IEEE, 15 December 2005 (2005-12-15), pages 97 - 104, XP010902750, ISBN: 978-0-7695-2495-5, DOI: 10.1109/ICMLA.2005.18 *
JEFFREY HEER ET AL: "Interactive Dynamics for Visual Analysis - ACM Queue", 20 February 2012 (2012-02-20), pages 1 - 33, XP055144077, Retrieved from the Internet <URL:http://queue.acm.org/detail.cfm?id=2146416> [retrieved on 20141002] *
RICHARD BRATH ET AL: "Spreadsheet Validation and Analysis through Content Visualization", ARXIV.ORG ARTICLE, 3 March 2008 (2008-03-03), pages 1 - 10, XP055417559, Retrieved from the Internet <URL:https://arxiv.org/ftp/arxiv/papers/0803/0803.0166.pdf> [retrieved on 20171020] *
THOMAS BAUDEL ED - ASSOCIATION FOR COMPUTING MACHINERY: "From information visualization to direct manipulation", PROCEEDINGS OF THE 19TH ANNUAL ACM SYMPOSIUM ON USER INTERFACE SOFTWARE AND TECHNOLOGY : OCTOBER 15 - 18, 2006, MONTREUX, SWITZERLAND; [ACM SYMPOSIUM ON USER INTERFACE SOFTWARE AND TECHNOLOGY], NEW YORK, NY : ACM, US, 15 October 2006 (2006-10-15), pages 67 - 76, XP058346649, ISBN: 978-1-59593-313-3, DOI: 10.1145/1166253.1166265 *

Also Published As

Publication number Publication date
US11106866B2 (en) 2021-08-31
EP3422199A1 (en) 2019-01-02
US20200143110A1 (en) 2020-05-07

Similar Documents

Publication Publication Date Title
Wagner III Using IBM® SPSS® statistics for research methods and social science statistics
Hill et al. Statistics: methods and applications: a comprehensive reference for science, industry, and data mining
Sall et al. JMP start statistics: a guide to statistics and data analysis using JMP
US9501540B2 (en) Interactive visualization of big data sets and models including textual data
Meyers et al. Performing data analysis using IBM SPSS
Argyrous Statistics for research: With a guide to SPSS
Abbott Understanding educational statistics using Microsoft Excel and SPSS
Cramer Basic statistics for social research: Step-by-step calculations & computer techniques using minitab
Gerber et al. Using SPSS for Windows: Data analysis and graphics
Behrisch et al. Feedback-driven interactive exploration of large multidimensional data supported by visual classifier
Ali et al. The effect of gestalt laws of perceptual organization on the comprehension of three-variable bar and line graphs
de Micheaux et al. The R software
Albert et al. R by Example
Sarmento et al. Comparative approaches to using R and Python for statistical data analysis
US20130117280A1 (en) Method and apparatus for visualizing and interacting with decision trees
US20150081685A1 (en) Interactive visualization system and method
US11347749B2 (en) Machine learning in digital paper-based interaction
Wagner Using SPSS for social statistics and research methods
Hoyt et al. Introduction to biomedical data science
CN110413765B (en) Interactive system and method for analyzing and displaying mass data set
Pollock III et al. An R companion to political analysis
Wagner Using IBM SPSS statistics for social statistics and research methods
US11106866B2 (en) Interactive interface for improving the management of datasets
Liu et al. Visualization support to better comprehend and improve decision tree classification modelling process: a survey and appraisal
Fiddler et al. SPSS for Windows version 16.0: a basic tutorial

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18732388

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18732388

Country of ref document: EP

Kind code of ref document: A1