CN110874644A - Method and device for assisting user in exploring data set and data table - Google Patents

Method and device for assisting user in exploring data set and data table Download PDF

Info

Publication number
CN110874644A
CN110874644A CN201911104860.9A CN201911104860A CN110874644A CN 110874644 A CN110874644 A CN 110874644A CN 201911104860 A CN201911104860 A CN 201911104860A CN 110874644 A CN110874644 A CN 110874644A
Authority
CN
China
Prior art keywords
data
field
user
analysis information
data analysis
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911104860.9A
Other languages
Chinese (zh)
Inventor
戴振衡
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
4Paradigm Beijing Technology Co Ltd
Original Assignee
4Paradigm Beijing Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 4Paradigm Beijing Technology Co Ltd filed Critical 4Paradigm Beijing Technology Co Ltd
Priority to CN201911104860.9A priority Critical patent/CN110874644A/en
Publication of CN110874644A publication Critical patent/CN110874644A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A method and apparatus for assisting a user in exploring data sets and data tables is disclosed. The data set comprises a plurality of pieces of data, each piece of data comprises values of one or more fields, and first data analysis information is output in response to field selection operation of a user and used for representing data statistics of field values corresponding to the fields selected by the user; recommending one or more second data analysis information to the user, the second data analysis information being predicted based on the user-selected field. Thus, by outputting the first data analysis information, the user can understand the data statistics of the selected field. The second data analysis information is data analysis information which is obtained by predicting according to the field selected by the user and is possibly interested by the user, and the second data analysis information is recommended to the user, so that the data exploration cost of the user can be reduced, and the user can conveniently and quickly explore the data.

Description

Method and device for assisting user in exploring data set and data table
Technical Field
The present invention relates generally to the field of data exploration, and more particularly, to a method and apparatus for assisting a user in exploring data sets and data tables.
Background
With the progress of modern science and technology and the rapid development and application of information technology, the informatization degree of the whole industry is comprehensively improved, the data of the whole society is rapidly increased at an unprecedented speed, and the data has the characteristics of large quantity, multiple types and rapid update, and gradually becomes one of important production elements of various industries. The abundant data volume contains a great deal of valuable information, but in the face of the data, users need to go through statistical analysis to extract meaningful results from the data.
Agile bi (business intelligence) tools on the market can provide data analysis services for users, but the tools have high use threshold, and require the users to spend a lot of time for learning, and also require sophisticated business experience support. The use of existing tools is a significant challenge for business personnel with data analysis appeal but without data analysis capability.
Therefore, a solution is needed that facilitates the user's exploration of data.
Disclosure of Invention
Exemplary embodiments of the present invention aim to overcome the high threshold of use of data analysis tools in the prior art.
According to a first aspect of the present invention, there is provided a method of assisting a user in exploring a data set, the data set comprising a plurality of pieces of data, each piece of data comprising values for one or more fields, the method comprising: responding to field selection operation of a user, and outputting first data analysis information, wherein the first data analysis information is used for representing data statistics of a field value corresponding to a field selected by the user; and recommending one or more second data analysis information to the user, the second data analysis information being predicted based on the user-selected field.
Optionally, the second data analysis information is used for characterizing data statistics of field values corresponding to other fields obtained by prediction, and/or the second data analysis information is used for characterizing data statistics of field value combinations corresponding to fields selected by a user and field combinations formed by other fields obtained by prediction.
Optionally, the step of recommending one or more second data analysis information to the user includes: predicting data analysis information suitable for recommendation to a user based on a machine learning model; and/or predict data analysis information suitable for recommendation to a user based on statistical relevance; and/or predict data analysis information suitable for recommendation to a user based on business rules.
Optionally, the step of predicting data analysis information suitable for recommendation to the user based on the machine learning model comprises: predicting, using a machine learning model, a probability that a user will select each of the other fields later based on information about the field currently selected by the user, wherein the machine learning model is trained in the following manner: taking the relevant information of the selected field and the relevant information of other fields as input, and taking the probability of the other fields as output; and analyzing the data statistics of the field value corresponding to the field with the probability value larger than the first preset threshold value to obtain second data analysis information, or analyzing the data statistics of the field value corresponding to the field with the probability value larger than the first preset threshold value on the basis of the first data analysis information of the field selected by the user to obtain the second data analysis information.
Optionally, the method further comprises: and acquiring the acceptance feedback condition of the one or more second data analysis information recommended to the user by the user, and updating the machine learning model based on the acquired acceptance feedback condition.
Optionally, the step of predicting data analysis information suitable for recommendation to the user based on the statistical relevance comprises: according to the statistical correlation among different fields, acquiring a field with the statistical correlation higher than a second preset threshold value with the field selected by the user; and analyzing the data statistics of the field value corresponding to the field with the statistical relevance higher than the second preset threshold value to obtain second data analysis information, or analyzing the data statistics of the field value corresponding to the field with the statistical relevance higher than the second preset threshold value on the basis of the first data analysis information of the field selected by the user to obtain the second data analysis information.
Optionally, the first data analysis information is presented in the first interface area, and the second data analysis information is presented in the second interface area, and the method further includes: and presenting the second data analysis information in the second interface area in the first interface area in response to the operation of the user.
Optionally, the method further comprises: connecting one of the two pieces of data analysis information to the other in response to a user's connection operation for the two pieces of data analysis information in the first interface area; and updating the other data analysis information to be used for representing the data statistics of the field value represented by the one data analysis information under the field dimension corresponding to the other data analysis information.
Optionally, the method further comprises: connecting one of the two pieces of data analysis information to the other in response to a user's connection operation for the two pieces of data analysis information in the first interface area; in response to a user's selection operation for a field value characterized by one of the two pieces of data analysis information in the connected state, data statistics of the selected field value in the data analysis information are highlighted relative to those of the unselected field values, and/or the other of the two pieces of data analysis information in the connected state is updated to characterize the data statistics of the selected field value in a field dimension corresponding to the other piece of data analysis information.
Optionally, the method further comprises: displaying an interface for exploring a data set, wherein a left area of the interface is used for displaying name icons corresponding to fields of the data set, a middle area of the interface is used for displaying a chart of first data analysis information, a right area of the interface is used for displaying a chart of second data analysis information, and when the data set is imported, the name icons corresponding to the fields included in each piece of data in the data set are displayed in the left area of the interface, wherein the outputting of the first data analysis information in response to a field selection operation of a user comprises: responding to the operation that a user selects a name icon of a specific field by a cursor, dragging the name icon to an interface middle area and releasing the name icon, displaying a chart of first data analysis information in the interface middle area, and recommending one or more second data analysis information to the user comprises the following steps: the method includes the steps of displaying a graph of first data analysis information in a middle area of an interface, and displaying one or more graphs of second data analysis information in a right area of the interface.
Optionally, the method further comprises: when a data set is imported, pre-calculating the data statistics of the field values corresponding to the fields, and caching the calculated data statistics in a memory, wherein a chart for displaying first data analysis information in an interface middle area comprises: and displaying a chart of first data analysis information generated based on the corresponding data statistics extracted from the memory in the interface middle area.
Optionally, the method further comprises: in response to a user selecting a field value in the chart of the first data analysis information, the chart of the corresponding presented one or more second data analysis information is automatically updated to a data statistic of the field value in the corresponding other field dimension.
Optionally, the method further comprises: when one or more frequently-selected field values in the chart of the first data analysis information are selected in advance, automatically updating the corresponding chart of the one or more second data analysis information into data statistics of the frequently-selected field values in the corresponding other field dimensions, and caching the calculated data statistics in a memory, wherein the automatically updating the chart of the one or more correspondingly-displayed second data analysis information into the data statistics of a certain field value in the corresponding other field dimensions comprises: and automatically updating the correspondingly displayed chart of the one or more second data analysis information into the corresponding data statistical condition cached in the memory.
Optionally, the method further comprises: and in response to the user selecting at least one chart of the second data analysis information displayed in the right area of the interface by the cursor, dragging the at least one chart of the second data analysis information to the middle area of the interface and releasing the at least one chart of the second data analysis information, and displaying the at least one chart of the second data analysis information in the middle area of the interface.
Optionally, the method further comprises: connecting one chart of the two charts to the other chart in response to a user connection operation for the two charts in the interface middle area; in response to a user selection of a field value characterized in the first graph in the connected state, the selected field value is highlighted relative to the unselected field values, and/or the second graph in the connected state is updated with data statistics characterizing the selected field value in the field dimension corresponding to the second graph.
Optionally, the method further comprises: connecting one chart of the two charts to the other chart in response to a user connection operation for the two charts in the interface middle area; the other chart is updated to be used for characterizing the data statistics of the field value characterized by one chart under the field dimension corresponding to the other chart.
Optionally, the method further comprises: responding to a prediction request of a user for a target field, taking field values corresponding to at least part of other fields in single data as input, taking field values corresponding to the target field in the single data as output, and automatically training a machine learning model; first interpretation information characterizing a degree of importance of one or more of at least some of the other fields to a machine learning model prediction target field is presented.
Optionally, the method further comprises: under the condition that a user selects a name icon of a target field in a left area of the interface or selects a chart of first data analysis information of the target field in a middle area of the interface, responding to an operation of starting an automatic training machine learning model executed in the right area of the interface, taking field values corresponding to at least part of other fields in single data as input, taking field values corresponding to the target field in the single data as output, and automatically training the machine learning model; first interpretation information characterizing the importance of one or more of at least some of the other fields to the machine learning model's predicted target field is presented in the area to the right of the interface.
Optionally, automatically training the machine learning model comprises: and automatically searching in a super-parameter space as small as possible based on the preset super-parameters according to experience so as to train the machine learning model.
Optionally, presenting first interpretation information for characterizing a degree of importance of one or more of at least some of the other fields to the machine learning model prediction target field comprises: in the process of calculating the degree of importance, the first explanatory information is displayed in a gradation form.
Optionally, the method further comprises: according to the distribution mode of the Charapril values, distributing the scores obtained by predicting the single data by the machine learning model to each field of at least part of other fields to obtain the score of the field under the single data; and determining the importance degree of each field to the machine learning model prediction target field according to the score sum of each field in at least part of other fields under the plurality of pieces of data, wherein the importance degree is positively correlated with the score sum.
Optionally, the method further comprises: according to the distribution mode of the Charapril values, distributing the scores obtained by predicting the single data by the machine learning model to each field of at least part of other fields to obtain the score of the field under the single data; and for a single field, presenting the score of the field under the pieces of data in a two-dimensional coordinate system, wherein one coordinate axis in the two-dimensional coordinate system is used for representing the field value corresponding to the field, and the other coordinate axis is used for representing the score of the field.
Optionally, the method further comprises: and representing a field value corresponding to another field in the data corresponding to the coordinate point by using the display characteristics of the coordinate point in the two-dimensional coordinate system.
Optionally, the method further comprises: responding to the selection operation of a user on a piece of data, and outputting a prediction result of the machine learning model on the piece of data; and outputting second interpretation information of the importance degree of one or more fields in at least part of other fields in the piece of data to the prediction result.
Optionally, the method further comprises: providing a control for inputting a certain piece of data for a user in the right area of the interface, and receiving the piece of data input by the user; and displaying the predicted result of the machine learning model for the piece of data in the area on the right side of the interface, and displaying second interpretation information of the importance degree of one or more fields in at least part of other fields in the piece of data to the predicted result.
Optionally, the method further comprises: according to the distribution mode of the Charapril value, the score obtained by predicting the data by the machine learning model is distributed to each field in at least part of other fields to obtain the score of each field under the data, and the importance degree is positively correlated with the score.
Optionally, the method further comprises: the method comprises the steps that a prediction result obtained by predicting a machine learning model aiming at a plurality of data is presented in a two-dimensional coordinate system, a plurality of coordinate points exist in the two-dimensional coordinate system, each coordinate point corresponds to one piece of data, the display characteristics of the coordinate points are used for representing the prediction result of the data, and the distance between two coordinate points in a two-dimensional space is positively correlated with the distance between two pieces of data corresponding to the two coordinate points in a multi-dimensional space.
Optionally, each piece of data has a plurality of dimensions, each dimension corresponding to one of at least some of the other fields, and the method further includes: according to a distribution mode of a Charapril value, a score obtained by predicting single data by a machine learning model is distributed to each field in at least part of other fields to obtain a score of the field under the single data, the score of the field is a value of the data under a corresponding dimension, and the position of each piece of data in a multidimensional space is determined based on values of multiple dimensions of the data.
Optionally, the method further comprises: selecting a preset number of coordinate points which are the same as the prediction result of the selected coordinate point from the vicinity of the selected coordinate point in response to the selection operation of a user for one or more coordinate points in the two-dimensional coordinate system to obtain a plurality of clustering coordinate points; extracting one or more key fields from a plurality of pieces of data corresponding to the plurality of clustering coordinate points based on the size sequence of the scores of the fields to obtain a key field group; and outputting the key field group.
Optionally, the method further comprises: responding to the adjustment operation of a user on the field values of one or more fields in a piece of data, and outputting the prediction result of the machine learning model on the adjusted piece of data; and outputting second interpretation information of the importance degree of one or more fields in at least part of other fields in the adjusted piece of data to the prediction result.
Optionally, the method further comprises: and outputting the change situation of field values of at least part of other fields in the data by utilizing a machine learning model according to the expected prediction result of the user for the target field of the data.
According to a second aspect of the present invention, there is also provided a method of assisting a user in exploring a data table, comprising: in response to a user opening a data table in an application, a plug-in for implementing the method according to the first aspect of the invention is run against a data set in the data table.
According to a third aspect of the present invention, there is also provided a method of assisting a user in exploring a data table, comprising: in response to a user opening a data table in an application, running a plug-in the application to perform the steps of: displaying an exploration area in a preset area of the data table; responding to a prediction request of a user for a target field in a data table, taking field values corresponding to at least part of other fields in single data as input, taking field values corresponding to the target field in the single data as output, and automatically training a machine learning model; first interpretation information for characterizing the importance degree of at least part of other fields to the target field is output in the exploration area.
Optionally, the predetermined area is at least one of a left side, a right side, an upper side, and a lower side of the data table.
Optionally, the data table is an excel table.
Optionally, automatically training the machine learning model comprises: and automatically searching in a super-parameter space as small as possible based on the preset super-parameters according to experience so as to train the machine learning model.
Optionally, outputting, in the exploration area, first interpretation information for characterizing the degree of importance of at least part of the other fields to the target field includes: in the process of calculating the degree of importance, the first explanatory information is displayed in a gradation form.
Optionally, the method further comprises: according to the distribution mode of the Charapril values, distributing the scores obtained by predicting the single data by the machine learning model to each field of at least part of other fields to obtain the score of the field under the single data; and determining the importance degree of each field to the machine learning model prediction target field according to the score sum of each field in at least part of other fields under the plurality of pieces of data, wherein the importance degree is positively correlated with the score sum.
Optionally, the method further comprises: and outputting a two-dimensional coordinate graph for representing the influence condition of a single field on a target field in the exploration area, wherein one coordinate axis in the two-dimensional coordinate graph is used for representing the field value corresponding to the field, the other coordinate axis is used for representing the score of the field, the two-dimensional coordinate graph comprises a plurality of coordinate points, each coordinate point corresponds to one piece of data, and the score of the field is obtained by distributing the score obtained by predicting the single piece of data by the machine learning model to each field in at least part of other fields according to the distribution mode of the Charpy value.
Optionally, the method further comprises: and representing a field value corresponding to another field in the data corresponding to the coordinate point by using the display characteristic of the coordinate point in the two-dimensional coordinate graph.
Optionally, the method further comprises: responding to a prediction request of a user for a certain piece of data in the data table, and outputting a prediction result of the machine learning model for the certain piece of data in the exploration area; and outputting second interpretation information of the importance degree of one or more fields in at least part of other fields in the piece of data to the prediction result in the exploration area.
Optionally, the method further comprises: according to the distribution mode of the Charapril values, the scores obtained by predicting the data by the machine learning model are distributed to each field in at least some other fields to obtain the score of each field under the data, and the importance degree of the fields to the prediction result is positively correlated with the scores of the fields.
Optionally, the method further comprises: responding to the adjustment operation of a user on the field values of one or more fields in a certain piece of data in the data table, and outputting the prediction result of the machine learning model on the adjusted piece of data; and outputting second interpretation information of the importance degree of one or more fields in at least part of other fields in the piece of data to the prediction result after the adjustment in the exploration area.
Optionally, the method further comprises: and outputting the change situation of field values of at least part of other fields in a certain piece of data by utilizing a machine learning model according to the expected prediction result of a user for the target field of the certain piece of data in the data table.
Optionally, the method further comprises: responding to a selection operation of a user on one or more data columns in the data table, and outputting first data analysis information in the exploration area, wherein the first data analysis information is used for representing data statistics of a field value corresponding to the data column selected by the user; and recommending one or more second data analysis information to the user, wherein the second data analysis information is predicted based on the data column selected by the user.
Optionally, the second data analysis information is used for characterizing data statistics of field values corresponding to other data columns obtained through prediction, and/or the second data analysis information is used for characterizing data statistics of field value combinations corresponding to data column combinations formed by data columns selected by users and other data columns obtained through prediction.
Optionally, the step of recommending one or more second data analysis information to the user includes: predicting data analysis information suitable for recommendation to a user based on a machine learning model; and/or predict data analysis information suitable for recommendation to a user based on statistical relevance; and/or predict data analysis information suitable for recommendation to a user based on business rules.
Optionally, the step of predicting data analysis information suitable for recommendation to the user based on the machine learning model comprises: predicting the probability that the user will select each other data column later by using a machine learning model based on the related information of the data column selected by the current user, wherein the machine learning model is trained according to the following modes: taking the relevant information of the previously selected data column and the relevant information of other data columns as input, and taking the probability of the other data columns being selected as output; and analyzing the data statistics of the field value corresponding to the data column with the probability value larger than the first preset threshold value to obtain second data analysis information, or analyzing the data statistics of the field value corresponding to the data column with the probability value larger than the first preset threshold value on the basis of the first data analysis information of the data column selected by the user to obtain the second data analysis information.
Optionally, the method further comprises: and acquiring the acceptance feedback condition of the one or more second data analysis information recommended to the user by the user, and updating the machine learning model based on the acquired acceptance feedback condition.
Optionally, the step of predicting data analysis information suitable for recommendation to the user based on the statistical relevance comprises: acquiring data columns with the statistical correlation higher than a second preset threshold value with the data columns selected by the user according to the statistical correlation among different data columns; and analyzing the data statistics of the field value corresponding to the data column with the statistical relevance higher than the second preset threshold value to obtain second data analysis information, or analyzing the data statistics of the field value corresponding to the data column with the statistical relevance higher than the second preset threshold value on the basis of the first data analysis information of the data column selected by the user to obtain the second data analysis information.
Optionally, the exploration area includes a first interface area and a second interface area, and the outputting the first data analysis information in the exploration area in response to a user selection operation on one or more data columns in the data table includes: presenting a chart of first data analysis information in the first interface region in response to the user selecting the cursor in a particular data column, and recommending one or more second data analysis information to the user includes: the method includes displaying a graph of the first data analysis information in a first interface region and displaying one or more graphs of the second data analysis information in a second interface region.
Optionally, the method further comprises: when a data table is opened in an application program, pre-calculating the data statistics of the field values corresponding to the data columns, and caching the calculated data statistics in a memory, wherein a chart for displaying first data analysis information in a first interface area comprises: and displaying a chart of first data analysis information generated based on the corresponding data statistics extracted from the memory in the first interface area.
Optionally, the method further comprises: in response to a user selecting a field value in the chart of the first data analysis information, the chart of the corresponding presented one or more second data analysis information is automatically updated to a data statistic of the field value in the corresponding other field dimension.
Optionally, the method further comprises: when one or more frequently-selected field values in the chart of the first data analysis information are selected in advance, automatically updating the corresponding chart of the one or more second data analysis information into data statistics of the frequently-selected field values in the corresponding other field dimensions, and caching the calculated data statistics in a memory, wherein the automatically updating the chart of the one or more correspondingly-displayed second data analysis information into the data statistics of a certain field value in the corresponding other field dimensions comprises: and automatically updating the correspondingly displayed chart of the one or more second data analysis information into the corresponding data statistical condition cached in the memory.
Optionally, the method further comprises: and in response to the user selecting at least one chart of the second data analysis information displayed in the second interface area by the cursor, dragging the at least one chart of the second data analysis information to the first interface area and performing releasing operation, and displaying the at least one chart of the second data analysis information in the first interface area.
Optionally, the method further comprises: connecting one chart of the two charts to the other chart in response to a user's connection operation for the two charts in the first interface area; in response to a user selection of a field value characterized in the first graph in the connected state, the selected field value is highlighted relative to the unselected field values, and/or the second graph in the connected state is updated with data statistics characterizing the selected field value in the field dimension corresponding to the second graph.
Optionally, the method further comprises: connecting one chart of the two charts to the other chart in response to a user's connection operation for the two charts in the first interface area; the other chart is updated to be used for characterizing the data statistics of the field value characterized by one chart under the field dimension corresponding to the other chart.
According to a fourth aspect of the present invention, there is also provided an apparatus for assisting a user in exploring a data set, the data set comprising a plurality of pieces of data, each piece of data comprising values of one or more fields, the apparatus comprising: the output module is used for responding to field selection operation of a user and outputting first data analysis information, and the first data analysis information is used for representing data statistics of a field value corresponding to a field selected by the user; and a recommending module for recommending one or more second data analysis information to the user, the second data analysis information being data analysis information predicted based on the field selected by the user.
Optionally, the second data analysis information is used for characterizing data statistics of field values corresponding to other fields obtained by prediction, and/or the second data analysis information is used for characterizing data statistics of field value combinations corresponding to fields selected by a user and field combinations formed by other fields obtained by prediction.
Optionally, the recommendation module comprises: a first recommendation module to predict data analysis information suitable for recommendation to a user based on a machine learning model; and/or a second recommendation module for predicting data analysis information suitable for recommendation to a user based on statistical relevance; and/or a third recommending module for predicting data analysis information suitable for recommending to the user based on the business rule.
Optionally, the first recommendation module: predicting, using a machine learning model, a probability that a user will select each of the other fields later based on information about the field currently selected by the user, wherein the machine learning model is trained in the following manner: taking the relevant information of the selected field and the relevant information of other fields as input, and taking the probability of the other fields as output; and analyzing the data statistics of the field value corresponding to the field with the probability value larger than the first preset threshold value to obtain second data analysis information, or analyzing the data statistics of the field value corresponding to the field with the probability value larger than the first preset threshold value on the basis of the first data analysis information of the field selected by the user to obtain the second data analysis information.
Optionally, the first recommending module is further configured to obtain an acceptance feedback condition of the one or more pieces of second data analysis information recommended to the user by the user, and update the machine learning model based on the obtained acceptance feedback condition.
Optionally, the second recommendation module: according to the statistical correlation among different fields, acquiring a field with the statistical correlation higher than a second preset threshold value with the field selected by the user; and analyzing the data statistics of the field value corresponding to the field with the statistical relevance higher than the second preset threshold value to obtain second data analysis information, or analyzing the data statistics of the field value corresponding to the field with the statistical relevance higher than the second preset threshold value on the basis of the first data analysis information of the field selected by the user to obtain the second data analysis information.
Optionally, the apparatus further comprises: the display module is used for displaying the first interface area and the second interface area, wherein the first data analysis information is presented in the first interface area, the second data analysis information is presented in the second interface area, and the display module is used for presenting the second data analysis information in the second interface area in the first interface area in response to the operation of a user.
Optionally, in response to a connection operation of a user for two pieces of data analysis information in the first interface area, the display module connects one piece of data analysis information to another piece of data analysis information, and the other piece of data analysis information is updated to be used for characterizing data statistics of a field value characterized by the one piece of data analysis information in a field dimension corresponding to the other piece of data analysis information.
Optionally, in response to a user's operation of connecting two pieces of data analysis information in the first interface region, the display module connects one piece of data analysis information to another piece of data analysis information, in response to a user's operation of selecting a field value represented by one piece of data analysis information in the connected state, the display module highlights data statistics of the selected field value in the piece of data analysis information relative to data statistics of the unselected field values, and/or the display module updates the other piece of data analysis information in the connected state to a data statistics for representing the selected field value in a field dimension corresponding to the other piece of data analysis information.
Optionally, the apparatus further comprises: the display module is used for displaying an interface used for exploring the data set, wherein the left area of the interface is used for displaying name icons corresponding to all the fields of the data set, the middle area of the interface is used for displaying a chart of first data analysis information, the right area of the interface is used for displaying a chart of second data analysis information, when the data set is imported, the display module displays the name icons corresponding to the fields included in each piece of data in the data set in the left area of the interface, the output module responds to the operation that a user selects the name icon of a specific field by a cursor, drags the name icon to the middle area of the interface and releases the name icon, displaying the graph of the first data analysis information in the interface middle area, and, while displaying the graph of the first data analysis information in the interface middle area, the recommendation module presents one or more graphs of the second data analysis information in the area to the right of the interface.
Optionally, the apparatus further comprises: the first calculation module is used for calculating the data statistics of the field values corresponding to the fields in advance when the data set is imported, caching the calculated data statistics in a memory, and the output module is used for displaying a chart of first data analysis information generated based on the corresponding data statistics extracted from the memory in an interface middle area.
Optionally, in response to a user selecting a certain field value in the chart of the first data analysis information, the display module automatically updates the chart of the one or more second data analysis information correspondingly displayed to the data statistics of the certain field value in the corresponding other field dimension.
Optionally, the apparatus further comprises: the second calculation module is used for calculating in advance that when one or more frequently-selected field values in the graph of the first data analysis information are selected, the graph of the corresponding one or more second data analysis information is automatically updated to the data statistics of the frequently-selected field values in the corresponding other field dimensions, and the calculated data statistics are cached in the memory, wherein the display module automatically updates the graph of the correspondingly displayed one or more second data analysis information to the corresponding data statistics cached in the memory.
Optionally, in response to a user selecting at least one graph of the second data analysis information displayed in the right area of the interface by the cursor, dragging the graph to the middle area of the interface and performing a release operation, the display module displays at least one graph of the second data analysis information in the middle area of the interface.
Optionally, in response to a user's operation of connecting two charts in the interface middle area, the display module connects one chart of the two charts to the other chart, in response to a user's operation of selecting a field value characterized in a first chart in a connected state, the display module highlights the selected field value relative to the unselected field value, and/or the display module updates a second chart in a connected state to be used for characterizing data statistics of the selected field value in a field dimension corresponding to the second chart.
Optionally, in response to a user's operation of connecting two charts in the interface middle area, the display module connects one chart of the two charts to the other chart, and the display module updates the other chart to be used for representing data statistics of the field value represented by the one chart in the field dimension corresponding to the other chart.
Optionally, the apparatus further comprises: the training module is used for responding to a prediction request of a user for a target field, taking field values corresponding to at least part of other fields in single data as input, taking field values corresponding to the target field in the single data as output, and automatically training a machine learning model; and the display module is used for displaying first interpretation information for characterizing the importance degree of one or more fields in at least part of other fields to the machine learning model prediction target field.
Optionally, the apparatus further comprises: the training module is used for responding to the operation of starting the automatic training machine learning model executed in the right area of the interface under the condition that a user selects a name icon of a target field in the left area of the interface or selects a chart of first data analysis information of the target field in the middle area of the interface, taking field values corresponding to at least part of other fields in a single piece of data as input, and taking field values corresponding to the target field in the single piece of data as output, and automatically training the machine learning model; and the display module is used for displaying first interpretation information used for representing the importance degree of one or more fields in at least part of other fields to the machine learning model prediction target field in the right area of the interface.
Optionally, the training module performs automatic search in a minimum hyper-parameter space based on a hyper-parameter preset according to experience to train the machine learning model.
Optionally, the display module displays the first interpretation information in a gradient form in the process of calculating the importance degree.
Optionally, the apparatus further comprises: the third calculation module is used for distributing the scores obtained by predicting the single data by the machine learning model to each field in at least part of other fields according to the distribution mode of the Charapril values so as to obtain the scores of the fields under the single data; and the determining module is used for determining the importance degree of each field to the machine learning model prediction target field according to the sum of the scores of each field in at least part of other fields under the plurality of pieces of data, wherein the importance degree is positively correlated with the sum of the scores.
Optionally, the apparatus further comprises: and the display module is used for displaying the scores of the field under a plurality of pieces of data in a two-dimensional coordinate system aiming at the single field, wherein one coordinate axis in the two-dimensional coordinate system is used for representing the field value corresponding to the field, and the other coordinate axis is used for representing the score of the field.
Optionally, the displaying module further represents a field value corresponding to another field in the data corresponding to the coordinate point by using a display characteristic of the coordinate point in the two-dimensional coordinate system.
Optionally, in response to a selection operation of a user on a piece of data, the presentation module further outputs a predicted result of the machine learning model on the piece of data, and outputs second interpretation information of the importance degree of one or more fields of at least some other fields in the piece of data on the predicted result.
Optionally, the presentation module is further configured to provide a control for inputting a piece of data to the user in the right area of the interface, receive the piece of data input by the user through the control, and present a predicted result of the machine learning model for the piece of data in the right area of the interface, and present second interpretation information of the importance degree of one or more fields of at least some other fields of the piece of data to the predicted result.
Optionally, the apparatus further comprises: and the fifth calculation module is used for distributing the score predicted by the machine learning model according to the data to each field in at least part of other fields according to the distribution mode of the Charapril value so as to obtain the score of each field under the data, and the importance degree is positively correlated with the score.
Optionally, the display module is further configured to present, in a two-dimensional coordinate system, a prediction result obtained by predicting the machine learning model for the plurality of pieces of data, where the two-dimensional coordinate system includes a plurality of coordinate points, each coordinate point corresponds to one piece of data, a display characteristic of the coordinate point is used to represent the prediction result of the data, and a distance between two coordinate points in the two-dimensional space is positively correlated to a distance between two pieces of data corresponding to the two coordinate points in the multi-dimensional space.
Optionally, each piece of data has a plurality of dimensions, and each dimension corresponds to one field of at least some other fields, and the apparatus further includes: and the sixth calculating module is used for allocating the scores predicted by the machine learning model aiming at the single data to each field in at least part of other fields according to the allocation mode of the Charapril values so as to obtain the score of the field under the single data, wherein the score of the field is the value of the data under the corresponding dimensionality, and the position of each piece of data in the multidimensional space is determined based on the values of the multiple dimensionalities of the data.
Optionally, the apparatus further comprises: the selecting module is used for responding to the selection operation of a user for one or more coordinate points in the two-dimensional coordinate system, and selecting a preset number of coordinate points which are the same as the prediction result of the selected coordinate points from the vicinity of the selected coordinate points to obtain a plurality of clustering coordinate points; the extraction module is used for extracting one or more key fields from the data corresponding to the clustering coordinate points based on the size sequence of the scores of the fields to obtain a key field group, and the display module is also used for outputting the key field group.
Optionally, the presentation module is further configured to, in response to an adjustment operation performed by a user on a field value of one or more fields in a piece of data, output a prediction result of the machine learning model for the adjusted piece of data, and output second interpretation information of a degree of importance of one or more fields in at least some other fields in the adjusted piece of data to the prediction result.
Optionally, the presentation module is further configured to output, by using the machine learning model, a change in field value of at least some other fields in a piece of data according to a predicted result expected by a user for a target field of the piece of data.
According to a fifth aspect of the present invention, there is further provided an apparatus for assisting a user in exploring a data table, comprising: an execution module, configured to execute, in response to a user opening a data table in an application, a plug-in for implementing the method according to the first aspect of the present invention with respect to a data set in the data table.
According to a sixth aspect of the present invention, there is further provided an apparatus for assisting a user in exploring a data table, comprising: the display module is used for responding to the opening of a data table in the application program by a user and displaying an exploration area in a preset area of the data table; the training module is used for responding to a prediction request of a user for a target field in the data table, taking field values corresponding to at least some other fields in the single piece of data as input, taking field values corresponding to the target field in the single piece of data as output, and automatically training the machine learning model, and the display module is also used for outputting first interpretation information for representing the importance degree of at least some other fields to the target field in the exploration area.
Optionally, the predetermined area is at least one of a left side, a right side, an upper side, and a lower side of the data table.
Optionally, the data table is an excel table.
Optionally, the training module performs automatic search in a minimum hyper-parameter space based on a hyper-parameter preset according to experience to train the machine learning model.
Optionally, the display module displays the first interpretation information in a gradient form in the process of calculating the importance degree.
Optionally, the apparatus further comprises: the first calculation module is used for distributing the scores obtained by predicting the single data by the machine learning model to each field in at least part of other fields according to the distribution mode of the Charapril values so as to obtain the scores of the fields under the single data; and the determining module is used for determining the importance degree of each field to the machine learning model prediction target field according to the sum of the scores of each field in at least part of other fields under the plurality of pieces of data, wherein the importance degree is positively correlated with the sum of the scores.
Optionally, the display module is further configured to output, in the exploration area, a two-dimensional coordinate graph for representing an influence of a single field on the target field, where one coordinate axis in the two-dimensional coordinate graph is used to represent a field value corresponding to the field, and the other coordinate axis is used to represent a score of the field, the two-dimensional coordinate graph includes a plurality of coordinate points, each coordinate point corresponds to one piece of data, and the score of the field is obtained by allocating, according to an allocation manner of a charapril value, a score predicted by the machine learning model for the single piece of data to each field of at least some other fields.
Optionally, the display module is further configured to represent a field value corresponding to another field in the data corresponding to the coordinate point by using a display characteristic of the coordinate point in the two-dimensional coordinate graph.
Optionally, in response to a prediction request of a user for a piece of data in the data table, the display module outputs a prediction result of the machine learning model for the piece of data in the exploration area, and outputs second interpretation information of the importance degree of one or more fields of at least part of other fields in the piece of data to the prediction result in the exploration area.
Optionally, the apparatus further comprises: and the second calculation module is used for distributing the scores obtained by predicting the data by the machine learning model according to the distribution mode of the Charapril values to each field in at least part of other fields to obtain the score of each field under the data, and the importance degree of the fields to the prediction result is positively correlated with the score of the fields.
Optionally, in response to an adjustment operation of a user on a field value of one or more fields in a certain piece of data in the data table, the display module further outputs a predicted result of the machine learning model for the adjusted piece of data, and outputs second interpretation information of the importance degree of the one or more fields in at least some other fields in the adjusted piece of data on the predicted result in the exploration area.
Optionally, the display module further outputs, by using a machine learning model, a change of field values of at least some other fields in a certain piece of data according to a desired prediction result of a user for a target field of the certain piece of data in the data table.
Optionally, the apparatus further comprises: the output module is used for responding to the selection operation of a user on one or more data columns in the data table, and outputting first data analysis information in the exploration area, wherein the first data analysis information is used for representing the data statistics condition of a field value corresponding to the data column selected by the user; and the recommending module is used for recommending one or more pieces of second data analysis information to the user, wherein the second data analysis information is predicted data analysis information based on the data column selected by the user.
Optionally, the second data analysis information is used for characterizing data statistics of field values corresponding to other data columns obtained through prediction, and/or the second data analysis information is used for characterizing data statistics of field value combinations corresponding to data column combinations formed by data columns selected by users and other data columns obtained through prediction.
Optionally, the recommendation module comprises: a first recommendation module to predict data analysis information suitable for recommendation to a user based on a machine learning model; and/or a second recommendation module for predicting data analysis information suitable for recommendation to a user based on statistical relevance; and/or a third recommending module for predicting data analysis information suitable for recommending to the user based on the business rule.
Optionally, the first recommendation module: predicting the probability that the user will select each other data column later by using a machine learning model based on the related information of the data column selected by the current user, wherein the machine learning model is trained according to the following modes: taking the relevant information of the previously selected data column and the relevant information of other data columns as input, and taking the probability of the other data columns being selected as output; and analyzing the data statistics of the field value corresponding to the data column with the probability value larger than the first preset threshold value to obtain second data analysis information, or analyzing the data statistics of the field value corresponding to the data column with the probability value larger than the first preset threshold value on the basis of the first data analysis information of the data column selected by the user to obtain the second data analysis information.
Optionally, the first recommending module further obtains an acceptance feedback condition of one or more pieces of second data analysis information recommended to the first recommending module by the user, and updates the machine learning model based on the obtained acceptance feedback condition.
Optionally, the second recommendation module: acquiring data columns with the statistical correlation higher than a second preset threshold value with the data columns selected by the user according to the statistical correlation among different data columns; and analyzing the data statistics of the field value corresponding to the data column with the statistical relevance higher than the second preset threshold value to obtain second data analysis information, or analyzing the data statistics of the field value corresponding to the data column with the statistical relevance higher than the second preset threshold value on the basis of the first data analysis information of the data column selected by the user to obtain the second data analysis information.
Optionally, the exploration area includes a first interface area and a second interface area, the output module displays a graph of the first data analysis information in the first interface area in response to the user selecting the cursor to a specific data column, and the recommendation module displays one or more graphs of the second data analysis information in the second interface area while the graph of the first data analysis information is displayed in the first interface area.
Optionally, the apparatus further comprises: and the third calculation module is used for calculating the data statistics of the field values corresponding to the data columns in advance when the data table is opened in the application program, and caching the calculated data statistics in the memory, wherein the output module displays a chart of first data analysis information generated on the basis of the corresponding data statistics extracted from the memory in the first interface area.
Optionally, the recommendation module automatically updates the displayed chart of the one or more second data analysis information into the data statistics of the certain field value in the corresponding other field dimension in response to the user selecting the certain field value in the chart of the first data analysis information.
Optionally, the apparatus further comprises: the fourth calculation module is configured to, when one or more frequently-selected field values in the graph of the first data analysis information are selected in advance, automatically update the graph of the corresponding one or more second data analysis information to the data statistics of the frequently-selected field values in the corresponding other field dimensions, cache the calculated data statistics in the memory, and automatically update the graph of the one or more second data analysis information that is correspondingly displayed to the corresponding data statistics that is cached in the memory by the recommendation module.
Optionally, in response to a user selecting the cursor to select the at least one graph of the second data analysis information displayed in the second interface region, dragging the cursor to the first interface region and performing a release operation, the display module displays the at least one graph of the second data analysis information in the first interface region.
Optionally, in response to a user's operation of connecting two charts in the first interface region, the display module connects one of the two charts to the other chart, in response to a user's operation of selecting a field value characterized in the first chart in a connected state, the display module highlights the selected field value relative to the unselected field value, and/or the display module updates the second chart in a connected state to characterize data statistics of the selected field value in a field dimension corresponding to the second chart.
Optionally, in response to a user's connection operation for the two charts in the first interface region, the display module connects one of the two charts to the other chart, and the display module updates the other chart to be used for characterizing data statistics of field values characterized by the one chart in a field dimension corresponding to the other chart.
According to a seventh aspect of the present invention, there is also presented a system comprising at least one computing device and at least one storage device storing instructions, wherein the instructions, when executed by the at least one computing device, cause the at least one computing device to perform the method as set forth in any one of the first to third aspects of the present invention.
According to an eighth aspect of the present invention, there is also provided a computer-readable storage medium storing instructions that, when executed by at least one computing device, cause the at least one computing device to perform the method as recited in any one of the first to third aspects of the present invention.
In the method and apparatus for assisting a user in exploring a data set, a data table according to an exemplary embodiment of the present invention, the user can understand data statistics of a selected field by outputting first data analysis information. The second data analysis information is data analysis information which is obtained by predicting according to the field selected by the user and is possibly interested by the user, and the second data analysis information is recommended to the user, so that the data exploration cost of the user can be reduced, and the user can conveniently and quickly explore the data set.
Drawings
These and/or other aspects and advantages of the present invention will become more apparent and more readily appreciated from the following detailed description of the embodiments of the invention, taken in conjunction with the accompanying drawings of which:
FIG. 1 shows a flow diagram of a method of assisting a user in exploring a data set, according to an exemplary embodiment of the invention;
FIGS. 2-9 illustrate schematic diagrams of a data exploration interface, according to an exemplary embodiment of the present invention;
10-14 illustrate schematic diagrams of exploration areas shown in a data table according to an exemplary embodiment of the present invention;
FIG. 15 illustrates a block diagram of an apparatus for assisting a user in exploring a data set, according to an exemplary embodiment of the present invention;
fig. 16 illustrates a block diagram of an apparatus for assisting a user in exploring a data table according to an exemplary embodiment of the present invention.
Detailed Description
In order that those skilled in the art will better understand the present invention, exemplary embodiments thereof will be described in further detail below with reference to the accompanying drawings and detailed description.
Fig. 1 shows a flowchart of a method of assisting a user in exploring a data set according to an exemplary embodiment of the present invention. The data set includes a plurality of pieces of data, each piece of data including values of one or more fields. The method shown in fig. 1 may be implemented entirely in software via a computer program, and the method shown in fig. 1 may also be executed by a specifically-configured computing device.
Referring to fig. 1, in step S110, in response to a field selection operation by a user, first data analysis information is output, where the first data analysis information is used to characterize data statistics of a field value corresponding to a field selected by the user.
Outputting the first data analysis information may refer to outputting the first data analysis information in a visualized manner. By way of example, the first data analysis information may be presented to the user by way of, but not limited to, a chart. That is, the first data analysis information may be a chart obtained by analyzing a field value corresponding to a field selected by a user in the data set.
In step S120, one or more second data analysis information, which is predicted based on the field selected by the user, is recommended to the user.
Recommending one or more second data analysis information to the user may refer to presenting the one or more second data analysis information to the user in a visual manner. As an example, the second data analysis information may be presented to the user by, but not limited to, a chart, i.e., the second data analysis information may be a chart recommended to the user.
The second data analysis information may be used to characterize data statistics of field values corresponding to other predicted fields, and/or may also be used to characterize data statistics of field combinations corresponding to field combinations of fields selected by the user and other predicted fields.
Taking the data set as an employee information table as an example, when the user selects a "department" field, the first data analysis information may be a chart of data statistics representing field values (such as a human resources department, a research and development department, and a sales department) corresponding to the "department" field, and the second data analysis information recommended to the user may be a chart of data statistics obtained by analyzing a field of "overtime" or not grouped according to the "department", or may be a chart of data statistics obtained by analyzing a field value (yes or no) corresponding to a field of "overtime" or not.
In the present invention, one or more of a variety of ways, including but not limited to machine learning models, statistical correlations, business rules, etc., may be employed to predict data analysis information suitable for recommendation to a user.
Taking the example of predicting data analysis information suitable for recommendation to a user based on a machine learning model, the machine learning model can be used to predict the probability that the user will select each other field later based on the relevant information (such as field names, data types, data distributions, etc.) of the field currently selected by the user, wherein the machine learning model is trained in the following manner: taking the relevant information of the selected field and the relevant information of other fields as input, and taking the probability of the other fields as output; and analyzing the data statistics of the field value corresponding to the field with the probability value larger than the first preset threshold value to obtain second data analysis information, or analyzing the data statistics of the field value corresponding to the field with the probability value larger than the first preset threshold value on the basis of the first data analysis information of the field selected by the user to obtain the second data analysis information.
Optionally, an acceptance feedback condition of the one or more pieces of second data analysis information recommended to the user by the user may be acquired, and the machine learning model may be updated based on the acquired acceptance feedback condition.
Taking the statistical relevance-based prediction of data analysis information suitable for being recommended to a user as an example, a field with the statistical relevance higher than a second predetermined threshold with respect to a field selected by the user can be obtained according to the statistical relevance between different fields; and analyzing the data statistics of the field value corresponding to the field with the statistical relevance higher than the second preset threshold value to obtain second data analysis information, or analyzing the data statistics of the field value corresponding to the field with the statistical relevance higher than the second preset threshold value on the basis of the first data analysis information of the field selected by the user to obtain the second data analysis information.
Taking the data analysis information suitable for being recommended to the user based on the business rule prediction as an example, the data analysis and visualization modes frequently used in the business scene can be summarized according to the business experience, and when the second data analysis information is predicted, the prediction can be performed based on the business experience in the same or similar scene. For example, if it is known through user interviews or actual business experience that a user tends to pay attention to "net profit" at the same time when analyzing "sales", data analysis information related to "net profit" can be recommended when the "sales" field is selected by the user.
In the present invention, step S110 and step S120 may be performed simultaneously without being separated from each other. That is, in response to a field selection operation by the user, the first data analysis may be output and one or more pieces of second data analysis information may be recommended to the user at the same time.
By outputting the first data analysis information, the user can know the data statistics of the selected field. And the second data analysis information is data analysis information which is predicted according to the field selected by the user and is possibly interested by the user, so that the data exploration cost of the user can be reduced by recommending the second data analysis information to the user, and the user can conveniently and quickly explore the data set.
As an example, when the first/second data analysis information is presented, outliers and/or saliency values thereof can be highlighted (e.g., highlighted), wherein an outlier refers to a value in the data that is more different than other values, and a saliency value refers to a value or a segment of a value that appears more.
The present invention may display an interface for exploring a data set. The first data analysis information and the second data analysis information may be presented in different areas of the interface, for example, the first data analysis information may be presented in a first interface area, and the second data analysis information may be presented in a second interface area, where the first interface area and the second interface area refer to two different interface areas in the same interface.
The first interface area can be regarded as an interaction area, and a user can perform a predetermined operation on the data analysis information presented in the first interface area to realize a corresponding function. The second interface area is a recommendation area, and the user can drag second recommendation information meeting requirements to the first interface area so as to execute subsequent operations on the first interface area.
That is, in response to an operation by the user, the second data analysis information in the second interface region may also be presented in the first interface region. For example, the second data analysis information displayed in the second interface area may be displayed in the first interface area by dragging the cursor to the first interface area and performing a release operation in response to the user selecting the second data analysis information displayed in the second interface area with the cursor. After the second data analysis information of the second interface area is dragged to the first interface area, the second interface area may not display the second data analysis information any more, and may still display the second data analysis information.
Thus, the first interface region may present one or more data analysis information. The data analysis information displayed in the first interface area may be the first data analysis information or the second data analysis information from the second interface area.
As an example of the present invention, in response to a connection operation of a user for two pieces of data analysis information in the first interface area, one piece of data analysis information of the two pieces of data analysis information is connected to the other piece of data analysis information, and the other piece of data analysis information is updated to represent a data statistic of a field value represented by the one piece of data analysis information in a field dimension corresponding to the other piece of data analysis information. Therefore, the user can obtain the data statistics condition under the dimension represented by the two pieces of data analysis information by connecting the two pieces of data analysis information.
For example, the first interface area may include a chart 1 obtained by counting field values corresponding to a field "department" and a chart 2 obtained by counting field values corresponding to a field "leave or not". In response to a user's connection operation for the chart 1 and the chart 2, the chart 1 may be visually connected (for example, by a wire connection) to the chart 2, at which time the chart 2 may be automatically updated to a chart of the situation of departure of different departments in which whether the departure is counted by the department group.
As another example of the present invention, in response to a connection operation of a user with respect to two pieces of data analysis information in the first interface area, one piece of data analysis information of the two pieces of data analysis information is connected to the other piece of data analysis information; in response to a user's selection operation for a field value characterized by one of the two pieces of data analysis information in the connected state, data statistics of the selected field value in the data analysis information are highlighted relative to those of the unselected field values, and/or the other of the two pieces of data analysis information in the connected state is updated to characterize the data statistics of the selected field value in a field dimension corresponding to the other piece of data analysis information. Therefore, in response to the user connecting the two pieces of data analysis information and screening one piece of data analysis information, the other piece of data analysis information can be automatically updated to the screened data statistics.
For example, the first interface area may include a chart 1 obtained by counting field values corresponding to a field "department" and a chart 2 obtained by counting field values corresponding to a field "whether the field" leaves the work ", where the chart 1 may be used to represent the people number distribution of the" department of human resources "," department of research and development ", and" department of sales ". In response to a user's connection operation with respect to the graphs 1 and 2, the graph 1 may be visually (for example, by connecting lines) connected to the graph 2, in response to a user's selection operation with respect to the "research and development department" in the graph 1, the data statistics of the "research and development department" in the graph 1 are highlighted with respect to the data statistics of the "human resources department", "sales department", and/or the graph 2 is updated to a graph for characterizing the ex-post statistics of the "research and development department".
FIGS. 2 through 9 show schematic diagrams of a data exploration interface, according to an exemplary embodiment of the present invention. The method for assisting a user in exploring a data set according to the present invention will be further described with reference to fig. 2 to 9.
Referring to FIG. 2, an interface for exploring a data set can be displayed, the interface being divided into three sections: the left area of the interface, the middle area of the interface, and the right area of the interface (i.e., the directions area shown in the figure).
And the left area of the interface is used for displaying name icons corresponding to all the fields of the data set. Fig. 2 illustrates an example of a data set as an employee information table, which exemplarily shows name icons corresponding to partial fields (department, gender, overtime, and departure) of the data set. It should be appreciated that all or a portion of the fields of the data set may be exposed in the left area of the interface as is practical. For example, in the case that the number of fields of the data set is not large (e.g., lower than a threshold), all the fields may be displayed in the left area of the interface, and in the case that the number of fields of the data set is large (e.g., higher than the threshold), some of the fields may be selectively displayed in the left area of the interface.
The interface middle area corresponds to the first interface area mentioned above, and is used for displaying a chart of the first data analysis information. The right area of the interface corresponds to the second interface area mentioned above, and is used for displaying a chart of the second data analysis information.
According to the method and the device, when the data set is imported, the name icons corresponding to the fields included in each piece of data in the data set are displayed in the left area of the interface.
And in response to the operation that a user selects a name icon of a specific field by a cursor, the name icon is dragged to the middle area of the interface and released, the chart of the first data analysis information is displayed in the middle area of the interface, and one or more charts of the second data analysis information are displayed in the right area of the interface while the chart of the first data analysis information is displayed in the middle area of the interface. The graph displayed in the right area of the interface may be predicted based on any one or combination of the above-mentioned prediction modes.
As shown in fig. 3, in response to an operation that a user selects a name icon of a "department" field with a cursor, drags the name icon to a middle area of the interface and releases the name icon, a chart representing a data statistics situation of a field value corresponding to the "department" field in a data set is displayed in the middle area of the interface, and meanwhile, a chart obtained by counting the number of "overtime" according to the "department" grouping and a chart obtained by summing "monthly income" according to the "department" grouping are displayed in a right area of the interface.
In order to enable the data analysis information to be displayed to the user in time, when the data set is imported, the data statistics of the field values corresponding to the fields can be calculated in advance, and the calculated data statistics can be cached in the memory. Therefore, in response to the field selection operation of the user, the chart of the first data analysis information generated based on the corresponding data statistics extracted from the memory can be displayed in the interface middle area without analyzing the field value corresponding to the field selected by the user.
In response to a user selecting a field value in the chart of the first data analysis information, the chart of the corresponding presented one or more second data analysis information is automatically updated to a data statistic of the field value in the corresponding other field dimension.
Still taking fig. 3 as an example, in response to the user selecting the "research and development part" in the graph displayed in the middle area of the interface, the graph displayed in the area on the right side of the interface and counted according to the "department" group to determine whether to overtake is automatically updated to the graph displayed in the area on the right side of the interface and counted according to the "research and development part" group to determine whether to overtake, and the graph displayed in the area on the right side of the interface and summed according to the "department" group to determine the "monthly income" is automatically updated to the graph summed according to the "research and development part" group to determine the "monthly income".
Correspondingly, in order to enable the data analysis information to be displayed to the user in time, when one or more frequently selected field values in the graph of the first data analysis information are selected, the graph of the corresponding one or more second data analysis information may be pre-calculated and automatically updated to the data statistics of the frequently selected field values in the corresponding other field dimensions, and the calculated data statistics is cached in the memory, where the automatically updating of the graph of the corresponding displayed one or more second data analysis information to the data statistics of a certain field value in the corresponding other field dimensions includes: and automatically updating the correspondingly displayed chart of the one or more second data analysis information into the corresponding data statistical condition cached in the memory.
The user can move the chart of the second data analysis information shown in the right area of the interface to the middle area of the interface by performing a predetermined operation. For example, in response to the user selecting the cursor to select at least one chart of the second data analysis information displayed in the right area of the interface, dragging the cursor to the middle area of the interface and performing releasing operation, the at least one chart of the second data analysis information is displayed in the middle area of the interface. After the chart in the right area of the interface is dragged to the middle area of the interface, the chart can not be displayed in the right area of the interface any more, and the chart can still be displayed.
The user may continue to select new fields to present a plurality of graphs of the first data analysis information in the middle area of the interface. As shown in fig. 4, after the user selects the name icon of the "department" field, drags the name icon to the interface middle area and releases the name icon, the name icon of the "whether to leave work" field may also be selected, and the name icon is dragged to the interface middle area and released, so that a graph representing the data statistics of the field value corresponding to the "whether to leave work" field in the data set is displayed in the interface middle area.
And responding to the newly added graph of the first data analysis information in the middle area of the interface, and automatically updating the graph of the second data analysis information displayed in the right area of the interface. The chart of the second data analysis information displayed in the area on the right side of the interface can be automatically updated to be the chart of the data analysis information obtained by prediction based on the field selected by the user last time. And/or the chart of the second data analysis information displayed in the right area of the interface can be automatically updated to the chart of the data analysis information predicted based on a plurality of fields selected by the user before. For the prediction method, see the above description, and will not be described herein again.
As shown in fig. 4, the graph of the second data analysis information shown in the right area of the interface may be updated to a graph for representing that "whether to leave" is highly correlated with "whether to overtime", and a graph for representing the statistical number of "whether to leave" according to the "department" group.
As shown in FIG. 5, in response to a user's connection operation with respect to two charts in the interface middle area, one of the two charts is connected to the other chart, in response to a user's selection operation with respect to a field value (e.g., sales department) characterized in a first chart in a connected state (e.g., a chart corresponding to a department), the selected field value (e.g., sales department) is highlighted with respect to the unselected field value (e.g., human resources department, research and development department), and/or a second chart in a connected state (e.g., a chart corresponding to whether or not to be left-job) is updated with data statistics characterizing the selected field value (sales department) in a field dimension (whether or not to be left-job) corresponding to the second chart. Therefore, by connecting the two charts and screening one chart, the other chart can be automatically updated to the screened data statistics.
In response to a user's join operation for two charts in the interface middle region, one of the two charts is joined to the other chart, and the other chart can be updated to characterize data statistics of field values characterized by the one chart in the field dimension corresponding to the other chart. Therefore, by connecting the two pieces of data analysis information, the data statistics under the dimension represented by the two pieces of data analysis information can be obtained.
As shown in fig. 4, the interface middle area includes a chart 1 obtained by counting field values corresponding to a field "department" and a chart 2 obtained by counting field values corresponding to a field "whether the field" leaves the work ", where the chart 1 may be used to represent people number distribution of" human resources department "," research and development department ", and" sales department ". In response to a user's connection operation for the chart 1 and the chart 2, the chart 1 may be visually connected (for example, by a wire connection) to the chart 2, at which time the chart 2 may be automatically updated to a chart of the situation of departure of different departments in which whether the departure is counted by the department group.
In response to a prediction request of a user for a target field, the method and the device can take the field values corresponding to at least part of other fields in the single piece of data as input, take the field values corresponding to the target field in the single piece of data as output, and automatically train the machine learning model. When the machine learning model is automatically trained, automatic search can be performed in a minimum hyper-parameter space based on the hyper-parameters preset according to experience so as to train the machine learning model. Therefore, the machine learning model with better effect, conciseness and strong interpretability can be generated in a short time to assist the user in making business decisions.
After the machine learning model is trained, first interpretation information for characterizing the importance degree of one or more fields of at least part of other fields to the machine learning model for predicting the target field can be presented. Wherein, in the calculating of the degree of importance, the first explanatory information may be displayed in a gradient form. The first interpretation information may be used to characterize the importance degree of all the fields to the machine learning model prediction target field, and may also be used to characterize a part of the fields with higher importance degree to the machine learning model prediction target field.
As shown in fig. 3 to 5, the right area of the interface may provide a control for starting the auto-trained machine learning model, and in the case where the user selects a name icon of the target field in the left area of the interface or selects a chart of the first data analysis information of the target field in the middle area of the interface, the auto-trained machine learning model is automatically trained in response to an operation of starting the auto-trained machine learning model (for example, a click operation on the control of "auto ml-auto-trained parsing model") performed in the right area of the interface, with field values corresponding to at least some other fields in a single piece of data as inputs, and with field values corresponding to the target fields in the single piece of data as outputs. And first interpretation information characterizing the importance of one or more of at least some of the other fields to the machine learning model's prediction objective field may be presented in the right area of the interface.
In the invention, the calculation of the importance degree of the field can adopt a distribution mode of a Shaapril value (SHAPvalue) in a game theory to convert the calculation of the importance degree of the field into a fair distribution problem of rights and interests (namely a predicted value) under the condition of multi-field cooperation. By way of example, the method can allocate the score predicted by the machine learning model for a single piece of data to each field in at least some other fields according to the allocation mode of the charapril value to obtain the score of the field under the single piece of data, and then determine the importance degree of the field to the prediction target field of the machine learning model according to the sum of the scores of each field in at least some other fields under multiple pieces of data, wherein the importance degree is positively correlated with the sum of the scores.
The score of the field obtained according to the distribution mode of the charpy value under a single datum can be a positive value or a negative value, the positive value represents that the field promotes the predicted value obtained by predicting the target field by the machine learning model, namely, the positive value has a positive effect on the target field, and the negative value represents that the field reduces the predicted value obtained by predicting the target field by the machine learning model, namely, the negative value has a negative effect on the target field.
When calculating the sum of the scores of each field under the plurality of pieces of data, the sum of the absolute values of the scores of the fields under the plurality of pieces of data may be used as the importance degree of the field to the machine learning model prediction target field, or the average of the absolute values of the scores of the fields under the plurality of pieces of data may be used as the importance degree of the field to the machine learning model prediction target field.
In the calculating of the degree of importance, the first explanatory information may be displayed in a gradation form. For example, since it takes a long time to calculate the degree of importance, the first explanatory information may be roughly displayed in a relatively blurred state first, and the displayed first explanatory information gradually becomes clearer as the degree of completion of the calculation becomes higher. As for the display form of the first explanatory information, as an example, different display characteristics may be given to the field according to the magnitude of the calculated degree of importance of the field. For example, it may be that the greater the importance of the field, the darker the color. Alternatively, different colors may be used to identify whether a field has a positive or negative effect on the target field, e.g., red may be used to characterize a positive effect and blue may be used to characterize a negative effect. I.e. the more important a field is to a target field and acts positively the more red it is colored, the more important a field is to a target field and acts negatively the more blue it is colored.
As shown in fig. 6A, the first interpretation information (i.e., the model automatic analysis result shown on the right side) may be presented in the right area of the interface. Fig. 6B shows an enlarged schematic view of the results of the automatic analysis of the model in fig. 6A.
As shown in fig. 6B, according to the calculated importance of the field, several columns of fields that have the greatest influence (i.e., the greatest importance) on the target field may be described to the user using the business language: whether overtime, stock option level, job role.
Each row of the chart in fig. 6B represents a field, the abscissa represents the SHAP value, each point represents a piece of data, and the SHAP value corresponding to each point in a certain field, i.e., the score of the field value corresponding to the field in the data corresponding to each point under the piece of data. It should be noted that fig. 6B is a gray scale without color shown for the requirement of the application, and actually, the vertical line on the right side in fig. 6B refers to the size of the field value of the field that can be represented by the form of color gradient, that is, the size of the field value can be represented by the display characteristic. As an example, field values may be characterized as small to large in form from color a (e.g., blue) to color B (e.g., red). For example, it may be that the larger the field value, the more red the color, the smaller the field value, the more blue the color.
In fig. 6B, the red part of the field value of the "over shift" field indicates that the over shift duration increases the probability of leaving, and the blue part of the field value of the "over shift" field indicates that the over shift duration decreases the probability of leaving. Similarly, a red field value of the "option level" field is located in the left half, indicating that a high option level decreases the probability of a job separation, and a blue field value of the "option level" field is mostly located in the right half, indicating that a low option level increases the probability of a job separation.
Therefore, the user can intuitively know the influence of the field value size of the different fields on the target field by displaying the relationship between the field value size of the different fields in the data set and the scores (namely, SHAP values) of the field under different data in the data set in different colors.
Fig. 6B illustrates how the size of a field value characterizing a field in terms of color affects the target field. As shown in fig. 7, the present invention can also show the influence of the field value of a certain field on the target field in a two-dimensional coordinate graph.
Specifically, the present invention may assign the score predicted by the machine learning model for a single piece of data to each of at least some of the other fields according to the way in which the charapril value is assigned to obtain the score of the field under the single piece of data. The score may be a positive value or a negative value, and may be referred to the above description, which is not repeated herein. For a single field, the score of the field under multiple pieces of data in the data set may be presented in a two-dimensional coordinate system, one axis of which is used to characterize the field value to which the field corresponds, and another axis is used to characterize the score (i.e., the SHAP value) of the field.
Referring to fig. 7, taking the "Age (Age)" field as an example, the abscissa is the Age value and the ordinate is the SHAP value, and each point in the figure represents a piece of data. As can be seen from fig. 7, the smaller the age, the larger the shield value, indicating the smaller the age, the greater the probability of departure, and the smaller the shield value of the employee between the ages of 30 and 45, indicating the smaller the probability of departure of the employee between the ages of 30 and 45.
Therefore, by representing the relationship between the field value size of a field and the score (namely, the SHAP value) in the two-dimensional coordinate system, the user can intuitively know the influence of the field value size of a certain field on the target field.
Optionally, a field value corresponding to another field in the data corresponding to the coordinate point in the two-dimensional coordinate system may also be characterized by using a display characteristic of the coordinate point. Fig. 7 is a gray scale without color for the requirements of the application document, and actually the vertical line shown on the right side in fig. 7 indicates that the work age can be represented in the form of color, such as the work age can be represented from short to long in the form of color a (e.g. blue) to color B (e.g. red). For example, the color may be red as the work age is longer, and the color may be blue as the work age is shorter. In fig. 7, the work age in the data corresponding to a coordinate point may be represented by display characteristics of the coordinate point, and for example, the red coordinate point may indicate that the work age of the coordinate point is longer, and the blue coordinate point may indicate that the work age of the coordinate point is shorter. Referring to fig. 7, the color of the coordinate points with a high shp value between 20 and 30 ages is blue, indicating that the staff with a high shp value between 20 and 30 ages are mostly short in working age, and the color of the coordinate points with a low shp value between 30 and 45 ages is mostly blue, indicating that the shorter the working age of the staff between 30 and 45 ages is, the lower the shp value is. Therefore, according to the distribution situation of the display characteristics of the coordinate points in the two-dimensional coordinate system, the user can know the influence situation of the multi-field dimensionality on the target field.
In response to a selection operation of a user on a piece of data, a predicted result of the machine learning model on the piece of data can be output, and second interpretation information of the importance degree of one or more fields of at least part of other fields in the piece of data on the predicted result can be output. For example, a control for inputting a piece of data can be provided for a user in the right area of the interface, and the piece of data input by the user can be received; and displaying the predicted result of the machine learning model for the piece of data in the area on the right side of the interface, and displaying second interpretation information of the importance degree of one or more fields in at least part of other fields in the piece of data to the predicted result.
The second interpretation information is used for representing the importance degree of each field in the piece of data on the prediction result of the target field obtained by predicting the piece of data by the machine learning model. The importance degree can also be calculated by a distribution mode of the charapril value, that is, according to the distribution mode of the charapril value, the score obtained by predicting the data by the machine learning model can be distributed to each field in at least part of other fields to obtain the score of each field under the data, and the importance degree is positively correlated with the score.
As shown in fig. 8, the probability of whether a certain piece of data is out of position is 0.49, which is obtained by predicting the data by using a machine learning model, and the importance degree (i.e., the influence degree) of a plurality of field values having a large influence on the prediction probability is shown. The field value above the predicted value 0.49 (such as level 2, age 41, distance from the company to the home 1 and work satisfaction 4) indicates that the predicted result has negative influence, the field value below the predicted value 0.49 (such as overtime, work role is supervisor, work and life balance 1, work age 8, stock option level 0, relation satisfaction 1, environment satisfaction 2 and employee number 1) indicates that the predicted result has positive influence, and the left value corresponding to the field value indicates the influence degree on the predicted result. Fig. 8 is a gray scale without color shown for the purposes of this document, and in fact fig. 8 may use different colors to indicate whether the field values have a positive or negative effect on the predicted result. For example, a color A (e.g., blue) above the predictor 0.49 and a color B (e.g., red) below the predictor 0.49 may be used.
By way of example, in response to an adjustment operation of a user on a field value of one or more fields in a piece of data, a prediction result of the machine learning model on the adjusted piece of data can be output, and second interpretation information of the importance degree of the one or more fields in at least some other fields in the adjusted piece of data on the prediction result can be output. For the second interpretation information and the display manner thereof, reference may be made to the above description, which is not repeated herein.
By way of example, the change of field values of at least some other fields in a piece of data can be output by using a machine learning model according to the expected prediction result of a user for a target field of the piece of data. For example, when the employee information is used to predict the employee's probability of leaving according to the machine learning model, if the employee's probability of leaving is expected to decrease by 20%, the change of field value such as how much the salary needs to be increased, how much the overtime needs to be decreased, and how much the stock option needs to be increased can be output.
The method can also display a prediction result obtained by predicting a plurality of pieces of data by the machine learning model in a two-dimensional coordinate system, wherein a plurality of coordinate points exist in the two-dimensional coordinate system, each coordinate point corresponds to one piece of data, the display characteristics of the coordinate points are used for representing the prediction result of the data, and the distance between two coordinate points in the two-dimensional space is positively correlated with the distance between two pieces of data corresponding to the two coordinate points in the multi-dimensional space. Each piece of data has multiple dimensions, each dimension corresponds to one field of at least part of other fields, so that multiple pieces of data can be displayed in a plane (namely a two-dimensional coordinate system) through dimension reduction, and the distance relationship between different pieces of data is kept while the dimension reduction is carried out, namely, the distance between two pieces of data in a multi-dimensional space is close, and the distance on the two-dimensional plane is also close after the dimension reduction is carried out, so that the data predicted to be classified due to similar reasons can be clustered, and a user can clearly know the classification characteristics of a data set. The detailed description of the dimension reduction is omitted here.
As an example, according to a way of allocating a charapril value, a score predicted by a machine learning model for a single piece of data may be allocated to each field of at least some other fields to obtain a score of the field under the single piece of data, where the score of the field is a value of the data under a corresponding dimension, and a position of each piece of data in a multidimensional space is determined based on values of multiple dimensions of the data.
As an example of the two-dimensional coordinate system shown in the lower half of fig. 9, the coordinate system may include a plurality of coordinate points, fig. 9 is a gray scale without color shown in the request of the application, and actually, the coordinate points in fig. 9 may carry colors, and the color of the coordinate points may be used to represent a prediction result obtained by predicting data corresponding to the coordinate points by the machine learning model. For example, if the result of prediction is "no" (i.e., not off job) is represented by a color a (e.g., blue), and the result of prediction is "yes" (i.e., off job) is represented by a color B (e.g., red). Therefore, the user can intuitively know the distribution situation of the prediction results (such as the situation of leaving the job) of the plurality of pieces of data.
As an example, in response to a user's selection operation for one or more coordinate points in the two-dimensional coordinate system, selecting a predetermined number of coordinate points from the vicinity of the selected coordinate point, the number of coordinate points being the same as the prediction result of the selected coordinate point, to obtain a plurality of clustered coordinate points; extracting one or more key fields from a plurality of pieces of data corresponding to the plurality of clustering coordinate points based on the size sequence of the scores of the fields to obtain a key field group; and outputting the key field group. The score of the field mentioned here may refer to a sum of scores of the field under data corresponding to a plurality of cluster coordinate points, and the calculation manner of the score of the field under a single piece of data may refer to the above related description, and is not described herein again. The output key field group may include the extracted key fields and statistics (e.g., an average) of field values of the key fields under the corresponding data.
As shown in fig. 9, in response to a user's selection operation of a coordinate point predicted to be out of position in the two-dimensional coordinate system, two sets of typical features, that is, similar features that an employee predicted to be out of position has, may be output. Wherein the group 1 may include fields of stock option level, number of companies that have worked, work environment satisfaction, etc. and statistics of field values of the fields, and the group 2 may include fields of whether to overtime, monthly income, stock option level, etc. and statistics of field values of the fields.
The method of assisting a user in exploring a data set of the present invention is further described with reference to fig. 2 through 9. The method for assisting the user in exploring the data set can be realized as a data analysis platform, and the user can explore the data set by using the data analysis platform by importing the data set into the data analysis platform. The data analysis platform can be designed as a webpage program which can be opened through a browser, and can also be designed as an application program which can be installed and run in electronic equipment such as a mobile phone and an Ipad.
The method of the invention for assisting a user in exploring a data set may also be implemented as a plug-in that may run in an application (e.g. spreadsheet software excel) that is capable of opening a spreadsheet. That is, in response to a user opening a data table in an application, a plug-in for implementing the method of assisting a user in exploring a data set of the present invention may be run for the data set in the data table.
The flow of the method for assisting a user to explore a data table is further described below by taking the implementation of the present invention as a plug-in running in an application.
The invention also provides a method for assisting the user to explore the data table, which comprises the following steps: in response to the user opening the data table in the application, a plug-in the application is run to perform the following steps S1 to S3. An application refers to an application that is capable of opening a spreadsheet, such as may be but is not limited to excel or other spreadsheet software.
Step S1 is to display the search area in a predetermined area of the data table. Wherein the predetermined area may be at least one of a left side, a right side, an upper side, and a lower side of the data table. The data table may be an excel table.
And step S2, responding to a prediction request of a user for a target field in the data table, taking the field values corresponding to at least part of other fields in the single piece of data as input, and taking the field values corresponding to the target field in the single piece of data as output, and automatically training the machine learning model. When the machine learning model is automatically trained, automatic search can be performed in a minimum hyper-parameter space based on the hyper-parameters preset according to experience so as to train the machine learning model.
And step S3, outputting first interpretation information for representing the importance degree of at least part of other fields to the target field in the exploration area. The first interpretation information may be used to characterize the importance degree of all the fields to the machine learning model prediction target field, and may also be used to characterize a part of the fields with higher importance degree to the machine learning model prediction target field. For the calculation of the importance of the fields, see the above description, and are not described herein.
In the calculating of the degree of importance, the first explanatory information may be displayed in a gradation form. For example, since it takes a long time to calculate the degree of importance, the first explanatory information may be roughly displayed in a relatively blurred state first, and the displayed first explanatory information gradually becomes clearer as the degree of completion of the calculation becomes higher. As for the display form of the first explanatory information, as an example, different display characteristics may be given to the field according to the magnitude of the calculated degree of importance of the field. For example, it may be that the greater the importance of the field, the darker the color. Alternatively, different colors may be used to identify whether a field has a positive or negative effect on the target field, e.g., red may be used to characterize a positive effect and blue may be used to characterize a negative effect. I.e. the more important a field is to a target field and acts positively the more red it is colored, the more important a field is to a target field and acts negatively the more blue it is colored.
Taking the application program as excel and the data table as excel table as an example. As shown in fig. 10, an "automatic modeling" tool may be provided in the toolbar of excel, and a user may first select a data column of the exit (attribute) as a target field to be predicted, and then click the cursor on the "automatic modeling" tool to start the auto-training machine learning model.
As shown in fig. 11, after the training is completed, the search area may be displayed on the right side of the excel table, and the interpretation information of the influence of the field value size of different fields on the target field may be output in the search area. For the explanation information outputted from the right side of fig. 11, the above description in conjunction with fig. 6B can be referred to, and the details are not repeated here.
As shown in fig. 12, a two-dimensional coordinate graph for representing the influence of a single field on a target field may be output in the search area, one coordinate axis in the two-dimensional coordinate graph is used for representing the field value corresponding to the field, the other coordinate axis is used for representing the score of the field, the two-dimensional coordinate graph includes a plurality of coordinate points, each coordinate point corresponds to one piece of data, and the score of the field is obtained by allocating the score predicted by the machine learning model for the single piece of data to each field of at least some other fields according to the allocation manner of the charapril value. Optionally, a field value corresponding to another field in the data corresponding to the coordinate point in the two-dimensional coordinate graph may also be characterized by using a display characteristic of the coordinate point. For the two-dimensional graph output from the right side in fig. 12, the description above in conjunction with fig. 7 can be referred to, and details are not repeated here.
As shown in fig. 13, in response to a prediction request of a user for a piece of data in the data table, a prediction result of the machine learning model for the piece of data is output in the exploration area, and second interpretation information of the importance degree of one or more fields of at least some other fields in the piece of data to the prediction result is output in the exploration area.
The second interpretation information is used for representing the importance degree of each field in the piece of data on the prediction result of the target field obtained by predicting the piece of data by the machine learning model. The importance degree can also be calculated by a distribution mode of the charapril value, that is, according to the distribution mode of the charapril value, the score obtained by predicting the data by the machine learning model can be distributed to each field in at least part of other fields to obtain the score of each field under the data, and the importance degree is positively correlated with the score.
For the second interpretation information outputted from the right side of fig. 13, the above description in conjunction with fig. 8 can be referred to, and the details are not repeated here. In response to a prediction request of a user for a certain piece of data in the data table, specific display characteristics can be given to field values of different fields in the piece of data in the data table, so that the influence of the field values on a prediction result can be characterized by using the display characteristics of the field values. Fig. 13 is a gray scale without color shown for the purposes of the specification, and in fact, in fig. 13, the negative effect of the field value on the predicted result can be represented by color a (e.g., blue), the positive effect of the field value on the predicted result can be represented by color B (e.g., red), and the darker the color of the field value, the greater the degree of influence on the predicted result.
As an example, in response to an adjustment operation of a user on a field value of one or more fields in a certain piece of data in a data table, outputting a prediction result of a machine learning model on the adjusted piece of data; and outputting second interpretation information of the importance degree of one or more fields in at least part of other fields in the piece of data to the prediction result after the adjustment in the exploration area. For the second interpretation information and the display manner thereof, reference may be made to the above description, which is not repeated herein.
As shown in fig. 14, a control capable of adjusting one or more field values in a piece of data may be presented to a user, the user may adjust the one or more field values through the control, and after the adjustment is completed, a prediction (Predict) button may be clicked to output a prediction result and second interpretation information of the machine learning model for the adjusted piece of data.
By way of example, the change of field values of at least some other fields in a piece of data can be output by using a machine learning model according to the expected prediction result of a user for the target field of the piece of data in the data table. For example, when the possibility of leaving the employee is predicted from information of the employee by using a machine learning model, if the possibility of leaving the employee is expected to decrease by 20%, the change of field values such as how much the salary needs to be increased, how much the overtime needs to be decreased, and how much the stock option needs to be increased can be output.
By way of example, in response to a user's selection operation on one or more data columns in the data table, first data analysis information used for characterizing data statistics of a field value corresponding to the data column selected by the user may be output in the exploration area, and one or more second data analysis information may be recommended to the user, where the second data analysis information is predicted based on the data column selected by the user.
The second data analysis information is used for representing the data statistics of the field values corresponding to the other predicted data columns, and/or the second data analysis information is used for representing the data statistics of the field value combination corresponding to the data column combination formed by the data column selected by the user and the other predicted data columns.
In the present invention, one or more of a variety of ways, including but not limited to machine learning models, statistical correlations, business rules, etc., may be employed to predict data analysis information suitable for recommendation to a user.
Taking the example of predicting data analysis information suitable for being recommended to a user based on a machine learning model, the probability that the user will select each other data column later can be predicted by using the machine learning model based on the relevant information of the data column selected by the current user, wherein the machine learning model is trained in the following manner: taking the relevant information of the previously selected data column and the relevant information of other data columns as input, and taking the probability of the other data columns being selected as output; and analyzing the data statistics of the field value corresponding to the data column with the probability value larger than the first preset threshold value to obtain second data analysis information, or analyzing the data statistics of the field value corresponding to the data column with the probability value larger than the first preset threshold value on the basis of the first data analysis information of the data column selected by the user to obtain the second data analysis information.
Optionally, an acceptance feedback condition of the one or more pieces of second data analysis information recommended to the user by the user may be acquired, and the machine learning model may be updated based on the acquired acceptance feedback condition.
Taking the statistical correlation-based prediction of data analysis information suitable for recommendation to a user as an example, a data column with statistical correlation higher than a second predetermined threshold with respect to a data column selected by the user can be obtained according to the statistical correlation between different data columns; and analyzing the data statistics of the field value corresponding to the data column with the statistical relevance higher than the second preset threshold value to obtain second data analysis information, or analyzing the data statistics of the field value corresponding to the data column with the statistical relevance higher than the second preset threshold value on the basis of the first data analysis information of the data column selected by the user to obtain the second data analysis information.
As an example, the exploration area may include a first interface area and a second interface area, and outputting the first data analysis information in the exploration area in response to a user selection operation for one or more columns of data in the data table includes: presenting a chart of first data analysis information in the first interface region in response to the user selecting the cursor in a particular data column, and recommending one or more second data analysis information to the user includes: the method includes displaying a graph of the first data analysis information in a first interface region and displaying one or more graphs of the second data analysis information in a second interface region.
Optionally, when the data table is opened in the application program, the data statistics of the field values corresponding to the data columns may be pre-calculated, and the calculated data statistics may be cached in the memory, where the chart showing the first data analysis information in the first interface area includes: and displaying a chart of first data analysis information generated based on the corresponding data statistics extracted from the memory in the first interface area.
As an example, in response to a user selecting a certain field value in the chart of the first data analysis information, the chart of the corresponding exposed one or more second data analysis information is automatically updated to a data statistic of the certain field value in the corresponding other field dimension.
Optionally, in response to a user selecting the cursor to select at least one graph of the second data analysis information displayed in the second interface region, dragging the cursor to the first interface region and performing a release operation, displaying the at least one graph of the second data analysis information in the first interface region.
As an example, in response to a user's connection operation for two charts in the first interface region, one chart of the two charts is connected to the other chart; in response to a user selection of a field value characterized in the first graph in the connected state, the selected field value is highlighted relative to the unselected field values, and/or the second graph in the connected state is updated with data statistics characterizing the selected field value in the field dimension corresponding to the second graph.
By way of example, in response to a user's join operation for two charts in the first interface region, one of the two charts is joined to the other chart, and the other chart is updated to characterize data statistics for field values characterized by the one chart in the field dimension corresponding to the other chart.
The method for assisting the user in exploring the data set can also be realized as a device for assisting the user in exploring the data set. Fig. 15 illustrates a block diagram of an apparatus for assisting a user in exploring a data set according to an exemplary embodiment of the present invention. Wherein the functional elements of the apparatus that assist a user in exploring data sets may be implemented by hardware, software, or a combination of hardware and software implementing the principles of the present invention. It will be appreciated by those skilled in the art that the functional units described in fig. 15 may be combined or divided into sub-units to implement the principles of the invention described above. Thus, the description herein may support any possible combination, or division, or further definition of the functional units described herein.
In the following, a brief description is given of functional units that the apparatus for assisting a user in exploring a data set may have and operations that each functional unit may perform, and for details related thereto, reference may be made to the above-mentioned related description, which is not described herein again.
Referring to fig. 15, an apparatus 200 for assisting a user in exploring a data set includes an output module 210 and a recommendation module. The data set includes a plurality of pieces of data, each piece of data including values of one or more fields.
The output module 210 is configured to output, in response to a field selection operation of a user, first data analysis information, where the first data analysis information is used to characterize data statistics of a field value corresponding to the field selected by the user. The recommending module 220 is configured to recommend one or more second data analysis information to the user, where the second data analysis information is predicted based on the field selected by the user. For the second data analysis information, see the above description, which is not repeated herein.
By way of example, the recommendation module 220 may include, but is not limited to, one or any combination of a first recommendation module, a second recommendation module, and a third recommendation module. The recommendation system comprises a first recommendation module, a second recommendation module and a third recommendation module, wherein the first recommendation module is used for predicting data analysis information suitable for being recommended to a user based on a machine learning model, the second recommendation module is used for predicting data analysis information suitable for being recommended to the user based on statistical relevance, and the third recommendation module is used for predicting data analysis information suitable for being recommended to the user based on business rules. The prediction mechanisms of the first recommending module, the second recommending module and the third recommending module can be referred to the above related description, and are not repeated herein.
The apparatus 200 for assisting a user in exploring a data set may further comprise a display module.
Example 1
In this embodiment, the display module is configured to display the first interface area and the second interface area. The display module is used for displaying the first data analysis information in the first interface area and the second data analysis information in the second interface area, and responding to the operation of a user, and the display module is also used for displaying the second data analysis information in the second interface area in the first interface area.
In response to a connection operation of a user on two pieces of data analysis information in the first interface area, the display module may further connect one piece of data analysis information to another piece of data analysis information, and the other piece of data analysis information is updated to be used for characterizing data statistics of a field value characterized by the one piece of data analysis information in a field dimension corresponding to the other piece of data analysis information.
In response to a user's operation of connecting two pieces of data analysis information in the first interface region, the display module connects one piece of data analysis information to the other piece of data analysis information, in response to a user's operation of selecting a field value characterized by one piece of data analysis information in the connected state, the display module highlights data statistics of the selected field value in the piece of data analysis information relative to data statistics of the unselected field values, and/or the display module updates the other piece of data analysis information in the connected state to be used for characterizing the selected field value in a field dimension corresponding to the other piece of data analysis information.
Example 2
In this embodiment, the display module may be configured to display an interface for exploring a data set, where a left area of the interface is used to display name icons corresponding to each field of the data set, a middle area of the interface (corresponding to a first interface area) is used to display a graph of first data analysis information, and a right area of the interface (corresponding to a second interface area) is used to display a graph of second data analysis information. When a data set is imported, the display module displays name icons corresponding to fields included in each piece of data in the data set in a left area of an interface, the output module responds to the operation that a user selects the name icon of a specific field from a cursor, drags the name icon to a middle area of the interface and releases the name icon, a chart of first data analysis information is displayed in the middle area of the interface, and meanwhile, the recommendation module displays one or more charts of second data analysis information in a right area of the interface.
The apparatus 200 for assisting a user in exploring a data set may further include a first calculating module, configured to calculate data statistics of field values corresponding to each field in advance when importing the data set, cache the calculated data statistics in a memory, and display, in an interface middle area, a graph of first data analysis information generated based on the corresponding data statistics extracted from the memory by the output module.
In response to a user selecting a certain field value in the chart of the first data analysis information, the display module automatically updates the chart of the corresponding displayed one or more second data analysis information into data statistics of the certain field value under the corresponding other field dimensions.
The apparatus 200 for assisting a user in exploring a data set may further include a second calculating module, configured to pre-calculate, when one or more frequently selected field values in the graph of the first data analysis information are selected, a data statistic of the corresponding one or more second data analysis information in the corresponding other field dimensions is automatically updated to the frequently selected field values, and the calculated data statistic is cached in the memory, where the displaying module automatically updates the graph of the correspondingly displayed one or more second data analysis information to the corresponding data statistic cached in the memory.
And in response to the user selecting at least one chart of the second data analysis information displayed in the right area of the interface by the cursor, dragging the chart to the middle area of the interface and performing releasing operation, the display module displays the at least one chart of the second data analysis information in the middle area of the interface.
In response to a user's join operation with respect to two charts in an intermediate region of the interface, the display module joins one of the two charts to the other chart, in response to a user's selection operation with respect to field values characterized in a first chart in a joined state, the display module highlights selected field values with respect to unselected field values, and/or the display module updates a second chart in a joined state to characterize data statistics of the selected field values in a field dimension corresponding to the second chart.
In response to a user's connection operation for two charts in the interface middle area, the display module connects one of the two charts to the other chart, and the display module updates the other chart to be used for representing data statistics of field values represented by the one chart under field dimensions corresponding to the other chart.
In embodiments 1 and 2, the apparatus 200 for assisting a user in exploring a data set may further include a training module and a presentation module.
In embodiment 1, the training module is configured to, in response to a prediction request for a target field by a user, automatically train the machine learning model by taking field values corresponding to at least some other fields in a single piece of data as input and taking field values corresponding to the target field in the single piece of data as output. The presentation module is used for presenting first interpretation information for characterizing the importance degree of one or more fields of at least part of other fields to the machine learning model prediction target field.
In embodiment 2, the training module is configured to, in a case where a user selects a name icon of a target field in a left area of the interface or selects a chart of first data analysis information of the target field in a middle area of the interface, in response to an operation of starting the auto-training machine learning model performed in the right area of the interface, auto-training the machine learning model with field values corresponding to at least some of other fields in a single piece of data as inputs and field values corresponding to the target field in the single piece of data as outputs. The display module is used for displaying first interpretation information used for representing the importance degree of one or more fields in at least part of other fields to the machine learning model prediction target field in the area on the right side of the interface.
In embodiments 1 and 2, the training module may perform automatic search in a minimum hyper-parameter space based on a hyper-parameter preset according to experience to train the machine learning model. The presentation module may display the first explanatory information in a gradient form in calculating the degree of importance.
In embodiments 1 and 2, the apparatus 200 for assisting a user in exploring a data set may further include a third calculation module and a determination module. And the third calculation module is used for distributing the scores predicted by the machine learning model for the single piece of data to each field of the at least part of other fields according to the distribution mode of the Charapril values so as to obtain the score of the field under the single piece of data. The determination module is used for determining the importance degree of each field to the machine learning model prediction target field according to the score sum of each field of the at least part of other fields under a plurality of pieces of data, wherein the importance degree is positively correlated with the score sum.
In embodiments 1 and 2, the apparatus 200 for assisting a user in exploring a data set may further include a fourth calculation module. The fourth calculation module is used for distributing scores obtained by predicting the single data by the machine learning model to each field of at least part of other fields according to a distribution mode of the Charapril values to obtain the scores of the fields under the single data, and for a single field, the display module displays the scores of the field under a plurality of pieces of data in a two-dimensional coordinate system, wherein one coordinate axis in the two-dimensional coordinate system is used for representing the field value corresponding to the field, and the other coordinate axis is used for representing the score of the field. The display module can also represent a field value corresponding to another field in the data corresponding to the coordinate point by using the display characteristic of the coordinate point in the two-dimensional coordinate system.
In embodiments 1 and 2, the display module is further configured to present, in a two-dimensional coordinate system, a prediction result obtained by predicting, by the machine learning model, a plurality of pieces of data, where the two-dimensional coordinate system includes a plurality of coordinate points, each coordinate point corresponds to one piece of data, a display characteristic of each coordinate point is used to represent the prediction result of the data, and a distance between two coordinate points in the two-dimensional space is positively correlated with a distance between two pieces of data corresponding to the two coordinate points in the multi-dimensional space.
As an example, each piece of data has multiple dimensions, each dimension corresponds to one field of the at least some other fields, and the apparatus 200 for assisting a user in exploring a data set may further include a sixth calculating module, configured to allocate, according to an allocation manner of a charapril value, a score predicted by the machine learning model for the single piece of data to each field of the at least some other fields to obtain a score of the field under the single piece of data, where the score of the field is a value of the data under the corresponding dimension, where a position of each piece of data in the multidimensional space is determined based on values of multiple dimensions of the data.
Optionally, the apparatus 200 for assisting a user in exploring a data set may further comprise a selecting module and an extracting module. The selecting module is used for responding to the selection operation of a user on one or more coordinate points in the two-dimensional coordinate system, and selecting a preset number of coordinate points which are the same as the prediction result of the selected coordinate points from the vicinity of the selected coordinate points to obtain a plurality of clustering coordinate points. The extraction module is used for extracting one or more key fields from a plurality of pieces of data corresponding to the clustering coordinate points based on the size sequence of the scores of the fields so as to obtain a key field group. The presentation module is further configured to output the set of key fields.
In embodiment 1 and embodiment 2, the presentation module is further configured to, in response to an adjustment operation performed by a user on a field value of one or more fields in a piece of data, output a prediction result of the machine learning model for the adjusted piece of data, and output second interpretation information of a degree of importance of one or more fields in at least some other fields in the adjusted piece of data to the prediction result.
In embodiment 1 and embodiment 2, the presentation module is further configured to output, by using a machine learning model, a change in field value of at least some other fields in a piece of data according to a predicted result expected by a user for a target field of the piece of data.
It should be understood that the specific implementation manner of the apparatus 200 for assisting a user in exploring a data set according to an exemplary embodiment of the present invention may be implemented with reference to the related specific implementation manners described in conjunction with fig. 1 to 9, and will not be described in detail herein.
The present invention may also be embodied as an apparatus for assisting a user in exploring a data table, which may include an execution module for executing, for a data set in the data table, a plug-in implementing the method for assisting a user in exploring a data set of the present invention, in response to the user opening the data table in an application.
The method for assisting the user in exploring the data table can also be realized as a device for assisting the user in exploring the data table. Fig. 16 illustrates a block diagram of an apparatus for assisting a user in exploring a data table according to an exemplary embodiment of the present invention. Wherein the functional elements of the means for assisting a user in exploring data tables may be implemented by hardware, software or a combination of hardware and software implementing the principles of the present invention. It will be appreciated by those skilled in the art that the functional units described in fig. 16 may be combined or divided into sub-units to implement the principles of the invention described above. Thus, the description herein may support any possible combination, or division, or further definition of the functional units described herein.
In the following, a brief description is given of functional units that the apparatus for assisting a user to explore a data table may have and operations that each functional unit may perform, and for details related thereto, reference may be made to the above-mentioned related description, which is not described herein again.
Referring to fig. 16, an apparatus 300 for assisting a user in exploring a data table includes a display module 310 and a training module 320. The apparatus 300 for assisting a user in exploring a data table may be implemented as a plug-in installed in an application. The apparatus 300 for assisting a user in exploring a data table may be run in response to the user opening the data table in an application.
The display module 310 is configured to display an exploration area in a predetermined area of a data table in response to a user opening the data table in an application. The predetermined area may be at least one of a left side, a right side, an upper side, and a lower side of the data table. The data table may be an excel table.
The training module 320 is configured to automatically train the machine learning model in response to a prediction request of a user for a target field in the data table, with field values corresponding to at least some other fields in the single piece of data as input, and with field values corresponding to the target field in the single piece of data as output. The training module 620 may perform an automatic search in a minimum hyper-parameter space based on the hyper-parameters preset according to experience to train the machine learning model.
The display module 310 is further configured to output, in the exploration area, first interpretation information characterizing a degree of importance of at least some of the other fields to the target field. The first interpretation information may be used to characterize the importance degree of all the fields to the machine learning model prediction target field, and may also be used to characterize a part of the fields with higher importance degree to the machine learning model prediction target field. For the calculation of the importance of the fields, see the above description, and are not described herein.
The display module may display the first explanatory information in a gradual manner in calculating the degree of importance. For example, since it takes a long time to calculate the degree of importance, the first explanatory information may be roughly displayed in a relatively blurred state first, and the displayed first explanatory information gradually becomes clearer as the degree of completion of the calculation becomes higher. As for the display form of the first explanatory information, as an example, different display characteristics may be given to the field according to the magnitude of the calculated degree of importance of the field. For example, it may be that the greater the importance of the field, the darker the color. Alternatively, different colors may be used to identify whether a field has a positive or negative effect on the target field, e.g., red may be used to characterize a positive effect and blue may be used to characterize a negative effect. I.e. the more important a field is to a target field and acts positively the more red it is colored, the more important a field is to a target field and acts negatively the more blue it is colored.
As an example, the apparatus 300 for assisting a user in exploring a data table may further include a first calculation module and a determination module. The first calculation module is used for distributing the scores predicted by the machine learning model for the single piece of data to each field of at least some other fields according to the distribution mode of the Charapril values so as to obtain the scores of the fields under the single piece of data. The determination module is used for determining the importance degree of each field to the machine learning model prediction target field according to the score sum of each field in at least part of other fields under the plurality of pieces of data, wherein the importance degree is positively correlated with the score sum.
As an example, the display module may be further configured to output, in the exploration area, a two-dimensional coordinate map for characterizing an influence of a single field on the target field, where one coordinate axis in the two-dimensional coordinate map is used to characterize a field value corresponding to the field, and another coordinate axis is used to characterize a score of the field, the two-dimensional coordinate map includes a plurality of coordinate points, each coordinate point corresponds to one piece of data, and the score of the field is obtained by allocating, to each field of the at least some other fields, a score predicted by the machine learning model for the single piece of data according to an allocation manner of a charpril value. Optionally, the display module may be further configured to characterize a field value corresponding to another field in the data corresponding to the coordinate point in the two-dimensional coordinate graph by using a display characteristic of the coordinate point.
By way of example, in response to a prediction request of a user for a piece of data in the data table, the display module outputs a prediction result of the machine learning model for the piece of data in an exploration area, and outputs second interpretation information of the importance degree of one or more fields of the at least part of other fields in the piece of data to the prediction result in the exploration area. The apparatus 300 for assisting a user in exploring a data table may further include a second calculating module, configured to assign, according to a way of assigning a charpril value, a score predicted by the machine learning model for the piece of data to each field of the at least some other fields to obtain a score of each field under the piece of data, where a degree of importance of a field to the prediction result is positively correlated with the score of the field. In response to an adjusting operation of a user on field values of one or more fields in a piece of data in the data table, the display module may further output a predicted result of the machine learning model on the adjusted piece of data, and output second interpretation information of the importance degree of the one or more fields in at least some other fields in the adjusted piece of data on the predicted result in the exploration area.
By way of example, the display module may further output, by using the machine learning model, a change in field value of at least some other fields in a piece of data according to a user's expected prediction result for a target field of the piece of data in the data table.
The apparatus 300 for assisting a user in exploring a data table may further include an output module 310 and a recommendation module 320. The output module 310 is configured to output, in response to a user's selection operation on one or more data columns in the data table, first data analysis information in the exploration area, where the first data analysis information is used to characterize data statistics of a field value corresponding to the data column selected by the user.
The recommending module 320 is configured to recommend one or more second data analysis information to the user, where the second data analysis information is predicted based on the data column selected by the user. For the second data analysis information, see the above description, which is not repeated herein.
The recommendation module 320 may include, but is not limited to, one or any combination of a first recommendation module, a second recommendation module, and a third recommendation module. The recommendation system comprises a first recommendation module, a second recommendation module and a third recommendation module, wherein the first recommendation module is used for predicting data analysis information suitable for being recommended to a user based on a machine learning model, the second recommendation module is used for predicting data analysis information suitable for being recommended to the user based on statistical relevance, and the third recommendation module is used for predicting data analysis information suitable for being recommended to the user based on business rules. The prediction mechanisms of the first recommending module, the second recommending module and the third recommending module can be referred to the above related description, and are not repeated herein.
The exploration area may include a first interface area and a second interface area, and the output module may present a graph of the first data analysis information in the first interface area in response to a user selecting a cursor for a particular data column, and present one or more graphs of the second data analysis information in the second interface area while the graph of the first data analysis information is presented in the first interface area.
The apparatus 300 for assisting a user in exploring a data table may further include a third calculating module, configured to calculate data statistics of field values corresponding to each data column in advance when the data table is opened in an application program, and cache the calculated data statistics in a memory, where the output module displays a graph of first data analysis information generated based on corresponding data statistics extracted from the memory in the first interface region.
And the recommendation module responds to the user selecting a certain field value in the chart of the first data analysis information, and automatically updates the chart of the one or more pieces of second data analysis information which are correspondingly displayed into the data statistics of the certain field value under the corresponding other field dimensions. The apparatus 300 for assisting a user in exploring a data table may further include a fourth calculating module, configured to pre-calculate a data statistics situation of the frequently selected field values in the graph of the first data analysis information when the frequently selected field values are selected, automatically update the graph of the corresponding second data analysis information to the data statistics situation of the frequently selected field values in the corresponding other field dimensions, cache the calculated data statistics situation in the memory, and automatically update the graph of the correspondingly displayed second data analysis information to the corresponding data statistics situation cached in the memory by the recommending module.
And in response to the user selecting at least one chart of the second data analysis information displayed in the second interface area by the cursor, dragging the at least one chart of the second data analysis information to the first interface area and performing releasing operation, displaying the at least one chart of the second data analysis information in the first interface area by the display module.
In response to a user's join operation with respect to two charts in the first interface region, the display module joins one of the two charts to the other chart, in response to a user's selection operation with respect to field values characterized in the first chart in a joined state, the display module highlights selected field values relative to unselected field values, and/or updates the second chart in a joined state to data statistics characterizing the selected field values in the field dimension to which the second chart corresponds.
In response to a user's connection operation for two charts in the first interface region, the display module connects one of the two charts to the other chart, and the display module updates the other chart to be used for characterizing data statistics of field values characterized by the one chart in a field dimension corresponding to the other chart.
It should be understood that, according to the present invention, the specific implementation manner of the apparatus 300 for assisting a user to explore a data table can be implemented by referring to the related description of the method for assisting a user to explore a data table in conjunction with fig. 10 to 14, and will not be described herein again.
A method, apparatus for assisting a user in exploring a data set or data table according to an exemplary embodiment of the present invention is described above with reference to fig. 1 to 16. It should be understood that the above-described method may be implemented by a program recorded on a computer-readable medium, for example, according to an exemplary embodiment of the present invention, there may be provided a computer-readable storage medium storing instructions, wherein a computer program for executing the method of assisting a user in exploring a data set (such as shown in fig. 1) or the method of assisting a user in exploring a data table of the present invention is recorded on the computer-readable medium.
The computer program in the computer-readable medium may be executed in an environment deployed in a computer device such as a client, a host, a proxy device, a server, etc., and it should be noted that the computer program may be used to perform additional steps or perform more specific processes when performing the steps in addition to or instead of the steps shown in fig. 1, and the contents of the additional steps and the further processes are described with reference to fig. 1, and will not be described again to avoid repetition.
It should be noted that the apparatus for assisting a user in exploring a data set and the apparatus for assisting a user in exploring a data table according to an exemplary embodiment of the present invention may be completely dependent on the execution of a computer program to implement the corresponding functions, that is, each apparatus corresponds to each step in the functional architecture of the computer program, so that the entire apparatus is called by a special software package (e.g., lib library) to implement the corresponding functions.
On the other hand, each of the devices shown in fig. 15 and 16 may be implemented by hardware, software, firmware, middleware, microcode, or any combination thereof. When implemented in software, firmware, middleware or microcode, the program code or code segments to perform the corresponding operations may be stored in a computer-readable medium such as a storage medium, so that a processor may perform the corresponding operations by reading and executing the corresponding program code or code segments.
For example, exemplary embodiments of the present invention may also be embodied as a computing apparatus comprising a storage component having stored therein a set of computer-executable instructions that, when executed by the processor, perform a method of assisting a user in exploring a data set or a method of assisting a user in exploring a data table.
In particular, the computing devices may be deployed in servers or clients, as well as on node devices in a distributed network environment. Further, the computing device may be a PC computer, tablet device, personal digital assistant, smart phone, web application, or other device capable of executing the set of instructions described above.
The computing device need not be a single computing device, but can be any device or collection of circuits capable of executing the instructions (or sets of instructions) described above, individually or in combination. The computing device may also be part of an integrated control system or system manager, or may be configured as a portable electronic device that interfaces with local or remote (e.g., via wireless transmission).
In the computing device, the processor may include a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a programmable logic device, a special purpose processor system, a microcontroller, or a microprocessor. By way of example, and not limitation, processors may also include analog processors, digital processors, microprocessors, multi-core processors, processor arrays, network processors, and the like.
Some of the operations described in the method for assisting a user in exploring a data set or the method for assisting a user in exploring a data table according to an exemplary embodiment of the present invention may be implemented by software, some of the operations may be implemented by hardware, and further, the operations may be implemented by a combination of hardware and software.
The processor may execute instructions or code stored in one of the memory components, which may also store data. Instructions and data may also be transmitted and received over a network via a network interface device, which may employ any known transmission protocol.
The memory component may be integral to the processor, e.g., having RAM or flash memory disposed within an integrated circuit microprocessor or the like. Further, the storage component may comprise a stand-alone device, such as an external disk drive, storage array, or any other storage device usable by a database system. The storage component and the processor may be operatively coupled or may communicate with each other, such as through an I/O port, a network connection, etc., so that the processor can read files stored in the storage component.
Further, the computing device may also include a video display (such as a liquid crystal display) and a user interaction interface (such as a keyboard, mouse, touch input device, etc.). All components of the computing device may be connected to each other via a bus and/or a network.
The operations involved in a method of assisting a user in exploring a data set or a method of assisting a user in exploring a data table according to exemplary embodiments of the present invention may be described as various interconnected or coupled functional blocks or functional diagrams. However, these functional blocks or functional diagrams may be equally integrated into a single logic device or operated on by non-exact boundaries.
For example, as described above, an apparatus for assisting a user in exploring a data set or an apparatus for assisting a user in exploring a data table according to an exemplary embodiment of the present invention may include a storage component and a processor, wherein the storage component stores therein a set of computer-executable instructions that, when executed by the processor, perform the above-mentioned method of assisting a user in exploring a data set or method of assisting a user in exploring a data table.
While exemplary embodiments of the invention have been described above, it should be understood that the above description is illustrative only and not exhaustive, and that the invention is not limited to the exemplary embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the invention. Therefore, the protection scope of the present invention should be subject to the scope of the claims.

Claims (10)

1. A method of assisting a user in exploring a data set, said data set comprising a plurality of pieces of data, each said piece of data comprising values for one or more fields, the method comprising:
responding to field selection operation of a user, and outputting first data analysis information, wherein the first data analysis information is used for representing data statistics of a field value corresponding to a field selected by the user; and
recommending one or more second data analysis information to the user, the second data analysis information being predicted based on the user-selected field.
2. The method of claim 1, wherein,
the second data analysis information is used for characterizing the data statistics of the field values corresponding to other fields obtained by prediction and/or
The second data analysis information is used for representing the data statistics of field value combinations corresponding to field combinations formed by fields selected by users and other fields obtained through prediction.
3. The method of claim 1, wherein recommending one or more second data analysis information to the user comprises:
predicting data analysis information suitable for recommendation to a user based on a machine learning model; and/or
Predicting data analysis information suitable for recommendation to a user based on the statistical relevance; and/or
And predicting data analysis information suitable for being recommended to the user based on the business rule.
4. The method of claim 3, wherein predicting data analysis information suitable for recommendation to a user based on a machine learning model comprises:
predicting, using the machine learning model, a probability that the user will select each of the other fields later based on information about the field currently selected by the user, wherein the machine learning model is trained in the following manner: taking the relevant information of the selected field and the relevant information of other fields as input, and taking the probability of the other fields as output;
analyzing the data statistics of the field value corresponding to the field with the probability value larger than the first preset threshold value to obtain the second data analysis information, or analyzing the data statistics of the field value corresponding to the field with the probability value larger than the first preset threshold value on the basis of the first data analysis information of the field selected by the user to obtain the second data analysis information.
5. The method of claim 4, further comprising:
and acquiring the acceptance feedback condition of the one or more second data analysis information recommended to the user by the user, and updating the machine learning model based on the acquired acceptance feedback condition.
6. The method of claim 3, wherein predicting data analysis information suitable for recommendation to a user based on statistical relevance comprises:
according to the statistical correlation among different fields, acquiring a field with the statistical correlation higher than a second preset threshold value with the field selected by the user;
analyzing the data statistics of the field value corresponding to the field with the statistical relevance higher than the second predetermined threshold value to obtain the second data analysis information, or analyzing the data statistics of the field value corresponding to the field with the statistical relevance higher than the second predetermined threshold value on the basis of the first data analysis information of the field selected by the user to obtain the second data analysis information.
7. The method of claim 1, wherein,
presenting the first data analysis information in a first interface area and the second data analysis information in a second interface area, the method further comprising: and responding to the operation of a user, and presenting the second data analysis information in the second interface area in the first interface area.
8. A method of assisting a user in exploring a data table, comprising:
in response to a user opening a data table in an application, running a plug-in the application to perform the steps of:
displaying an exploration area in a preset area of the data table;
responding to a prediction request of a user for a target field in a data table, taking field values corresponding to at least part of other fields in single data as input, taking field values corresponding to the target field in the single data as output, and automatically training a machine learning model;
and outputting first interpretation information for representing the importance degree of the at least part of other fields to the target field in the exploration area.
9. An apparatus for assisting a user in exploring a data set, said data set comprising a plurality of pieces of data, each of said pieces of data comprising values for one or more fields, the apparatus comprising:
the output module is used for responding to field selection operation of a user and outputting first data analysis information, and the first data analysis information is used for representing data statistics of a field value corresponding to a field selected by the user; and
and the recommending module is used for recommending one or more second data analysis information to the user, wherein the second data analysis information is predicted data analysis information based on the field selected by the user.
10. An apparatus that assists a user in exploring a data table, comprising:
the display module is used for responding to the opening of a data table in the application program by a user and displaying an exploration area in a preset area of the data table;
a training module for responding to a prediction request of a user for a target field in the data table, taking field values corresponding to at least part of other fields in the single piece of data as input, taking field values corresponding to the target field in the single piece of data as output, automatically training the machine learning model,
the display module is further used for outputting first interpretation information for representing the importance degree of the at least part of other fields to the target field in the exploration area.
CN201911104860.9A 2019-11-13 2019-11-13 Method and device for assisting user in exploring data set and data table Pending CN110874644A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911104860.9A CN110874644A (en) 2019-11-13 2019-11-13 Method and device for assisting user in exploring data set and data table

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911104860.9A CN110874644A (en) 2019-11-13 2019-11-13 Method and device for assisting user in exploring data set and data table

Publications (1)

Publication Number Publication Date
CN110874644A true CN110874644A (en) 2020-03-10

Family

ID=69717958

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911104860.9A Pending CN110874644A (en) 2019-11-13 2019-11-13 Method and device for assisting user in exploring data set and data table

Country Status (1)

Country Link
CN (1) CN110874644A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112347161A (en) * 2020-11-18 2021-02-09 未来电视有限公司 Data analysis processing method, device, equipment and storage medium
CN112364208A (en) * 2020-11-24 2021-02-12 北京海联捷讯科技股份有限公司 Operation and maintenance analysis method and system based on big data visualization and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112347161A (en) * 2020-11-18 2021-02-09 未来电视有限公司 Data analysis processing method, device, equipment and storage medium
CN112364208A (en) * 2020-11-24 2021-02-12 北京海联捷讯科技股份有限公司 Operation and maintenance analysis method and system based on big data visualization and storage medium

Similar Documents

Publication Publication Date Title
TWI703458B (en) Data processing model construction method, device, server and client
US10467634B2 (en) Generating metadata and visuals related to mined data habits
US20200211103A1 (en) Systems and methods of assisted strategy design
US20200320100A1 (en) Sytems and methods for combining data analyses
US9390142B2 (en) Guided predictive analysis with the use of templates
EP3035189A1 (en) Automated approach for integrating automated function library functions and algorithms in predictive analytics
CN110045953A (en) Generate the method and computing device of business rule expression formula
EP2706494A1 (en) Energy efficient display of control events of an industrial automation system
JP7069029B2 (en) Automatic prediction system, automatic prediction method and automatic prediction program
US20130262444A1 (en) Card view for project resource search results
CA2910808A1 (en) Systems, devices, and methods for determining an operational health score
US11775144B2 (en) Place-based semantic similarity platform
CN118093801A (en) Information interaction method and device based on large language model and electronic equipment
CN110874644A (en) Method and device for assisting user in exploring data set and data table
CN115687672A (en) Chart visualization intelligent recommendation method, device and equipment and readable storage medium
US20140019206A1 (en) Predictive confidence determination for sales forecasting
CN107729424B (en) Data visualization method and equipment
AU2021204470A1 (en) Benefit surrender prediction
CN110781378B (en) Data graphical processing method and device, computer equipment and storage medium
JP2022010749A (en) Contribution aggregation system, contribution aggregation method, and program
US12131166B2 (en) Methods and systems for workflow automation
WO2022267364A1 (en) Information recommendation method and device, and storage medium
CN113468354A (en) Method and device for recommending chart, electronic equipment and computer readable medium
JP5444282B2 (en) Data shaping system, method, and program
AU2020201689A1 (en) Cognitive forecasting

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination