CN112115699A

CN112115699A - Method and system for analyzing data

Info

Publication number: CN112115699A
Application number: CN201910535971.9A
Authority: CN
Inventors: 黄飞; 纪大猛; 江敏
Original assignee: Hangzhou Dtwave Technology Co ltd
Current assignee: Hangzhou Dtwave Technology Co ltd
Priority date: 2019-06-20
Filing date: 2019-06-20
Publication date: 2020-12-22

Abstract

Methods and systems for analyzing data content in a data collection are disclosed. The method includes identifying a data type of a field in the data set; acquiring an application scene of the data set; determining an analytical model based on input features, wherein the input features include the data type and the application scenario; and analyzing the data content of the data set according to the analysis model.

Description

Method and system for analyzing data

Technical Field

The present invention relates to big data analysis technology, and particularly to a method and system for analyzing data.

Background

Big data is a large-scale data set, and has four major characteristics of large data size, fast data circulation, various data types and low value density. At present, the value of big data in the emerging industry of the internet is increasingly highlighted, and the big data not only can help the enterprise to optimize management, but also can more accurately locate target customers and enable the enterprise to benefit from the target customers.

However, the above four large features of big data make it difficult for traditional database operation tools to handle big data in terms of acquisition, storage, management, analysis. For example, the conventional data analysis technology in the field relies on a data analyst to issue instructions to the database operation tool or is completely performed manually, the workload is heavy, the data analysis efficiency is low, and the obtained analysis result is low in visualization degree, so that the direct application is not facilitated.

Disclosure of Invention

One aspect of the invention discloses a method for analyzing data content in a data collection, comprising identifying a data type of a field in the data collection; acquiring an application scene of the data set; determining an analytical model based on input features, wherein the input features include the data type and the application scenario; and analyzing the data content of the data set according to the analysis model.

In an embodiment of the invention, the method further comprises preparing the data set.

In an embodiment of the invention, the preparing operation comprises selecting a range of the data set in a source database; defining a database adapter based on the type of the source database; and processing metadata and data content in the data collection by the database adapter.

In an embodiment of the present invention, the identifying operation includes when an explicit declaration exists for the field, the data type of the field being a data type in the explicit declaration for the field, and when no explicit declaration exists for the field, the data type of the field being a data type that data content of the field is converted out according to a data type conversion matrix.

In an embodiment of the invention, the method further comprises normalizing the data type identified in the identifying step.

In an embodiment of the present invention, the normalization operation includes normalizing the data type into four types, numeric value, date, boolean, and string.

In an embodiment of the present invention, the obtaining operation includes performing NLP text analysis on at least one of the data content and the description information of the field to obtain an entity or an entity triplet; performing knowledge reasoning on the entity or the entity triple by using a knowledge graph to obtain a keyword for describing the field; and matching the keywords with a scene library to obtain the application scene.

In an embodiment of the invention, the matching operation filters out useless keywords of the fields.

In an embodiment of the present invention, the input features further include one or more of descriptive information of the field, an entity, and a unit of measure of data content.

In embodiments of the present invention, the analytical model comprises one or more statistical methods, one or more analytical functions, or a combination thereof.

In an embodiment of the present invention, the statistical method includes TopN statistics, cluster statistics, segment statistics, and correlation statistics, and the analysis function includes a maximum value, a minimum value, a median, a standard deviation, a vacancy rate, and an outlier ratio.

In an embodiment of the invention, the method further comprises visually presenting the results of the analysis operation.

In an embodiment of the invention, the visualization presentation operation comprises selecting a visualization chart based on at least one of a data type of the field, a number of records of the data set, the statistical method, and the analysis function.

In an embodiment of the present invention, the determining operation includes comparing the input feature with respective features of a plurality of analysis models, calculating a matching degree, and finding an analysis model that has a highest matching degree with the input feature as the determined analysis model.

In an embodiment of the present invention, the determining operation includes building an analysis model library including a plurality of analysis models; and on the basis of the analysis model library, generating an analysis model with better matching degree by adopting a machine learning technology to serve as the determined analysis model.

Another aspect of the invention discloses a system for analyzing data content in a data collection, comprising means for identifying a data type of a field in the data collection; means for obtaining an application scenario for the data set; means for determining an analytical model based on input features, wherein the input features include the data type and the application scenario; and means for analyzing the data content of the data collection according to the analytical model.

In an embodiment of the invention, the system further comprises means for preparing the data set.

In an embodiment of the invention, the system further comprises means for normalizing the data type identified in the identifying means.

In an embodiment of the present invention, the obtaining means includes means for performing NLP text analysis on at least one of the data content and the description information of the field to obtain an entity or an entity triplet; means for using a knowledge graph to perform a reasoning of knowledge about the entity or entity triplet and to derive keywords describing the fields; and means for matching the keyword with a scene library to obtain the application scene.

In an embodiment of the invention, the system further comprises means for visually presenting the analysis results of the analysis means.

Yet another aspect of the invention discloses a computer-readable medium, on which a computer program is stored, which computer program, when being executed by a processor, carries out the above-mentioned method for analyzing data content in a data collection.

Embodiments of the present invention can automatically detect, access and process data without relying on external input (e.g., application scenarios or business requirements). The embodiment of the invention can select a proper analysis model for analysis and can visually display the analysis result to realize the effect of data browsing, namely analysis. Compared with the prior art, the embodiment of the invention reduces the workload of a data analyst and improves the efficiency of data analysis.

Drawings

FIG. 1 is a flow diagram of a method of data analysis according to one embodiment of the invention.

FIG. 2 is a schematic diagram of a process of identifying and normalizing data types of fields according to one embodiment of the invention.

FIG. 3 is a schematic diagram of a HIVE data type transformation matrix according to one embodiment of the invention.

FIG. 4 is a schematic diagram of scene acquisition according to one embodiment of the present invention.

FIG. 5 is a schematic diagram of determining an analytical model according to one embodiment of the present invention.

FIG. 6 is a schematic illustration of a visualization presentation according to an embodiment of the invention.

Detailed Description

The content of the invention will now be described with reference to a number of exemplary embodiments. It is to be understood that these examples are set forth merely to enable those of ordinary skill in the art to better understand and thereby implement the teachings of the present invention, and are not intended to suggest any limitation as to the scope of the invention.

As used herein, the term "include" and its variants should be read as open-ended terms meaning "including, but not limited to. The term "based on" should be read as "based, at least in part, on. The terms "one embodiment" and "an embodiment" should be read as "at least one embodiment". The term "another embodiment" should be read as "at least one other embodiment".

FIG. 1 shows a flow diagram of a data analysis method according to one embodiment of the invention. The method includes steps 1-6, each of which is described in detail below.

Step 1-data set preparation

First, a range of data sets to be analyzed is selected in a source database (i.e., data source). In embodiments of the present invention, a data set may include one or more full databases, one or more partial databases, or a combination thereof (e.g., one or more full databases and one or more partial databases) in a source database. A database (e.g., a relational database) is typically made up of one or more tables, while the smallest unit for analyzing data is typically a field, and a field refers to a column in a table.

In an embodiment of the invention, after the data set is selected, the metadata of the relevant databases in the data set is read and the data content actually stored in these databases is read. As is generally understood in the art, metadata, also referred to as intermediate data or relay data, is data used to describe data, including information describing database definitions (e.g., database names, database descriptions, database types, tables contained in databases), table definitions (e.g., table names, table descriptions, table types, fields contained in tables), and field definitions (e.g., field names, field descriptions, field data types) used to implement functions such as indicating storage locations, historical data, resource lookups, file records, and the like.

There are many different types of databases in the art, such as relational databases (MySQL, Oracle, etc.) and non-relational databases (hbsase, Redis, MongoDB, etc.), and different types of databases may have different metadata storage locations and metadata storage ways. Thus, the database adapters used to connect different types of databases will also vary accordingly. In an embodiment of the present invention, a conforming database adapter is defined for the type of database prior to obtaining the metadata and accessing the data content. Database adapters are generally constructed as known in the art and typically include database connection protocols (e.g., JDBC, ODBC, etc.). Through the connection mode defined by the database connection protocol, the database adapter can read metadata and data content in the target database and optionally delete, modify, query and the like.

Additionally, in embodiments of the present invention, the database adapter may exchange (optionally, read or write) data between the aforementioned selected data set and the analytics database to be built by the present invention. The construction and use of the analytical database will be described in detail below.

Step 2-field data type identification and normalization

FIG. 2 is a diagram illustrating a process of identifying and normalizing data types of fields, according to an embodiment of the invention, which is described in detail below.

Common data types in the art are listed in table 1, which may have different names and definitions in different database tools. The invention takes the HIVE database as an example to explain the process of identifying the field data type.

After reading the metadata and the data content in step 1, an explicit declaration of the data type of the relevant field can be obtained. When an explicit declaration exists in a field, the data type in the explicit declaration is the data type of the field, for example, when the explicitly declared data type is big, the data type of the field is big.

When no explicit declaration exists in the field, the data content in the field is converted according to the data type conversion matrix, and the data type of the field is determined according to whether the conversion is successful or not. Generally, this conversion operation is also referred to as "data recognition". FIG. 3 illustrates a schematic diagram of a HIVE datatype transformation matrix according to one embodiment of the present invention. For example, if the data content of the BOOLEAN type is true only when converted to a value of the BOOLEAN type, the data content may only be of the BOOLEAN type; and the data content of the tintint type is true when converted to values of tintint, smallnt, INT, big, die, decode, STRING, and value types, the data type of the data content may be any of the nine types described above. In an embodiment of the present invention, the priority of the data types that are successfully converted is shown as decreasing from left to right in the first row of FIG. 3. For example, although the tinyin type may be converted into nine types, after the data content is successfully converted into tinyin, the data content is not converted subsequently, and the data type that is successfully converted is recorded as the data type of the field, otherwise, the data content is sequentially converted with lower priority until the conversion is successful.

According to an embodiment of the present invention, after the data types of all the fields are identified, the data types of all the fields are normalized according to table 1. In an embodiment of the present invention, the data types of all fields can be normalized to four types: numeric value, date, boolean, and string.

TABLE 1

Step 3-scene acquisition

For a data set to be analyzed, analyzing its data content alone may not be able to clarify the exact meaning represented by the data content, and even to develop relevant application analysis based on the data content. Therefore, according to some embodiments of the present invention, information (hereinafter referred to as "key information") from which a true application scene (which may be simply referred to as "scene") can be estimated (i.e., acquired) as much as possible (in many cases, a small amount of description information may be used) can be extracted from description information in metadata of data contents of a data set. According to further embodiments of the present invention, in the event that the metadata lacks descriptive information, the key information is extracted directly from the data content.

FIG. 4 shows a schematic diagram of scene acquisition according to an embodiment of the present invention, wherein scene acquisition includes three steps of Natural Language Processing (NLP) text analysis, knowledge inference and scene matching.

First, the description information and data content of several fields are obtained from a data set to be analyzed, and the description information and data content of the fields are processed using NLP text analysis techniques. Embodiments of the present invention may use existing NLP text analysis techniques. In one embodiment of the present invention, NLP text analysis includes word segmentation, Named Entity Recognition (NER) and extraction of Entity relationships, thereby constructing Entity triples as will be described in detail below. In another embodiment of the invention, NLP text analysis does not include entity relationship extraction, thus omitting the construction of entity triples. And then, carrying out knowledge reasoning on the obtained entity and/or entity triple according to the knowledge graph. And finally, selecting the most matched application scene according to the inference result, and providing reference for the determination of the analysis model.

In the embodiment of the present invention, the participle refers to chinese participle, which is a process of segmenting a chinese character sequence into individual characters or words. Chinese word segmentation is the basis for text mining. For a section of input Chinese, the effect of automatically identifying the meaning of a sentence by a computer can be achieved by carrying out Chinese word segmentation. In the embodiment of the invention, for the description information or data content which is not Chinese, the description information or data content is firstly translated into the corresponding Chinese.

Named entity recognition, also known as "proper name recognition," refers to the recognition of entities in text that have specific meaning, including mainly personal names, place names, organization names, proper nouns, etc. In general, the task of named entity recognition is to identify three major classes in the text to be processed: the subject class (person name, organization name, place name), the time class (time, date), and the value class (currency, percentage).

The entity relationship extraction refers to extracting a relationship from the word or the word after word segmentation so as to construct an entity triple with the entity obtained after the named entity is identified, and the entity triple is a way for describing the relationship between the entities. The basic form of an entity triplet is "entity 1-relationship-entity 2". In embodiments of the present invention, the relationship in the entity triplet may be directly from an entity identified by the named entity, or may be derived by derivation (i.e., the entity identified by the named entity does not contain the relationship). For example, the named entity "headquarters" can be identified from the text "headquarters in Hangzhou" and is taken as the relationship between the named entity "headquarters" and the named entity "Hangzhou". For another example, the named entity "headquarters" cannot be directly identified from the text "in hangzhou, billows", but the same "headquarters" relationship can be deduced from the relevant semantic analysis.

In one embodiment of the present invention, the data set to be analyzed is shown in Table 2. NLP text analysis is illustrated with the fields in column 2 as an example. First, the description information (line 1) and the data content of this field are participled. Then, the entity in the description information and the data content is extracted by utilizing a named entity identification technology. In this embodiment, there is no definite relationship between the extracted entities, and thus no entity triplet is constructed.

TABLE 2

uid	brand_prefer	avg_trd_amt	active_usr_flg	regst_date
					1001001	Regional brand	165.0	TRUE	2018-09-09
1001002	Domestic brand	209.5	FALSE	2016-02-12
					1001003	International brand	98.3	FALSE	2015-11-03
1001004	Global brand	127.9	TRUE	2016-07-25
					…	…	…	…	…

In another embodiment of the present invention, the data set to be analyzed is shown in Table 3. NLP text analysis is illustrated with the fields in column 2 as an example. First, the data content of the field is participled, wherein the field does not have description information. Then, the entity in the description information is extracted by using a named entity identification technology. And finally, determining the relationship between the extracted entities by utilizing an entity relationship extraction technology, and further constructing entity triples. In this embodiment, the entity triplet constructed by the data content in line 1 includes, for example, (brand a, version, stature), which indicates that the version of brand a is stature; the entity triplets constructed from the data content of line 2 include, for example, (brand B, version, tight), which indicates that the version of brand B is tight.

TABLE 3

As is generally understood in the art, a knowledge-graph is a series of graphs embodying entities and relationships to reflect interrelationships between entities, entities and relationships, and relationships. According to embodiments of the present invention, a knowledge-graph may be constructed by concatenating at least one entity and/or entity triplet. In embodiments of the present invention, knowledge inference refers to the process of linking at least one entity to at least one other entity using a knowledge graph comprised of a plurality of entities and/or entity triplets.

For example, after deriving the entity and entity triples of the respective column 2 fields from tables 2 and 3, it may be determined through knowledge reasoning that at least these entity and entity triples relate to the keywords "brand preference" and "layout preference". Further, "consumption level" and "body shape" and the like can be inferred from the above keywords "brand preference" and "layout preference", respectively.

In the embodiment of the present invention, the inferred keyword is written in the description information of the field, or may be written in a new field. In the case of a new field, the keyword is the description information of the field, and the data content is an entity associated or inferred from the extracted entity and/or entity triplet. For example, a new field may be constructed according to the NLP text analysis of table 3, where the field description information is "layout preference" or "body size", and the field data content is "build", "tight", and "standard", "thin", etc., respectively.

In embodiments of the present invention, one or more fields in the same table or different tables within a data set, or even tables outside the data set, may be inferred to obtain description information and data content associated with the fields, and the description information and data content may be written to new fields of the data set. For example, the column 2 field of table 2 may be inferred from a large amount of ordering information described on a plurality of other tables.

Finally, the final defined keywords (including inferred keywords and original directly available description information) of all or part of the fields in the data set (containing one or more tables) are matched with the already established scene library. In an embodiment of the invention, the matching process may filter out useless description information of the fields. The application scenario of table 2 is determined as "user transaction information", and the application scenario of table 3 is determined as "electronic commerce ordering information". The specific construction of the scene library and its establishment are known.

Step 4-analytical model determination

As is known in the art, an analytical model (or simply "model") is a computer-executable module that implements one or more statistical methods and/or analytical functions (or a combination of such statistical methods and functions). In an embodiment of the invention, the analytical model is determined by feature matching. First, a model library is built, which includes a plurality of models. And then finding a model with the highest similarity for the input features through feature matching. The feature matching is to compare the input features with the features of all existing models in the model library, calculate the matching degree, and further find out the model with the highest matching degree with the current input features. In embodiments of the present invention, the degree of match may be calculated in a variety of ways known in the art, such as linear, polynomial, weighted, logarithmic, exponential, and the like.

In another embodiment of the invention, a machine learning technique may be employed to generate a model with a better degree of matching based on the model library that has been built. First, a label set (the composition of the label set is known) including input features and output models is created by accumulating existing models in a model library. Then, a machine learning model of the input features to the output model is built through the label set and a machine learning classification algorithm (e.g., XGBOOST, random forest, LightGBM, or CNN deep neural network algorithm). Then, when new input features are faced, the established machine learning model is used for reasoning to obtain the best matching analysis function combination, and the best matching model is generated and output according to the best matching analysis function combination.

In an embodiment of the present invention, the input features include a scene and a data type. In another embodiment of the present invention, the input features further comprise one or more of a field description, an entity, a unit of measure. After the determined data type and scene are obtained through

steps

2 and 3, which analysis model should be adopted for the determined data type and scene can be judged through one of the two schemes, and then intelligent data analysis is performed on the input data content and data type by using the adopted analysis model.

FIG. 5 is a schematic diagram of determining an analytical model according to one embodiment of the present invention. For example, the scenario is user transaction information, when the field is described as "user brand preference", and the data type is STRING, the matched best model is "user brand preference analysis"; when the field is described as transaction amount and the data type is DOUBLE, the matched optimal model is transaction amount segmented statistics; when the field is described as "registration DATE" and the data type is DATE, the best model matched is "analysis of the number of new registered users in the last 1 year".

In one embodiment of the invention, input features are not limited to single field scenarios and data types, etc., and input features from multiple fields in the same or different tables may also be used to determine an analytical model. For example, according to the fields in table 2, models such as "activity analysis of users who register for more than 5 years", "brand preference analysis of users who have a trade amount of more than 1 ten thousand" may be matched.

Step 5-data analysis

Based on the analysis model obtained in step 4, data analysis can be performed on the data in the data set according to the statistical method and the analysis function specified in the analysis model. Common statistical methods include TopN statistics, cluster statistics, segment statistics, correlation statistics, and the like. Common analytical functions include maximum, minimum, median, standard deviation, null rate, outlier rate, and the like.

Step 6-visual display

FIG. 6 shows a schematic diagram of an analysis results visualization according to an embodiment of the invention. And (5) automatically matching and displaying the data analysis result in the step (5), wherein elements needing to be referred to for generating the visual chart comprise but are not limited to field types, record numbers, statistical methods, analysis functions and the like, and then automatically selecting a visualization mode based on the logic of the matched chart and the elements. For example, when it is desired to present results of an analysis of brand preferences for a user, the results will be presented in the form of a pie chart by the logic described above. Where the number of records refers to the number of rows in a table in a data set.

The method, apparatus and system of the embodiments of the present invention may be implemented as a pure software module (for example, a software program written in Java language), as a pure hardware module (for example, a special ASIC chip or FPGA chip) as required, or as a module combining software and hardware (for example, a firmware system storing fixed codes).

Another aspect of the invention is a computer-readable medium having stored thereon computer-readable instructions that, when executed, implement the method of embodiments of the invention.

It will be appreciated by persons skilled in the art that the foregoing description is only exemplary of the invention and is not intended to limit the invention. The present invention may include various modifications and variations. Any modifications and variations within the spirit and scope of the present invention should be included within the scope of the present invention.

Claims

1. A method for analyzing data content in a data collection, comprising:

identifying a data type of a field in the data set;

acquiring an application scene of the data set;

determining an analytical model based on input features, wherein the input features include the data type and the application scenario; and

and analyzing the data content of the data set according to the analysis model.

2. The method of claim 1, further comprising:

the data set is prepared.

3. The method of claim 2, wherein the preparing operation comprises:

selecting a range of the data set in a source database;

defining a database adapter based on the type of the source database; and

processing, by the database adapter, metadata and data content in the data collection.

4. The method of claim 1, wherein the identifying operation comprises:

when an explicit declaration exists for the field, the data type for the field is a data type in the explicit declaration for the field, and when no explicit declaration exists for the field, the data type for the field is a data type that data content for the field is converted out according to a data type conversion matrix.

5. The method of claim 1, further comprising:

normalizing the data type identified in the identifying step.

6. The method of claim 5, wherein the normalizing operation comprises:

and normalizing the data types into four types of numerical values, dates, Boolean and character strings.

7. The method of claim 1, wherein the obtaining operation comprises:

performing NLP text analysis on at least one of the data content and the description information of the field to obtain an entity or an entity triple;

performing knowledge reasoning on the entity or the entity triple by using a knowledge graph to obtain a keyword for describing the field; and

and matching the keywords with a scene library to obtain the application scene.

8. The method of claim 7, wherein the matching operation filters out useless field keywords.

9. The method of claim 7, wherein the input features further comprise one or more of descriptive information of the field, an entity, a unit of measure of data content.

10. The method of claim 1, wherein the analytical model comprises one or more statistical methods, one or more analytical functions, or a combination thereof.

11. The method of claim 10, wherein the statistical methods include TopN statistics, cluster statistics, segment statistics, and correlation statistics, and the analysis functions include maxima, minima, medians, standard deviations, null rates, outlier ratios.

12. The method of claim 1, further comprising:

and visually displaying the result of the analysis operation.

13. The method of claim 12, wherein the visualization presentation operation comprises selecting a visualization chart based on at least one of a data type of the field, a number of records of the data set, the statistical method, and the analytic function.

14. The method of claim 1, wherein the determining comprises:

and comparing the input features with the features of the analysis models respectively, calculating the matching degree, and finding out the analysis model with the highest matching degree with the input features as the determined analysis model.

15. The method of claim 1, wherein the determining comprises:

establishing an analysis model library, wherein the analysis model library comprises a plurality of analysis models; and

and on the basis of the analysis model library, generating an analysis model with better matching degree by adopting a machine learning technology according to the input characteristics, wherein the analysis model is used as the determined analysis model.

16. A system for analyzing data content in a data collection, comprising:

means for identifying a data type of a field in the data set;

means for obtaining an application scenario for the data set;

means for determining an analytical model based on input features, wherein the input features include the data type and the application scenario; and

means for analyzing data content of the data collection according to the analytical model.

17. The system of claim 16, further comprising:

means for preparing the data set.

18. The system of claim 16, further comprising:

means for normalizing said data types identified in said identifying means.

19. The system of claim 16, wherein the obtaining means comprises:

means for performing NLP text analysis on at least one of data content and description information of the field to obtain an entity or entity triplet;

means for using a knowledge graph to perform a reasoning of knowledge about the entity or entity triplet and to derive keywords describing the fields; and

means for matching the keyword with a scene library to obtain the application scene.

20. The system of claim 16, further comprising:

and the device is used for visually displaying the analysis result of the analysis device.

21. A computer-readable medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-15.