Data mining method, device, equipment and storage medium
Technical Field
The embodiment of the invention relates to the technical field of data processing, in particular to a data mining method, a data mining device, data mining equipment and a storage medium.
Background
Structured data mining is a cross discipline and relates to a plurality of fields such as data mining, statistics, computer languages, computer network technologies, ETL and the like. Structured data mining is to extract log data, basic data and relationship data concerned in the business field from massive log data. Due to the fact that the source heterogeneity of mass data is strong, the data quality is uneven, the data volume is large, the traditional structured data extraction work task is heavy, and the reliability judgment of the data extraction result is already a main problem faced by current data mining.
The current structured data extraction method comprises real-time stream extraction and offline task extraction, and the data model design of the target data extracted in most systems is a two-dimensional data table based on business application. The method has the disadvantages that the data model changes along with the frequent change of business requirements, and the historical data also needs to be migrated correspondingly, which is time-consuming and labor-consuming. Meanwhile, because the data sources are more, the attribute values of different sources can be different, the data quality can be different, and the reliability can not be judged.
Disclosure of Invention
The invention provides a data mining method, a data mining device, data mining equipment and a storage medium, and aims to solve the problems that a data model based on business application is not easy to integrate and transfer data and the reliability of the data cannot be judged in the prior art.
In a first aspect, an embodiment of the present invention provides a data mining method, including:
extracting target data according to a pre-designed data model, wherein the data structure of the target data comprises an entity, an attribute value and data source information;
calculating a reliability value for each attribute value according to a confidence weight of the data source information in a data structure of the target data;
sorting the attribute values of the same attribute of the same entity according to the reliability value;
and selecting the attribute value with the highest reliability value as the attribute value after the target attribute of the target entity is merged.
In a second aspect, an embodiment of the present invention further provides a data mining apparatus, including:
the target data extraction module is used for extracting target data according to a pre-designed data model, wherein the data structure of the target data comprises an entity, an attribute value and data source information;
a reliability value calculation module for calculating a reliability value of each attribute value according to a credibility weight of the data source information in a data structure of the target data;
the credibility sorting module is used for sorting the attribute values of the same attribute of the same entity according to the reliability values;
and the attribute value merging module is used for selecting the attribute value with the highest reliability value as the attribute value of the target entity after the target attribute is merged.
In a third aspect, an embodiment of the present invention further provides a computer device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor executes the computer program to implement the data mining method according to any embodiment of the present invention.
In a fourth aspect, the embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the data mining method according to any embodiment of the present invention.
According to the data mining method, the data mining device, the data mining equipment and the storage medium, target data are extracted according to a pre-designed data model, wherein a data structure of the target data comprises an entity, attributes, attribute values and data source information; calculating a reliability value for each attribute value according to the confidence weight of the data source information in the data structure of the target data; sorting the attribute values of the same attribute of the same entity according to the reliability value; the technical means of selecting the attribute value with the highest reliability value as the attribute value after the target attribute of the target entity is merged achieves the beneficial effect of extracting data with high value density from mass data, and overcomes the problems that a data model based on service application is not easy to integrate and transfer data and the reliability of the data cannot be judged in the prior art.
Drawings
FIG. 1 is a flow chart of a data mining method according to a first embodiment of the present invention;
FIG. 2 is a schematic diagram of storage in a star data relationship according to a first embodiment of the present invention;
fig. 3 is a schematic structural diagram of a data mining apparatus according to a second embodiment of the present invention;
fig. 4 is a schematic hardware structure diagram of a computer device in the third embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.
It should be further noted that, for the convenience of description, only some but not all of the relevant aspects of the present invention are shown in the drawings. Before discussing exemplary embodiments in more detail, it should be noted that some exemplary embodiments are described as processes or methods depicted as flowcharts. Although a flowchart may describe the operations (or steps) as a sequential process, many of the operations can be performed in parallel, concurrently or simultaneously. In addition, the order of the operations may be re-arranged. The process may be terminated when its operations are completed, but may have additional steps not included in the figure. The processes may correspond to methods, functions, procedures, subroutines, and the like.
Example one
Fig. 1 is a flowchart of a data mining method provided in an embodiment of the present invention, where this embodiment is applicable to a case of mining data with high value density from mass data, and the method may be executed by a data mining device provided in an embodiment of the present invention, where the device may be implemented in a software and/or hardware manner, and may be generally integrated in a computer device, as shown in fig. 1, the method of this embodiment specifically includes:
and S110, extracting target data according to a pre-designed data model. The data structure of the target data comprises entities, attributes, attribute values and data source information.
The data model is designed in advance, a data structure in the data model at least comprises an entity, an attribute value and data source information, and taking statistical personnel basic information data as an example, the data structure in the data model can comprise an entity ID, an attribute value and data source information, wherein the entity ID can be an MD5 value obtained by carrying out encryption calculation through a certificate type and a certificate number, and the attribute ID can also be a unique identifier corresponding to an attribute field. The data model designed in advance for the statistical staff basic information data may be, for example, as shown in table 1. The data source information may include a data source system and a data source data set, and may further include other extraction source fields, such as data extraction time, data extraction source area, operation type of the data extraction source, and the like, and the operation type may include a login operation, a binding operation, a registration operation, and the like.
Target data are extracted from the mass log data according to a pre-designed data model, and the data structure of the target data comprises entities, attributes, attribute values and data source information. Due to the diversity of the data source system, the data source of the basic information of the statistical personnel can comprise a railway station ticketing system, a hotel system, internet data and the like. The data structure of each data source is different, and extraction of structured data according to a pre-designed data model is required, for example, extraction of target data according to the data model in table 1 is performed, and the extracted target data may be as shown in table 2 (1 in the gender field represents male, 2 represents female, and 9 represents unknown).
TABLE 1 data model example Table
Serial number
|
Name (R)
|
Remarks for note
|
1
|
Entity ID
|
Calculating MD5 values from certificate types and certificate numbers
|
2
|
Attribute ID
|
Unique identification of corresponding attribute fields
|
3
|
Attribute value
|
|
4
|
System for extracting data source
|
|
5
|
Extracting a source data set
|
|
Table 2 target data extraction example table
Entity ID
|
Attribute ID
|
Attribute value
|
System for extracting data source
|
Extracting a source data set
|
OBJId1
|
Name (I)
|
Zhang three
|
Train station ticket selling system
|
Ticket selling information
|
OBJId1
|
Name (I)
|
Zhang Sanqiang (three powerful)
|
Internet network
|
Registration information
|
OBJId1
|
Sex
|
1
|
Train station ticket selling system
|
Ticket selling information
|
OBJId1
|
Sex
|
9
|
Internet network
|
Registration information
|
OBJId1
|
Age (age)
|
27
|
Train station ticket selling system
|
Ticket selling information
|
OBJId1
|
Age (age)
|
25
|
Internet network
|
Registration information
|
OBJId1
|
Household registration area
|
Shanxi province
|
Train station ticket selling system
|
Ticket selling information
|
OBJId1
|
Household registration area
|
Shanxi Taiyuan
|
Internet network
|
Registration information
|
OBJId2
|
Name (I)
|
Wangli tea
|
Hotel system
|
Hotel check-in information
|
OBJId2
|
Sex
|
2
|
Hotel system
|
Hotel check-in information
|
OBJId2
|
Age (age)
|
28
|
Hotel system
|
Hotel check-in information
|
OBJId2
|
Household registration area
|
Beijing east city
|
Hotel system
|
Hotel check-in information
|
OBJId2
|
Name (I)
|
Wang Jing (meridian of Wang)Theory of things
|
Internet network
|
Mail information
|
OBJId2
|
Sex
|
Null
|
Internet network
|
Mail information
|
OBJId2
|
Age (age)
|
Null
|
Internet network
|
Mail information
|
OBJId2
|
Household registration area
|
Beijing
|
Internet network
|
Mail information |
And S120, calculating a reliability value of each attribute value according to the credibility weight of the data source information in the data structure of the target data.
The reliability values of the attribute values of the extracted target data are calculated by using the credible weight of the data source information in the data structure of the extracted target data, wherein the data source information at least comprises an extracted data source system and an extracted source data set, that is, the reliability values of the attribute values can be calculated by using the credible weights of the extracted data source system and the extracted source data set, wherein the credible weight of the extracted data source system is a first credible weight, and the credible weight of the extracted source data set is a second credible weight.
Specifically, the trusted weight of the data source information is a product of the first trusted weight and the second trusted weight, that is, a product of the trusted weights of the extracted data source system and the extracted source data set.
If the data source further includes other extraction source dimension fields, such as data extraction time, data extraction source region, operation types of the data extraction source, and the like, the operation types may include login operation, binding operation, registration operation, and the like, and the trusted weight of the data source information is a product of the trusted weights of the source dimensions. The specific weight value of each source dimension can be generally determined by a statistical empirical value.
The calculation strategy of the credibility weight according to the data source information is exemplified by the attributes "name" and "age", as shown in table 3. Correspondingly, the reliability value of the attribute value is the credible weight value of the data source information.
Table 3 example table of trusted weight of data source information
Attribute name
|
System for extracting data source
|
First trusted weight
|
Extracting a source data set
|
Second trusted weight
|
Name (I)
|
Train station ticket selling system
|
1
|
Ticket selling information
|
1
|
Name (I)
|
Hotel system
|
0.9
|
Hotel check-in information
|
0.9
|
Name (I)
|
Internet network
|
0.8
|
Registration information
|
0.6
|
Name (I)
|
Internet network
|
0.7
|
Mail information
|
0.5
|
Age (age)
|
Train station ticket selling system
|
0.8
|
Ticket selling information
|
0.7
|
Age (age)
|
Hotel system
|
0.7
|
Hotel check-in information
|
0.7
|
Age (age)
|
Internet network
|
0.6
|
Registration information
|
0.5
|
Age (age)
|
Internet network
|
0.2
|
Mail information
|
0.2 |
S130, sorting the attribute values of the same attribute of the same entity according to the reliability value.
The reliability values of the attribute values determined according to the credible weight values of the calculation data source information are sorted according to the height, and taking the name attribute as an example, for the entity OBJId1, the reliability value of the name attribute value of Zhang III is 1 × 1 to 1, and the reliability value of the name attribute value of Zhang III is 0.8 × 0.6 to 0.48.
And S140, selecting the attribute value with the highest reliability value as the attribute value after the target attribute of the target entity is merged.
The attribute values ranked in the first few bits may be taken as the attribute values with high reliability to perform data mining storage, such as the first three bits or the first five bits, or only the attribute value with the highest reliability may be taken to perform storage, and the attribute value is taken as the attribute value after the target attribute of the target entity is merged.
For example, after sorting, the attribute values of the name attribute of entity OBJId1 may be merged into "Zhang three". And the other attributes are similar, and finally, a complete entity can be output, wherein the complete entity comprises all the attributes and attribute values after all the attributes are merged.
In the data mining method provided by this embodiment, target data is extracted according to a pre-designed data model, where a data structure of the target data includes an entity, an attribute value, and data source information; calculating a reliability value for each attribute value according to the confidence weight of the data source information in the data structure of the target data; sorting the attribute values of the same attribute of the same entity according to the reliability value; the technical means of selecting the attribute value with the highest reliability value as the attribute value after the target attribute of the target entity is merged achieves the beneficial effect of extracting data with high value density from mass data, and overcomes the problems that a data model based on service application is not easy to integrate and transfer data and the reliability of the data cannot be judged in the prior art.
On the basis of the foregoing embodiments, before extracting target data according to a preset data model, the data mining method further includes:
defining a data structure based on an objectification idea, wherein the data structure comprises classes, entities, attributes and attribute values; and designing a first type of data model according to the data structure, wherein the data model comprises an entity ID, an attribute value and data source information, the entity ID is a unique identifier of the entity, and the attribute ID is a unique identifier of a corresponding attribute field.
The data structure is designed according to the objectification idea, concepts of classes, entities and attributes are added, and a relationship is established between the attributes and the attribute values. Wherein, the class can refer to a personnel basic information class, a doctor professional information class, a student information class and the like. And designing a data model of the corresponding class according to the data structure, such as the data model of the basic information class of the designer, wherein the data model comprises an entity ID, an attribute value and data source information, and the entity ID can be an MD5 value obtained by encrypting the certificate type and the certificate number.
Specifically, after selecting the attribute value with the highest reliability value as the attribute value after merging the target attributes of the target entity, the data mining method further includes: and outputting and storing the attributes and attribute values of the target entity in a star data relationship.
After merging the attribute values of the attributes of the target entity, a complete entity may be output, where the attributes and the attribute values of the target entity may be output and stored in a star data relationship, as shown in fig. 2. The storage in a star data relationship can provide convenience for graph calculation of data.
Example two
Fig. 3 is a schematic structural diagram of a data mining device according to a second embodiment of the present invention, which is applicable to a situation of mining data with high value density from mass data, and the data mining device may be implemented in a software and/or hardware manner, and may be generally integrated in a computer device, as shown in fig. 3, the data mining device specifically includes: a target data extraction module 310, a reliability value calculation module 320, a reliability ranking module 330, and an attribute value merge module 340, wherein,
the target data extraction module 310 is configured to extract target data according to a pre-designed data model, where a data structure of the target data includes an entity, an attribute value, and data source information;
a reliability value calculation module 320, configured to calculate a reliability value of each attribute value according to a trusted weight of the data source information in a data structure of the target data;
the reliability sorting module 330 is configured to sort the attribute values of the same attribute of the same entity according to the reliability values;
and the attribute value merging module 340 is configured to select an attribute value with the highest reliability value as the attribute value after merging the target attributes of the target entity.
According to the data mining device provided by the embodiment, target data are extracted according to a pre-designed data model, wherein a data structure of the target data comprises an entity, attributes, attribute values and data source information; calculating a reliability value for each attribute value according to the confidence weight of the data source information in the data structure of the target data; sorting the attribute values of the same attribute of the same entity according to the reliability value; the technical means of selecting the attribute value with the highest reliability value as the attribute value after the target attribute of the target entity is merged achieves the beneficial effect of extracting data with high value density from mass data, and overcomes the problems that a data model based on service application is not easy to integrate and transfer data and the reliability of the data cannot be judged in the prior art.
Specifically, the data source information at least includes an extracted data source system and an extracted source data set, where a trusted weight of the extracted data source system is a first trusted weight, and a trusted weight of the extracted source data set is a second trusted weight.
Specifically, the trusted weight of the data source information is a product of the first trusted weight and the second trusted weight.
On the basis of the foregoing embodiments, before the extracting the target data according to the preset data model, the data mining apparatus further includes: a data structure definition module and a number model design module, wherein,
the data structure definition module is used for defining a data structure based on an objectification idea, wherein the data structure comprises classes, entities, attributes and attribute values;
and the number model design module is used for designing a first type of data model according to the data structure, wherein the data model comprises an entity ID, an attribute value and data source information, the entity ID is a unique identifier of the entity, and the attribute ID is a unique identifier of a corresponding attribute field.
Specifically, after the selecting the attribute value with the highest reliability value as the attribute value after merging the target attributes of the target entity, the data mining apparatus further includes: and the data output module is used for outputting and storing the attribute and the attribute value of the target entity in a star data relationship.
The data mining device can execute the data mining method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the executed data mining method.
EXAMPLE III
As shown in fig. 4, which is a schematic diagram of a hardware structure of a computer device according to a third embodiment of the present invention, as shown in fig. 4, the computer device includes:
one or more processors 410, one processor 410 being illustrated in FIG. 4;
a memory 420;
the computer device may further include: an input device 430 and an output device 440.
The processor 410, the memory 420, the input device 430 and the output device 440 in the computer apparatus may be connected by a bus or other means, and fig. 4 illustrates the connection by a bus as an example.
The memory 420, which is a non-transitory computer-readable storage medium, may be used for storing software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to a data mining method in the embodiment of the present invention (for example, the target data extraction module 310, the reliability value calculation module 320, the confidence ranking module 330, and the attribute value merging module 340 shown in fig. 3). The processor 410 executes various functional applications of the computer device and data processing by executing software programs, instructions and modules stored in the memory 420, namely, implements a data mining method of the above-described method embodiments.
The memory 420 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the computer device, and the like. Further, the memory 420 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, memory 420 may optionally include memory located remotely from processor 410, which may be connected to the terminal device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The input device 430 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the computer apparatus. The output device 440 may include a display device such as a display screen.
Example four
Embodiments of the present invention also provide a storage medium containing computer-executable instructions, which when executed by a computer processor, perform a data mining method, the method comprising:
extracting target data according to a pre-designed data model, wherein the data structure of the target data comprises an entity, an attribute value and data source information;
calculating a reliability value for each attribute value according to a confidence weight of the data source information in a data structure of the target data;
sorting the attribute values of the same attribute of the same entity according to the reliability value;
and selecting the attribute value with the highest reliability value as the attribute value after the target attribute of the target entity is merged.
Optionally, the computer executable instructions, when executed by the computer processor, may be further configured to implement a technical solution of a data mining method provided by any embodiment of the present invention.
From the above description of the embodiments, it is obvious for those skilled in the art that the present invention can be implemented by software and necessary general hardware, and certainly, can also be implemented by hardware, but the former is a better embodiment in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute the methods according to the embodiments of the present invention.
It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.