CN107832440B

CN107832440B - Data mining method, device, server and computer readable storage medium

Info

Publication number: CN107832440B
Application number: CN201711144782.6A
Authority: CN
Inventors: 谢永恒; 谭罗乐; 火一莽; 万月亮
Original assignee: Beijing Ruian Technology Co Ltd
Current assignee: Beijing Ruian Technology Co Ltd
Priority date: 2017-11-17
Filing date: 2017-11-17
Publication date: 2020-10-13
Anticipated expiration: 2037-11-17
Also published as: CN107832440A

Abstract

The invention discloses a data mining method, a device, a server and a computer readable storage medium, wherein the data mining method comprises the following steps: acquiring original data from an HDFS; extracting object information and object relation information in the original data; storing the extracted object information and the object relation information into Hbase; and mining relation data based on the stored object information and the object relation information. The technical scheme of the invention solves the problem that the existing data mining process has higher requirements on the configuration of the memory and the network bandwidth, can realize rapid relational data mining on the configuration cluster with low memory configuration and lower configuration than a ten-gigabit network card, and reduces the implementation cost.

Description

Data mining method, device, server and computer readable storage medium

Technical Field

The embodiment of the invention relates to the technical field of big data and data mining, in particular to a data mining method, a data mining device, a server and a computer readable storage medium.

Background

Currently, with the increasing popularity of computer and network applications and the increasing abundance of different areas of business, it is increasingly important to quickly and efficiently mine desired data from large data records, such as to quickly and efficiently mine objects of different categories or objects that are related in some way (i.e., objects that have the same relationship) from a large number of data records related to a particular object.

A business frequently appearing in the practical application of the big data platform is to process data in the same relationship, for example, find friends owned by a person, and how to merge the scattered data into the same relationship in the big data system is a problem often encountered in big data processing.

At present, the arithmetic processing for the same relation on data generally adopts graph x in Spark to calculate a graph, and then outputs a calculation result. However, although there are many conveniences in computing using Spark, the requirement on memory and network bandwidth is high, and the network card recommended by Spark official network is ten thousand M, which is not suitable for some low-configuration clusters; and the Neo4j is used for operation, the open source version memory capacity is limited, and the memory requirement of mass data cannot be met.

Disclosure of Invention

The invention provides a data mining method, a data mining device, a server and a computer readable storage medium, wherein the data mining method is suitable for being used on a configuration cluster with low memory configuration and lower than ten-gigabit network cards, so that the cost is reduced, and the problem that the existing big data mining process has higher requirements on memory and network bandwidth configuration is solved.

In a first aspect, an embodiment of the present invention provides a data mining method, including:

acquiring original data from an HDFS (Hadoop Distributed File System);

extracting object information and object relation information in the original data;

storing the extracted object information and the object relation information into Hbase (Hadoop Database);

and mining relation data based on the stored object information and the object relation information.

In a second aspect, an embodiment of the present invention further provides a data mining apparatus, including:

the data acquisition module is used for acquiring original data from the HDFS;

the information extraction module is used for extracting the object information and the object relation information in the original data;

the information storage module is used for storing the extracted object information and the object relation information into Hbase;

and the data mining module is used for mining the relation data based on the stored object information and the object relation information.

In a third aspect, an embodiment of the present invention further provides a server, where the server includes:

one or more processors;

a storage device for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement the data mining method of any embodiment of the present invention.

In a fourth aspect, the embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the data mining method according to any embodiment of the present invention.

According to the technical scheme, original data are obtained from an HDFS; extracting object information and object relation information in the original data; storing the extracted object information and the object relation information into Hbase; relation data are mined based on stored object information and object relation information, the problem that the existing big data mining process has high requirements on memory and network bandwidth configuration is solved, quick relation data mining can be realized on a configuration cluster with low memory configuration and lower than a ten-trillion network card, and the realization cost is reduced.

Drawings

Fig. 1 is a flowchart of a data mining method according to an embodiment of the present invention.

Fig. 2 is a flowchart of mining relationship data according to a second embodiment of the present invention.

Fig. 3 is a schematic structural diagram of a data mining device according to a third embodiment of the present invention.

Fig. 4 is a schematic structural diagram of a server according to a fourth embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.

Example one

Fig. 1 is a flowchart of a data mining method according to an embodiment of the present invention, where the present embodiment is applicable to a case of relational data mining, the method may be executed by a data mining device, the device may be implemented by hardware and/or software, and the device may be integrated in a server. As shown in fig. 1, the method specifically includes the following steps:

step 110, obtaining raw data from the HDFS.

The original data is unstructured data which is not processed, and the unstructured data refers to data which cannot be expressed through a predefined data model or cannot be stored in a relational database table, such as plain text (web pages, mails and the like), images, data, audio, video, log-type data and the like. The original data are put in an HDFS based on a Hadoop architecture for permanent storage. Therefore, during the data mining or processing process, the relevant raw data can be obtained from the HDFS according to the business requirements.

And step 120, extracting the object information and the object relation information in the original data.

And extracting related object information and object relation information from the acquired original data according to business needs. The object information refers to a complete business object, and the object information contains a series of attribute information related to the object. For example, if a person is an object, the attributes of the person include name, gender, age, height, blood type, and the like; the mobile phone is an object, and the attributes of the mobile phone include a mobile phone number, a mobile phone model, a price, a color, a manufacturer, an IMEI (International mobile equipment Identity), and the like. The object relationship information is relationship information between objects, such as two objects of a person and a mobile phone, and the object relationship information may be a one-to-one relationship. Preferably, the object information in the original data may be extracted using MapReduce, and then the object relationship information may be extracted based on the object information.

And step 130, storing the extracted object information and the object relation information in Hbase.

Wherein after extracting the relevant object information and object relation information, the object information and the object relation information are stored in Hbase.

And step 140, mining relation data based on the stored object information and the object relation information.

The relationship data is obtained by mining according to the stored object information and the object relationship information, and specifically, the data required by further mining is obtained according to the object relationship information, for example, finding out direct family members of A, finding out dad B of A according to the relationship between A and other people, then finding out dad of B according to the relationship between B and other people, and obtaining the required relationship data according to the relationship between the objects in sequence. By the method, after all the data are loaded, the data related to the central data can be further found in all the data, and the relational data can be further mined and found according to the relation between the objects.

According to the technical scheme of the embodiment, original data are obtained from an HDFS; extracting object information and object relation information in the original data; storing the extracted object information and the object relation information into Hbase; relation data are mined based on the stored object information and the object relation information, the problem that the existing big data mining process has high requirements on the configuration of the memory and the network bandwidth is solved, the quick relation data mining can be realized on a configuration cluster with low memory configuration and lower than ten-gigabit network cards, and the realization cost is reduced. Data storage is carried out based on a Hadoop architecture, and the requirement on the performance of the server is not high; the object information and the object relation information are extracted and stored, and the stored information is searched and mined according to hierarchical recursion, so that the requirement on the memory of the machine is not high, and the data mining can be realized at low hardware cost.

Example two

Fig. 2 is a flowchart of mining relationship data according to a second embodiment of the present invention, as shown in fig. 2, on the basis of the second embodiment, optionally, step 140, based on the stored object information and the object relationship information, specifically includes:

step 210, loading a preset configuration file.

The preset configuration file can be flexibly set according to the service requirements, and includes configuration settings dedicated to all users, and the configuration file may include program items, recursion levels, special conditions, service rules, and the like. The recursive hierarchy refers to a data hierarchy which needs to be mined according to business needs, and the deeper the hierarchy is, the deeper the depth of data which can be mined is. The special condition is a condition which is set according to a specific service and stops mining when a special condition is met. For example, when a certain data type is encountered, data mining becomes difficult, which is not beneficial to quickly mining the data result, a special condition can be set in the configuration file, and before the hierarchy of recursive mining is not reached, once the data of the type is mined, the mining is stopped. The business rules are rules which can be customized according to specific business or needs, and before the hierarchy of recursive mining is not reached, once the business rules are not consistent with the business rules in the mining process, the mining is stopped. In any case, after the excavation is stopped, the recording or non-recording of the excavation result can be selected according to specific requirements. These can be placed in a configuration file, which is the core of the overall recursive mining and can be designed by those familiar with the business.

And step 220, loading object information and object relation information corresponding to the mining target from the Hbase.

The mining target refers to relationship data which needs to be mined currently, for example, direct family members of A are required. After the preset configuration file is loaded, according to the mining target required by the specific service, the object information and the object relation information corresponding to the mining target are loaded from the Hbase, for example, the attribute information of the a and the relation between the a and other people are loaded from the Hbase, and the relation between the a and other objects (such as mobile phones and the like) does not need to be loaded, so that the problem of low processing speed caused by too much useless data loading can be solved.

Step 230, recursively inquiring object relation information corresponding to the mining target according to the configuration file, and mining the object information corresponding to the mining target based on the current object relation information to obtain relation data.

The method mainly uses a recursive algorithm to mine relation data according to recursive levels in a configuration file. Specifically, the object corresponding to the target is further mined based on the relationship between the currently mined objects by using a recursive algorithm to obtain relationship data. For example, find the direct family member of a, first query the relationship between a and other people in Hbase, find dad B of a, then find dad of B according to the relationship between B and other people, and in turn obtain the required relationship data according to the relationship between the objects.

When the object relationship information is mined in the Hbase, some tools can be used to improve the query efficiency in combination with the platform architecture and specific services, for example, tools such as Hive or Presto are used to simplify the process, and a way of building a secondary index for the Hbase can be used to perform fast query. If the mining time needs to be controlled within a certain range, the quantity of the mined data is reduced along with the increase of the recursion level, and the specific situation can be flexibly configured according to the business needs.

And step 240, generating an Hbase warehousing file after the excavation is stopped, writing the relation data into the Hbase warehousing file, and introducing the Hbase warehousing file into the Hbase.

And when the recursion level in the configuration file is reached, stopping mining, generating an Hbase warehousing file, writing the mined relational data into the Hbase warehousing file, and then introducing the Hbase warehousing file into the Hbase, so that the result of the relational data obtained according to the service requirement is stored in the Hbase, and the service is convenient to inquire and use.

Preferably, after step 230 and before step 240, step 231 is performed: and judging whether the mining stop condition is met or not according to the special condition and/or the service rule in the configuration file. If the excavation stop condition is not satisfied, continuing to execute step 230; and stopping excavation if the excavation stopping condition is met. Mining is stopped immediately upon satisfaction of a special condition and/or a stopping condition specified in a business rule before the hierarchy of recursive mining is not reached.

Optionally, after obtaining the relationship data, step 232 may be executed: and judging whether the relation data meets the preset service requirement or not. If the relation data meets the preset service requirement, step 240 is directly executed to store the relation data. If the relation data does not meet the preset service requirement, step 233 is executed to perform information completion on the relation data according to the corresponding object information, and then step 240 is executed to perform storage of the relation data.

The information completion means that when the obtained relationship data does not meet the preset service requirement, the attribute information lacking in the object needs to be completed. For example, the original data is stored by using a Key-value (Key-value) model, and the object information mined according to the object relationship may only carry a certain Key value or a certain attribute. Therefore, at this time, other attributes of the object need to be completed based on the object information according to the business requirements. For example, if the subject person is only the name in the obtained relationship data, information such as sex and age can be supplemented. Therefore, the stored relation data can be ensured to be complete, and the condition that the relation data lacks the necessary information of the service is avoided.

It should be noted that the data mining method according to the embodiment of the present invention has low requirements on memory configuration and network bandwidth configuration, and can be implemented by using a common configuration server, for example, a 32G memory, a 2T hard disk, and a network using a gigabit network card.

According to the technical scheme, the data are mined layer by using a recursive algorithm, blind mining in a large amount of data is avoided, and the relation data are rapidly mined; before the level of recursive mining is not reached, whether a mining stopping condition is met or not is judged according to special conditions and/or business rules in the configuration file, mining is stopped immediately once the mining stopping condition is met, and great mining difficulty and mining of irrelevant data can be avoided. More abundant and accurate relation data can be obtained through information completion.

EXAMPLE III

Fig. 3 is a schematic structural diagram of a data mining apparatus according to a third embodiment of the present invention, and referring to fig. 3, the data mining apparatus includes:

a data obtaining module 310, configured to obtain raw data from the HDFS;

an information extraction module 320, configured to extract object information and object relationship information in the original data;

an information storage module 330, configured to store the extracted object information and the object relationship information in Hbase;

and the data mining module 340 mines the relation data based on the stored object information and the object relation information.

Wherein the raw data is unstructured data that has not been processed; the object information includes object-related attribute information; the object relationship information includes relationship information between objects.

Preferably, the data mining module 340 includes:

the configuration file loading unit is used for loading a preset configuration file;

the information loading unit is used for loading object information and object relation information corresponding to the mining target from the Hbase;

the data mining unit is used for recursively inquiring object relation information corresponding to the mining target according to the configuration file and mining the object information corresponding to the mining target based on the current object relation information to obtain relation data;

and the data import unit is used for generating an Hbase warehousing file after the excavation is stopped, writing the relation data into the Hbase warehousing file, and importing the Hbase warehousing file into the Hbase.

Further, the data mining module 340 further includes: the first judgment unit is used for recursively inquiring object relation information corresponding to the mining target according to the configuration file, mining the object information corresponding to the mining target based on the current object relation information to obtain relation data, and then judging whether a mining stop condition is met or not according to special conditions and/or business rules in the configuration file; if so, the excavation is stopped.

Optionally, the data mining module 340 further includes:

a second judging unit, configured to judge whether the relationship data meets a preset service requirement before writing the relationship data into the Hbase warehousing file;

and the information completion unit is used for performing information completion on the relation data according to the corresponding object information under the condition that the judgment result is negative.

The device can execute the data mining method provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects for executing the data mining method.

It should be noted that, in the embodiment of the data mining apparatus, the included units and modules are only divided according to functional logic, but are not limited to the above division as long as the corresponding functions can be implemented; in addition, specific names of the functional units are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present invention.

The technical solution of this embodiment provides a data mining apparatus, including: the data acquisition module is used for acquiring original data from the HDFS; the information extraction module is used for extracting the object information and the object relation information in the original data; the information storage module is used for storing the extracted object information and the object relation information into Hbase; the data mining module is used for mining the relational data based on the stored object information and the object relation information, solving the problem that the existing big data mining process has higher requirements on the configuration of the memory and the network bandwidth, realizing the rapid relational data mining on the configuration cluster with the memory configuration not high and lower than a ten-trillion network card and reducing the realization cost.

Example four

The present embodiment provides a server, including: one or more processors; a storage device for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement a data mining method as described in any of the embodiments above.

Fig. 4 is a schematic structural diagram of a server according to a fourth embodiment of the present invention, as shown in fig. 4, the server includes a processor 410 and a memory 420; the number of the processors 410 in the server may be one or more, and one processor 410 is taken as an example in fig. 4; the processor 410 and the memory 420 in the server may be connected by a bus or other means, as exemplified by the bus connection in fig. 4.

The memory 420 serves as a computer-readable storage medium for storing software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to the data mining method in the embodiment of the present invention (e.g., the data acquisition module 310, the information extraction module 320, the information storage module 330, and the data mining module 340 in the data mining apparatus). The processor 410 executes various functional applications of the server and data processing by executing software programs, instructions and modules stored in the memory 420, that is, implements the data mining method described above.

The memory 420 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the terminal, and the like. Further, the memory 420 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, memory 420 may further include memory located remotely from processor 410, which may be connected to a server over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The server provided by the embodiment of the present invention may not only execute and implement the data mining method provided by any of the above embodiments, but also execute other programs or methods according to specific requirements of a service.

According to the technical scheme, the problem that the existing big data mining process has high requirements on the configuration of the memory and the network bandwidth is solved, the rapid relational data mining can be realized on the configuration cluster with low memory configuration and lower configuration than a ten-gigabit network card, and the realization cost is reduced.

EXAMPLE five

The present embodiment provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the data mining method described in any of the above embodiments.

The computer program stored in the computer-readable storage medium provided in this embodiment may include other programs, in addition to the program executed by the processor, for implementing the data mining method described in any embodiment of the present invention, so as to implement specific business requirements.

From the above description of the embodiments, it is obvious for those skilled in the art that the present invention can be implemented by software and necessary general hardware, and certainly, can also be implemented by hardware, but the former is a better embodiment in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute the methods according to the embodiments of the present invention.

It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims

1. A method of data mining, comprising:

acquiring original data from an HDFS;

extracting object information and object relation information in the original data; wherein the object information in the original data is attribute information of an object;

storing the extracted object information and the object relation information into Hbase;

mining relation data based on the stored object information and the object relation information;

wherein mining relationship data based on the stored object information and object relationship information comprises:

loading a preset configuration file;

loading object information and object relation information corresponding to a mining target from Hbase;

recursively inquiring object relation information corresponding to the mining target according to the configuration file, and mining the object information corresponding to the mining target based on the current object relation information to obtain relation data;

and after the excavation is stopped, generating an Hbase warehousing file, writing the relation data into the Hbase warehousing file, and introducing the Hbase warehousing file into the Hbase.

2. The method of claim 1, wherein after recursively querying object relationship information corresponding to the mining target according to the configuration file, and mining object information corresponding to the mining target based on current object relationship information to obtain relationship data, the method further comprises:

judging whether the mining stop condition is met or not according to the special condition and/or the service rule in the configuration file;

if so, the excavation is stopped.

3. The method of claim 1, wherein prior to writing the relationship data to the Hbase binned file, the method further comprises:

judging whether the relation data meets the preset service requirement or not;

and if not, performing information completion on the relational data according to the corresponding object information.

4. The method of claim 1, wherein the raw data is unstructured data that has not been processed; the object information includes object-related attribute information; the object relationship information includes relationship information between objects.

5. A data mining device, comprising:

the data acquisition module is used for acquiring original data from the HDFS;

the information extraction module is used for extracting the object information and the object relation information in the original data; wherein the object information in the original data is attribute information of an object;

the data mining module is used for mining relation data based on the stored object information and the object relation information;

wherein the data mining module comprises:

6. The apparatus of claim 5, wherein the data mining module further comprises:

the judging unit is used for recursively inquiring the object relation information corresponding to the mining target according to the configuration file, mining the object information corresponding to the mining target based on the current object relation information to obtain relation data, and then judging whether a mining stop condition is met according to special conditions and/or business rules in the configuration file; if so, the excavation is stopped.

7. A server, characterized in that the server comprises:

one or more processors;

a storage device for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement the data mining method of any one of claims 1-4.

8. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the data mining method of any one of claims 1 to 4.