CN110837520A

CN110837520A - Data processing method, platform and system

Info

Publication number: CN110837520A
Application number: CN201910959014.9A
Authority: CN
Inventors: 万鹏程; 吕勇; 李春生; 贾洪园
Original assignee: Suning Cloud Computing Co Ltd
Current assignee: Suning Cloud Computing Co Ltd
Priority date: 2019-10-10
Filing date: 2019-10-10
Publication date: 2020-02-25
Also published as: CA3154438A1; WO2021068549A1

Abstract

The embodiment of the application discloses a data processing method, a data processing platform and a data processing system, wherein the method comprises the following steps: the original commodity content data diversity grouping library sub-table is stored in a first relational database; establishing index data according to the original commodity content data and storing the index data in an index database; the index data comprises keyword fields and query dimension identification data corresponding to each keyword field; and calculating the original commodity content data through a calculation program to obtain calculation result data, and storing the calculation result data and the query dimension identification data in the first relational database in an associated manner. According to the technical scheme, the calculation efficiency is improved, the index database is established according to the query dimension, and the subsequent index is performed first during query, so that the query efficiency is certainly improved.

Description

Data processing method, platform and system

Technical Field

The present application relates to the field of service data calculation and query, and in particular, to a data processing method, platform, and system.

Background

When a merchant sells goods, some analysis data is often needed as a guide basis for operation. The analysis data is obtained by analyzing and calculating a large amount of commodity content data on the basis of the platform. Such as the quality score of the commodity content for characterizing the quality of the commodity description information, the data can provide commodity operation guidance for the merchant selling the physical commodity. The data is obtained by summarizing, analyzing and calculating the data of the contents of a plurality of commodities of a plurality of merchants through the platform. At present, the summary analysis and calculation of a plurality of commodity content data are mostly realized by means of Java and a relational database Mysql. When the merchant needs to query the calculation result data, Mysql is directly queried.

However, under the era background of rapid development of electronic commerce, the data volume of commodity content is generated, and particularly, during the period of a platform promotion, such as 'biseleven', '618', '818', 'bistwelve', etc., the data volume is greatly increased. The data calculation efficiency of the Java and relational database Mysql mode is low, and when the merchant inquires the calculation result data, the inquiry efficiency is also low due to the Java and relational database Mysql mode. Especially, complex query conditions are met, and the query time is basically in the second level.

Disclosure of Invention

The application provides a data processing method, a data processing platform and a data processing system, which aim to solve the problem that in the prior art, the efficiency of calculating and querying commodity content data is low.

The application provides the following scheme:

one aspect provides a data processing method, including:

the original commodity content data is stored in a first relational database in a clustering, database-dividing and table-dividing manner;

establishing index data according to the original commodity content data and storing the index data in an index database; the index data comprises keyword fields and query dimension identification data corresponding to each keyword field;

and calling a calculation program to calculate the original commodity content data to obtain calculation result data, and storing the calculation result data and the query dimension identification data in the first relational database in an associated manner.

Preferably, the method further comprises:

receiving a query request of a user;

analyzing the query request to obtain a keyword to be queried;

inquiring in the index database to obtain inquiry dimension identification data corresponding to the keywords to be inquired as target identifications;

and inquiring in the first relational database to obtain the calculation result data corresponding to the target identification.

Preferably, the method further comprises:

storing at least a portion of the computed result data in association with the query dimension identification data in the index database.

Preferably, the first and second liquid crystal materials are,

the step of calling the calculation program to calculate the original commodity content data to obtain calculation result data comprises the following steps:

calling a calculation program to calculate the content quality score of each dimension of each commodity in at least two content dimensions for the original commodity content data and calculating the content quality total score of each commodity according to each dimension score;

the storing the calculation result data in association with the query dimension identification data in the first relational database comprises:

storing the dimension content quality scores of each commodity and the content quality total of each commodity in the first relational database in association with the query dimension identification data;

the storing at least a portion of the data of the computation result data in association with the query dimension identification data in the index database comprises:

and storing the quality total of each commodity and the inquiry dimension identification data in the index database in an associated mode.

Preferably, the query dimension identification data is a commodity code and/or a merchant code.

Preferably, the method further comprises:

receiving the original commodity content data and storing the original commodity content data in a second relational database in a diversity grouping database sub-table;

synchronizing the original merchandise content data in the second relational database to the first relational database.

Preferably, the receiving the original commodity content data and storing the diversity grouping library sub-table in the second relational database includes:

and receiving the original commodity content data and storing the original commodity content data in a second relational database according to the commodity code division cluster database sub-table.

Preferably, the first and second liquid crystal materials are,

the first relational database is Hbase, the second relational database is Mysql, the calculation program is Spark, and the index database is an elastic search.

The application also provides a data processing platform, which comprises a data storage layer and a data calculation layer;

the data storage layer is used for clustering and sub-database sub-table storage of original commodity content data in a first relational database, and establishing index data according to the original commodity content data and storing the index data in an index database; the index data comprises keyword fields and query dimension identification data corresponding to each keyword field;

and the data calculation layer is used for calling a calculation program to calculate the original commodity content data to obtain calculation result data and storing the calculation result data and the query dimension identification data in the first relational database in an associated manner.

Yet another aspect of the present application provides a computer system, including:

one or more processors; and

a memory associated with the one or more processors for storing program instructions that, when read and executed by the one or more processors, perform operations comprising:

According to the specific embodiments provided herein, the present application discloses the following technical effects:

according to the technical scheme, the commodity original data is stored in the relational database in a clustering and database-dividing manner, the calculation program is called for calculation, the calculation efficiency is improved, the index database is established according to the query dimension, and subsequently, the index is firstly carried out during query, so that the query efficiency is certainly improved. Compared with the prior art, the scheme can rapidly provide multi-dimensional query of the calculation result data, and the problem of low efficiency caused by directly querying in the relational database is solved.

Of course, it is not necessary for any product to achieve all of the above-described advantages at the same time for the practice of the present application.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

FIG. 1 is a block diagram of a data processing platform provided in an embodiment of the present application;

fig. 2 is a schematic diagram of cluster library division provided in an embodiment of the present application;

FIG. 3 is a flowchart of original merchandise content data synchronization provided by an embodiment of the present application;

FIG. 4 is a flowchart of a product content quality score query provided by an embodiment of the present application;

FIG. 5 is a flow chart of a data processing method provided by an embodiment of the present application;

FIG. 6 is a diagram of a computer system architecture provided by an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments that can be derived from the embodiments given herein by a person of ordinary skill in the art are intended to be within the scope of the present disclosure.

The method aims to provide a commodity content data processing method, original commodity content data are stored in a relational database in a clustering and database-dividing manner, a calculation program is called to perform database-dividing and parallel calculation to improve calculation efficiency, index data are established according to required query dimensions, and then query can be performed on the relational database after identification data are matched on the index database in subsequent query.

As shown in fig. 1, a structure diagram of a data processing platform according to an embodiment of the present application includes a Mysql database, an Hbase database, a spare calculation program for calculation, a search engine elastic search, a remote service framework RSF, and a querying business.

The Mysql database is used as a database for receiving the original commodity content data, and stores massive original commodity content data in the database by means of diversity grouping and database sorting. The grouping, warehousing and tabulating can be specifically completed in a commodity coding mode, and specific operations will be described in detail later.

The Hbase database is used for synchronization according to the data in the Mysql database. It may specifically accomplish synchronization through data replication and data exchange platforms. After synchronization, the Hbase database stores the original commodity content data in a diversity grouping database and table dividing manner.

In other embodiments of the present application, the original commodity content data may be stored directly in the Hbase database without going through the Mysql database. And the Mysql database is used in a mode of considering the stability of data backup and considering that other business processes need to depend on the Mysql database for operation.

Calculating the subsequent result data to be inquired, establishing an index, and establishing an incidence relation between the index and the calculated result data. So that the result data can be further queried according to the index data:

the index database Elasticissearch stores keyword fields for query, such as commodity brands, and identification data, such as commodity codes, corresponding to the keyword fields. Based on the index, the corresponding commodity code can be matched with the query keyword input by the user (merchant).

The Spark calculation program is used for performing MapReduce (programming model for parallel operation of large-scale data sets (larger than 1 TB)) on the original commodity content data of each cluster according to expression rules according to the number segments of the commodity codes so as to obtain calculation results, such as calculating the commodity content quality scores. And after obtaining the calculation result, storing the calculation result and identification data such as a commodity code in an Hbase database.

Through the steps, the association between the index data in the elastic search and the calculation result data in the Hbase database is established through the identification data.

When a user inputs a query keyword, the RSF firstly queries in the index to determine matched identification data such as commodity codes, and then determines calculation result data in the Hbase database according to the commodity codes.

The index may be established independently of the calculation process, and at least a part of the calculation result may be stored in the index database. When the query of the partial result is performed, the query can be completed only by the Elasticsearch without further query in the Hbase database.

It should be noted that the Mysql database, Hbase database, Spark calculation program, and search engine elastic search described above can be replaced by modules with similar functions, and fig. 1 is only a specific system structure of the present application.

Taking the system shown in fig. 1 and the calculation of the content quality of the commodity as examples, the related original commodity content data clustering, database, table and storage process, commodity content data synchronization process, commodity content quality score calculation process, commodity content quality score synchronization process, index establishment process and commodity content quality score query process are described in detail:

performing clustering, database and table storage on the original commodity content data:

the original commodity content data are respectively stored in 4 clusters of Mysql according to the number segments of the commodity codes, the results of taking the modulus of 10 according to the last two digits of the commodity codes are stored in 10 branch bases of each cluster, and the results of taking the modulus of 10 according to the last digit of the commodity codes are stored in 10 branch tables of each branch base, so that billions of commodity content data are dispersed into hundreds of branch tables. A schematic diagram of cluster banking is shown in fig. 2.

As the number segments defining the goods code stored by each cluster: the commodity data of segments 000000000000000000 to 000000000500000000 are stored in a 1 cluster; the commodity data of segments 000000000500000001 to 000000001000000000 are stored in a 2 cluster; the commodity data of segments 000000001000000001 to 000000001500000000 are stored in a 3 cluster; the commodity data of segments 000000001500000001 through 000000002000000000 are stored in a 4-cluster.

Defining a sub-library of the cluster to which each commodity belongs: and according to the two coded commodities, designating a corresponding sub-library for storing the 10-modulus result.

Defining a branch table of a branch library to which each commodity belongs to the cluster: and according to the last bit of the commodity code, designating a corresponding sublist for the 10-residue-taking result for storage.

For example, the product code 000000001500000023 belongs to the 4-cluster 3-base 4-table.

Original commodity content data synchronization:

the original commodity content data synchronization is divided into three types: incremental updates in near real-time, once a day, full weekly updates. Where both the daily incremental updates and the weekly full volume updates are for fault tolerance.

As shown in fig. 3, specifically, a real-time Data replication platform rdrs (realtime Data replication system) may be defined for synchronizing Mysql Data to the HBase in near real-time and a Data exchange platform IDE for synchronizing Mysql Data to the HBase in daily increments and weekly amounts:

and the RDRS platform synchronizes the commodity content data to HBase by analyzing binlog information of the Mysql database cluster in quasi-real time.

And the data exchange platform synchronizes the commodity content information incremental data to the HBase every day, and compares and corrects the commodity content information incremental data with quasi-real-time HBase commodity content data.

And the data exchange platform synchronizes the full amount of commodity data to the HBase every week, and compares and corrects the full amount of commodity data with the current commodity content data of the HBase.

Calculating the quality of the commodity content:

the commodity content quality is mainly influenced by seven content dimensions of basic information, parameter information, category information, main figure information, title information, selling point information and detail information. The Spark program can calculate each sub-library in parallel based on expression rules, calculate basic information, parameter information, category information, main graph information, title information, selling point information and detail information scores of all sub-library commodities, and finally collect all dimensionality scores and write the dimensionality scores into Hive (a data warehouse tool of Hadoop), specifically:

firstly, according to the sub-database, calculating the basic information, parameter information, category information, main map information, title information, selling point information and detail information score of all sub-databases by using MapReduce. The calculation according to the sub-library is mainly used for reducing excessive inclination of data, so that the calculation efficiency is improved.

And combining scores of the basic information, the parameter information, the category information, the main picture information, the title information, the selling point information and the detail information to obtain a total score.

The following is a test of computational efficiency for this application and the prior art:

100w data to be calculated, 1000w data to be calculated and 1 hundred million data to be calculated are inserted into a table to be calculated for commodity quality evaluation. Then, calculation is carried out based on the modes of java + Mysql and Spark + HBase respectively. The test results are reported in table 1.

TABLE 1 comparison of Spark + HBase and java computational efficiency

Number of records in table	Spark+HBase	Java+Mysq
			100w	30 minutes	8 hours
1000w	2 hours	3 days
			1 hundred million	5 hours	30 days

According to the test results, the calculation based on Spark + HBase combination can greatly improve the calculation efficiency, and even if the number of data is multiplied, the calculation efficiency still has excellent performance.

Synchronizing the quality score data of the commodity contents:

and summarizing and calculating all scores according to the set query dimensions such as commodities and merchants to obtain corresponding total scores, such as the total score of a certain commodity or the total score of a certain merchant. Of course other dimensions are possible. And then synchronizing data such as the commodity content quality scores of all dimensions, the commodity content quality total scores, the scores collected according to the set query dimensions and the like to the HBase.

Establishing a query dimension index:

and establishing index data according to the query dimensions such as commodity codes and merchant codes, wherein the index data comprises keyword fields and corresponding query dimension identification data. Such as the brand of the article and the corresponding article code.

The index can be established based on a commodity content quality score data synchronization process, when the commodity content quality score data is obtained through calculation and is synchronized to HBase, the corresponding relation between a keyword field in original commodity data and inquiry dimension identification data is established, and total score data obtained through summarizing according to inquiry dimensions such as commodity codes and merchant codes is synchronized to index data.

Wherein, the related calculation result data of the elastic search and the HBase, such as the commodity content quality score data, are both updated in increments.

And inquiring the quality of the commodity content.

Aiming at data of different types of query conditions required by a user, a corresponding query interface and request parameters are required, then the corresponding commodity code and merchant code are obtained by firstly removing an elastic search according to the query conditions, then the required data are queried by removing an HBase according to the queried commodity code and merchant code, and finally the data meeting the conditions are integrated, filtered and returned to the user, specifically:

firstly, a Remote Service Framework (RSF) (remote Service framework) is defined, which is used for providing remote query Service and defined query Service (QueryService) for a querier component, and is used for processing query of merchants. According to the query conditions input by a merchant, the RSF service is called to carry out various types of iterative queries, and then the intersection of the results of all sub-query conditions is solved, wherein each sub-query is a concurrent query.

Fig. 4 is a flowchart of a product content quality score query process, which includes the following steps:

the client sends out a query service request of the commodity quality score;

the query server analyzes the expression of the commodity quality score query service request sent by the client;

the query server submits the analyzed query request to an Elasticissearch cluster (Elasticissearch Cluster); the Elasticissearch in this embodiment sets the cluster to prevent single point of failure of the machine.

The Elasticissearch cluster returns a query result (commodity code + merchant code) to the query server;

the query server submits a query request to an HBase cluster (HBaseCluster) according to a query result returned by the Elasticissearch cluster;

the HBase cluster returns a final query result corresponding to the commodity code and the merchant code to the query server;

and the query server returns a final query result to the client.

The following is a test of the query efficiency of the present application and the prior art:

100w data to be calculated, 1000w data to be calculated and 1 hundred million data to be calculated are inserted into a table to be calculated for commodity quality evaluation. Then, calculation is carried out based on Java + Mysql and Spark + HBase modes respectively.

100 million data, 1000 million data, 1 hundred million data, 10 hundred million data are inserted into different tables of the elastic search and the HBase, respectively, and each record has 15 fields. Then, the query is carried out based on the modes of java + Mysql and Elasticissearch + HBase respectively.

The test results are reported in table 2.

TABLE 2 comparison of Elasticissearch + HBase and Java + Mysql query efficiencies

Number of records in table	Elasticsearch+HBase	Java+Mysql
			100w	125ms	0.564s
1000w	140ms	2.543s
			1 hundred million	162ms	Timeout exception
10 hundred million	190ms	Timeout exception

According to the test result, the query based on the Elasticissearch + HBase combination can greatly improve the query efficiency, and even if the number of data is multiplied, the query efficiency still has excellent performance.

Example one

As described above, the databases or the calculation programs Spark may be replaced by similar function modules, and the calculation result may be set as data other than the product content quality score according to the user requirement. Based on this, an embodiment of the present application provides a data processing method, as shown in fig. 5, including the following steps:

s51, storing the original commodity content data in a first relational database in a clustering, database-dividing and table-dividing manner;

s52, establishing index data according to the original commodity content data and storing the index data in an index database; the index data comprises keyword fields and query dimension identification data corresponding to each keyword field;

and S53, calling a calculation program to calculate the original commodity content data to obtain calculation result data, and storing the calculation result data and the query dimension identification data in the first relational database in an associated manner.

Preferably, the first and second liquid crystal materials are,

the method further comprises the following steps:

receiving a query request of a user;

analyzing the query request to obtain a keyword to be queried;

In addition, the method may further include:

In another preferred embodiment, the method further comprises: receiving the original commodity content data and storing the original commodity content data in a second relational database in a diversity grouping database sub-table; particularly, diversity, grouping, library and table division can be carried out according to commodity codes;

Example two

Corresponding to the method, the application also provides a data processing platform, wherein the platform comprises a data storage layer and a data calculation layer;

the data storage layer is used for clustering the original commodity content data into a first relational database stored in a database and table, establishing index data according to the original commodity content data and storing the index data in an index database; the index data comprises keyword fields and query dimension identification data corresponding to each keyword field;

In a preferred embodiment, the data processing platform further includes a data application layer, configured to receive a query request from a user, perform parsing to obtain a keyword to be queried, perform querying in the index database to obtain query dimension identification data corresponding to the keyword to be queried as a target identification, perform querying in the first relational database to obtain calculation result data corresponding to the target identification, so as to return the result data to the user.

In a preferred embodiment, the storage layer is further configured to store at least a portion of the computed result data in association with the query dimension identification data in the index database.

In a preferred embodiment, the storage layer is further configured to receive the original commodity content data, store the original commodity content data in a second relational database in a diversity grouping library sublist, and synchronize the original commodity content data in the second relational database with the first relational database.

EXAMPLE III

Corresponding to the above method and platform, a third embodiment of the present application provides a computer system, including:

one or more processors; and

the original commodity content data diversity grouping library sub-table is stored in a first relational database;

and calculating the original commodity content data through a calculation program to obtain calculation result data, and storing the calculation result data and the query dimension identification data in the first relational database in an associated manner.

Fig. 6 illustrates an architecture of a computer system, which may include, in particular, a processor 1510, a video display adapter 1511, a disk drive 1512, an input/output interface 1513, a network interface 1514, and a memory 1520. The processor 1510, video display adapter 1511, disk drive 1512, input/output interface 1513, network interface 1514, and memory 1520 may be communicatively coupled via a communication bus 1530.

The processor 1510 may be implemented by a general-purpose CPU (Central processing Unit), a microprocessor, an Application Specific Integrated Circuit (ASIC), or one or more integrated circuits, and is configured to execute related programs to implement the technical solution provided by the present Application.

The Memory 1520 may be implemented in the form of a ROM (Read Only Memory), a RAM (random access Memory), a static storage device, a dynamic storage device, or the like. The memory 1520 may store an operating system 1521 for controlling the operation of the computer system 1500, a Basic Input Output System (BIOS) for controlling low-level operations of the computer system 1500. In addition, a web browser 1523, a data storage management system 1524, an icon font processing system 1525, and the like can also be stored. The icon font processing system 1525 may be an application program that implements the operations of the foregoing steps in this embodiment of the application. In summary, when the technical solution provided by the present application is implemented by software or firmware, the relevant program codes are stored in the memory 1520 and called for execution by the processor 1510.

The input/output interface 1513 is used for connecting an input/output module to realize information input and output. The i/o module may be configured as a component in a device (not shown) or may be external to the device to provide a corresponding function. The input devices may include a keyboard, a mouse, a touch screen, a microphone, various sensors, etc., and the output devices may include a display, a speaker, a vibrator, an indicator light, etc.

The network interface 1514 is used to connect a communication module (not shown) to enable the device to communicatively interact with other devices. The communication module can realize communication in a wired mode (such as USB, network cable and the like) and also can realize communication in a wireless mode (such as mobile network, WIFI, Bluetooth and the like).

The bus 1530 includes a path to transfer information between the various components of the device, such as the processor 1510, the video display adapter 1511, the disk drive 1512, the input/output interface 1513, the network interface 1514, and the memory 1520.

In addition, the computer system 1500 may also obtain information of specific extraction conditions from the virtual resource object extraction condition information database 1541 for performing condition judgment, and the like.

It should be noted that although the above devices only show the processor 1510, the video display adapter 1511, the disk drive 1512, the input/output interface 1513, the network interface 1514, the memory 1520, the bus 1530, etc., in a specific implementation, the devices may also include other components necessary for proper operation. Furthermore, it will be understood by those skilled in the art that the apparatus described above may also include only the components necessary to implement the solution of the present application, and not necessarily all of the components shown in the figures.

From the above description of the embodiments, it is clear to those skilled in the art that the present application can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, or the like, and includes several instructions for enabling a computer device (which may be a personal computer, a cloud server, or a network device) to execute the method according to the embodiments or some parts of the embodiments of the present application.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, the system or system embodiments are substantially similar to the method embodiments and therefore are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for related points. The above-described system and system embodiments are only illustrative, wherein the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

The data processing method, the data processing platform and the data processing system provided by the application are introduced in detail, a specific example is applied in the description to explain the principle and the implementation of the application, and the description of the embodiment is only used for helping to understand the method and the core idea of the application; meanwhile, for a person skilled in the art, according to the idea of the present application, the specific embodiments and the application range may be changed. In view of the above, the description should not be taken as limiting the application.

Claims

1. A method of data processing, the method comprising:

2. The data processing method of claim 1, wherein the method further comprises:

receiving a query request of a user;

analyzing the query request to obtain a keyword to be queried;

3. The data processing method of claim 1, wherein the method further comprises:

4. The data processing method of claim 3,

calling a calculation program to calculate the content quality score of each dimension of each commodity in at least two content dimensions for the original commodity content data and calculating the content quality total score of each commodity according to the content quality score of each dimension;

storing the dimension content quality scores of each commodity and the content quality total of each commodity in the first relational database in association with the identification data;

and storing the content quality total of each commodity in the index database in association with the query dimension identification data.

5. A data processing method as claimed in any one of claims 1 to 4, wherein the identification data is a goods code and/or a merchant code.

6. The data processing method of any of claims 1 to 4, wherein the method further comprises:

7. The data processing method of claim 6, wherein said receiving said original merchandise content data and storing a diversity bin sub-table in a second relational database comprises:

8. The data processing method of claim 6,

9. A data processing platform, said platform comprising a data storage layer and a data computation layer;

10. A computer system, comprising:

one or more processors; and

establishing index data according to the original commodity content data and storing the index data in an index database; the index data comprises keyword fields and identification data corresponding to each keyword field;

and calling a calculation program to calculate the original commodity content data to obtain calculation result data, and storing the calculation result data and the identification data in the first relational database in a correlation manner.