CN114116714A

CN114116714A - Big data tag storage method, analysis method and system

Info

Publication number: CN114116714A
Application number: CN202111370093.3A
Authority: CN
Inventors: 邓唯玉; 余毅; 熊纯; 李显锋; 张永强
Original assignee: Wuhan Dayun Data Technology Co ltd
Current assignee: Wuhan Dayun Data Technology Co ltd
Priority date: 2021-11-18
Filing date: 2021-11-18
Publication date: 2022-03-01

Abstract

The application relates to a big data tag storage method, an analysis method and a system, wherein the storage method comprises the steps of obtaining a target tag, wherein the target tag comprises a target object and a tag name; matching a target main body database from preset different specific main body databases according to the target object of the target label; and judging whether the label name of the target label exists in the target main body database, if not, storing the target label in the target main body database according to a set data storage structure. According to the method and the device, the big data labels can be stored in a classified mode quickly, the data storage speed is increased, and data congestion is reduced.

Description

Big data tag storage method, analysis method and system

Technical Field

The application relates to the field of big data storage analysis, in particular to a big data label storage method, a big data label analysis method and a big data label analysis system.

Background

In daily work, a person, an object, a vehicle, a case and other subjects need to use a data tag with subject characteristics to perform rapid search and tag data analysis.

The storage of tag data using relational databases also presents some current situations: the data volume of the target label main body reaches hundred million levels, the speed of insert updating and query cannot be guaranteed by using the relational database storage, and the requirements of frequent and rapid updating and millisecond retrieval in label data production cannot be met. The data tags for each body are different and the columns of data tags are dynamic. Using a relational database store can only exhaust all columns and create a table with a super large number of fields.

Along with the expansion of the target service, data labels are also increased all the time, because the columns of the relational database cannot be dynamically expanded, only the original table structure can be modified, new label columns are added, and the maintenance cost of the database is high; when the tag is searched, the source information of the tag and the number of times of the tag appearing need to be obtained for tag tracing, and the existing tag data storage structure cannot be added with a new column to store the two information, so that the requirement for tag tracing is not realized all the time. Since many label columns in a single piece of data are NULL values, and label search is also performed with fuzzy matching, the index of the relational database is invalidated when the data is retrieved, and the search speed becomes very slow. Because each data tag is a single column, hundreds of tag columns are traversed according to keyword searching, and rapid full-text searching cannot be realized.

Disclosure of Invention

In view of this, the present application provides a big data tag storage method, an analysis method, and a system, so as to solve the technical problem that the query service is blocked due to too slow speed or low efficiency in storing big data tags in the existing database.

In order to solve the above problem, in a first aspect, the present application provides a big data tag storage method, including:

acquiring a target label, wherein the target label comprises a target object and a label name;

matching a target main body database from preset different specific main body databases according to the target object of the target label;

and judging whether the label name of the target label exists in the target main body database, if not, storing the target label in the target main body database according to a set data storage structure.

Optionally, the target tag further comprises a tag source; storing the target tag in the target subject database according to a set data storage structure, including:

storing a target object and a tag name of a target tag in an array form;

setting a label code for the label name of the target label;

creating a label field and a source field for the label code of a target label, wherein the source field is embedded in the label field;

and correspondingly storing the label name and the label code of the target label in the label field, and storing the label source of the target label in the source field.

Optionally, the storing the target tag in the target subject database according to a set data storage structure further includes:

setting a retrieval mode, comprising: when the label name and the label code of the target label are searched simultaneously, a word segmentation index mode is adopted; when the label name or the label code of the target label is independently searched, a non-word-segmentation index mode is adopted;

recording the occurrence frequency of the label source of the target label as 1;

the number of occurrences of the tag source of the target tag is stored in the source field.

Optionally, after the target tag is stored in the target subject database according to the set data storage structure, the method further includes:

if a first deleting instruction about the target tag is acquired, wherein the first deleting instruction comprises a source of the tag to be deleted, deleting the source of the tag to be deleted corresponding to the target tag from a target main body database;

judging whether other label sources exist in the target label after deletion, and if so, keeping the label code and the label name of the target label; if not, deleting the label code and the label name of the target label;

and if a second deleting instruction about the target label is acquired, wherein the second deleting instruction comprises a label code to be deleted, deleting the label code to be deleted, the label name and the label source corresponding to the target label from the target main body database.

In a second aspect, the present application provides a big data tag analysis method, including:

executing the big data label storage method;

acquiring data to be queried, wherein the data to be queried comprises a plurality of label names to be queried and dimension parameters corresponding to the same object to be queried;

determining a main database to be queried corresponding to the object to be queried;

constructing a data analysis script according to the plurality of label names to be inquired and the dimension parameters;

and querying in the main database to be queried by using the data analysis script to obtain a data analysis result.

Optionally, constructing a data analysis script according to the multiple to-be-queried tag names and the dimension parameters corresponding to the to-be-queried object, where the constructing includes:

determining a query logic relation of a plurality of label names to be queried according to the hit requirement condition of the plurality of label names to be queried;

constructing a data query grammar according to the inner and outer layer logic sequence of a main database to be queried;

constructing an aggregation grammar according to the dimension parameters;

and constructing a data analysis script according to the query logic relation, the data query grammar, the aggregation grammar and a preset statistical algorithm of the plurality of the label names to be queried.

Optionally, the query logic relationship of the tag name to be queried is determined according to the hit requirement of the tag names to be queried, where the query logic relationship at least includes an and or logic relationship.

In a third aspect, the present application provides a big data tag storage system, the system comprising:

the system comprises an acquisition data module, a storage module and a processing module, wherein the acquisition data module is used for acquiring a target label, and the target label comprises a target object and a label name;

the matching module is used for matching a target main body database from preset different specific main body databases according to the target object of the target label;

and the storage module is used for judging whether the label name of the target label exists in the target main body database, and if not, storing the target label in the target main body database according to a set data storage structure.

In a fourth aspect, the present application provides a big data tag analysis system, the system comprising:

the system comprises an acquisition data module, a query data module and a query data module, wherein the acquisition data module is used for acquiring data to be queried, and the data to be queried comprises a plurality of label names to be queried and dimension parameters corresponding to the same object to be queried;

the determining database module is used for determining a main database to be queried corresponding to the object to be queried;

the script construction module is used for constructing a data analysis script according to the names of the labels to be inquired and the dimension parameters;

and the analysis module is used for inquiring in the main database to be inquired by using the data analysis script to obtain a data analysis result.

The beneficial effects of adopting the above embodiment are: in the embodiment, the target tag is obtained, and the target main body database can be matched according to the tag object of the target tag, so that the target tag can be conveniently stored in the target main body database; and judging whether the label name of the target label exists in the target main body database, if not, adding the target label to the target main body database according to a set storage structure, so that the target label is stored in the corresponding target main body database, the rapid classified storage of the big data label is realized, the data storage speed is improved, and the data congestion is reduced.

Drawings

FIG. 1 is a flowchart of a method of one embodiment of a big data tag storage method provided herein;

FIG. 2 is a flowchart of a method of one embodiment of step S103 of the big data tag storage method provided in the present application;

FIG. 3 is a schematic diagram of a data storage structure script provided herein;

FIG. 4 is a flowchart of another embodiment of a method for storing big data tags, step S103;

FIG. 5 is a flowchart of a method of one embodiment of a big data tag analysis method provided herein;

FIG. 6 is a flowchart of a method of one embodiment of step S503 of the big data tag analysis method provided herein;

FIG. 7 is a schematic diagram of a data analysis script provided herein;

FIG. 8 is a flow chart of a method of data analysis provided herein;

FIG. 9 is a functional block diagram of an embodiment of a big data tag storage system provided herein;

FIG. 10 is a functional block diagram of an embodiment of a big data tag analytics system as provided herein.

Detailed Description

The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate preferred embodiments of the application and together with the description, serve to explain the principles of the application and not to limit the scope of the application.

In the description of the present application, "a plurality" means two or more unless specifically limited otherwise.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

Referring to fig. 1, a method flowchart of an embodiment of a big data tag storage method provided in the present application is shown, where the big data tag storage method includes the following steps:

s101, obtaining a target label, wherein the target label comprises a target object and a label name;

s102, matching a target main body database from preset different specific main body databases according to a target object of a target label;

s103, judging whether the label name of the target label exists in the target main body database, if not, storing the target label in the target main body database according to a set data storage structure.

In the embodiment, the specific subject database is created by using an elastic search database, and different types of the specific subject database include a personnel information database, a vehicle information database, a case information database, an object information database and the like; the target object refers to a tagged data subject, such as a person, a vehicle, a case, and the like; the tag name refers to a tag name printed for a target object, for example, for a certain person, the tag names may be attached to a person involved in a case, a person involved in an alarm, an individual industrial and commercial company, and in this embodiment, one target tag includes the target object to be tagged, one tag name, and a corresponding tag source. In the embodiment, the elastic search database is used for storing data, so that the problem of storage of hundred million data volumes can be solved, a fragmentation storage mode is used, the transverse expansion can be realized, and no pressure is generated even if the data volume is stored in a large scale. One fragment is an example of Lucene, and when data is searched, each fragment is searched, and then results are summarized, so that the search efficiency is improved. When writing in mass data, the method can also achieve near real-time searching speed, and the reading and writing are not influenced mutually.

In the embodiment, the target tag is obtained, and the target main body database can be matched according to the tag object of the target tag, so that the target tag can be conveniently stored in the target main body database; and judging whether the label name of the target label exists in the target main body database, if not, adding the target label to the target main body database according to a set storage structure, so that the target label is stored in the corresponding target main body database, the rapid classified storage of the big data label is realized, the data storage speed is improved, and the data congestion is reduced.

In this embodiment, the data storage structure is created by using a table structure, an array, a field, a word segmentation and/or a non-word segmentation retrieval mode; referring to fig. 2, in step S103, storing the target tag in the target subject database according to the set data storage structure includes:

s201, storing a target object and a tag name of a target tag in an array form;

s202, setting label codes for label names of target labels;

s203, creating a label field and a source field for the label code of the target label, wherein the source field is embedded in the label field;

s204, correspondingly storing the label name and the label code of the target label in a label field, and storing the label source of the target label in a source field.

In a specific embodiment, referring to the data storage structure script shown in fig. 3, a same target object may have multiple target tags, all the target tags are stored in one tag field, the tag field is an array, each array element is a target tag, and one target tag can be stored by requiring one tag field and one source field, so that the index rule configured during the field creation does not need to be modified subsequently, and the maintenance cost of the table structure is greatly reduced.

Referring to fig. 4, the step S103 of storing the target tag in the target subject database according to the set data storage structure further includes:

s401, setting a retrieval mode, comprising: when the label name and the label code of the target label are searched simultaneously, a word segmentation index mode is adopted; when the label name or the label code of the target label is independently searched, a non-word-segmentation index mode is adopted;

s402, recording the occurrence frequency of the label source of the target label as 1;

and S403, storing the occurrence frequency of the label source of the target label in a source field.

When the target label is stored, the label code and the label name of the target label are stored, the label code is used as a basis field for inserting updating, and the ability of accurately searching the data label and using the label code as a data analysis parameter is provided by adopting non-word segmentation retrieval.

And storing the label name and the label code together, and setting the label name and the label code as a general word segmentation to realize full-text search of the target label. The full-text search also accords with the search habit of the user according to the label name, and the response speed of the full-text search is effectively improved by only a few milliseconds.

In addition, a source field is embedded in the label field, so that each target label in each piece of data can find a corresponding source. The source of the same target tag may be multiple, so when storing the target tag, two sources need to be stored, so the source information is also stored using an array structure. The same label of the same target object appears for a plurality of times, and the times are different according to the source of the target label, so the time information is placed in the label source information.

In this embodiment, when the target tag is inserted into the target subject database for updating, the insertion updating logic is implemented by using a paintless script provided by the ElasticSearch, the speed of the insertion updating of the tag data in the ElasticSearch database is faster than that of a relational database, tens of minutes are required for the insertion updating of tens of millions of tag data, the performance of the search is not affected in the insertion updating process, and the data search is almost real-time.

It should be noted that, because the data storage structure for the target tag is a nested format, the number of times the tag appears needs to be calculated during storage. If the target object does not have the currently output label name, directly updating the target object, adding new label information and corresponding source information, and recording the number of label occurrences as 1. If the data main body already has the currently output label name, judging whether a source of the current label name exists or not, if the same source information does not exist, adding a source again, and marking the occurrence frequency of the label as 1; if the data body already has the source information of the current label name, adding 1 to the number of times of the original source label.

Optionally, the method for storing the big data tag according to this embodiment further includes:

judging whether other label sources exist in the deleted target label, if so, keeping the label code and the label name of the target label; if not, deleting the label code and the label name of the target label;

optionally, if a second deletion instruction about the target tag is obtained, where the second deletion instruction includes a to-be-deleted tag code, the tag name, and the tag source corresponding to the target tag are deleted from the target main body database.

In this embodiment, the tag data is also updated, and the tag data is cleared by using a paintless script.

It should be noted that the advantage of clearing the tag in comparison with the Update By Query method in the elastic search is as follows: the Update By Query mode is similar to a "set … where …" statement of a relational database, updating of large quantities of data is necessarily overtime to cause Update stop, snapshot information is acquired during Update By Query, when a plurality of label clearing operations are executed in parallel, version collision is generated to cause data Update failure, and data Update stop is also caused when Query fails. The label clearing logic of the real-time example separates the inquiry and the Update, firstly inquires the data main key set of the label to be cleared according to the label code and the rule ID, can be completed within hundreds of milliseconds, and executes the script for clearing the label after the data main key set of the label to be cleared, so that tens of millions of label data can be quickly cleared, the midway stop caused By errors is not easy to occur, and the defect of Update By Query is avoided.

Referring to fig. 5, this embodiment further discloses a big data tag analysis method, and the executed big data tag storage method includes:

s501, acquiring data to be queried, wherein the data to be queried comprises a plurality of label names to be queried and dimension parameters corresponding to the same object to be queried;

s502, determining a main database to be queried corresponding to an object to be queried;

s503, constructing a data analysis script according to the names of the labels to be inquired and the dimension parameters;

and S504, inquiring in a main database to be inquired by using the data analysis script to obtain a data analysis result.

In this embodiment, tag data analysis becomes much simpler based on the data storage structure of the target tag. Regardless of the number of the transmitted data analysis parameters or the combination of some label names, only a field of label coding tag _ code needs to be queried, the whole structure of the data analysis script does not need to be changed, and only different parameter values need to be transmitted.

In an embodiment, referring to fig. 6, in step S503, constructing a data analysis script according to a plurality of tag names to be queried and a dimension parameter includes:

s601, determining a query logic relation of a plurality of label names to be queried according to the hit requirement condition of the plurality of label names to be queried; wherein, the query logic relationship at least comprises an and or logic relationship;

s602, constructing a data query grammar according to the inner and outer layer logic sequence of a main database to be queried;

s603, constructing an aggregation grammar according to the dimension parameters;

s604, constructing a data analysis script according to the query logic relation, the data query grammar, the aggregation grammar and the preset statistical algorithm of the plurality of label names to be queried.

Referring to the data analysis script shown in fig. 7, the script analyzes hundreds of milliseconds to obtain the analysis result, and the analysis speed is improved by hundreds of times compared with the analysis speed obtained by using a relational database.

In a specific embodiment, the data analysis method is as shown in fig. 8, and the key of the data analysis method lies in dynamically constructing the logical relationship of the data analysis parameters and the dimension of the analysis. For example, the occupation distribution conditions of people with different case types in the epidemic situation are analyzed to obtain a distribution map of infected people, and the people who are high-risk infected people are analyzed according to the infection proportion of different people.

There are five tag names to be queried in this example: the five label names of 'confirmed case', 'suspected case', 'positive detection', 'other case' and 'close contact person' all need to be hit accurately, so tag _ code is used for searching without word segmentation, a logical relation is set as 'or', and a union set is taken for a query condition. And the dimension parameter is a tag _ code of the occupation, grouping statistics is carried out on tens of types of occupation to obtain the quantity meeting the query condition in each type of population, and then the occupation ratio is calculated.

Different from the prior art, the embodiment greatly reduces the maintenance cost of the data table by optimizing the data storage structure, improves the writing performance and the searching performance of the target tag, realizes full-text search and rapid data analysis of the target tag, makes up for the deficiency of tag source information records, provides a uniform tag data analysis method, saves a large amount of development work, and reduces the working cost. And the requirements of users on full-text retrieval, millisecond-level response, data analysis and source tracing in the use of the label names are met.

It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.

The embodiment also provides a big data tag storage system, and the big data tag storage system corresponds to the big data tag storage method in the embodiment one to one. As shown in fig. 9, the big data tag storage system includes a creating module 901, a obtaining data module 902, a matching module 903, and a storing module 904. The functional modules are explained in detail as follows:

an obtain data module 901, configured to obtain a target tag, where the target tag includes a target object and a tag name;

a matching module 902, configured to match a target subject database from preset different specific subject databases according to a target object of a target tag;

and the storage module 903 is configured to determine whether the tag name of the target tag exists in the target subject database, and if not, store the target tag in the target subject database according to a set data storage structure.

For specific limitations of each module of the large data tag storage system, reference may be made to the above limitations on the large data tag storage method, which is not described herein again. The various modules in the big data tag storage system described above may be implemented in whole or in part by software, hardware, and combinations thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

The embodiment also provides a big data label analysis system, and the big data label analysis system corresponds to the big data label analysis method in the embodiment one to one. As shown in fig. 10, the big data tag analysis system includes a get data module 1001, a determine database module 1002, a script construction module 1003, and an analysis module 1004. The functional modules are explained in detail as follows:

the data acquiring module 1001 is configured to acquire data to be queried, where the data to be queried includes multiple tag names and dimension parameters to be queried corresponding to the same object to be queried;

a determining database module 1002, configured to determine a main database to be queried corresponding to an object to be queried;

the script construction module 1003 is used for constructing a data analysis script according to the multiple to-be-queried tag names and the dimension parameters;

the analysis module 1004 is configured to query the main database to be queried by using the data analysis script, and obtain a data analysis result.

For specific limitations of each module of the big data tag analysis system, reference may be made to the above limitations of the big data tag analysis method, which is not described herein again. The modules in the big data tag analysis system can be wholly or partially implemented by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above.

Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

The above description is only for the preferred embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present application should be covered within the scope of the present application.

Claims

1. A big data tag storage method, the method comprising:

2. The big data tag storage method of claim 1, wherein the target tag further comprises a tag source; storing the target tag in the target subject database according to a set data storage structure, including:

storing a target object and a tag name of a target tag in an array form;

setting a label code for the label name of the target label;

3. The big data tag storage method according to claim 2, wherein the storing the target tag in the target subject database according to the set data storage structure further comprises:

4. The big data tag storage method according to claim 3, wherein after storing the target tag in the target subject database according to the set data storage structure, the method further comprises:

judging whether other label sources exist in the target label after deletion, and if so, keeping the label code and the label name of the target label; and if not, deleting the label code and the label name of the target label.

5. The big data tag storage method according to claim 3, wherein after storing the target tag in the target subject database according to the set data storage structure, the method further comprises:

6. A big data label analysis method is characterized by comprising the following steps:

executing the big data tag storage method of any of claims 1-5;

7. The big data tag analysis method according to claim 6, wherein constructing a data analysis script according to a plurality of to-be-queried tag names and dimension parameters corresponding to the to-be-queried object comprises:

constructing an aggregation grammar according to the dimension parameters;

8. The big data tag analysis method according to claim 7, wherein the query logical relationship of the tag names to be queried is determined according to hit requirements of the plurality of tag names to be queried, wherein the query logical relationship at least comprises an and or logical relationship.

9. A big data tag storage system, the system comprising:

10. A big data tag analytics system, the system comprising: