CN116684437A

CN116684437A - Distributed data management method based on natural language analysis

Info

Publication number: CN116684437A
Application number: CN202310976377.XA
Authority: CN
Inventors: 张玉磊; 梅雪明; 丁皓; 张敬超
Original assignee: Jiangsu Liangjie Data Technology Co ltd
Current assignee: Jiangsu Liangjie Data Technology Co ltd
Priority date: 2023-08-04
Filing date: 2023-08-04
Publication date: 2023-09-01
Anticipated expiration: 2043-08-04
Also published as: CN116684437B

Abstract

A distributed data management method based on natural language analysis, comprising: in response to an instruction to gather data information, obtaining the data information from a data storage device; calculating the data information by using a second information calculation module to obtain intermediate data and sending the intermediate data to the data exchange equipment; transmitting the intermediate data in the data exchange equipment to a server; the server side calculates the intermediate data by using a first information calculation module to obtain target data; the second information calculation module is obtained by the first information calculation module through natural language analysis. When the data volume obtained by the server from the data storage device is too large, the technical scheme of the application can greatly reduce the transmission of data information, save time and avoid potential safety hazards caused by segmented transmission of a large amount of data, such as data leakage and the like.

Description

Distributed data management method based on natural language analysis

Technical Field

The application belongs to the field of distributed data management, and particularly relates to a distributed data management method based on natural language analysis.

Background

In the prior art, massive data are often distributed in different data storage devices, and when a server needs to access the data, the data are often read from the different data storage devices through a data exchange device and are unified and collected to the server. However, the data storage devices themselves may be dispersed in different cities of different countries. Therefore, when data is acquired from the data storage device by using the data exchange device, the data needs to be transmitted in segments due to huge data volume, so that the transmission time is slow, and the data is easy to hack, thereby causing data leakage.

Disclosure of Invention

In order to solve the defects existing in the prior art, the application aims to solve the defects, and further provides a distributed data management method based on natural language analysis.

The application adopts the following technical scheme.

The first aspect of the application discloses a distributed data management method based on natural language analysis, which comprises the following steps:

step S1, responding to an instruction for collecting data information, and acquiring the data information from a data storage device;

step S2, calculating the data information by using a second information calculation module to obtain intermediate data and sending the intermediate data to the data exchange equipment;

step S3, intermediate data in the data exchange equipment are sent to a server;

step S4, the server calculates the intermediate data by using a first information calculation module to obtain target data; the second information calculation module is obtained by the first information calculation module through natural language analysis.

A second aspect of the present application discloses a distributed data management system based on natural language analysis, for performing the method of the first aspect, comprising: comprising the following steps: the system comprises a plurality of data storage devices, data exchange devices and a server;

the data information is stored in a plurality of data storage devices in a distributed mode, the data storage devices respond to instructions for collecting the data information to obtain the data information, and the second information calculation module is utilized to calculate the data information, so that intermediate data is obtained and sent to the data exchange device;

the data exchange equipment is used for sending the intermediate data to the server;

the server side is used for sending an instruction for collecting data information, and calculating the intermediate data by using the first information calculation module to obtain target data; the second information calculation module is obtained by the first information calculation module through natural language analysis.

The third aspect of the application discloses a terminal, which comprises a processor and a storage medium; the method is characterized in that:

the storage medium is used for storing instructions;

the processor is configured to operate in accordance with the instructions to perform the steps of the method of the first aspect.

A fourth aspect of the application discloses a computer-readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the steps of the method according to the first aspect.

Compared with the prior art, the application has the following advantages:

when the data volume obtained by the server from the data storage device is too large, the technical scheme of the application can greatly reduce the transmission of data information, save time and avoid potential safety hazards caused by segmented transmission of a large amount of data, such as data leakage and the like.

Drawings

FIG. 1 is a schematic diagram of a distributed data management system.

FIG. 2A is a schematic diagram of a distributed data management system based on natural language analysis, according to an embodiment of the present application.

FIG. 2B is a schematic diagram of a first logic module and a second logic module according to an embodiment of the application.

Fig. 3A is a code schematic diagram of a machine learning algorithm of an embodiment of the present application.

FIG. 3B is a code schematic diagram of another machine learning algorithm according to an embodiment of the present application.

Fig. 3C is a code schematic diagram after the first cutting module adapts to the interface according to an embodiment of the present application.

FIG. 4A is a schematic diagram of a communication method between a first logic module and a second logic module according to an embodiment of the present application.

Fig. 4B is a schematic diagram of a communication method between the first and second logic modules according to an embodiment of the present application based on the scenario of fig. 3B.

FIG. 5 is a flow chart of a distributed data management method based on natural language analysis according to an embodiment of the present application.

Detailed Description

The application is further described below with reference to the accompanying drawings. The following examples are only for more clearly illustrating the technical aspects of the present application, and are not intended to limit the scope of the present application.

A distributed data management system is shown in fig. 1. The distributed data management system may include: the system comprises a plurality of data storage devices, at least one data exchange device and a server.

The data storage device is used for storing data, and generally mainly comprises a magnetic disk, and massive data are stored on the magnetic disk in the form of a database. The data storage device should also include a first interface for external access and storing the above-mentioned mass of data.

The data exchange device may be integrated inside the server, but is usually separately drawn out of the server, so as to share the pressure of the server. The data exchange device at least comprises a second interface and a third interface. It is understood that the second interface is communicatively connected to the first interface of the data storage device, and the third interface is communicatively connected to the logic module of the server. The switching end of the data switching device is also generally responsible for the secure transmission of data.

It is understood that the first interface, the second interface and the third interface are also logic modules in nature, and the internal logic thereof can be implemented by codes. The functions born by the first interface, the second interface and the third interface only need to be responsible for the external interface, so the naming is distinguished.

In one general scenario of the present application, for example: the wind power information of the Guangzhou, cheng and Harbin cities throughout the year is required to be collected, and then the wind power information with the highest wind power value is obtained. Taking fig. 1 as a reference, it is not difficult to generalize that the above-mentioned process may include steps 1 to 3.

Step 1, data information is acquired from a data storage device in response to an instruction for gathering the data information.

In a general scenario, the data information may correspond to wind information. It will be appreciated that instructions for gathering data information are typically sent by the server to the data storage device via the data exchange device.

And step 2, the data information is sent to the server through the data exchange equipment.

And step 3, the server calculates according to the data information to obtain target data.

In a general scene, the server side performs calculation according to the data information, and can correspondingly sequence the wind power information according to the wind power value of the wind power information, and the target data can correspondingly correspond to the wind power information with the highest wind power value.

The drawbacks of the above steps are readily apparent from the background art. In step 2, massive data information needs to be transmitted to the data exchange device via the data storage device and then forwarded to the server. In addition, since the data information storage devices are distributed in different cities, from the viewpoint of the routing nodes, a plurality of routing nodes may be required for the transmission from the data storage devices to the data exchange device, which has a small influence on the pressure of communication transmission, and the transmission mode is also very easy to cause data leakage.

Based on the above, the application discloses a distributed data management system based on natural language analysis, as shown in fig. 2A. In contrast to fig. 1, a second logic module is also included in each data storage device, the second logic module including the first interface.

To illustrate the second logic module in more detail, FIG. 2B subdivides the first logic module and the second logic module. The first logic module includes a fourth interface, a first information calculating module, and other modules, where the first information calculating module is configured to process the calculating process described in the above step 3, and the other modules are configured to process other processes, for example: issuing instructions to gather data information, etc. The second information calculation module is obtained through natural language analysis according to the first information calculation module.

Accordingly, referring to fig. 2A, it is not difficult to generalize the above process to include steps S1 to S4.

Step S1, data information is acquired from a data storage device in response to an instruction for gathering the data information.

And S2, calculating the data information by using a second information calculation module to obtain intermediate data and sending the intermediate data to the data exchange equipment.

And step S3, transmitting the intermediate data in the data exchange equipment to the server.

It will be appreciated that the communication process described above may be as shown in fig. 4A. It should be noted that, the data exchange device is not shown in fig. 4A, but it should be understood that, when the data storage device sends the intermediate data to the server, the intermediate data needs to be combined at the data exchange device and then forwarded to the server. When the server sends the second information calculation module to the data storage devices, the server also relies on the data exchange device as an intermediary to distribute the second information calculation module to the plurality of data storage devices.

It will be appreciated that in one general scenario of the present application described above, it is readily known by natural language analysis: the second information calculation module is identical to the first information calculation module.

It should be noted that, in the present application, natural language analysis is mainly used to analyze the meaning of the code, in fact, since the compiler itself is provided with a certain analysis tool, and the grammar of the code is more strict, the difficulty of analyzing the meaning of the code is far less than that of analyzing the natural language. Thus, only specific training of the code through natural language analysis is required to achieve the effects of the application that are obtained through natural language analysis.

To describe the general principles of natural language analysis, this paragraph next describes a non-generic scenario. In some machine learning or deep learning algorithm scenarios, a neural network needs to be trained by collecting massive wind data, temperature data and humidity data, so as to reach a conclusion of weather prediction, and the code framework can be shown in fig. 3A. The expression form of the first information calculating module may be a function f1 in fig. 3A or a function f3 in fig. 3B. The input parameters are wind data wind_datas, temperature data temp_datas and humidity data humi_datas, respectively. The function ft_pre will typically preprocess the initial data, e.g., process the "raw" temperature data, and then obtain the processed temperature data temp_datas2. The function ft_class is used for classifying data, the temperature data l_temp_datas after partial processing is used for machine learning, and the temperature data s_temp_datas after partial processing is used for verifying the learning result. The function ft_learning, the function ft_confirm, and the function f_adjustment are respectively a machine learning method, a method of verifying a learning result, and a method of adjusting a machine learning method according to a verification confidence.

It should be noted that, in view of simplicity, the code of fig. 3A is exemplified by c++, but it does not strictly follow the syntax of c++, and there is a syntax problem such as lack of type in the input parameters of #40 line. But this does not affect the specific process of analyzing embodiments of the present application.

It may be appreciated that, in the scenario where the data information is applied to machine learning or deep learning, in step S4, the second information calculating module obtains, through natural language analysis, the specific steps S101 to S103 from the first information calculating module.

Step S101, obtaining the name of the training model variable. The training model variables are the set of all parameters in the algorithm model.

Step S102, finding the name of the training model variable in the first information calculation module, and cutting the first information calculation module into a first cutting module and a second cutting module, wherein the second cutting module does not contain any information of the training model variable.

Step S103, the training model variable is referenced in an input interface of the first cutting module, the first cutting module is used as a second information calculation module, and the second cutting module is used for replacing the first information calculation module.

Correspondingly, the data storage device at least comprises a first data storage device, a second data storage device and a third data storage device, and the step 2 specifically comprises steps S201 to S203.

Step S201, calculating the data information by using a second information calculation module of the first data storage device so as to iterate the reference amount; wherein the reference quantity comprises the training model variable.

Step S202, calculating the data information by using a second information calculation module of the second data storage device so as to iterate the reference amount.

It will be appreciated that the second information calculation module in the second data storage device is not transmitted by the server, but by the first data storage device via the data exchange device. Thus, the second information computation module in the second data storage device carries information of the reference amount. The second information calculation module of the third data storage device hereinafter is similar.

Step S203, calculating the data information by using the second information calculation module of the third data storage device to iterate the reference amount, and sending the iterated reference amount to the server.

The communication method between the first logic module and the second logic module is applied to the scene of machine learning or deep learning, and the communication method between the first logic module and the second logic module is applied to the scene of ordinary. The difference can be seen in fig. 4B and fig. 4A. The first data storage device, the second data storage device, and the third data storage device may correspond to the data storage device 1, the data storage device 2, and the data storage device 3 in fig. 4B, respectively. It will be appreciated that the data exchange device may be intermediated when the data storage device 1 transmits training model variables to the data storage device 2. Furthermore, when the data storage device 1 transmits the training model variables to the data storage device 2, the training model variables are iterated through the data storage device 1 at this time. Therefore, the training model variables must be referenced in the input interface of the first cutting module mentioned in step S103, explaining this reason.

It will be appreciated that, for example, in FIG. 3A, the training model variable is alg_params. In general, taking fig. 3A as an example, it stores training results in class type, and the training results should at least include: an algorithm for training a model, parameters for training a model, and an interface for training a model. It will be appreciated that the algorithm for training the model refers to the machine learning algorithm or the deep neural network algorithm, etc.; the parameters of the training model refer to all parameters corresponding to the algorithm of the training model, and the parameters are continuously optimized along with the continuous training of the learning of each data information; the interface of the training model internally references the training model variable for obtaining a specific conclusion according to the input of new data information. The specific conclusion is the final goal of the algorithm of the training model. In short, the training model variables are trained with known data information, so that when new data information is entered, specific conclusions can be reached through the interface of the training model containing the training model variables.

It is not difficult to find that the parameters of the training model generally must be configured with some parameter names, regardless of the content of the algorithm itself. For example: model parameters, tuning parameters, super parameters, etc. Thus, the obtaining the names of the training model variables in step S101 may specifically include: analyzing the effect of each variable in the first information calculation module one by one according to natural language analysis; and determining the names of the training model variables according to the action of each variable. More importantly, an interface is inevitably reloaded in the training model, and the participation of the interface contains the type of the data information. The obtaining the name of the training model variable in step S101 may further include: and obtaining the type of the data information, searching a function of the type of the data information contained in the input parameters of the interface, and determining the name of the training model variable according to the function.

Typically, algorithms are packaged in a single package of written algorithm files that may include a number of annotations for the algorithm. Thus, step S101 further includes: and confirming the action of the variable according to the annotation information in the first information calculation module. In addition, considering that the current natural language processing capability is still in the development stage, the confirmation can also be assisted in a manual judgment mode.

In fig. 3A, it is understood that the first cutting module includes the #41 to #43 row codes, and the second cutting module includes the #44 row codes.

Fig. 3B gives another example of a scenario. In fig. 3B, the training model variables are iteratively adjusted by the variable belief in the for loop. The variable belief is used for representing the credibility of the current training model variable. In fig. 3B, it is understood that the first cutting module includes the #61 to #68 row codes, and the second cutting module is empty. With reference to fig. 3B and fig. 4B, it is not difficult to verify the feasibility of the transmission and invocation of the data information in steps S101 to S103 and S201 to S203 in the scenario of fig. 3B.

More specifically, the feasibility of the upper segment assumes that in the first scenario, the ft_class function is used to cut the data information into equal parts by the base data amount 100w, where the ratio of s_temp_data to l_temp_data is 9:1. And assuming a number of data storage devices of 3, the number of data on the first, second and third data storage devices is 520w, 730w and 930w, respectively. Assuming that the most original method is adopted, that is, all data is transmitted to the server at one time and is uniformly processed by the server, s_temp_data.length () at the #64 line code is celing ((520+730+930)/100) =22. Wherein, cening is rounding up the symbol. If the methods of steps S101 to S103 and S201 to S203 are adopted, s_temp_data.length () in the second information calculation modules in the first data storage device, the second data storage device and the third data storage device are 6, 8 and 10, respectively. The only difference between the two is that the number of cut parts is greater in the examples of the present application. Many times, when the number of actual cuts is too small compared to the amount of base data, the data itself is wasted. That is, the last 20w data in the first data storage device, or the last 30w data in the second data storage device, or the last 30w data in the third data storage device, may not achieve a good training effect because the amount of data is too small compared to the amount of base data, thereby wasting a total of 80w of data.

The above situation may generally be disregarded because this approach necessitates more natural language analysis techniques that analyze at least the effects of ft_class and track validation of validation data, e.g., s_temp_datas. This is a great difficulty for either current code semantic analysis or natural language analysis. However, since the function ft_class is generally independent of the actual algorithm module and is generally determined by the actual project and engineering, the ft_class may be determined by preprocessing means such as manual marking, for example, directly and manually defining a first mapping table, where a key represents a function name, for example, ft_class, and a value represents the function of the function, for example, may be a "data cutting function". The first mapping table is used for performing targeted perfect processing on details, and the values of the first mapping table should be expressed as uniformly as possible. The reason for unified expression is that the preprocessing means further comprises a second mapping table, the keys of which exactly correspond to the values of the first mapping table, and the values of which contain the processing means information.

Based on this, in some embodiments, in step S4, the second information calculating module is obtained by the first information calculating module through natural language analysis and further includes steps S301 to S304, where steps S301 to S304 are performed before step S103.

Step S301, determining whether a function in the first information calculating module appears in a preset first mapping table.

Step S302, if the function g appears in a preset first mapping table, the corresponding function of the function g is obtained according to the first mapping table, and the processing means information is obtained according to a second mapping table.

Step S303, according to the processing means information, and combining natural language analysis to determine verification data.

Step S304, the verification data is referenced in the input interface of the first cutting module, and the first cutting module is modified to adapt to the interface by combining the processing means information.

Correspondingly, the reference quantity also includes the verification data.

Taking fig. 3B as an example, assuming that the function g is ft_class and the corresponding function is "data slicing function", the processing means information may include condition settings for a specific code segment, threshold settings for parameters, and an adaptive function. The specific code segment here is typically the code segment referenced to the output result of the function g (e.g., l_temp_datas), e.g., the #65 line code segment. The threshold setting of the parameter may be determining the size of l_temp_datas [ i ]. The adaptation function is used to modify the first cutting module to adapt it to the interface.

For convenience of explanation, the verification data is divided into first verification data and second verification data, wherein the first verification data is associated with the verification data in the reference amount in step S201, and the second verification data is associated with the verification data in the reference amount in step S202. In the first scenario, it is understood that the first verification data is l_temp_data [5], l_wind_data [5], l_humi_data [5], s_temp_data [5], s_wind_data [5], s_humi_data [5]. And the second verification data is l_temp_datas [7], l_wind_datas [7], l_humi_datas [7], s_temp_datas [7], s_wind_datas [7], s_humi_datas [7]. In step S201, when the #64 line code starts to execute for the 6 th time, since the data amount of l_temp_datas [5] is 20×9/10=18w, the data which is not executed is packed and transmitted together (i.e. the verification data is referenced in the input interface of the first cutting module) to the second data storage device according to the threshold setting of the parameter. In the second data storage device, its fitness function may be, as in the merge function of fig. 3C, merging 20w parts of data referenced in the first data storage device, i.e., quote. Thus solving the problem of data waste. After the first cutting module is modified to accommodate the interface, the first cutting module is as shown in fig. 3C.

In summary, the application discloses a distributed data management system based on natural language analysis, comprising: the system comprises a plurality of data storage devices, a data exchange device and a server.

The data information is stored in a plurality of data storage devices in a distributed mode, the data storage devices respond to instructions for collecting the data information to obtain the data information, the second information calculation module is used for calculating the data information, and intermediate data are obtained and sent to the data exchange device.

The data exchange device is used for sending the intermediate data to the server.

While the applicant has described and illustrated the embodiments of the present application in detail with reference to the drawings, it should be understood by those skilled in the art that the above embodiments are only preferred embodiments of the present application, and the detailed description is only for the purpose of helping the reader to better understand the spirit of the present application, and not to limit the scope of the present application, but any improvements or modifications based on the spirit of the present application should fall within the scope of the present application.

Claims

1. A distributed data management method based on natural language analysis is characterized by comprising the following steps S1-S4;

step S3, intermediate data in the data exchange equipment are sent to a server;

2. The distributed data management method based on natural language analysis according to claim 1, wherein in step S4, the second information calculation module obtains the specific steps S101 to S103 by the first information calculation module through natural language analysis, when the data information is applied to a scene of machine learning or deep learning;

step S101, obtaining the names of training model variables;

step S102, finding the name of the training model variable in the first information calculation module, and taking the first information calculation module as a first cutting module and a second cutting module, wherein the second cutting module does not contain any information of the training model variable;

step S103, referring to the training model variable in an input interface of the first cutting module, taking the first cutting module as a second information calculation module, and replacing the first information calculation module with the second cutting module;

correspondingly, the data storage device at least comprises a first data storage device, a second data storage device and a third data storage device, and the step 2 specifically comprises steps S201-S203;

step S201, calculating data information by using a second information calculation module of the first data storage device so as to iteratively train model variables;

step S202, calculating the data information by using a second information calculation module of the second data storage device according to the training model variables iterated in the step S201 so as to iterate the training model variables;

step S203, the second information calculation module of the third data storage device is utilized, and according to the training model variable iterated in step S202, the data information is calculated, so as to iterate the training model variable, and the training model variable is sent to the server.

3. The method for distributed data management based on natural language analysis according to claim 2, wherein the obtaining the names of the training model variables in step S101 specifically includes: analyzing the effect of each variable in the first information calculation module one by one according to natural language analysis; and determining the names of the training model variables according to the action of each variable.

4. The method of claim 2, wherein obtaining the names of the training model variables in step S101 further comprises: and obtaining the type of the data information, searching a function of the type of the data information contained in the input parameters of the interface, and determining the name of the training model variable according to the function.

5. The distributed data management method according to claim 2, wherein step S101 further comprises: and confirming the action of the variable according to the annotation information in the first information calculation module.

6. The distributed data management method based on natural language analysis according to claim 2, wherein in step S4, the second information calculation module is obtained by the first information calculation module through natural language analysis further includes steps S301 to S304, wherein steps S301 to S304 are performed before step S103;

step S301, judging whether a function in a first information calculation module appears in a preset first mapping table;

step S302, if the function g appears in a preset first mapping table, acquiring the corresponding action of the function g according to the first mapping table, and acquiring the processing means information according to a second mapping table;

step S303, according to the processing means information, and combining natural language analysis to determine verification data;

step S304, referring to the verification data in the input interface of the first cutting module, and modifying the first cutting module to adapt to the interface by combining the processing means information;

correspondingly, the reference quantity also includes the verification data.

7. A distributed data management system based on natural language analysis for performing the method of any one of claims 1-6, comprising: comprising the following steps: the system comprises a plurality of data storage devices, data exchange devices and a server;

8. A terminal comprising a processor and a storage medium; the method is characterized in that:

the storage medium is used for storing instructions;

the processor being operative according to the instructions to perform the steps of the method according to any one of claims 1-6.

9. Computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the steps of the method according to any of claims 1-6.