CN115080602A

CN115080602A - Method for realizing accurate search of data assets based on NLP algorithm

Info

Publication number: CN115080602A
Application number: CN202210275470.3A
Authority: CN
Inventors: 于洋; 高经郡; 谢晋
Original assignee: Beijing Kejie Technology Co ltd
Current assignee: Beijing Kejie Technology Co ltd
Priority date: 2022-03-21
Filing date: 2022-03-21
Publication date: 2022-09-20
Anticipated expiration: 2042-03-21
Also published as: CN115080602B

Abstract

The invention discloses a method for realizing accurate search of data assets based on an NLP algorithm, which is used for retrieving natural language of the data assets, and performing automatic maintenance, model generation, intelligent retrieval and the like of relationships by taking data as one asset. The method has efficient retrieval and hit rate aiming at the asset directory generated by big data and the field consanguinity relation, the label, the index and the like in the asset directory. And the construction can be effectively carried out for some TopN data which are inquired to be hot. By using the reids cache, the result of the similar keyword query is quickly responded, and the length of a query link and the pressure on a relational database are greatly shortened.

Description

Method for realizing accurate search of data assets based on NLP algorithm

Technical Field

The invention relates to the technical field of data processing, in particular to a method for realizing accurate search of data assets based on an NLP algorithm.

Background

In the existing data asset searching method, the type of asset retrieval is fixed to the maintenance and retrieval of physical assets, such as data retrieval of building materials, the query link is long, and interfaces and calling relations need to be manually maintained. For some metadata asset tags and cross-asset consanguineous relationships, the prior art is not able to efficiently retrieve. In addition, there is also a problem of low hit and recall.

Disclosure of Invention

Aiming at the defects of the prior art, the invention aims to provide a method for realizing accurate search of data assets based on an NLP algorithm.

In order to achieve the purpose, the invention adopts the following technical scheme:

a method for realizing accurate search of data assets based on NLP algorithm comprises the following specific processes:

firstly, generation of asset metadata:

secondly, index construction:

(1) and (3) constructing a metadata index:

the method comprises the steps that an acquisition module pulls asset metadata according to a timed acquisition task, the pulling mode is divided into full pulling and incremental pulling, full pulling is adopted for a newly added table, incremental pulling is adopted for an existing table, then null value removal and numerical value removal are carried out on the content of each field, and keywords are reserved and used as indexes of the metadata;

(2) and (3) index construction of data:

1) for an existing table, the synchronization/construction service performs incremental pulling of data; extracting keywords from the fields by using the code table/term library and the keywords with the most access times as a keyword library; after extraction is finished, all keywords are subjected to duplicate removal to generate a dictionary, and the dictionary is written into es through synchronization/construction service and serves as an index of data;

2) for the newly added table, performing full-scale pulling on the synchronization/construction service, and extracting keywords from the fields by taking the code table/term library and the keywords with the most access times as a keyword library; after extraction is finished, all keywords are subjected to duplicate removal to generate a dictionary, and the dictionary is written into es through synchronization/construction service and serves as an index of data;

third, data asset retrieval

The NLP2SQL service receives contents to be retrieved from an input entry, and then generates an SQL statement of a query after mapping a code table; the NLP2SQL service analyzes the natural language through lexical analysis to generate an SQL statement executable by a machine;

searching a redis cache by the NLP service, if historical data to be searched exists in the cache, the cache is hit, and the NLP service returns a hit data result; if all hits, directly ending the retrieval; the returned data results include: the positions of the fields of the library and the table where the asset data are located and the keywords, the mapping relationship between the fields and other libraries and tables, the mapping relationship between the fields and the tags, and the mapping relationship between the fields and the indexes; if all the reids cache results are not hit, data results are required to be continuously obtained from the es library according to the indexes; if the index is not hit, the keyword does not exist in the es library, the retrieval is finished, and if the index is hit, a data result is returned;

after being sorted, the returned data result is returned to the front end of NLP2SQL for list display;

when a user inquires detailed information of a list, the user needs to apply for asset data again from a data center station when inquiring the details of the asset data, and uses presto to inquire about hive according to a library, a table and a row position where the asset data is located; and the inquired result forms topN hot spot data which is cached in the redis and is synchronized to the es library at set time intervals.

Further, the specific process of the step one is as follows: the asset management module generates asset metadata at regular time and generates corresponding asset consanguinity relation, code table mapping relation and label data; storing the asset metadata into an asset metadata database, and storing the asset consanguineous relationship into a consanguineous relationship database; newly added data of the blood relationship library, the asset metadata library and the code table/term library are firstly cached into the redis, and the redis writes the cached data into the es library at regular time intervals to finally form an asset directory.

The invention has the beneficial effects that: the invention searches the natural language of the data assets, and takes the data as one asset to carry out automatic maintenance, model generation, intelligent search and the like of the relationship. The method has efficient retrieval and hit rate aiming at the asset directory generated by big data and the field consanguinity relation, the label, the index and the like in the asset directory. And also can be effectively established for some inquiry hot TopN data. By using the reids cache, the result of the similar keyword query is quickly responded, and the length of a query link and the pressure on a relational database are greatly shortened.

Drawings

FIG. 1 is a flow chart illustrating the construction of an index according to an embodiment of the present invention;

fig. 2 is a schematic flow chart of data retrieval according to an embodiment of the present invention.

Detailed Description

The present invention will be further described with reference to the accompanying drawings, and it should be noted that the present embodiment is based on the technical solution, and the detailed implementation and the specific operation process are provided, but the protection scope of the present invention is not limited to the present embodiment.

The embodiment provides a method for realizing accurate data asset search based on an NLP algorithm, as shown in fig. 1, the specific process is as follows:

firstly, generation of asset metadata:

the asset management module generates asset metadata at regular time and generates corresponding asset consanguinity relation, code table mapping relation and label data; storing the asset metadata into an asset metadata database, and storing the asset consanguineous relationship into a consanguineous relationship database; collecting the industry terms and storing the industry terms in a code table/term library; for example, the blood relationship library, the asset metadata library and the code table/term library have new data, the new data are firstly cached into the redis, the redis writes the cached data into es at fixed time intervals, and finally an asset directory is formed.

Secondly, index construction:

(1) and (3) constructing a metadata index:

(2) and (3) index construction of data:

2) for the newly added table, performing full-scale pulling on the synchronization/construction service, and extracting keywords from the fields by taking the code table/term library and the keywords with the most access times as a keyword library; and after extraction is finished, removing duplication of all keywords to generate a dictionary, and writing the dictionary into es through synchronization/construction service to serve as an index of data.

Third, data asset retrieval

As shown in fig. 2, the NLP2SQL service receives content (natural language) to be retrieved from an input entry and then generates an SQL statement of a query by mapping code tables. SQL here encompasses both structured and unstructured query languages (structured primarily queries from a mapped code table repository (mysql.) unstructured primarily queries from the es index repository). The NLP2SQL service parses the natural language through lexical analysis, generating a machine executable SQL statement.

Searching a redis cache by the NLP service, if historical data to be searched exists in the cache, the cache is hit, and the NLP service returns a hit data result; if all hits, directly ending the retrieval; the returned data results include: the position of the database, table and key field where the property data is located (for example, oil is in the field form _ Point 100 th row), the mapping relationship between the field and other databases and tables, the mapping relationship between the field and the label, and the mapping relationship between the field and the index; if all the reids cache results are not hit, data results are required to be continuously obtained from the es library according to the indexes; if the index is not hit, the keyword (index) does not exist in the es library, the retrieval is finished, and if the index is hit, a data result is returned.

Various corresponding changes and modifications can be made by those skilled in the art based on the above technical solutions and concepts, and all such changes and modifications should be included in the protection scope of the present invention.

Claims

1. A method for realizing accurate search of data assets based on NLP algorithm is characterized by comprising the following specific processes:

firstly, generation of asset metadata:

secondly, index construction:

(1) and (3) constructing a metadata index:

(2) and (3) index construction of data:

third, data asset retrieval

searching a redis cache by the NLP service, if historical data to be searched exists in the cache, the cache is hit, and the NLP service returns a hit data result; if all hits, directly ending the retrieval; the returned data results include: the positions of the libraries and tables where the asset data are located and the fields where the keywords are located, the mapping relationship between the fields and other libraries and tables, the mapping relationship between the fields and the tags and the mapping relationship between the fields and the indexes; if all the reids cache results are not hit, data results are required to be continuously obtained from the es library according to the indexes; if the index is not hit, the keyword does not exist in the es library, the retrieval is finished, and if the index is hit, a data result is returned;

2. The method according to claim 1, wherein the specific process of step one is as follows: the asset management module generates asset metadata at regular time and generates corresponding asset consanguinity relation, code table mapping relation and label data; storing the asset metadata into an asset metadata database, and storing the asset consanguineous relationship into a consanguineous relationship database; newly added data of the blood relationship library, the asset metadata library and the code table/term library are firstly cached into the redis, and the redis writes the cached data into the es library at regular time intervals to finally form an asset directory.