CN115080602B

CN115080602B - Method for realizing accurate search of data assets based on NLP algorithm

Info

Publication number: CN115080602B
Application number: CN202210275470.3A
Authority: CN
Inventors: 于洋; 高经郡; 谢晋
Original assignee: Beijing Kejie Technology Co ltd
Current assignee: Beijing Kejie Technology Co ltd
Priority date: 2022-03-21
Filing date: 2022-03-21
Publication date: 2023-05-26
Anticipated expiration: 2042-03-21
Also published as: CN115080602A

Abstract

The invention discloses a method for realizing accurate search of data assets based on an NLP algorithm, which is used for searching natural language of the data assets, taking the data as an asset, and carrying out automatic maintenance, model generation, intelligent search and the like on the relationship. The method has high-efficiency retrieval and hit rate aiming at the large data generated asset catalogue and the field blood relationship, label, index and the like in the asset catalogue. While also efficiently building up the TopN data for some queries that are hotter. And the results of similar keyword query are responded quickly by using the references cache, so that the length of a query link and the pressure on a relational database are greatly shortened.

Description

Method for realizing accurate search of data assets based on NLP algorithm

Technical Field

The invention relates to the technical field of data processing, in particular to a method for realizing accurate search of data assets based on an NLP algorithm.

Background

In the existing data asset searching method, the types of asset searching are solidified to the maintenance and searching of physical assets, such as the data searching of building materials, the inquiry link is long, and the manual maintenance interface and the calling relation are needed. For some metadata asset tags and blood-related relationships across assets, the prior art does not provide efficient retrieval. In addition, there is a problem that hit rate and recall rate are low.

Disclosure of Invention

Aiming at the defects of the prior art, the invention aims to provide a method for realizing accurate search of data assets based on an NLP algorithm.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

a method for realizing accurate search of data assets based on NLP algorithm comprises the following specific processes:

1. generation of asset metadata:

2. construction of the index:

(1) Construction of metadata indexes:

the acquisition module performs pulling on asset metadata according to timed acquisition tasks, the pulling mode is divided into full pulling and incremental pulling, full pulling is adopted for a newly added table, incremental pulling is adopted for an existing table, then null value removal and numerical value removal are performed on each field content, and keywords are reserved to serve as indexes of the metadata;

(2) And (3) constructing an index of data:

1) For existing tables, the synchronization/build service performs incremental pulling of data; taking a code table/term library and the keyword with the largest access times as a keyword library, and extracting keywords from the fields; after the extraction is finished, all keywords are de-duplicated, a dictionary is generated, and the dictionary is written into es through a synchronization/construction service to be used as an index of data;

2) For the newly added list, then the synchronization/construction service performs full-scale pulling, takes a code list/term library and the keyword with the largest access times as a keyword library, and extracts keywords from the fields; after the extraction is finished, all keywords are de-duplicated, a dictionary is generated, and the dictionary is written into es through a synchronization/construction service to be used as an index of data;

3. data asset retrieval

The NLP2SQL service receives the content to be retrieved from the input entrance and then generates an SQL statement of the query after passing through the mapping code table; the NLP2SQL service analyzes natural language through lexical analysis to generate SQL sentences executable by a machine;

searching a redis cache firstly by NLP service searching, if historical data to be searched exists in the cache, caching hit, and returning a hit data result obtained by NLP service; if all hit, directly ending the search; the returned data results include: the positions of the fields of the library and the table where the asset data are located and the key words are located, the mapping blood relationship between the fields and other libraries and tables, the mapping relationship between the fields and the labels, and the mapping relationship between the fields and the indexes; if the results of the references cache do not hit all, the data results are required to be obtained from the es library according to the index; if the index is not hit, the key word does not exist in the es library, and if the index is hit after the retrieval is completed, a data result is returned;

the returned data result is arranged and returned to the front end of NLP2SQL for list display;

when a user inquires detailed information of a list, the user needs to apply for asset data again from a data center table when inquiring details of the asset data, and inquires hive by using presto according to a library, a table and a row position of the asset data; the result of the inquiry forms topN hot spot data which are cached in redis and are synchronized to the es library every set time period.

Further, the specific process of the first step is as follows: the asset management module generates asset metadata at regular time and generates corresponding asset blood-edge relation, code table mapping relation and label data; asset metadata is stored in an asset metadata database, and asset blood-edge relations are stored in a blood-edge relation library; newly added data of the blood relationship library, the asset metadata library and the code table/term library are firstly cached in the redis, and the redis writes the cached data into the es library every fixed time period, so that an asset catalog is finally formed.

The invention has the beneficial effects that: the invention searches the natural language of the data asset, takes the data as an asset, and performs automatic maintenance, model generation, intelligent search and the like of the relation. The method has high-efficiency retrieval and hit rate aiming at the large data generated asset catalogue and the field blood relationship, label, index and the like in the asset catalogue. While also efficiently building up the TopN data for some queries that are hotter. And the results of similar keyword query are responded quickly by using the references cache, so that the length of a query link and the pressure on a relational database are greatly shortened.

Drawings

FIG. 1 is a schematic flow chart of index construction in an embodiment of the invention;

fig. 2 is a schematic flow chart of data retrieval in an embodiment of the invention.

Detailed Description

The present invention will be further described with reference to the accompanying drawings, and it should be noted that, while the present embodiment provides a detailed implementation and a specific operation process on the premise of the present technical solution, the protection scope of the present invention is not limited to the present embodiment.

The embodiment provides a method for realizing accurate search of data assets based on an NLP algorithm, as shown in fig. 1, comprising the following specific processes:

1. generation of asset metadata:

the asset management module generates asset metadata at regular time and generates corresponding asset blood-edge relation, code table mapping relation and label data; asset metadata is stored in an asset metadata database, and asset blood-edge relations are stored in a blood-edge relation library; collecting industry terms to store in a code table/term library; if there is new data in the blood relation library, the asset metadata library and the code table/term library, the new data is cached in the redis, and the redis writes the cached data into the es every fixed time period, and finally the asset catalog is formed.

2. Construction of the index:

(1) Construction of metadata indexes:

(2) And (3) constructing an index of data:

2) For the newly added list, then the synchronization/construction service performs full-scale pulling, takes a code list/term library and the keyword with the largest access times as a keyword library, and extracts keywords from the fields; after the extraction is completed, all keywords are de-duplicated, a dictionary is generated, and the dictionary is written into es through a synchronization/construction service to be used as an index of data.

3. Data asset retrieval

As shown in fig. 2, the NLP2SQL service receives content (natural language) to be retrieved from an input portal and then generates SQL statements of a query after passing through a mapping code table. The SQL herein contains structured and unstructured query languages (structured primarily queries from mapped code table libraries (mysql). Unstructured primarily queries from the es index library). The NLP2SQL service analyzes natural language through lexical analysis to generate machine executable SQL statements.

Searching a redis cache firstly by NLP service searching, if historical data to be searched exists in the cache, caching hit, and returning a hit data result obtained by NLP service; if all hit, directly ending the search; the returned data results include: the location of the fields where the asset data is located (e.g., petroleum in field form_post line 100), the mapping blood relationship of the fields with other libraries, tables, the mapping relationship of the fields with labels, the mapping relationship of the fields with indexes; if the results of the references cache do not hit all, the data results are required to be obtained from the es library according to the index; if the index is not hit, the key word (index) does not exist in the es library, and if the index is hit after the retrieval is completed, a data result is returned.

Various modifications and variations of the present invention will be apparent to those skilled in the art in light of the foregoing teachings and are intended to be included within the scope of the following claims.

Claims

1. A method for realizing accurate search of data assets based on NLP algorithm is characterized in that the specific process is as follows:

1. generation of asset metadata:

the asset management module generates asset metadata at regular time and generates corresponding asset blood-edge relation, code table mapping relation and label data; asset metadata is stored in an asset metadata database, and asset blood-edge relations are stored in a blood-edge relation library; newly added data of the blood relationship library, the asset metadata library and the code table/term library are firstly cached in a redis, and the redis writes the cached data into the es library at fixed time intervals to finally form an asset catalog;

2. construction of the index:

(1) Construction of metadata indexes:

(2) And (3) constructing an index of data:

1) For existing tables, the synchronization/build service performs incremental pulling of data; taking a code table/term library and the keyword with the largest access times as a keyword library, and extracting keywords from the fields; after the extraction is completed, all keywords are de-duplicated, a dictionary is generated, and the dictionary is written into an es library through a synchronization/construction service to be used as an index of data;

2) For the newly added list, the synchronization/construction service performs full-scale pulling, takes a code list/term library and the keyword with the largest access frequency as a keyword library, and extracts the keywords from the fields; after the extraction is completed, all keywords are de-duplicated, a dictionary is generated, and the dictionary is written into an es library through a synchronization/construction service to be used as an index of data;

3. data asset retrieval

The NLP2SQL service receives the content to be retrieved from the input entrance, and generates an SQL statement of the query after passing the mapping code table; the NLP2SQL service analyzes natural language through lexical analysis to generate SQL sentences executable by a machine;