CN112183110A

CN112183110A - Artificial intelligence data application system and application method based on data center

Info

Publication number: CN112183110A
Application number: CN202011037663.2A
Authority: CN
Inventors: 范馨月; 沈齐; 何清龙; 韩云杰; 杜逆索; 祖兴水
Original assignee: Guizhou Yunteng Zhiyuan Technology Development Co ltd
Current assignee: Guizhou Yunteng Zhiyuan Technology Development Co ltd
Priority date: 2020-09-28
Filing date: 2020-09-28
Publication date: 2021-01-05

Abstract

The invention discloses an artificial intelligence data application system and an application method based on a data center. The system comprises a normalization module, a data center and a data processing module, wherein the normalization module is used for carrying out normalization processing on the data center to obtain a normalized data center; the semantic extraction module is used for performing semantic extraction on the natural sentences carrying the request information to obtain semantic structural information carrying the request information; the request processing module is used for constructing a escaping dictionary aiming at industry vocabulary escaping interpretation; and the semantic structural information is used for completing request processing to the normalized data center based on the semantic structural information and returning a request result. The invention has the characteristics of high efficiency, high accuracy of query return results and simple and convenient use.

Description

Artificial intelligence data application system and application method based on data center

Technical Field

The invention relates to the technical field of data query statistics, in particular to an artificial intelligence data application system and an application method based on a data center.

Background

With the continuous development of artificial intelligence technology, the effect of the intelligent dialogue system is greatly improved, and the intelligent dialogue system is widely concerned and applied in the industry. The development of dialog systems is however still a difficult task for most developers, with high technical and data requirements. Hundreds of years of natural language understanding and interaction technology are opened to the outside, and an intelligent conversation customization and service platform UNIT is provided. However, the dialogue system has its own corpus for a certain industry, and cannot manage and statistically calculate data or relatively fix templates, and the questions that can be answered are very limited and cannot be popularized and extended. For a large-scale data center, the condition of multiple databases and multiple tables exists, so that the data volume is large, the data content is tedious, and the data retrieval speed is difficult to improve. In order to improve the retrieval speed, a series of data storage frames are constructed in the current common scheme, so that not only are a lot of server resources consumed, but also the data are required to be stored in a full amount, and for selective extraction, the automatic judgment and the automatic storage of the data are rarely performed, so that a lot of service resources are wasted. Meanwhile, for a large-scale data center, statistics on which tables are needed for query can be completed, and professional personnel familiar with service understanding and table structure are required to process the statistics. Therefore, it is an urgent need to develop a data center intelligent query application system for the current application industry, aiming at the problems that the use threshold of data centers of many domestic units and companies is high, the use of a customized development platform is inconvenient, diversified statistics is difficult to realize through the platform, and particularly, the data center intelligent query application system is difficult to clean up the relationship among data when the number of data sources of the data center is large and is not professional maintainers.

Disclosure of Invention

The invention aims to provide an artificial intelligence data application system and an application method based on a data center. The invention has the characteristics of high efficiency, high accuracy of query return results and simple and convenient use.

The technical scheme of the invention is as follows: an artificial intelligence data application system based on data center comprises

The standardization module is used for carrying out standardization processing on the data center to obtain a standardized data center;

the semantic extraction module is used for performing semantic extraction on the natural sentences carrying the request information to obtain semantic structural information carrying the request information;

the request processing module is used for constructing a escaping dictionary aiming at industry vocabulary escaping interpretation; the system is used for completing request processing to a normalized data center based on the semantic structural information and returning a request result;

the normalization processing is to normalize the multi-dimensional and multi-database data tables of the data center so as to simplify the data content and the clear data structure of the data center, and is convenient for the rapid adaptation, matching and response of the request.

The aforementioned data center-based artificial intelligence data application system further includes:

and the report output module is used for receiving the request result returned by the request processing module and outputting the visual report.

In the artificial intelligence data application system based on the data center, the normalization process specifically includes:

performing data directory retrieval on a data storage path of the data center through a data dictionary; and carrying out self-adaptive data judgment on the retrieval related content, the retrieval data amount and the retrieval time expressed by the request, extracting the retrieval data which is judged to meet the local data storage, establishing a data customization extraction task, and generating a data storage directory.

In the artificial intelligence data application system based on the data center, the semantic extraction is specifically as follows:

word segmentation processing: performing word segmentation, part of speech tagging and named entity recognition on a natural sentence to obtain a vocabulary unit;

reduction treatment: forming coarse-grained statement units and marking the part of speech of the statement units by using the part of speech of the vocabulary units, the modification relation among the vocabulary units and the syntax generation principle in combination with application scene prior information, and obtaining the simple syntax after the reduction processing;

semantic dependency analysis: constructing a semantic dependency graph with simple syntax, and analyzing the dependency relationship information between a root node and a argument in the semantic dependency graph;

extracting unstructured semantics: and performing generalized preferential traversal on the semantic dependency graph based on the dependency relationship information between the arguments, and quickly extracting semantic structural information of concise syntax.

In the aforementioned data center-based artificial intelligence data application system, the escape dictionary is constructed as follows:

dividing the industry vocabulary into words and marking the part of speech; matching in a meaning-transferring dictionary according to the part of speech, and replacing the corresponding vocabulary in the original meaning-transferring dictionary with meaning-transferring words after matching;

and calculating words with the similarity higher than 90% by using the word sense similarity Sim, carrying out synonym recognition, and adding the recognized synonyms into the escape dictionary.

In the above artificial intelligence data application system based on a data center, the request processing specifically includes:

matching field information corresponding to the semantic structural information in the normalized data center, storing and recording the field information and establishing an index;

generating a directed graph which is formed by taking a table and a keyword as nodes and taking an ordered pair as an edge;

and constructing a weight matrix based on the directed graph to optimize a PageRank algorithm evaluation table and fields, returning an obtained request result, and then generating SQL (structured query language) having a corresponding relation with semantic structural information according to the database type.

The application method of the artificial intelligence data application system comprises the following steps:

s1, after receiving a natural sentence carrying request information, performing semantic extraction on the natural sentence by a semantic extraction module to obtain semantic structural information carrying the request information;

and S2, after receiving the semantic structural information extracted by the semantic extraction module, the request processing module performs request processing in the standardized data center and returns a request result.

In step S1 of the application method of the artificial intelligence data application system, when the semantic structural information of the current request is the supplementary description of the previous request, the previous request result is retrieved and combined with the current request result.

In the request processing in step S2 of the application method of the artificial intelligence data application system, when the matching of the semantic structural information in the standardized data center fails, a guidance statement for the reason of the matching failure is generated to guide the user to correspondingly supplement and refine the requested information of the current request.

Has the advantages that: compared with the prior art, the invention has the following advantages:

the invention is a system which can reduce the use threshold aiming at different data centers, can carry out query statistics aiming at the data centers in different fields and different industries, realizes the automatic processing and analysis work of data through man-machine communication, allows people to omit the complicated big data processing and analysis process, concentrates the time on data application, effectively improves the working efficiency, and reduces the working difficulty of data application research and development and the professional requirements of developers.

Aiming at the characteristics of large data volume, low retrieval speed, high server resource consumption and the like of the conventional data center, the invention designs a normalization module to perform normalization processing on the data center, and particularly performs data directory retrieval on a data storage path of the data center through a data dictionary; carrying out self-adaptive data judgment on the retrieval related content, the retrieval data amount and the retrieval time expressed by the request, extracting the retrieval data which is judged to meet the local data storage, establishing a data customization extraction task, and simultaneously generating a data storage directory; through the data center processed by the normalization module, the data quantity and the data retrieval content can be judged in a partition and segmentation manner, so that different data storage carriers are selected, the retrieval speed of multi-table data is increased finally, and data storage resources are saved.

When the query statistics method is used for querying and counting the large-scale data center, a user does not need to know the table structure in the data center, and can automatically execute analysis query and statistics tasks only by providing the data dictionary, so that the professional requirements of the user are greatly reduced, and the query statistics method can also be used for producing structured language query on different databases, and is simple and convenient to use.

The invention adds the self-learning escape dictionary to the fields of different industries, and can learn the professional vocabularies of different industries, so that the query result has more industrial pertinence and stronger practicability.

According to the method, the semantic extraction module is used for extracting the semantic meaning of the natural sentence, so that the query precision can be effectively improved, and the noise of the query result is greatly reduced.

The invention self-learns the semantic meaning of the industry vocabulary in the escape word stock, avoids the condition that the query result does not have industry pertinence due to different meanings of the same vocabulary in different industries, and improves the fitting degree of the query result and the query requirement of the user through the method.

In conclusion, the invention has the characteristics of high efficiency, high accuracy of query return results and simple and convenient use.

Drawings

FIG. 1 is a scheduling flow diagram of the present invention;

FIG. 2 is a flow chart of the scheduling of the normalization module of the present invention;

FIG. 3 is a flow chart of the scheduling of the request processing module of the present invention.

Detailed Description

The invention is further illustrated by the following figures and examples, which are not to be construed as limiting the invention.

Example 1. A data center-based artificial intelligence data application system, comprising:

the semantic extraction module is used for performing semantic extraction on the natural sentence carrying the request information to obtain semantic structural information carrying the request information (the request information comprises user intention);

The artificial intelligence data application system based on the data center further comprises a report output module, which is used for receiving the request result returned by the request processing module and outputting a visual report.

The foregoing normalization process is specifically as follows:

The semantic extraction is specifically as follows:

The unstructured semantics refers to semantic information to be expressed by natural language; semantic structuring is another formatted language with the same meaning formed by processing the meaning of the language to be expressed by a natural language (the language here is a non-natural language and is more generalized than a language that can be recognized by a computer).

The unstructured semantic extraction is extraction of sentence semantic structural information, and the difficulty is that the relationship and the role between each component in a sentence are accurately identified. When the sentence structure is complex and the sentence pattern changes, it is difficult to extract the structured information through the syntactic dependency tree. Therefore, the invention firstly carries out word segmentation during semantic extraction, forms coarse-grained sentence units and carries out unit marking by utilizing the part of speech of the sentence units, the modification relation among the units and the syntax generation principle and combining the prior information of the application scene, and obtains a simple syntax structure after reduction processing. For example: the 'driving record', the conventional word segmentation and part-of-speech tagging results are: 'Driving/NN' and 'recording/NN', wherein in the phrase, the noun phrase 'driving' is a noun modification word of the noun phrase 'recording', and the noun phrase 'driving' and the noun modification word form a noun phrase with a specific meaning by utilizing a syntax generation principle, so that the noun phrase 'driving' and the noun phrase 'recording' are required to be reduced to 'driving recording/NN'; again, as a physical noun: 'G210 new column village committee road segment', the phrase is labeled with common participles and parts of speech as: 'G210/NR', 'New post/NR', 'village Commission/NN' and 'road segment/NN', it is clear that such granularity is too fine and not in line with the application scenario, with which a priori information, the phrase can be reduced to: 'G210 New column village Commission road segment/NR'. After the reduction processing, not only the number of sentence units but also the sentence structure is simplified most importantly.

And finally, acquiring concise structured key information of the contracted post statement by utilizing the semantic dependency graph. The key of the semantic dependency graph analysis is to accurately determine the dependency relationship between the root node and the argument, namely the dependency relationship between the sentence predicate and the argument. The relationships between predicates and arguments, and between arguments and arguments directly determine the semantic information of the sentence. When the sentence structure is complex, the semantic dependency graph is particularly complex, and how to traverse the graph to obtain the relationship between the predicate and each argument so as to obtain the semantic structural key information of the sentence is the key of the sentence semantic analysis. Therefore, the dependency graph is subjected to generalized preferential traversal based on the dependency information among the arguments, and the sentence semantic structural information is extracted quickly.

The aforementioned escape dictionary is constructed as follows:

The foregoing request processing is specifically as follows:

generating a directed graph which is formed by taking a table and a keyword containing semantic structural information as nodes and taking an ordered pair as an edge;

Referring to fig. 1, in fig. 1, Ti (i ═ 1,2, 3.) denotes an execution step of the process, a, B, and C respectively denote a normalization module, a semantic extraction module, and a request processing module, and N denotes a scheduling external program waiting for a response; specifically, the method comprises the following steps:

s1, after receiving a natural sentence carrying request information, performing semantic extraction on the natural sentence by a semantic extraction module to obtain semantic structural information carrying the request information; see T1-T4 of FIG. 1;

s2, after receiving the semantic structural information extracted by the semantic extraction module, the request processing module carries out request processing in the standardized data center and returns a request result; see T5-T7 of FIG. 1.

In step S1, when the semantic structural information of the current request is the supplementary explanation of the previous request, the previous request result is retrieved and combined with the current request result. See T3-T4 of FIG. 1.

In the request processing in step S2, when the matching of the semantic structural information in the standardized data center fails, a guidance statement for the reason of the matching failure is generated to guide the user to correspondingly supplement and refine the requested information requested this time. See T7-2 of FIG. 1.

Example 2. In embodiment 1, the step of normalizing the data center by the normalization module is as follows, referring to fig. 2:

a. after receiving a query request (i.e., request information), performing data query on the query request, where the data query includes: whether the query request is requested, the processing time of the requested query request, whether the relevant request data is stored locally or not, and whether the relevant request data is in a local database or a local search engine;

after the data processing receiving service receives the query request, firstly, performing data query on the related data request, wherein the query content is not specific to the query data content, but judging whether the query request is requested or not, the related processing time of the query request, whether the related data request is stored locally or not, and whether the related data request is in a database or a search engine; after the judgment is finished, determining the next flow of data processing; and if the related data request is in the third-party database, directly fetching the data from the third-party database. In the initial stage of the whole process, the third-party database is generally searched and cached, and the local data content is continuously improved. By the method, the retrieval process can be continuously optimized, and the retrieval speed is improved.

b. Splitting the query request, performing data recording on the related SQL statement query request, and recording all data query time states including query statements, query specific time, query related SQL language and whether SQL statements are optimized in the whole statement query process.

c. For the first query request, directly obtaining related request data from a third-party database, and synchronously updating the data updating content to the data query record content, so that the data position is conveniently retrieved;

when the query belongs to the first query and the query needs to pass through a third-party database, sometimes the data result is slower to return, a data query interface needs to be returned for waiting, the data query content needs to be temporarily cached in a cache, the query data related to the return can be added to the data cache, and meanwhile, the data update content can be synchronously updated to the data query record content, so that the data position is convenient to retrieve.

d. The method comprises the steps of automatically scanning data of a third-party database table, automatically synchronizing data of large-relevance data quantity (such as data storage quantity and relevant data content), putting the data into a data search engine server for data retrieval service, and storing only large data quantity for a local data storage bank and a data search engine.

e. And synchronously maintaining the data operation records after the updating synchronization to a data dictionary table, updating the related data operation records, and maintaining and updating the data when the data service is updated next time so as to quickly locate the data storage position and keep the data contents of the data operation records and the data operation records updated consistently.

Through the processing of the steps c to e, the invention can form a uniform data retrieval catalog according to the database resources.

In the foregoing step a, after receiving the query request, the query is first performed in the local cache, and when the relevant request data already exists in the cache, the relevant request data is directly returned. The user queries the data, namely a third-party service applies for a retrieval request of a multi-dimensional data table and multi-dimensional data contents, and the user sends out a related data query request according to the requirement; after receiving the request service of the query request, firstly querying the cache result of the recently queried data, and if relevant request data exist in the cache, directly returning the queried result to the user from the cache; the method can greatly improve the retrieval speed. And under the condition that the relevant request data does not exist in the local cache, whether the relevant request data is in the local or the third-party database is queried.

In the foregoing step c, for the first query request, the query request and the related request data are temporarily stored in the cache, so as to be directly returned from the cache when querying again, and the query of the data target database is performed on the data query according to the data dictionary directory, which may be a local database or a third-party database.

In the step c, for the first query request, the database is automatically optimized for the related query SQL statement.

In the step c, the optimized SQL statement is queried again, when the related request data is still slow to return, the data volume of the database is detected, and the database with slow query result statement is subjected to data table synchronization operation;

specifically, the SQL standard statement query is performed according to the query feedback time record in the data query record, the standard SQL data query is performed 3 times, and if the standard data query mean value exceeds 5 seconds, the data is automatically synchronized (data extraction threshold judgment method). And automatically judging data of the related data table, if the related data extraction condition content is met, establishing a related data extraction task for the related data content, ensuring that data of a third-party database can be incrementally extracted to a local database, and for the condition task of which the data extraction is established, judging and extracting again in automatic retrieval. By the method, real-time data synchronization can be effectively guaranteed, and whether data are extracted from a third-party database to a local data table or not can be judged.

Through the data query in the step a, the SQL sentence optimization in the step c and the query processing of the optimized SQL sentence again, multi-library global data mapping of a multi-dimensional and multi-library data table is realized, so that the data structure is more optimized and clear, and the quick matching of the user target retrieval is favorably realized.

For the third-party database, if the query can be quickly carried out, the database automatic optimization of the related query SQL statement is not needed; if the query is slow, the query speed of the third-party database can be effectively improved by optimizing the query SQL statement and/or performing query processing on the optimized SQL statement again; by the method, the formed data retrieval catalog can be used for rapidly retrieving the condition of multiple libraries and multiple tables.

In the foregoing step c, the slower content of the data table is queried, the data content of the relevant table is queried, data synchronization operation is performed on the data table, a data directory is automatically created for the header of the data table and the content of the data structure, the data is automatically synchronized, meanwhile, the position state of the data storage is updated, and the operation of step a is performed when the data is queried for next time.

Example 3. In embodiment 1, the working steps of the request processing module, see fig. 3, and Ci (i ═ 1,2, 3.) represent the ith step of the workflow of the request processing module, i.e., the following a, b, … steps; a represents a normalization module, B represents a semantic extraction module, N represents that parameters do not need to be received, and 2s represents the longest allowable waiting time; the method comprises the following specific steps:

a. receiving data information subjected to normalized processing, wherein the data information subjected to normalized processing can be a Json format file;

b. receiving semantic structural information extracted by a semantic extraction module, and generating user demand keywords or entities after semantic analysis; the user requirement key words or entities can be marked as K₁,K₂,…,K_n；

c. Constructing a escaping dictionary with a self-learning function aiming at escaping explanation of industrial vocabularies; for example, the track is transferred to a license plate, because the counting of the vehicle running track is completed, the computer cannot complete the transfer work like a person, and the track is counted, the data center does not have the similar field, and the analysis task needs to be completed in the vehicle passing table with license plate information, so that the computer has the automatic learning capability; the establishment of the escape dictionary requires the analysis task of the invention to be run for a plurality of times;

d. number of step aMatching the user requirement keywords or entities in the step b, synonyms of the user requirement keywords or entities and/or synonyms after the synonyms of the keywords are transferred according to the information; storing and recording the matched corresponding vocabulary (namely the user requirement key words or entities, the synonyms of the user requirement key words or entities and/or the synonyms after the escape of the key words) and the corresponding field information, and establishing an index; recording the similarity of the matched words as K_i,j ^Sim(i ═ 1,2,3, …, n, j ═ 1,2, …, m) where i denotes the number of keywords/entities, j denotes the matched fields in the number of tables, n denotes the number of keywords, and m denotes the number of matched tables in the data information table.

e. Generating a directed graph which is formed by taking a table and a keyword as nodes and taking an ordered pair as an edge;

f. and constructing a weight matrix based on the directed graph to optimize the PageRank algorithm evaluation table and the field and recommend the evaluation table and the field to a user, and then generating SQL corresponding to the corresponding vocabulary according to the database type.

Specifically, in the step a, the data information is one or more data information tables composed of all the databases, tables, fields, chinese data dictionaries, data quality conditions (quantized as percentage), field use frequency, and the like of the data center.

Specifically, in the aforementioned step c, the basic word library of the escape dictionary may be based on the word library of wikipedia; the escape dictionary is constructed as follows:

c1. dividing the industry vocabulary into words and marking the part of speech; matching in a meaning-transferring dictionary according to the part of speech, and replacing the corresponding vocabulary in the original meaning-transferring dictionary with meaning-transferring words after matching; the parts of speech include nouns, verbs, place words and the like; the part of speech can be marked by adopting the existing Python package, such as jieba \ hanlp and the like;

c2. and calculating words with the similarity higher than 90% by using the word sense similarity Sim, carrying out synonym recognition, adding the recognized synonyms into a transfer dictionary, and enabling the used word bank to be based on the expansion version of the Haughard word forest. By the method, the influence of adjectives, adverbs and the like in the entity on a subsequent recommendation algorithm can be avoided, for example, the entity ' uploading time ', the time/n ' can be the date, the word ' date ' similar to the time (Sim (x, y) > 90%) can be found through similarity calculation, then the ' uploading ' and the ' date ' are recombined into the ' uploading date ' to realize semantic concatenation, similar word vectors are constructed, and the similarity value of the word vectors is recorded.

In step c1, when the matching fails, the reason for the failure is returned and the interpretation is manually added to the escape dictionary.

In step c2, the method for identifying synonyms specifically includes: identifying words with similarity value Sim > 90% by calculating word sense similarity Sim, constructing synonym vectors according to semantic concatenation, and recording the similarity value of the word vectors.

Specifically, in the step d, if the matching fails, information is returned to the user so as to add an explanation escape dictionary.

Specifically, in the foregoing step e, the directed graph is generated as follows:

recording a directed graph as D ═ V, E >; wherein,

set of vertices V ═ K_i,T_l}，K_iFor the ith user requirement keyword or entity, i ═ 1, …, n; t is_lIndicates the data information table of l, l is 1, …, m;

edge set is defined as E ═ great face<K_i,T_i,r>,<T_l,K_i>1,2,3, …, n, l 1, …, m, r is the data information table associated with the ith user requirement keyword or entity };

when a user request keyword or an entity corresponds to several fields of the same data information table, the field with the largest similarity is taken, and Σ r is m.

Specifically, the foregoing step f is specifically as follows:

the authority is not always opened among all departments of the large-scale data center, so that users without authority cannot operate certain data;

recording the authority of the jth data information table corresponding to the ith user requirement keyword or entity as K_ij ^pWherein i is 1,2,3, …, n, j is n +1, n +2, …, n + m;

K_ij ^ffor the use frequency of the field, the field with more use frequency should be preferentially recommended;

K_ij ^qthe data quality of the field is quantized into percentage numbers;

K_ij ^Simis the matched word similarity, where i is 1,2,3, …, n, j is 1,2, …, m; wherein i represents the ith user requirement keyword or entity, j represents the matched field in the jth data information table, n is the number of the keywords, and m is the number of the matched tables in the data information table;

defining weights

Here, the

Constructing an initial weight matrix

Calculating an adjacency matrix A of a directed graph D, which is (V, E), normalizing the adjacency matrix A by rows and then recording the normalized adjacency matrix A' as a final weight matrix

M＝Q·(A')^T (2)

In a data center, the importance of a data information table is determined by the IPR value, IPR^k+1Representing IPR values after the k-th iteration

I(T_l) Is equal to T_lSet of associated keywords, Δ⁺(K_j) Is equal to T_lAssociated key word K_jOut of degree, initial value of

Typically β ═ 0.85;

stable value IPR after iteration^*(T_l) Namely, the evaluation value of each data information table;

for T_lIPR (a)^*(T_l) Sorting the values, where l is 1,2, …, m, if T is the maximum value_jAre all provided with<K_i,T_j>I is 1, …, m, the task of statistics and inquiry can be at T_jFinishing the data in the data information table to generate SQL corresponding to the keywords; if a data information table can not complete the tasks of inquiry and statistical analysis, IPR is selected in sequence^*And finishing SQL splicing by the value.

Claims

1. An artificial intelligence data application system based on a data center is characterized by comprising

2. The data center-based artificial intelligence data application system of claim 1, further comprising:

3. The data center-based artificial intelligence data application system of claim 1, wherein the normalization process is specifically as follows:

4. The data center-based artificial intelligence data application system of claim 1, wherein the semantic extraction is specifically as follows:

5. The data center-based artificial intelligence data application system of claim 1, wherein the escape dictionary is constructed as follows:

6. The data center-based artificial intelligence data application system of claim 1, wherein the request processing is specifically as follows:

7. An application method of an artificial intelligence data application system based on a data center is characterized by comprising the following steps:

8. The method as claimed in claim 7, wherein in step S1, when the semantic structural information of the current request is a supplementary description of the previous request, the previous request result is retrieved and combined with the current request result.

9. The method as claimed in claim 7, wherein in the request processing of step S2, when the semantic structural information fails to match in the normalized data center, a guidance statement for the reason of the failure in matching is generated to guide the user to supplement and refine the requested information of the current request.