CN111104422A

CN111104422A - Training method, device, equipment and storage medium of data recommendation model

Info

Publication number: CN111104422A
Application number: CN201911264372.4A
Authority: CN
Inventors: 陈旭; 王磊; 叶淑强; 李冬冬; 刘明明; 董晃
Original assignee: Beijing Mininglamp Software System Co ltd
Current assignee: Beijing Mininglamp Software System Co ltd
Priority date: 2019-12-10
Filing date: 2019-12-10
Publication date: 2020-05-05
Anticipated expiration: 2039-12-10
Also published as: CN111104422B

Abstract

The invention provides a training method, a training device, equipment and a storage medium of a data recommendation model, and relates to the technical field of model training. The method comprises the following steps: obtaining the description of the service requirement in the application knowledge base of the data warehouse and the information of the data table used by the service requirement; determining input variables of a preset algorithm according to the description of the service requirement and preset service terms; determining an output variable of the preset algorithm according to the information of the data table used by the service requirement; and training by adopting the preset algorithm according to the input variable and the output variable to obtain the data recommendation model. By applying the embodiment of the invention, the data recommendation model has historical experience, the text content can be understood semantically, and the amount of search results can be reduced and the accuracy of the search results can be improved by using the recommendation model.

Description

Training method, device, equipment and storage medium of data recommendation model

Technical Field

The invention relates to the technical field of model training, in particular to a training method, a training device, training equipment and a storage medium of a data recommendation model.

Background

With the development of the internet, the number of data tables of a data warehouse is continuously expanding. Particularly, like an enterprise-level data warehouse, one of the core functions of the enterprise-level data warehouse is to integrate the cross-system data in the whole range of an enterprise, and the data sources of the enterprise-level data warehouse include various types of source systems of the enterprise, so that the data tables of the enterprise-level data warehouse are usually large in quantity and complicated in business, especially for large-scale enterprises. Therefore, how to recommend the data table in the data warehouse to the user according to the search content of the user becomes a current research focus.

Currently, a search result is output by calculating similarity between a search keyword and metadata of a data warehouse, and a data table with the highest similarity to the search keyword is arranged in front, whereas the data table is arranged in back, wherein the similarity calculation method comprises editing distance similarity, cosine similarity, Euclidean distance and the like.

However, the data table matching is performed only from the plain text recognition degree by means of similarity size matching, which results in many search results and inaccuracy.

Disclosure of Invention

The present invention is directed to provide a method, an apparatus, a device and a storage medium for training a data recommendation model, which can reduce the amount of search results and improve the accuracy of the search results by using the recommendation model.

In order to achieve the above purpose, the embodiment of the present invention adopts the following technical solutions:

in a first aspect, an embodiment of the present invention provides a method for training a data recommendation model, where the method includes:

obtaining the description of the service requirement in the application knowledge base of the data warehouse and the information of the data table used by the service requirement;

determining input variables of a preset algorithm according to the description of the service requirement and preset service terms;

determining an output variable of the preset algorithm according to the information of the data table used by the service requirement;

and training by adopting the preset algorithm according to the input variable and the output variable to obtain the data recommendation model.

Further, the data recommendation model includes: the first recommendation model, which determines the input variables of the preset algorithm according to the description of the business requirement and the preset business terms, includes:

performing word segmentation processing on the description of the service requirement to obtain a plurality of word segmentation words, and respectively determining the maximum value of the similarity of each word segmentation word and the service term in each data table of the data warehouse;

determining a plurality of word segmentation words with the maximum approximation degree in each data table of the data warehouse and the types of business terms corresponding to the word segmentation words with the maximum approximation degree in the data tables as input variables of a preset algorithm;

the training by adopting the preset algorithm according to the input variable and the output variable to obtain the data recommendation model comprises the following steps:

and training by adopting the preset algorithm according to the input variable and the output variable to obtain the first recommendation model.

Further, the data recommendation model further comprises: the second recommendation model, which determines the input variables of the preset algorithm according to the description of the business requirement and the preset business terms, includes:

performing word segmentation processing on the description of the service requirement to obtain a plurality of word segmentation words, and respectively determining the maximum value of the similarity between each word segmentation word and each service term in a service term dictionary;

determining the maximum value of the similarity of each word segmentation word and each service term as an input variable of a preset algorithm;

and training by adopting the preset algorithm according to the input variable and the output variable to obtain the second recommendation model.

Further, the determining an output variable of the preset algorithm according to the information of the data table used by the service requirement includes:

determining the service condition information of all data tables in the data warehouse according to the information of the data tables used by the plurality of service requirements, and using the service condition information as the output variable of the preset algorithm model; the usage information of each data table is used to indicate whether the data table is used by the service requirement.

Further, the method further comprises:

performing word segmentation processing on input retrieval content, and determining the similarity of each word segmentation word in the retrieval content and each business term in each data table in the data warehouse, and the type of each similarity corresponding to the business term in the data table;

and determining a first probability value of the retrieval content using each data table by adopting the first recommendation model according to the similarity and the type of the service term corresponding to each similarity in the data table.

Further, the method further comprises:

performing word segmentation processing on input retrieval contents, and determining the similarity of each word segmentation word and each service term in the retrieval contents;

and determining a second probability value of using each data table by the retrieval content by adopting the second recommendation model according to the similarity.

Further, the method further comprises:

integrating a first probability value of the retrieval content using each data table and a second probability value of the retrieval content using each data table to determine a plurality of data tables with the highest probability value of the retrieval content using each data table;

determining the maximum similarity of each word segmentation word of the retrieval content and the service terms in the data tables with the maximum probability value;

and outputting a plurality of business terms with the maximum similarity to each participle word of the retrieval content and information of a data table where the business terms are located.

In a second aspect, an embodiment of the present invention further provides a device for training a data recommendation model, where the device includes:

the first acquisition module is used for acquiring the description of the service requirement in the application knowledge base of the data warehouse and the information of the data table used by the service requirement;

the first determining module is used for determining input variables of a preset algorithm according to the description of the service requirement and preset service terms;

the second determining module is used for determining the output variable of the preset algorithm according to the information of the data table used by the service requirement;

and the second acquisition module is used for training by adopting the preset algorithm according to the input variable and the output variable to acquire the data recommendation model.

Further, the data recommendation model includes: the first recommendation model, the first determination module, is specifically configured to:

the second obtaining module is specifically configured to:

Further, the data recommendation model further comprises: the second recommendation model, the first determination module, is further specifically configured to:

the second obtaining module is further specifically configured to:

Further, the second determining module is specifically configured to:

Further, the apparatus further comprises:

the first word segmentation module is used for performing word segmentation processing on input retrieval content, and determining the similarity of each word segmentation word in the retrieval content and each business term in each data table in the data warehouse and the type of each similarity corresponding to the business term in the data table;

and a third determining module, configured to determine, according to the similarity and the type of the service term corresponding to each similarity in the data tables, a first probability value that the retrieved content uses each data table by using the first recommendation model.

Further, the apparatus further comprises:

the second word segmentation module is used for carrying out word segmentation processing on input retrieval contents and determining the similarity of each word segmentation word and each service term in the retrieval contents;

and the fourth determining module is used for determining a second probability value of the retrieval content using each data table by adopting the second recommendation model according to the approximation degree.

Further, the apparatus further comprises:

the integration module is used for integrating the first probability value of each data table used by the retrieval content and the second probability value of each data table used by the retrieval content so as to determine a plurality of data tables with the highest probability value of each data table used by the retrieval content;

a fifth determining module, configured to determine maximum similarity between each participle word of the search content and a service term in the data tables with the maximum probability value;

and the output module is used for outputting a plurality of business terms with the maximum similarity to each word segmentation word of the retrieval content and information of a data table where the business terms are located.

The embodiment of the invention also provides model training equipment, which comprises a memory and a processor, wherein a computer program capable of running on the processor is stored in the memory, and the processor realizes the steps of the training method of any data recommendation model when executing the computer program.

The embodiment of the invention also provides a storage medium, wherein a computer program is stored on the storage medium, and when the computer program is executed by a processor, the steps of the training method of any data recommendation model are executed.

The invention has the beneficial effects that:

according to the training method, the training device and the training storage medium of the data recommendation model, firstly, description of business requirements and information of a data table used by the business requirements are obtained from a data warehouse application knowledge base, then, input variables of a preset algorithm of the training data recommendation model are determined according to the description of the business requirements and preset business terms, then, output variables of the preset algorithm are determined according to the information of the data table used by the business requirements, and finally, training is carried out by adopting the preset algorithm according to the input variables and the output variables to obtain the data recommendation model. By adopting the training method of the data recommendation model provided by the embodiment of the invention, the input variable and the output variable of the preset algorithm can be determined by utilizing the relation between the description of the service requirement and the data table and the preset service term, and finally the data recommendation model is obtained by training, so that the data recommendation model has historical experience, the text content can be understood semantically, and the amount of the search result can be reduced and the accuracy of the search result can be improved by using the recommendation model.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.

Fig. 1 is a schematic flowchart of a training method of a data recommendation model according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating a testing method of a first recommendation model according to an embodiment of the present invention;

FIG. 3 is a flowchart illustrating a testing method for a second recommendation model according to an embodiment of the present invention;

FIG. 4 is a flowchart illustrating a method for applying a recommendation model according to an embodiment of the present invention;

FIG. 5 is a schematic structural diagram of a training apparatus for a data recommendation model according to an embodiment of the present invention;

FIG. 6 is a schematic structural diagram of a testing apparatus of a first recommendation model according to an embodiment of the present invention;

FIG. 7 is a schematic structural diagram of a testing apparatus of a second recommendation model according to an embodiment of the present invention;

fig. 8 is a schematic structural diagram of an application apparatus of a recommendation model according to an embodiment of the present invention;

fig. 9 is a schematic structural diagram of a model training device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention.

The training method of the data recommendation model provided by the following embodiments of the invention can be used in the development process of an application program for data information search, the data recommendation model can be trained by adopting a binary classification algorithm in machine learning, the binary classification algorithm comprises a decision tree algorithm and a random forest algorithm, and the data recommendation model provided by the following embodiments of the invention can be trained by adopting the random forest algorithm. Fig. 1 is a flowchart illustrating a method for training a data recommendation model according to an embodiment of the present invention, and as shown in fig. 1, the method may include:

s101, obtaining the description of the service requirement in the application knowledge base of the data warehouse and the information of the data table used by the service requirement.

Specifically, the data warehouse application knowledge base may reflect a corresponding relationship between the service demand and the data tables, where the service demand may be acquired from an online demand system or an offline demand document, and the data in the data warehouse may be stored in a plurality of data tables. The data warehouse application knowledge base can be built in the following mode, and the building process can be based on a computer programming language Python.

First, keywords in Structured Query Language (SQL) used by the data warehouse, such as from, select, insert, and update, may be collected, and the keywords are generally followed by table names; selecting an SQL script, if the script has a carriage return symbol, replacing the carriage return symbol in the script with a space, or replacing a plurality of connected spaces with one space; then splitting the script according to the blank space by using a splitting function (split), and recording the split position number; finding the position number of the first occurrence of the keyword, finding the position number of the first occurrence of the keyword after the position, and then analyzing the table name of the data table between the two position numbers; finally, the table name, the description of the service requirement and the knowledge code corresponding to the service requirement, which are analyzed from the script corresponding to the service requirement, can form the content shown in table 1.

TABLE 1

Knowledge coding	Description of business requirements	Used data table name
			1	xxxxxx	TABLE-1, TABLE-2
2	……	……

In table 1, one knowledge number corresponds to the description of one service requirement, and at the same time, the description of one service requirement may be associated with one data table or multiple data tables. The content in the table is the content in the data warehouse application knowledge base, and after the data warehouse application knowledge base is built, the description of the service requirement and the table name of the data table used by the service requirement can be acquired from the data warehouse application knowledge base.

S102, determining input variables of a preset algorithm according to the description of the business requirement and preset business terms.

Specifically, the business term may be stored in advance in a data warehouse business term dictionary, the source of the business term may be a table Chinese name and a Chinese description, a field Chinese name and a Chinese description in metadata information of the data warehouse, and the data warehouse business term dictionary may be built in the following manner.

Firstly, collecting table Chinese names and field Chinese names in metadata information of the data warehouse, and directly using the table Chinese names and the field Chinese names as initial terms without performing word segmentation processing on the table Chinese names and the field Chinese names; accepting review and revision of the initial terms to form a first batch of business terms; adding the first batch of business terms into a user-defined dictionary with a word segmentation function; then, table Chinese descriptions in the metadata information of the data warehouse are collected, word segmentation processing is carried out on the table Chinese descriptions, and results after the word segmentation processing are used as a second batch of initial terms; collecting field Chinese description in the metadata information of the data warehouse, performing word segmentation processing on the field Chinese description, and taking the result after the word segmentation processing as a second batch of initial terms; and finally, accepting the review and revision of the first batch of initial terms and the second batch of initial terms, and adding the reviewed and revised first batch of initial terms and second batch of initial terms into the data warehouse business term dictionary to form business terms.

Further, in the process of building the data warehouse business term dictionary, the word segmentation process may include the following steps: (1) word segmentation can be carried out by combining a word segmentation tool knot (jieba) of a python mainstream with the self-defined dictionary; (2) the result after word segmentation can be screened, firstly, the numbers and punctuation information in the result can be removed, and then stop words, such as words with the length of 1: and the like, such words have no actual meaning, and finally, invalid words can be removed by defining an invalid word library, wherein the invalid words refer to words which are not descriptive of the table per se and have no value for recommendation, such as words of year, month and the like.

Further, in the step (2) of building the data warehouse business term dictionary, the invalid word library may be defined by an Inverse text Frequency Index (IDF), and the specific steps are as follows: first, the table Chinese name, the field Chinese name, the table Chinese description, the field Chinese description and the business requirements in the data warehouse application knowledge base can be collected, then, performing word segmentation processing on the collected content, calculating IDF values of all word segmentation words obtained after word segmentation processing, sequencing the IDF values from low to high, selecting word segmentation words with preset number and rank, for example, selecting the first 100 word segmentation words for analysis, in the result after removing punctuation and stop words, comparing the participle words in the result with the first 100 participle words with the lowest IDF value in the invalid word stock, if the two participle words have the same participle word, the participle word in the result which is the same as the first 100 participle words with the lowest IDF value in the invalid word bank can be regarded as an invalid word, and the invalid word is removed, wherein the following formula is a calculation formula of the IDF value:

it can be seen from the formula that if the number of the data tables containing the participle word is more, the IDF value is smaller, that is, the participle word appears in a data warehouse in many places, the resolution is poor, so the first 100 participle words with the lowest IDF value can be taken to judge which participle words are invalid words and which are invalid words in the result after removing punctuation and stop words, and the participle words are deleted, and finally only valid participle words are left, and the valid participle words are taken as the second set of initial terms.

The data warehouse business term dictionary can be periodically revised according to the business term maintenance process at the later stage, in the process of building the data warehouse business term dictionary, if the business terms belonging to one data table are from multiple types, one of the business terms can be selected according to the priority sequence of the table Chinese name, the field Chinese name, the table Chinese description and the field Chinese description, so as to ensure that the business terms in one data table are unique, for example, the business term is taken as an example of equipment maintenance of a national power grid, if the equipment maintenance business terms are from the table Chinese name, the field Chinese name, the table Chinese description and the field Chinese description in one data table, one of the business terms can be selected according to the priority sequence of the table Chinese name, the field Chinese name, the table Chinese description and the field Chinese description, the source of the equipment overhaul business terms is the names of table languages, and the form of business terms stored in the data warehouse business term dictionary is shown in table 2, wherein each business term corresponds to a term number.

TABLE 2

It can be seen from the table that a term code corresponds to a service term, and the source type of a service term can be determined as only one type, and the source name can be a table Chinese name or a field Chinese name. The contents in the table are the contents in the data warehouse business term dictionary, and after the data warehouse business term dictionary is built, business terms and business term source types can be obtained from the data warehouse business term dictionary.

The preset algorithm can be realized by a decision tree algorithm or a random forest algorithm, generally speaking, the random forest algorithm is better, the preset algorithm in the embodiment of the invention can be realized by the random forest algorithm, the random forest algorithm is a learning method combining a series of classifiers together to make a decision, the fair learning method is expected to be obtained, each classifier is constructed by randomly extracting a part of samples from an original data set as a sample subspace, then randomly selecting a new characteristic subspace from the sample subspace, establishing a decision tree in the new space as a classifier, and finally obtaining a final decision by a voting method. The input variables of the random forest algorithm can be determined according to the description of the business requirements in the data warehouse application knowledge base and the business terms in the data warehouse business term dictionary.

Further, in one embodiment, the data recommendation model may be a first recommendation model, and the input variables of the random forest algorithm used to train the first recommendation model may be determined according to the description of the business requirements in the data warehouse application knowledge base and the business terms in the data warehouse business term dictionary. Specifically, the description of the business requirement is firstly subjected to word segmentation processing to obtain a plurality of word segmentation words, the word segmentation processing process is almost the same as that in a built data warehouse business term dictionary, except that the content in a user-defined word library used for word segmentation is all business terms in the data warehouse business term dictionary, other processing processes are the same and are not explained here, then the approximation degree of each word segmentation word and the business terms in each data table of the data warehouse can be calculated firstly, and then the maximum value of the approximation degree of each word segmentation word and the business terms in each data table of the data warehouse is determined. For example, after the description of the service requirement corresponding to a knowledge code is subjected to word segmentation processing, the number of obtained word segmentation words is 10, a data warehouse has 800 data tables in total, and each data table has 10 service terms on average, then the approximation degree of each word segmentation word and each service term of each data table in the database can be calculated, and finally 10 × 800 × 10 — 80000 records are generated. Table 3 records a description of a service requirement of the national grid, specifically records an approximation degree of a word-segmentation word in the description of the service requirement and each service term in a data table, where "the source type of the service term in the table" can be obtained from the application knowledge base of the data warehouse. Based on the similarity calculation result of each word segmentation word and each business term of each data table in the database, the maximum value of the similarity of each word segmentation word and the business term in each data table of the data warehouse can be determined. Table 4 records a case of describing a service requirement of the national grid, specifically, a maximum value of an approximation degree between a word segmentation word in the description of the service requirement and each service term in a data table.

TABLE 3

TABLE 4

Finally, determining a plurality of related information of the word segmentation words with the maximum similarity in each data table of the data warehouse, taking 5 related information of the word segmentation words with the maximum similarity as an example, the related information may include: knowledge code, table name, TOP of the maximum approximation TOP1, the type corresponding to TOP of the maximum approximation TOP1 (i.e. the type of the traffic term with the highest approximation in table 4 in the table), TOP of the maximum approximation TOP2, the type corresponding to TOP of the maximum approximation TOP2, TOP of the maximum approximation TOP3, the type corresponding to TOP of the maximum approximation TOP3, TOP of the maximum approximation TOP4, the type corresponding to TOP of the maximum approximation 4, TOP of the maximum approximation TOP5, TOP of the maximum approximation TOP5, the types corresponding to the TOP of the maximum approximation and the TOP of the maximum approximation are used as input variables of the forest random algorithm, if there is a null value in the input variables of the forest random algorithm, that is to say, when the number of the participle words of a traffic demand is less than the number of input variables to be taken, the maximum values of the approximations between only some participle words and traffic terms are added to be averaged, and the averaged value is given to the remaining input variables.

Further, in another embodiment, the data recommendation model may be a second recommendation model, and as with the first recommendation model, input variables of the random forest algorithm used to train the second recommendation model may be determined according to the description of the business needs in the data warehouse application knowledge base and the business terms in the data warehouse business term dictionary. Specifically, the description of the service requirement is firstly subjected to word segmentation processing to obtain a plurality of word segmentation words, the word segmentation processing process is the same as that in the training of the first recommendation model and is not described here, then the similarity between each word segmentation word and a service term in a data warehouse service term dictionary can be calculated, and then the maximum value of the similarity between each word segmentation word and each service term is determined. For example, after the description of the business requirement corresponding to a knowledge code is subjected to word segmentation processing, the number of obtained word segmentation words is 10, and the number of business terms in the data warehouse business term dictionary is 5000, then the similarity between the 10 word segmentation words and the 5000 business terms can be calculated, and finally 10 × 5000 records are generated, namely 50000 records. Table 5 records a description of a service requirement of the national grid as an example, and specifically records an approximation degree of a word segmentation word and each service term in the description of the service requirement.

TABLE 5

Knowledge coding	Word segmentation word	Business terminology	Proximity between word-segmented words and business terms
				1	Equipment maintenance	Equipment repair test	0.XXX
1	Equipment maintenance	Equipment patrol	0.XXX

Based on the similarity calculation result of each word segmentation word and each service term, the maximum value of the similarity between each word segmentation word and the service term in the data warehouse service term dictionary can be determined, and table 6 records the case of describing one service requirement of the national power grid as an example, and specifically records the maximum value of the similarity between each word segmentation word and each service term in the description of one service requirement.

TABLE 6

Business terminology	Maximum similarity between word-segmented words and business terms
		Equipment repair test	0.XXX
Equipment patrol	0.XXX

Finally, the contents in table 6 are transformed to form the following information: the method comprises the steps that knowledge coding, the approximation degree of business terms 1, the approximation degree of business terms 2 and the like are carried out, and how many business terms and variables exist, the maximum value of the approximation degree of each word segmentation word and each business term can be determined as an input variable of a random forest algorithm, if the approximation degree values of all input variables corresponding to one knowledge coding are smaller than a preset value, for example, the preset value is set to be 0.2, the fact that the description quality of business requirements corresponding to the knowledge coding is poor can be shown, and the information can be deleted.

S103, determining an output variable of the preset algorithm according to the information of the data table used by the service requirement.

Specifically, the usage information of all data tables in the data warehouse is determined according to the information of the data tables used by a plurality of service requirements, that is, the content in the application knowledge base of the data warehouse, and can be used as an output variable of a preset algorithm, and the usage information of each data table is used for indicating whether the data table is used by the service requirement.

Further, in an embodiment, the data recommendation model may be a first recommendation model, and in combination with table 1 and input variables of the first recommendation model, if the data table name used in table 1 is consistent with the table name used in table 4, the record is denoted as 1, for example, in a data warehouse application knowledge base, 3 data tables are used for a knowledge code, and a total of 1000 data tables are in the data warehouse, the knowledge code and the 1000 tables generate 1000 records, where 3 records are denoted as 1 and the rest are denoted as 0, and this usage information may be used as an output variable of the forest random algorithm.

Further, in another embodiment, the data recommendation model may be a second recommendation model, and the content in the data warehouse application knowledge base is processed, and taking information of a data table used by a service requirement corresponding to a knowledge code as an example, the data recommendation model is split into: whether the table 1 is used or not, whether the table 2 is used or not and the like, and how many variables exist in the database according to the number of data tables, wherein if the table is used for encoding the knowledge, the variable takes the value of 1, otherwise, the variable is 0, and the using condition can be used as the output variable of the random forest algorithm.

And S104, training by adopting the preset algorithm according to the input variable and the output variable to obtain the data recommendation model.

Specifically, in an embodiment, the data recommendation model may be a first recommendation model, training is performed by using a random forest algorithm according to the input variable and the output variable, an optimal parameter may be obtained through cross validation, and finally, the first recommendation model is obtained and stored.

Specifically, in another embodiment, the data recommendation model may be a second recommendation model, and the training is performed by using a random forest algorithm according to the input variable and the output variable, and may be implemented by using a deep learning framework (Tensorflow), where the number of input variables is equal to the number of business terms, and the number of output variables is equal to the number of database entriesTraining to obtain optimal parameters by taking sigmoid as an activation function and binary _ cross as a loss function according to the number of the data tables, and finally obtaining a second recommendation model and storing the model_。

According to the training method of the data recommendation model, provided by the invention, the input variable and the output variable of the preset algorithm can be determined by utilizing the relation between the description of the service requirement and the data table and the preset service term, and the data recommendation model is finally obtained through training, so that the data recommendation model has historical experience, the text content can be understood semantically, and the amount of the search result can be reduced and the accuracy of the search result can be improved by using the recommendation model.

Fig. 2 is a schematic flowchart of a testing method of a first recommendation model according to an embodiment of the present invention, and as shown in fig. 2, the method may include:

s201, performing word segmentation processing on input retrieval contents, and determining the similarity between each word segmentation word in the retrieval contents and each business term in each data table in the data warehouse, and the type of each business term corresponding to each similarity in the data tables.

Specifically, the search content input by the user may be subjected to word segmentation processing, the process of the word segmentation processing is the same as the process of performing word segmentation processing on the description of the service requirement, and is not described here, then, the similarity between each word segmentation word in the search content and the service term in each data table in the data warehouse and the type of the service term corresponding to each similarity in the data table may be determined, and the process of calculating the similarity between each word segmentation word to which the description of the service requirement belongs and the process of the service term in each data table in the data warehouse are the same, and are not described here.

S202, according to the similarity and the type of the business term corresponding to each similarity in the data table, a first probability value of using each data table by the retrieval content is determined by adopting a first recommendation model.

Specifically, the first recommendation model receives the word segmentation words in the search content and the similarity of the business terms in each data table in the data warehouse, and the first recommendation model may determine, according to the received approximate values, a first probability value of using each data table by the search content, and store the search content in the following format: the content, table name and probability value that the table needs to be used are retrieved.

Fig. 3 is a flowchart illustrating a testing method of a second recommendation model according to an embodiment of the present invention, and as shown in fig. 3, the method may include:

s301, performing word segmentation processing on the input search content, and determining the similarity of each word segmentation word and each business term in the search content.

Specifically, the search content input by the user may be subjected to word segmentation processing, the process of the word segmentation processing is the same as the process of performing word segmentation processing on the description of the business requirement and is not described here, then the approximation degree of each word segmentation word and each business term in the search content may be determined, the process is the same as the process of calculating the approximation degree of each word segmentation word belonging to the description of the business requirement and the business term in the data warehouse business term dictionary, and is not described here.

S302, according to the approximation degree, a second probability value of using each data table for the retrieval content is determined by adopting the second recommendation model.

Specifically, the second recommendation model receives the approximation degrees of each participle word in the search content and the business term in the data warehouse business term dictionary, and the second recommendation model may determine, according to the received approximation degrees, a second probability value of each data table used by the search content, and store the search content in the following format: the content, table name and probability value that the table needs to be used are retrieved.

Fig. 4 is a schematic flowchart of an application method of a recommendation model according to an embodiment of the present invention, and as shown in fig. 4, the method may include:

s401, integrating the first probability value of each data table used by the search content and the second probability value of each data table used by the search content, so as to determine a plurality of data tables with the maximum probability value of each data table used by the search content.

Specifically, the search content may be a sentence or a segment of service requirement description, the first probability value of each data table in the search content usage database and the second probability value of each data table in the search content usage database are sorted from high to low according to the probability values, and the data tables corresponding to the preset number of probability values are taken from the front, for example, the preset number is 20, so that the top 20 tables can be stored.

In the embodiment of the present invention, before determining that the retrieval content uses the plurality of data tables with the maximum probability value of each data table, according to the retrieval content input by the user, if the similarity between the retrieval content and the service terms such as the names of the tables in the database can reach more than a preset value, for example, the preset value is 0.9, that is, when the similarity between the retrieval content and the service terms such as the names of the tables in the database can reach more than 0.9, the tables can be directly recommended to the user, and if there are a plurality of data tables with the similarity of more than 0.9, the tables can be recommended to the user together in the order from high to low.

S402, determining the maximum similarity between each word segmentation word of the retrieval content and the service terms in the data tables with the maximum probability value.

Specifically, the calculation of each participle in the search content is the same as the maximum similarity of the business terms in the table stored in the previous step, and the process is the same as the maximum process of describing the similarity between each participle term and the business terms in each data table of the data warehouse in the business requirement, and is not described here, wherein the business terms in the table stored in the previous step may be field business terms, including field Chinese names and field Chinese descriptions, and the calculation of each participle in the search content may be the maximum similarity of the field business terms in the table stored in the previous step.

And S403, outputting a plurality of business terms with the maximum similarity to each word segmentation word of the search content and information of a data table where the business terms are located.

Specifically, each participle in the search content may be calculated, and the maximum similarity of the field service terms in the table stored in the previous step may be computed, for example, the top5 fields with the highest similarity and the table name to which the top5 fields belong may be recommended to the user together.

Fig. 5 is a schematic structural diagram of a training apparatus for a data recommendation model according to an embodiment of the present invention, and as shown in fig. 5, the apparatus may include:

a first obtaining module 501, configured to obtain a description of a service requirement in a data warehouse application knowledge base and information of a data table used by the service requirement;

a first determining module 502, configured to determine an input variable of a preset algorithm according to the description of the service requirement and a preset service term;

a second determining module 503, configured to determine an output variable of the preset algorithm according to information of a data table used by the service requirement;

a second obtaining module 504, configured to perform training by using the preset algorithm according to the input variable and the output variable, and obtain the data recommendation model.

Further, the data recommendation model includes: the first recommendation model, the first determination module 502, is specifically configured to:

determining a plurality of word segmentation words with the maximum approximation degree in each data table of the data warehouse and the types of business terms corresponding to the word segmentation words with the maximum approximation degree in the data table as input variables of a preset algorithm;

the second obtaining module 504 is specifically configured to:

Further, the data recommendation model includes: the second recommendation model, the first determining module 502, is further specifically configured to:

the second obtaining module 504 is further specifically configured to:

Further, the second determining module 503 is specifically configured to:

determining the service condition information of all data tables in the data warehouse according to the information of the data tables used by a plurality of service requirements, and using the service condition information as the output variable of the preset algorithm model; the usage information of each data table is used to indicate whether the data table is used by the service requirement.

Fig. 6 is a schematic structural diagram of a testing apparatus of a second recommendation model according to an embodiment of the present invention, and as shown in fig. 6, the apparatus may include:

a first segmentation module 601, configured to perform segmentation processing on input search content, and determine similarity between each segmented word in the search content and a service term in each data table in the data warehouse, and a type of each similarity corresponding to the service term in the data table;

and a third determining module 602, configured to determine, according to the similarity and the type of the service term corresponding to each similarity in the data tables, a first probability value that each data table is used by the retrieved content using the first recommendation model.

Fig. 7 is a schematic structural diagram of a testing apparatus of a second recommendation model according to an embodiment of the present invention, and as shown in fig. 7, the apparatus may include:

a second word segmentation module 701, configured to perform word segmentation processing on the input search content, and determine the similarity between each word segmentation word and each service term in the search content;

a fourth determining module 702, configured to determine, according to the approximation, a second probability value of using each data table for the retrieved content by using the second recommendation model.

Fig. 8 is a schematic structural diagram of an application apparatus of a recommendation model according to an embodiment of the present invention, and as shown in fig. 8, the apparatus may include:

an integrating module 801, configured to perform integration processing on the first probability value of each data table used by the search content and the second probability value of each data table used by the search content to determine multiple data tables with the highest probability value of each data table used by the search content;

a fifth determining module 802, configured to determine maximum similarity between each participle word of the search content and a service term in the data tables with the maximum probability value;

the output module 803 is configured to output a plurality of business terms with the maximum similarity to each participle word of the search content, and information of a data table in which the plurality of business terms are located.

The above-mentioned apparatus is used for executing the method provided by the foregoing embodiment, and the implementation principle and technical effect are similar, which are not described herein again.

These above modules may be one or more integrated circuits configured to implement the above methods, such as: one or more Application Specific Integrated Circuits (ASICs), or one or more microprocessors (DSPs), or one or more Field Programmable Gate Arrays (FPGAs), among others. For another example, when one of the above modules is implemented in the form of a processing element scheduler code, the processing element may be a general-purpose processor, such as a Central Processing Unit (CPU) or other processor capable of calling program code. For another example, these modules may be integrated together and implemented in the form of a system-on-a-chip (SOC).

Fig. 9 is a model training apparatus provided in an embodiment of the present invention, including: memory 901, processor 902.

The memory 901 is used for storing a computer program that is executable on the processor 902, and the processor 902 is used for implementing the above-mentioned method embodiments when executing the computer program. The specific implementation and technical effects are similar, and are not described herein again.

Optionally, the present invention also provides a storage medium, for example a computer-readable storage medium, comprising a program which, when executed by a processor, is adapted to perform the above-described method embodiments.

In the embodiments provided in the present invention, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

The integrated unit implemented in the form of a software functional unit may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium and includes several instructions to enable a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to execute some steps of the methods according to the embodiments of the present invention. And the aforementioned storage medium includes: a U disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

Claims

1. A method for training a data recommendation model, the method comprising:

2. The method of claim 1, wherein the data recommendation model comprises: the first recommendation model, which determines the input variables of the preset algorithm according to the description of the business requirement and the preset business terms, includes:

3. The method of claim 2, wherein the data recommendation model further comprises: the second recommendation model, which determines the input variables of the preset algorithm according to the description of the business requirement and the preset business terms, includes:

4. The method according to claim 2 or 3, wherein the determining the output variable of the preset algorithm according to the information of the data table used by the service requirement comprises:

determining the service condition information of all data tables in the data warehouse according to the information of the data tables used by the plurality of service requirements, wherein the service condition information is used as an output variable of the preset algorithm; the usage information of each data table is used to indicate whether the data table is used by the service requirement.

5. The method of claim 3, further comprising:

6. The method of claim 5, further comprising:

and determining a second probability value of using each data table by the retrieval content by adopting the second recommendation model according to the approximation degree.

7. The method of claim 6, further comprising:

8. An apparatus for training a data recommendation model, the apparatus comprising:

9. Model training device, comprising a memory, a processor, a computer program being stored in the memory and being executable on the processor, the processor implementing the steps of the method according to any of the claims 1 to 7 when executing the computer program.

10. A storage medium, characterized in that the storage medium has stored thereon a computer program which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.