CN116629132A - Semi-automatic model building method and machine learning platform based on PCFG - Google Patents

Semi-automatic model building method and machine learning platform based on PCFG Download PDF

Info

Publication number
CN116629132A
CN116629132A CN202310635240.8A CN202310635240A CN116629132A CN 116629132 A CN116629132 A CN 116629132A CN 202310635240 A CN202310635240 A CN 202310635240A CN 116629132 A CN116629132 A CN 116629132A
Authority
CN
China
Prior art keywords
algorithm
machine learning
module
model
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310635240.8A
Other languages
Chinese (zh)
Inventor
张微
冯天
周必群
胡开添
王宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN202310635240.8A priority Critical patent/CN116629132A/en
Publication of CN116629132A publication Critical patent/CN116629132A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • G06F30/20Design optimisation, verification or simulation
    • G06F30/27Design optimisation, verification or simulation using machine learning, e.g. artificial intelligence, neural networks, support vector machines [SVM] or training a model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/151Transformation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Data Mining & Analysis (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Computer Hardware Design (AREA)
  • Geometry (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a semi-automatic model building method and a machine learning platform based on PCFG. The method comprises the following steps: inputting Chinese sentences describing a machine learning model to be constructed, firstly, segmenting words of a given sentence, then, analyzing word segmentation results through PCFG (pulse-pitch control language) syntax, obtaining a syntax tree of the given sentence based on PCFG analysis, performing hierarchical traversal on the syntax tree generated by the PCFG, and generally, enabling child nodes of the same root node to represent sequence from left to right, so that the hierarchical traversal can obtain sequence of the same hierarchy, and constructing mapping relations between verbs and algorithm modules, so that the execution sequence of the algorithm modules corresponding to phrases of the same hierarchy, namely, the machine learning model algorithm module flow chart corresponding to the given sentence, can be mapped. Finally, the invention provides a machine learning platform for semi-automatically constructing the model based on the method, thereby greatly reducing the difficulty of constructing the model by a user and improving the efficiency.

Description

Semi-automatic model building method and machine learning platform based on PCFG
Technical Field
The invention belongs to the field of computer application and machine learning, and relates to a method for semi-automatically constructing a model based on PCFG and a machine learning platform.
Background
The existing machine learning platform is a machine learning platform built on a cloud platform, a user submits data to the cloud platform, a flow chart is drawn by dragging a series of processes on the machine learning platform, and finally a machine learning result is operated. The existing machine learning platform is generally realized in the form of Java Web application, java is rarely used for writing a machine learning algorithm, and has poor compatibility with common machine learning programming languages, such as python; the data source can only upload the local file to the server, once the local file is too large, if network fluctuation occurs in the transmission process, the uploading speed is possibly too slow, and even the uploading fails; the algorithm used by the machine learning model can only use the algorithm given by the platform, and the requirement that a user uses the self-optimizing algorithm as an intermediate node cannot be met; the process of constructing the machine learning model flow chart by dragging by the user is troublesome and time-consuming.
Disclosure of Invention
In view of the foregoing, it is an object of the present invention to provide a method for semi-automatically building a machine learning model based on PCFG and a machine learning platform supporting data exchange between cross-language algorithms supporting custom algorithms for reading data from a user database.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
in a first aspect, the present invention provides a semi-automatic model building method based on PCFG, which is used for automatically building a model flowchart on a machine learning platform, wherein an algorithm module which can be called is prestored in the machine learning platform, all the algorithm modules are divided into a plurality of sets according to function categories, each set contains all the algorithm modules belonging to the same function category on the platform, the set names are named by function category names, and the algorithm module names are named by algorithm names;
the semi-automatic model building method comprises the following steps:
s1, acquiring Chinese description sentences which are input by a user and are used for describing a machine learning model flow chart expected to be constructed;
s2, performing word segmentation processing on the Chinese description sentence to obtain a word segmentation result with part of speech;
s3, matching each verb in the word segmentation result in an algorithm keyword library, and replacing each verb which can be matched with a keyword in the library in the Chinese description sentence with a placeholder representing the verb to obtain an input sentence; the module names and function attribute labels of all algorithm modules which can be called in the machine learning platform are prestored in the algorithm keyword library;
S4, carrying out syntactic analysis on the input sentence based on a Probability Context Free Grammar (PCFG), so as to analyze a sentence grammar structure conforming to grammar rules and generate a syntactic tree; if only one generated syntax tree exists, the generated syntax tree is used as a final syntax tree, and if a plurality of syntax trees are generated due to syntax analysis ambiguity exists, the corresponding occurrence probability of each generated syntax tree is calculated, and the syntax tree with the largest occurrence probability is used as the final syntax tree;
s5, extracting all placeholders in sequence to form a placeholder sequence according to the level and the sequence of leaf nodes in the final syntax tree; mapping each placeholder in the placeholder sequence into a corresponding algorithm module stored in the machine learning platform in advance in sequence, so that all the mapped algorithm modules are connected in sequence to form a model flow chart; wherein:
if the placeholder has the adverbs belonging to the same subtree in the syntax tree, searching a corresponding set from the machine learning platform according to the replaced verb corresponding to the placeholder, further searching a corresponding algorithm module from the searched set according to the adverbs in the same subtree, and adding the algorithm module into the model flow chart;
If the placeholder does not have the adverbs belonging to the same subtree in the syntax tree, searching a corresponding set from the machine learning platform directly according to the replaced verb corresponding to the placeholder, associating all algorithm modules in the searched set as to-be-selected modules to an algorithm module identifier, and adding the algorithm module identifier to a model flow chart for users to specify a final algorithm module by themselves.
As a preferable aspect of the first aspect, in S3, in the algorithm keyword library, in addition to the module name of each algorithm module, a synonym of the module name is stored correspondingly, and when matching the verb with the algorithm module name in the algorithm keyword library, the synonym of the module name needs to be included in a matching range.
As a preference of the first aspect, each algorithm module in the machine learning platform has a name attribute tag for describing an algorithm name of the algorithm module and a function attribute tag for describing a function class of the algorithm module; all the algorithm modules are clustered according to the function categories, and the algorithm modules with the same function attribute labels are stored in the set of the same function categories by taking the name attribute labels as keys.
As a preferable aspect of the above first aspect, in S3, a relationship between each placeholder and the replaced verb is stored in a record table; and when mapping is executed in S5, searching the replaced verb corresponding to each placeholder in the Chinese description sentence from the record table.
Preferably, after the model flowchart is constructed, if the algorithm module identifier exists, a manual assignment process is initiated to the user, one of all the modules to be selected is selected by the user, and the selected algorithm module identifier is replaced in the model flowchart.
As a preference of the above first aspect, the Probabilistic Context Free Grammar (PCFG) parses the input sentence according to a preset set of rules, each rule being provided with a corresponding probability.
In a second aspect, the present invention provides a machine learning platform for semi-automatically constructing a machine learning model, where the machine learning platform includes a data import module, a data format conversion module, a machine learning model construction module, and a resource layer;
the data importing module is used for reading appointed data from a user database and storing the appointed data to a local server side to serve as a data source of the machine learning model;
The data format conversion module is used for carrying out data exchange between adjacent algorithm modules of the model flow chart, wherein the data exchange takes a JSON format as a data exchange format, output data in the previous algorithm module is required to be converted into a unified JSON format, and then the converted JSON format data is converted into input data of the next algorithm module;
the machine learning model construction module comprises a semiautomatic construction module and a visual drag construction module, wherein the semiautomatic construction module is used for realizing the semiautomatic construction model method based on PCFG according to any one of claims 1-6 so as to generate an initial model flow chart, and the visual drag construction module is used for modifying the initial model flow chart in a visual drag mode;
the resource layer is used for providing bottom layer support for construction and operation of a machine learning model, and an algorithm module which can be called and an operation environment required by the operation of the algorithm module are integrated inside the resource layer.
As a preferable mode of the first aspect, the machine learning platform is connected to the bottom database through a database middleware, the user data is stored in the bottom database, the database middleware is located between the bottom database and the machine learning platform, and after the user inputs the URL address, the port number and the user name password of the host installed in the database, the machine learning platform automatically selects a suitable database middleware according to the database to connect the database.
As a preferred mode of the first aspect, the design mode of the database middleware adopts a server proxy mode, and a proxy service needs to be deployed independently, and the proxy service manages a plurality of database instances later, establishes a connection with the proxy server through a data source in an application, and uses the proxy to operate an underlying database and returns a corresponding result.
As a preference of the first aspect, each machine learning model object finally constructed by the machine learning platform includes an algorithm module part, a data set part and a model parameter part; the algorithm module part comprises a directed acyclic graph formed by interconnecting various algorithm modules used by a machine learning model, the algorithm modules are stored in a linked list by using the topological arrangement sequence of nodes of the directed acyclic graph, and the connection relation among the algorithm modules is stored in a form of an adjacent matrix; the dataset portion includes a training set and a testing set for use by a machine learning model; the model parameter part comprises an initial preset parameter value of the machine learning model and a parameter value after training is completed.
Due to the adoption of the technical scheme, the invention has the following advantages:
1) According to the invention, the machine learning platform can construct the corresponding model flow chart through natural language processing and module calling by inputting Chinese description sentences describing the expected constructed machine learning model flow chart, so that the difficulty of constructing the model by a user is greatly reduced, and the efficiency is improved.
2) According to the invention, the database middleware can be connected with the database table read data as a data source, so that the operations of reading and writing separation and database and table separation can be simplified, the bottom implementation details are hidden, the multi-database multi-table can be operated like the operation single-database single-table, and the uploading of the data by a user is further facilitated.
3) The invention unifies the input and output format standards of the algorithm module, so that a user can use the custom algorithm as a machine learning model intermediate node in the system only by writing the custom algorithm according to the standards of the algorithm module, the customizable performance of the machine learning model is further improved, and meanwhile, the custom algorithm can be stored in a server, so that the user can use the custom algorithm for multiple times without uploading the same algorithm each time.
Drawings
FIG. 1 is a machine learning platform framework.
Fig. 2 is a syntax tree based on PCFG analysis.
Fig. 3 is a flow chart of a machine learning model corresponding to fig. 2.
Fig. 4 is a flow of a semi-automatic construction algorithm based on PCFG.
Fig. 5 is a verb and algorithm module mapping relationship and mapping flow.
Fig. 6 is a structural design of a connection user database.
FIG. 7 is a design schema of database middleware.
FIG. 8 is a structural design of a machine learning model object.
Fig. 9 is a structural design of the algorithm module data exchange.
Fig. 10 is a structural design of a resource layer.
Detailed Description
The invention is further illustrated and described below with reference to the drawings and detailed description. The technical features of the embodiments of the invention can be combined correspondingly on the premise of no mutual conflict.
In order that the above objects, features and advantages of the invention will be readily understood, a more particular description of the invention will be rendered by reference to the appended drawings. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. The present invention may be embodied in many other forms than described herein and similarly modified by those skilled in the art without departing from the spirit of the invention, whereby the invention is not limited to the specific embodiments disclosed below. The technical features of the embodiments of the invention can be combined correspondingly on the premise of no mutual conflict.
Under the large background of rapid development of Internet application technology, human society is more and more active on the Internet to conduct teaching, entertainment, sharing and other actions, so that massive data are created, a trigger is brought to development of machine learning technology, and massive data are provided for model training of machine learning. However, from the viewpoint of working performance, for more huge data, a user generally chooses to store the data in a database, if the user derives a data file from the database, uploading speed is slow or even failure occurs easily when the user chooses to upload the data file to the platform, so that the user is difficult to upload the data file to the platform for training, and the platform is difficult to receive the huge data file at one time. From the practical use point of view, if a user wants to use the self-optimized and adjusted algorithm to perform machine learning model training and compare with the conventional algorithm, the general platform is not supported; meanwhile, the machine learning programming languages used by users are various, and the platform is difficult to support connection switching among the nodes of the multi-language algorithm; and for a final machine learning model flow chart, the user is required to carry out complicated parameter setting, algorithm module connection and other operations, and the time is relatively consumed.
In view of the above problems, embodiments of the present application are intended to solve: 1. the problem that a user is difficult to upload data to a platform due to overlarge data volume; 2. a problem that the user cannot use the self-adjusting optimization algorithm; 3. a problem of data exchange between cross-language algorithms; 4. the machine learning model flow diagram construction is a time consuming problem. Therefore, the embodiment of the application provides a PCFG-based machine learning platform which automatically builds a machine learning model, spans languages and supports a self-defined algorithm.
In order that the above-recited objects, features and advantages of the present application will become more readily apparent, a more particular description of the application will be rendered by reference to the appended drawings and appended detailed description.
In a preferred embodiment of the present application, a semi-automatic model building method based on PCFG is provided for automatically building a model flowchart on a machine learning platform, in which algorithm modules that can be invoked are stored in advance, each algorithm module is packaged in advance, and then is directly invoked by the platform without redevelopment. All the algorithm modules are divided into a plurality of sets according to the function category, each set comprises all the algorithm modules belonging to the same function category on the platform, the set names are named by the function category names, and the algorithm module names are named by the algorithm names. In order to facilitate matching, each algorithm module in the machine learning platform has a name attribute tag for describing the algorithm name of the algorithm module and a function attribute tag for describing the function class of the algorithm module. All the algorithm modules are clustered according to the function types, and the algorithm modules with the same function attribute labels are stored in the set of the same function types by taking the name attribute labels as keys, so that a lookup table of the algorithm modules can be formed.
The semi-automatic model building method comprises the following steps:
s1, acquiring Chinese description sentences which are input by a user and used for describing a machine learning model flow chart expected to be constructed.
S2, performing word segmentation processing on the Chinese description sentence to obtain a word segmentation result with part of speech.
S3, matching each verb in the word segmentation result in an algorithm keyword library, and replacing each verb which can be matched with the keyword in the library in the Chinese description sentence with a placeholder representing the verb to obtain an input sentence. The module names and the function attribute labels of all the algorithm modules which can be called in the machine learning platform are prestored in the algorithm keyword library.
In the embodiment of the invention, in the process of matching verbs in an algorithm keyword library, a large number of synonyms such as short names, aliases, chinese and English nouns and the like of an algorithm can be considered in the verbs input by a user, so that in order to accurately identify the intention input by the user, in the algorithm keyword library, in addition to the module name of each algorithm module, the synonyms of the module name are correspondingly stored, and when matching the verbs with the algorithm module names in the algorithm keyword library, the synonyms of the module names need to be brought into a matching range.
S4, carrying out syntactic analysis on the input sentence based on a Probability Context Free Grammar (PCFG), so as to analyze a sentence grammar structure conforming to grammar rules and generate a syntactic tree; if the generated syntax tree has only one syntax tree, the generated syntax tree is used as a final syntax tree, and if the generated syntax tree has syntax analysis ambiguity to generate a plurality of syntax trees, the corresponding occurrence probability of each generated syntax tree is calculated, and the syntax tree with the largest occurrence probability is used as the final syntax tree.
The probability context-free grammar belongs to the prior art, and generally needs to carry out syntactic analysis on an input sentence according to a preset rule set, wherein each rule is provided with a corresponding probability.
S5, extracting all placeholders in sequence to form a placeholder sequence according to the level and the sequence of leaf nodes in the final syntax tree; and mapping each placeholder in the placeholder sequence into a corresponding algorithm module stored in the machine learning platform in advance, so that all the mapped algorithm modules are connected in sequence to form a model flow chart. Wherein:
if the placeholder has the adverbs belonging to the same subtree in the syntax tree, searching a corresponding set from the machine learning platform according to the replaced verb corresponding to the placeholder, further searching a corresponding algorithm module from the searched set according to the adverbs in the same subtree, and adding the algorithm module into the model flow chart;
If the placeholder does not have the adverbs belonging to the same subtree in the syntax tree, searching a corresponding set from the machine learning platform directly according to the replaced verb corresponding to the placeholder, associating all algorithm modules in the searched set as to-be-selected modules to an algorithm module identifier, and adding the algorithm module identifier to a model flow chart for users to specify a final algorithm module by themselves.
Note that, the placeholder and the adverb belong to the same subtree in the syntax tree, which means that the placeholder and the adverb have a common parent node.
In the embodiment of the present invention, in order to facilitate subsequent reading, in the process of replacing verbs with placeholders, the relationship between each placeholder and the replaced verb is stored in a record table. Thus, when mapping is performed in S5, the replaced verb corresponding to each placeholder in the chinese description sentence is searched for from the record table.
In addition, after the model flow chart is built, if the algorithm module identifier exists, a manual assignment flow can be initiated to the user, the user selects one of all the modules to be selected by himself, and the selected algorithm module is used for replacing the algorithm module identifier in the model flow chart.
The semi-automatic model building method based on PCFG can be applied to a machine learning platform, so that the machine learning platform has the capability of semi-automatically building a machine learning model. Therefore, the invention provides a machine learning platform which is used for semi-automatically constructing a machine learning model, supporting reading data from a user database, supporting a self-defined algorithm and supporting data exchange among cross-language algorithms. The machine learning platform comprises a data import module, a data format conversion module, a machine learning model construction module and a resource layer.
The data importing module is used for uploading data in a file mode when the data set of the user is huge in scale and fails to upload due to network fluctuation, and is connected to a module used when the user database reads the data, and can execute query operation according to SQL sentences of the user to read the designated data from the user database and store the designated data to a server side for use by a subsequent algorithm.
The data format conversion module is used for converting the format of the intermediate data processed by the previous algorithm module into a format which can be analyzed by the next algorithm module and transmitting the format to the next algorithm module when the data are exchanged between the adjacent algorithm modules. The data format conversion module can convert the data of the output part in the previous algorithm module into a JSON format, and then convert the converted JSON format data into the data format of the input part in the next algorithm module.
The machine learning model construction module is used for constructing services by two parts of semi-automatic construction and visual dragging construction based on PCFG, wherein the semi-automatic construction based on PCFG is used for carrying out syntax analysis on sentences, the syntax analysis is a process of obtaining a syntax structure through word combination analysis, the syntax structure of input sentences is determined, the method is essentially a set of candidate tree-oriented evaluation method, and finally, the syntax tree with the highest score is selected as a final syntax analysis result. On the basis of the generated machine learning model flow chart, modification in detail is performed through a visual drag construction service.
When the machine learning model training is performed, algorithms of various machine learning programming languages are used, the resource layer is used as a bottom layer support, and an algorithm library of the various languages and an operation environment required by language operation are integrated inside the resource layer.
In one embodiment of the invention, a machine learning platform framework employing the above-described PCFG-based semi-automatic build model approach is illustrated. As shown in fig. 1, the machine learning platform includes a data import module, a data format conversion module, a machine learning model construction module, and a resource layer. When the machine learning model is trained, algorithms of various machine learning programming languages are used, a resource layer is used as a bottom support, algorithm libraries of the various languages and operation environments required by language operation are integrated inside, such as a Python algorithm library and operation environment, an R algorithm library and operation environment illustrated in fig. 1, and in addition, matlab algorithm libraries and operation environments and the like can be included, which are not limited to examples.
The data importing module is a module used when the user data is stored in the database and needs to be connected to the user database to read the data, and can read the appointed data from the user database according to the SQL statement of the user by executing the query operation and store the appointed data to the server for the subsequent algorithm to use. For example, the user's data is stored in a plurality of tables in the user database, and then the user needs to specify the field name and table name to be queried, so that the specified specific data is obtained through the data import module.
The data format conversion module is used for converting output data of the algorithm into a JSON format and converting the JSON format into input data of the algorithm, and the data format conversion module can be automatically selected according to a programming language of the algorithm module. Because in a machine learning model, two algorithm modules that may be adjacent are written in different programming languages, for example, algorithm module 1 is written in Python and algorithm module 2 is written in R, then the data exchange between the two modules requires a data exchange format that both can parse, while JSON is a lightweight text data exchange format that is self-descriptive, easy to read and write, easy to parse and generate by the machine, and independent of language and platform, JSON parser and JSON library support many different programming languages, so JSON is selected for the data exchange format. In general, each algorithm module needs to perform format conversion on input and output data, so that the functions of converting the output data of the algorithm into a JSON format and converting the JSON format into the input data of the algorithm can be extracted from the algorithm modules as independent data format conversion modules, and high internal cohesion of the algorithm modules and low coupling between the algorithm modules are ensured.
The machine learning model construction module is used for constructing services by automatic construction and visual dragging based on PCFG, wherein the automatic construction based on PCFG is used for carrying out syntax analysis on sentences, the syntax analysis is a process of obtaining a syntax structure through word combination analysis, the syntax structure of input sentences is determined, the method is essentially a set of candidate tree-oriented evaluation method, a correct syntax tree is given a higher score, unreasonable syntax trees are given a lower score, and finally the syntax tree with the highest score is selected as a final syntax analysis result. The user can select to input Chinese sentence description of the target machine learning model by using the PCFG-based automatic construction service, perform syntactic analysis by the PCFG to generate a syntactic tree of sentences, and then convert the syntactic tree into a corresponding machine learning model flow chart, and on the basis of the generated machine learning model flow chart, the user can modify details through the visual drag construction service, and the user can also directly select a machine learning model flow chart consisting of a series of algorithm modules through drag. The machine learning model flow diagram generated by the interaction of the two is a directed acyclic graph.
The PCFG probability-based phrase structure analysis method is a relatively mature syntactic analysis model at present, and can also be considered as a combination of a rule method and a statistical method. PCFG is a method of generating a formula whose phrase structure grammar can be expressed as a five-tuple (X, V, S, R, P):
x is a collection of finite words, the elements of which are called words or terminators.
V is a set of finite labels, called a non-terminal set.
S is called the start symbol of the grammar and is contained in V.
R is a set of ordered pairs (α, β), which is the set of rules that are generated.
P represents the statistical probability of each production rule.
The PCFG solution syntax tree is shown below by way of one example. First, as shown in table 1 below, a rule set of PCFG is shown, wherein the first and third columns represent rules, the second and fourth columns represent probabilities of rule establishment, the tag names identified by the tag codes in table 1 are shown in table 2, each rule in table 1 is divided into two parts by an arrow, taking ip→npvp as an example, the left part of the arrow IP is a simple clause, the right part of the arrow is a noun phrase and a verb phrase, and the probability of dividing a sentence into a noun phrase and a verb phrase is 0.7.
TABLE 1
IP→NP VP 0.7 VP→VV QP 0.2
IP→VP 0.2 VP→ADVP VP 0.2
IP→NP 0.1 VP→VV 0.2
NP→NN 0.6 QP→CD 1.0
NP→NN NN 0.4 ADVP→AD 1.0
VP→VP PU 0.4
TABLE 2
Illustratively, given a sentence ROOT: the training set is normalized, then the features are extracted, clustered and finally evaluated, firstly, the given sentence is divided into words, then the features are extracted, clustered and finally evaluated, the words in the sentence are separated by spaces, then the word division result is analyzed by PCFG syntax by using the rule set, as shown in figure 2, a syntax tree based on PCFG analysis is obtained for the given sentence, firstly, one sentence is divided into a verb phrase and a noun phrase, then the noun phrase is divided into two nouns, and the verb phrase is divided into a plurality of verb phrases and combinations of punctuations, wherein the first verb phrase is divided into a verb phrase and an adverb phrase, and the three verb phrases are divided into adverbs and verbs. Thus, the PCFG parses the input sentence to obtain a sentence grammar structure conforming to the grammar rule and generates a syntax tree, but note that if there is only one generated syntax tree, the generated syntax tree is used as a final syntax tree, and if there is syntax parsing ambiguity to generate multiple syntax trees, the corresponding occurrence probability is calculated for each generated syntax tree, and the syntax tree with the largest occurrence probability is used as the final syntax tree. And performing hierarchical traversal on the syntax tree finally generated by the PCFG, wherein in general, the child nodes of the same root node represent the sequence from left to right, so that the hierarchical traversal can obtain the sequence of the same hierarchy, and the verbs and the algorithm modules construct a mapping relation, so that the execution sequence of the algorithm modules corresponding to the phrases of the same hierarchy can be mapped and obtained, and the execution sequence is shown in figure 3.
As shown in fig. 4, a flowchart of a semi-automatic construction algorithm based on PCFG. Because the vocabulary set of the PCFG algorithm used by the machine learning platform is a Chinese vocabulary, and most of the algorithms in the machine learning field are English vocabulary, before converting a given sentence into a syntax tree, keyword replacement is required to be performed on all verbs after the given sentence is segmented. Before replacement, the semiautomatic construction module needs to maintain an algorithm keyword library of the algorithm modules contained in one platform, and name attribute tags and function attribute tags of all the algorithm modules are recorded in the list. Scanning verbs marked after word segmentation in a given sentence once, so that verbs contained in an algorithm keyword library in the sentence are replaced by a special mark, wherein the special mark is a placeholder, the verbs are used as verb phrases in a vocabulary set, and the replaced verbs are arranged into a replaced verb list according to the appearance sequence of the verbs in the sentence; and then, after the replaced given sentence passes through the PCFG algorithm flow, replacing the replaced verb in sequence to the original position. The verb keywords can be mapped into the appointed algorithm modules according to the mapping relation between the verb keywords and the algorithm modules. The flow in fig. 4 can be described as:
And step 1, acquiring Chinese description sentences which are input by a user and are used for describing a machine learning model flow chart expected to be constructed.
And step 2, performing word segmentation processing on the Chinese description sentence to obtain a word segmentation result with part of speech.
And 3, matching each verb in the word segmentation result in an algorithm keyword library, and if a certain verb can be matched with a keyword in the library, replacing the verb with a placeholder representing the verb, so that the Chinese description sentence is converted into an input sentence with the placeholder.
And 3, carrying out syntactic analysis on the input sentence with the placeholder based on the PCFG to obtain a final syntactic tree.
And 4, sequentially extracting all placeholders according to the hierarchy and the sequence of the leaf nodes in the final syntax tree to form a placeholder sequence. And mapping each placeholder in the placeholder sequence into a corresponding algorithm module stored in the machine learning platform in advance, so that all the mapped algorithm modules are connected in sequence to form a model flow chart. Wherein:
if the placeholder has the adverbs belonging to the same subtree in the syntax tree, searching a corresponding set from the machine learning platform according to the replaced verb corresponding to the placeholder, further searching a corresponding algorithm module from the searched set according to the adverbs in the same subtree, and adding the algorithm module into the model flow chart.
If the placeholder does not have the adverbs belonging to the same subtree in the syntax tree, searching a corresponding set from the machine learning platform directly according to the replaced verb corresponding to the placeholder, associating all algorithm modules in the searched set as to-be-selected modules to an algorithm module identifier, and adding the algorithm module identifier into the model flow chart for users to specify the final algorithm module by themselves.
The above-mentioned process of searching for a set or an algorithm module based on verbs or adverbs is essentially a process of word matching, and the matching can be regarded as searching for a corresponding set or module.
In the embodiment of the present invention, in mapping verb keywords to algorithm modules, the emphasis is on verbs at placeholders, and the rest of verbs such as "go" are not used as algorithm modules. Specifically, if there is an adverb belonging to the same subtree in the syntax tree by the placeholder, a specific step diagram of constructing a mapping relationship between verbs and adverbs and algorithm modules and searching for a corresponding algorithm module according to the verbs may be presented with reference to fig. 5. For each algorithm module in the machine learning platform, it contains a name attribute tag and a function attribute tag, the name attribute tag is the name of the algorithm module, different names represent different algorithm modules, and the function attribute tag represents the function classification of the algorithm module, so that the machine learning algorithm can be generally divided into two main categories: the conventional algorithm and the deep learning algorithm can be further divided into: decision trees, random forests, logistic regression, k-means clustering, etc.; the deep learning algorithm can be further divided into back propagation, feedforward neural network, convolution neural network, etc., and these neural networks are composed of some basic modules, such as: convolution module, activation function, linear interpolation, attention module, etc. Algorithms in machine learning platforms are therefore classified into several broad classes according to functional attributes, such as: clustering, convolution, pooling, activation functions and the like, wherein the interior of each major class is divided into a plurality of independent algorithm modules according to algorithm names, and the major classes of the activation functions can be divided into: relu, gelu, softmax, etc. The name attribute and the function attribute of the algorithm module are used as labels, each algorithm module corresponds to two labels, namely a name label and a function label, the algorithm module is firstly divided into a plurality of major classes according to the function label, and each major class is divided into a plurality of independent algorithm modules according to the name label, so that the mapping relation between the function label and the name label and the algorithm module is formed. When the machine learning building module automatically builds the machine learning model, the machine learning building module can perform algorithm module mapping according to the current verb keywords and the adverb keywords each time. The searching process is divided into two steps, namely, step 1: finding out the major class of the algorithm module according to the verb keywords, and step 2: and searching a specific algorithm module in the specified major class according to the adverb keywords. For example, the corresponding algorithm module is searched according to the "softmax activation function", the major class where the activation function is located is searched according to the verb keyword "activation function", and then the specific softmax algorithm module is found according to the "softmax" keyword. However, if the placeholder does not have any adverbs belonging to the same subtree in the syntax tree, the corresponding set is added into the flowchart only in the form of the algorithm module identifier, and the user can be informed of self-assignment in the diagram in a graphical prompt mode later, and the assignment range is all the algorithm modules contained in the set. After the user designates, the algorithm module identifier in the flow chart is replaced by the designated algorithm module.
As shown in fig. 6, the bottom database and the machine learning platform are connected through a database middleware, the database middleware is located between the bottom database and the machine learning platform, is mainly used for shielding the middleware of the bottom details of the heterogeneous databases, is a bridge for communication between the machine learning platform and the user database, for example, the user data is stored in the MySQL database, the user does not need to consider how to connect the database to obtain data, only needs URL address, port number and user name password of a host installed in the database, and the platform can automatically select appropriate database middleware to connect the database according to the selected database.
The machine learning platform can inquire the data in the database table through the middleware to derive database data, and under the condition that the data volume is relatively large, if the data are read into the memory at all and then data processing is executed, memory overflow is easy to occur; in order to avoid the problem of memory overflow, the method adopts a segmentation process, uses a streaming reading mode, reads and writes one segment of data at a time, and writes the data into a temporary file of a service end where a platform is positioned until all the data are finally read and written into the service end. The data is obtained more efficiently and safely in the above mode, and the data is used as a data source of the machine learning model. For example, when 200w data in the database need to be read to the server, 1000 data are read from the database and stored in the memory each time, and every 10w data are written into the file, so that the interaction with the database needs 2000 times, and the operation of writing the file needs 20 times, thereby reducing the network consumption of the database, and simultaneously utilizing the memory of the server where the platform is located as much as possible, and accelerating the data reading speed.
As shown in fig. 7, the design mode of the database middleware adopts a server proxy mode, and a proxy service needs to be deployed independently, and the proxy service manages a plurality of database instances later, establishes a connection with the proxy server through a data source in an application, and the proxy operates the underlying database and returns corresponding results. The method has the advantages of supporting multiple languages and being transparent to business. All data sources are managed uniformly through the middleware agent layer, and the back-end database cluster is transparent to the front-end application and easy to expand. Independent services can provide more processing power. For example, the database 1 of the user 1 is data with a strict list structure of columns and rows, each row contains a piece of data information, each column contains a specific type of information, and the specific type of information is generally stored in a relational database, such as a MySQL database, while the database 2 of the user 2 may be a non-relational database, such as a key value storage database, a document storage database, a graph storage database, a time sequence database, a wide column storage database, and the like, for example, in a mongo db database, and the underlying details of a heterogeneous database, an efficient and safe operation database can be shielded through database middleware. If the database 3 of the user 3 is the same MySQL database as the user 1, but the database version is different, for example, the driver in the jar packet dependency of the version 5.1.6 of the MySQL database is com.mysql.jdbc.driver, and the driver in the jar packet dependency of the version 8.0.25 of the MySQL database is com.mysql.cj.jdbc.driver, if the jar packet of the version 8.0.25 is used, an error occurs by using com.mysql.cj.jdbc.driver to drive the MySQL database of the version 5.1.6, so that a unified database middleware service is used through a server proxy mode, since the service is deployed separately, all applications connected to the proxy are naturally equivalent to all upgrades as long as the proxy server is upgraded.
The structural design of the machine learning model object is shown in fig. 8. Each machine learning model object comprises a plurality of algorithm modules, a data set and model parameters, wherein the plurality of algorithm modules comprise a plurality of algorithms used by the machine learning model and connection relations among the algorithm modules, the algorithm modules are connected with each other to form a directed acyclic graph, so that the algorithm modules use the topological arrangement sequence of the directed acyclic graph nodes to be stored in a linked list, the connection relations among the algorithm modules can be stored in the form of an adjacent matrix, and further, the formed directed acyclic graph is generally a sparse graph, so that a contiguous table can be used to save storage space, and each module uses a linked list to record the next algorithm module of the module according to the sequence number of the algorithm modules in the linked list; the data set part represents a training set and a testing set used by the machine learning model; the model parameter part represents an initial preset value of the machine learning model or a parameter value after training is completed. For example, the user generates a directed acyclic graph through the machine learning model building module, wherein the directed acyclic graph comprises four algorithm modules 1,2,3 and 4, the algorithm module 1 can be a data cleaning algorithm, the algorithm module 2 can be a K-means algorithm, the algorithm module 3 can be an ANN algorithm, and the algorithm module 4 is an intelligent weighting algorithm for the algorithm module 2 and the algorithm module 3.
An example of a machine learning model is illustrated below. Wherein the name part is the name of the machine learning model and is used for identifying different machine learning models; the algorithms part is the topological order of the algorithm modules used by the model, wherein four algorithm modules exist; the graph part is the connection condition of algorithms part algorithm modules, wherein the subsequent modules of the algorithm module 1 are an algorithm module 2 and an algorithm module 3, the subsequent modules of the algorithm module 2 and the algorithm module 3 are an algorithm module 4, and the algorithm module 4 has no subsequent module; the parameters part is the parameters of the model, and three parameters are included; the datas portion contains the test set, training set, and validation set of the model.
{
“model”:{
Name of the machine learning model
“algorithms”:[algorithm1,algorithm2,algorithm3,algorithm4],
“graph”:[
[2,3],
[4],
[4],
[]
],
"parameters" {// if there are no parameters, the term is null
key1:value1,
key2:value2,
key3,value3
},
“datas”:{
"tiningset": { }// if there is no training set, the term is empty
"testSet": { }// if there is no test set, the term is empty
"validization set": { }// if there is no validation set, the term is empty
}
}
}
As shown in fig. 9, the data exchange between two adjacent algorithm modules is performed. The directed acyclic graph of the machine learning model constructed from the machine learning model construction module is then subjected to data exchange between adjacent algorithm modules in the actual algorithm module execution process of the algorithm execution module, so that the last algorithm module can transmit the parameters of the constructed machine learning model and the intermediate data of the original data processed by each algorithm module to the next algorithm module. The data format conversion module can convert the data of the output part in the previous algorithm module into a JSON format, and then convert the converted JSON format data into the data format of the input part in the next algorithm module. In this way, each algorithm module only needs to consider specific execution logic inside the algorithm, but does not need to consider logic of data format conversion, high cohesion in the module and low coupling between the modules. For example, the output of the algorithm module 1 contains output data of the y class of the python language and model parameters of three float classes, while the input of the algorithm module 2 requires input data of the y class of the R language and model parameters of the digital class, then the data of the python language of the algorithm module 1 needs to be converted into JSON format by the output format conversion function of the data format conversion module, and then the JSON format is converted into an acceptable format for the R language by the input format conversion function of the data format conversion module.
An example of an algorithm module is illustrated below. The name part is the name of the algorithm module and is used for identifying different algorithm modules; the inputs are the algorithm input parameters needed by the module, and comprise two parameters and one piece of raw data to be processed; the output part is the parameters output by the module and the processed intermediate data; the parameters part is the initial parameter setting inside the algorithm.
{
“algorithm”:{
"name":/(algorithm name)
"function":// algorithm function
Inputs {// algorithm inputs
“data”:{},
key1:value1,
key2:value2
},
"output" {// algorithm output
“data”:{},
key1:value1,
key2:value2
},
"parameters" {// algorithm parameters
key1:value1,
key2:value2
}
}
}
As shown in FIG. 10, in the structural design of the resource layer, since the resource layer is used as the bottom support of the platform, the server host where the resource layer is located needs to install multiple language running environments, and different servers may have different versions of the running environments which need to be installed because of different operating systems. Unlike virtual machines, there is no need to bundle a complete set of operating systems, and the system thus becomes efficient, lightweight and ensures that applications deployed in any environment run consistently. For example, packaging the Python running environment and the corresponding algorithm library from bottom to top, and running the Python running environment and the corresponding algorithm library in a container as an independent application process; packaging the R language running environment and the corresponding algorithm library from bottom to top, and running the R language running environment and the corresponding algorithm library in a container as an independent application process; the Java running environment and the corresponding libraries are packaged from bottom to top and run in the container as an independent application process. When the resource layer needs to be expanded, the expansion can be easily completed only by arranging the corresponding container mirror image in other servers, and the application cannot be influenced by the inconsistency of the underlying infrastructure and the operating system, so that a new problem is generated.
The above embodiment is only a preferred embodiment of the present invention, but it is not intended to limit the present invention. Various changes and modifications may be made by one of ordinary skill in the pertinent art without departing from the spirit and scope of the present invention. Therefore, all the technical schemes obtained by adopting the equivalent substitution or equivalent transformation are within the protection scope of the invention.

Claims (10)

1. A semi-automatic model construction method based on PCFG is used for automatically constructing a model flow chart on a machine learning platform, and is characterized in that algorithm modules which can be called are prestored in the machine learning platform, all algorithm modules are divided into a plurality of sets according to function categories, each set comprises all algorithm modules belonging to the same function category on the platform, the names of the sets are named by function category names, and the names of the algorithm modules are named by algorithm names;
the semi-automatic model building method comprises the following steps:
s1, acquiring Chinese description sentences which are input by a user and are used for describing a machine learning model flow chart expected to be constructed;
s2, performing word segmentation processing on the Chinese description sentence to obtain a word segmentation result with part of speech;
s3, matching each verb in the word segmentation result in an algorithm keyword library, and replacing each verb which can be matched with a keyword in the library in the Chinese description sentence with a placeholder representing the verb to obtain an input sentence; the module names and function attribute labels of all algorithm modules which can be called in the machine learning platform are prestored in the algorithm keyword library;
S4, carrying out syntactic analysis on the input sentence based on a Probability Context Free Grammar (PCFG), so as to analyze a sentence grammar structure conforming to grammar rules and generate a syntactic tree; if only one generated syntax tree exists, the generated syntax tree is used as a final syntax tree, and if a plurality of syntax trees are generated due to syntax analysis ambiguity exists, the corresponding occurrence probability of each generated syntax tree is calculated, and the syntax tree with the largest occurrence probability is used as the final syntax tree;
s5, extracting all placeholders in sequence to form a placeholder sequence according to the level and the sequence of leaf nodes in the final syntax tree; mapping each placeholder in the placeholder sequence into a corresponding algorithm module stored in the machine learning platform in advance in sequence, so that all the mapped algorithm modules are connected in sequence to form a model flow chart; wherein:
if the placeholder has the adverbs belonging to the same subtree in the syntax tree, searching a corresponding set from the machine learning platform according to the replaced verb corresponding to the placeholder, further searching a corresponding algorithm module from the searched set according to the adverbs in the same subtree, and adding the algorithm module into the model flow chart;
If the placeholder does not have the adverbs belonging to the same subtree in the syntax tree, searching a corresponding set from the machine learning platform directly according to the replaced verb corresponding to the placeholder, associating all algorithm modules in the searched set as to-be-selected modules to an algorithm module identifier, and adding the algorithm module identifier to a model flow chart for users to specify a final algorithm module by themselves.
2. The method for semi-automatically constructing a model based on PCFG according to claim 1, wherein in S3, in addition to the module name of each algorithm module, the synonym of the module name is stored in the algorithm keyword library, and when matching the verb with the algorithm module name in the algorithm keyword library, the synonym of the module name needs to be included in the matching range.
3. The PCFG-based semi-automatic modeling method of claim 1, wherein each algorithm module in the machine learning platform has a name attribute tag for describing the algorithm name of the algorithm module and a function attribute tag for describing the function class of the algorithm module; all the algorithm modules are clustered according to the function categories, and the algorithm modules with the same function attribute labels are stored in the set of the same function categories by taking the name attribute labels as keys.
4. The semi-automatic PCFG-based model method according to claim 1, wherein in S3, the relation between each placeholder and the replaced verb is stored in a record table; and when mapping is executed in S5, searching the replaced verb corresponding to each placeholder in the Chinese description sentence from the record table.
5. The PCFG-based semi-automatic modeling method of claim 1, wherein after the modeling flowchart is constructed, if an algorithm module identifier exists, a manual assignment procedure is initiated to the user, one of all the candidate modules is selected by the user himself, and the algorithm module identifier is replaced with this selected algorithm module in the modeling flowchart.
6. The PCFG-based semi-automatic modeling method of claim 1, wherein the Probabilistic Context Free Grammar (PCFG) parses the input sentence according to a set of preset rules, each rule provided with a corresponding probability.
7. The machine learning platform is characterized by comprising a data importing module, a data format converting module, a machine learning model constructing module and a resource layer;
The data importing module is used for reading appointed data from a user database and storing the appointed data to a local server side to serve as a data source of the machine learning model;
the data format conversion module is used for carrying out data exchange between adjacent algorithm modules of the model flow chart, wherein the data exchange takes a JSON format as a data exchange format, output data in the previous algorithm module is required to be converted into a unified JSON format, and then the converted JSON format data is converted into input data of the next algorithm module;
the machine learning model construction module comprises a semiautomatic construction module and a visual drag construction module, wherein the semiautomatic construction module is used for realizing the semiautomatic construction model method based on PCFG according to any one of claims 1-6 so as to generate an initial model flow chart, and the visual drag construction module is used for modifying the initial model flow chart in a visual drag mode;
the resource layer is used for providing bottom layer support for construction and operation of a machine learning model, and an algorithm module which can be called and an operation environment required by the operation of the algorithm module are integrated inside the resource layer.
8. The machine learning platform for semi-automatically constructing a machine learning model according to claim 7, wherein the machine learning platform is connected to the underlying database through database middleware, the user data is stored in the underlying database, the database middleware is located between the underlying database and the machine learning platform, and after the user inputs the URL address, the port number and the user name password of the host installed in the database, the machine learning platform automatically selects an appropriate database middleware according to the database to connect the database.
9. The machine learning platform for semi-automatically building a machine learning model of claim 8, wherein the database middleware is designed in a server-side proxy mode, requiring independent deployment of a proxy service that later manages multiple database instances, establishes a connection with the proxy server through a data source in an application, operates the underlying database by the proxy, and returns corresponding results.
10. The machine learning platform for semi-automatically building a machine learning model of claim 7, wherein each machine learning model object ultimately built by the machine learning platform comprises an algorithm module portion, a dataset portion, and a model parameter portion; the algorithm module part comprises a directed acyclic graph formed by interconnecting various algorithm modules used by a machine learning model, the algorithm modules are stored in a linked list by using the topological arrangement sequence of nodes of the directed acyclic graph, and the connection relation among the algorithm modules is stored in a form of an adjacent matrix; the dataset portion includes a training set and a testing set for use by a machine learning model; the model parameter part comprises an initial preset parameter value of the machine learning model and a parameter value after training is completed.
CN202310635240.8A 2023-05-31 2023-05-31 Semi-automatic model building method and machine learning platform based on PCFG Pending CN116629132A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310635240.8A CN116629132A (en) 2023-05-31 2023-05-31 Semi-automatic model building method and machine learning platform based on PCFG

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310635240.8A CN116629132A (en) 2023-05-31 2023-05-31 Semi-automatic model building method and machine learning platform based on PCFG

Publications (1)

Publication Number Publication Date
CN116629132A true CN116629132A (en) 2023-08-22

Family

ID=87637987

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310635240.8A Pending CN116629132A (en) 2023-05-31 2023-05-31 Semi-automatic model building method and machine learning platform based on PCFG

Country Status (1)

Country Link
CN (1) CN116629132A (en)

Similar Documents

Publication Publication Date Title
US10497366B2 (en) Hybrid learning system for natural language understanding
US11550783B2 (en) One-shot learning for text-to-SQL
US11210468B2 (en) System and method for comparing plurality of documents
US11816102B2 (en) Natural language query translation based on query graphs
US11520992B2 (en) Hybrid learning system for natural language understanding
CN110727839B (en) Semantic parsing of natural language queries
US8417512B2 (en) Method, used by computers, for developing an ontology from a text in natural language
CN111914534B (en) Method and system for constructing semantic mapping of knowledge graph
Liao et al. Unsupervised approaches for textual semantic annotation, a survey
US20220245353A1 (en) System and method for entity labeling in a natural language understanding (nlu) framework
CN115202626A (en) Low-code front-end development method supporting multi-technology stack components
Tapsai Information processing and retrieval from CSV file by natural language
US20220229994A1 (en) Operational modeling and optimization system for a natural language understanding (nlu) framework
US20230129994A1 (en) System and Method for Transpilation of Machine Interpretable Languages
CN112507089A (en) Intelligent question-answering engine based on knowledge graph and implementation method thereof
US20220245361A1 (en) System and method for managing and optimizing lookup source templates in a natural language understanding (nlu) framework
CN110309214B (en) Instruction execution method and equipment, storage medium and server thereof
Tuo et al. Review of entity relation extraction
US20220229986A1 (en) System and method for compiling and using taxonomy lookup sources in a natural language understanding (nlu) framework
US20220229990A1 (en) System and method for lookup source segmentation scoring in a natural language understanding (nlu) framework
CN116629132A (en) Semi-automatic model building method and machine learning platform based on PCFG
US20220229998A1 (en) Lookup source framework for a natural language understanding (nlu) framework
RU2605387C2 (en) Method and system for storing graphs data
CN115186671A (en) Method for mapping noun phrases to descriptive logic concepts based on extension
Zhang et al. Managing data from knowledge bases: querying and extraction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination