CN113296784A

CN113296784A - Container base mirror image recommendation method and system based on configuration code representation

Info

Publication number: CN113296784A
Application number: CN202110539905.6A
Authority: CN
Inventors: 毛新军; 张银园; 张洋; 卢遥; 王涛; 张璋
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2021-05-18
Filing date: 2021-05-18
Publication date: 2021-08-24
Anticipated expiration: 2041-05-18
Also published as: CN113296784B

Abstract

The invention relates to a method and a system for recommending a container base mirror image based on configuration code representation, wherein the method comprises the following steps: analyzing data in each container mirror image configuration file in the container mirror image configuration data set to obtain a functional code segment and a basic mirror image corresponding to each container mirror image configuration file; characterizing each of the functional code fragments as an abstract syntax tree structure; obtaining a plurality of paths of the abstract syntax tree structure from a root node to each leaf node, wherein each path comprises a structure sequence from the root node to the corresponding leaf node and the corresponding leaf node; taking a plurality of structure sequences corresponding to each functional code segment and corresponding leaf nodes as input, and taking a basic mirror image corresponding to each functional code segment as output to train a neural network model; and obtaining a basic mirror image corresponding to the functional code segment to be recommended according to the trained neural network model. The invention improves the efficiency and the accuracy of acquiring the container basic mirror image.

Description

Container base mirror image recommendation method and system based on configuration code representation

Technical Field

The invention relates to the field of container base mirror images, in particular to a container base mirror image recommendation method and system based on configuration code representation.

Background

In recent years, the Docker container technology has attracted a great deal of attention in the industry, thanks to the rapid deployment nature of the container technology. However, in the software development process based on the Docker container, configuration file information such as Dockerfile needs to be written. To complete the configuration of a Dockerfile, a developer first needs to specify the base image on which the Dockerfile depends, which often depends on the developer's personal experience. More importantly, the selection of the proper basic image is not only beneficial to reducing the size of the image, but also beneficial to improving the construction power of the image. However, in a mirror hosting community like Docker Hub, the container search technique relies heavily on the personal experience of the developer.

Disclosure of Invention

The invention aims to provide a container base mirror image recommendation method and system based on configuration code representation, and the efficiency and the accuracy of container base mirror image acquisition are improved.

In order to achieve the purpose, the invention provides the following scheme:

a method for recommendation of a container base image based on configuration code characterization, the method comprising:

obtaining a container mirror image configuration data set; the container image configuration dataset comprises a plurality of container image configuration files;

analyzing data in each container mirror image configuration file in the container mirror image configuration data set to obtain a functional code segment and a basic mirror image corresponding to each container mirror image configuration file;

characterizing each of the functional code fragments as an abstract syntax tree structure;

obtaining a plurality of paths of the abstract syntax tree structure from a root node to each leaf node, wherein each path comprises a structure sequence from the root node to the corresponding leaf node and the corresponding leaf node;

taking a plurality of structure sequences corresponding to each functional code segment and corresponding leaf nodes as input, and taking a basic mirror image corresponding to each functional code segment as output to train a neural network model, so as to obtain a container basic mirror image recommendation model;

obtaining a plurality of structural sequences of a functional code segment to be recommended and corresponding leaf nodes;

and inputting the plurality of structural sequences of the functional code segments to be recommended and the corresponding leaf nodes into the container basic mirror image recommendation model to obtain the basic mirror image corresponding to the functional code segments to be recommended.

Optionally, the obtaining a container mirror configuration data set specifically includes:

obtaining an open source project set;

screening out items comprising mirror image configuration files from the open source item set to obtain a container mirror database;

and removing repeated container mirror image configuration files in the container mirror image database to obtain a container mirror image configuration data set consisting of a plurality of container mirror image configuration files with different contents.

Optionally, the obtaining the open-source item set specifically includes:

and screening the open source items of which the star indexes are greater than a first set value and the Issue indexes are greater than a second set value from the open source community code hosting platform to obtain an open source item set.

Optionally, the removing of the repeated container mirror image configuration files in the container mirror database to obtain a container mirror image configuration data set composed of a plurality of container mirror image configuration files with different contents specifically includes:

obtaining the hash value of each container mirror image file in a container mirror database;

and removing repeated container mirror image configuration files in the container mirror image database according to the hash value of each container mirror image file to obtain a container mirror image configuration data set formed by a plurality of container mirror image configuration files with different contents.

Optionally, the neural network model is an attention-based neural network model.

The invention also discloses a container base mirror image recommendation system based on the configuration code representation, which comprises the following steps:

the data set acquisition module is used for acquiring a container mirror image configuration data set; the container image configuration dataset comprises a plurality of container image configuration files;

the data analysis module is used for analyzing data in each container mirror image configuration file in the container mirror image configuration data set to obtain a functional code segment and a basic mirror image corresponding to each container mirror image configuration file;

a code segment representation module for representing each of the functional code segments into an abstract syntax tree structure;

a multi-path obtaining module, configured to obtain multiple paths of the abstract syntax tree structure from a root node to each leaf node, where each path includes a structure sequence from the root node to a corresponding leaf node and the corresponding leaf node;

the container basic mirror image recommendation model training module is used for training a neural network model by taking a plurality of structure sequences corresponding to the functional code segments and corresponding leaf nodes as input and taking a basic mirror image corresponding to the functional code segments as output to obtain a container basic mirror image recommendation model;

the input characteristic acquisition module is used for acquiring a plurality of structural sequences of the functional code segments to be recommended and corresponding leaf nodes;

and the container basic mirror image recommendation model application module is used for inputting the plurality of structure sequences and the corresponding leaf nodes of the functional code segments to be recommended into the container basic mirror image recommendation model to obtain the basic mirror image corresponding to the functional code segments to be recommended.

Optionally, the data set obtaining module specifically includes:

the open source project set acquisition unit is used for acquiring an open source project set;

a container mirror database acquisition unit, configured to filter out items including mirror configuration files from the open source item set, and acquire a container mirror database;

and the container mirror image configuration data set acquisition unit is used for eliminating repeated container mirror image configuration files in the container mirror image database and acquiring a container mirror image configuration data set consisting of a plurality of container mirror image configuration files with different contents.

Optionally, the open-source item set obtaining unit specifically includes:

and the open source item set acquisition subunit is used for screening open source items of which the star indexes are greater than a first set value and the Issue indexes are greater than a second set value from the open source community code hosting platform to obtain an open source item set.

Optionally, the container mirror image configuration data set obtaining unit specifically includes:

the hash value acquisition subunit is used for acquiring the hash value of each container image file in the container mirror database;

and the repeated removing subunit is used for removing repeated container mirror image configuration files in the container mirror image database according to the hash values of the container mirror image files to obtain a container mirror image configuration data set formed by a plurality of container mirror image configuration files with different contents.

According to the specific embodiment provided by the invention, the invention discloses the following technical effects:

the invention relates to a container basic mirror image recommendation method and system based on configuration code representation, which are characterized in that a functional code segment is represented as an abstract syntax tree structure, semantic and structural characteristics of configuration information are obtained from the abstract syntax tree structure, a plurality of structural sequences and corresponding leaf nodes corresponding to the functional code segment are taken as input, a basic mirror image corresponding to the functional code segment is taken as output training neural network model, a container basic mirror image recommendation model is obtained, a basic mirror image corresponding to the functional code segment to be recommended is obtained according to the container basic mirror image recommendation model, and compared with the traditional method of selecting the basic mirror image according to personal experience, the efficiency and the accuracy of obtaining the container basic mirror image are improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.

FIG. 1 is a schematic flow chart of a container-based mirror image recommendation method based on configuration code characterization according to the present invention;

FIG. 2 is a schematic structural diagram of a container-based mirror image recommendation system based on configuration code characterization according to the present invention;

FIG. 3 is a detailed flowchart of a container-based mirror image recommendation method based on configuration code characterization according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

Fig. 1 is a schematic flow chart of a container-based image recommendation method based on configuration code representation according to the present invention, and as shown in fig. 1, a container-based image recommendation method based on configuration code representation includes:

step 101: obtaining a container mirror image configuration data set; the container image configuration data set includes a plurality of container image configuration files.

The obtaining of the container mirror image configuration data set specifically includes:

an open source item set is obtained.

And screening out items comprising mirror image configuration files from the open source item set to obtain a container mirror image database.

The obtaining of the open source item set specifically includes:

and screening the open source items of which the star indexes are greater than a first set value and the Issue indexes are greater than a second set value from the open source community code hosting platform to obtain an open source item set. The reliability of the screened open source project is improved through the star index and the Issue index, and therefore the reliability of the training model taking the screened open source project as sample data is improved.

The removing of the repeated container mirror image configuration files in the container mirror image database to obtain a container mirror image configuration data set composed of a plurality of container mirror image configuration files with different contents specifically includes:

and obtaining the hash value of each container image file in the container image database.

Step 102: and analyzing the data in each container mirror image configuration file in the container mirror image configuration data set to obtain a functional code segment and a basic mirror image corresponding to each container mirror image configuration file.

Step 103: each of the functional code fragments is characterized as an abstract syntax tree structure.

Step 104: and obtaining a plurality of paths of the abstract syntax tree structure from the root node to each leaf node, wherein each path comprises a structure sequence from the root node to the corresponding leaf node and the corresponding leaf node.

Step 105: and taking a plurality of structure sequences corresponding to each functional code segment and corresponding leaf nodes as input, and taking a basic mirror image corresponding to each functional code segment as output to train a neural network model, so as to obtain a container basic mirror image recommendation model.

The neural network model is based on an attention mechanism.

Step 106: and obtaining a plurality of structural sequences and corresponding leaf nodes of the functional code segments to be recommended.

Step 107: and inputting the plurality of structural sequences of the functional code segments to be recommended and the corresponding leaf nodes into the container basic mirror image recommendation model to obtain the basic mirror image corresponding to the functional code segments to be recommended.

The following describes in detail a container-based image recommendation method based on configuration code representation according to the present invention, and a detailed flowchart of the container-based image recommendation method based on configuration code representation is shown in fig. 3.

S1: and constructing an active open source project set according to indexes such as star and Issue of the open source community code hosting platform.

S2: based on the active open source project set obtained in step S1, an API (application programming interface) is used to check whether the open source project includes a Dockerfile mirror configuration file, screen out an open source project including mirror configuration, and construct a container mirror database according to Dockerfile mirror configuration data included in the open source project including mirror configuration.

S3: based on the container mirror image dataset obtained in step S1, removing duplicate container Dockerfile, only retaining container data with different Dockerfile contents, and analyzing the container configuration file Dockerfile to obtain a functional code segment X and a base mirror image Y.

S4: the functional code segment X obtained in step S3 is characterized into an abstract syntax tree structure, and a plurality of paths from the root node to the leaf nodes are acquired based on the AST (abstract syntax tree) structure.

S5: splitting each path obtained in the step S4 into a structure sequence and leaf nodes, taking the leaf nodes corresponding to the structure sequence and the structure sequence as features, training a neural network model of a multi-coded attention mechanism based on the basic mirror image Y obtained in the step S4 as a label (output), and the model (container basic mirror image recommendation model) can be used for predicting a basic mirror image according to a Dockerfile functional code segment.

In the present invention, the step S1 includes the following steps:

s1.1: in a collaborative development community GitHub, basic information data of a project is collected by using an API, and a popular open source project is screened out according to a star index.

S1.2: and screening out active open source projects from the popular open source projects according to the Issue index data submitted by the developers, and constructing an active open source project set.

In the present invention, the step S2 includes the following steps:

s2.1: and according to the active open source project set obtained in the step S1, acquiring the file name information contained in the project, and removing the project set which does not contain the mirror image configuration file.

S2.2: and traversing mirror image configuration information of the residual project data sets, and constructing a container mirror image configuration data set.

In the present invention, step S3 includes the following:

s3.1: and traversing the content of each configuration file of the data set, and removing repeated mirror image configuration data to obtain a mirror image configuration data set.

S3.2: and analyzing the instruction information of the Dockerfile image configuration file of the container, and extracting functional instruction data (except FROM instruction) X and basic image instruction data, namely the basic image name Y declared by the FROM instruction.

In the present invention, step S4 includes the following:

s4.1: the common Dockerfile functional instruction data X is analyzed into an AST structure (root node is DOCKER-FILE, state node is abstract instruction or command information, leaf node is information such as packet or ARG).

S4.2: and traversing the abstract syntax tree structure of each Dockerfile to obtain a plurality of syntax paths, wherein each path is a node information set from a root node to a leaf node.

In the present invention, step S5 includes the following:

s5.1: each path may be split into a structural sequence between the root node and the leaf nodes and semantic information expressed by the leaf nodes.

S5.2: the structural sequence and semantic information characteristics are input into a model, the basic mirror image name is input into the model as a label, and a basic mirror image automatic recommendation model (container basic mirror image recommendation model) is obtained through training and can be used for automatically recommending the basic mirror image for Dockerfile only containing functional code segments.

The invention achieves the following technical effects:

the method proposes a method for recommending the mirror image according to the structured Dockerfile functional segment. By representing the functional segments in the Dockerfile in the form of abstract syntax trees, the semantic and structural characteristics of the configuration information can be acquired, and the attention mechanism in the neural network can capture important paths, so that a correct basic mirror image is recommended. The method can effectively assist developers to automatically select the appropriate basic mirror image, and improves the container configuration efficiency.

The following describes a container-based image recommendation method based on configuration code characterization according to a specific embodiment of the present invention.

S1, constructing an active open source project set.

For an open source community (for example, GitHub), an open source project with a star index greater than 10 and an Issue index greater than 10 is screened out, and the open source project meeting the requirements is used as an open source project set.

S2: a container mirror database is constructed.

And traversing each file of the item, and if the item does not contain the file at the end of the Dockerfile suffix, rejecting the item. For the Dockerfile with the removed content being repeated, the Dockerfile file is added into the final container mirror database only if the hash value does not appear by acquiring the hash value of the content of the Dockerfile.

S3: extracting functional segments and base images

For extracting functional fragments and basic images, removing annotation information (rows at// head), and for data at the head of FROM instruction, extracting the name of the basic image by using a namespace/name (version) tuple; instruction data other than FROM is considered a functional code segment.

S4: AST characterization and acquisition path

According to the information type of the instruction, the functional code segments are characterized into an AST syntax tree structure, specifically, a depth-first mode is adopted to sequentially obtain a plurality of paths, common instruction contents such as APT-GET-INSTALL and the like are characterized into state nodes, and PACKAGE or ARG information such as GCC-Y and the like are characterized into leaf node information.

Each path x in each Dockefile functional segment_iCan be characterized as

The path sequence (structure sequence) of each path is denoted as s_i，

A root node is represented as a root node,

leaf nodes representing semantic information.

Each Dockerfile functional fragment can be characterized as<x₁,x₂…x_k>A set of multiple paths, k representing the number of paths.

To state node sequence (structure sequence)

The whole is encoded.

The structural sequence code, encode _ sequence(s), is represented using an embedding matrix Es_i)＝E_s。

For leaf nodes, the sub-information can be split according to the 'partition' information, using the learned embedded matrix E^subtokenTo represent the encoding of each sub information. The coded vectors of sub-information are then summed to represent the code for the complete leaf node:

where t represents a leaf node.

Connecting the coding of the root node, the coding of the structural sequence and the coding of the leaf node into a new vector z_i，

Wherein,

the code representing the root node is represented by,

which represents the coding of the sequence of the structure,

representing the encoding of the leaf node.

Z corresponding to each path_iThe calculation of how the learning at the fully connected layer is combined is represented as:

where W represents a weight matrix and tanh () represents an activation function.

Each one of

Attention weight α of_iIs shown as

Wherein, note that the vector α ∈ R^2dRandomly initialized and learned simultaneously with the network (attention-based neural network model), k representing the number of paths, R^2dThe denoted dimension is 2 d.

Is expressed as:

predictions of the neural network model based on the attention mechanism are calculated as (softmax normalized) dot products between the Dockerfile vector and each base mirror label, respectively.

Q represents the number of basic images, image _ tag_i′Denotes the ith' base image, v^TDenotes the transposition of v, q (y)_i′) Represents image _ tag_i′Corresponding distribution probability, image _ tag with maximum distribution probability_i′Is v the corresponding base image Y.

S5: splitting each path in the container mirror image configuration data set obtained in the step S4 into a structure sequence and leaf nodes, training a neural network model of a multi-coding attention system by using the leaf nodes corresponding to the structure sequence and the structure sequence as features and the basic mirror image as a label (output), and predicting the basic mirror image according to the Dockerfile functional code segment through the neural network model of the attention system (container basic mirror image recommendation model).

Fig. 2 is a schematic structural diagram of a container base image recommendation system based on configuration code representation according to the present invention, and as shown in fig. 2, a container base image recommendation system based on configuration code representation includes:

a data set obtaining module 201, configured to obtain a container mirror configuration data set; the container image configuration data set includes a plurality of container image configuration files.

And the data analysis module 202 is configured to analyze data in each container mirror configuration file in the container mirror configuration data set, so as to obtain a functional code segment and a basic mirror image corresponding to each container mirror configuration file.

A code segment representation module 203, configured to represent each of the functional code segments as an abstract syntax tree structure.

A multi-path obtaining module 204, configured to obtain multiple paths of the abstract syntax tree structure from a root node to each leaf node, where each path includes a structural sequence from the root node to a corresponding leaf node and a corresponding leaf node.

The container basis mirror image recommendation model training module 205 is configured to train a neural network model by taking a plurality of structure sequences and corresponding leaf nodes corresponding to each of the functional code segments as input and taking a basis mirror image corresponding to each of the functional code segments as output, so as to obtain a container basis mirror image recommendation model.

The input feature obtaining module 206 is configured to obtain a plurality of structural sequences of the functional code segment to be recommended and corresponding leaf nodes.

The container base mirror image recommendation model application module 207 is configured to input the plurality of structure sequences of the functional code segment to be recommended and the corresponding leaf nodes into the container base mirror image recommendation model, so as to obtain a base mirror image corresponding to the functional code segment to be recommended.

The data set obtaining module 201 specifically includes:

and the open source item set acquisition unit is used for acquiring the open source item set.

And the container mirror database acquisition unit is used for screening out items comprising mirror image configuration files from the open source item set to obtain a container mirror database.

The open source item set obtaining unit specifically includes:

The container mirror image configuration data set acquisition unit specifically includes:

and the hash value acquisition subunit is used for acquiring the hash value of each container image file in the container image database.

The neural network model is based on an attention mechanism.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.

The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims

1. A container base image recommendation method based on configuration code characterization is characterized by comprising the following steps:

2. The method for recommending a container base image based on a configuration code representation according to claim 1, wherein the obtaining a container image configuration data set specifically comprises:

obtaining an open source project set;

3. The method for recommending a container base image based on configuration code characterization according to claim 2, wherein the obtaining an open source item set specifically includes:

4. The method for recommending container base images based on configuration code characterization according to claim 2, wherein the removing of duplicate container image configuration files in the container image database to obtain a container image configuration data set composed of a plurality of container image configuration files with different contents specifically includes:

5. The method of claim 1, wherein the neural network model is an attention-based neural network model.

6. A container base image recommendation system based on configuration code characterization, the system comprising:

7. The system according to claim 1, wherein the data set acquisition module specifically includes:

8. The system according to claim 7, wherein the open-source item set obtaining unit specifically includes:

9. The system according to claim 7, wherein the container mirror configuration dataset acquisition unit specifically includes:

10. The configuration code characterization based container base image recommendation system according to claim 6, wherein the neural network model is an attention mechanism based neural network model.