CN113296784B

CN113296784B - Container base mirror image recommendation method and system based on configuration code characterization

Info

Publication number: CN113296784B
Application number: CN202110539905.6A
Authority: CN
Inventors: 毛新军; 张银园; 张洋; 卢遥; 王涛; 张璋
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2021-05-18
Filing date: 2021-05-18
Publication date: 2023-11-14
Anticipated expiration: 2041-05-18
Also published as: CN113296784A

Abstract

The invention relates to a container base mirror image recommendation method and a system based on configuration code characterization, wherein the method comprises the following steps: analyzing the data in each container mirror image configuration file in the container mirror image configuration data set to obtain a functional code segment and a basic mirror image corresponding to each container mirror image configuration file; characterizing each of the functional code segments as an abstract syntax tree structure; obtaining a plurality of paths of the abstract syntax tree structure from a root node to each leaf node, wherein each path comprises a structure sequence from the root node to a corresponding leaf node and the corresponding leaf node; taking a plurality of structural sequences corresponding to the functional code segments and corresponding leaf nodes as inputs, and taking a basic mirror image corresponding to the functional code segments as an output training neural network model; and obtaining a basic mirror image corresponding to the functional code segment to be recommended according to the trained neural network model. The invention improves the efficiency and accuracy of obtaining the container base mirror image.

Description

Container base mirror image recommendation method and system based on configuration code characterization

Technical Field

The invention relates to the field of container base mirror images, in particular to a container base mirror image recommendation method and system based on configuration code characterization.

Background

In recent years, the Docker container technology has attracted widespread attention in the industry, thanks to the rapid deployment nature of the container technology. However, in the software development process based on the Docker container, configuration file information such as Dockerfile needs to be written. To complete the configuration of the Dockerfile, the developer first needs to specify the base image on which the Dockerfile depends, which often depends on the developer's personal experience. More importantly, the selection of the proper basic mirror image is not only beneficial to reducing the size of the mirror image, but also beneficial to improving the success rate of construction of the mirror image. However, in mirrored hosting communities like Docker Hub, the container search technique relies primarily on the personal experience of the developer.

Disclosure of Invention

The invention aims to provide a container base mirror image recommending method and system based on configuration code characterization, which improve the efficiency and accuracy of obtaining the container base mirror image.

In order to achieve the above object, the present invention provides the following solutions:

a container base image recommendation method based on configuration code characterization, the method comprising:

obtaining a container mirror configuration dataset; the container image configuration data set includes a plurality of container image configuration files;

analyzing the data in each container mirror image configuration file in the container mirror image configuration data set to obtain a functional code segment and a basic mirror image corresponding to each container mirror image configuration file;

characterizing each of the functional code segments as an abstract syntax tree structure;

obtaining a plurality of paths of the abstract syntax tree structure from a root node to each leaf node, wherein each path comprises a structure sequence from the root node to a corresponding leaf node and the corresponding leaf node;

taking a plurality of structural sequences corresponding to each functional code segment and corresponding leaf nodes as inputs, and taking a basic mirror image corresponding to each functional code segment as an output training neural network model to obtain a container basic mirror image recommendation model;

obtaining a plurality of structural sequences and corresponding leaf nodes of the functional code segments to be recommended;

and inputting a plurality of structural sequences of the functional code segments to be recommended and corresponding leaf nodes into the container base image recommendation model to obtain the base image corresponding to the functional code segments to be recommended.

Optionally, the obtaining a container mirror configuration data set specifically includes:

acquiring an open source item set;

screening out items comprising mirror configuration files from the open source item set to obtain a container mirror database;

and eliminating repeated container mirror image configuration files in the container mirror image database to obtain a container mirror image configuration data set formed by a plurality of container mirror image configuration files with different contents.

Optionally, the obtaining the open source item set specifically includes:

and screening open source projects with star indexes larger than a first set value and Issue indexes larger than a second set value from the open source community code hosting platform to obtain an open source project set.

Optionally, the removing the repeated container mirror configuration files in the container mirror database to obtain a container mirror configuration data set formed by a plurality of container mirror configuration files with different contents specifically includes:

obtaining hash values of all the container mirror files in a container mirror database;

and eliminating repeated container mirror image configuration files in the container mirror image database according to the hash value of each container mirror image file to obtain a container mirror image configuration data set formed by a plurality of container mirror image configuration files with different contents.

Optionally, the neural network model is a neural network model based on an attention mechanism.

The invention also discloses a container base mirror image recommendation system based on configuration code characterization, which comprises:

the data set acquisition module is used for acquiring a container mirror image configuration data set; the container image configuration data set includes a plurality of container image configuration files;

the data analysis module is used for analyzing the data in each container mirror image configuration file in the container mirror image configuration data set to obtain functional code fragments and basic mirrors corresponding to each container mirror image configuration file;

the code segment characterization module is used for characterizing each functional code segment into an abstract syntax tree structure;

the multi-path acquisition module is used for acquiring a plurality of paths of the abstract syntax tree structure from a root node to each leaf node, wherein each path comprises a structure sequence from the root node to a corresponding leaf node and the corresponding leaf node;

the container base mirror image recommendation model training module is used for taking a plurality of structural sequences corresponding to each functional code segment and corresponding leaf nodes as inputs, taking a base mirror image corresponding to each functional code segment as an output training neural network model, and obtaining a container base mirror image recommendation model;

the input feature acquisition module is used for acquiring a plurality of structural sequences of the functional code fragments to be recommended and corresponding leaf nodes;

and the container base mirror image recommendation model application module is used for inputting the multiple structural sequences of the functional code fragments to be recommended and the corresponding leaf nodes into the container base mirror image recommendation model to obtain the base mirror images corresponding to the functional code fragments to be recommended.

Optionally, the data set acquisition module specifically includes:

the open source item set acquisition unit is used for acquiring an open source item set;

the container mirror image database acquisition unit is used for screening out items comprising mirror image configuration files from the open source item set to acquire a container mirror image database;

the container mirror image configuration data set obtaining unit is used for removing repeated container mirror image configuration files in the container mirror image database to obtain a container mirror image configuration data set formed by a plurality of container mirror image configuration files with different contents.

Optionally, the open source item set acquisition unit specifically includes:

the open source project set acquisition subunit is used for screening open source projects with star indexes larger than a first set value and Issue indexes larger than a second set value from the open source community code hosting platform to obtain the open source project set.

Optionally, the container mirror configuration data set obtaining unit specifically includes:

a hash value obtaining subunit, configured to obtain hash values of the container image files in the container image database;

and the repeated eliminating subunit is used for eliminating repeated container mirror image configuration files in the container mirror image database according to the hash value of each container mirror image file to obtain a container mirror image configuration data set formed by a plurality of container mirror image configuration files with different contents.

According to the specific embodiment provided by the invention, the invention discloses the following technical effects:

the invention relates to a container base mirror image recommending method and a system based on configuration code characterization, which are characterized in that functional code segments are characterized as abstract syntax tree structures, semantic and structural characteristics of configuration information are obtained from the abstract syntax tree structures, a plurality of structural sequences corresponding to the functional code segments and corresponding leaf nodes are used as inputs, a base mirror image corresponding to the functional code segments is used as an output training neural network model, a container base mirror image recommending model is obtained, and the base mirror image corresponding to the functional code segments to be recommended is obtained according to the container base mirror image recommending model.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the drawings that are needed in the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a method for recommending container base images based on configuration code characterization;

FIG. 2 is a schematic diagram of a configuration code representation-based container base image recommendation system;

FIG. 3 is a detailed flowchart of a method for recommending container base images based on configuration code characterization according to the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description.

FIG. 1 is a schematic flow chart of a container base image recommending method based on configuration code representation, and as shown in FIG. 1, the container base image recommending method based on configuration code representation comprises the following steps:

step 101: obtaining a container mirror configuration dataset; the container image configuration data set includes a plurality of container image configuration files.

The obtaining a container mirror configuration data set specifically includes:

an open source set of items is obtained.

And screening out the items comprising the mirror configuration file from the open source item set to obtain a container mirror database.

The obtaining the open source item set specifically includes:

and screening open source projects with star indexes larger than a first set value and Issue indexes larger than a second set value from the open source community code hosting platform to obtain an open source project set. The reliability of the screened open source items is improved through star indexes and Issue indexes, so that the reliability of a training model taking the screened open source items as sample data is improved.

The step of eliminating repeated container mirror configuration files in the container mirror database to obtain a container mirror configuration data set formed by a plurality of container mirror configuration files with different contents comprises the following steps:

and obtaining the hash value of each container image file in the container image database.

Step 102: analyzing the data in each container mirror image configuration file in the container mirror image configuration data set to obtain functional code fragments and basic mirrors corresponding to each container mirror image configuration file.

Step 103: each of the functional code segments is characterized as an abstract syntax tree structure.

Step 104: a plurality of paths of the abstract syntax tree structure from the root node to each leaf node are obtained, each path comprising a structure sequence from the root node to a corresponding leaf node and a corresponding leaf node.

Step 105: and taking a plurality of structural sequences corresponding to the functional code segments and corresponding leaf nodes as inputs, and taking a basic mirror image corresponding to the functional code segments as an output training neural network model to obtain a container basic mirror image recommendation model.

The neural network model is a neural network model based on an attention mechanism.

Step 106: a plurality of structural sequences of functional code segments to be recommended and corresponding leaf nodes are obtained.

Step 107: and inputting a plurality of structural sequences of the functional code segments to be recommended and corresponding leaf nodes into the container base image recommendation model to obtain the base image corresponding to the functional code segments to be recommended.

The detailed description of the method for recommending the container base image based on the configuration code representation is provided below, and the detailed flow diagram of the method for recommending the container base image based on the configuration code representation is shown in fig. 3.

S1: and constructing an active open source project set according to indexes such as star, issue and the like of the open source community code hosting platform.

S2: based on the active open source item set obtained in the step S1, whether the open source item contains a Dockerfire image configuration file is checked by using an API (application programming interface), the open source item containing image configuration is screened out, and a container image database is constructed according to Dockerfire image configuration data contained in the open source item containing image configuration.

S3: based on the container mirror image data set obtained in the step S1, removing the repeated container Dockerfile, only retaining container data with different Dockerfile contents, and analyzing the container configuration file Dockerfile to obtain a functional code segment X and a basic mirror image Y.

S4: characterizing the functional code segment X obtained in the step S3 into an abstract syntax tree structure, and acquiring a plurality of paths from a root node to leaf nodes based on an AST (abstract syntax tree) structure.

S5: splitting each path obtained in the step S4 into a structure sequence and leaf nodes, taking the leaf nodes corresponding to the structure sequence and the structure sequence as characteristics, training a neural network model of a multi-code attention mechanism based on a basic mirror image Y obtained in the step S4 as a label (output), wherein the model (container basic mirror image recommendation model) can be used for predicting a basic mirror image according to Dockerfire functional code segments.

In the present invention, the step S1 includes the following:

s1.1: in the collaborative development community Github, basic information data of the project is collected by using an API, and the flow open source project is screened according to a star index.

S1.2: and screening out active open source projects from the popular open source projects according to Issue index data submitted by a developer, and constructing an active open source project set.

In the present invention, the step S2 includes the following:

s2.1: and (3) acquiring file name information contained in the project according to the active open source project set obtained in the step (S1), and eliminating the project set which does not contain the mirror configuration file.

S2.2: and traversing the mirror configuration information of the rest project data sets to construct a container mirror configuration data set.

In the present invention, step S3 includes the following:

s3.1: traversing the content of each configuration file of the data set, and removing repeated mirror configuration data to obtain a mirror configuration data set.

S3.2: and analyzing the instruction information of the Dockerfile image configuration file of the container, and extracting functional instruction data (except the FROM instruction) X and basic image instruction data, namely a basic image name Y declared by the FROM instruction.

In the present invention, step S4 includes the following:

s4.1: the common Dockerfile functional instruction data X is analyzed into an AST structure (a root node is DOCKER-FILE, a state node is abstract instruction or command information, and a leaf node is PACKAGE or ARG information).

S4.2: traversing the abstract syntax tree structure of each Dockerf file to obtain a plurality of syntax paths, wherein each path is a node information set from a root node to a leaf node.

In the present invention, step S5 includes the following:

s5.1: each path can be split into a structural sequence between the root node and the leaf node and semantic information expressed by the leaf node.

S5.2: the structure sequence and the semantic information features are input into a model, the basic mirror name is input into the model as a label, and a basic mirror automatic recommendation model (container basic mirror recommendation model) is obtained through training, wherein the model can be used for automatically recommending basic mirrors for Dockerfile only containing functional code fragments.

The invention achieves the following technical effects:

the method proposes a method of recommending mirroring based on structured Dockerfile functional fragments. By characterizing the functional fragments in the Dockerfile in the form of abstract syntax trees, the semantics and structural features of the configuration information can be obtained, and the attention mechanisms in the neural network can capture important paths, thereby recommending the correct base mirror image. The method can effectively assist the majority of developers to automatically select the proper basic mirror image, and improves the efficiency of container configuration.

The following describes a container base image recommendation method based on configuration code characterization according to a specific embodiment.

S1, constructing an active open source project set.

For an open source community (taking Github as an example), open source items with star indexes larger than 10 and Issue indexes larger than 10 are screened out, and the open source items meeting the requirements are used as an open source item set.

S2: a container mirror database is constructed.

Traversing each file of the item, and eliminating the item if the item does not contain the file with the end of the Dockerf file suffix. And for the Dockerfile file with repeated rejection contents, the Dockerfile file is added into a final container mirror image database only if the hash value of the content of the Dockerfile file is not found by acquiring the hash value of the content of the Dockerfile file.

S3: decimating functional fragments and base images

For extracting functional fragments and base images, removing annotation information (rows at// beginning), and for data beginning with a FROM instruction, extracting names of the base images by a name/name (version) tuple; instruction data other than FROM considers functional code segments.

S4: AST characterizes and acquires paths

According to the information type of the instruction, the functional code segment is characterized as an AST grammar tree structure, specifically, a plurality of paths are sequentially acquired in a depth-first mode, common instruction contents such as APT-GET-INSTALL are characterized as state nodes, and PACKAGE or ARG information such as GCC-Y is characterized as leaf node information.

Paths x in each Dockefile functional fragment _i Can be characterized asThe path sequence (structure sequence) of each path is denoted as s _i ，/> Representing root node->Leaf nodes that characterize semantic information.

Each Dockerfile functional fragment can be characterized as<x ₁ ,x ₂ …x _k >A set of multiple paths, k representing the number of paths.

Pair state node sequence (Structure sequence)The whole is encoded.

Representing structural sequence coding using an embedding matrix Es, encode_sequence (s _i )＝E _s 。

For leaf nodes, sub-information can be split according to the' \partition information, and a learned embedded matrix E is used ^subtoken Representing the encoding of each sub-information. The encoded vectors of the sub-information are then summed to represent the encoding of the complete leaf node:

where t represents a leaf node.

Coding of root node, coding of structural sequence andthe coding of leaf nodes is connected into a new vector z _i ，Wherein (1)>Coding representing root node->Coding representing a structural sequence->Representing the coding of the leaf node.

Z corresponding to each path _i The calculation of how the learning at the fully connected layer is combined is expressed as:where W represents a weight matrix and tanh () represents an activation function.

Each of which is provided withIs of the attention weight alpha _i Denoted as->Wherein the attention vector alpha epsilon R ^2d Randomly initializing and learning simultaneously with the network (neural network model based on the attention mechanism), k represents the number of paths, R ^2d The representation dimension is 2d.

The linear combination of (a) is expressed as: />

Predictions of the neural network model based on the attention mechanism are calculated as (softmax normalized) dot products between the Dockerfile vector and each base mirror label, respectively.

Q represents the number of base images, image_tag _i′ Represents the i' th base mirror image, v ^T Represents the transpose of v, q (y _i′ ) Representing image_tag _i′ Corresponding distribution probability, image_tag with maximum distribution probability _i′ And the base mirror image Y corresponding to v.

S5: splitting each path in the container mirror image configuration data set obtained in the step S4 into a structure sequence and leaf nodes, taking the leaf nodes corresponding to the structure sequence and the structure sequence as characteristics, taking a basic mirror image as a label (output) to train a neural network model of a multi-code attention mechanism, and predicting a basic mirror image according to a Dockerfire functional code segment through the neural network model of the attention mechanism (container basic mirror image recommendation model).

FIG. 2 is a schematic structural diagram of a container base image recommendation system based on configuration code representation according to the present invention, and as shown in FIG. 2, a container base image recommendation system based on configuration code representation includes:

a data set acquisition module 201 for acquiring a container image configuration data set; the container image configuration data set includes a plurality of container image configuration files.

The data analysis module 202 is configured to analyze data in each container image configuration file in the container image configuration data set, and obtain a functional code segment and a base image corresponding to each container image configuration file.

A code segment characterization module 203, configured to characterize each of the functional code segments into an abstract syntax tree structure.

A multi-path obtaining module 204, configured to obtain a plurality of paths of the abstract syntax tree structure from a root node to each leaf node, and each path includes a structure sequence from the root node to a corresponding leaf node and the corresponding leaf node.

The container base mirror image recommendation model training module 205 is configured to obtain a container base mirror image recommendation model by taking a plurality of structure sequences corresponding to each functional code segment and corresponding leaf nodes as inputs, and taking a base mirror image corresponding to each functional code segment as an output to train a neural network model.

The input feature acquisition module 206 is configured to acquire a plurality of structural sequences of the functional code segments to be recommended and corresponding leaf nodes.

The container base image recommendation model application module 207 is configured to input the multiple structure sequences of the functional code segments to be recommended and the corresponding leaf nodes into the container base image recommendation model, and obtain a base image corresponding to the functional code segments to be recommended.

The data set acquisition module 201 specifically includes:

and the open source item set acquisition unit is used for acquiring the open source item set.

And the container mirror image database acquisition unit is used for screening out the items comprising the mirror image configuration file from the open source item set to acquire the container mirror image database.

The open source item set acquisition unit specifically comprises:

The container mirror image configuration data set acquisition unit specifically comprises:

and the hash value acquisition subunit is used for acquiring the hash value of each container image file in the container image database.

In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other. For the system disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.

The principles and embodiments of the present invention have been described herein with reference to specific examples, the description of which is intended only to assist in understanding the methods of the present invention and the core ideas thereof; also, it is within the scope of the present invention to be modified by those of ordinary skill in the art in light of the present teachings. In view of the foregoing, this description should not be construed as limiting the invention.

Claims

1. A container base image recommendation method based on configuration code characterization, the method comprising:

2. The method for recommending container base images based on configuration code characterization according to claim 1, wherein the obtaining a container image configuration data set specifically comprises:

acquiring an open source item set;

3. The method for recommending container base images based on configuration code characterization according to claim 2, wherein the step of obtaining the open source item set specifically comprises the steps of:

4. The method for recommending container base images based on configuration code characterization according to claim 2, wherein the step of eliminating repeated container image configuration files in the container image database to obtain a container image configuration data set composed of a plurality of container image configuration files with different contents comprises the following steps:

5. The configuration code characterization based container base image recommendation method according to claim 1, wherein the neural network model is an attention mechanism based neural network model.

6. A container base image recommendation system based on configuration code characterization, the system comprising:

7. The container base image recommendation system based on configuration code characterization of claim 6, wherein the data set acquisition module specifically comprises:

8. The container base image recommendation system based on configuration code characterization according to claim 7, wherein the open source item set obtaining unit specifically comprises:

9. The container base image recommendation system based on configuration code characterization according to claim 7, wherein the container image configuration data set obtaining unit specifically comprises:

10. The configuration code characterization based container base image recommendation system according to claim 6, wherein the neural network model is an attention mechanism based neural network model.