CN112667286A

CN112667286A - Searching method based on context of programming field environment

Info

Publication number: CN112667286A
Application number: CN202011551429.1A
Authority: CN
Inventors: 张智轶; 许云剑; 黄志球; 陶传奇; 周玉倩
Original assignee: Nanjing University of Aeronautics and Astronautics
Current assignee: Nanjing University of Aeronautics and Astronautics
Priority date: 2020-12-24
Filing date: 2020-12-24
Publication date: 2021-04-16

Abstract

The invention provides a search method based on programming field environment context. The method comprises the following steps: acquiring context information of a programming site environment, including context information of a programmer, context information of a programming project and a task, programming time and context information of an environment; carrying out different preprocessing on the acquired context information of the original programming site environment aiming at the text language and the formal language and storing the preprocessed information; clustering the context information of the preprocessed programming site environment by using a K-means algorithm to obtain a semantic relation between the context information; performing hierarchical analysis on the context information of the preprocessed programming site environment by using a hierarchical clustering method to obtain an explicit association and an implicit association relation between the context information; the search is completed for the required programming requirements using the elastic search as the underlying data retrieval model. According to the invention, the extensive and various semantic relationships among the context information are deeply mined, so that the accurate recommendation of the codes can be realized.

Description

Searching method based on context of programming field environment

Technical Field

The invention belongs to the field of computers, relates to a data acquisition technology and an inference engine technology in a software development technology, and particularly relates to a search method based on context of a field programming environment.

Background

With the development of the internet and the popularity of open source software, the reuse of software and code becomes more and more important in improving the software development efficiency. Code search techniques have gained increasing research and attention. However, the current searching method cannot well search out the needed programming codes according to the requirements of the user, which brings inconvenience to the work of the user and consumes time; therefore, how to analyze the needed and well-adapted codes according to the habits and the completed projects of different users becomes a key problem.

Disclosure of Invention

The purpose of the invention is as follows: aiming at the defects of the prior art, the invention provides a searching method based on the context of the programming field environment, which can provide the codes required by the user with high precision and bring better user experience.

The technical scheme is as follows: a search method based on programming field environment context comprises the following steps:

acquiring context information of a programming site environment, including context information of a programmer, context information of a programming project and a task, programming time and context information of an environment;

preprocessing and storing the acquired context information of the original programming site environment aiming at the text language;

clustering the context information of the preprocessed programming site environment by using a K-means algorithm to obtain a semantic relation between the context information;

performing hierarchical analysis on the context information of the preprocessed programming site environment by using a hierarchical clustering method to obtain an explicit association and an implicit association relation between the context information;

the search is completed for the required programming requirements using the elastic search as the underlying data retrieval model.

Has the advantages that: the invention extracts information for analysis according to questionnaire survey and ordinary programming information of the user and the field of the executed project, and can recommend codes which are required by the user and can meet the programming habits and the programming capability of the user by deeply mining wide and various semantic relations among context information.

Drawings

FIG. 1 is a flowchart of a search method based on programming field context according to an embodiment of the present invention.

Detailed Description

The technical scheme of the invention is further explained by combining the attached drawings.

Referring to fig. 1, the search method based on the programming field context provided by the present invention firstly obtains the required user information by using explicit obtaining, implicit obtaining and reasoning obtaining modes, and adopts different natural language processing and preprocessing aiming at the query condition of the user and the collected data such as the original context information, etc. aiming at the text language. And then, deeply mining wide and various semantic relations among the context information by using a K-means algorithm, and carrying out complete and unified fusion of semantic communication on the context information knowledge by combining a text potential semantic analysis technology and an entity link analysis technology. And then, mining the association relation between the context information by using a hierarchical clustering technology to find the explicit association and implicit association relation between the data, finding the rule and the characteristic of dynamic change of the context information, and forming a benign growth and evolution cycle by using a snowball rolling effect in the iterative multiplexing process of software knowledge to realize the acquisition, cleaning, organization and management of large heterogeneous data. Finally, the modeling searches for the required programming requirements using the elastic search as the underlying data retrieval engine. The method comprises the following specific steps:

and step 1, acquiring required information by using explicit acquisition, implicit acquisition and reasoning acquisition modes.

Firstly, determining required information, starting from three dimensions of people, projects and environments, and programming field environment context information comprises context information of programmers, context information of programming projects and tasks, programming time and context information of environments. More specifically, the context information of the programmer includes: current Integrated Development Environment (IDE) familiarity, familiarity with current projects, experience of programmers, programmers programming habits, social networks of programmers; the context information of the programming project and task comprises: the method comprises the steps of using command information currently, operating a module currently, describing methods, calling methods, item structures, task types, programming error suggestions, item descriptions, item types and historical recommendation information; the context information of the programming time and the programming environment includes: time information, version number of the project, programming location, interface elements used by the developer, interface elements of interest to the developer.

The method for acquiring various types of information can be as follows:

1) explicit acquisition: obtaining the familiarity of the current Integrated Development Environment (IDE) and the familiarity of the current project by adopting a questionnaire mode; the method comprises the following steps that project types, project descriptions, programming error suggestions and historical recommendation information are mainly obtained according to a mode of searching recorded documents, developed documents and log reports of software; the time information and the project version number can be obtained by means of user communication and/or document inquiry;

2) implicit acquisition: determining programming habits of programmers by analyzing past code documents, bug reports and other documents of users; for the current running module, the current use command, the method description, the programming place, the interface element used by the developer and the interface element concerned by the developer, information collection can be carried out by adopting implicit modes such as screen monitoring, mouse operation monitoring and the like;

3) inferential acquisition (inferenng): the programmer experience is obtained by crawling the recorded data of the user in the programming forum through a crawler and acquiring the social network of the user by analyzing and reasoning the association relationship of the social network; and analyzing the task type, the project structure and the method calling information by the requirements of the project, the design document and the code structure to deduce and induce the association relation between the parameters and the methods.

And preprocessing the query conditions of the user and the collected data such as the original context information. For a text language, word segmentation and keyword acquisition technologies are adopted for preprocessing, syntactic and semantic information of words, sentences and the like can be obtained through word segmentation, keywords are a vocabulary set for expressing text subject content and are a more brief abstract of a text, and the keywords acquired through the keyword acquisition technologies can quickly and roughly acquire the content of the document.

For the preprocessed information, the codes are vectorized by utilizing a bag-of-words model and are represented by utilizing a one-shot representation method, the numerical value of the words appearing in the word sequence is 1, and the numerical value of the words not appearing in the word sequence is 0. For example, for a document:

1	Chinese Nanjing Chinese
		2	Tokyo Japan Chinese

the word bag is constructed as follows:

Chinese

Nanjing

Tokyo

Japan

and calculating the numerical value of each word in the word bag according to a one-shot representation method for each code text to obtain a vector representation of the code text.

The word vector of Chinese Nanjing Chinese is:

Chinese	Nanjing	Tokyo	Japan
				1	1	0	0

the word vector for Tokyo Japan Chinese is:

Chinese	Nanjing	Tokyo	Japan
				1	0	1	1

for texts except codes, expressing and storing the texts by using an n-gram model, and for sentences containing n words, expressing the language model by using the following expressions: p (W1, W2, …, Wn) ═ P (W1) P (W2 | W1) P (W3 | W1, W2) … P (Wn | W1, W2, … Wn-1), P is the probability that the phrase is established, and a larger probability indicates a larger possibility that the phrase is established. The model is based on Markov assumptions, assuming that whether a target word in a sentence occurs depends only on the n words that occur before this word, in order to reduce computational complexity, n is typically 2 or n is 3.

And 2, carrying out clustering analysis on the preprocessed data by utilizing a K-means algorithm to obtain the semantic relation between the context information.

The K-means clustering algorithm (K-means clustering algorithm) is an iterative solution clustering analysis algorithm, and the steps of the algorithm are that K objects are randomly selected to serve as initial clustering centers, then the distance between each object and each seed clustering center is calculated, and each object is allocated to the nearest clustering center. The cluster centers and the objects assigned to them represent a cluster.

The invention deeply excavates the wide and various semantic relations among the context information by utilizing the K-means algorithm, and excavates the context text information with great topic relevance in the context by clustering the preprocessed information. Specifically, K pieces of context information related to a topic are selected as initial clustering centers, then the distance between each piece of context information and each center is calculated, each piece of context information is distributed to the clustering center closest to the context information, so that K pieces of clusters are obtained, and finally the context information far away from each clustering center (such as exceeding a preset distance threshold) is removed, so that the context information with a high topic relevance degree in the context is extracted.

And 3, performing hierarchical analysis on the context information of the preprocessed programming site environment by using a hierarchical clustering algorithm to obtain an explicit association and an implicit association relation between the context information.

Hierarchical Clustering divides a data set into clusterss one layer by one layer, and the clusterss generated by the next layer is based on the results of the previous layer. Hierarchical clustering algorithms generally fall into two categories: (1) divive hierarchical clustering: the method is also called top-down (top-down) hierarchical clustering, all objects at the beginning belong to one cluster, a certain cluster is divided into a plurality of clusters according to a certain criterion each time, and the steps are repeated until each object is one cluster; (2) agglomerative hierarchical clustering: each object is a cluster at the beginning, two closest clusters are merged to generate a new cluster according to a certain criterion each time, and the steps are repeated until all the objects belong to one cluster finally.

In the invention, the hierarchical clustering technology is utilized to mine the association relationship between the context information so as to discover the explicit association and the implicit association relationship between the data, wherein the explicit association refers to the direct association relationship, and the implicit association refers to the hidden association between the context information and the implicit association acquired by mining. For example, the gender of Zhang Sanqi is male, the explicit association between Zhang Sanqi and the gender is male, and according to the activities of Zhang Sanqi at ordinary times, the Zhang Sanqi and a game can be related together through mining, and the implicit association between Zhang Sanqi and the game is. The previous K-means is difficult to extract implicit context topic information, hierarchical clustering is used for processing the preprocessed context information by using aggregate hierarchical clustering, the text information is used as a class, 2 classes with the nearest distance are combined into one class, and the process is sequentially carried out until only K classes are left.

And 4, searching the required programming requirement by using an ElasticSearch search engine.

The ElasticSearch is a Lucene-based search server. It provides a distributed multi-user full-text search engine, which can conveniently make a large amount of data have the capabilities of searching, analyzing and exploring. It is based on the RESTful web interface, is developed by Java, is released as open source code under Apache licensing terms, and is an enterprise-level search engine.

The invention utilizes an elastic search as an underlying data retrieval engine, and the elastic search is an open source item. We use ElasticSearch for information retrieval using contextual topic information and/or keyword information data obtained by preprocessing and data mining as obtained above.

In summary, the invention creatively provides a method for searching based on the context of the programming field environment, and compared with other searching methods, the method can effectively recommend the needs and the appropriate codes for the user. The method has the advantages that the codes are searched by a plurality of users based on requirements, and the search precision calculation is carried out according to the standard that the codes which can be satisfied with the codes can be searched, so that the required codes can be provided for the users with higher precision, and compared with a search method constructed by RNN, the search precision is relatively improved by about 40%, and better user experience is brought.

Claims

1. A search method based on programming field environment context is characterized by comprising the following steps:

2. The context-based search method for programming field environments of claim 1, wherein the context information of the programmer comprises: current integrated development environment familiarity, familiarity of current projects, experience of programmers, programming habits of programmers, social networks of programmers; the context information of the programming items and tasks includes: the method comprises the steps of using command information currently, operating a module currently, describing methods, calling methods, item structures, task types, programming error suggestions, item descriptions, item types and historical recommendation information; the programming time and the context information of the programming environment include: time information, version number of the project, programming location, interface elements used by the developer, interface elements of interest to the developer.

3. The search method based on context of programming field environment of claim 2, wherein the obtaining of context information of programming field environment comprises:

obtaining the familiarity of the current integrated development environment and the familiarity of the current project by adopting a questionnaire mode;

acquiring project types, project descriptions, programming error suggestions and historical recommendation information according to a mode of searching record documents, development documents and log reports of software;

acquiring time information and a project version number in a user communication and/or document query mode;

determining programmer programming habits by analyzing historical code documents and bug report documents;

collecting a current running module, a current use command, method description, a programming place, interface elements used by a developer and interface elements concerned by the developer in a screen monitoring mode and a mouse operation monitoring mode;

crawling the recorded data of the user in the programming forum by using a crawler as experience information of the programmer; analyzing and reasoning by using the relationship of socializers to obtain the social network of the user;

and analyzing the requirements of the project, the design document and the code structure to obtain the task type, the project structure and the method calling information.

4. The programming field context based search method of claim 1, wherein said pre-processing and storing for a text language comprises: preprocessing a text language by adopting a word segmentation and keyword acquisition technology; representing and storing the preprocessed code information by utilizing a bag-of-words model and a one-shot representation method; the text information is represented by an n-gram model and stored, and the text information refers to other context information except code information.

5. The searching method based on the context of the programming field environment of claim 4, wherein the clustering analysis of the context information of the pre-processed programming field environment by using the K-means algorithm to obtain the semantic relationship between the context information comprises: selecting K pieces of context information related to a theme as initial clustering centers, calculating the distance between each piece of context information and each center, distributing each piece of context information to the clustering center closest to the context information, thereby obtaining K clusters, and removing the context information which is more than a preset threshold value from each clustering center, thereby extracting the context information with theme relevance in the context.

6. The searching method based on context of programming field environment of claim 4, wherein the performing hierarchical analysis on the context information of the pre-processed programming field environment by using the hierarchical clustering method to obtain the explicit association and the implicit association between the context information comprises: and processing the preprocessed context information by using hierarchical clustering, taking the text information as a class, synthesizing 2 classes closest to the text information into a class, and sequentially carrying out clustering division until the clustering division is finished.