CN114692595A

CN114692595A - Repeated conflict scheme detection method based on text matching

Info

Publication number: CN114692595A
Application number: CN202210606284.3A
Authority: CN
Inventors: 王翊; 唐勇; 刘世伟; 李贺彬; 张卫丰
Original assignee: Xuancai Interactive Network Science And Technology Co ltd; Nanjing University of Posts and Telecommunications
Current assignee: Xuancai Interactive Network Science And Technology Co ltd; Nanjing University of Posts and Telecommunications
Priority date: 2022-05-31
Filing date: 2022-05-31
Publication date: 2022-07-01
Anticipated expiration: 2042-05-31
Also published as: CN114692595B

Abstract

The invention relates to a repeated conflict scheme detection method based on text matching, which is used for extracting a relevant merging solution by utilizing a large amount of code warehouse information which solves the problem of good merging for a real scene that a database with a large amount of historical merging codes is migrated from one code warehouse to another code warehouse or a code warehouse has similar updating. Secondly, the solution of the acquired history merging conflict is stored in a database. Then, when existing conflicts are encountered, the recommendation of the merging scheme is carried out by utilizing the saved conflict solution. And finally, updating the update time of the conflict solved based on the historical information in the database, and indicating the use frequency of the solved conflict. If the conflict is manually resolved by the developer, the solution is saved to the database.

Description

Repeated conflict scheme detection method based on text matching

Technical Field

The invention belongs to the technical field of computers, particularly relates to the technical field of software, and particularly relates to a repeated conflict scheme detection method based on text matching.

Background

With the wider application of internet technology and the more and more change of software industry, the demands of remote cooperative work, use of software code warehouse and conflict resolution are increasing. In the collaborative development work, there are two basic forms, distributed and centralized. And Git is the most advanced distributed version control system in the world at present as a typical software code warehouse. The biggest difference between distributed versus centralized is that developers can submit to the local, each copying a complete Git repository on the local machine by cloning. Git is a version control tool for Linux kernel development. Different from the common version control tools CVS, Subversion and the like, the method adopts a distributed version library mode without the support of server-side software, so that the release and the communication of the source code are very convenient. Git is fast, which is naturally important for large projects such as Linux kernel. Git is most excellent in its merge traceability.

However, in the collaborative work, different persons may perform the collaborative work, which may generate branches, and when the Git version is used to control the project submitting flow, a merging error, i.e., a conflict, may be generated when merging different work branches. Although developers can solve some simple merging conflicts through the method calling in the Git, the solution of the complex conflicts can be only manually solved by the developers, and in addition, when large-scale open source projects are merged, the number of conflicts is increased sharply, the workload of the developers is greatly increased, and the method is a main challenge in the Git cooperative work.

One effective solution to these problems is to provide the functionality of a conflicting solution in a software integrated development environment. The prompt function is to analyze the history conflict solved by the developer before, and for the conflict faced at present, the quick matching of the history conflict information can be realized, and the prompt for solving the conflict scheme is provided for the developer, so that the development efficiency is improved.

However, because the concrete representation form of the merging scheme in the Git version control is too single, the current open source framework and IDE can not provide help for solving complex conflicts.

CN113626385A discloses a method and system for reading based on text data, which includes analyzing and filtering duplicate files and merging and classifying the associated files into corresponding data sets Fn; comparing the attribute set and the attribute list in the text file to be extracted to assign a processing engine according to the similarity degree of the attribute set and the attribute list, and forming an attribute analysis result set of the file; and in response to the fact that the matching degree of the attribute analysis result set and the content analysis set in the text file to be extracted exceeds a first threshold value, extracting the text data by using a special processing engine corresponding to the attribute. The method and the system can automatically adapt to the text file data with known characteristics and unknown characteristics, call the corresponding data extraction engine, realize automatic identification, analysis and warehousing of the text file type original data, improve the extraction efficiency and accuracy of the text file, and improve the big data analysis capability. But it is primarily an attribute extraction of files and then a classification of files that exceed an attribute similarity threshold. For repeated files, the files need to be classified, attribute sets and lists are extracted, the operation granularity is small, the operation is complicated, and the method is not suitable for a conflict resolution scene.

Disclosure of Invention

The technical problem to be solved by the invention is how to collect conflict information in a warehouse containing historical merged codes, how to judge the similarity between conflicts, and using the similarity of conflicts and quickly positioning similar conflicts from stored historical conflict merging schemes as one of the schemes for complicated merging conflicts, thereby effectively helping developers to understand codes and quickly solving code conflict merging.

In order to solve the above technical problems, the present invention provides a repeated conflict scheme detection method based on text matching, which comprises the following steps:

step 1) extracting all the merged nodes in a submission tree for a code warehouse with historical merged codes, and then judging whether the merged nodes are nodes containing conflict historical information or not;

step 2) recording the current branch name of the conflict node by using a graph traversal algorithm and node merging analysis, merging the current branch name of the conflict node into other branch names of the current branch, merging the current branch name into precursor nodes of other branches of the current branch, precursor nodes of the precursor of the conflict node, merging branches and ancestor nodes of branches to be merged, and recording the submission marks of each node;

step 3) extracting code blocks in the file by using a text matching technology for comparison, then obtaining a hash value of the successfully matched code blocks through a hash algorithm to be used as a record dimension of a conflict scheme, and then storing the hash value and the updating time of each file code block in a database by using the current warehouse name, the submission mark of each node and the hash value and the updating time of each file code block;

step 4) matching the existing conflict with the stored conflict scheme by using a database index quick matching technology; returning the conflict scheme successfully matched and updating the storage time; and if the matching fails, the programmer solves the merging conflict problem manually and then records the conflict scheme with successful current merging into the database.

Further, in step 1), while judging the merging node, the commit Id of the node where the comparison file, the source file, the reference file and the merging file are located is saved through the submission tree, the commit Id can be stored to locate the conflict file, the name of the good warehouse is recorded, and a scheme is provided for saving the conflict information.

Further, a merging node of the commit Id is obtained as E, for the merging node E, the current branch and the merging branch of the current node are recorded, the node where the two branches are merged before is found to be the ancestor node, the commit Id of the node is stored, and the predecessor node of the branch where the merging node is located is found to be D and the predecessor node of the merging branch is found to be C. The nodes contain conflict information and lay a cushion for storing subsequent conflict information.

Further, the predecessor node B of the node D is obtained, the node B and the node C of the merging branch are merged, if merging conflict occurs, the current node E is a conflict node, and the commit Id of the obtaining node is recorded. And judging whether the conflict occurs by using a merging algorithm in Git.

Further, in step 3), matching codes in the comparison file, the source file and the reference file by using a text matching technology to obtain a code block list in which code blocks cannot be successfully matched completely, and obtaining a code block hash value for each code block in the list by a hash algorithm; and 3) obtaining the name of the warehouse, a source file submission id, a comparison file submission id, a reference file submission id and a merging file submission id, obtaining the hash value of a source code block in the source file, the hash value of a comparison file code block, the hash value of a merging file code block and the hash value of a reference file code block, and finally saving the merging scheme serving as conflict information into a database.

Further, in step 4), a database index fast matching technology is used, corresponding conflict blocks are extracted according to conflicts encountered in the project, a hash algorithm is used for carrying out code processing on the code blocks to obtain hash values, data in the database are matched through indexes in the database, if matching is successful, a historical conflict merging scheme is recommended, merging scheme time in the database is updated, if matching is failed, manual resolution is carried out, and then the current conflict merging scheme is stored in the database. Compared with the similarity between the existing conflict and the stored conflict, the method greatly improves the speed of solving the similar conflict by utilizing the stored conflict merging scheme.

Compared with the prior art, the invention has the following beneficial effects:

1. for a code warehouse with history merging codes, extracting all history merging submission nodes as a node set; then, performing traversal algorithm and node merging analysis on each merged node usage graph in the node set to judge whether the merged node is a conflict node; then, based on a text analysis method, finding key code blocks of the conflict file, then obtaining a hash value by using a hash algorithm, and storing the solution in a database; and finally, matching the file conflict code block with the stored merging conflict scheme when the existing conflict is met based on the conflict solved by the historical information, so as to realize recommendation of the merging scheme, effectively help developers to understand codes and quickly solve code conflict merging.

2. According to the invention, for the code warehouse with the historical merging codes, such as the migration from one code warehouse to another code warehouse, or the similar update of one code warehouse, and other real scenes, the historical merging decision can have reference value for new merging conflicts. In the first step, a commit submission tree in the code repository can be obtained by using a Sourcetree tool to obtain an imaged tree diagram. The commit tree is the core of the entire warehouse for Git, and the commit node is the basic unit of the commit tree. And secondly, finding a node where two branches in the commit submitting tree intersect, wherein the node is defined as a merging node. All the combination nodes are obtained as a set, and for each node in the set, precursor nodes of two branches of the node are obtained, wherein in the precursor nodes on the branches to be combined, a grandparent node of the combination node is to be found, and the grandparent node is the precursor node of the node. And then carrying out node merging under the two branches, wherein if no conflict occurs, the merged node belongs to normal merging, and no conflict occurs. Otherwise, a merge conflict occurs, which indicates that the merge node is the merge node after manual resolution after the conflict occurs. The above steps may find a merge conflict node in a code repository.

3. The invention can save the history merging conflict scheme based on the found conflict node. The commit submitting id of the node is mainly recorded, the node file after the completion of the combination can be found out through the conflict node, the file to be combined to other branches of the current branch is called as a comparison file, and the node file when the two branches are separated is called as a reference file. And finding out a file name source file of a predecessor node of the merging node. The merging problems encountered need to be matched to the saved merging problems in the database. In a code warehouse, merging problems always involve files, so that the similarity between the files becomes the key for judging whether a merging scheme is suitable for the merging problems, a text matching method is used, data preprocessing is carried out by utilizing a text distance vector algorithm and an N-Gram word segmentation model to obtain different code blocks between the files, which can also be called conflict blocks, and then hash values of the conflict code blocks in a source file, the merging file, a comparison file and a reference file can be saved to serve as the dimension for saving the conflict scheme. And performing quick matching by using indexes in the database so as to achieve recommendation of the merging scheme.

4. The present invention applies the statistical characteristics of merging decisions that occur with high frequency to new similar code changes. If the existing conflict can be matched to a similar merging scheme, updating the update time of the scheme stored in the current database. Of course, if the current conflict does not match a similar merge conflict in the database, the developer can only manually process the merge conflict, and finally the conflict solution processed by the developer is saved in the data. In summary, a human developer or code administrator may make a final merge decision based on the merge solution recommendation enhancing understanding.

Drawings

Fig. 1 is a schematic diagram of a conflict node based on historical information acquisition according to the present invention.

FIG. 2 is a diagram illustrating a code warehouse submission tree based history merge conflict scheme according to the present invention.

FIG. 3 is a schematic diagram of a merging conflict process based on a scheme recommendation according to the present invention.

Detailed Description

A repeated conflict scheme detection method based on text matching specifically comprises the following steps:

step 1) first, a commit submission tree is obtained for a code warehouse containing history merging information, namely a code database, all merging nodes in the submission tree are found, for each merging node, the relationship between the node and a file of the node is analyzed, and whether the merging node is a node containing a conflict history is judged, referring to fig. 1.

The process of extracting the merging node is as follows: for a code warehouse containing a large amount of history merging information, firstly, a commit submitting tree in the code warehouse is obtained by using a SourceTree tool, wherein the commit submitting tree is the core of the whole warehouse of Git, and a commit node is the basic unit of the commit tree. Then, for the branch merging case, the node of two branch mergers is found, and all the nodes are collected as a set. The node represents commit on one branch. Then, for each node in the set, the commit Id at the time of branch submission is recorded, where the commit Id is a code successfully submitted to the code warehouse by a developer, and the SHA1 value automatically generated by Git represents the serial number of the submission, and the commit ids of all the merge nodes are obtained by referring to the commit submission tree, and then the current warehouse name needs to be saved.

SourceTree is a free GUI Git client applicable to macOS and Windows. It simplifies the version control process, allowing you to focus on the important thing, encoding. The method has a professional UI, and can execute the Git task and access the Git stream by directly accessing the Git stream, the submodules, the remote repo server, local submitted search, the visual management version library supporting the Git big files and the like. The commit submitting tree is generated based on a Sourcetree tool, and a node with two merged branches is positioned.

And 2) using a graph traversal algorithm and node combination, recording the current branch name of the conflict node when analyzing the conflict node, combining the current branch name of the conflict node with other branch names of the current branch, combining the current branch name with precursor nodes of other branches of the current branch, precursor nodes of the precursor of the conflict node, and combining the branch and ancestor nodes of the branches to be combined. And records the commit Id of each node. And acquiring the merging node of the commit Id.

The merging node of the obtained commit Id is E, for the merging node E, the current branch of the current node and the branch from merging need to be recorded, the node where the two branches start to split is found, the commit Id of the node is stored, the predecessor node of the branch predecessor where the merging node is located is D, and the predecessor node of the branch from merging is C, and the specific node obtaining situation can refer to fig. 2. And D, the precursor node B and the precursor nodes C of other branches are combined, and if a combination conflict occurs, the current node E is proved to be a conflict node. The commit Id acquired to the node is recorded as one piece of information of the merge conflict resolution.

And 3) acquiring a reference file of the ancestor node, a comparison file merged to a predecessor node of other branches of the current branch, a source file of a predecessor previous node of the current node and a merged file of a conflict node.

Determining whether the current node is a conflict node through the step 2), then circulating the node set in the step 1), finding all merged nodes containing conflict histories, and then saving the commit Id of the node where the comparison file, the source file, the reference file and the merged file are located through a commit submitting tree and locating the conflict file through saving the commit Id in order to locate the specific situation of history merged conflict nodes by referring to the step 2.

And 4) comparing the files obtained in the step 3) by using a text matching technology, obtaining a hash value for the successfully matched code block by using a hash algorithm as a record dimension of a conflict scheme, and storing the current name of the warehouse, the commit Id of each node in the step 2), the hash value of each file code block and the updating time into a database.

In order to further save the history merging scheme, codes in the comparison file, the source file and the reference file are matched firstly by using a text matching-based technology, and a code block list with code blocks which cannot be successfully matched is obtained. And respectively obtaining a code block hash value for each code block in the list through a hash algorithm. And step 1) and step 3) are used for obtaining the name of the warehouse, the submitting id of the source file, the submitting id of the comparison file, the submitting id of the reference file, the submitting id of the merging file, the hash value of the source code block of the source file, the hash value of the code block of the comparison file, the hash value of the code block of the merging file and the hash value of the code block of the reference file. Finally, the merging scheme is used as the merging scheme in the database, and the merging scheme is in the form of < Id, projectome, merged commit Id, source commit Id, target commit Id, base commit Id, source code fragment hash, target code fragment hash, merged code fragment hash, base code fragment hash, update time >. The project name is a warehouse name, the merged commit Id is a merged file node submission Id, the source commit Id is a source file node submission Id, the target commit Id is a comparison file node submission Id, the base commit Id is a reference file node submission Id, the source code fragment hash is a hash value of a source file source code block, the target code fragment hash is a hash value of a comparison file code block, the merged code fragment hash is a hash value of the merged file code block, the base code fragment hash is a hash value of the merged file code block, and the update time is time for inserting or updating the record.

And 5) matching the existing conflict with the conflict scheme stored in the step 4) by using a database index quick matching technology, wherein the matched conflict scheme can be returned to developers after successful matching, and the storage time is updated. And (3) in the matching failure, developers need to solve the merging conflict problem by themselves, and then the current merging conflict scheme record is stored in the database according to the step 3) and the step 4).

Referring to fig. 3, a database index fast matching technology is used, corresponding conflict blocks are extracted according to conflicts encountered by a developer in the project, a hash algorithm is used for code processing, obtained hash values are obtained, data in the database are matched through indexes in the database, if matching is successful, a historical merging conflict scheme is recommended to the developer, and merging scheme time in the database is updated. And if the matching fails, manually solving by developers, and storing the current merging conflict scheme into a code warehouse according to the step 3) and the step 4).

The implementation process comprises the following steps: and acquiring conflict information of the merging nodes from the historical information, storing the historical merging conflict scheme in a code warehouse, and quickly searching a conflict solution and a conflict storage scheme for conflicts and recommendation conflicts when similar conflicts are encountered again. The model can be used for solving the existing merging conflict problem. First, the present invention focuses on solution recommendation of merging problem in persistent integration, and for a code warehouse with historical merging information, if a code warehouse has similar conflict scenes, the code warehouse information containing historical conflict merging is used to extract relevant merging solutions. Second, the resolution to get history merge conflict is stored to the code repository. And then, when the existing conflict is met, recommending the merging scheme by utilizing the saved merging conflict scheme. And finally, updating the updating time of the conflict in the code warehouse based on the conflict solved by the historical information, and indicating the use frequency of the solved conflict. If the conflict is manually resolved by a developer, the solution needs to be saved to the code repository.

The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and substitutions can be made without departing from the technical principle of the present invention, and these modifications and substitutions should also be regarded as the protection scope of the present invention.

Claims

1. A repeated conflict scheme detection method based on text matching is characterized by comprising the following steps:

step 1) extracting all the merging nodes in a submission tree for a code warehouse with historical merging codes, and then judging whether the merging nodes are nodes containing conflict historical information or not;

step 2) using a graph traversal algorithm and node merging analysis, recording the current branch name of the conflict node, merging the current branch name of the conflict node to other branch names of the current branch, merging the current branch name of the current branch to precursor nodes of other branches of the current branch, merging the precursor nodes of the precursor of the conflict node, merging the branches and ancestor nodes of branches to be merged, and recording the submission marks of each node;

step 3) extracting code blocks in the file by using a text matching technology for comparison, then obtaining a hash value of the successfully matched code blocks through a hash algorithm to be used as a record dimension of a conflict scheme, and then storing the hash value and the update time of each file code block in a database as a merging scheme of code conflicts in the database by using the current warehouse name and the submission marks of each node;

step 4) using a database index fast matching technology to match the existing conflict with the stored conflict scheme; returning the conflict scheme successfully matched and updating the storage time; and when the matching fails, the programmer manually solves the merging conflict problem and records the conflict scheme which is successfully merged currently into the database.

2. The text matching-based repetitive conflict scheme detection method according to claim 1, wherein: in the step 1), when the merging node is judged, the commit Id of the node where the comparison file, the source file, the reference file and the merging file are located is saved through the submission tree, the conflict file can be located through saving the commit Id, and the good warehouse name is recorded.

3. The text matching-based repetitive conflict scheme detection method according to claim 2, wherein: and obtaining a merging node E of the commit Id, recording the current branch and the merging branch of the current node for the merging node E, finding a node converged before the two branches to be an ancestor node, storing the commit Id of the node, finding a precursor node of a branch precursor where the merging node is located to be D and a precursor node of the merging branch to be C.

4. The text matching based repeat collision scheme detection method of claim 3, wherein: and acquiring a precursor node B of the D, merging the node B and the node C of the merging branch, if merging conflict occurs, determining that the current node E is a conflict node, and recording the commit Id of the acquired node.

5. The text matching-based repetitive conflict scheme detecting method according to claim 1, wherein: in the step 3), codes in the comparison file, the source file and the reference file are firstly matched by using a text matching technology to obtain a code block list in which code blocks cannot be completely matched successfully, and a code block hash value is obtained for each code block in the list by a hash algorithm; and 3) obtaining the name of the warehouse, a source file submission id, a comparison file submission id, a reference file submission id and a merging file submission id, obtaining the hash value of a source code block in the source file, the hash value of a comparison file code block, the hash value of a merging file code block and the hash value of a reference file code block, and finally saving the obtained hash values in a database as a merging scheme of code conflict.

6. The text matching-based repetitive conflict scheme detection method according to claim 1, wherein: in the step 4), a database index fast matching technology is used, corresponding conflict blocks are extracted according to conflicts encountered in the project, a hash algorithm is used for carrying out code processing on the code blocks to obtain hash values, data in the database are matched through indexes in the database, if matching is successful, a historical merging conflict scheme is recommended, merging scheme time in the database is updated, if matching is failed, manual resolution is carried out, and then the merging conflict scheme at this time is stored in the database.