CN116257819A - Rapid open source license identification method, system and medium for large-scale software - Google Patents

Rapid open source license identification method, system and medium for large-scale software Download PDF

Info

Publication number
CN116257819A
CN116257819A CN202310223364.5A CN202310223364A CN116257819A CN 116257819 A CN116257819 A CN 116257819A CN 202310223364 A CN202310223364 A CN 202310223364A CN 116257819 A CN116257819 A CN 116257819A
Authority
CN
China
Prior art keywords
open source
license
automaton
source license
identified
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310223364.5A
Other languages
Chinese (zh)
Inventor
任怡
姜智文
谭郁松
李宝
王庆坤
赵俊
李漠
董攀
张建锋
蹇松雷
王晓川
丁滟
谭霜
郭勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN202310223364.5A priority Critical patent/CN116257819A/en
Publication of CN116257819A publication Critical patent/CN116257819A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/10Protecting distributed programs or content, e.g. vending or licensing of copyrighted material ; Digital rights management [DRM]
    • G06F21/105Arrangements for software license management or administration, e.g. for managing licenses at corporate level
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • Technology Law (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a method, a system and a medium for rapidly identifying open source licenses for large-scale software, wherein the method for rapidly identifying open source licenses for large-scale software comprises the steps of determining an open source license set to be identified; constructing an AC automaton for identifying the open source license by using the identification of each open source license in the set of open source licenses to be identified; and aiming at the user code file to be identified, utilizing the AC automaton to obtain the identification result of each open source license in the open source license set to be identified in the user code file. The invention adopts the AC automaton to identify the license in the text in a one-time scanning mode, reduces the time complexity of the identification process, extracts annotation information containing the license statement text for identification when analyzing the input user code file, reduces the matching content in the identification process, and is particularly suitable for quick identification of the open source license of large-scale software.

Description

Rapid open source license identification method, system and medium for large-scale software
Technical Field
The invention relates to the technologies of component analysis, open source license management and the like in the field of computer software, in particular to a method, a system and a medium for rapidly identifying open source licenses for large-scale software.
Background
Currently, the use of open source components has become one of the important ways to achieve rapid software development and technical innovation. In recent years, a process for producing a plastic film,
increasingly globalization of software development, a large number of third party open source components (Open Source Software, OSS) are used to increase the efficiency of software development after being modified or extended with desired functionality. According to the report issued by Forrester 2021, the ratio of open source codes in 17 industry code libraries of the Internet of things, network security and the like of audit is almost doubled in 5 years. With the development of the information technology industry, under the increasing difficulty of software development and the trend of the open source of a software supply chain, a large amount of open source software is introduced to become a main solution of software development, but many developers and software purchasers have no small misunderstanding on the word "open source", the open source software is confused with free software or shared software without specific permission, and the open source software is considered to be capable of being modified, copied and distributed as required.
The open source license is generated along with the open source software, the open source license prescribes the open source software which can be used by a user under what range and limited conditions, however, some engineers with misunderstanding of the open source concept often neglect the importance of the open source license, and the open source software or the open source code is modified and then distributed or used commercially, so that legal disputes are caused. According to the statistics of the Black Duck audit service team, there was a license conflict in 2021 for 73% of the open source codes contained in the audited code library, and 30% of the audited code library was either license-free or using open source codes containing custom licenses.
During the software development process, if an open source license is introduced that constitutes a conflict, a compliance risk may result. For example, GPL (GNU General Public License) license and MPL (Mozilla Public License) license, GPL license requires that all source code of the entire software must be issued as per GPL license requirements, where the source code is used, while MPL requires that if MPL license is present in a separate code file, other newly added files can avoid opening sources. Thus, when an enterprise uses open source software that contains both a GPL license and an MPL license, the enterprise may be faced with a compliance risk of violating one of the open source license terms due to conflicts between the open source license terms.
In order to avoid infringing intellectual property rights of others and causing legal disputes, reasonably and properly using open source software, open source license identification has become an indispensable work for software developers in the process of using open source software. Although many open source license identification and management tools are currently available, most research on open source licenses is open source tools of open source communities or applications developed by commercial companies, such as Go License Detector, findLicense, FOSSSology, etc. The existing open source license management tool is used for identifying the open source license through processing specific terms in the text of the open source license and then matching and identifying, for example Go License Detector, preprocessing the content of the open source license by using a Minhash method, and text cutting is carried out on the term content of the open source license by using a bSAM algorithm by using FOSSology. The license has relatively more clause content, the matching based on the clause content needs to be performed for a larger number of text matching, and under the trend that the number of open source software is rapidly increased and the software development scale is gradually increased, the open source license is required to be identified with higher efficiency.
Disclosure of Invention
The invention aims to solve the technical problems: aiming at the problems in the prior art, the invention provides a rapid open source license identification method, a rapid open source license identification system and a rapid open source license identification medium for large-scale software.
In order to solve the technical problems, the invention adopts the following technical scheme:
a method for rapidly identifying open source license for large-scale software comprises the following steps:
s101, determining an open source license set to be identified;
s102, constructing an AC automaton for identifying the open source license by using the identification of each open source license in the open source license set to be identified;
s103, aiming at the user code file to be identified, utilizing an AC automaton to obtain identification results of all open source licenses in the open source license set to be identified in the user code file.
Optionally, determining the set of open source licenses to be identified in step S101 refers to online capturing information of the open source licenses to be identified from a web page containing the open source licenses to be identified by using a web page capturing technology, so as to obtain the set of open source licenses to be identified.
Optionally, step S101 includes:
s201, acquiring an input web address of a web page of an open source license to be identified;
s202, reading form data for recording open source license information from an input web address of a web page of the open source license to be identified, and converting the form data into a data frame format;
s203, extracting the full name and the identification of each open source license from the data frame format, and acquiring the license text content of each open source license according to the license text address pointed by the hyperlink;
s204, constructing and obtaining an open source license set to be identified by taking the full name, the identification and the license text content of each open source license in the open source license set to be identified as the attribute of each open source license.
Optionally, the web address of the web page of the open source license to be identified input in step S201 refers to a package data exchange license list web address of the web site of the package data exchange SPDX: https:// spdx.org/license/.
Optionally, when the table data for recording the open source license information is read from the inputted web address of the web page of the open source license to be identified in step S202, the read table data for recording the open source license information includes at least one of the table data for recording the open source license for normal use and the table data for recording the open source license that has been discarded.
Optionally, step S102 includes: for the open source license set to be identified, dividing the identification of the open source license in the open source license set into independent character strings without the separator with the separator, using the independent character strings as nodes in the word search tree Trie of the AC automaton M, and constructing the word search tree Trie of the AC automaton M from the identification of the open source license in the open source license set to obtain the AC automaton for identifying the open source license.
Optionally, step S103 includes:
s301, traversing a user software directory for a user code file to be identified, acquiring a directory tree of the user software, extracting annotation information containing license statement from the user code file, establishing a mapping relation between the annotation information and the code file, marking an index value for the extracted annotation information according to a reading sequence, and marking a corresponding code file by using the index value;
s302, according to the index value of the mark, the extracted annotation information is input into an AC automaton as a character string, the open source license contained in the corresponding code file is judged according to the output result of the AC automaton, if the identification of the open source license is output, the code file contains the open source license corresponding to the identification, and therefore the identification result of all open source licenses in the open source license set to be identified in all user code files is obtained.
Optionally, the AC automaton constructed in step S102 is implemented in Python language, and before the extracted annotation information is input as a character string into the AC automaton in step S302, the method further includes constructing an AC automaton service cluster by using a Python distributed computing framework Ray; after the AC automaton service cluster receives the input character string, a main node of the AC automaton service cluster adopts a preset scheduling strategy to distribute one or more child nodes to execute the program of the AC automaton to identify the character string, and all identification results are returned through the main node: the constructing the AC automaton service cluster by using the Python distributed computing framework Ray comprises the following steps:
s401, selecting a node from a plurality of physical machines or virtual machines in advance as a main node of an AC automaton service cluster, and activating Python operation environments containing a distributed computing frame Ray on the main node and other nodes;
s402, using a distributed computing framework Ray to specify the IP address of a main node on each other node, and setting output of an AC automaton service cluster activated by the main node so as to add the other nodes as child nodes into the AC automaton service cluster;
s403, annotating the program of the AC automaton by using the decorator, and serializing the program code of the AC automaton on a redis database of the distributed computing framework Ray to be stored as an object for each child node to call and execute data exchange.
In addition, the invention also provides a large-scale software-oriented open source license rapid identification system, which comprises a microprocessor and a memory which are connected with each other, wherein the microprocessor is programmed or configured to execute the large-scale software-oriented open source license rapid identification method.
Furthermore, the present invention provides a computer readable storage medium having stored therein a computer program for being programmed or configured by a microprocessor to perform the large-scale software-oriented open source license quick identification method.
Compared with the prior art, the invention has the following advantages: the invention discloses a quick open source license identification method for large-scale software, which comprises the steps of determining an open source license set to be identified; constructing an AC automaton for identifying the open source license by using the identification of each open source license in the set of open source licenses to be identified; and aiming at the user code file to be identified, utilizing the AC automaton to obtain the identification result of each open source license in the open source license set to be identified in the user code file. The invention adopts the AC automaton to identify the license in the text in a one-time scanning mode, reduces the time complexity of the identification process, extracts annotation information containing the license statement text for identification when analyzing the input user code file, reduces the matching content in the identification process, and is particularly suitable for quick identification of the open source license of large-scale software.
Drawings
FIG. 1 is a schematic diagram of a basic flow of a method according to an embodiment of the present invention.
FIG. 2 is a flowchart illustrating steps for extracting open source license information according to an embodiment of the present invention.
Fig. 3 is a flow chart of an AC automaton construction in an embodiment of the invention.
Fig. 4 is a flowchart illustrating a decision process for an AC automaton output mode in accordance with an embodiment of the present invention.
FIG. 5 is a flowchart illustrating steps for identifying an open source license in an embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent.
As shown in fig. 1, the open source license quick identification method for large-scale software in this embodiment includes:
s101, determining an open source license set to be identified;
s102, constructing an AC automaton for identifying the open source license by using the identification (Identifier) of each open source license in the open source license set to be identified;
s103, aiming at the user code file to be identified, utilizing an AC automaton to obtain identification results of all open source licenses in the open source license set to be identified in the user code file.
It should be noted that, in step S101, the determination of the set of open source licenses to be identified may be implemented in a feasible manner according to needs, for example, as a preferred embodiment, in step S101, the determination of the set of open source licenses to be identified refers to using a web page crawling technology to crawl information of the open source licenses to be identified on line from a web page containing the open source licenses to be identified, so as to obtain the set of open source licenses to be identified.
As shown in fig. 2, step S101 of the present embodiment includes:
s201, acquiring an input web address of a web page of an open source license to be identified;
referring to fig. 2, as an alternative implementation manner, the web address of the web page of the open source license to be identified input in step S201 of this embodiment refers to the package data exchange license list web address of the web site of the package data exchange SPDX (Software Package Data Exchange): https:// spdx.org/license/(SPDX official website for short); of course, the web site of the inputted web page of the open source license to be identified may be other web sites of web pages which may contain the open source license to be identified; the software package data exchange SPDX is an open standard for exchanging software bill of materials information, which is proposed by the Linux foundation, and the SPDX is used for standardizing information such as names, identifiers and the like of more than 400 open source licenses and continuously updating the information, so that an open source license database for identifying open source license construction standards used in user software can provide basis for identifying the open source licenses.
S202, reading form data for recording open source license information from an input web address of a web page of the open source license to be identified, and converting the form data into a data frame format;
reading form data for recording open source license information from the inputted web site of the web page of the open source license to be identified may take a desired form, such as character analysis or a component or control containing character analysis, etc., as required, for example, in this embodiment, a pandas component in Python language is used to read form data for recording open source license information from the inputted web site of the web page of the open source license to be identified. Classifying and extracting license related information according to header information in the form; and obtaining license text content through hyperlink of license clause content, sorting the extracted license related information, and outputting and storing the information in a classified way. Specifically, in this embodiment, the read_html () method of the pandas component is used to read form data from the SPDX License List web page (two types of form data are used, one type is an open source license that can still be used normally, and the other type is an open source license that has been discarded), and convert the form in HTML format into the DataFrame data format of pandas. Then, the license related information is classified and extracted according to header information in the form, wherein the header information comprises: "Full name", "Identifier", "FSF Free/library. The "Full name" of each open source license may be hyperlinked to another web page in which license text, i.e., the specific terms content of the license, is presented, specifying the rights and obligations of the user. Through the hyperlink, web page content is acquired using the get () method of Python requests and returned in the form of binary data. And after the binary data are decoded into text data, extracting license text content according to the license-text keywords.
S203, extracting the full name and the identification of each open source license from the data frame format, and acquiring the license text content of each open source license according to the license text address pointed by the hyperlink;
referring to fig. 2, as an alternative implementation manner, when the table data for recording the open source license information is read from the web address of the input web page of the open source license to be identified in step S202 of this embodiment, the read table data for recording the open source license information includes at least one of the table data for recording the open source license used normally and the table data for recording the open source license that has been discarded;
s204, constructing and obtaining an open source license set (also expressed as an open source license database) to be identified by taking the full name, the identification and the license text content of each open source license as the attribute of each open source license in the open source license set to be identified. Specifically, in this embodiment, the full name, identifier and license text content of each open source license are stored as an attribute output of each open source license in the open source license set to be identified in the file in CSV format, so as to construct and obtain the open source license set to be identified (which may also be expressed as an open source license database).
In the embodiment, the related information of the open source license is extracted from the SPDX official website in an online extraction mode, so that the condition that the open source license database needs to be manually updated due to SPDX License List content update is avoided, and the constructed standard license database can be updated by operating the open source license information extraction program unit. Meanwhile, the SPDX license list also gives out abandoned open source license information, when an open source license database is constructed, the open source licenses are marked, and if the open source licenses are detected in a code file of a user, the user is reminded to update in time so as to avoid license conflict and compatibility risks.
In this embodiment, step S102 includes: for an open source license set to be identified (wherein the identification set is simply referred to as an Identifier set), dividing the identification of the open source license in the open source license set into independent character strings without the separator "-" by the separator "-" and constructing the word search tree Trie of the AC automaton M from the identification of the open source license in the open source license set by using the independent character strings as nodes in the word search tree Trie of the AC automaton M to obtain the AC automaton for identifying the open source license. Research analysis shows that a large number of public substrings exist between names and identifiers (identifiers) of different licenses developed by the same organization in a software package data exchange license list website of a software package data exchange SPDX website, and the public substrings are generally separated by "-". Therefore, in this embodiment, the identifier of the open source license in the open source license set is divided into independent character strings not including the separator "-" by using the independent character strings as nodes in the word search tree Trie of the AC automaton M, so that the common substring can be used as a node, compared with the conventional method for constructing the word search tree Trie by using a single letter as a node, the number of nodes is greatly reduced, on one hand, the memory overhead is saved, and on the other hand, the scanning efficiency of traversing the Trie tree is improved. And the independent character strings are used as nodes in a word search tree Trie of the AC automaton M, the identification of the open source license in the open source license set is constructed into the word search tree Trie of the AC automaton M to obtain the AC automaton for identifying the open source license, and then the word search tree Trie is traversed to obtain the failure function value of each node through a recursion rule, so that the multi-mode matching can be completed through one-pass traversal, the matching time complexity is reduced, the longest substring matching is realized through judgment and determination when the final node is output, and the identification accuracy is improved.
In this embodiment, the constructed AC automaton is denoted as AC automaton M, and the AC automaton M is formed by (Q, Σ, g, F, Q0, F), where Q represents a finite state set, Σ represents a finite input character set, g represents a state transfer function, F is a failure function, Q0 represents an initial state, and F represents a final state set.
As shown in fig. 3, the identifier of the open source license in the open source license set is divided into independent character strings not containing the separator "-" by using the independent character strings as nodes in the word search tree Trie of the AC automaton M, and after the word search tree Trie of the AC automaton M is constructed from the identifier of the open source license in the open source license set, the word search tree Trie is made to contain the finite state set Q, the input character set Σ, the state transfer function g, the initial state Q0 and the final state set F in the AC automaton. Constructing failure values (mismatch values) of all nodes by using a recursive function by adopting a depth-first traversal method from root nodes of a word search tree Trie to obtain failure functions of an AC automaton, namely, the next state which should be skipped when matching is unsuccessful, wherein the root nodes of the word search tree Trie are 0 layers, the failure values of the 1 st layer nodes are root nodes, and the values of the corresponding failure functions can be obtained according to f (i) =g (f (i.pre), x) of the recursive function, wherein f (i) represents the value of the failure function of the current node, f (i.pre) represents the failure function value of a precursor node of the current node, x represents input characters from the precursor node to the current node, and g represents a state transfer function; when the final state of the AC automaton is reached, judging whether the AC automaton has a subsequent node, if not, outputting the identification of the current successful matching and the corresponding open source license, if so, continuing to input characters for matching, outputting the identification of the final state node and the corresponding open source license if the matching fails, and repeating the judging steps if the matching succeeds to reach the next final state, wherein the constructed output mode is shown in figure 4. The identification structure of each open source license in the open source license set to be identified is used as input for constructing an AC automaton, the number of nodes of the word search tree Trie is reduced by taking a' split unit as an independent character relative to a single letter as the independent character, memory expenditure is saved, a failure function is constructed, the next conforming state is found by utilizing a prefix which is successfully matched when the current state matching fails, the next conforming state is continuously matched without returning to an initial state, the time complexity of the matching is reduced, the final state is judged during output, the longest identification matching is ensured to be realized, similar identification errors (such as BSD-3-Clause and BSD-3-Clause-attribute) are avoided, and the identification accuracy is improved.
As shown in fig. 5, step S103 in this embodiment includes:
s301, traversing a user software directory for a user code file to be identified, acquiring a directory tree of the user software, extracting annotation information containing license statement from the user code file, establishing a mapping relation between the annotation information and the code file, marking an index value for the extracted annotation information according to a reading sequence, and marking a corresponding code file by using the index value;
s302, according to the index value of the mark, the extracted annotation information is input into an AC automaton as a character string, the open source license contained in the corresponding code file is judged according to the output result of the AC automaton, if the identification of the open source license is output, the code file contains the open source license corresponding to the identification, and therefore the identification result of all open source licenses in the open source license set to be identified in all user code files is obtained.
In this step S301 to S302, the matching range is narrowed by analyzing the user code file, extracting the comment text containing license declaration information, and whether or not the code file contains the open-source license is judged by the identification of whether or not the AC automaton recognizes the open-source license. In the step, the license is identified by extracting comment information containing license statement in the user code file as input of the AC automaton, the license statement in the code file is positioned at the beginning of the file content (generally positioned at the top row), so that the identification efficiency is improved by reading the first 5 rows of comment information in the user code file and taking the comment information as a text basis for identifying what kind of open source license is used, and the identification process of the AC automaton is executed through a distributed framework, so that the license in the license can be quickly identified by facing large-scale software.
In this embodiment, the AC automaton constructed in step S102 is implemented in Python language, and before the extracted annotation information is input as a character string into the AC automaton in step S302, the method further includes constructing an AC automaton service cluster by using a Python distributed computing framework Ray; after the input character string is received by the AC automaton service cluster, a main node of the AC automaton service cluster adopts a preset scheduling strategy to distribute one or more sub-nodes to execute the programs of the AC automaton to identify the character string, all identification results are returned through the main node, a distributed computing cluster is constructed by using a high-performance distributed computing frame Ray of Python, the programs of the AC automaton are subjected to distributed optimization, and the identification efficiency is further improved.
In this embodiment, constructing an AC automaton service cluster using a Python distributed computing framework Ray includes:
s401, selecting a node from a plurality of physical machines or virtual machines in advance as a main node of an AC automaton service cluster, and activating Python operation environments containing a distributed computing frame Ray on the main node and other nodes; for example, in this embodiment, a computer is selected as the master node, and the commands for activating the python environment containing the distributed computing frame Ray on the master node and the rest nodes are: "ray start- -head";
s402, using a distributed computing framework Ray to specify the IP address of a main node on each other node, and setting output of an AC automaton service cluster activated by the main node so as to add the other nodes as child nodes into the AC automaton service cluster;
after the child node activates the python environment containing the Ray, the system output '″ joins the AC automaton service cluster when the child node activates the main node IP' - -redis-password= 'by running the "Ray start- -address=' main node IP '- -redis-password=' and the main node;
s403, annotating the program of the AC automaton by using the decorator, and serializing the program code of the AC automaton on a redis database of the distributed computing framework Ray to be stored as an object for each child node to call and execute data exchange. Specifically, in this example, the decorator @ ray. Remote is used to annotate the recognition program of the AC automaton, serialize the code onto the redis database and store it as an object, and implement various asynchronous execution and data exchanges (tasks are preferably completed at the local node, if no longer completed, by the global scheduler of the AC automaton service cluster to other nodes).
In summary, open source license management is a key technology in software component analysis, which has important significance for avoiding open source license conflict and compliance risk in software, while open source license identification is a precondition for implementing open source license management, and along with expansion of software scale, the identification efficiency of the current open source license management tool needs to be improved. In view of the above problems, the present embodiment provides a method for quickly identifying open source licenses for large-scale software, where an open source license set is constructed according to the Identifier (Identifier) of a license in an SPDX open source license list, and the license is identified by an improved AC automaton based on the Identifier, so that the time complexity of the matching process is reduced. Meanwhile, the high-performance distributed computing framework Ray of Python is used for further improving performance, so that the method can efficiently process the open source license identification task of large-scale software.
In addition, the embodiment also provides a quick open source license identification system for large-scale software, which comprises
An open source license information extraction program unit for determining an open source license set to be identified;
an AC automaton constructor unit for constructing an AC automaton for identifying an open source license using the identification of each open source license in the set of open source licenses to be identified;
and the open source license identification program unit is used for acquiring identification results of all open source licenses in the open source license set to be identified in the user code file by using the AC automaton according to the user code file to be identified.
In addition, the embodiment also provides a large-scale software-oriented open source license rapid identification system, which comprises a microprocessor and a memory which are connected with each other, wherein the microprocessor is programmed or configured to execute the large-scale software-oriented open source license rapid identification method.
In addition, the present embodiment also provides a computer-readable storage medium in which a computer program is stored, the computer program being used for being programmed or configured by a microprocessor to perform the open source license rapid identification method for large-scale software.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-readable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein. The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The above description is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above examples, and all technical solutions belonging to the concept of the present invention belong to the protection scope of the present invention. It should be noted that modifications and adaptations to the present invention may occur to one skilled in the art without departing from the principles of the present invention and are intended to be within the scope of the present invention.

Claims (10)

1. The method for rapidly identifying the open source license for the large-scale software is characterized by comprising the following steps of:
s101, determining an open source license set to be identified;
s102, constructing an AC automaton for identifying the open source license by using the identification of each open source license in the open source license set to be identified;
s103, aiming at the user code file to be identified, utilizing an AC automaton to obtain identification results of all open source licenses in the open source license set to be identified in the user code file.
2. The method for rapidly identifying open source licenses for large-scale software according to claim 1, wherein determining that the set of open source licenses to be identified in step S101 means that the set of open source licenses to be identified is obtained by online capturing information of the open source licenses to be identified from a web page containing the open source licenses to be identified using web page capturing technology.
3. The method for rapidly identifying open source licenses for large-scale software according to claim 1, wherein step S101 comprises:
s201, acquiring an input web address of a web page of an open source license to be identified;
s202, reading form data for recording open source license information from an input web address of a web page of the open source license to be identified, and converting the form data into a data frame format;
s203, extracting the full name and the identification of each open source license from the data frame format, and acquiring the license text content of each open source license according to the license text address pointed by the hyperlink;
s204, constructing and obtaining an open source license set to be identified by taking the full name, the identification and the license text content of each open source license in the open source license set to be identified as the attribute of each open source license.
4. The method for rapidly identifying open source licenses for large-scale software according to claim 3, wherein the web site of the web page of the open source license to be identified inputted in step S201 is a software package data exchange license list web site of a website of the software package data exchange SPDX: https:// spdx.org/license/.
5. The method for rapidly recognizing open source license for large-scale software according to claim 3, wherein when the form data for recording open source license information is read from the web site of the inputted web page of open source license to be recognized in step S202, the read form data for recording open source license information includes at least one of form data for recording open source license for normal use and form data for recording open source license which has been discarded.
6. The method for rapidly identifying open source licenses for large-scale software according to any one of claims 1 to 5, wherein step S102 includes: for the open source license set to be identified, dividing the identification of the open source license in the open source license set into independent character strings without the separator with the separator, using the independent character strings as nodes in the word search tree Trie of the AC automaton M, and constructing the word search tree Trie of the AC automaton M from the identification of the open source license in the open source license set to obtain the AC automaton for identifying the open source license.
7. The method for rapidly identifying open source licenses for large-scale software as recited in claim 6, wherein step S103 includes:
s301, traversing a user software directory for a user code file to be identified, acquiring a directory tree of the user software, extracting annotation information containing license statement from the user code file, establishing a mapping relation between the annotation information and the code file, marking an index value for the extracted annotation information according to a reading sequence, and marking a corresponding code file by using the index value;
s302, according to the index value of the mark, the extracted annotation information is input into an AC automaton as a character string, the open source license contained in the corresponding code file is judged according to the output result of the AC automaton, if the identification of the open source license is output, the code file contains the open source license corresponding to the identification, and therefore the identification result of all open source licenses in the open source license set to be identified in all user code files is obtained.
8. The method for rapidly identifying open source licenses for large-scale software according to claim 7, wherein the AC automaton constructed in step S102 is implemented in Python language, and the method further comprises constructing an AC automaton service cluster using Python' S distributed computing framework Ray before inputting the extracted annotation information as a character string into the AC automaton in step S302; after the AC automaton service cluster receives the input character string, a main node of the AC automaton service cluster adopts a preset scheduling strategy to distribute one or more child nodes to execute the program of the AC automaton to identify the character string, and all identification results are returned through the main node: the constructing the AC automaton service cluster by using the Python distributed computing framework Ray comprises the following steps:
s401, selecting a node from a plurality of physical machines or virtual machines in advance as a main node of an AC automaton service cluster, and activating Python operation environments containing a distributed computing frame Ray on the main node and other nodes;
s402, using a distributed computing framework Ray to specify the IP address of a main node on each other node, and setting output of an AC automaton service cluster activated by the main node so as to add the other nodes as child nodes into the AC automaton service cluster;
s403, annotating the program of the AC automaton by using the decorator, and serializing the program code of the AC automaton on a redis database of the distributed computing framework Ray to be stored as an object for each child node to call and execute data exchange.
9. A large-scale software-oriented open source license quick recognition system comprising a microprocessor and a memory connected to each other, wherein the microprocessor is programmed or configured to perform the large-scale software-oriented open source license quick recognition method of any one of claims 1 to 8.
10. A computer readable storage medium having a computer program stored therein, wherein the computer program is for programming or configuring by a microprocessor to perform the large-scale software-oriented open source license quick identification method of any one of claims 1 to 8.
CN202310223364.5A 2023-03-09 2023-03-09 Rapid open source license identification method, system and medium for large-scale software Pending CN116257819A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310223364.5A CN116257819A (en) 2023-03-09 2023-03-09 Rapid open source license identification method, system and medium for large-scale software

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310223364.5A CN116257819A (en) 2023-03-09 2023-03-09 Rapid open source license identification method, system and medium for large-scale software

Publications (1)

Publication Number Publication Date
CN116257819A true CN116257819A (en) 2023-06-13

Family

ID=86686062

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310223364.5A Pending CN116257819A (en) 2023-03-09 2023-03-09 Rapid open source license identification method, system and medium for large-scale software

Country Status (1)

Country Link
CN (1) CN116257819A (en)

Similar Documents

Publication Publication Date Title
CN1297936C (en) Method and system for comparing files of two computers
US9280569B2 (en) Schema matching for data migration
US9122540B2 (en) Transformation of computer programs and eliminating errors
CN104657402B (en) Method and system for linguistic labelses management
KR20150042877A (en) Managing record format information
CN115543402B (en) Software knowledge graph increment updating method based on code submission
CN115358200A (en) Template document automatic generation method based on SysML meta model
CN111367890A (en) Data migration method and device, computer equipment and readable storage medium
US20090204889A1 (en) Adaptive sampling of web pages for extraction
CN114527991A (en) Code scanning method, device, equipment, storage medium and program product
CN114253995A (en) Data tracing method, device, equipment and computer readable storage medium
CN112445775A (en) Fault analysis method, device, equipment and storage medium of photoetching machine
CN110659063A (en) Software project reconstruction method and device, computer device and storage medium
CN114385148A (en) Method, device, equipment and storage medium for realizing linkage function
Karnalim et al. Layered similarity detection for programming plagiarism and collusion on weekly assessments
CN110633084B (en) Transcoding derivation method and device based on single sample
CN116149669B (en) Binary file-based software component analysis method, binary file-based software component analysis device and binary file-based medium
US10782942B1 (en) Rapid onboarding of data from diverse data sources into standardized objects with parser and unit test generation
KR100762712B1 (en) Method for transforming of electronic document based on mapping rule and system thereof
Tsoukalos Mastering Go: Create Golang production applications using network libraries, concurrency, machine learning, and advanced data structures
CN116257819A (en) Rapid open source license identification method, system and medium for large-scale software
CN114281688A (en) Codeless or low-code automatic case management method and device
CN114968725A (en) Task dependency relationship correction method and device, computer equipment and storage medium
Tukaram Design and development of software tool for code clone search, detection, and analysis
Zhong et al. Burner: Recipe automatic generation for HPC container based on domain knowledge graph

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination