CN111143203A

CN111143203A - Machine learning method, privacy code determination method, device and electronic equipment

Info

Publication number: CN111143203A
Application number: CN201911305402.1A
Authority: CN
Inventors: 林博
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd
Priority date: 2019-12-13
Filing date: 2019-12-13
Publication date: 2020-05-12
Anticipated expiration: 2039-12-13
Also published as: CN111143203B

Abstract

The embodiment of the specification discloses a machine learning and privacy code determining method, a device and electronic equipment, wherein the machine learning method can acquire sample data containing positive and negative samples in batches, the positive sample data contains a privacy code file, and the negative sample data does not contain the privacy code file; screening a plurality of first code files from the sample data based on similarity measurement parameters of the code files, and screening a plurality of second code files from a preset code library of which privacy tags of the code files are known; determining target parameters corresponding to the sample data based on the plurality of first code files and the plurality of second code files; and taking the target parameters corresponding to the sample data and the label of the sample data as input, and training a target model, wherein the target model is used for determining similarity measurement parameters of a batch of code files to be detected and privacy code files.

Description

Machine learning method, privacy code determination method, device and electronic equipment

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method and an apparatus for machine learning and privacy code determination, and an electronic device.

Background

Code is an important asset of computer program product development companies and needs to be heavily protected, especially the privacy code of the company's core system. However, development companies have a large variety of code, a large amount of code, and may also contain many non-private codes, such as open source code and general test code. Therefore, if protection of the code is to be performed, it is first recognized which codes belong to the privacy code, and then protection is given on a targeted basis.

At present, a primary screening is performed based on a code file name, and then related contents in the code are checked manually to determine a privacy code to be protected. It is clear that this way of determination is inefficient.

Disclosure of Invention

The embodiment of the specification provides a method and a device for machine learning and privacy code determination and electronic equipment, and aims to solve the problem that the mode for determining the privacy code in the related art is low in efficiency.

In order to solve the above technical problem, the embodiments of the present specification are implemented as follows:

in a first aspect, a machine learning method for determining a privacy code is presented, comprising:

acquiring batch sample data, wherein the batch sample data comprises positive sample data and negative sample data, the positive sample data comprises a privacy code file, and the negative sample data does not comprise the privacy code file;

screening a plurality of first code files from the sample data based on similarity measurement parameters of the code files, and screening a plurality of second code files from a preset code library with known privacy tags of the code files, wherein one first code file corresponds to one second code file, and the similarity degree between the first code file and the corresponding second code file meets a preset condition;

determining target parameters corresponding to the sample data based on the plurality of first code files and the plurality of second code files, wherein the target parameters comprise at least one of structural similarity parameters and word vector parameters, the structural similarity parameters represent the similarity between a file structure tree formed by the plurality of first code files and a file structure tree formed by the plurality of second code files, the word vector parameters comprise at least one first word vector and at least one second word vector, the first word vector is a word vector of a keyword in a path of the first code file, and the second word vector is a word vector of a keyword in a path of the second code file;

and taking the target parameters corresponding to the sample data and the label of the sample data as input, and training a target model, wherein the target model is used for determining similarity measurement parameters of a batch of code files to be detected and privacy code files.

In a second aspect, a method for determining a privacy code based on machine learning is provided, including:

acquiring a batch of code files to be detected;

screening a plurality of third code files from the batch of code files to be detected based on similarity measurement parameters of the code files, and screening a plurality of fourth code files from a preset code library with known privacy tags of the code files, wherein one third code file corresponds to one fourth code file, and the similarity degree between the third code file and the corresponding fourth code file meets the preset condition;

determining target parameters corresponding to the batch of code files to be detected based on the plurality of third code files and the plurality of fourth code files;

inputting target parameters corresponding to the batch of code files to be detected into a target model, and determining similarity measurement parameters of the batch of code files to be detected and the privacy code files, wherein the target model is obtained by the machine learning method in the first aspect;

and determining whether the privacy code files exist in the batch of code files to be detected based on the similarity measurement parameters of the batch of code files to be detected and the privacy code files.

In a third aspect, a machine learning apparatus for determining a privacy code is presented, comprising:

the system comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring batch sample data, the batch sample data comprises positive sample data and negative sample data, the positive sample data comprises a privacy code file, and the negative sample data does not comprise the privacy code file;

the first screening module is used for screening a plurality of first code files from the sample data based on similarity measurement parameters of the code files and screening a plurality of second code files from a preset code library with known privacy tags of the code files, wherein one first code file corresponds to one second code file, and the similarity degree between the first code file and the corresponding second code file meets a preset condition;

a first determining module, configured to determine, based on the plurality of first code files and the plurality of second code files, a target parameter corresponding to the sample data, where the target parameter includes at least one of a structural similarity parameter and a word vector parameter, where the structural similarity parameter indicates similarity between a file structure tree formed by the plurality of first code files and a file structure tree formed by the plurality of second code files, the word vector parameter includes at least one first word vector and at least one second word vector, the first word vector is a word vector of a keyword in a path of the first code file, and the second word vector is a word vector of a keyword in a path of the second code file;

and the training module is used for taking the target parameters corresponding to the sample data and the labels of the sample data as input and training a target model, wherein the target model is used for determining similarity measurement parameters of a batch of code files to be detected and privacy code files.

In a fourth aspect, a privacy code determination apparatus based on machine learning is provided, including:

the acquisition module is used for acquiring a batch of code files to be detected;

the screening module is used for screening a plurality of third code files from the batch of code files to be detected based on similarity measurement parameters of the code files, and screening a plurality of fourth code files from a preset code library with known privacy tags of the code files, wherein one third code file corresponds to one fourth code file, and the similarity degree between the third code file and the corresponding fourth code file meets the preset condition;

the first parameter determining module is used for determining target parameters corresponding to the batch of code files to be detected based on the plurality of third code files and the plurality of fourth code files;

a second parameter determining module, configured to input target parameters corresponding to the batch of code files to be detected into a target model, and determine similarity measurement parameters between the batch of code files to be detected and the privacy code files, where the target model is obtained by the machine learning method in the first aspect;

and the privacy code determination module is used for determining whether the privacy code files exist in the batch of code files to be detected based on the similarity measurement parameters of the batch of code files to be detected and the privacy code files.

In a fifth aspect, an electronic device is provided, including:

a processor; and

a memory arranged to store computer executable instructions that, when executed, cause the processor to:

In a sixth aspect, a computer-readable storage medium is presented, storing one or more programs that, when executed by an electronic device including a plurality of application programs, cause the electronic device to:

In a seventh aspect, an electronic device is provided, including:

a processor; and

acquiring a batch of code files to be detected;

In a fourth aspect, a computer-readable storage medium is provided that stores one or more programs, which when executed by an electronic device that includes a plurality of application programs, cause the electronic device to perform operations comprising:

acquiring a batch of code files to be detected;

As can be seen from the technical solutions provided in the embodiments of the present specification, the solutions provided in the embodiments of the present specification have at least one of the following technical effects: because the code file is treated as a common text at first, the code files with similar contents and known privacy tags are selected by comparing similarity measurement parameters of the code files, and the method is high in speed; secondly, by comparing at least one of the structural similarity parameter and the word vector parameter of the code file, the similarity between a batch of code files and the privacy code file is determined from at least one of the directory structure and the directory keywords of the code file. In a word, the method can greatly improve the detection efficiency of the privacy codes.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

fig. 1 is one of flow diagrams of a machine learning method for determining a privacy code provided by an embodiment of the present specification.

Fig. 2 is a second flowchart of a machine learning method for determining a privacy code according to an embodiment of the present disclosure.

Fig. 3 is a third flowchart of a machine learning method for determining a privacy code according to an embodiment of the present disclosure.

Fig. 4 is a flowchart of a privacy code determination method based on machine learning, provided by an embodiment of the present specification.

Fig. 5 is a schematic structural diagram of an electronic device provided in an embodiment of the present specification.

Fig. 6 is a schematic structural diagram of another electronic device provided in an embodiment of this specification.

Fig. 7 is one of schematic structural diagrams of a machine learning apparatus for determining a privacy code according to an embodiment of the present specification.

Fig. 8 is a second schematic structural diagram of a machine learning apparatus for determining a privacy code according to an embodiment of the present specification.

Fig. 9 is a schematic structural diagram of a privacy code determination apparatus based on machine learning according to an embodiment of the present specification.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be described in detail and completely with reference to the following specific embodiments of the present application and the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

In order to solve the problem that the manner of determining the privacy code in the related art is inefficient, embodiments of the present specification provide a machine learning method and apparatus for determining the privacy code, and a privacy code determination method and apparatus based on machine learning. The method and the apparatus provided by the embodiments of the present disclosure may be executed by an electronic device, such as a terminal device or a server device. In other words, the method may be performed by software or hardware installed in the terminal device or the server device. The server includes but is not limited to: a single server, a server cluster, a cloud server or a cloud server cluster, and the like. The terminal devices include but are not limited to: any one of smart terminal devices such as a smart phone, a Personal Computer (PC), a notebook computer, a tablet computer, an electronic reader, a web tv, and a wearable device.

In the embodiment of the present specification, the machine learning method for determining the privacy code may be regarded as a training process of the target model, and the privacy code determination method based on machine learning may be regarded as a process of identifying whether the privacy code is included in a batch of code files to be detected by applying the target model, which will be described below separately.

First, a machine learning method for determining a privacy code provided in an embodiment of the present specification is described.

Fig. 1 is a flowchart illustrating an implementation of a machine learning method for determining a privacy code according to an embodiment of the present specification. As shown in fig. 1, the method may include the following steps.

102, obtaining batch sample data, wherein the batch sample data comprises positive sample data and negative sample data, the positive sample data comprises a privacy code file, and the negative sample data does not comprise the privacy code file.

Specifically, one sample data may include a batch of code files, the positive sample data includes a batch of code files in which privacy code files exist, the negative sample data includes a batch of code files in which privacy code files do not exist, and both the positive sample data and the negative sample data have corresponding tags.

The privacy code file refers to a code file containing a privacy code. A non-private code file refers to a code file that does not contain a private code. The private code may be code that the code owner or developer is unwilling or inconvenient to disclose to the outside, for example, for a computer program product development company, the code of its core system carries core services, may belong to the private code, and is inconvenient to disclose to the outside.

It is understood that in the embodiments of the present specification, a sample data includes a batch of code files instead of a code file because the code file owner (code development company) often faces a scenario of packaging a batch of code files into a compressed package for outgoing (e.g., to another company), and this scenario often risks disclosure of privacy codes.

In a specific implementation, a sample database may be set, which includes positive sample data and negative sample data, and before acquiring batch sample data from the sample database, the sample database may be cleaned to delete redundant repeated codes therein, and completely supplement missing codes.

And 104, screening a plurality of first code files from the sample data based on the similarity measurement parameters of the code files, and screening a plurality of second code files from a preset code library with known privacy tags of the code files, wherein one first code file corresponds to one second code file, and the similarity degree between the first code file and the corresponding second code file meets a preset condition.

The privacy of the code file in the preset code base is known and has a corresponding label, when the code file in the preset code base is a privacy file, the privacy label of the code file is consistent with the label of the positive sample data, and when the code file in the preset code base is a non-privacy file, the privacy label of the code file is consistent with the label of the negative sample data. The first code file selected from the sample data corresponds to the second code file selected from the preset code library, that is, the first code files selected in step 104 correspond to the second code files one by one, and the similarity degree between the first code file and the corresponding second code file satisfies the preset condition.

As an example, in step 104, for each code file in one sample data, the most similar code file to the code file may be found from the preset code library by comparing the similarity measurement parameters, and when the similarity between the two is greater than or equal to the preset threshold, the code file in the sample data may be used as the first code file, and the code file in the preset code library most similar to the code file may be used as the corresponding second code file. That is, the preset conditions include: the second code file corresponding to the first code file is the code file which is most similar to the first code file in the preset code library, and the similarity degree is greater than or equal to a preset threshold value. Of course, in other examples, the preset conditions may be different, for example, the preset conditions may include: the second code file corresponding to the first code file is a code file in the preset code library that is most similar to the first code file, and this specification does not limit this.

The preset code library may contain both a privacy code file known to the privacy tag and a non-privacy code file known to the privacy tag. Thus, in step 104, when the sample data is a positive sample, the code files in the sample data may be compared with the privacy code files in the preset code library to screen out a plurality of first code files and a plurality of corresponding second code files; when the sample data is a negative sample, the code files in the sample data can be compared with the non-private code files in the preset code library to screen out a plurality of first code files and a plurality of corresponding second code files. Alternatively, the preset code library may include two sub libraries, one sub library (may be referred to as a first sub library) stores the private code file, and the other sub library (may be referred to as a second sub library) stores the non-private code file, at this time, in step 104, when the sample data is a positive sample, the code file in the sample data may be compared with the private code file in the first sub library to screen out a plurality of first code files and a plurality of corresponding second code files; when the sample data is a negative sample, the code files in the sample data can be compared with the non-private code files in the second sub-library to screen out a plurality of first code files and a plurality of corresponding second code files.

Of course, before step 104, embodiments of the present specification provide that the machine learning method may further include: and determining similarity measurement parameters of the sample data and the code files in the preset code library. Specifically, the similarity measurement parameter of each code file in the sample data and the similarity measurement parameter of each code file in the preset code library may be determined and stored. The similarity measurement parameter of a code file may be an approximate hash (simhash) value of the code file, or may be another parameter capable of measuring the similarity of the code file.

Specifically, step 104 and the following step 106 may be performed on each sample data in the batch of sample data acquired in step 101 to determine the target parameter corresponding to each sample data, so that in the following step 108, the target parameter corresponding to each sample data in the batch of sample data and the label of the sample data may be used as input to train the target model.

It will be appreciated that comparing a batch of code files in a sample of data with code files in a pre-defined code repository is because code file owners (code development companies) often have a need to extract portions of code (such as a catalog) from an overall code repository to detect whether they belong to private code. It is therefore necessary to take the local code to compare with the entire code library, specifically comparing at least one of the similarity of the code, the similarity of the structure of the code file (edit distance of the subtree), and the semantic similarity of the keywords in the path of the code file. Wherein, the similarity comparison of the codes is implemented in step 104, and the structure similarity comparison of the code files and the semantic similarity comparison of the keywords in the path of the code files are implemented in step 106.

And 106, determining target parameters corresponding to the sample data based on the plurality of first code files and the plurality of second code files.

The target parameter may include at least one of a structural similarity parameter and a word vector parameter. Wherein the structural similarity parameter represents a similarity of a file structure tree formed by the plurality of first code files and a file structure tree formed by the plurality of second code files. The word vector parameters comprise at least one first word vector and at least one second word vector, the first word vector is a word vector of a keyword in a path of the first code file, and the second word vector is a word vector of a keyword in a path of the second code file.

The determination of the target parameter is described in the following by several embodiments.

First embodiment

The target parameters include structural similarity parameters, and step 106 may include: constructing a first file structure tree based on the paths of the plurality of first code files; constructing a second file structure tree based on the paths of the plurality of second code files; and determining a structural similarity parameter corresponding to the sample data based on the first file structure tree and the second file structure tree.

Optionally, after constructing the first file structure tree and the second file structure tree, before determining the structural similarity parameter of the two file structure trees, step 106 may further include: cutting off the isolated nodes in the first file structure tree to obtain a first sub-tree formed by the remaining nodes; cutting off the isolated nodes in the second file structure tree to obtain a second sub-tree formed by the remaining nodes; the isolated node is a node in which neither a leaf node nor a parent node exists, and is isolated. On this basis, in step 106, a structural similarity parameter corresponding to the sample data may be determined based on the first subtree and the second subtree.

The structural similarity parameter of two trees can be expressed in terms of the edit distance of the two trees. The edit distance between two trees refers to the minimum amount of operations to change from one tree to another, where the operations may be at least one of inserting nodes, deleting nodes, and changing nodes. Therefore, the structural similarity parameter corresponding to one sample data may be an edit distance between the first sub-tree and the second sub-tree corresponding to the sample data.

It should be understood that, for one file structure tree, after deleting the isolated node therein, the number of the remaining subtrees may be one or more, and if the number of the remaining subtrees is more than one, the edit distances between all the remaining subtrees of the file structure tree and all the subtrees of the other file structure tree are calculated when determining the edit distances between the file structure tree and the other file structure tree.

Second embodiment

The target parameters include word vector parameters, and step 106 may include: extracting key words in paths of the plurality of first code files screened from the sample data; determining word vectors of keywords in paths of the plurality of first code files to obtain at least one first word vector; extracting keywords in paths of the plurality of second code files screened from a preset code base; determining word vectors of the keywords in the paths of the plurality of second code files to obtain at least one second word vector.

For example, for a project (project), the path of a code file may be: sofa-hessian/src/main/java/com/alipay/hessian/classnamefilter. At this time, the sofa-hessian and hessian can be extracted as keywords. These two keywords do not appear very frequently in the code files of the other items and therefore characterize this item very well.

In the second embodiment, the keywords in the paths of the plurality of first code files and the plurality of second code files may be extracted based on a term frequency-inverse text frequency index algorithm (TF-IDF). And, word vectors (embedding) for the extracted keywords may be determined based on known, future occurrences of new word vector conversion models. Known word vector conversion models include, but are not limited to, any of word2vec, Bert, and Graph2vec, and word2vec may include any of continuous bag-of-Words Model (CBOW) and skim-gram.

Third embodiment

The target parameters include a structural similarity parameter and a word vector parameter, and it is understood that this embodiment is a combination of the first embodiment and the second embodiment, and specific contents may refer to the two embodiments, which will not be described repeatedly herein.

And step 108, taking the target parameters corresponding to the sample data and the label of the sample data as input, and training a target model.

The target model is used for determining similarity measurement parameters of a batch of code files to be detected and privacy code files. In one example, the greater the similarity measurement parameter, the more similar the batch of code files to be detected to the private code files, and vice versa.

In step 108, a target model may be trained by using the target parameter corresponding to each sample data in the batch of sample data and the label of the sample data as input. The target model may specifically be a random forest, xgboost, or the like model.

It can be understood that when the target parameters include structural similarity parameters, the trained target model can be used for determining similarity measurement parameters of a batch of code files to be detected and privacy code files from the aspect of structural similarity; when the target parameters comprise word vector parameters, the trained target model can be used for determining similarity measurement parameters of a batch of code files to be detected and privacy code files from the semantic aspect of keywords in the path of the code files; when the target parameters include both the structural similarity parameters and the word vector parameters, the trained target model can be used for determining similarity measurement parameters of a batch of code files to be detected and privacy code files from two aspects of structural similarity and semantic similarity of keywords in paths of the code files.

The machine learning method provided by the embodiment of the specification comprises the steps of firstly, processing a code file as a common text, and selecting a code file with similar content and a known privacy tag by comparing similarity measurement parameters of the code file, wherein the method is high in speed and good in expansibility, and can avoid the problem of analysis failure possibly caused when an analyzer is constructed for each type of code file in the related art; secondly, by comparing at least one of the structural similarity parameter and the word vector parameter of the code file, the similarity between a batch of code files and the privacy code file is determined from at least one of the directory structure and the directory keywords of the code file. In a word, the method can greatly improve the detection efficiency of the privacy codes.

In addition, for some open source components, code files of the open source components generally have specific directory structures, and target models capable of identifying the code files in a targeted mode can be trained through a local machine learning method, so that false identification of the open source codes is reduced, and the accuracy and the effectiveness of private code detection are improved.

Fig. 2 illustrates, by way of a more detailed embodiment, a machine learning method for determining a privacy code provided herein. In this embodiment, the target parameters include a structural similarity parameter and a word vector parameter. As shown in fig. 2, the method may include the steps of:

and step 211, cleaning the preset code base.

The privacy of the code files in the preset code library is known.

And step 212, determining similarity measurement parameters of the code files in the preset code library.

And step 213, storing the similarity measurement parameters and paths of the code files in the preset code library.

Step 221, cleaning the sample database to obtain a batch of sample data. Then, step 222, step 223, step 224, step 225, step 226, step 214, step 215, step 216, step 217, step 231, and step 232 are performed once for each sample data.

Step 222, determining similarity measurement parameters of the code files in the sample data.

Step 223, comparing the similarity measurement parameters of the code files in the sample data and the code files in the preset code library to screen out a plurality of first code files from the sample data.

And step 214, comparing the similarity measurement parameters of the code files in the sample data and the code files in the preset code base to screen out a plurality of corresponding second code files from the preset code base.

In step 223 and step 214, for a sample data, one of the first code files corresponds to one of the second code files in the predetermined code library, and the similarity between the first code file and the corresponding second code file satisfies the predetermined condition.

Step 215, extracting keywords in the paths of the plurality of second code files corresponding to the sample data.

And step 216, determining word vectors of the keywords in the paths of the second code files corresponding to the sample data.

And 217, constructing a second file structure tree according to the paths of the plurality of second code files corresponding to the sample data.

And 224, constructing a first file structure tree according to the paths of the plurality of first code files corresponding to the sample data.

231, cutting off an isolated node in the first file structure tree to obtain a first sub-tree; and cutting off the isolated nodes in the second file structure tree to obtain a second subtree.

Step 232, determining the structural similarity parameters of the first subtree and the second subtree.

And step 225, extracting keywords in the paths of the plurality of first code files corresponding to the sample data.

Step 226, determining word vectors of the keywords in the multiple first code file paths corresponding to the sample data.

Finally, the word vectors of the keywords in the multiple second code file paths corresponding to the sample data obtained in step 216, the structural similarity parameters corresponding to the sample data obtained in step 232, the word vectors of the keywords in the multiple first code file paths corresponding to the sample data obtained in step 226, and the labels of the sample data are used as input, and the target model 2 is trained.

The machine learning method provided by the embodiment of the specification comprises the steps of firstly, processing a code file as a common text, and selecting the code file with similar content and known privacy label by comparing similarity measurement parameters of the code file, wherein the method is high in speed and good in expansibility; secondly, by comparing the structural similarity parameter and the word vector parameter of the code file, the similarity between a batch of code files and the privacy code file is determined from the two angles of the directory structure and the directory keywords of the code file. In a word, the method can greatly improve the detection efficiency of the privacy codes.

Optionally, as shown in fig. 3, after the target model is obtained through training, the machine learning method for determining the privacy code provided by the embodiment of the present specification may further include a step of applying the trained target model to identify whether a batch of code files to be detected contains the privacy code.

And step 110, acquiring a batch of code files to be detected.

The code files to be detected are the code files to be detected with unknown privacy. For example, a code development company packages a batch of code files to be sent out.

And 112, screening a plurality of third code files from the batch of code files to be detected based on the similarity measurement parameters of the code files, and screening a plurality of fourth code files from the preset code library, wherein one third code file corresponds to one fourth code file, and the similarity degree between the third code file and the corresponding fourth code file meets the preset condition.

Specifically, for each code file to be detected in the batch of code files to be detected, the most similar code file to the code file to be detected can be found from a preset code library by comparing similarity measurement parameters, when the similarity degree of the code file to be detected and the code file to be detected is greater than or equal to a preset threshold value, the code file to be detected can be used as a third code file, and the code file most similar to the code file to be detected in the preset code library can be used as a corresponding fourth code file.

The similarity measurement parameter of a code file may be an approximate hash value of the code file, or may be another parameter capable of measuring the similarity of the code file.

And step 114, determining target parameters corresponding to the batch of code files to be detected based on the plurality of third code files and the plurality of fourth code files.

As described above, the target parameters corresponding to the sample data used in training the target model may include: at least one of a structural similarity parameter and a word vector parameter.

Therefore, when the target parameters corresponding to the sample data adopted in training the target model include the structural similarity parameters, the target parameters corresponding to the batch of code files to be detected include the structural similarity parameters, wherein the structural similarity parameters represent the similarity between the file structure tree formed by the third code files and the file structure tree formed by the fourth code files. When target parameters corresponding to sample data adopted in training of a target model comprise word vector parameters, the target parameters corresponding to the batch of code files to be detected comprise target word vector parameters, wherein the target word vector parameters comprise at least one third word vector and at least one fourth word vector, the third word vector is a word vector of a keyword in a path of the third code file, and the fourth word vector is a word vector of the keyword in the path of the fourth code file. The following describes a process of determining target parameters corresponding to the batch of code files to be detected by several embodiments.

First embodiment

If the target parameters corresponding to the sample data adopted in training the target model include structural similarity parameters, the target parameters corresponding to the batch of code files to be detected include target structural similarity parameters, and correspondingly, step 114 may include: constructing a third file structure tree based on the paths of the plurality of third code files; constructing a fourth file structure tree based on the paths of the plurality of fourth code files; and determining target structure similarity parameters corresponding to the batch of code files to be detected based on the third file structure tree and the fourth file structure tree.

Optionally, after the third file structure tree and the fourth file structure tree are constructed, before determining the target structure similarity parameters corresponding to the batch of code files to be detected, step 114 may further include: cutting off the isolated nodes in the third file structure tree to obtain a third sub-tree formed by the remaining nodes; and cutting off the isolated nodes in the fourth file structure tree to obtain a fourth sub-tree formed by the remaining nodes. Correspondingly, the target structure similarity parameters corresponding to the batch of code files to be detected are determined based on the third subtree and the fourth subtree.

The structural similarity parameter of two trees can be expressed in terms of the edit distance of the two trees. The edit distance between two trees refers to the minimum amount of operations to change from one tree to another, where the operations may be at least one of inserting nodes, deleting nodes, and changing nodes. Therefore, the structural similarity parameter corresponding to the batch of code files to be detected may be an edit distance between the third sub-tree and the fourth sub-tree corresponding to the batch of code files to be detected.

Second embodiment

If the target parameters corresponding to the sample data used in training the target model include word vector parameters, the target parameters corresponding to the batch of code files to be detected include target word vector parameters, and correspondingly, step 114 may include: extracting keywords in paths of the plurality of third code files; determining word vectors of keywords in paths of the plurality of third code files to obtain at least one third word vector; extracting keywords in paths of the plurality of fourth code files; determining word vectors of the keywords in the paths of the plurality of fourth code files to obtain at least one fourth word vector.

Specifically, keywords in paths of the plurality of third code files may be extracted based on a TF-IDF algorithm, and keywords in paths of the plurality of fourth code files may be extracted based on a TF-IDF algorithm. A word vector (embedding) of the extracted keyword may be determined based on a word vector conversion model. The word vector conversion model includes, but is not limited to, any one of word2vec, Bert, and Graph2vec, and the word2vec may include any one of CBOW and ski-gram.

Third embodiment

And if the target parameters corresponding to the sample data adopted in the training of the target model comprise the structure similarity parameters and the word vector parameters, the target parameters corresponding to the batch of code files to be detected comprise the target structure similarity parameters and the target word vector parameters. It should be understood that the third embodiment is a combination of the first and second embodiments, and the specific determination manner of the target structure similarity parameter and the target word vector parameter please refer to the first embodiment and the second embodiment, and the description will not be repeated here.

And 116, inputting the target parameters corresponding to the batch of code files to be detected into the target model, and determining similarity measurement parameters of the batch of code files to be detected and the privacy code files.

And step 118, determining whether the privacy code files exist in the batch of code files to be detected based on the similarity measurement parameters of the batch of code files to be detected and the privacy code files.

In one example, the greater the similarity metric parameter, the more similar a batch of code files to be detected to private code files, and vice versa, the more similar a batch of code files to be detected to non-private code files.

In the machine learning method provided in the embodiment of the present specification, after the target model is trained, whether a batch of code files to be detected contains a privacy code file may be further determined by using the target model. In the method, a code file to be detected is treated as a common text, a code file which is similar to the content of the code file to be detected and has a known privacy label is selected by comparing similarity measurement parameters and is used as a basis for determining target parameters corresponding to the code file to be detected, and the method is high in speed and good in expansibility; secondly, by comparing at least one of the structural similarity parameter and the word vector parameter of the code file, the similarity between a batch of code files to be detected and the privacy code file is determined from at least one angle of the directory structure and the directory keywords of the code file. In a word, the method can greatly improve the detection efficiency of the privacy codes.

In addition, for some open source components, code files of the open source components generally have specific directory structures, and the code files can be identified in a more targeted manner through a local machine learning method, so that false identification of the open source codes is reduced, and the accuracy and the effectiveness of privacy code detection are improved.

The following describes a privacy code determination method based on machine learning, which may be regarded as an application of the target model obtained through machine learning in the foregoing. As shown in fig. 4, the method may include the steps of:

step 402, acquiring a batch of code files to be detected.

Step 404, based on the similarity measurement parameters of the code files, a plurality of third code files are screened from the batch of code files to be detected, and a plurality of fourth code files are screened from a preset code library with known privacy tags of the code files, wherein one third code file corresponds to one fourth code file, and the similarity degree between the third code file and the corresponding fourth code file meets the preset condition.

Specifically, for each code file to be detected in the batch of code files to be detected, the most similar code file to the code file to be detected can be found from a preset code library by comparing the similarity measurement parameters, when the similarity degree of the code file to be detected and the code file to be detected is greater than or equal to a preset threshold value, the code file to be detected can be used as a third code file, and the code file most similar to the code file to be detected in the preset code library can be used as a corresponding fourth code file.

The similarity measurement parameter of a code file may be an approximate hash value of the code file, but may also be other parameters capable of measuring the similarity of the code file.

And step 406, determining target parameters corresponding to the batch of code files to be detected based on the plurality of third code files and the plurality of fourth code files.

First embodiment

If the target parameters corresponding to the sample data adopted in training the target model include structural similarity parameters, the target parameters corresponding to the batch of code files to be detected include target structural similarity parameters, and correspondingly, step 406 may include: constructing a third file structure tree based on the paths of the plurality of third code files; constructing a fourth file structure tree based on the paths of the plurality of fourth code files; and determining target structure similarity parameters corresponding to the batch of code files to be detected based on the third file structure tree and the fourth file structure tree.

Optionally, after the third file structure tree and the fourth file structure tree are constructed, before determining the target structure similarity parameters corresponding to the batch of code files to be detected, step 406 may further include: cutting off the isolated nodes in the third file structure tree to obtain a third sub-tree formed by the remaining nodes; and cutting off the isolated nodes in the fourth file structure tree to obtain a fourth sub-tree formed by the remaining nodes. Correspondingly, the target structure similarity parameters corresponding to the batch of code files to be detected are determined based on the third subtree and the fourth subtree.

Second embodiment

If the target parameters corresponding to the sample data used in training the target model include word vector parameters, and accordingly, the target parameters corresponding to the batch of code files to be detected include target word vector parameters, and the corresponding step 406 may include: extracting keywords in paths of the plurality of third code files; determining word vectors of keywords in paths of the plurality of third code files to obtain at least one third word vector; extracting keywords in paths of the plurality of fourth code files; determining word vectors of the keywords in the paths of the plurality of fourth code files to obtain at least one fourth word vector.

Third embodiment

Step 408, inputting the target parameters corresponding to the batch of code files to be detected into a target model, and determining similarity measurement parameters of the batch of code files to be detected and the privacy code files.

The target model is obtained by training through a machine learning method in the embodiment of the present specification, and the specific training process refers to the above, which is not described repeatedly herein.

And step 410, determining whether the privacy code files exist in the batch of code files to be detected based on the similarity measurement parameters of the batch of code files to be detected and the privacy code files.

The privacy code determination method based on machine learning provided by the embodiment of the specification comprises the steps of firstly treating a code file to be detected as a common text, selecting a code file which is similar to the content of the code file to be detected and has a known privacy label by comparing similarity measurement parameters, and using the selected code file as a basis for determining target parameters corresponding to the code file to be detected, wherein the method is high in speed and good in expansibility; secondly, by comparing at least one of the structural similarity parameter and the word vector parameter of the code file, the similarity between a batch of code files to be detected and the privacy code file is determined from at least one angle of the directory structure and the directory keywords of the code file. In a word, the method can greatly improve the detection efficiency of the privacy codes.

In addition, for some open source components, the code files of the open source components generally have specific directory structures, and the code files can be identified in a more targeted manner by the method, so that the false identification of the open source codes is reduced, and the accuracy and the effectiveness of the privacy code detection are improved.

The above is a description of embodiments of the method provided in this specification, and the electronic device provided in this specification is described below.

Fig. 5 is a schematic structural diagram of an electronic device provided in an embodiment of the present specification. Referring to fig. 5, at a hardware level, the electronic device includes a processor, and optionally further includes an internal bus, a network interface, and a memory. The Memory may include a Memory, such as a Random-Access Memory (RAM), and may further include a non-volatile Memory, such as at least 1 disk Memory. Of course, the electronic device may also include hardware required for other services.

The processor, the network interface, and the memory may be connected to each other via an internal bus, which may be an ISA (Industry Standard Architecture) bus, a PCI (peripheral component Interconnect) bus, an EISA (Extended Industry Standard Architecture) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one double-headed arrow is shown in FIG. 5, but this does not indicate only one bus or one type of bus.

And the memory is used for storing programs. In particular, the program may include program code comprising computer operating instructions. The memory may include both memory and non-volatile storage and provides instructions and data to the processor.

The processor reads a corresponding computer program from the nonvolatile memory into the memory and then runs the computer program, and the privacy code determination device based on machine learning is formed on a logic level. The processor is used for executing the program stored in the memory and is specifically used for executing the following operations:

The machine learning method for determining a privacy code disclosed in the embodiments of fig. 1 to 3 of the present specification may be applied to or implemented by a processor. The processor may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or instructions in the form of software. The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but also Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components. The various methods, steps and logic blocks disclosed in one or more embodiments of the present specification may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with one or more embodiments of the present disclosure may be embodied directly in hardware, in a software module executed by a hardware decoding processor, or in a combination of the hardware and software modules executed by a hardware decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in a memory, and a processor reads information in the memory and completes the steps of the method in combination with hardware of the processor.

The electronic device may further execute the machine learning method provided in any one of fig. 1 to 3, which is not described herein again.

Fig. 6 is a schematic structural diagram of another electronic device provided in an embodiment of the present specification. The electronic device differs from the electronic device shown in fig. 5 in that the processor executes the program stored in the memory and is specifically configured to perform the following operations:

acquiring a batch of code files to be detected;

inputting target parameters corresponding to the batch of code files to be detected into a target model, and determining similarity measurement parameters of the batch of code files to be detected and privacy code files, wherein the target model is obtained by training through a machine learning method provided by the embodiment of the specification;

Of course, besides the software implementation, the electronic device in this specification does not exclude other implementations, such as logic devices or a combination of software and hardware, and the like, that is, the execution subject of the following processing flow is not limited to each logic unit, and may also be hardware or logic devices.

Embodiments of the present specification also propose a computer-readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by a portable electronic device comprising a plurality of application programs, are capable of causing the portable electronic device to perform the method of the embodiment shown in fig. 1, and in particular to perform the following:

Embodiments of the present specification also provide a computer-readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by a portable electronic device comprising a plurality of application programs, enable the portable electronic device to perform the method of the embodiment shown in fig. 4, and in particular to perform the following operations:

acquiring a batch of code files to be detected;

The following describes a machine-learning-based privacy code determination apparatus provided in the present specification.

As shown in fig. 7, an embodiment of the present specification provides a device for determining a privacy code based on machine learning, and in one software implementation, the device 700 for determining a privacy code based on machine learning may include: a first obtaining module 701, a first screening module 702, a first determining module 703 and a training module 704.

The first obtaining module 701 is configured to obtain batch sample data, where the batch sample data includes positive sample data and negative sample data, the positive sample data includes a privacy code file, and the negative sample data does not include the privacy code file.

The first screening module 702 is configured to screen a plurality of first code files from the sample data based on the similarity measurement parameters of the code files, and screen a plurality of second code files from a preset code library of which privacy tags of the code files are known.

The privacy tag of the code file in the preset code base is known, when the code file in the preset code base is a privacy file, the privacy tag of the code file is consistent with the tag of the positive sample data, and when the code file in the preset code base is a non-privacy file, the privacy tag of the code file is consistent with the tag of the negative sample data. The first code file selected from the sample data corresponds to the second code file selected from the preset code library, that is, the first code files selected in step 104 correspond to the second code files one by one, and the similarity degree between the first code file and the corresponding second code file satisfies the preset condition.

A first determining module 703 is configured to determine target parameters corresponding to the sample data based on the plurality of first code files and the plurality of second code files.

First embodiment

The target parameters include structural similarity parameters, and the first determining module 703 may be configured to: constructing a first file structure tree based on the paths of the plurality of first code files; constructing a second file structure tree based on the paths of the plurality of second code files; and determining a structural similarity parameter corresponding to the sample data based on the first file structure tree and the second file structure tree.

Optionally, after constructing the first file structure tree and the second file structure tree, before determining the structural similarity parameters of the two file structure trees, the first determining module 703 may be further configured to: cutting off the isolated nodes in the first file structure tree to obtain a first sub-tree formed by the remaining nodes; and cutting off the isolated nodes in the second file structure tree to obtain a second subtree formed by the remaining nodes. On this basis, in the first determining module 703, the structural similarity parameter corresponding to the sample data may be determined based on the first subtree and the second subtree.

In a second embodiment

The target parameters comprise word vector parameters, and the first determining module 703 is operable to: extracting key words in paths of the plurality of first code files screened from the sample data; determining word vectors of keywords in paths of the plurality of first code files to obtain at least one first word vector; extracting keywords in paths of the plurality of second code files screened from a preset code base; determining word vectors of the keywords in the paths of the plurality of second code files to obtain at least one second word vector.

Third embodiment

A training module 704, configured to take the target parameter corresponding to the sample data and the label of the sample data as inputs to train a target model; the target model is used for determining similarity measurement parameters of a batch of code files to be detected and privacy code files.

In the machine learning apparatus 700 provided in the embodiment shown in fig. 7, first, the code file is treated as a normal text, and the code files with similar contents and known privacy tags are selected by comparing similarity measurement parameters of the code files, so that the method is fast and good in expansibility, and the problem of analysis failure possibly caused when an analyzer is constructed for each type of code file to analyze in the related art can be avoided; secondly, by comparing at least one of the structural similarity parameter and the word vector parameter of the code file, the similarity between a batch of code files and the privacy code file is determined from at least one of the directory structure and the directory keywords of the code file. In a word, the method can greatly improve the detection efficiency of the privacy codes.

It should be noted that the machine learning apparatus 700 for determining a privacy code can implement the method in the embodiment of the method in fig. 1, and specifically refer to the privacy code determination method based on machine learning in the embodiment shown in fig. 1, and details are not repeated.

Optionally, as shown in fig. 8, the apparatus 700 shown in fig. 7 may further include: a second obtaining module 705, a second screening module 706, a third determining module 707, a fourth determining module 708, and a privacy code determining module 709.

The second obtaining module 705 is configured to obtain a batch of code files to be detected.

The second screening module 706 is configured to screen a plurality of third code files from the batch of code files to be detected based on the similarity measurement parameter of the code files, and screen a plurality of fourth code files from the preset code library, where one third code file corresponds to one fourth code file, and a similarity degree between the third code file and the corresponding fourth code file satisfies the preset condition.

A third determining module 707, configured to determine, based on the multiple third code files and the multiple fourth code files, target parameters corresponding to the batch of code files to be detected.

First embodiment

The target parameters corresponding to the sample data used in training the target model include structural similarity parameters, and correspondingly, the target parameters corresponding to the batch of code files to be detected include target structural similarity parameters, and the third determining module 707 is specifically configured to: constructing a third file structure tree based on the paths of the plurality of third code files; constructing a fourth file structure tree based on the paths of the plurality of fourth code files; and determining target structure similarity parameters corresponding to the batch of code files to be detected based on the third file structure tree and the fourth file structure tree.

Optionally, after the third file structure tree and the fourth file structure tree are constructed, before determining the target structure similarity parameters corresponding to the batch of code files to be detected, the third determining module 707 may be further configured to: cutting off the isolated nodes in the third file structure tree to obtain a third sub-tree formed by the remaining nodes; and cutting off the isolated nodes in the fourth file structure tree to obtain a fourth sub-tree formed by the remaining nodes. Accordingly, the third determining module 707 is specifically configured to: and determining target structure similarity parameters corresponding to the batch of code files to be detected based on the third subtree and the fourth subtree.

Second embodiment

The target parameters corresponding to the sample data used in training the target model include word vector parameters, and correspondingly, the target parameters corresponding to the batch of code files to be detected include target word vector parameters, and the third determining module 707 is specifically configured to: extracting keywords in paths of the plurality of third code files; determining word vectors of keywords in paths of the plurality of third code files to obtain at least one third word vector; extracting keywords in paths of the plurality of fourth code files; determining word vectors of the keywords in the paths of the plurality of fourth code files to obtain at least one fourth word vector.

Third embodiment

The target parameters corresponding to the sample data adopted in the training of the target model comprise structure similarity parameters and word vector parameters, and correspondingly, the target parameters corresponding to the batch of code files to be detected comprise target structure similarity parameters and target word vector parameters. It should be understood that the third embodiment is a combination of the first and second embodiments, and the specific determination manner of the target structure similarity parameter and the target word vector parameter please refer to the first embodiment and the second embodiment, and the description will not be repeated here.

A fourth determining module 708, configured to input target parameters corresponding to the batch of code files to be detected into the target model, and determine similarity measurement parameters between the batch of code files to be detected and the privacy code files, where the target model is obtained by training the machine learning device 700 provided in the embodiment shown in fig. 7.

The privacy code determining module 709 is configured to determine whether the privacy code file exists in the batch of code files to be detected based on the similarity measurement parameter between the batch of code files to be detected and the privacy code file.

The machine learning apparatus 700 provided in the embodiment shown in fig. 8 may further determine whether the code files to be detected include the privacy code files by using the target model after the target model is trained. In the method, a code file to be detected is treated as a common text, a code file which is similar to the content of the code file to be detected and has a known privacy label is selected by comparing similarity measurement parameters and is used as a basis for determining target parameters corresponding to the code file to be detected, and the method is high in speed and good in expansibility; secondly, by comparing at least one of the structural similarity parameter and the word vector parameter of the code file, the similarity between a batch of code files to be detected and the privacy code file is determined from at least one angle of the directory structure and the directory keywords of the code file. In a word, the method can greatly improve the detection efficiency of the privacy codes.

As shown in fig. 9, the present specification further provides a device 900 for determining a privacy code based on machine learning, and in one software implementation, the device 900 may include: an obtaining module 901, a screening module 902, a first parameter determining module 903, a second parameter determining module 904, and a privacy code determining module 905.

An obtaining module 901, configured to obtain a batch of code files to be detected.

The screening module 902 is configured to screen a plurality of third code files from the batch of code files to be detected based on the similarity measurement parameter of the code files, and screen a plurality of fourth code files from a preset code library of which privacy tags of the code files are known, where one third code file corresponds to one fourth code file, and a similarity degree between the third code file and the corresponding fourth code file satisfies the preset condition.

A first parameter determining module 903, configured to determine target parameters corresponding to the batch of code files to be detected based on the multiple third code files and the multiple fourth code files.

Therefore, when the target parameters corresponding to the sample data adopted in training the target model include the structural similarity parameters, the target parameters corresponding to the batch of code files to be detected include the structural similarity parameters, wherein the structural similarity parameters represent the similarity between the file structure tree formed by the third code files and the file structure tree formed by the fourth code files. When target parameters corresponding to sample data adopted in training of a target model comprise word vector parameters, the target parameters corresponding to the batch of code files to be detected comprise target word vector parameters, wherein the target word vector parameters comprise at least one third word vector and at least one fourth word vector, the third word vector is a word vector of a keyword in a path of the third code file, and the fourth word vector is a word vector of the keyword in the path of the fourth code file. Please refer to other embodiments for the process of determining the target parameters corresponding to the batch of code files to be detected, which will not be described repeatedly herein.

A second parameter determining module 904, configured to input target parameters corresponding to the batch of code files to be detected into a target model, and determine similarity measurement parameters between the batch of code files to be detected and the privacy code files, where the target model is obtained by training with a machine learning method in an embodiment of the present specification, and a specific training process refers to the above, and is not described here repeatedly.

The privacy code determination module 905 is configured to determine whether the privacy code files exist in the batch of code files to be detected based on the similarity measurement parameter between the batch of code files to be detected and the privacy code files.

In the privacy code determination apparatus based on machine learning provided in the embodiment shown in fig. 9, first, a code file to be detected is treated as a normal text, and a code file with a known privacy tag and similar to the content of the code file to be detected is selected by comparing similarity measurement parameters and is used as a basis for determining target parameters corresponding to the batch of code files to be detected, so that the method is fast in speed and good in expansibility; secondly, by comparing at least one of the structural similarity parameter and the word vector parameter of the code file, the similarity between a batch of code files to be detected and the privacy code file is determined from at least one angle of the directory structure and the directory keywords of the code file. In a word, the method can greatly improve the detection efficiency of the privacy codes.

It should be noted that the privacy code determination apparatus 900 based on machine learning can implement the method in the embodiment of the method in fig. 4, and specifically refer to the privacy code determination method based on machine learning in the embodiment shown in fig. 4, and details are not repeated.

While certain embodiments of the present disclosure have been described above, other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, as for the apparatus embodiment, since it is substantially similar to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

In short, the above description is only a preferred embodiment of the present disclosure, and is not intended to limit the scope of the present disclosure. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of one or more embodiments of the present disclosure should be included in the scope of protection of one or more embodiments of the present disclosure.

The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. One typical implementation device is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smartphone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. In the absence of further limitation, the statement "comprises" or "comprising" a specified element does not exclude the presence of other like elements in the process, method, article, or apparatus that comprises the specified element.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

Claims

1. A machine learning method for determining a privacy code, comprising:

2. The method of claim 1, further comprising, prior to said filtering out a plurality of first code files from said sample data and a plurality of second code files from a predetermined code library for which privacy tags of the code files are known based on similarity metric parameters of the code files:

and determining similarity measurement parameters of the sample data and the code files in the preset code library.

3. The method of claim 1, said target parameters comprising structural similarity parameters, wherein said determining target parameters corresponding to said sample data based on said first and second plurality of code files comprises:

constructing a first file structure tree based on the paths of the plurality of first code files;

constructing a second file structure tree based on the paths of the plurality of second code files;

and determining a structural similarity parameter corresponding to the sample data based on the first file structure tree and the second file structure tree.

4. The method of claim 3, further comprising, before said determining the structural similarity parameter corresponding to the sample data based on the first file structure tree and the second file structure tree:

cutting off the isolated nodes in the first file structure tree to obtain a first sub-tree formed by the remaining nodes;

cutting off the isolated nodes in the second file structure tree to obtain a second sub-tree formed by the remaining nodes;

wherein the determining the structural similarity parameter corresponding to the sample data based on the first file structure tree and the second file structure tree includes:

and determining the structural similarity parameter corresponding to the sample data based on the first subtree and the second subtree.

5. The method of claim 1, the target parameters comprising word vector parameters, wherein said determining target parameters corresponding to the sample data based on the plurality of first code files and the plurality of second code files comprises:

extracting keywords in paths of the plurality of first code files;

determining word vectors of keywords in paths of the plurality of first code files to obtain at least one first word vector;

extracting keywords in paths of the plurality of second code files;

determining word vectors of the keywords in the paths of the plurality of second code files to obtain the at least one second word vector.

6. The method of claim 5, wherein the first and second light sources are selected from the group consisting of,

wherein the extracting keywords in the paths of the plurality of first code files comprises:

extracting keywords in paths of the plurality of first code files based on a word frequency-inverse text frequency index TF-IDF algorithm;

wherein the extracting the keywords in the paths of the plurality of second code files comprises:

extracting keywords in paths of the plurality of second code files based on a TF-IDF algorithm.

7. The method according to any one of claims 1 to 6,

the similarity measurement parameter of the code file is an approximate hash value of the code file.

8. The method of claim 1, further comprising:

acquiring a batch of code files to be detected;

screening a plurality of third code files from the batch of code files to be detected based on similarity measurement parameters of the code files, and screening a plurality of fourth code files from a preset code library, wherein one third code file corresponds to one fourth code file, and the similarity degree between the third code file and the corresponding fourth code file meets the preset condition;

inputting target parameters corresponding to the batch of code files to be detected into the target model, and determining similarity measurement parameters of the batch of code files to be detected and the privacy code files;

9. The method of claim 8, wherein the first and second light sources are selected from the group consisting of,

when the target parameters corresponding to the sample data comprise structural similarity parameters, the target parameters corresponding to the batch of code files to be detected comprise the target structural similarity parameters, wherein the target structural similarity parameters represent the similarity between a file structure tree formed by the third code files and a file structure tree formed by the fourth code files;

when the target parameters corresponding to the sample data comprise word vector parameters, the target parameters corresponding to the batch of code files to be detected comprise target word vector parameters, wherein the target word vector parameters comprise at least one third word vector and at least one fourth word vector, the third word vector is a word vector of a keyword in a path of the third code file, and the fourth word vector is a word vector of a keyword in a path of the fourth code file.

10. The method according to claim 9, wherein the target parameters corresponding to the batch of code files to be detected include target structural similarity parameters, wherein the determining the target parameters corresponding to the batch of code files to be detected based on the plurality of third code files and the plurality of fourth code files includes:

constructing a third file structure tree based on the paths of the plurality of third code files;

constructing a fourth file structure tree based on the paths of the plurality of fourth code files;

and determining target structure similarity parameters corresponding to the batch of code files to be detected based on the third file structure tree and the fourth file structure tree.

11. The method according to claim 10, further comprising, before determining the target structure similarity parameters corresponding to the batch of code files to be detected based on the third file structure tree and the fourth file structure tree, the following steps:

cutting off the isolated nodes in the third file structure tree to obtain a third sub-tree formed by the remaining nodes;

cutting off the isolated nodes in the fourth file structure tree to obtain a fourth sub-tree formed by the remaining nodes;

determining target structure similarity parameters corresponding to the batch of code files to be detected based on the third file structure tree and the fourth file structure tree, wherein the determining the target structure similarity parameters comprises:

and determining target structure similarity parameters corresponding to the batch of code files to be detected based on the third subtree and the fourth subtree.

12. The method according to claim 9, wherein the target parameters corresponding to the batch of code files to be detected include target word vector parameters, wherein the determining the target parameters corresponding to the batch of code files to be detected based on the plurality of third code files and the plurality of fourth code files includes:

extracting keywords in paths of the plurality of third code files;

determining word vectors of keywords in paths of the plurality of third code files to obtain at least one third word vector;

extracting keywords in paths of the plurality of fourth code files;

determining word vectors of the keywords in the paths of the plurality of fourth code files to obtain at least one fourth word vector.

13. The method of claim 12, wherein the first and second light sources are selected from the group consisting of,

wherein the extracting the keywords in the paths of the plurality of third code files comprises:

extracting keywords in paths of the plurality of third code files based on a TF-IDF algorithm;

wherein the extracting the keywords in the paths of the plurality of fourth code files comprises:

extracting keywords in paths of the plurality of fourth code files based on a TF-IDF algorithm.

14. A privacy code determination method based on machine learning, comprising:

acquiring a batch of code files to be detected;

inputting target parameters corresponding to the batch of code files to be detected into a target model, and determining similarity measurement parameters of the batch of code files to be detected and privacy code files, wherein the target model is obtained by training through the machine learning method of any one of claims 1-13;

15. A machine learning apparatus for determining a privacy code, comprising:

16. A privacy code determination apparatus based on machine learning, comprising:

a second parameter determination module, configured to input target parameters corresponding to the batch of code files to be detected into a target model, and determine similarity measurement parameters between the batch of code files to be detected and the privacy code files, where the target model is obtained by training the machine learning method according to any one of claims 1 to 13;

17. An electronic device, comprising:

a processor; and

18. A computer-readable storage medium storing one or more programs that, when executed by an electronic device including a plurality of application programs, cause the electronic device to:

19. An electronic device, comprising:

a processor; and

acquiring a batch of code files to be detected;

20. A computer-readable storage medium storing one or more programs that, when executed by an electronic device including a plurality of application programs, cause the electronic device to:

acquiring a batch of code files to be detected;