WO2024045989A1 - 图网络数据集的处理方法、装置、电子设备、程序及介质 - Google Patents

图网络数据集的处理方法、装置、电子设备、程序及介质 Download PDF

Info

Publication number
WO2024045989A1
WO2024045989A1 PCT/CN2023/110370 CN2023110370W WO2024045989A1 WO 2024045989 A1 WO2024045989 A1 WO 2024045989A1 CN 2023110370 W CN2023110370 W CN 2023110370W WO 2024045989 A1 WO2024045989 A1 WO 2024045989A1
Authority
WO
WIPO (PCT)
Prior art keywords
training set
sample data
data
test set
training
Prior art date
Application number
PCT/CN2023/110370
Other languages
English (en)
French (fr)
Inventor
李龙飞
张振中
梁烁斌
Original Assignee
京东方科技集团股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 京东方科技集团股份有限公司 filed Critical 京东方科技集团股份有限公司
Publication of WO2024045989A1 publication Critical patent/WO2024045989A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the present disclosure belongs to the technical field of knowledge graphs, and particularly relates to a processing method, device, electronic equipment, program and medium for a graph network data set.
  • Graph network structure data is composed of nodes and the edges connected between nodes.
  • Graph structure data can well represent the topological structure relationship between nodes. By learning the feature representation of graph network nodes, it can be used to classify nodes. Or the prediction of edge connections between two nodes. However, if there are a large number of edge connections between nodes in the network at the beginning, the learned representation of network nodes will be overfitted. It is necessary to effectively select some edge connections between nodes for the current network to represent the entire graph. In order to represent the entire graph to the greatest extent, it is generally divided into a training set and a test set in learning network node representation. The training set is used to train the model, and the test set is used to verify the prediction ability of the model. The data of the training set needs to be as accurate as possible. It can represent the entire graph to the greatest extent, so how to collect sample data as the data of the training set is particularly important.
  • the training set and the test set are usually obtained by dividing the graph network data set according to a certain proportion.
  • this method may cause all the associated pairs of a certain node to be divided into the test set, resulting in the corresponding node in the training set.
  • the present disclosure provides a processing method, device, electronic equipment, program and medium for graph network data sets.
  • Some embodiments of the present disclosure provide a method for processing graph network data sets.
  • the method includes:
  • the positive sample data in the second training set and the negative sample data in the second test set are exchanged, so that the proportion of positive samples between the exchanged second training set and the second test set is In line with the target proportion and there are no isolated nodes in the second training set after the swap, the third training set and the third test set are obtained, including:
  • Determine target positive sample data in the second training set, and the row data and column data where the target positive sample data is located include at least two positive sample data;
  • the target positive sample data and the negative sample data in the second test set are exchanged until the number of exchanged positive sample data reaches the target sample number, and a third training set and a third test set are obtained.
  • obtaining the target number of samples to be exchanged from the first test set to the first training set includes:
  • the node data whose number of positive sample data is greater than or equal to 2 in the column data corresponding to the positive sample data in the row data is used as the target positive sample data.
  • the method before using the node data whose number of positive sample data is greater than or equal to 2 in the column data corresponding to the positive sample data in the row data as the target positive sample data, the method further includes:
  • determining the isolated nodes with no correlation in the training set includes:
  • the node corresponding to the adjacency matrix is regarded as an isolated node.
  • the positive sample data in the second training set and the negative sample data in the second test set are exchanged, so that the positive samples between the exchanged second training set and the second test set are The proportion meets the target proportion and there are no isolated nodes in the switched second training set.
  • the node correlations in the third training set and the third test set are used to characterize the correlation between drugs and diseases
  • the third training set and the third test set are used to train a score prediction model, wherein the score prediction model is used to predict the correlation between the input drug information and disease information.
  • using the third training set and the third test set to train the score prediction model includes:
  • calculating the loss value of the prediction probability score matrix includes:
  • (i, j) represents the association pair of the i-th drug and the j-th disease
  • S + represents the set of all known drug-disease association pairs
  • S- represents the set of all unknown or unobserved drug-disease association pairs.
  • balance factor To reduce the impact of data imbalance, A′ is the prediction probability score matrix, u represents the number of rows of the prediction score matrix and v represents the number of columns of the prediction score matrix.
  • Some embodiments of the present disclosure provide a device for processing graph network data sets.
  • the device includes:
  • the dividing module is configured to divide the original graph network data set according to the target proportion to obtain the first training set and the first test set;
  • a determination module configured to determine isolated nodes with no correlation in the training set
  • a swap module configured to swap the adjacency matrix of the isolated node in the first training set with the adjacency matrix corresponding to the position of the isolated node in the first test set to obtain a second training set and a second test set. set;
  • the exchange module is also configured to:
  • Determine target positive sample data in the second training set, and the row data and column data where the target positive sample data is located include at least two positive sample data;
  • the target positive sample data and the negative sample data in the second test set are exchanged until the number of exchanged positive sample data reaches the target sample number, and a third training set and a third test set are obtained.
  • the exchange module is also configured to:
  • the node data whose number of positive sample data is greater than or equal to 2 in the column data corresponding to the positive sample data in the row data is used as the target positive sample data.
  • the exchange module is also configured to:
  • the determining module is also configured to:
  • the node corresponding to the adjacency matrix is regarded as an isolated node.
  • the node association relationships in the third training set and the third test set are used to characterize the association between drugs and diseases; the module further includes:
  • the training module is configured as:
  • the third training set and the third test set are used to train a score prediction model, wherein the score prediction model is used to predict the correlation between the input drug information and disease information.
  • the training module is also configured to:
  • the training module is also configured to:
  • (i, j) represents the association pair of the i-th drug and the j-th disease
  • S + represents the set of all known drug-disease association pairs
  • S- represents the set of all unknown or unobserved drug-disease association pairs.
  • balance factor To reduce the impact of data imbalance, A′ is the prediction probability score matrix, u represents the number of rows of the prediction score matrix and v represents the number of columns of the prediction score matrix.
  • Some embodiments of the present disclosure provide a computing processing device, including:
  • a memory having computer readable code stored therein;
  • One or more processors when the computer readable code is executed by the one or more processors, the computing processing device performs the above-mentioned processing method of the graph network data set.
  • Some embodiments of the present disclosure provide a computer program, including computer readable code, which, when run on a computing processing device, causes the computing processing device to perform the processing method of a graph network data set as described above.
  • Some embodiments of the present disclosure provide a non-transitory computer-readable medium in which the above-mentioned processing method of a graph network data set is stored.
  • Figure 1 schematically shows a flow chart of a graph network data set processing method provided by some embodiments of the present disclosure
  • Figure 2 schematically shows the effect of a graph network data set processing method provided by some embodiments of the present disclosure
  • Figure 3 schematically shows a flow chart of another method for processing a graph network data set provided by some embodiments of the present disclosure
  • Figure 4 schematically shows a flow chart of yet another method for processing a graph network data set provided by some embodiments of the present disclosure
  • Figure 5 schematically shows a logical diagram of a method for processing a graph network data set provided by some embodiments of the present disclosure
  • Figure 6 schematically shows a flow chart of a model training method provided by some embodiments of the present disclosure
  • Figure 7 schematically shows a structural diagram of a graph network data set processing device provided by some embodiments of the present disclosure
  • Figure 8 schematically illustrates a block diagram of a computing processing device for performing methods according to some embodiments of the present disclosure
  • Figure 9 schematically illustrates a storage unit for holding or carrying program code implementing methods according to some embodiments of the present disclosure.
  • the data format is presented as a matrix representation of N rows and 2 columns; N is the number of rows, which is the number of correlations between nodes in the graph structure data, that is, the number of 1s in the adjacency matrix; the corresponding rows in the adjacency matrix
  • the coordinate index is placed in the first column; the column coordinate index is placed in the second column; so for the situation mentioned in this proposal, the positive sample set of all associations between a node and other nodes is randomly assigned to the test set , which will cause the row coordinate index of the first column to be less than the number of rows of the adjacency matrix after deduplication or the column coordinate index of the second column to be less than the number of columns of the adjacency matrix after deduplication;
  • sample space composed of node pairs corresponding to all rows and columns of the adjacency matrix also needs to take some measures to ensure that the most suitable one for the current data processing task is selected.
  • the process of processing the graph network data set provided by this disclosure is based on such a knowledge constraint. to complete.
  • Figure 1 schematically shows a flow chart of a graph network data set processing method provided by the present disclosure.
  • the method includes:
  • Step 101 Divide the original graph network data set according to the target ratio to obtain a first training set and a first test set.
  • the original graph network data set is graph-structured data
  • graph-structured data is composed of nodes and edges connected between nodes.
  • the edges represent the association between nodes. This kind of association between two nodes.
  • the node pair will be called positive sample data in the future, and the two node pairs with no correlation will be called negative sample data in the future.
  • graph-structured data can well represent the topological relationship between nodes, learning the feature representation of graph network nodes through predictive models can be used to classify nodes, or predict the connecting edges between two nodes. But if at the beginning one of the nodes in the graph network If there are a large number of edge connections between them, the representation results of the learned network nodes will be over-fitting.
  • the graph network data set is generally divided into a training set and a test set according to a preset target ratio.
  • the training set is used to train the model, and the test set is used to verify whether the prediction ability of the model reaches expected.
  • the system divides the positive sample data in the acquired original graph network data according to a target ratio.
  • the target ratio can be that the ratio between the training set and the test set is, for example, 4:1, 5:1, 3:1. Etc., usually the number of training sets is larger than the test set.
  • the specific target ratio can be set according to actual needs, and there is no limit here.
  • Step 102 Determine isolated nodes with no correlation in the training set.
  • the original graph network data set is generally a data set containing a certain amount of positive sample data.
  • at least one row or column of the adjacency matrix composed of the sample data in the training set will be all 0, that is, it does not contain positive sample data, which will greatly affect the model.
  • the effect of training makes it impossible for the model to fully learn the connection relationship of the nodes corresponding to the adjacency matrix.
  • One of the situations is that the associated pairs of at least one first-type node and all second-type nodes in the original graph network data set are all divided into the first test set, which means that at least one row of the first training set is all 0; Another situation is that the associated pairs of at least one second type node and all first type nodes in the original graph network data set are all divided into the first test set, which means that at least one column of the first training set is all 0.
  • the embodiment of the present disclosure will identify isolated nodes in the training set that are not associated with other nodes for subsequent data set optimization.
  • Step 103 Exchange the adjacency matrix of the isolated node in the first training set with the adjacency matrix corresponding to the position of the isolated node in the first test set to obtain a second training set and a second test set.
  • the original graph network data set is a positive sample data set, that is to say There is at least one association between each node and other nodes. Therefore, if there is no association between the isolated node and other nodes in the first training set, it means that the association between the isolated node and other nodes is divided into the first test.
  • the adjacency matrix corresponding to the isolated node in the first test set of the training set is exchanged, so that there is positive sample data in the adjacency matrix corresponding to the isolated node in the second training set after the exchange.
  • the embodiment of the present disclosure will adjust the positive sample data in the second training set and the second test set so that the ratio of the positive sample data in the training set and the test set can meet the target ratio.
  • Step 104 Exchange the positive sample data in the second training set and the negative sample data in the second test set, so that the proportion of positive samples between the exchanged second training set and the second test set meets the above requirements.
  • the embodiment of the present disclosure adjusts the positive samples in the training set and the test set.
  • the amount of data is determined by exchanging sample data.
  • the positive sample data in the second training set can be exchanged with the negative sample data in the second test set by selecting whether the exchange will cause the associated node to become an isolated node, so that the exchanged third training set and the third training set can be exchanged.
  • the proportion of positive sample data in the three test sets meets the target proportion, it will not cause isolated nodes to appear again in the third training set.
  • Figure 2 schematically presents a schematic diagram of the processing method of a graph network data set provided by an embodiment of the present disclosure, in which dark squares represent positive sample data with associated relationships, and light squares represent no association.
  • Negative sample data of the relationship the process of dividing the original graph network data set into a training set and a test set is similar to the traditional random division process, and the embodiment of the present disclosure further first divides the training set corresponding to the isolated node after the preliminary division of the data set. The adjacency matrix is transferred to the test set.
  • the row data of the adjacency matrix where p1 is located are all negative sample data
  • the row data of the adjacency matrix where p2 is located are all negative sample data. Therefore, p1 in the first test set is The p1' corresponding to the position and the p2' corresponding to the position p2 are exchanged to obtain the second training set and the second test set.
  • the embodiment of the present disclosure further selects the negative sample data p3' and p4' in the second test set and exchanges the positive sample data p3 and p4 at the corresponding position in the second training set, so that the third training set and the first training set are finally obtained.
  • the number of positive sample data in the middle test set is the same, and the number of positive sample data in the third test set is the same as that in the first test set. While ensuring the ratio between the positive sample data in the training set and the test set, it can also ensure that there are no isolated nodes in the training set.
  • the adjacency matrix of the isolated node in the training set is first exchanged with the adjacency matrix of the corresponding position in the test set, so that there are no isolated nodes in the training set, and then On the premise of ensuring that isolated nodes will not appear again in the training set, the positive sample data in the training set and the negative sample data in the test set are exchanged, so that the final training set not only does not have isolated nodes, but also is consistent with the test set.
  • the proportion of positive sample data conforms to the target proportion in the previous division, thereby ensuring that when the model is trained using the training set and the test set, the model will not be unable to learn the association of the isolated node due to the existence of an isolated node in the training set. Ensure that the model can fully learn the association of each node.
  • Figure 2 schematically shows a flow chart of another graph network data set processing method provided by the present disclosure.
  • the method includes:
  • Step 201 Divide the original graph network data set according to the target ratio to obtain a first training set and a first test set.
  • step 101 please refer to the detailed description of step 101, which will not be described again here.
  • Step 202 Obtain the adjacency matrix of each node in the first training set.
  • bipartite graph bg(u, v, ⁇ ) there is a bipartite graph bg(u, v, ⁇ ), and the bipartite graph is equivalent to g(u ⁇ v, ⁇ ), where u and v represent two sets of two node domains, Represented by the first type of node set and the second type of node set, u i and v j represent the i-th and j-th nodes of u and v respectively. All edges of the bipartite graph are strictly between u and v, e ij represents the edge between u i and v j ; A is the adjacency matrix of the association network between the first type of node and the second type of node.
  • Step 203 When the sample data in any row or column in the adjacency matrix is negative sample data, the node corresponding to the adjacency matrix is regarded as an isolated node.
  • the sample data in any row or column in the adjacency matrix composed of data samples in the first training set can be stored as negative sample data, that is, the adjacent values are all equal to 0.
  • the nodes corresponding to the adjacency matrix are used as isolated nodes for subsequent adjacency matrix replacement, so that isolated nodes in the training set can be easily identified.
  • Step 204 Exchange the adjacency matrix of the isolated node in the first training set with the adjacency matrix corresponding to the position of the isolated node in the first test set to obtain a second training set and a second test set.
  • step 103 please refer to the detailed description of step 103, which will not be described again here.
  • Step 205 Obtain the target number of samples exchanged from the first test set to the first training set.
  • the target sample data refers to the number of positive sample data that have been exchanged from the first test set to the first training set.
  • the target sample data can be obtained by counting the positive sample data in the transposed adjacency matrix.
  • Step 206 Determine target positive sample data in the second training set.
  • the row data and column data where the target positive sample data is located include at least two positive sample data.
  • the embodiment of the present disclosure considering that a certain positive sample data in the second training set is transferred to the second test set, it may cause a certain row or a certain column in the adjacency matrix where the positive sample data is located to be all negative samples again. data, that is, the values are all 0. Therefore, when selecting the target positive sample data, the embodiment of the present disclosure will select at least two positive sample data in the row data and column data of the adjacency matrix, thus ensuring that the target sample After the data is transferred to the second test set, at least one positive sample data still exists in the row data and column data where the target sample data originally resides, preventing the node corresponding to the target positive sample data from becoming an isolated node again.
  • Step 207 Exchange the target positive sample data and the negative sample data in the second test set until the number of exchanged positive sample data reaches the target number of samples to obtain a third training set and a third test set.
  • the determined target positive sample data and the uniform sample data in the second test set are exchanged, so that the positive sample data in the second training set is reduced by 1, and the positive sample data in the second test set is increased by 1. 1.
  • This method adjusts the proportion of positive sample data in the training set and test set. Considering that each time the target positive sample data is swapped in the second training set, the row data and column data of the adjacency matrix will change, and the positive sample data that could previously be used as target positive sample data may no longer be used as the target after the swap. Positive sample data, therefore, it is necessary to re-enter the process of step 206 after each replacement to cyclically select target positive sample data until the number of switched target positive sample data reaches the target sample data.
  • the ratio The example is in line with the target proportion.
  • the number of positive sample data contained in the second training set is equal to that of the first training set
  • the number of positive sample data contained in the second test set is equal to that of the first training set. It is the same as the first test set, so that the ratio of the number of positive sample data in the third training set and the third test set obtained by the swap meets the target ratio.
  • step 206 includes:
  • Step 2061 Obtain row data whose number of positive sample data contained in the second training set is greater than or equal to 2.
  • Step 2062 Use the node data whose number of positive sample data is greater than or equal to 2 in the column data corresponding to the positive sample data in the row data as the target positive sample data.
  • Step 2063 When there is no row data in the second training set that contains positive sample data greater than or equal to 2, stop the process of exchanging positive samples from the second training set to the second test set.
  • the positive sample data corresponding to the intersection position of the selected column data and row data is used as the target positive sample data.
  • the row data and column data where the target positive sample data is located are the target positive sample.
  • no row data containing positive sample data greater than or equal to 2 is found, it means that there is no target positive sample data in the second training set that can be used for replacement.
  • the replacement process will be stopped here, and the replacement process can be stopped by re-entering the test set. Add positive sample data to make the positive sample data in the training set and test set meet the target ratio.
  • FIG. 5 schematically illustrating a logic diagram of a graph network data set processing method provided by an embodiment of the present disclosure:
  • the positive sample set is the position with a value of 1 in the adjacency matrix.
  • the corresponding set of node pairs between the first type of nodes and the second type of nodes, according to the predetermined ratio of the training set and the test set. Come, for example, 4:1, randomly sample 4/5 of all positive sample data as the training set, recorded as A train , and 1/5 as the test set, recorded as A test .
  • step S11 If k is greater than 0, return to step S4. If k is less than or equal to 0, proceed to step S12.
  • the node association relationship in the third training set and the third test set is used to characterize the association relationship between the drug and the disease
  • the method further includes: using the third training set and the third test set.
  • Three test sets are used to train the score prediction model, where the score prediction model is used to predict the correlation between the input drug information and disease information.
  • the score prediction model is a model used to predict the score of the probability value of the association between nodes, and can be a logical classification model, a decision tree model, a support vector machine model, a Bayesian model, etc. Model.
  • the nodes in the graph network data set can represent drugs and diseases, so the edges between nodes can be used to represent the correlation between drugs and diseases, so that the score prediction obtained by training with the third training and third test sets can be The model can fully learn the correlation between drugs and diseases, improving the accuracy of the score prediction model.
  • the training process of the score prediction model is as follows:
  • Step 301 Use the third training set to train the score prediction model.
  • Step 302 Use the third test set to test the trained score prediction model to obtain a prediction probability score matrix.
  • Step 303 Calculate the loss value of the prediction probability score matrix.
  • Step 304 When the loss value meets the training requirement, confirm that the score prediction model training is completed.
  • the formula (1) of the decoder f( HR , HD ) selected by the present disclosure is as follows:
  • A′ represents the predicted predicted probability score matrix
  • the predicted score of the association between drug r i and disease d j is given by the corresponding A′ ij item
  • sigmoid() is the activation function
  • H R represents the learned drug Node embedding
  • HD represents disease node embedding.
  • the score prediction model is trained using the third training set. Since any node in the third training set is optimized through the processing method of the graph network data set provided by the present disclosure, any node in the third training set is different from other nodes. There is at least one correlation pair between them, so the trained score prediction model can completely learn the correlation between the nodes in the third training set. Then, the test set is used to test the trained score prediction model. Since the third test set is also optimized by the processing method of the graph network data set provided by the present disclosure, the amount of data in the third test set and the third training set are The proportion is in line with the expected target proportion and can meet the testing requirements of the score prediction model. Use the third test set to be predicted by the trained score prediction model to obtain the predicted probability score matrix.
  • the loss value is greater than or equal to the loss. value threshold, or the loss value converges to the loss value range, the result of the training process of the score prediction model can be determined. If it does not meet the expected training requirements, the parameters of the score prediction model will be adjusted and the training will continue until the score prediction model reaches the desired value after training. The loss value meets the expected training requirements.
  • step 303 includes:
  • (i, j) represents the association pair of the i-th drug and the j-th disease
  • S + represents the set of all known drug-disease association pairs
  • S- represents the set of all unknown or unobserved drug-disease association pairs.
  • balancing factor To reduce the impact of data imbalance, A′ is the prediction probability score matrix, u represents the number of rows of the prediction score matrix and v represents the number of columns of the prediction score matrix.
  • Figure 7 schematically shows a schematic structural diagram of a graph network data set processing device 40 provided by the present disclosure.
  • the device includes:
  • the dividing module 401 is configured to divide the original graph network data set according to the target proportion to obtain the first training set and the first test set;
  • the determination module 402 is configured to determine isolated nodes that have no association in the training set
  • the exchange module 403 is configured to exchange the adjacency matrix of the isolated node in the first training set with the adjacency matrix of the position corresponding to the isolated node in the first test set to obtain the second training set and the second test set;
  • the exchange module 403 is also configured to:
  • Determine target positive sample data in the second training set, and the row data and column data where the target positive sample data is located include at least two positive sample data;
  • the target positive sample data and the negative sample data in the second test set are exchanged until the number of exchanged positive sample data reaches the target sample number, and a third training set and a third test set are obtained.
  • the exchange module 403 is also configured to:
  • the node data whose number of positive sample data is greater than or equal to 2 in the column data corresponding to the positive sample data in the row data is used as the target positive sample data.
  • the exchange module 403 is also configured to:
  • the determination module 402 is also configured to:
  • the node corresponding to the adjacency matrix is regarded as an isolated node.
  • the node association relationships in the third training set and the third test set are used to characterize the association between drugs and diseases; the module further includes:
  • the training module is configured as:
  • the third training set and the third test set are used to train a score prediction model, wherein the score prediction model is used to predict the correlation between the input drug information and disease information.
  • the training module is also configured to:
  • the training module is also configured to:
  • (i, j) represents the association pair of the i-th drug and the j-th disease
  • S + represents the set of all known drug-disease association pairs
  • S- represents the set of all unknown or unobserved drug-disease association pairs.
  • balance factor To reduce the impact of data imbalance, A′ is the prediction probability score matrix, u represents the number of rows of the prediction score matrix and v represents the number of columns of the prediction score matrix.
  • the graph network data set is divided into a training set and a test set
  • the adjacency matrix of the isolated node in the training set is exchanged with the adjacency matrix of the corresponding position in the test set, so that there are no isolated nodes in the training set, and then in Under the premise of ensuring that isolated nodes will not appear again in the training set, the positive sample data in the training set and the negative sample data in the test set are exchanged, so that the final training set not only does not have isolated nodes, but also is consistent with the test set.
  • the proportion of positive sample data conforms to the target proportion in the previous division, thereby ensuring that when the model is trained using the training set and the test set, the model will not be unable to learn the association of the isolated node due to the existence of an isolated node in the training set, ensuring that The model can fully learn the association of each node.
  • the device embodiments described above are only illustrative.
  • the units described as separate components may or may not be physically separated.
  • the components shown as units may or may not be physical units, that is, they may be located in One location, or it can be distributed across multiple network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment. Persons of ordinary skill in the art can understand and implement the method without any creative effort.
  • Various component embodiments of the present disclosure may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof.
  • a microprocessor or a digital signal processor (DSP) may be used in practice to implement some or all functions of some or all components in a computing processing device according to embodiments of the present disclosure.
  • DSP digital signal processor
  • the present disclosure may also be implemented as an apparatus or apparatus program (eg, computer program and computer program product) for performing part or all of the methods described herein.
  • the program may be stored on a non-transitory computer-readable medium, or may be in the form of one or more signals. Such signals may be downloaded from an Internet website, or provided on a carrier signal, or in any other form.
  • Figure 8 illustrates a computing processing device that may implement methods in accordance with the present disclosure.
  • the computing processing device conventionally includes a processor 510 and a computer program product in the form of memory 520 or non-transitory computer-readable medium.
  • Memory 520 may be electronic memory such as flash memory, EEPROM (Electrically Erasable Programmable Read Only Memory), EPROM, hard disk, or ROM.
  • the memory 520 has a storage space 530 for program code 531 for executing any method steps in the above-described methods.
  • the storage space 530 for program codes may include individual program codes 531 respectively used to implement various steps in the above method. These program codes can be read from or written into one or more computer program products.
  • These computer program products include program code carriers such as hard disks, compact disks (CDs), memory cards or floppy disks. Such computer program products are typically portable or fixed storage units as described with reference to FIG. 9 .
  • the storage unit may have storage segments, storage spaces, etc. arranged similarly to the memory 520 in the computing processing device of FIG. 8 .
  • the program code may, for example, be compressed in a suitable form.
  • the storage unit includes computer readable code 531', ie code that can be read by, for example, a processor such as 510, which code, when executed by a computing processing device, causes the computing processing device to perform the methods described above. various steps.
  • any reference signs placed between parentheses shall not be construed as limiting the claim.
  • the word “comprising” does not exclude the presence of elements or steps not listed in a claim.
  • the word “a” or “an” preceding an element does not exclude the presence of a plurality of such elements.
  • the present disclosure may be implemented by means of hardware comprising several different elements and by means of a suitably programmed computer. In the element claim enumerating several means, several of these means may be embodied by the same item of hardware.
  • the use of the words first, second, third, etc. does not indicate any order. These words can be interpreted as names.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Animal Behavior & Ethology (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本公开提供的图网络数据集的处理方法、装置、电子设备、程序及介质,属于知识图谱技术领域。所述方法包括:将原始图网络数据集按照目标比例进行划分,得到第一训练集和第一测试集;确定所述训练集中无关联关系的孤立节点;将所述第一训练集中所述孤立节点的邻接矩阵,与所述第一测试集中所述孤立节点所对应位置的邻接矩阵进行调换,得到第二训练集和第二测试集;将所述第二训练集中正样本数据与所述第二测试集中的负样本数据之间进行调换,使得调换后的第二训练集合与第二测试集之间正样本比例符合所述目标比例且调换后的第二训练集合中不存在孤立节点,得到第三训练集和第三测试集。

Description

图网络数据集的处理方法、装置、电子设备、程序及介质
相关申请的交叉引用
本公开要求在2022年8月31日提交中国专利局、申请号为202211057371.4、名称为“图网络数据集的处理方法、装置、电子设备、程序及介质”的中国专利申请的优先权,其全部内容通过引用结合在本公开中。
技术领域
本公开属于知识图谱技术领域,特别涉及一种图网络数据集的处理方法、装置、电子设备、程序及介质。
背景技术
图网络结构数据是由节点和节点之间连接的边所组成,图结构的数据可以很好地表示节点之间的拓扑结构关系,通过学习图网络节点的特征表示可以用来对节点进行分类,或者两两节点之间的边连接预测。但是如果一开始网络里的节点之间的边连接数量多的话,学习出来的网络节点的表征就会过拟合,需要针对当前网络有效地选取一些节点之间的边连接,来代表整个图,从而最大程度地表示整个图,故一般在学习网络节点表征中一般采取分成训练集和测试集,训练集用来训练模型,测试集用来验证模型的预测能力,训练集的数据,需要尽可能最大程度的能够代表整个图的表示,故怎样采集样本数据作为训练集的数据就显得尤为重要。
相关技术中通常是通过图网络数据集按照一定比例进行划分来得到训练集和测试集,但是这种方式可能会由于某一节点的所有关联对被全部划分到测试集中导致训练集中该节点所对应的样本数据中不存在该节点与其他节点之间的关联关系,这就会导致后续模型在利用该训练集进行训练时,无法充分学习到图网络数据集中节点之间的关联关系。
概述
本公开提供的一种图网络数据集的处理方法、装置、电子设备、程序及介质。
本公开一些实施例提供一种图网络数据集的处理方法,所述方法包括:
将原始图网络数据集按照目标比例进行划分,得到第一训练集和第一测试集;
确定所述训练集中无关联关系的孤立节点;
将所述第一训练集中所述孤立节点的邻接矩阵,与所述第一测试集中所述孤立节点所对应位置的邻接矩阵进行调换,得到第二训练集和第二测试集;
将所述第二训练集中正样本数据与所述第二测试集中的负样本数据之间进行调换,使得调换后的第二训练集合与第二测试集之间正样本比例符合所述目标比例且调换后的第二训练集合中不存在孤立节点,得到第三训练集和第三测试集。
可选地,所述将所述第二训练集中正样本数据与所述第二测试集中的负样本数据之间进行调换,使得调换后的第二训练集合与第二测试集之间正样本比例符合所述目标比例且调换后的第二训练集合中不存在孤立节点,得到第三训练集和第三测试集,包括:
获取所述第一测试集向所述第一训练集调换的目标样本数量;
在所述第二训练集中确定目标正样本数据,所述目标正样本数据所在的行数据和列数据包含至少两个正样本数据;
将所述目标正样本数据与所述第二测试集中的负样本数据进行调换,直至已调换正样本数据的数量达到所述目标样本数量,得到第三训练集合第三测试集。
可选地,所述获取所述第一测试集向所述第一训练集调换的目标样本数量,包括:
获取所述第二训练集中所包含正样本数据的数量大于或等于2的行数据;
将所述行数据中正样本数据所对应列数据中包含正样本数据的数量大于或等于2的节点数据作为目标正样本数据。
可选地,在所述将所述行数据中正样本数据所对应列数据中包含正样本数据的数量大于或等于2的节点数据作为目标正样本数据之前,所述方法还包括:
在所述第二训练集中不存在所包含正样本数据的数量大于或等于2的行数据时,停止所述第二训练集向所述第二测试集的正样本调换过程。
可选地,所述确定所述训练集中无关联关系的孤立节点,包括:
获取所述第一训练集中每个节点的邻接矩阵;
在所述邻接矩阵中的任一行或者任一列中的样本数据均为负样本数据时,将所述邻接矩阵所对应的节点作为孤立节点。
可选地,在所述将所述第二训练集中正样本数据与所述第二测试集中的负样本数据之间进行调换,使得调换后的第二训练集合与第二测试集之间正样本比例符合所述目标比例且调换后的第二训练集合中不存在孤立节点,得到第三训练集和第三测试集之后,所述方法还包括:
所述第三训练集和所述第三测试集中的节点关联关系用于表征药物和疾病之间的关联关系;
利用所述第三训练集和所述第三测试集对得分预测模型进行训练,其中所述得分预测模型用于预测所输入的药物信息和疾病信息之间的关联度。
可选地,所述利用所述第三训练集和所述第三测试集对得分预测模型进行训练,包括:
利用所述第三训练集对所述得分预测模型进行训练;
利用所述第三测试集对训练后的得分预测模型进行测试,得到预测概率得分矩阵;
计算所述预测概率得分矩阵的损失值;
在所述损失值得到训练要求时,确认所述得分预测模型训练完成。
可选地,所述计算所述预测概率得分矩阵的损失值,包括:
将所述预测概率得分矩阵输入至如下公式,以得到损失值loss:
其中,(i,j)表示第i个药物和第j个疾病的关联对,S+表示所有已知药物疾病关联对的集合,S-表示所有未知或未观察到的药物疾病关联对的集合,平衡因子用于降低数据不平衡的影响,A′是预测概率得分矩阵,u表示所述预测得分矩阵的行数和v表示所述预测得分矩阵的列数。
本公开一些实施例提供一种图网络数据集的处理装置,所述装置包括:
划分模块,被配置为将原始图网络数据集按照目标比例进行划分,得到第一训练集和第一测试集;
确定模块,被配置为确定所述训练集中无关联关系的孤立节点;
调换模块,被配置为将所述第一训练集中所述孤立节点的邻接矩阵,与所述第一测试集中所述孤立节点所对应位置的邻接矩阵进行调换,得到第二训练集和第二测试集;
将所述第二训练集中正样本数据与所述第二测试集中的负样本数据之间进行调换,使得调换后的第二训练集合与第二测试集之间正样本比例符合所述目标比例且调换后的第二训练集合中不存在孤立节点,得到第三训练集和第三测试集。
可选地,所述调换模块,还被配置为:
获取所述第一测试集向所述第一训练集调换的目标样本数量;
在所述第二训练集中确定目标正样本数据,所述目标正样本数据所在的行数据和列数据包含至少两个正样本数据;
将所述目标正样本数据与所述第二测试集中的负样本数据进行调换,直至已调换正样本数据的数量达到所述目标样本数量,得到第三训练集合第三测试集。
可选地,所述调换模块,还被配置为:
获取所述第二训练集中所包含正样本数据的数量大于或等于2的行数据;
将所述行数据中正样本数据所对应列数据中包含正样本数据的数量大于或等于2的节点数据作为目标正样本数据。
可选地,所述调换模块,还被配置为:
在所述第二训练集中不存在所包含正样本数据的数量大于或等于2的行数据时,停止所述第二训练集向所述第二测试集的正样本调换过程。
可选地,所述确定模块,还被配置为:
获取所述第一训练集中每个节点的邻接矩阵;
在所述邻接矩阵中的任一行或者任一列中的样本数据均为负样本数据时,将所述邻接矩阵所对应的节点作为孤立节点。
可选地,所述第三训练集和所述第三测试集中的节点关联关系用于表征药物和疾病之间的关联关系;所述模块还包括:
训练模块,被配置为:
利用所述第三训练集和所述第三测试集对得分预测模型进行训练,其中所述得分预测模型用于预测所输入的药物信息和疾病信息之间的关联度。
可选地,所述训练模块,还被配置为:
利用所述第三训练集对所述得分预测模型进行训练;
利用所述第三测试集对训练后的得分预测模型进行测试,得到预测概率得分矩阵;
计算所述预测概率得分矩阵的损失值;
在所述损失值得到训练要求时,确认所述得分预测模型训练完成。
可选地,所述训练模块,还被配置为:
将所述预测概率得分矩阵输入至如下公式,以得到损失值loss:
其中,(i,j)表示第i个药物和第j个疾病的关联对,S+表示所有已知药物疾病关联对的集合,S-表示所有未知或未观察到的药物疾病关联对的集合,平衡因子用于降低数据不平衡的影响,A′是预测概率得分矩阵,u表示所述预测得分矩阵的行数和v表示所述预测得分矩阵的列数。
本公开一些实施例提供一种计算处理设备,包括:
存储器,其中存储有计算机可读代码;
一个或多个处理器,当所述计算机可读代码被所述一个或多个处理器执行时,所述计算处理设备执行如上述的图网络数据集的处理方法。
本公开一些实施例提供一种计算机程序,包括计算机可读代码,当所述计算机可读代码在计算处理设备上运行时,导致所述计算处理设备执行如上述的图网络数据集的处理方法。
本公开一些实施例提供一种非瞬态计算机可读介质,其中存储了如上述的图网络数据集的处理方法。
上述说明仅是本公开技术方案的概述,为了能够更清楚了解本公开的技术手段,而可依照说明书的内容予以实施,并且为了让本公开的上述和其它目的、特征和优点能够更明显易懂,以下特举本公开的具体实施方式。
附图简述
为了更清楚地说明本公开实施例或现有技术中的技术方案,下面将对实 施例或现有技术描述中所需要使用的附图作一简单地介绍,显而易见地,下面描述中的附图是本公开的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1示意性地示出了本公开一些实施例提供的一种图网络数据集的处理方法的流程示意图;
图2示意性地示出了本公开一些实施例提供的一种图网络数据集的处理方法的效果示意图;
图3示意性地示出了本公开一些实施例提供的另一种图网络数据集的处理方法的流程示意图;
图4示意性地示出了本公开一些实施例提供的再一种图网络数据集的处理方法的流程示意图;
图5示意性地示出了本公开一些实施例提供的一种图网络数据集的处理方法的逻辑示意图;
图6示意性地示出了本公开一些实施例提供的一种模型训练方法的流程示意图;
图7示意性地示出了本公开一些实施例提供的一种图网络数据集的处理装置的结构示意图;
图8示意性地示出了用于执行根据本公开一些实施例的方法的计算处理设备的框图;
图9示意性地示出了用于保持或者携带实现根据本公开一些实施例的方法的程序代码的存储单元。
详细描述
为使本公开实施例的目的、技术方案和优点更加清楚,下面将结合本公开实施例中的附图,对本公开实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本公开一部分实施例,而不是全部的实施例。基于本公开中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本公开保护的范围。
对于这个图网络中存在的天然的问题,即在学习网络节点表征的过程中,训练集与测试集随机划分存在的问题;随机划分数据集会导致有些节点 与其他节点之间的连接全部被划分到测试集,针对这样一种情况,对应数据格式呈现形式的不同,而有与之对应的表述;
表述1:针对直接邻接矩阵形式来表示的图结构数据来说,图的数据结构的表现形式为二维的矩阵,如果两个节点对之间有相关关系,则矩阵对应位置则为1,否则为0;如果是这种形式的数据格式,针对本提案所说的情况,表现为就是,某一节点与其他节点的所有关联关系的正样本集合被随机分配到测试集里,从而导致图网络中的某一行或者某一列全部为0;
表述2:针对稀疏表示的图结构数据;在邻接矩阵中,若数值为0的元素数目远远多于非0元素的数目,并且非0元素分布没有规律时,则称该邻接矩阵为稀疏矩阵;与之相反,若非0元素数目占大多数时,则称该矩阵为稠密矩阵。定义非0元素的总数比上矩阵所有元素的总数为矩阵的稠密度。因图结构数据大多情况,0元素占大多数,即邻接矩阵大多为稀疏矩阵,将稀疏的邻接矩阵转换为稀疏表示后,会对运算速度有很大提升和内存占用会减少,因为计算机只对非0元素进行操作和存储,这是稀疏矩阵的一个突出的优点。数据格式呈现为N行2列的矩阵表示;N为行数,就是图结构数据中节点之间有相关关系的个数,即邻接矩阵中为1的个数;所对应的邻接矩阵中的行坐标索引放到第一列;列坐标索引放到第二列;所以针对本提案所说的情况,表现为就是,某一节点与其他节点的所有关联关系的正样本集合被随机分配到测试集里,从而会导致第一列的行坐标索引去重复后小于邻接矩阵的行数或第二列中的列坐标索引去重复后小于邻接矩阵的列数;
这两种图结构数据的表示形式,在表象上都可以解释为该图网络中的某些节点为孤立节点,这样的图结构被用于图节点的嵌入表征学习中,只会导致这些孤立节点没有学习到有效的节点嵌入表示,从而降低模型的指标性能。
对于图网络中训练集与测试集随机划分存在的问题,或者换种说法,在图网络结构的数据中随机取消一些两两节点之间连接的边的过程中;随机划分数据集会导致有些节点与其他节点之间的连接全部被划分到测试集,
从数据分布的角度来考虑,是当前基于随机方式采集的样本,并没有使训练集的样本集合最大程度的表示整个样本空间;
然而对于机器学习领域来说,训练集样本空间选择的不同,最终会对数 据处理模型的评价指标结果产生很大的影响;
对于图网络节点表征的学习任务来说,邻接矩阵的所有行和列所一一对应的节点对组成的样本空间的选择也是需要去采取一些措施来保证,选到对于当前数据处理任务中最合适的样本空间;
目前,对于处理图网络的样本空间的选择这一领域,相关的研究还没有;只是对所有样本随机选择一些作为训练集,来表征整个样本空间,这样就会导致之前所提到的邻接矩阵有些行或列全为0;而一般用于图网络节点的表征学习过程所需要的邻接矩阵的所有行和列都不全为0时,才能最完整的学习到网络本身最大程度的网络拓扑结构特征。从而对接下来的网络节点的表征学习模型的优劣及评估才能最准确和客观。
对于两类节点之间是否有连接,这在原始样本空间中是已知的,这可以作为先验知识,对接下来训练集样本的采集过程做一个很好的指导过程。这里提到的知识就是,对于图网络表征的算法学习来说,对于所有节点,都至少与其他节点有一条边连接(情况一),相对于存在孤立节点,即有些节点没有与其他节点有边连接这种情况(情况二),最终学习到的节点表征肯定是不一样的,基于经验和实验验证,对于多个模型进行测试,第一种情况普遍要好于第二种情况。
所以,基于数据采集过程中,至少要保证所有节点都至少有一条边与其他节点连接这样的一个知识约束,因此本公开所提供图网络数据集的处理方法的过程都是基于这样的一个知识约束来完成的。
图1示意性地示出了本公开提供的一种图网络数据集的处理方法的流程示意图,所述方法包括:
步骤101,将原始图网络数据集按照目标比例进行划分,得到第一训练集和第一测试集。
需要说明的是,原始图网络数据集是图结构数据,而图结构数据是由节点和节点之间连接的边组成,边表示节点和节点的之间的关联关系,这种存在关联关系的两节点对后续将称之为正样本数据,而不存在关联关系的两节点对后续将称之为负样本数据。由于图结构的数据可以很好的表示节点之间的拓扑关系,通过预测模型来学习图网络节点的特征表示可以用于对节点进行分类,或者预测两节点之间的连接边。但是如果一开始图网络中的节点之 间的边连接数量多的话,学习出来的网络节点的表征结果过度拟合,需要针对当前万里看过有效的选取一些节点之间的边连接来代表整个图网络,从而最大程度地表示整个图网络。因此在学习图网络节点表征的过程中,一般将图网络数据集按照预先设置的目标比例划分为训练集和测试集,训练集用来训练模型,而测试集用来验证模型的预测能力是否达到预期。
在本公开实施例中,系统将获取到的原始图网络数据中正样本数据按照目标比例进行划分,该目标比例可以是训练集合测试集之间比例是例如4:1,5:1,3:1等,通常训练集的数量大于测试集,当然具体目标比例可以根据实际需求设置,此处不做限定。
步骤102,确定所述训练集中无关联关系的孤立节点。
在本公开实施例中,考虑到虽然原始图网络数据集一般是包含有一定数量正样本数据的数据集。但是若在步骤101的数据集划分过程以下两种情况则会导致训练集中样本数据所组成的邻接矩阵的至少某一行或者某列全为0,即不包含正样本数据,这就会大大影响模型训练的效果,使得模型无法充分学习到该邻接矩阵所对应节点的连接关系。其中一种情况是原始图网络数据集中的至少一个第一类节点与所有第二类节点的关联对,被全部划分至第一测试集,表现为第一训练集的至少某一行全为0;另一种情况是或者是原始图网络数据集中的至少一个第二类节点与所有第一类节点的关联对,被全部划分至第一测试集,表现为第一训练集的至少某一列全为0。
因此本公开实施例在原始图网络数据集划分完成后,将会识别训练集中与其他节点不存在关联关系的孤立节点,以供后续数据集优化使用。
步骤103,将所述第一训练集中所述孤立节点的邻接矩阵,与所述第一测试集中所述孤立节点所对应位置的邻接矩阵进行调换,得到第二训练集和第二测试集。
在本公开实施例中,通过将第一训练集中孤立节点所对应的邻接矩阵与第一测试集中该孤立节点所对应邻接矩阵进行调换,由于原始图网络数据集为正样本数据集,也就是说其中每个节点与其他节点之间至少存在一个关联关系,因此若第一训练集中该孤立节点与其他节点不存在关联关系,则说明该孤立节点与其他节点的关联关系被划分至了第一测试集,因此该第一测试集中该孤立节点所对应的邻接矩阵中必然存在正样本数据,因此通过将第一 训练集合第一测试集中该孤立节点所对应的邻接矩阵进行调换,即可使得调换后的第二训练集中该孤立节点所对应的邻接矩阵中存在正样本数据。
但是由于调换邻接矩阵得到的第二训练集中加入了正样本数据,响应的第二测试集中取出了正样本数据,因此第二训练集和第二测试集中的正样本数据的数量之比明显不符合目标比例,因此本公开实施例将对第二训练集和第二测试集中的正样本数据进行调整,以使得训练集和测试集的正样本数据之比可以符合目标比例。
步骤104,将所述第二训练集中正样本数据与所述第二测试集中的负样本数据之间进行调换,使得调换后的第二训练集合与第二测试集之间正样本比例符合所述目标比例且调换后的第二训练集合中不存在孤立节点且调换后的第二训练集合中不存在孤立节点,得到第三训练集和第三测试集。
在本公开实施例中,考虑到邻接矩阵的调换过程将会导致之前被调换至测试集中的孤立节点的邻接矩阵被重新调换回训练集,因此本公开实施例在调整训练集和测试集中正样本数据的数量时采用样本数据之间的调换方式。具体的,可通过在第二训练集中选取调换会不会导致所关联节点变成孤立节点的正样本数据与第二测试集中的负样本数据进行调换,以使得调换后的第三训练集和第三测试集中的正样本数据之间的比例符合目标比例的情况下又不会导致第三训练集中再次出现孤立节点。
为了便于理解,图2示意性提出本公开实施例提供的一种图网络数据集的处理方法的效果示意图,其中深色方格代表存在关联关系的正样本数据,浅色方格表示不存在关联关系的负样本数据,将原始图网络数据集划分为训练集和测试集的过程与传统随机划分的过程类似,而本公开实施例进一步在数据集的初步划分之后首先将训练集中孤立节点对应的邻接矩阵调换到测试集中,例如图2中的第一训练集的p1所在邻接矩阵的行数据全为负样本数据,p2所邻接矩阵的行数据全为负样本数据,因此将第一测试集中p1对应位置的p1’,p2对应位置的p2’进行调换,得到第二训练集和第二测试集。
但是此处第二训练集明显比第一训练集中正样本数据的数量多个2,而第二测试集比第一测试集中正样本数据的数量少了2。因此本公开实施例进一步将挑选第二测试集中为负样本数据的p3’、p4’与第二训练集中对应位置的正样本数据p3和p4进行调换,使得最终得到第三训练集与第一训练集 中正样本数据的数量持平,第三测试集与第一测试集中正样本数据的数量持平,在保证训练集和测试集中正样本数据之间比例的同时,也可保证训练集中无孤立节点。
在本公开实施例通过在图网络数据集划分为训练集和测试集后,通过首先将训练集中孤立节点的邻接矩阵与测试集中对应位置的邻接矩阵进行调换,使得训练集中不存在孤立节点,然后在保证训练集中不会再次出现孤立节点的前提下将训练集中的正样本数据与测试集中的负样本数据之间进行调换,使得最终得到的训练集不仅不存在孤立节点,还与测试集中之间的正样本数据的比例符合先前划分时的目标比例,从而保证模型在利用训练集和测试集进行训练时不会出现由于训练集存在孤立节点导致模型无法学习到该孤立节点的关联关系的情况,保证模型可以充分学习到每个节点的关联关系。
图2示意性地示出了本公开提供的另一种图网络数据集的处理方法的流程示意图,所述方法包括:
步骤201,将原始图网络数据集按照目标比例进行划分,得到第一训练集和第一测试集。
该步骤可参照步骤101的详细描述,此处不再赘述。
步骤202,获取所述第一训练集中每个节点的邻接矩阵。
在本公开实施例中,假设存在一个二分图bg(u,v,ε),而二分图等价于g(u∪v,ε),其中u和v表示两个节点域的两个集合,用第一类节点集合和第二类节点集合来表示,ui和vj分别表示u和v的第i个和第j个节点。二分图的所有边都是严格在u和v之间,eij表示ui和vj之间的边;A为第一类节点与第二类节点之间关联网络的邻接矩阵,如果第一类节点集合中的第i个节点与第二类节点集合中的第j个节点有关联关系,即eij=1,则A(i,j)=1,否则,A(i,j)=0。因此可通过节点对应的邻接矩阵中的数值是否为0来确定该节点是否与其他节点之间存在关联关系。
步骤203,在所述邻接矩阵中的任一行或者任一列中的样本数据均为负样本数据时,将所述邻接矩阵所对应的节点作为孤立节点。
在本公开实施例中,可通过将第一训练集中数据样本所组成的邻接矩阵中的任一行或者任一列的样本数据存为负样本数据,也就是均取值为0的邻 接矩阵所对应的节点作为孤立节点,以供后续邻接矩阵调换使用,从而可以便捷地识别训练集中的孤立节点。
步骤204,将所述第一训练集中所述孤立节点的邻接矩阵,与所述第一测试集中所述孤立节点所对应位置的邻接矩阵进行调换,得到第二训练集和第二测试集。
该步骤可参照步骤103的详细描述,此处不再赘述。
步骤205,获取所述第一测试集向所述第一训练集调换的目标样本数量。
在本公开实施例中,目标样本数据是指第一测试集向第一训练集中已调换的正样本数据的数量。该目标样本数据可以通过对已调换邻接矩阵中的正样本数据进行计数得到。
步骤206,在所述第二训练集中确定目标正样本数据,所述目标正样本数据所在的行数据和列数据包含至少两个正样本数据。
在本公开实施例中,考虑到第二训练集中的某一正样本数据被调换至第二测试集可能会导致该正样本数据所在的邻接矩阵中的某一行或者某一列中再次全为负样本数据,也就是取值全为0,因此本公开实施例在选取目标正样本数据时,将选取所在邻接矩阵的所在行数据和列数据中存在至少两个正样本数据,这样就保证该目标样本数据被调换至第二测试集后,该目标样本数据原本所在的行数据和列数据中还存在至少一个正样本数据,避免了该目标正样本数据所对应的节点再次成为孤立节点的情况出现。
步骤207,将所述目标正样本数据与所述第二测试集中的负样本数据进行调换,直至已调换正样本数据的数量达到所述目标样本数量,得到第三训练集合第三测试集。
在本公开实施例中,将所确定的目标正样本数据与第二测试集中的服样本数据进行调换,使得第二训练集中的正样本数据减1,而第二测试集中的正样本数据则加1,至此方式调整训练集和测试集中正样本数据的比例。考虑到第二训练集在每次调换目标正样本数据后,其中的邻接矩阵的行数据和列数据将会发生改变,之前可作为目标正样本数据的正样本数据在调换之后可能不可再作为目标正样本数据,因此需要在每次调换后重新进入步骤206的过程来循环选取目标正样本数据,直至已调换的目标正样本数据的数量达到目标样本数据。此时由于第一训练集和第一测试集原本是正样本数据的比 例是符合目标比例的,通过上述方式在对目标正样本数据进行调换后,使得第二训练集所包含正样本数据的数量与第一训练集持平,第二测试集所包含正样本数据的数量与第一测试集持平,使得调换得到的第三训练集和第三测试集中正样本数据的数量之比符合目标比例。
可选地,参照图4,所述步骤206,包括:
步骤2061,获取所述第二训练集中所包含正样本数据的数量大于或等于2的行数据。
步骤2062,将所述行数据中正样本数据所对应列数据中包含正样本数据的数量大于或等于2的节点数据作为目标正样本数据。
步骤2063,在所述第二训练集中不存在所包含正样本数据的数量大于或等于2的行数据时,停止所述第二训练集向所述第二测试集的正样本调换过程。
在本公开实施例中,首先选取第二训练集中所包含正样本数据的数量大于或等于2的行数据,然后在该行数据的正样本数据中选取所对应列数据中包含正样本数据的数量大于或等于2的列数据,将所选取的列数据和行数据的交点位置所对应的正样本数据作为目标正样本数据,可见该目标正样本数据所在的行数据和列数据在将目标正样本数据调换为负样本数据后,将还包含至少一个正样本数据。当然若未成找到所包含正样本数据大于或等于2的行数据时,说明该第二训练集中不存在可被用于调换的目标正样本数据,此处则停止调换过程,可通过向测试集中重新补入正样本数据来使得训练集和测试集中的正样本数据符合目标比例。
示例性的,参照图5示意性地示出本公开实施例提供的一种图网络数据集的处理方法的逻辑示意图:
S1、选择所有正样本,正样本集合为邻接矩阵中数值为1的位置,所对应的第一类节点与第二类节点之间组成的节点对集合,按照训练集、测试集预定于的比例来,比如4:1,随机采样所有正样本数据的4/5为训练集,记为Atrain,1/5为测试集,记为Atest
S2、将训练集中数据样本所组成的邻接矩阵Atrain中全为零的行和列与测试集中数据样本所组成的邻接矩阵Atest中对应位置做调换,经过这一步骤处理完后,满足了Atrain中的所有行和列都不为零的要求。
S3、计算出当前需要采集放入测试集中训练集的样本数k。
S4、进入循环,条件是k>0。
S5、获取Atrain的所有全为零的行和列的个数,row number和column number。
S6、如果row number=0和column number=0,则算法条件不满足退出逻辑过程,若均不为0则继续执行步骤S7。
S7、获取Atrain的所有行中,节点连接数>=2的位置列表row index list和个数row count。
S8、如果row count=0,则算法条件不满足退出逻辑过程,若row count≠0则继续执行步骤S9。
S9、随机从位置row index中选择一个i,然后获取Atrain第i行的所有节点连接的位置列表column index list。
S10、随机从位置column index中选择一个j,如果Atrain的第j列的节点连接数>=2,则Atrain(i,j)=0,Atest(i,j)=1,k减1。
S11、若k大于0则返回执行步骤S4,若k小于或等于0则进入步骤S12。
S12、输出Atrain和Atest
可选地,所述第三训练集和所述第三测试集中的节点关联关系用于表征药物和疾病之间的关联关系,所述方法还包括:利用所述第三训练集和所述第三测试集对得分预测模型进行训练,其中所述得分预测模型用于预测所输入的药物信息和疾病信息之间的关联度。
在本公开实施例中,得分预测模型是用于对节点之间存在关联关系的概率值进行预测得分的模型,可以是逻辑分类模型、决策树模型、支持向量机模型、贝叶斯模型等分类模型。在实际应用中,图网络数据集中节点可以表征药物和疾病,从而节点之间边可以用于表征药物和疾病之间的关联关系,从而使得采用第三训练和第三测试集训练得到的得分预测模型可以充分学习到药物和疾病之间的关联关系,提高了得分预测模型的准确性。
可选地,图6,所述得分预测模型的训练过程如下:
步骤301,利用所述第三训练集对所述得分预测模型进行训练。
步骤302,利用所述第三测试集对训练后的得分预测模型进行测试,得到预测概率得分矩阵。
步骤303,计算所述预测概率得分矩阵的损失值。
步骤304,在所述损失值得到训练要求时,确认所述得分预测模型训练完成。
在本公开实施例中,示例性的为了重建药物和疾病之间的关联,本公开选取的解码器f(HR,HD)的公式(1)如下:
其中A′表示预测得到的预测概率得分矩阵,药物ri和疾病dj之间关联的预测得分由相应的A′ij项给出,sigmoid()为激活函数,HR表示代表学习到的药物节点嵌入,HD表示疾病节点嵌入。
首先利用第三训练集对得分预测模型进行训练,由于第三训练集中的任一节点通过本公开所提供图网络数据集的处理方法进行了优化,因此第三训练集中的任一节点与其他节点之间至少存在一个关联对,因此训练后的得分预测模型可以完整学习到第三训练集中节点之间的关联关系。然后在利用测试集对训练后的得分预测模型进行测试,由于第三测试集也是通过本公开所提供图网络数据集的处理方法进行了优化,因此第三测试集中的数据量和第三训练集的比例符合预期的目标比例,可满足得分预测模型的测试要求。利用第三测试集交由训练后的得分预测模型进行预测,即可得到预测概率得分矩阵,在对该预测概率得分矩阵求损失值,若损失值符合预期训练要求,例如损失值大于或等于损失值阈值,或者损失值收敛至损失值范围,则可确定得分预测模型的训练过程结果,若不符合预期训练要求,则对得分预测模型进行调参后继续进行训练,直至训练后得分预测模型的损失值符合预期训练要求。
可选地,所述步骤303,包括:
将所述预测概率得分矩阵输入至如下公式,以得到损失值loss:
其中,(i,j)表示第i个药物和第j个疾病的关联对,S+表示所有已知药物疾病关联对的集合,S-表示所有未知或未观察到的药物疾病关联对的集 合,平衡因子用于降低数据不平衡的影响,A′是预测概率得分矩阵,u表示所述预测得分矩阵的行数和v表示所述预测得分矩阵的列数。
在本公开实施例中,由于已知的药物与疾病的关联已经人工验证过了,它们是高度可靠的,对提高预测性能非常重要。然而,已知药物疾病关联的数量远远少于未知或未观察到的药物疾病对的数量。因此,本方案通过最小化加权二元交叉熵损失来学习参数。
示例性的,参照下述表1:
可见通过本公开所提供图网络数据集的处理方法所得到训练集合测试集得到的模型的评价指标进行分析,其中重要的两个指标aupr提升2个百分点,auc提升10个百分点,其他指标也有不同程度的提升,充分证明了,该措施的有效性。
图7示意性地示出了本公开提供的一种图网络数据集的处理装置40的结构示意图,所述装置包括:
划分模块401,被配置为将原始图网络数据集按照目标比例进行划分,得到第一训练集和第一测试集;
确定模块402,被配置为确定所述训练集中无关联关系的孤立节点;
调换模块403,被配置为将所述第一训练集中所述孤立节点的邻接矩阵,与所述第一测试集中所述孤立节点所对应位置的邻接矩阵进行调换,得到第二训练集和第二测试集;
将所述第二训练集中正样本数据与所述第二测试集中的负样本数据之间进行调换,使得调换后的第二训练集合与第二测试集之间正样本比例符合所述目标比例且调换后的第二训练集合中不存在孤立节点,得到第三训练集和第三测试集。
可选地,所述调换模块403,还被配置为:
获取所述第一测试集向所述第一训练集调换的目标样本数量;
在所述第二训练集中确定目标正样本数据,所述目标正样本数据所在的行数据和列数据包含至少两个正样本数据;
将所述目标正样本数据与所述第二测试集中的负样本数据进行调换,直至已调换正样本数据的数量达到所述目标样本数量,得到第三训练集合第三测试集。
可选地,所述调换模块403,还被配置为:
获取所述第二训练集中所包含正样本数据的数量大于或等于2的行数据;
将所述行数据中正样本数据所对应列数据中包含正样本数据的数量大于或等于2的节点数据作为目标正样本数据。
可选地,所述调换模块403,还被配置为:
在所述第二训练集中不存在所包含正样本数据的数量大于或等于2的行数据时,停止所述第二训练集向所述第二测试集的正样本调换过程。
可选地,所述确定模块402,还被配置为:
获取所述第一训练集中每个节点的邻接矩阵;
在所述邻接矩阵中的任一行或者任一列中的样本数据均为负样本数据时,将所述邻接矩阵所对应的节点作为孤立节点。
可选地,所述第三训练集和所述第三测试集中的节点关联关系用于表征药物和疾病之间的关联关系;所述模块还包括:
训练模块,被配置为:
利用所述第三训练集和所述第三测试集对得分预测模型进行训练,其中所述得分预测模型用于预测所输入的药物信息和疾病信息之间的关联度。
可选地,所述训练模块,还被配置为:
利用所述第三训练集对所述得分预测模型进行训练;
利用所述第三测试集对训练后的得分预测模型进行测试,得到预测概率得分矩阵;
计算所述预测概率得分矩阵的损失值;
在所述损失值得到训练要求时,确认所述得分预测模型训练完成。
可选地,所述训练模块,还被配置为:
将所述预测概率得分矩阵输入至如下公式,以得到损失值loss:
其中,(i,j)表示第i个药物和第j个疾病的关联对,S+表示所有已知药物疾病关联对的集合,S-表示所有未知或未观察到的药物疾病关联对的集合,平衡因子用于降低数据不平衡的影响,A′是预测概率得分矩阵,u表示所述预测得分矩阵的行数和v表示所述预测得分矩阵的列数。
本公开实施例通过在图网络数据集划分为训练集和测试集后,通过首先将训练集中孤立节点的邻接矩阵与测试集中对应位置的邻接矩阵进行调换,使得训练集中不存在孤立节点,然后在保证训练集中不会再次出现孤立节点的前提下将训练集中的正样本数据与测试集中的负样本数据之间进行调换,使得最终得到的训练集不仅不存在孤立节点,还与测试集中之间的正样本数据的比例符合先前划分时的目标比例,从而保证模型在利用训练集和测试集进行训练时不会出现由于训练集存在孤立节点导致模型无法学习到该孤立节点的关联关系的情况,保证模型可以充分学习到每个节点的关联关系。
以上所描述的设备实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性的劳动的情况下,即可以理解并实施。
本公开的各个部件实施例可以以硬件实现,或者以在一个或者多个处理器上运行的软件模块实现,或者以它们的组合实现。本领域的技术人员应当理解,可以在实践中使用微处理器或者数字信号处理器(DSP)来实现根据本公开实施例的计算处理设备中的一些或者全部部件的一些或者全部功能。本公开还可以实现为用于执行这里所描述的方法的一部分或者全部的设备或者装置程序(例如,计算机程序和计算机程序产品)。这样的实现本公开 的程序可以存储在非瞬态计算机可读介质上,或者可以具有一个或者多个信号的形式。这样的信号可以从因特网网站上下载得到,或者在载体信号上提供,或者以任何其他形式提供。
例如,图8示出了可以实现根据本公开的方法的计算处理设备。该计算处理设备传统上包括处理器510和以存储器520形式的计算机程序产品或者非瞬态计算机可读介质。存储器520可以是诸如闪存、EEPROM(电可擦除可编程只读存储器)、EPROM、硬盘或者ROM之类的电子存储器。存储器520具有用于执行上述方法中的任何方法步骤的程序代码531的存储空间530。例如,用于程序代码的存储空间530可以包括分别用于实现上面的方法中的各种步骤的各个程序代码531。这些程序代码可以从一个或者多个计算机程序产品中读出或者写入到这一个或者多个计算机程序产品中。这些计算机程序产品包括诸如硬盘,紧致盘(CD)、存储卡或者软盘之类的程序代码载体。这样的计算机程序产品通常为如参考图9所述的便携式或者固定存储单元。该存储单元可以具有与图8的计算处理设备中的存储器520类似布置的存储段、存储空间等。程序代码可以例如以适当形式进行压缩。通常,存储单元包括计算机可读代码531’,即可以由例如诸如510之类的处理器读取的代码,这些代码当由计算处理设备运行时,导致该计算处理设备执行上面所描述的方法中的各个步骤。
应该理解的是,虽然附图的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,其可以以其他的顺序执行。而且,附图的流程图中的至少一部分步骤可以包括多个子步骤或者多个阶段,这些子步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,其执行顺序也不必然是依次进行,而是可以与其他步骤或者其他步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。
本文中所称的“一个实施例”、“实施例”或者“一个或者多个实施例”意味着,结合实施例描述的特定特征、结构或者特性包括在本公开的至少一个实施例中。此外,请注意,这里“在一个实施例中”的词语例子不一定全指同一个实施例。
在此处所提供的说明书中,说明了大量具体细节。然而,能够理解,本 公开的实施例可以在没有这些具体细节的情况下被实践。在一些实例中,并未详细示出公知的方法、结构和技术,以便不模糊对本说明书的理解。
在权利要求中,不应将位于括号之间的任何参考符号构造成对权利要求的限制。单词“包含”不排除存在未列在权利要求中的元件或步骤。位于元件之前的单词“一”或“一个”不排除存在多个这样的元件。本公开可以借助于包括有若干不同元件的硬件以及借助于适当编程的计算机来实现。在列举了若干装置的单元权利要求中,这些装置中的若干个可以是通过同一个硬件项来具体体现。单词第一、第二、以及第三等的使用不表示任何顺序。可将这些单词解释为名称。
最后应说明的是:以上实施例仅用以说明本公开的技术方案,而非对其限制;尽管参照前述实施例对本公开进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本公开各实施例技术方案的精神和范围。

Claims (12)

  1. 一种图网络数据集的处理方法,所述方法包括:
    将原始图网络数据集按照目标比例进行划分,得到第一训练集和第一测试集;
    确定所述训练集中无关联关系的孤立节点;
    将所述第一训练集中所述孤立节点的邻接矩阵,与所述第一测试集中所述孤立节点所对应位置的邻接矩阵进行调换,得到第二训练集和第二测试集;
    将所述第二训练集中正样本数据与所述第二测试集中的负样本数据之间进行调换,使得调换后的第二训练集合与第二测试集之间正样本比例符合所述目标比例且调换后的第二训练集合中不存在孤立节点,得到第三训练集和第三测试集。
  2. 根据权利要求1所述的方法,所述将所述第二训练集中正样本数据与所述第二测试集中的负样本数据之间进行调换,使得调换后的第二训练集合与第二测试集之间正样本比例符合所述目标比例且调换后的第二训练集合中不存在孤立节点,得到第三训练集和第三测试集,包括:
    获取所述第一测试集向所述第一训练集调换的目标样本数量;
    在所述第二训练集中确定目标正样本数据,所述目标正样本数据所在的行数据和列数据包含至少两个正样本数据;
    将所述目标正样本数据与所述第二测试集中的负样本数据进行调换,直至已调换正样本数据的数量达到所述目标样本数量,得到第三训练集合第三测试集。
  3. 根据权利要求2所述的方法,所述获取所述第一测试集向所述第一训练集调换的目标样本数量,包括:
    获取所述第二训练集中所包含正样本数据的数量大于或等于2的行数据;
    将所述行数据中正样本数据所对应列数据中包含正样本数据的数量大于或等于2的节点数据作为目标正样本数据。
  4. 根据权利要求3所述的方法,在所述将所述行数据中正样本数据所 对应列数据中包含正样本数据的数量大于或等于2的节点数据作为目标正样本数据之前,所述方法还包括:
    在所述第二训练集中不存在所包含正样本数据的数量大于或等于2的行数据时,停止所述第二训练集向所述第二测试集的正样本调换过程。
  5. 根据权利要求1所述的方法,所述确定所述训练集中无关联关系的孤立节点,包括:
    获取所述第一训练集中每个节点的邻接矩阵;
    在所述邻接矩阵中的任一行或者任一列中的样本数据均为负样本数据时,将所述邻接矩阵所对应的节点作为孤立节点。
  6. 根据权利要求1所述的方法,在所述将所述第二训练集中正样本数据与所述第二测试集中的负样本数据之间进行调换,使得调换后的第二训练集合与第二测试集之间正样本比例符合所述目标比例且调换后的第二训练集合中不存在孤立节点,得到第三训练集和第三测试集之后,所述方法还包括:
    所述第三训练集和所述第三测试集中的节点关联关系用于表征药物和疾病之间的关联关系;
    利用所述第三训练集和所述第三测试集对得分预测模型进行训练,其中所述得分预测模型用于预测所输入的药物信息和疾病信息之间的关联度。
  7. 根据权利要求6所述的方法,所述利用所述第三训练集和所述第三测试集对得分预测模型进行训练,包括:
    利用所述第三训练集对所述得分预测模型进行训练;
    利用所述第三测试集对训练后的得分预测模型进行测试,得到预测概率得分矩阵;
    计算所述预测概率得分矩阵的损失值;
    在所述损失值得到训练要求时,确认所述得分预测模型训练完成。
  8. 根据权利要求7所述的方法,所述计算所述预测概率得分矩阵的损失值,包括:
    将所述预测概率得分矩阵输入至如下公式,以得到损失值loss:
    其中,(i,j)表示第i个药物和第j个疾病的关联对,S+表示所有已知药物疾病关联对的集合,S-表示所有未知或未观察到的药物疾病关联对的集合,平衡因子用于降低数据不平衡的影响,A′是预测概率得分矩阵,u表示所述预测得分矩阵的行数和v表示所述预测得分矩阵的列数。
  9. 一种图网络数据集的处理装置,所述装置包括:
    划分模块,被配置为将原始图网络数据集按照目标比例进行划分,得到第一训练集和第一测试集;
    确定模块,被配置为确定所述训练集中无关联关系的孤立节点;
    调换模块,被配置为将所述第一训练集中所述孤立节点的邻接矩阵,与所述第一测试集中所述孤立节点所对应位置的邻接矩阵进行调换,得到第二训练集和第二测试集;
    将所述第二训练集中正样本数据与所述第二测试集中的负样本数据之间进行调换,使得调换后的第二训练集合与第二测试集之间正样本比例符合所述目标比例且调换后的第二训练集合中不存在孤立节点,得到第三训练集和第三测试集。
  10. 一种计算处理设备,包括:
    存储器,其中存储有计算机可读代码;
    一个或多个处理器,当所述计算机可读代码被所述一个或多个处理器执行时,所述计算处理设备执行如权利要求1-8中任一项所述的图网络数据集的处理方法。
  11. 本公开一些实施例提供一种计算机程序,包括计算机可读代码,当所述计算机可读代码在计算处理设备上运行时,导致所述计算处理设备执行如权利要求1-8中任一项的所述的图网络数据集的处理方法。
  12. 一种非瞬态计算机可读介质,其中存储了如权利要求1-8中任一项 所述的图网络数据集的处理方法的计算机程序。
PCT/CN2023/110370 2022-08-31 2023-07-31 图网络数据集的处理方法、装置、电子设备、程序及介质 WO2024045989A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211057371.4A CN115391561A (zh) 2022-08-31 2022-08-31 图网络数据集的处理方法、装置、电子设备、程序及介质
CN202211057371.4 2022-08-31

Publications (1)

Publication Number Publication Date
WO2024045989A1 true WO2024045989A1 (zh) 2024-03-07

Family

ID=84125103

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/110370 WO2024045989A1 (zh) 2022-08-31 2023-07-31 图网络数据集的处理方法、装置、电子设备、程序及介质

Country Status (2)

Country Link
CN (1) CN115391561A (zh)
WO (1) WO2024045989A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115391561A (zh) * 2022-08-31 2022-11-25 京东方科技集团股份有限公司 图网络数据集的处理方法、装置、电子设备、程序及介质
CN115688907B (zh) * 2022-12-30 2023-04-21 中国科学技术大学 基于图传播的推荐模型训练方法及基于图传播的推荐方法

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110781933A (zh) * 2019-10-14 2020-02-11 杭州电子科技大学 一种用于理解图卷积神经网络的可视分析方法
US20210374499A1 (en) * 2020-05-26 2021-12-02 International Business Machines Corporation Iterative deep graph learning for graph neural networks
CN113934813A (zh) * 2020-07-14 2022-01-14 深信服科技股份有限公司 一种样本数据划分的方法、系统、设备及可读存储介质
CN115391561A (zh) * 2022-08-31 2022-11-25 京东方科技集团股份有限公司 图网络数据集的处理方法、装置、电子设备、程序及介质

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110781933A (zh) * 2019-10-14 2020-02-11 杭州电子科技大学 一种用于理解图卷积神经网络的可视分析方法
US20210374499A1 (en) * 2020-05-26 2021-12-02 International Business Machines Corporation Iterative deep graph learning for graph neural networks
CN113934813A (zh) * 2020-07-14 2022-01-14 深信服科技股份有限公司 一种样本数据划分的方法、系统、设备及可读存储介质
CN115391561A (zh) * 2022-08-31 2022-11-25 京东方科技集团股份有限公司 图网络数据集的处理方法、装置、电子设备、程序及介质

Also Published As

Publication number Publication date
CN115391561A (zh) 2022-11-25

Similar Documents

Publication Publication Date Title
WO2024045989A1 (zh) 图网络数据集的处理方法、装置、电子设备、程序及介质
TWI689871B (zh) 梯度提升決策樹(gbdt)模型的特徵解釋方法和裝置
US20190279088A1 (en) Training method, apparatus, chip, and system for neural network model
CN107766929B (zh) 模型分析方法及装置
Li et al. A new method of image detection for small datasets under the framework of YOLO network
CN108090498A (zh) 一种基于深度学习的纤维识别方法及装置
WO2023217290A1 (zh) 基于图神经网络的基因表型预测
JP7457125B2 (ja) 翻訳方法、装置、電子機器及びコンピュータプログラム
US20180018566A1 (en) Finding k extreme values in constant processing time
CN111753101A (zh) 一种融合实体描述及类型的知识图谱表示学习方法
CN104615730B (zh) 一种多标签分类方法及装置
WO2023103527A1 (zh) 一种访问频次的预测方法及装置
CN110134777A (zh) 问题去重方法、装置、电子设备和计算机可读存储介质
CN115482418A (zh) 基于伪负标签的半监督模型训练方法、系统及应用
CN113764034B (zh) 基因组序列中潜在bgc的预测方法、装置、设备及介质
Wu et al. Research on recognition method of leaf diseases of woody fruit plants based on transfer learning
CN112598089B (zh) 图像样本的筛选方法、装置、设备及介质
WO2024016949A1 (zh) 标签生成、图像分类模型的方法、图像分类方法及装置
Qin et al. Malaria cell detection using evolutionary convolutional deep networks
CN108229572B (zh) 一种参数寻优方法及计算设备
CN115907775A (zh) 基于深度学习的个人征信评级方法及其应用
WO2023004632A1 (zh) 知识图谱的更新方法、装置、电子设备、存储介质及程序
CN113159976B (zh) 一种微博网络重要用户的识别方法
CN115345248A (zh) 一种面向深度学习的数据去偏方法及装置
US20210241147A1 (en) Method and device for predicting pair of similar questions and electronic equipment

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23859024

Country of ref document: EP

Kind code of ref document: A1