CN110232393B - Data processing method and device, storage medium and electronic device - Google Patents

Data processing method and device, storage medium and electronic device Download PDF

Info

Publication number
CN110232393B
CN110232393B CN201810179509.5A CN201810179509A CN110232393B CN 110232393 B CN110232393 B CN 110232393B CN 201810179509 A CN201810179509 A CN 201810179509A CN 110232393 B CN110232393 B CN 110232393B
Authority
CN
China
Prior art keywords
data
target
node
nodes
topological graph
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810179509.5A
Other languages
Chinese (zh)
Other versions
CN110232393A (en
Inventor
孙仕杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201810179509.5A priority Critical patent/CN110232393B/en
Publication of CN110232393A publication Critical patent/CN110232393A/en
Application granted granted Critical
Publication of CN110232393B publication Critical patent/CN110232393B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Abstract

The invention discloses a data processing method and device, a storage medium and an electronic device. Wherein, the method comprises the following steps: acquiring a node sequence in a topological graph to obtain first data, wherein the first data is data representing the node sequence, and the topological graph comprises a plurality of nodes and connection relations among the plurality of nodes; collecting a target node in a topological graph; selecting node data continuously stored with target data as second data, wherein the target data is data representing a target node, the first data is a positive sample, the second data is a negative sample, and the second data comprises the target data; and training the preset model according to the first data and the second data to obtain a trained target model. The invention solves the technical problem of low efficiency of training the model caused by frequent memory updating.

Description

Data processing method and device, storage medium and electronic device
Technical Field
The present invention relates to the field of data processing, and in particular, to a method and an apparatus for processing data, a storage medium, and an electronic apparatus.
Background
In training a model, positive and negative sample data are typically collected for training. When the data of the negative samples are collected, all the data are sampled according to the frequency, and the probability that the data with higher occurrence frequency is sampled as the negative samples is higher. However, such collected negative samples are stored discretely in memory. And pulling part of data in the negative sample from the memory into a cache, and pulling part of data from the cache into the memory for processing after the processing is finished, wherein the memory is frequently updated, so that the efficiency of training the model is low.
In view of the above problems, no effective solution has been proposed.
Disclosure of Invention
The embodiment of the invention provides a data processing method and device, a storage medium and an electronic device, which at least solve the technical problem of low efficiency of a training model caused by frequent memory updating.
Another embodiment of the present invention provides a data processing method, including: acquiring a node sequence in a topological graph to obtain first data, wherein the first data is data representing the node sequence, and the topological graph comprises a plurality of nodes and connection relations among the nodes; collecting target nodes in the topological graph; selecting node data continuously stored with target data as second data, wherein the target data is data representing the target node, the first data is a positive sample, the second data is a negative sample, and the second data comprises the target data; and training a preset model according to the first data and the second data to obtain a trained target model.
Another embodiment of the present invention further provides a data processing apparatus, including: the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a node sequence in a topological graph to obtain first data, the first data is data representing the node sequence, and the topological graph comprises a plurality of nodes and connection relations among the nodes; the acquisition unit is used for acquiring a target node in the topological graph; a selecting unit, configured to select node data stored continuously with target data as second data, where the target data is data representing the target node, the first data is a positive sample, the second data is a negative sample, and the second data includes the target data; and the training unit is used for training the preset model according to the first data and the second data to obtain a trained target model.
In another embodiment of the present invention, a storage medium is further provided, where a computer program is stored in the storage medium, where the computer program is configured to execute the method described above when the computer program runs.
In another embodiment of the present invention, an electronic apparatus is further provided, which includes a memory and a processor, and is characterized in that the memory stores a computer program, and the processor is configured to execute the above method through the computer program.
In the embodiment, the node data continuously stored with the target data is used as the data of the negative sample, so that the number of times of refreshing the memory when the negative sample is selected is reduced, the technical problem that the efficiency of the training model is low due to frequent updating of the memory when the negative sample is selected in the prior art is solved, and the technical effect of improving the efficiency of the training model is achieved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:
FIG. 1 is a schematic diagram of a hardware environment according to an embodiment of the present invention;
FIG. 2 is a flow chart of a method of processing data according to an embodiment of the invention;
FIG. 3 is a schematic diagram of a sequence of nodes according to an embodiment of the invention;
FIG. 4 is a schematic diagram of a memory region according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of another memory region according to an embodiment of the invention;
FIG. 6 is a schematic diagram of yet another memory region according to an embodiment of the present invention;
FIG. 7 is a schematic diagram of a parameter server according to an embodiment of the present invention;
FIG. 8 is a schematic diagram of an apparatus for processing data according to an embodiment of the present invention;
fig. 9 is a block diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
According to an aspect of an embodiment of the present invention, a data processing method is provided. In the present embodiment, the data processing method is applied to a hardware environment formed by the terminal 101 and the server 102 shown in fig. 1. As shown in fig. 1, a terminal 101 is connected to a server 102 through a network, including but not limited to: the terminal 101 may be a mobile phone terminal, or may also be a PC terminal, a notebook terminal, or a tablet terminal. The server 102 includes a computing server 1021 and a storage server 1022.
Fig. 2 is a flowchart of a method of processing data according to an embodiment of the present invention. As shown in fig. 2, the data processing method includes:
step S202, a node sequence in a topological graph is obtained to obtain first data, wherein the node sequence comprises a plurality of nodes with connection relations, the first data are data of the plurality of nodes, and the topological graph comprises the nodes and the connection relations among the nodes.
Node2Vec is a graph-based feature representation algorithm framework. For any given graph, it may learn a continuous feature vector for all nodes in the graph, which may then be used in downstream machine learning algorithms. The Node2Vec algorithm is divided into 2 steps, firstly, a Node sequence is generated through a weighted random walk algorithm, and then a Word2Vec learning algorithm is used for learning continuous feature vectors for each Node.
In the step of learning a continuous feature vector for each node by using the learning algorithm of Word2Vec, it is necessary to set each positive sample to correspond to a plurality of negative samples to train a preset model so as to obtain a target model. The topology map of this embodiment may be a Node2Vec map in the Node2Vec algorithm.
The topological graph comprises nodes and connection relations between the nodes, the connection relations between the nodes can be represented by connecting lines, as shown in fig. 3, circles represented by numbers 1 to 12 represent one node, and dotted lines and solid lines connected between the circles represent the connection relations. Each node represents data of one object, and the connection relationship between the nodes represents the connection relationship between a plurality of objects. For example, in a topological graph representing a plurality of users of the instant messaging application, each node represents one user, the connection relationship between the nodes can represent whether there is an interactive behavior between two users, and two nodes have connections on the topological graph to represent that there is an interactive behavior between two users; two nodes are not connected on the topological graph, and the fact that no interaction acts between two users is shown. As there is no connection between node 1 and node2 in fig. 3, there is no interaction between the user represented by node 1 and the user represented by node 2. The interactive behavior comprises sending a message and browsing the public visible information of the other party (comprising messages, states and pictures issued or forwarded in a public space by a person, personal data and the like). For example, the connection relationship between the nodes may indicate whether there is a friend relationship between the two users, and the two nodes are connected on the topological graph to indicate that there is a friend relationship between the two users; two nodes are not connected on the topological graph, and the fact that the two users do not have friend relationships is shown.
The node sequence is a sequence formed by a plurality of nodes with connection relations in the topological graph, and the sequence formed by continuously walking a plurality of steps according to the connection relations among the nodes from one node in the topological graph is the node sequence. As shown in fig. 3, starting from node 1, a sequence formed by 5 steps in the sequence of 1-5-6-9-11-12 is a node sequence, data included in the node sequence is data of node 1, node 5, node 6, node 9, node 11 and node 12, and the data is used as first data, that is, data used for training a preset model to obtain a positive sample of a target model. The first data includes data of a plurality of nodes, and the data of the nodes in the topological graph is expressed by a vector, for example, a node is expressed by a 10-dimensional vector. The first data includes a vector representing a plurality of nodes.
Step S204, collecting a target node in a topological graph, wherein the connecting line between the target node and other nodes in the topological graph is the largest.
The data of the target node is used as the data of the negative sample, and the data of the node sequence is used as the data of the positive sample. For example, when training a target model for recognizing the letter a in a picture, all pictures with the letter a can be used as data of a positive sample, and pictures without the letter a can be used as data of a negative sample. For example, in training a target model for identifying interactive behavior between users, data indicating that two users have interactive behavior therebetween may be data of positive samples, and data indicating that two users do not have interactive behavior therebetween may be data of negative samples. The data of the target node and the data of the node sequence in this embodiment are respectively data of a negative sample and data of a positive sample.
Optionally, collecting the target node in the topology map includes: acquiring the connection times of all nodes and other nodes in the topological graph; and selecting the node with the most connection times from the topological graph as a target node.
And selecting the target node in the topological graph for multiple times to obtain multiple target nodes. And taking the node with the highest connection frequency as a target node in the topological graph. For example, the node 5 in fig. 3 has a connection relationship with other 6 nodes, that is, there are 6 connecting lines, the number of connections is 6, and then the node 5 is selected as the target node. Next time the target node is selected, the node 11 is selected as the target node from the topology because the target node 11 is the node with the largest number of connections other than the node 5.
Step S206, selecting node data continuously stored with target data as second data, wherein the target data is data representing a target node, the first data is a positive sample, the second data is a negative sample, and the second data comprises the target data;
the data stored in the memory is continuously stored, and the data of the nodes in the topological graph is stored in the memory in the form of vectors, for example, the nodes are stored in the memory in the form of multidimensional vectors. And after the target node is selected, pulling the data of the target node from the memory into the cache. Because the pulled data not only includes the data of the target node, but also includes the data of other nodes, the pulled data is continuously stored in the memory, and is also continuously stored after being pulled to the cache. Node data stored continuously with the target data is selected from the cache as second data. . Fig. 4 shows node data stored in the cache. Wherein C is the target data, and the continuously stored node data includes a, B, C, D, and E, that is, a, B, C, D, and E are continuously stored as the second data. That is, when the second data is extracted, a piece of continuous data near the storage location of the target data is extracted in the memory. The selection method may be to select data in a certain range around C in the memory as shown in fig. 4, or to select data in a certain range on the left or right side of the target data C as shown in fig. 5 and 6.
When a small number of nodes are selected as target nodes from the topological graph, the target nodes can be selected in multiple ways. Each time a node is selected, the data of the node is pulled from the node to the cache. When the data of the node is pulled to the cache, the data stored adjacent to the node is also pulled to the cache, and the data adjacent to the target data is directly selected from the cache to obtain second data. And after the cache is updated, the data of the reselected target node is stored again, and the node data continuously stored with the target node is selected. And repeatedly executing the process until the number of the acquired target nodes meets the preset number, and finishing sampling of the negative samples. In the process of finishing sampling the obtained negative sample, updating the cache once to obtain the data of a plurality of nodes as the negative sample, compared with the prior art that the data of one node is obtained as the negative sample every time the memory is updated, the embodiment can reduce the updating frequency of the cache, reduce the burden of the memory bus, and improve the calculation speed.
Optionally, selecting node data stored continuously with the target data as the second data includes: searching a storage area adjacent to the storage position of the target data in the cache; and taking the data continuously stored in the storage position and the storage area as second data.
As shown in fig. 4 to 6, the storage location of the target data is a location where the target data C is located, and the storage area adjacent to the storage location includes areas where data a, data B, data D, and data E are located. All the data (data a, data B, target data C, data D, and data E) in the area where the data a, data B, target data C, data D, and data E are located are taken as the second data.
And S208, training a preset model according to the first data and the second data to obtain a trained target model.
And training by using the data of the positive sample and the negative sample to obtain a target model, and analyzing whether the users represented by the two nodes are in friend relationship or not by using the target model. For example, for two users who do not have a friend relationship in the instant messaging application, whether the two users are friends or not can be analyzed through the target model, and if the two users are analyzed to be friends, the two users can be recommended to be added as friends to each other in the instant messaging application. Optionally, after the predetermined model is trained according to the first data and the second data, the method further includes: and inputting the data of the first object and the data of the second object into the target model to obtain third data output by the target model, wherein the third data is used for expressing the relationship between the first object and the second object.
For example, the first object is a first user, the second object is a second user, data of the first object is data of the first user, data of the second object is data of the second user, the data of the first user and the data of the second user are input into a trained target model, third data output by the target model can be obtained, and the third data can indicate whether the first user and the second user are friends or not and whether the first user and the second user have a friend relationship or not.
In the embodiment, the node data continuously stored with the target data is used as the data of the negative sample, so that the times of accessing the memory and refreshing the cache when the negative sample is selected are reduced, the technical problem that the efficiency of the training model is low due to frequent memory updating when the negative sample is selected in the prior art is solved, and the technical effect of improving the efficiency of the training model is achieved.
Optionally, collecting the target node in the topology map includes: acquiring a storage identifier of a target node in a topological graph by using a computing server; selecting node data stored continuously with the target data as second data includes: and selecting node data continuously stored with the target data on the storage area as second data in the storage server according to the storage identifier, wherein the storage identifier is used for indicating the position of the storage area.
In this embodiment, a Parameter-Server (PS) architecture is used for calculation, and the Parameter Server is a programming framework for facilitating writing of a distributed parallel program, and is mainly used for supporting distributed storage and coordination of large-scale parameters. The parameter server comprises a computing server and a storage server, wherein the computing server can be a server cluster, and the storage server can also be a server cluster. The computing server selects a target node in the topological graph and determines a storage identifier of the target node. The computing server broadcasts the storage identification to the storage server, and the server selects target data corresponding to the storage identification from the stored data according to the broadcast and pulls the target data from the memory to the cache of the storage server. Because the data stored in the cache is continuous, the data continuously stored with the target data is selected from the cache to obtain the second data.
In order to avoid consumption of transmission time and transmission bandwidth proxied for data transmission between the storage server and the computation server, the predetermined model may be trained in the storage server, that is, the predetermined model is trained according to the first data and the second data, and obtaining the trained target model includes: and training the preset model in the storage server according to the first data and the second data to obtain a target model.
This embodiment may be implemented in the parameter server shown in fig. 7. The parameter server includes a calculation server 702 and a storage server 704, and a memory 706 of the storage server 704 stores the pulled target data and data stored continuously with the target data.
After the calculation server 702 randomly selects a target node from the topological graph, the target node is broadcasted to all the storage servers 704, the storage servers 704 select target data corresponding to the target node and node data continuously stored with the target data to obtain second data, and the predetermined model is trained according to the first data and the second data.
And repeating the steps for training for multiple times to obtain the target model. One positive sample and a plurality of negative samples are selected during each training, and local sampling rather than global sampling is adopted during the selection of the negative samples. The global sampling is that the computing server randomly selects a plurality of nodes to obtain data of the plurality of nodes, and does not select continuously stored node data in the cache as data of the negative sample.
Because the data of a plurality of samples can be obtained by accessing the memory once when the negative sample is selected, compared with the data of one sample obtained by accessing the memory each time, the time for acquiring the negative sample is shortened, and the refreshing time of the cache is reduced, thereby improving the efficiency for acquiring the negative sample and improving the efficiency for training the target model.
Table 1 shows the comparison of model effect and training time using 2 modes of global negative sampling and range negative sampling under the same parameters. The experiment cluster is provided with 24 servers, and all machines are provided with Intel E5-2670v3 type CPUs, 128GB memories, 2 blocks of 300GB SAS hard disks, raid1 and 10Gbits/s network cards.
TABLE 1 comparison of negative sampling effects in the Range
Figure 2
As can be seen from table 1, the use range negative sampling can keep the model effect (NDCG value) equivalent to the global random negative sampling, but the computation time can be reduced by more than 50% (when the number of negative sampling times exceeds 20), and the model training efficiency is significantly improved.
It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required by the invention.
Through the description of the foregoing embodiments, it is clear to those skilled in the art that the method according to the foregoing embodiments may be implemented by software plus a necessary general hardware platform, and certainly may also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention or portions thereof contributing to the prior art may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (which may be a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.
According to another aspect of the embodiment of the invention, a data processing device for implementing the data processing method is also provided. Fig. 8 is a schematic diagram of a data processing apparatus according to an embodiment of the present invention. As shown in fig. 8, the apparatus includes:
the obtaining unit 80 is configured to obtain a node sequence in a topology map, and obtain first data, where the node sequence includes a plurality of nodes having a connection relationship, the first data is data of the plurality of nodes, and the topology map includes the nodes and the connection relationship between the nodes.
Node2Vec is a graph-based feature representation algorithm framework. For any given graph, it may learn a continuous feature vector for all nodes in the graph, which may then be used in downstream machine learning algorithms. The Node2Vec algorithm is divided into 2 steps, firstly, a Node sequence is generated through a weighted random walk algorithm, and then a Word2Vec learning algorithm is used for learning continuous feature vectors for each Node.
In the step of learning a continuous feature vector for each node by using the learning algorithm of Word2Vec, it is necessary to set each positive sample to correspond to a plurality of negative samples to train a preset model so as to obtain a target model. The topology map of this embodiment may be a Node2Vec map in the Node2Vec algorithm.
The topological graph comprises a plurality of nodes and connection relations among the nodes, the connection relations among the nodes can be represented by connecting lines, as shown in fig. 3, circles represented by numbers 1 to 12 represent one node, and dotted lines and solid lines connected among the circles represent the connection relations. Each node represents data of one object, and the connection relationship between the nodes represents the connection relationship between a plurality of objects. For example, in a topological graph representing a plurality of users of the instant messaging application, each node represents one user, the connection relationship between the nodes can represent whether there is an interactive behavior between two users, and two nodes have connections on the topological graph to represent that there is an interactive behavior between two users; two nodes are not connected on the topological graph, and the fact that two users do not interact with each other is represented. As there is no connection between node 1 and node2 in fig. 3, there is no interaction between the user represented by node 1 and the user represented by node 2. The interactive behavior comprises sending a message and browsing the public visible information of the other party (comprising messages, states and pictures issued or forwarded in a public space by a person, personal data and the like). For example, the connection relationship between the nodes may indicate whether two users have a friend relationship, and the two nodes are connected on the topological graph to indicate that two users have a friend relationship; two nodes are not connected on the topological graph, and the fact that the two users do not have friend relationships is shown.
The node sequence is a sequence formed by a plurality of nodes having a connection relationship in the topological graph, for example, a sequence formed by continuously walking a plurality of steps according to the connection relationship between the nodes from one node in the topological graph is the node sequence. As shown in fig. 3, starting from node 1, a sequence formed by walking 5 steps in the order of 1-5-6-9-11-12 is a node sequence, data included in the node sequence is data of node 1, node 5, node 6, node 9, node 11, and node 12, and these data are used as first data, that is, data used for training a preset model to obtain a positive sample of a target model. The first data includes data of a plurality of nodes, and the data of the nodes in the topological graph is expressed by a vector, for example, a node is expressed by a 10-dimensional vector. The first data includes a vector representing a plurality of nodes.
The acquisition unit 82 is configured to acquire a target node in a topological graph, where the target node is connected with the other nodes in the topological graph at the most.
The data of the target node is used as the data of the negative sample, and the data of the node sequence is used as the data of the positive sample. For example, when training a target model for recognizing the letter a in a picture, all pictures with the letter a can be used as data of a positive sample, and pictures without the letter a can be used as data of a negative sample. For example, in training a target model for identifying an interactive behavior between users, data indicating that two users have an interactive behavior therebetween may be data of a positive sample, and data indicating that two users do not have an interactive behavior therebetween may be data of a negative sample. The data of the target node and the data of the node sequence in this embodiment are data of a negative sample and data of a positive sample, respectively.
Optionally, the acquisition unit comprises: the acquisition module is used for connecting all the nodes in the topological graph with other nodes; and the selection module is used for selecting the node with the most connection times from the topological graph as the target node.
And selecting the target node in the topological graph for multiple times to obtain multiple target nodes. And taking the node with the highest connection frequency as a target node in the topological graph. For example, the node 5 in fig. 3 has a connection relationship with other 6 nodes, that is, there are 6 connecting lines, the number of connections is 6, and then the node 5 is selected as the target node. Next time the target node is selected, the node 11 is selected as the target node from the topology because the target node 11 is the node with the largest number of connections except the node 5.
The selecting unit 84 is configured to select node data stored continuously with target data as second data, where the target data is data representing a target node, the first data is a positive sample, the second data is a negative sample, and the second data includes the target data.
The data stored in the memory is continuously stored, and the data of the nodes in the topological graph is stored in the memory in the form of vectors, for example, the nodes are stored in the memory in the form of multidimensional vectors. And after the target node is selected, pulling the data of the target node from the memory into the cache. Because the pulled data not only includes the data of the target node, but also includes the data of other nodes, the pulled data is continuously stored in the memory, and is also continuously stored after being pulled to the cache. Node data stored continuously with the target data is selected from the cache as second data. Fig. 4 shows node data stored in the cache. Wherein, C is the target data, and the continuously stored node data includes a, B, C, D and E, that is, a, B, C, D and E are continuously stored as the second data. That is, when the second data is extracted, a piece of continuous data near the storage location of the target data is extracted in the memory. The selection method may be to select data in a certain range around C in the memory as shown in fig. 4, or to select data in a certain range on the left or right side of the target data C as shown in fig. 5 and 6.
When a small number of nodes are selected as target nodes from the topological graph, the target nodes can be selected in multiple ways. One node is selected at a time, and then the data of the node is pulled from the cache. When the data of the node is pulled to the cache, the data stored adjacent to the node is also pulled to the cache, and the data adjacent to the target data is directly selected from the cache to obtain second data. And after the cache is updated, the data of the reselected target node is stored again, and the node data continuously stored with the target node is selected. And repeatedly executing the process until the number of the collected target nodes meets the preset number, and finishing sampling of the negative samples. In the process of completing the sampling of the obtained negative sample, updating the cache once can obtain the data of a plurality of nodes as the negative sample, and compared with the prior art that the data of one node is obtained as the negative sample by updating the memory once, the embodiment can reduce the updating frequency of the cache, reduce the burden of a memory bus and improve the calculation speed.
Optionally, the selecting unit includes: the searching module is used for searching a storage area adjacent to the storage position of the target data in the cache; and the determining module is used for taking the data continuously stored in the storage position and the storage area as second data.
As shown in fig. 4 to 6, the storage location of the target data is the location where the target data C is located, and the storage area adjacent to the storage location includes areas where the data a, the data B, the data D, and the data E are located. All the data (data a, data B, target data C, data D, and data E) in the area where the data a, data B, target data C, data D, and data E are located are taken as the second data.
The training unit 86 is configured to train the predetermined model according to the first data and the second data to obtain a trained target model.
And training by using the data of the positive sample and the negative sample to obtain a target model, and analyzing whether the users represented by the two nodes are in friend relationship or not by using the target model. For example, for two users who do not have a friend relationship in the instant messaging application, whether the two users are friends or not can be analyzed through the target model, and if the two users are analyzed to be friends, the two users can be recommended to add the other party as friends in the instant messaging application. Optionally, the apparatus further comprises: and the output unit is used for inputting the data of the first object and the data of the second object into the target model after training the predetermined model according to the first data and the second data to obtain a trained target model, and obtaining third data output by the target model, wherein the third data is used for representing the relationship between the first object and the second object.
For example, the first object is a first user, the second object is a second user, data of the first object is data of the first user, data of the second object is data of the second user, the data of the first user and the data of the second user are input into a trained target model, third data output by the target model can be obtained, and the third data can indicate whether the first user and the second user are friends or not and whether the first user and the second user have a friend relationship or not.
In the embodiment, the node data continuously stored with the target data is used as the data of the negative sample, so that the times of memory access and cache refreshing when the negative sample is selected are reduced, the technical problem that the efficiency of the training model is low due to frequent memory updating when the negative sample is selected in the prior art is solved, and the technical effect of improving the efficiency of the training model is achieved.
Optionally, the acquisition unit is arranged in the computing server, and includes a storage module, configured to obtain a storage identifier of a target node in the topological graph; the selecting unit is arranged in the storage server and comprises a selecting module used for selecting node data continuously stored on the storage area with the target data according to the storage identification as second data, wherein the storage identification is used for representing the position of the storage area.
In this embodiment, a Parameter Server (PS) architecture is used for calculation, and the Parameter Server is a programming framework for facilitating writing of a distributed parallel program, and is mainly used for supporting distributed storage and coordination of large-scale parameters. The parameter server comprises a computing server and a storage server, wherein the computing server can be a server cluster, and the storage server can also be a server cluster. And the computing server randomly selects a target node in the topological graph and determines the storage identification of the target node. The computing server broadcasts the storage identification to the storage server, and the server selects target data corresponding to the storage identification from the stored data according to the broadcast and pulls the target data from the memory to the cache of the storage server. As the data stored in the cache are continuous, the data continuously stored with the target data are selected from the cache to obtain second data.
In order to avoid the consumption of transmission time and transmission bandwidth by the agent for transmitting data between the storage server and the computation server, the predetermined model may be trained in the storage server, i.e. the training unit comprises: and the training module is used for training the preset model in the storage server according to the first data and the second data to obtain a target model.
This embodiment may be implemented in the parameter server shown in fig. 7. The parameter server includes a computation server 702 and a storage server 704, and the memory 706 of the storage server 704 stores the pulled target data and data stored continuously with the target data.
After the computation server 702 randomly selects a target node from the topological graph, the target node is broadcasted to all the storage servers 704, the storage servers 704 select target data corresponding to the target node and data continuously stored with the target data to obtain second data, and the predetermined model is trained according to the first data and the second data.
And repeating the steps for training for multiple times to obtain the target model. One positive sample and a plurality of negative samples are selected during each training, and local sampling is adopted instead of global sampling when the negative samples are selected. The global sampling is that the computing server randomly selects a plurality of nodes to obtain data of the plurality of nodes, and does not select continuously stored node data in the cache as data of a negative sample.
Because the data of a plurality of samples can be obtained by accessing the memory once when the negative sample is selected, compared with the data of one sample obtained by accessing the memory each time, the time for acquiring the negative sample is shortened, the refreshing time of the memory is reduced, the efficiency for acquiring the negative sample is improved, and the efficiency for training the target model is improved.
According to a further aspect of the embodiments of the present invention, there is also provided an electronic device for implementing the data processing method, as shown in fig. 9, the electronic device includes a memory and a processor, the memory stores a computer program, and the processor is configured to execute the steps in any one of the method embodiments by the computer program.
Alternatively, fig. 9 is a block diagram of an electronic device according to an embodiment of the present invention. As shown in fig. 9, the electronic device may include: one or more (only one shown) processors 901, at least one communication bus 902, a user interface 903, at least one transmitting device 904, and memory 905. Wherein a communication bus 902 is used to enable connective communication between these components. The user interface 903 may include, among other things, a display 906 and a keyboard 907. The transmission means 904 may optionally comprise a standard wired interface and a wireless interface.
Optionally, in this embodiment, the electronic apparatus may be located in at least one network device of a plurality of network devices of a computer network.
Optionally, in this embodiment, the processor may be configured to execute the following steps by a computer program:
s1, acquiring a node sequence in a topological graph to obtain first data, wherein the node sequence comprises a plurality of nodes with connection relations, the first data are data of the nodes, and the topological graph comprises the nodes and the connection relations among the nodes;
s2, collecting target nodes in the topological graph, wherein the connecting lines between the target nodes and other nodes in the topological graph are the most;
s3, selecting node data continuously stored with target data as second data, wherein the target data is data representing the target node, the first data is a positive sample, the second data is a negative sample, and the second data comprises the target data;
and S4, training a preset model according to the first data and the second data to obtain a trained target model.
Alternatively, it can be understood by those skilled in the art that the structure shown in fig. 9 is only an illustration, and the electronic device may also be a terminal device such as a smart phone (e.g., an Android phone, an iOS phone, etc.), a tablet computer, a palm computer, a Mobile Internet Device (MID), a PAD, and the like. Fig. 9 is a diagram illustrating a structure of the electronic device. For example, the electronic device may also include more or fewer components (e.g., network interfaces, display devices, etc.) than shown in FIG. 9, or have a different configuration than shown in FIG. 9.
The memory 905 can be used for storing software programs and modules, such as program instructions/modules corresponding to the data processing method and apparatus in the embodiment of the present invention, and the processor 901 executes various functional applications and data processing by running the software programs and modules stored in the memory 905, that is, implements the data processing method described above. The memory 905 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 905 may further include memory located remotely from the processor 901, which may be connected to the terminal over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The transmission device 904 is used for receiving or transmitting data via a network. Examples of the network may include a wired network and a wireless network. In one example, the transmission device 904 includes a Network adapter (NIC) that can be connected to a router via a Network cable and other Network devices to communicate with the internet or a local area Network. In one example, the transmitting device 904 is a Radio Frequency (RF) module that is used to communicate with the internet via wireless means.
The memory 905 is used for storing the topology map and data of nodes in the topology map.
The embodiment of the invention provides a scheme of a data processing method. The node data continuously stored with the target data is used as the data of the negative sample, the times of refreshing the memory when the negative sample is selected are reduced, the technical problem that the efficiency of the training model is low due to the fact that the memory is frequently updated when the negative sample is selected in the prior art is solved, and the technical effect of improving the efficiency of the training model is achieved.
Embodiments of the present invention also provide a storage medium having a computer program stored therein, wherein the computer program is arranged to perform the steps of any of the above method embodiments when executed.
Alternatively, in the present embodiment, the storage medium may be configured to store a computer program for executing the steps of:
s1, acquiring a node sequence in a topological graph to obtain first data, wherein the node sequence comprises a plurality of nodes with connection relations, the first data are data of the nodes, and the topological graph comprises the nodes and the connection relations among the nodes;
s2, collecting target nodes in the topological graph, wherein the connecting lines between the target nodes and other nodes in the topological graph are the most;
s3, selecting node data continuously stored with target data as second data, wherein the target data is data representing the target node, the first data is a positive sample, the second data is a negative sample, and the second data comprises the target data;
and S4, training a preset model according to the first data and the second data to obtain a trained target model.
Optionally, the storage medium is further arranged to store a computer program for performing the steps of:
s1, acquiring the connection times of each node and other nodes in the topological graph;
and S2, selecting the node with the most connection times from the topological graph as the target node.
Optionally, the storage medium is further arranged to store a computer program for performing the steps of:
searching a storage area adjacent to the storage position of the target data in a cache; and taking the storage position and the data continuously stored in the storage area as the second data.
Optionally, the storage medium is further arranged to store a computer program for performing the steps of: acquiring a storage identifier of a target node in the topological graph by using a computing server; and selecting node data continuously stored with the target data on a storage area as the second data in a storage server according to the storage identifier, wherein the storage identifier is used for representing the position of the storage area.
Optionally, the storage medium is further arranged to store a computer program for performing the steps of: and training the preset model in the storage server according to the first data and the second data to obtain the target model.
Optionally, the storage medium is further arranged to store a computer program for performing the steps of: and inputting the data of the first object and the data of the second object into the target model to obtain third data output by the target model, wherein the third data is used for representing the relation between the first object and the second object.
Optionally, the storage medium is further configured to store a computer program for executing the steps included in the method in the foregoing embodiment, which is not described in detail in this embodiment.
Alternatively, in this embodiment, a person skilled in the art may understand that all or part of the steps in the various methods in the foregoing embodiments may be implemented by a program instructing hardware related to the terminal device, where the program may be stored in a computer-readable storage medium, and the storage medium may include: flash disks, read-Only memories (ROMs), random Access Memories (RAMs), magnetic or optical disks, and the like.
The integrated unit in the above embodiments, if implemented in the form of a software functional unit and sold or used as a separate product, may be stored in the above computer-readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing one or more computer devices (which may be personal computers, servers, network devices, etc.) to execute all or part of the steps of the method according to the embodiments of the present invention.
In the above embodiments of the present invention, the description of each embodiment has its own emphasis, and reference may be made to the related description of other embodiments for parts that are not described in detail in a certain embodiment.
In the several embodiments provided in the present application, it should be understood that the disclosed client may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed coupling or direct coupling or communication connection between each other may be an indirect coupling or communication connection through some interfaces, units or modules, and may be electrical or in other forms.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit may be implemented in the form of hardware, or may also be implemented in the form of a software functional unit.
The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims (14)

1. A method for processing data, comprising:
acquiring a node sequence in a topological graph to obtain first data, wherein the node sequence comprises a plurality of nodes with connection relations, the first data are data of the nodes, and the topological graph comprises the nodes and the connection relations among the nodes;
collecting a target node in the topological graph, wherein the connecting line between the target node and other nodes in the topological graph is the largest in the topological graph;
selecting node data continuously stored with target data as second data, wherein the target data is data representing the target node, the first data is a positive sample, the second data is a negative sample, and the second data comprises the target data;
and training a preset model according to the first data and the second data to obtain a trained target model.
2. The method of claim 1, wherein collecting a target node in the topology graph comprises:
acquiring the connection times of each node and other nodes in the topological graph;
and selecting the node with the most connection times from the topological graph as the target node.
3. The method according to claim 1 or 2, wherein selecting node data stored in series with the target data as the second data comprises:
searching a storage area adjacent to the storage position of the target data in a cache;
and taking the storage position and the data continuously stored in the storage area as the second data.
4. The method of claim 1,
collecting the target node in the topological graph comprises: acquiring a storage identifier of a target node in the topological graph by using a computing server;
selecting node data stored continuously with the target data as second data includes: and selecting node data continuously stored with the target data on a storage area as the second data in a storage server according to the storage identifier, wherein the storage identifier is used for representing the position of the storage area.
5. The method of claim 4, wherein training a predetermined model based on the first data and the second data, resulting in a trained target model comprises:
and training the preset model in the storage server according to the first data and the second data to obtain the target model.
6. The method of claim 1, wherein after training a predetermined model based on the first data and the second data to obtain a trained target model, the method further comprises:
and inputting the data of the first object and the data of the second object into the target model to obtain third data output by the target model, wherein the third data is used for representing the relationship between the first object and the second object.
7. An apparatus for processing data, comprising:
the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a node sequence in a topological graph to obtain first data, the node sequence comprises a plurality of nodes with connection relations, the first data is data of the nodes, and the topological graph comprises the nodes and the connection relations among the nodes;
the acquisition unit is used for acquiring a target node in the topological graph, wherein the connecting line between the target node and other nodes in the topological graph is the largest;
a selecting unit, configured to select node data stored continuously with target data as second data, where the target data is data representing the target node, the first data is a positive sample, the second data is a negative sample, and the second data includes the target data;
and the training unit is used for training a preset model according to the first data and the second data to obtain a trained target model.
8. The apparatus of claim 7, wherein the acquisition unit comprises:
the acquisition module is used for acquiring the connection times of each node and other nodes in the topological graph;
and the selection module is used for selecting the node with the most connection times from the topological graph as the target node.
9. The apparatus according to claim 7 or 8, wherein the selecting unit comprises:
the searching module is used for searching a storage area adjacent to the storage position of the target data in the cache;
and the determining module is used for taking the data continuously stored in the storage position and the storage area as the second data.
10. The apparatus of claim 7,
the acquisition unit is arranged in a computing server and comprises a storage module used for acquiring a storage identifier of a target node in the topological graph;
the selecting unit is arranged in a storage server and comprises a selecting module used for selecting node data continuously stored on a storage area with the target data according to the storage identification as the second data, wherein the storage identification is used for representing the position of the storage area.
11. The apparatus of claim 10, wherein the training unit comprises:
and the training module is used for training the preset model in the storage server according to the first data and the second data to obtain the target model.
12. The apparatus of claim 7, further comprising:
and the output unit is used for inputting the data of the first object and the data of the second object into the target model after training a preset model according to the first data and the second data to obtain a trained target model, and obtaining third data output by the target model, wherein the third data is used for representing the relationship between the first object and the second object.
13. A storage medium, in which a computer program is stored, wherein the computer program is arranged to execute the method of any of claims 1 to 6 when executed
The method as described in (1).
14. An electronic device comprising a memory and a processor, characterized in that the memory has stored therein a computer program, the processor being arranged to execute the method of any of claims 1 to 6 by means of the computer program.
CN201810179509.5A 2018-03-05 2018-03-05 Data processing method and device, storage medium and electronic device Active CN110232393B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810179509.5A CN110232393B (en) 2018-03-05 2018-03-05 Data processing method and device, storage medium and electronic device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810179509.5A CN110232393B (en) 2018-03-05 2018-03-05 Data processing method and device, storage medium and electronic device

Publications (2)

Publication Number Publication Date
CN110232393A CN110232393A (en) 2019-09-13
CN110232393B true CN110232393B (en) 2022-11-04

Family

ID=67861655

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810179509.5A Active CN110232393B (en) 2018-03-05 2018-03-05 Data processing method and device, storage medium and electronic device

Country Status (1)

Country Link
CN (1) CN110232393B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112749302A (en) * 2019-10-29 2021-05-04 第四范式(北京)技术有限公司 Data sampling method and device based on knowledge graph, computing equipment and readable medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA2642536A1 (en) * 2006-02-17 2007-11-22 Uti Limited Partnership Method and system for sampling dissolved gas
CN106407581A (en) * 2016-09-28 2017-02-15 华中科技大学 Intelligent prediction method for ground surface settlement induced by subway tunnel construction
CN107292186A (en) * 2016-03-31 2017-10-24 阿里巴巴集团控股有限公司 A kind of model training method and device based on random forest
CN107729290A (en) * 2017-09-21 2018-02-23 北京大学深圳研究生院 A kind of expression learning method of ultra-large figure using the optimization of local sensitivity Hash

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9135399B2 (en) * 2012-05-25 2015-09-15 Echometrics Cardiologists, Pc Determining disease state of a patient by mapping a topological module representing the disease, and using a weighted average of node data
US20150095017A1 (en) * 2013-09-27 2015-04-02 Google Inc. System and method for learning word embeddings using neural language models
CN106708871B (en) * 2015-11-16 2020-08-11 阿里巴巴集团控股有限公司 Method and device for identifying social service characteristic users
US10402750B2 (en) * 2015-12-30 2019-09-03 Facebook, Inc. Identifying entities using a deep-learning model

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA2642536A1 (en) * 2006-02-17 2007-11-22 Uti Limited Partnership Method and system for sampling dissolved gas
CN107292186A (en) * 2016-03-31 2017-10-24 阿里巴巴集团控股有限公司 A kind of model training method and device based on random forest
CN106407581A (en) * 2016-09-28 2017-02-15 华中科技大学 Intelligent prediction method for ground surface settlement induced by subway tunnel construction
CN107729290A (en) * 2017-09-21 2018-02-23 北京大学深圳研究生院 A kind of expression learning method of ultra-large figure using the optimization of local sensitivity Hash

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
DNPS:基于阻尼采样的大规模动态社会网络结构特征表示学习;李志宇 等;《计算机学报》;20170430;第40卷(第4期);805-823 *
OptimizingWord2Vec Performance on Multicore Systems;Vasudevan Rengasamy 等;《IA3"17: Proceedings of the Seventh Workshop on Irregular Applications》;20171117;1-9 *
基于CUDA的Word2Vec设计与实现;涂楚成;《中国优秀硕士学位论文全文数据库 信息科技辑》;20170315(第3期);I138-6100 *
基于边采样的网络表示学习模型;陈丽 等;《软件学报》;20171206;756-771 *

Also Published As

Publication number Publication date
CN110232393A (en) 2019-09-13

Similar Documents

Publication Publication Date Title
CN108108821B (en) Model training method and device
CN109447895B (en) Picture generation method and device, storage medium and electronic device
CN105740268B (en) A kind of information-pushing method and device
CN110263916B (en) Data processing method and device, storage medium and electronic device
CN107404656A (en) Live video recommends method, apparatus and server
CN110413867B (en) Method and system for content recommendation
CN108021708B (en) Content recommendation method and device and computer readable storage medium
CN111950056B (en) BIM display method and related equipment for building informatization model
WO2019062081A1 (en) Salesman profile formation method, electronic device and computer readable storage medium
CN112232889A (en) User interest portrait extension method, device, equipment and storage medium
CN112328823A (en) Training method and device for multi-label classification model, electronic equipment and storage medium
CN111124902A (en) Object operating method and device, computer-readable storage medium and electronic device
CN110399564B (en) Account classification method and device, storage medium and electronic device
CN111182332A (en) Video processing method, device, server and storage medium
CN110795558A (en) Label acquisition method and device, storage medium and electronic device
CN110232393B (en) Data processing method and device, storage medium and electronic device
CN107547626B (en) User portrait sharing method and device
WO2015109902A1 (en) Personalized information processing method, device and apparatus, and nonvolatile computer storage medium
CN112785069A (en) Prediction method and device for terminal equipment changing machine, storage medium and electronic equipment
CN106682014B (en) Game display data generation method and device
CN112269937A (en) Method, system and device for calculating user similarity
CN111651989A (en) Named entity recognition method and device, storage medium and electronic device
CN110895555B (en) Data retrieval method and device, storage medium and electronic device
JP2020502710A (en) Web page main image recognition method and apparatus
CN112182460A (en) Resource pushing method and device, storage medium and electronic device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant