CN112507388B - Word2vec model training method, device and system based on privacy protection - Google Patents

Word2vec model training method, device and system based on privacy protection Download PDF

Info

Publication number
CN112507388B
CN112507388B CN202110158847.2A CN202110158847A CN112507388B CN 112507388 B CN112507388 B CN 112507388B CN 202110158847 A CN202110158847 A CN 202110158847A CN 112507388 B CN112507388 B CN 112507388B
Authority
CN
China
Prior art keywords
word segmentation
local
corpus
public
participle
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110158847.2A
Other languages
Chinese (zh)
Other versions
CN112507388A (en
Inventor
陈超超
王力
周俊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alipay Hangzhou Information Technology Co Ltd
Original Assignee
Alipay Hangzhou Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alipay Hangzhou Information Technology Co Ltd filed Critical Alipay Hangzhou Information Technology Co Ltd
Priority to CN202110158847.2A priority Critical patent/CN112507388B/en
Publication of CN112507388A publication Critical patent/CN112507388A/en
Application granted granted Critical
Publication of CN112507388B publication Critical patent/CN112507388B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Bioethics (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Databases & Information Systems (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Machine Translation (AREA)

Abstract

Embodiments of the present description provide a method, apparatus, and system for training a word2vec model via at least two first member devices. And each first member device generates a local participle word bank based on the corpus participle result of the local corpus, uses the respective local participle word bank to perform privacy intersection to determine public participles, and shares the number of non-public participles with other first member devices. And then, carrying out unified word segmentation numbering on each first member device according to the public word segmentation and the number of the non-public word segmentation of each first member device, and generating a unified dictionary. Then, each first member device generates a respective training sample based on the unified dictionary and the corpus participle result of the local corpus, and performs privacy protection-based model training using the respective training sample to train a word2vec model.

Description

Word2vec model training method, device and system based on privacy protection
Technical Field
The embodiment of the specification relates to the field of artificial intelligence, in particular to a word2vec model training method, device and system based on privacy protection.
Background
A word2vec (word to vector) model is a shallow neural network model used to generate word vectors (word embedding), and is widely applied to the fields of natural language processing and machine learning. The word2vec model can be efficiently trained using dictionaries and a large amount of training data and is used to convert text participles into word vector representations, thereby converting text content processing into vector operations in vector space and reflecting the similarity of text content in text semantics using the similarity in vector space.
However, in practical applications, the training text of the word2vec model may be owned by multiple data owners, for example, in a medical scenario, the medical history text of a patient may be owned by multiple hospitals, and the medical history texts may not be shared with each other for privacy protection reasons. Therefore, how to implement the word2vec model jointly trained by multiple data owners becomes an urgent problem to be solved under the condition of protecting the data privacy of each data owner.
Disclosure of Invention
In view of the above, embodiments of the present specification provide a word2vec model training method, apparatus, and system based on privacy protection, which can implement joint training of a word2vec model by multiple data owners under the condition of protecting data privacy of the data owners.
According to an aspect of embodiments herein, there is provided a method for training a word2vec model via at least two first member devices, the method being applied to one of the at least two first member devices, the method comprising: generating a local word segmentation word bank based on the corpus word segmentation result of the local corpus; the method comprises the steps that privacy intersection is carried out by using respective local participle word banks together with other first member equipment, public participles are determined, the number of the non-public participles is shared by the other first member equipment, and the local participle word bank of each other first member equipment is generated according to the corpus participle result of local corpora of the other first member equipment; carrying out unified word segmentation numbering according to the public word segmentation and the number of the non-public word segmentation of each first member device to generate a unified dictionary; generating a training sample at the first member device based on the unified dictionary and the corpus participle result of the local corpus; and performing privacy protection-based model training with the remaining first member devices to train out a word2vec model using respective training samples, the training samples at the respective remaining first member devices being generated based on the unified dictionary and the corpus participle results of the local corpus.
Optionally, in an example of the above aspect, the method may further include: preprocessing the local corpus segmentation result, and generating a local segmentation word bank based on the local corpus segmentation result of the local corpus comprises the following steps: and generating a local word segmentation word bank based on the local corpus word segmentation result after the local corpus is preprocessed.
Optionally, in one example of the above aspect, the pre-processing comprises at least one of: word segmentation filtering processing and word segmentation de-duplication processing.
Optionally, in an example of the above aspect, generating training samples at the first member device based on the unified dictionary and the local corpus participle results comprises: using a given word segmentation sampling window to perform word segmentation pair sampling on the local corpus word segmentation result to obtain a local word segmentation pair set; and generating a training sample at the first member device according to the participle pairs in the local participle pair set.
Optionally, in an example of the above aspect, the privacy preserving based word2vec model training comprises federated learning based word2vec model training.
According to another aspect of embodiments herein, there is provided an apparatus for training a word2vec model via at least two first member devices, the apparatus being applied to one of the at least two first member devices, the apparatus comprising: at least one processor, a memory coupled with the at least one processor, and a computer program stored in the memory, the at least one processor executing the computer program to implement: generating a local word segmentation word bank based on a local corpus word segmentation result of the local corpus; the method comprises the steps that privacy intersection is carried out by using respective local participle word banks together with other first member equipment, public participles are determined, the number of the non-public participles is shared by the other first member equipment, and the local participle word bank of each other first member equipment is generated according to the corpus participle result of each local corpus of the other first member equipment; carrying out unified word segmentation numbering according to the public word segmentation and the number of the non-public word segmentation of each first member device to generate a unified dictionary; generating a training sample at the first member device based on the unified dictionary and the local corpus participle result; and performing privacy protection-based model training with the other first member devices to train a word2vec model using respective training samples, the training samples at the other first member devices being generated based on the unified dictionary and the corpus participle results of the local corpus.
Optionally, in one example of the above aspect, the at least one processor executes the computer program to further implement: and preprocessing the local corpus word segmentation result. Accordingly, the at least one processor executes the computer program to implement: and generating a local word segmentation word bank based on the local corpus word segmentation result after the local corpus is preprocessed.
Optionally, in one example of the above aspect, the pre-processing comprises at least one of: word segmentation filtering processing and word segmentation de-duplication processing.
Optionally, in one example of the above aspect, the at least one processor executes the computer program to implement: using a given word segmentation sampling window to perform word segmentation pair sampling on the local corpus word segmentation result to obtain a local word segmentation pair set; and generating a training sample at the first member device according to the participle pairs in the local participle pair set.
According to another aspect of embodiments herein, there is provided a system for training a word2vec model via at least two first member devices, comprising: at least two first member devices, each first member device comprising an apparatus for training a word2vec model via at least two first member devices as described above.
Optionally, in an example of the above aspect, the system may further include: a second member device deployed with a word2vec model and performing federated learning with the at least two first member devices to train the word2vec model.
According to another aspect of embodiments of the present description, there is provided a computer readable storage medium storing a computer program for execution by a processor to implement the method for training a word2vec model via at least two first member devices as described above.
According to another aspect of embodiments of the present description, there is provided a computer program product comprising a computer program for execution by a processor to implement the method for training a word2vec model via at least two first member devices as described above.
Drawings
A further understanding of the nature and advantages of the present disclosure may be realized by reference to the following drawings. In the drawings, similar components or features may have the same reference numerals.
FIG. 1A shows an application example diagram of the CBOW model.
FIG. 1B shows an example schematic of the neural network structure of the CBOW model.
FIG. 2A shows a schematic diagram of an example of an application of the Skip-Gram model.
FIG. 2B shows an example schematic of the neural network structure of the Skip-Gram model.
FIG. 3 illustrates an example schematic diagram of a model training system for training a word2vec model in accordance with embodiments of the present description.
FIG. 4 shows a flow diagram of a method for training a Word2vec model in accordance with an embodiment of the present description.
FIG. 5 illustrates an example flow diagram of a training sample generation process in accordance with an embodiment of the present description.
Fig. 6 shows an example schematic diagram of a participle extraction process according to an embodiment of the present specification.
Fig. 7 illustrates an example schematic of a federal learning process in accordance with an embodiment of the present specification.
FIG. 8 shows a block diagram of a model training apparatus for training a word2vec model in accordance with embodiments of the present description.
FIG. 9 illustrates an example block diagram of a training sample generation unit in accordance with an embodiment of this specification.
FIG. 10 shows a schematic diagram of a computer-based implementation of a model training apparatus for training a word2vec model according to an embodiment of the present description.
Detailed Description
The subject matter described herein will now be discussed with reference to example embodiments. It should be understood that these embodiments are discussed only to enable those skilled in the art to better understand and thereby implement the subject matter described herein, and are not intended to limit the scope, applicability, or examples set forth in the claims. Changes may be made in the function and arrangement of elements discussed without departing from the scope of the disclosure. Various examples may omit, substitute, or add various procedures or components as needed. For example, the described methods may be performed in an order different from that described, and various steps may be added, omitted, or combined. In addition, features described with respect to some examples may also be combined in other examples.
As used herein, the term "include" and its variants mean open-ended terms in the sense of "including, but not limited to. The term "based on" means "based at least in part on". The terms "one embodiment" and "an embodiment" mean "at least one embodiment". The term "another embodiment" means "at least one other embodiment". The terms "first," "second," and the like may refer to different or the same object. Other definitions, whether explicit or implicit, may be included below. The definition of a term is consistent throughout the specification unless the context clearly dictates otherwise.
The word2vec model is a fully connected neural network with only one hidden layer, used to predict words with a greater degree of association with a given word. Examples of Word2vec models may include, but are not limited to, the Continuous Bag of words model (CBOW) and the Skip-Gram model (Skip-Gram). The CBOW model is used to predict the headword based on its context words before and after the text sequence. The Skip-Gram model is used to predict context words of a censer before and after a text sequence based on the censer.
Fig. 1A shows an application example of the CBOW model, and fig. 1B shows an example schematic diagram of a neural network structure of the CBOW model.
In the application example shown in FIG. 1A, the core word "loves" is predicted using the context words "the", "man", "his" and "son" of the core word "loves" for the text sequence "the man loves his son".
FIG. 1B illustrates a CBOW model for a plurality of context words. For a given context word w, the input to the model is a one-hot vector
Figure 180905DEST_PATH_IMAGE001
Where V is the number of words (bag size) that the bag has. The term "bag of words" may also be referred to as a "vocabulary", "dictionary", or "dictionary". In this vector, only one value is 1 and the other values are all 0. The weight matrix between the input layer and the hidden layer can be one
Figure 566887DEST_PATH_IMAGE002
Of (2) matrix
Figure 346624DEST_PATH_IMAGE003
Where N is the vector dimension number of the word vector representation of the word, the value of N being predefined and much smaller than V.
Figure 741833DEST_PATH_IMAGE003
Is an N-dimensional vector corresponding to a model input of a word w
Figure 556205DEST_PATH_IMAGE004
. In addition, there is also one between the hidden layer and the output layer
Figure 300039DEST_PATH_IMAGE005
Weight matrix of
Figure 567072DEST_PATH_IMAGE006
Figure 765973DEST_PATH_IMAGE006
Each column of (A) is associated with a sheetModel input of word w corresponds to an N-dimensional vector
Figure 903693DEST_PATH_IMAGE007
. In the art, N-dimensional vectors
Figure 897057DEST_PATH_IMAGE004
Input vector, N-dimensional vector, which may also be referred to as word w
Figure 651386DEST_PATH_IMAGE007
Which may also be referred to as the output vector of the word w. Input vector
Figure 388398DEST_PATH_IMAGE004
And the output vector
Figure 646204DEST_PATH_IMAGE007
Are two different expressions for the same word w, and can both be used to represent a word vector for the word. Typically, the word vector of a word is characterized using the input vector of the word.
Fig. 2A shows an application example of the Skip-Gram model, and fig. 2B shows an example schematic diagram of a neural network structure of the Skip-Gram model.
In the application example shown in FIG. 2A, the context words "the", "man", "his" and "son" are predicted using the core word "loves" for the text sequence "the man loves his son". The Skip-Gram model shown in FIG. 2B is the inverse of the CBOW model. The target word (center word) of the CBOW model is input as a model in the Skip-Gram model, and the context words of the CBOW model are output as models in the Skip-Gram model.
After training the above CBOW model or Skip-Gram model using the training samples, each word can be represented as a fixed-length vector (e.g., the above weight vector between the input layer and the hidden layer) which can better represent the similarity and analogy relationship between different words.
In the existing word2vec model training scheme, model training is usually realized by using a training sample of a single data owner, and the method cannot be applied to jointly training a word2vec model by a plurality of data owners needing privacy protection.
The following describes a method, an apparatus, and a system for training a word2vec model according to an embodiment of the present specification in detail with reference to the accompanying drawings.
FIG. 3 shows an architectural diagram of a model training system 300 for training a Word2vec model in accordance with an embodiment of the present description.
As shown in FIG. 3, model training system 300 includes at least two first member devices 310 and one second member device 320. In FIG. 3, 2 first member devices 310-1 and 310-2 are shown. In other embodiments of the present description, more first member devices 310 may be included. At least two first member devices 310 and second member devices 320 may communicate with each other over a network 330 such as, but not limited to, the internet or a local area network, etc.
In embodiments of the present description, first member devices 310-1 and 310-2 may be devices or device sides, such as smart terminal devices, server devices, etc., for locally collecting text data samples. In this specification, the term "first member device" and the term "data owner" may be used interchangeably. The second member device 320 may be a device or device side that deploys or maintains the word2vec model. A neural network structure of the word2vec model is deployed on the second member device 320.
In addition, a neural network structure of the word2vec model is also deployed on the first member devices 310-1 and 310-2. During model training, the second member device 320 initializes the word2vec model to obtain an initial weight matrix between the input layer and the hidden layer of the word2vec model
Figure 76048DEST_PATH_IMAGE008
And an initial weight matrix between the hidden layer and the output layer of the word2vec model
Figure 317674DEST_PATH_IMAGE009
. The second member device 320 then applies the initial weight matrix
Figure 514169DEST_PATH_IMAGE008
And an initial weight matrix
Figure 157640DEST_PATH_IMAGE009
To each of first member devices 310-1 and 310-2.
In this description, first member device 310 and second member device 320 may be any suitable electronic device with computing capabilities. The electronic devices include, but are not limited to: personal computers, server computers, workstations, desktop computers, laptop computers, notebook computers, mobile electronic devices, smart phones, tablet computers, cellular phones, Personal Digital Assistants (PDAs), handheld devices, messaging devices, wearable electronic devices, consumer electronic devices, and the like.
Further, first member devices 310-1 and 310-2 each have a model training apparatus. The model training apparatus provided at the first member device 310-1, 310-2 and the second member device 320 may perform network communication via the network 330 for data interaction, whereby a collaborative process performs a model training process for the Word2vec model. The operation and structure of the model training apparatus will be described in detail below with reference to the accompanying drawings.
In some embodiments, network 330 may be any one or more of a wired network or a wireless network. Examples of network 330 may include, but are not limited to, a cable network, a fiber optic network, a telecommunications network, an intranet, the internet, a Local Area Network (LAN), a Wide Area Network (WAN), a Wireless Local Area Network (WLAN), a Metropolitan Area Network (MAN), a Public Switched Telephone Network (PSTN), a bluetooth network, a zigbee network (zigbee), Near Field Communication (NFC), an intra-device bus, an intra-device line, and the like, or any combination thereof.
FIG. 4 shows a flow diagram of a method 400 for training a word2vec model in accordance with an embodiment of the present description.
As shown in fig. 4, at 410, a local segmented word library is generated based on the corpus segmentation results of the local corpus at each first member device.
In one example, at each first member device, the local corpus (text data) collected by each first member device may be participled to obtain a local corpus participle result. For example, after performing word segmentation processing on the local corpus "cough, little phlegm, and obvious fever", the corpus word segmentation results "cough", "little phlegm", "obvious" and "fever" can be obtained. Here, the word segmentation process for the corpus can be implemented by various applicable word segmentation processes in the art. Further, it is to be noted that the respective collected local corpora are formed into a plurality of data samples, each having a data sample identification (data sample ID). For example, in an application scenario of user data, the local corpus may be formed into a plurality of data samples based on the user ID. In other application scenarios, the data samples may be formed based on other suitable manners. Further, in one example, the data sample ID may be a hash value obtained by hashing a feature value of each sample feature included in the data sample.
And then, generating a local word segmentation word bank based on the obtained local corpus word segmentation result. In one example, a local thesaurus may be generated based on all the resulting corpus participle results (i.e., without pre-processing the corpus participle results). In another example, the resulting local corpus participle results may be preprocessed. Examples of the preprocessing may include, but are not limited to, a segmentation filtering process and a segmentation de-duplication process. The word segmentation filtering process may include, for example, removing stop words (stop words), rare words, and the like. Stop words may refer to words that carry very little substantive information content for the meaning of the phrase, e.g., "obvious" in the above-described word segmentation results may be considered stop words. Examples of stop words may include, but are not limited to, english characters, numbers, mathematical characters, punctuation marks, and single chinese characters that are used with a particularly high frequency, etc. And then, generating a local word segmentation word bank according to the preprocessed local corpus word segmentation result.
After the respective local participle lexicons are obtained as above, at 420, the respective first member devices use the respective local participle lexicons together to perform privacy Intersection (PSI), determine public participles. Here, various PSI protocols in the art may be employed to implement the above-described public word segmentation determination process. After the public word segmentation is obtained by each first member device, the number of the non-public word segmentation is determined, and then the non-public word segmentation is shared with the rest first member devices.
It is noted that in case the first member device comprises more than two first member devices, the determined common participles may be hierarchically common participles. For example, where the first member device includes N first member devices, the hierarchy of common participles may include N member devices in common, N-1 member devices in common, … …, up to 2 member devices in common. In this case, it is also necessary to acquire the hierarchy of the common participle, the number of participles of the hierarchy to which the common participle belongs, and the device identification of the member device to which the common participle belongs.
Next, at 430, at each first member device, a unified participle number is performed according to the common participle and the number of non-common participles of each first member device, respectively, to generate a unified dictionary. The generated unified dictionary exists at the respective first member devices, and each participle in the unified dictionary has a unique participle number. It is noted that, in the case where there are more than two first member devices, each first member device numbers the common participle in a unified manner according to the common participle, the hierarchy of the common participle, the number of participles of the hierarchy to which the common participle belongs, the device identification of the member device to which the common participle belongs, and the number of non-common participles.
After the unified dictionary is obtained, at 440, at each first member device, training samples are generated for the first member device based on the corpus participle results for the unified dictionary and the local corpus.
Fig. 5 illustrates an example flow diagram of a training sample generation process 500 in accordance with an embodiment of the present description.
As shown in fig. 5, at 510, a local corpus participle result is participle-pair sampled using a given participle sampling window, resulting in a set of local participle pairs. Here, the segmentation pair sampling is performed separately for the local corpus segmentation result of each data sample in the local corpus.
Fig. 6 shows an example schematic diagram of a participle extraction process according to an embodiment of the present specification. In The example of fig. 6, The local corpus participle result is "The quick brown fox jumps over The lazy dog", and The size of The participle sampling window is 2 (window _ size = 2), that is, only two words before and after The input word are selected to be combined with The input word, thereby extracting a participle pair (input word, output word), where The word in The gray box is The input word. As shown in fig. 6, the part-of-speech pairs (the, quick), (the, brown), (quick, the), (quick, brown), (quick, fox), (brown, the), (brown, quick), (brown, fox), (brown, jumps), (fox, quick), (fox, brown), (fox, jumps), (fox, over), and the like are extracted.
Then, at 520, training samples at the first member device are generated based on the segmented word pairs in the extracted set of local segmented word pairs. For example, two words of a participle pair may be one-hot coded, resulting in a training sample (input word, output word) of the word2vec model. After the training samples for each first member device are obtained as described above, each first member device performs privacy protection based model training using the respective training samples to train out a word2vec model at 450. In one example, privacy preserving based word2vec model training may be federated learning based word2vec model training.
As shown in FIG. 4, a second member device 320 in model training system 300 deploys or maintains the word2vec model. Each first member device then performs federated learning with a second member device using the respective training sample to train a word2vec model.
Fig. 7 illustrates an example schematic of a federal learning process in accordance with an embodiment of the present specification.
As shown in FIG. 7, during model training, the second member device 320 initializes the word2vec model to obtain an initial weight matrix between the input layer and the hidden layer of the word2vec model
Figure 492806DEST_PATH_IMAGE008
And an initial weight matrix between the hidden layer and the output layer of the word2vec model
Figure 221728DEST_PATH_IMAGE009
. The second member device 320 then applies the initial weight matrix
Figure 769384DEST_PATH_IMAGE008
And an initial weight matrix
Figure 267361DEST_PATH_IMAGE009
And issuing the data to each first member device.
Each first member device performs model training by using the training sample, and calculates a weight matrix
Figure 39008DEST_PATH_IMAGE008
And
Figure 724067DEST_PATH_IMAGE009
gradient of (2)
Figure 606573DEST_PATH_IMAGE010
Figure 959057DEST_PATH_IMAGE011
And
Figure 901605DEST_PATH_IMAGE012
Figure 260911DEST_PATH_IMAGE013
. Then, the second member apparatus pairs the calculated respective gradients
Figure 681528DEST_PATH_IMAGE010
Figure 154097DEST_PATH_IMAGE011
And
Figure 736389DEST_PATH_IMAGE012
Figure 661619DEST_PATH_IMAGE013
performing safety summation, and updating the weight matrix by the second member device according to the safety summation result
Figure 885927DEST_PATH_IMAGE008
And
Figure 681845DEST_PATH_IMAGE009
and the updated weight matrix is used
Figure 966196DEST_PATH_IMAGE008
And
Figure 378722DEST_PATH_IMAGE009
and issuing the data to each first member device. And circulating the steps until the model converges. It is noted that the gradient is aimed at
Figure 406721DEST_PATH_IMAGE010
Figure 57146DEST_PATH_IMAGE011
And
Figure 246818DEST_PATH_IMAGE012
Figure 146641DEST_PATH_IMAGE013
the safety summation of (a) may comprise one of the following safety summations: a secret sharing based secure summation; secure summation based on homomorphic encryption; a safety summation based on inadvertent transmissions; safe summing based on garbled circuits; or based on a secure summation of trusted execution environments.
Further, it is noted that, in the above embodiment, each first member device cooperates with a second member device to perform federal learning to train out a word2vec model. In other embodiments, the model training system may not include the second member device, but may be comprised of the first member devicesFederal learning is performed to train out the word2vec model. In this case, one of the first member devices performs word2vec model initialization and initializes the initial weight matrix
Figure 637053DEST_PATH_IMAGE008
And an initial weight matrix
Figure 938721DEST_PATH_IMAGE009
And issuing the data to each other first member device. In addition, the gradients calculated by the respective other first member devices are returned to the first member device, and the model update is performed by the first member device. In other words, the first member device implements the role of the second member device in the model training system shown in fig. 3.
A word2vec model training method according to an embodiment of the present specification is described above with reference to fig. 1A to 7. By using the model training scheme, the public participles are determined by using the local participle word banks together by each first member device to perform privacy intersection, and the number of the non-public participles is shared instead of the non-public participles, so that each first member device does not expose each non-public participle, and the privacy protection of the private participles of each first member device is ensured.
Furthermore, with the above model training scheme, privacy protection-based word2vec model training is performed by each first member device using a respective training sample, so that privacy security of the training sample of each first member device can be ensured.
FIG. 8 illustrates a block diagram of a model training apparatus 800 for training a word2vec model in accordance with embodiments of the present description. The model training apparatus 800 is applied to a first member device. As shown in fig. 8, the model training apparatus 800 includes a word stock generation unit 810, a privacy evaluation unit 820, a dictionary generation unit 830, a training sample generation unit 840, and a model training unit 850.
The thesaurus generating unit 810 is configured to generate a local participle thesaurus based on a local corpus participle result of the local corpus. The privacy rendezvous unit 820 is configured to use the respective local participle lexicon with the remaining first member devices for privacy rendezvous, determine common participles and share the number of non-common participles with the remaining first member devices. Here, the local segmentation word library of each of the remaining first member devices is generated based on the corpus segmentation result of each local corpus possessed by the remaining member device.
The dictionary generating unit 830 is configured to generate a unified dictionary by performing unified segmentation numbering according to the common segmentation and the number of non-common segmentation of each first member device.
The training sample generation unit 840 is configured to generate training samples at the first member device based on the unified dictionary and the local corpus participle results.
The model training unit 850 is configured to perform privacy preserving based word2vec model training with the remaining first member devices using the respective training samples.
Fig. 9 illustrates an example block diagram of a training sample generation unit 840 in accordance with an embodiment of this specification. As shown in fig. 9, the training sample generation unit 840 includes a segmentation extraction module 841 and a training sample generation module 843.
The segmentation extraction module 841 is configured to sample the segmentation pairs of each local corpus segmentation result using a given segmentation sampling window, resulting in a set of local segmentation pairs. Then, the training sample generation module 843 is configured to generate a training sample at the first member device from the participle pairs in the local participle pair set.
Furthermore, optionally, the model training apparatus 800 may further comprise a preprocessing unit (not shown). The preprocessing unit is configured to preprocess a local corpus participle result. Then, the thesaurus generating unit 810 is configured to generate a local participle thesaurus based on the preprocessed local corpus participle result of the local corpus.
As described above with reference to fig. 1A to 9, the model training method, the model training apparatus, and the model training system according to the embodiments of the present specification are described. The above model training device can be implemented by hardware, or can be implemented by software, or a combination of hardware and software.
FIG. 10 shows a schematic diagram of a model training apparatus 1000 for training a word2vec model based on a computer implementation on a first member device side according to an embodiment of the present description. As shown in FIG. 10, the model training apparatus 1000 may include at least one processor 1010, a memory (e.g., non-volatile storage) 1020, a memory 1030, and a communication interface 1040, and the at least one processor 1010, the memory 1020, the memory 1030, and the communication interface 1040 are coupled together via a bus 1060. The at least one processor 1010 executes computer programs (i.e., the above-described elements implemented in software) stored or encoded in memory.
In one embodiment, a computer program is stored in the memory that, when executed, causes the at least one processor 1010 to: generating a local word segmentation word bank based on the corpus word segmentation result of the local corpus; the method comprises the steps that privacy intersection is carried out by using respective local participle word banks together with other first member equipment, public participles are determined, the number of the non-public participles is shared by the other first member equipment, and the local participle word bank of each other first member equipment is generated according to the corpus participle result of each local corpus of the other first member equipment; carrying out unified word segmentation numbering according to the public word segmentation and the number of the non-public words segmentation of each first member device to generate a unified dictionary; generating a training sample at the first member device based on the corpus word segmentation result of the unified dictionary and the local corpus; and performing privacy protection-based model training with the rest of first member devices to train a word2vec model using respective training samples, the training samples at the rest of first member devices being generated based on the corpus segmentation results of the unified dictionary and the local corpus.
It should be appreciated that the computer programs stored in the memory, when executed, cause the at least one processor 1010 to perform the various operations and functions described above in connection with fig. 1A-9 in the various embodiments of the present description.
According to one embodiment, a program product, such as a computer-readable medium (e.g., a non-transitory computer-readable medium), is provided. The computer-readable medium may have a computer program (i.e., the elements described above as being implemented in software) that, when executed by a processor, causes the processor to perform various operations and functions described above in connection with fig. 1A-9 in various embodiments of the present specification. Specifically, a system or apparatus may be provided which is provided with a readable storage medium on which software program code implementing the functions of any of the above embodiments is stored, and causes a computer or processor of the system or apparatus to read out and execute instructions stored in the readable storage medium.
In this case, the program code itself read from the computer-readable medium can realize the functions of any of the above-described embodiments, and thus the computer-readable code and the readable storage medium storing the computer-readable code form part of the present invention.
Examples of the computer-readable storage medium include floppy disks, hard disks, magneto-optical disks, optical disks (e.g., CD-ROMs, CD-R, CD-RWs, DVD-ROMs, DVD-RAMs, DVD-RWs), magnetic tapes, nonvolatile memory cards, and ROMs. Alternatively, the program code may be downloaded from a server computer or from the cloud via a communications network.
According to one embodiment, a computer program product is provided that includes a computer program that, when executed by a processor, causes the processor to perform the various operations and functions described above in connection with fig. 1A-9 in the various embodiments of the present specification.
It will be understood by those skilled in the art that various changes and modifications may be made in the above-disclosed embodiments without departing from the spirit of the invention. Accordingly, the scope of the invention should be determined from the following claims.
It should be noted that not all steps and units in the above flows and system structure diagrams are necessary, and some steps or units may be omitted according to actual needs. The execution order of the steps is not fixed, and can be determined as required. The apparatus structures described in the above embodiments may be physical structures or logical structures, that is, some units may be implemented by the same physical entity, or some units may be implemented by a plurality of physical entities, or some units may be implemented by some components in a plurality of independent devices.
In the above embodiments, the hardware units or modules may be implemented mechanically or electrically. For example, a hardware unit, module or processor may comprise permanently dedicated circuitry or logic (such as a dedicated processor, FPGA or ASIC) to perform the corresponding operations. The hardware units or processors may also include programmable logic or circuitry (e.g., a general purpose processor or other programmable processor) that may be temporarily configured by software to perform the corresponding operations. The specific implementation (mechanical, or dedicated permanent, or temporarily set) may be determined based on cost and time considerations.
The detailed description set forth above in connection with the appended drawings describes exemplary embodiments but does not represent all embodiments that may be practiced or fall within the scope of the claims. The term "exemplary" used throughout this specification means "serving as an example, instance, or illustration," and does not mean "preferred" or "advantageous" over other embodiments. The detailed description includes specific details for the purpose of providing an understanding of the described technology. However, the techniques may be practiced without these specific details. In some instances, well-known structures and devices are shown in block diagram form in order to avoid obscuring the concepts of the described embodiments.
The previous description of the disclosure is provided to enable any person skilled in the art to make or use the disclosure. Various modifications to the disclosure will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other variations without departing from the scope of the disclosure. Thus, the disclosure is not intended to be limited to the examples and designs described herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (14)

1. A method for training a word2vec model via at least two first member devices, the method being applied to one of the at least two first member devices, the method comprising:
generating a local word segmentation word bank based on the corpus word segmentation result of the local corpus;
the method comprises the steps that privacy intersection is carried out by using respective local participle word banks together with other first member equipment, public participles are determined, the number of the non-public participles is shared by the other first member equipment, and the local participle word bank of each other first member equipment is generated according to the corpus participle result of local corpus of the other first member equipment;
carrying out unified word segmentation numbering according to the public word segmentation and the number of the non-public word segmentation of each first member device to generate a unified dictionary, wherein the generated unified dictionary is stored in the local first member device;
generating a training sample at the first member device based on a locally stored unified dictionary and the corpus participle result of the local corpus; and
and carrying out privacy protection-based model training to train a word2vec model by using respective training samples together with the rest first member devices, wherein the training samples at the rest first member devices are generated based on the uniform dictionary and the corpus segmentation result of the local corpus.
2. The method of claim 1, further comprising:
preprocessing the local corpus word segmentation result,
generating a local participle word bank based on a local corpus participle result of the local corpus comprises:
and generating a local word segmentation word bank based on the local corpus word segmentation result after the local corpus is preprocessed.
3. The method of claim 2, wherein the pre-processing comprises at least one of: word segmentation filtering processing and word segmentation de-duplication processing.
4. The method of claim 1, wherein generating training samples at the first member device based on the unified dictionary and the local corpus participle results comprises:
using a given word segmentation sampling window to perform word segmentation pair sampling on the local corpus word segmentation result to obtain a local word segmentation pair set; and
generating a training sample at the first member device according to the participle pair in the local participle pair set.
5. The method of claim 1, wherein the privacy preserving based word2vec model training comprises federated learning based word2vec model training.
6. The method of claim 1, wherein the first member devices include more than two first member devices and the determined common participles include hierarchical common participles,
sharing the number of non-common participles to the remaining first member devices comprises:
sharing the number of non-public participles, the number of participles of the hierarchy to which the public participles belong and the device identification of the member device to which the public participles belong with the rest first member devices,
performing unified word segmentation numbering according to the public word segmentation and the number of the non-public word segmentation of each first member device, and generating a unified dictionary comprises the following steps:
and carrying out unified word segmentation numbering according to the public word segmentation, the word segmentation number of the hierarchy to which the public word segmentation belongs, the equipment identification of the member equipment to which the public word segmentation belongs and the non-public word segmentation number of each first member equipment to generate a unified dictionary.
7. An apparatus for training a word2vec model via at least two first member devices, the apparatus being applied to one of the at least two first member devices, the apparatus comprising:
at least one processor for executing a program code for the at least one processor,
a memory coupled to the at least one processor, an
A computer program stored in the memory, the computer program being executable by the at least one processor to implement:
generating a local word segmentation word bank based on a local corpus word segmentation result of the local corpus;
the method comprises the steps that privacy intersection is carried out by using respective local participle word banks together with other first member equipment, public participles are determined, the number of the non-public participles is shared by the other first member equipment, and the local participle word bank of each other first member equipment is generated according to the corpus participle result of local corpora of the other first member equipment;
carrying out unified word segmentation numbering according to the public word segmentation and the number of the non-public word segmentation of each first member device to generate a unified dictionary, wherein the generated unified dictionary is stored in the local first member device;
generating a training sample at the first member device based on a locally stored unified dictionary and the local corpus participle result; and
and performing privacy protection-based word2vec model training with the rest of first member devices by using respective training samples, wherein the training samples at the rest of first member devices are generated based on the unified dictionary and the corpus participle result of the local corpus.
8. The apparatus of claim 7, wherein the at least one processor executes the computer program to further implement:
preprocessing the local corpus word segmentation result,
wherein the at least one processor executes the computer program to further implement:
and generating a local word segmentation word bank based on the local corpus word segmentation result after the local corpus is preprocessed.
9. The apparatus of claim 8, wherein the pre-processing comprises at least one of: word segmentation filtering processing and word segmentation de-duplication processing.
10. The apparatus of claim 7, wherein the at least one processor executes the computer program to implement:
using a given word segmentation sampling window to perform word segmentation pair sampling on the local corpus word segmentation result to obtain a local word segmentation pair set; and
generating a training sample at the first member device according to the participle pair in the local participle pair set.
11. The apparatus of claim 7, wherein the first member devices include more than two first member devices and the determined common participles include hierarchical common participles,
the at least one processor executes the computer program to:
sharing the number of non-public participles, the number of participles of a hierarchy to which the public participles belong and the equipment identification of the member equipment to which the public participles belong to the other first member equipment; and
and carrying out unified word segmentation numbering according to the public word segmentation, the word segmentation number of the hierarchy to which the public word segmentation belongs, the equipment identification of the member equipment to which the public word segmentation belongs and the non-public word segmentation number of each first member equipment to generate a unified dictionary.
12. A system for training a word2vec model via at least two first member devices, comprising:
at least two first member devices, each first member device comprising an apparatus as claimed in any one of claims 7 to 11.
13. The system of claim 12, further comprising:
a second member device deployed with a word2vec model and performing federated learning with the at least two first member devices to train the word2vec model.
14. A computer-readable storage medium storing a computer program for execution by a processor to implement the method of any one of claims 1 to 6.
CN202110158847.2A 2021-02-05 2021-02-05 Word2vec model training method, device and system based on privacy protection Active CN112507388B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110158847.2A CN112507388B (en) 2021-02-05 2021-02-05 Word2vec model training method, device and system based on privacy protection

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110158847.2A CN112507388B (en) 2021-02-05 2021-02-05 Word2vec model training method, device and system based on privacy protection

Publications (2)

Publication Number Publication Date
CN112507388A CN112507388A (en) 2021-03-16
CN112507388B true CN112507388B (en) 2021-05-25

Family

ID=74952608

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110158847.2A Active CN112507388B (en) 2021-02-05 2021-02-05 Word2vec model training method, device and system based on privacy protection

Country Status (1)

Country Link
CN (1) CN112507388B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117349879A (en) * 2023-09-11 2024-01-05 江苏汉康东优信息技术有限公司 Text data anonymization privacy protection method based on continuous word bag model

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110942147A (en) * 2019-11-28 2020-03-31 支付宝(杭州)信息技术有限公司 Neural network model training and predicting method and device based on multi-party safety calculation
CN110955915A (en) * 2019-12-14 2020-04-03 支付宝(杭州)信息技术有限公司 Method and device for processing private data

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110942147A (en) * 2019-11-28 2020-03-31 支付宝(杭州)信息技术有限公司 Neural network model training and predicting method and device based on multi-party safety calculation
CN110955915A (en) * 2019-12-14 2020-04-03 支付宝(杭州)信息技术有限公司 Method and device for processing private data

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于深度语义学习的智能录波器自配置方法;陈旭等;《电力系统保护与控制》;20210131;第49卷(第2期);第179-187页 *

Also Published As

Publication number Publication date
CN112507388A (en) 2021-03-16

Similar Documents

Publication Publication Date Title
CN107273503B (en) Method and device for generating parallel text in same language
Qin et al. A network security entity recognition method based on feature template and CNN-BiLSTM-CRF
CN111460820B (en) Network space security domain named entity recognition method and device based on pre-training model BERT
CN110532381B (en) Text vector acquisition method and device, computer equipment and storage medium
US10585989B1 (en) Machine-learning based detection and classification of personally identifiable information
Yang et al. Ordering-sensitive and semantic-aware topic modeling
KR20190065665A (en) Apparatus and method for recognizing Korean named entity using deep-learning
CN113505601A (en) Positive and negative sample pair construction method and device, computer equipment and storage medium
CN117271759A (en) Text abstract generation model training method, text abstract generation method and device
CN115438149A (en) End-to-end model training method and device, computer equipment and storage medium
CN111241843B (en) Semantic relation inference system and method based on composite neural network
CN112507388B (en) Word2vec model training method, device and system based on privacy protection
CN115146068A (en) Method, device and equipment for extracting relation triples and storage medium
CN114580371A (en) Program semantic confusion method and system based on natural language processing
CN113505595A (en) Text phrase extraction method and device, computer equipment and storage medium
Li et al. An improved Chinese named entity recognition method with TB-LSTM-CRF
CN115730237B (en) Junk mail detection method, device, computer equipment and storage medium
CN112199954A (en) Disease entity matching method and device based on voice semantics and computer equipment
Wu et al. Semantic key generation based on natural language
Kwon et al. Toward backdoor attacks for image captioning model in deep neural networks
CN112749251B (en) Text processing method, device, computer equipment and storage medium
Tang et al. Interpretability rules: Jointly bootstrapping a neural relation extractorwith an explanation decoder
Zaikis et al. Dacl: A domain-adapted contrastive learning approach to low resource language representations for document clustering tasks
Ororbia II et al. Privacy protection for natural language: Neural generative models for synthetic text data
CN116975298B (en) NLP-based modernized society governance scheduling system and method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant