CN117521116B - Large language model privacy information protection method - Google Patents

Large language model privacy information protection method Download PDF

Info

Publication number
CN117521116B
CN117521116B CN202410013413.7A CN202410013413A CN117521116B CN 117521116 B CN117521116 B CN 117521116B CN 202410013413 A CN202410013413 A CN 202410013413A CN 117521116 B CN117521116 B CN 117521116B
Authority
CN
China
Prior art keywords
language model
different
word segmentation
large language
text sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202410013413.7A
Other languages
Chinese (zh)
Other versions
CN117521116A (en
Inventor
赵策
屠静
王亚
万晶晶
李伟伟
颉彬
张玥
孙岩
刘岩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhuoshi Future Beijing technology Co ltd
Original Assignee
Zhuoshi Future Beijing technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhuoshi Future Beijing technology Co ltd filed Critical Zhuoshi Future Beijing technology Co ltd
Priority to CN202410013413.7A priority Critical patent/CN117521116B/en
Publication of CN117521116A publication Critical patent/CN117521116A/en
Application granted granted Critical
Publication of CN117521116B publication Critical patent/CN117521116B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/602Providing cryptographic facilities or services
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Bioethics (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Medical Informatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to the technical field of data processing, and provides a large language model privacy information protection method, which comprises the following steps: acquiring relevant parameters of a large language model; preprocessing the obtained large language model related parameters to obtain word segmentation data with different dimensions, calculating similar semantic sets of the word segmentation data with different dimensions according to the large language model related parameters, and calculating important association evaluation coefficients of word segmentation with different dimensions; acquiring synonym forest codes by using related parameters of a large language model, calculating associated similarity indexes of different dimension word segments at different moments, calculating a clustering evaluation function according to the associated similarity indexes of the different dimension word segments at different moments, and acquiring a high-frequency privacy information cluster; and acquiring a desensitization substitution text sequence according to the related parameters of the large language model by the high-frequency privacy information cluster, and protecting the privacy information of the large language model according to the desensitization substitution text sequence. The invention improves the reliability of protecting the privacy information of the large language model.

Description

Large language model privacy information protection method
Technical Field
The invention relates to the technical field of data processing, in particular to a large language model privacy information protection method.
Background
The large language model (Large Language Models, LLM) is a deep learning large model for realizing large-scale human natural language understanding through an artificial intelligence algorithm, thereby completing various tasks such as language reasoning, text generation, man-machine interaction and the like.
The large language model mainly represented by the dialogue generation and training model ChatGPT has various excellent capabilities such as high-quality dialogue, complex reasoning, cross-domain generalization and the like, and the large language model solves various open tasks through the excellent model performance assistance and is widely focused by different users in various different industrial fields.
Because the large language model parameter is very large in scale, the time cost and the time cost are large in the process of retraining the large language model, the general large language model widely applied in the industry at present mainly comes from a pre-trained general model disclosed on a network. The pre-training general model disclosed by the invention is quite similar to the text data processing process, and has excellent generalization, but the quite similar data processing flow also hides a larger risk of user privacy data disclosure.
Disclosure of Invention
The invention provides a large language model privacy information protection method, which solves the problem that the traditional clustering algorithm cannot accurately acquire privacy information, and adopts the following technical scheme:
the invention provides a large language model privacy information protection method, which comprises the following steps:
acquiring relevant parameters of a large language model;
Preprocessing the obtained related parameters of the large language model to obtain word segmentation data with different dimensions, calculating similar semantic sets of the word segmentation data with different dimensions according to the related parameters of the large language model, and calculating important association evaluation coefficients of the word segmentation with different dimensions according to the similar semantic sets;
Obtaining synonym Lin Bianma by using related parameters of a large language model, calculating associated similarity indexes of different dimension words at different moments according to important association coefficients of the different dimension words and synonym forest codes, calculating a clustering evaluation function according to the associated similarity indexes of the different dimension words at different moments, and obtaining a high-frequency privacy information cluster according to the clustering evaluation function;
And acquiring a desensitization substitution text sequence according to the related parameters of the large language model by the high-frequency privacy information cluster, and protecting the privacy information of the large language model according to the desensitization substitution text sequence.
Preferably, the large language model related parameters include: the user inputs a text sequence, a general text vector output by a large language model.
Preferably, the method for calculating the similar semantic collection of the word segmentation data with different dimensions according to the related parameters of the large language model comprises the following steps:
and marking the numerical value difference of the Chinese character codes of each dimension word segmentation data and other different dimension word segmentation as a first difference value, and forming a similar semantic set by all the different dimension word segmentation with the first difference value as a preset value.
Preferably, the mathematical expression for calculating the important association evaluation coefficients of the different dimension word segments according to the similar semantic sets is as follows:
in the/> Representing the adjustment of a preset constant,/>Represents the/>User input of the first/>, in a text sequence at a single instantWord frequency of individual dimension word segmentation data,/>Represents the/>User input text sequence at various moments/>Total number of all different word segmentation data of the same class semantic set of each dimension,/>Representing the relative distance between two different dimension word segmentation data,/>,/>Respectively show that at the/>User input of the first/>, in a text sequence at a single instantIndividual dimensions and/>Word segmentation data of individual dimensions,/>Represents the/>Middle/>, of a text sequence entered by a user at a single moment in timeImportant associated evaluation coefficients of individual dimension word segmentation.
Preferably, the mathematical expression for calculating the association similarity indexes of the different dimension words at different moments according to the important association coefficients of the different dimension words and the synonym forest codes is as follows:
in the/> Represents the/>User input of the first/>, in a text sequence at a single instantImportant association evaluation coefficient of individual dimension word segmentation,/>Represents the/>Dimension of user input text sequence data at various moments,/>Representing cosine similarity between two different dimension word-segmentation encoding vectors,/>Respectively represent the/>User input of the first text sequence at a timeIndividual dimension word segmentation and the/>Encoding vector of individual dimension word segmentation,/>An exponential function based on a natural constant is shown,Representing the height difference of synonym forest where two different dimension word segmentation coding vectors are located,/>, andRepresents the/>At the moment/>Associated similarity index for individual dimension word segments.
Preferably, the mathematical expression of the height difference calculation of the synonym forest where the two different dimension word segmentation coding vectors are located is:
in the/> ,/>Respectively represent the/>At the moment/>Individual dimension word segmentation and the/>The height of synonym forest where two different dimension word segmentation coding vectors are located,/>Representing a preset discriminating constant,/>,/>Respectively represent the/>At the moment/>Individual dimensions and/>A set formed by all different synonyms in a synonym forest where each dimension word is located,/>Represents the/>User input of the first/>, in a text sequence at a single instantIndividual dimension word segmentation and the/>Height difference of synonym forest among individual dimension word segments.
Preferably, the mathematical expression of the clustering evaluation function calculated according to the associated similarity indexes of the different dimension word segments at different moments is:
in the/> Represents the/>Dimension of text sequence data entered by a user at each instant,/>Representing the total number of categories of the text data sequence input by the user,/>Represents the/>At the moment/>Associated similarity index of individual dimension word segment,/>Represents the/>Class center dimension word segmentation/>, at each momentRelated similarity index,/>Represents the/>The text sequence data entered by the user at each instant clusters the evaluation function.
Preferably, the method for acquiring the high-frequency privacy information cluster according to the clustering evaluation function comprises the following steps:
And clustering the text sequence data to obtain a cluster when the minimum value of the evaluation function is clustered, and marking the cluster as a high-frequency privacy information cluster.
Preferably, the method for acquiring the desensitization substitution text sequence according to the related parameters of the high-frequency privacy information cluster to the large language model comprises the following steps:
and replacing the high-frequency privacy information cluster in the related parameters of the large language model in the Greek alphabet according to a preset desensitization step length, and recording the data obtained after replacement as a desensitization replacement text sequence.
Preferably, the method for protecting the privacy information of the large language model according to the desensitization substitution text sequence comprises the following steps:
and inputting the desensitization substitution text sequence as an AES encryption algorithm to obtain the encrypted large language model related parameters, so as to protect the privacy information of the large language model.
The beneficial effects of the invention are as follows: according to the invention, different similar semantic sets are constructed through the word coding differences of the word segmentation in different dimensions in the large language model, and important association evaluation coefficients are obtained by calculating different word segmentation data in the similar semantic sets, so that the privacy information in the text data input by the large language model user is primarily extracted and calculated. Furthermore, the method acquires the coding vectors of the different-dimension word fragments through the synonym forest, calculates the association similarity indexes of the different-dimension word fragments by combining the important association evaluation coefficients and the coding vectors, and effectively reflects the privacy information data in the text data input by the large language model user through the association similarity indexes. According to the method, the traditional clustering loss function is optimized by combining the privacy information data, so that the problem of inaccurate acquisition of the privacy information caused by direct clustering of text information input by a large language model user in a traditional clustering algorithm is effectively solved.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions of the prior art, the drawings which are used in the description of the embodiments or the prior art will be briefly described, it being obvious that the drawings in the description below are only some embodiments of the invention, and that other drawings can be obtained according to these drawings without inventive faculty for a person skilled in the art.
FIG. 1 is a flowchart of a method for protecting privacy information of a large language model according to an embodiment of the present invention;
Fig. 2 is a schematic diagram of a correlation similarity index calculation flow.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Referring to fig. 1, a flowchart of a method for protecting privacy information of a large language model according to an embodiment of the invention is shown, the method includes the following steps:
Step S001, acquiring relevant parameters of the large language model.
It should be noted that, in general, the main structure of the general large language model is stacked layer by layer according to the network structure model of the Transformer, and the text sequence input by the user is processed by the large language model to generate the corresponding general text vector feature, and different language processing tasks are further completed according to the general text vector. Therefore, in the processing process of the large language model, if the user unintentionally leaks the cached general text vector output by the large language model in the process of using the large language model, or an attacker acquires the general text vector output by the large language model in various modes such as acquiring the general text vector output by the large language model by capturing data information of a victim server through an attack channel. An attacker can infer the original text data input by the user according to the universal text vector to obtain the user privacy key sensitive words.
In order to protect the privacy information of the large language model, related parameters of the large language model need to be acquired, wherein the related parameters comprise a text sequence input by a user and a general text vector output by the large language model.
Considering that the language types used by users in different regions are different, large language model parameters can have large difference in the processing process of different language types, so that the invention firstly processes the original text data input by the users and converts the non-Chinese original text data into Chinese text data. Meanwhile, in order to facilitate subsequent further processing of the original text data, the text sequence input by the user is subjected to word segmentation processing through the jieba word segmentation tool to obtain different word segmentation data, and in order to avoid influence of high-frequency stop words in the original text data on subsequent further analysis and calculation processes, the stop words in the original text data are removed through the word segmentation tool.
Step S002, preprocessing the obtained related parameters of the large language model to obtain word segmentation data with different dimensions, calculating similar semantic sets of the word segmentation data with different dimensions according to the related parameters of the large language model, and calculating important association evaluation coefficients of the word segmentation with different dimensions according to the similar semantic sets.
It should be noted that, in general, the large language model directly analyzes and processes the text sequence data input by the user to obtain a corresponding general text vector. However, when text sequence data input by a user is directly processed through a large language model, privacy information in original text sequence data of the user is exposed to a large extent, so that analysis of the text sequence data input by the user is required in order to avoid the defect of exposure of privacy information when an attacker attacks the large language model.
Assume time of dayThe dimension of text sequence data input by a user is/>The user-entered text sequence data at this point may be noted as/>Wherein/>,/>,/>Respectively represent time/>At user input of text sequence data/>Dimension, th/>Sum/>Dimension. Each different dimension of the user-entered text sequence data represents a different word segmentation data.
For user input of text sequence dataWord segmentation data for each dimension of (1)Calculating the Chinese character encoding numerical value difference between the word segmentation data of each different dimension and the word segmentation data of other dimensions, and if the two different Chinese characters are identical in representation in a computer, the Chinese character encoding numerical values between the two characters are identical, so that the Chinese character encoding numerical value difference of the word segmentation data of each different dimension is/>The invention uses GBK Chinese character codes to calculate, in specific application, the implementer can set according to specific conditions to set time/>, and the word segmentation data is used as the similar semantic set of corresponding dimension word segmentation dataAt user input text sequence number/>The homogeneous semantic collections of individual dimensions are denoted/>
In the/>Representing the adjustment of a preset constant,/>Represents the/>User input of the first/>, in a text sequence at a single instantWord frequency of individual dimension word segmentation data,/>Represents the/>User input text sequence at various moments/>Total number of all different word segmentation data of the same class semantic set of each dimension,/>Representing the relative distance between two different dimension word segmentation data,/>,/>Respectively show that at the/>User input of the first/>, in a text sequence at a single instantIndividual dimensions and/>Word segmentation data of individual dimensions,/>Represents the/>Middle/>, of a text sequence entered by a user at a single moment in timeImportant associated evaluation coefficients of individual dimension word segmentation.
The time can be set by the methodCalculating important associated evaluation coefficients of word segmentation data with different dimensions in a text sequence input by a user, wherein a preset constant/>Take the empirical value as/>Approximate distance/>, between two different dimension word segmentation dataThe calculation method is the total number of the word segmentation data of all other dimensions between two different word segmentation data. If/>The/>, in the text sequence entered by the user at each instantThe higher the frequency of occurrence of individual segmentation, the more frequently the word is simultaneously with the/>The smaller the interval between other words with the same word segmentation, the description time/>Input text sequence data by userThe more important the individual dimension word segmentation data is, the more likely it is that the key privacy information in the text data is input by the user, then the/>The larger the value of the important association evaluation coefficient obtained by calculating the individual dimension word segmentation data.
And S003, acquiring synonym Lin Bianma by using related parameters of the large language model, calculating associated similarity indexes of the different-dimension words at different moments according to important association coefficients of the different-dimension words and synonym forest codes, calculating a clustering evaluation function according to the associated similarity indexes of the different-dimension words at different moments, and acquiring a high-frequency privacy information cluster according to the clustering evaluation function.
When calculating the text sequence input by the user, the first text code is obtained and input in the text sequenceSimilar semantic sets of individual dimension word segments, however, different synonym transformation descriptions may exist for the same dimension word segment in text sequence data input by a user, and similar synonyms cannot be effectively judged through strict Chinese character coding comparison calculation, so that calculation analysis is required for synonym similarity among different dimension word segments in text data input by the user.
Because different words have different associated interpretations, the method aims at carrying out associated analysis on synonym information among different dimension word segments in text data input by a user. As shown in fig. 2, the method encodes all the different dimension words in the text data sequence input by the user according to the "synonym forest" to obtain corresponding synonyms Lin Bianma, the "synonym forest" and its specific encoding process are known techniques, and are not described in detail herein, and the moment is assumedFirst/>, in text data entered by a user at a locationThe code vector of each dimension word is recorded as/>
In the/>Represents the/>User input of the first/>, in a text sequence at a single instantImportant association evaluation coefficient of individual dimension word segmentation,/>Represents the/>Dimension of user input text sequence data at various moments,/>Representing cosine similarity between two different dimension word-segmentation encoding vectors,/>Respectively represent the/>User input of the first text sequence at a timeIndividual dimension word segmentation and the/>Encoding vector of individual dimension word segmentation,/>An exponential function based on a natural constant is shown,Representing the height difference of synonym forest where two different dimension word segmentation coding vectors are located,/>, andRepresents the/>At the moment/>Associated similarity index for individual dimension word segments.
The synonymous association index of the text sequences input by the user at different moments can be calculated through the method, if the moments areCalculation of the place to obtain the/>The larger the important association evaluation coefficient of each dimension word is, the higher the cosine similarity between two different dimension word coding vectors is and the smaller the height difference value of the corresponding different dimension word in the synonym forest is, the description is in the/>User input of the first/>, in a text sequence at a single instantThe higher the possibility that the individual dimension word is user privacy information, the more/>, the more the user privacy information is calculated at the momentAt the moment/>Individual dimension word association similarity index/>The larger the value of (c) will be relatively.
In the/>,/>Respectively represent the/>At the moment/>Individual dimension word segmentation and the/>The height of synonym forest where two different dimension word segmentation coding vectors are located,/>Representing a preset discriminating constant,/>,/>Respectively represent the/>At the moment/>Individual dimensions and/>A set formed by all different synonyms in a synonym forest where each dimension word is located,/>Represents the/>User input of the first/>, in a text sequence at a single instantIndividual dimension word segmentation and the/>Height difference of synonym forest among individual dimension word segments.
The height difference of the synonym forest where the two different-dimension word segmentation coding vectors are located can be calculated through the method, and the fact that the synonym forest where the two different-dimension word segmentation coding vectors are located is large is assumed that the difference of the synonym forest where the two different-dimension word segmentation coding vectors are located is large, at the moment, the intersection between sets formed by all different synonyms in the corresponding synonym forest is empty, the fact that the association coincidence degree between the two different-dimension word segmentation coding vectors is small is indicated, two different semantic sets are represented, and at the moment, the preset difference constant is larger experience value isThe numerical value of the height difference of the synonym forest calculated in the same synonym forest should be larger than that of the height difference of the synonym forest calculated in the same synonym forest, and in specific application, an implementer can set according to specific situations.
It should be noted that, for the text sequence input by the user at different moments, the similar words appearing repeatedly and frequently are very likely to represent the key privacy information of the user, so the invention makesTake the experience value/>In a specific application, the implementer may make settings according to the specific circumstances. Clustering the text sequences input by the user by using a K-Means algorithm to obtain the most important/>, of the privacy information of the data input by the userThe categories.
In the/>Represents the/>Dimension of text sequence data entered by a user at each instant,/>Representing the total number of categories of the text data sequence input by the user,/>Represents the/>At the moment/>Associated similarity index of individual dimension word segment,/>Represents the/>Class center dimension word segmentation/>, at each momentRelated similarity index,/>Represents the/>The text sequence data entered by the user at each instant clusters the evaluation function.
The time can be calculated by the above methodThe clustering evaluation function of text sequence data is input by a user, and the method is distributed in/>, according to Gaussian distributionRandom selection/>, among individual different dimension segmentationsAnd taking the word segmentation data with different dimensions and the class center word segmentation as initial clustering centers, and assuming that the smaller the difference of the association similarity indexes of the word segmentation data with different dimensions and the class center word segmentation is, the current dimension word segmentation and the class center word segmentation are both in the same semantic representation range. Thus, when calculated/>Text sequence data clustering evaluation function/>, input by user at each momentWhen the numerical value is minimum, the convergence of the current clustering function is illustrated, and the text sequence data input by the user is clustered and divided to obtain/>Different high frequency privacy information clusters.
And S004, acquiring a desensitization substitution text sequence for the related parameters of the large language model according to the high-frequency privacy information cluster, and protecting the privacy information of the large language model according to the desensitization substitution text sequence.
It should be noted that, for text sequence data input by the user at different time points, it is possible to obtainDifferent high-frequency privacy information clusters which contain privacy information in text sequence data input by a user, so that the invention combines the alphabetic permutation pairs/>The different high-frequency privacy information clusters are subjected to desensitization protection.
The invention takes the preset desensitization step lengthFor/>Pairs/>, in 24 different greek alphabetsThe private data in different high-frequency private information clusters are mapped and replaced to obtain desensitized replacement text sequence data input by users at different moments, the desensitization step length is preset, the desensitization replacement text sequence data input by the users at different moments are processed through an AES encryption algorithm, and the encrypted text sequence obtained after encryption is sent to a large language model server for further processing. By desensitizing and encrypting the user transmitted data at different moments, the security of the privacy data of the large language model in the text sequence data input by the user is ensured, and the risk of intercepting the text sequence data input by the user in the channel attack process by an attacker is avoided.
The above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the application and are intended to be included within the scope of the application.

Claims (9)

1. A method for protecting privacy information of a large language model, comprising the steps of:
acquiring relevant parameters of a large language model;
Preprocessing the obtained related parameters of the large language model to obtain word segmentation data with different dimensions, calculating similar semantic sets of the word segmentation data with different dimensions according to the related parameters of the large language model, and calculating important association evaluation coefficients of the word segmentation with different dimensions according to the similar semantic sets;
Obtaining synonym Lin Bianma by using related parameters of a large language model, calculating associated similarity indexes of different dimension words at different moments according to important associated evaluation coefficients of the different dimension words and synonym forest codes, calculating a cluster evaluation function according to the associated similarity indexes of the different dimension words at different moments, and obtaining a high-frequency privacy information cluster according to the cluster evaluation function;
Acquiring a desensitization substitution text sequence for related parameters of the large language model according to the high-frequency privacy information cluster, and protecting the privacy information of the large language model according to the desensitization substitution text sequence;
The large language model related parameters include: the user inputs a text sequence, a general text vector output by a large language model.
2. The method for protecting privacy information of large language model according to claim 1, wherein the method for calculating the similar semantic collection of word segmentation data with different dimensions according to the related parameters of the large language model is as follows:
and marking the numerical value difference of the Chinese character codes of each dimension word segmentation data and other different dimension word segmentation as a first difference value, and forming a similar semantic set by all the different dimension word segmentation with the first difference value as a preset value.
3. The method for protecting privacy information of large language model according to claim 2, wherein the mathematical expression for calculating the important associated evaluation coefficients of different dimension word segments according to the similar semantic collection is:
in the/> Indicating that the adjustment of the preset constant is performed,Represents the/>User input of the first/>, in a text sequence at a single instantWord frequency of individual dimension word segmentation data,/>Represents the/>User input text sequence at various moments/>The total number of all different word segmentation data of the same class semantic set of each dimension,Representing the relative distance between two different dimension word segmentation data,/>,/>Respectively show that at the/>User input of the first/>, in a text sequence at a single instantIndividual dimensions and/>Word segmentation data of individual dimensions,/>Represents the/>Middle/>, of a text sequence entered by a user at a single moment in timeImportant associated evaluation coefficients of individual dimension word segmentation.
4. The method for protecting privacy information of large language model according to claim 3, wherein the mathematical expression of the association similarity index of the different dimension word segments at different time points according to the important association evaluation coefficient of the different dimension word segments and the synonym word forest code is calculated as follows:
in the/> Represents the/>User input of the first/>, in a text sequence at a single instantImportant associated evaluation coefficients of individual dimension word segmentation,Represents the/>Dimension of user input text sequence data at various moments,/>Representing cosine similarity between two different dimension word-segmentation encoding vectors,/>Respectively represent the/>User input of the first/>, in a text sequence at a single instantIndividual dimension word segmentation and the/>Encoding vector of individual dimension word segmentation,/>Represents an exponential function based on natural constants,/>Representing the height difference of synonym forest where two different dimension word segmentation coding vectors are located,/>, andRepresents the/>At the moment/>Associated similarity index for individual dimension word segments.
5. The method for protecting private information of large language model according to claim 4, wherein the mathematical expression of the calculation of the height difference of the synonym forest where the two different dimension word segmentation coding vectors are located is:
In the method, in the process of the invention, ,/>Respectively represent the/>At the moment/>Individual dimension word segmentation and the/>The height of synonym forest where two different dimension word segmentation coding vectors are located,/>Representing a preset discriminating constant,/>,/>Respectively represent the/>At the moment/>Individual dimensions and/>A set of all different synonyms in the synonym forest where each dimension word is located,Represents the/>User input of the first/>, in a text sequence at a single instantIndividual dimension word segmentation and the/>Height difference of synonym forest among individual dimension word segments.
6. The method for protecting privacy information of large language model according to claim 4, wherein the mathematical expression of the clustering evaluation function is calculated according to the associated similarity indexes of different dimension word segments at different moments:
in the/> Represents the/>Dimension of text sequence data entered by a user at each instant,/>Representing the total number of categories of the text data sequence input by the user,/>Represents the/>At the moment/>Associated similarity index of individual dimension word segment,/>Represents the/>Class center dimension word segmentation/>, at each momentRelated similarity index,/>Represents the/>The text sequence data entered by the user at each instant clusters the evaluation function.
7. The method for protecting private information of large language model according to claim 6, wherein the method for obtaining high-frequency private information clusters according to the cluster evaluation function is as follows:
And clustering the text sequence data to obtain a cluster when the minimum value of the evaluation function is clustered, and marking the cluster as a high-frequency privacy information cluster.
8. The method for protecting privacy information of large language model according to claim 7, wherein the method for obtaining desensitized substitution text sequence for related parameters of large language model according to high frequency privacy information cluster comprises:
and replacing the high-frequency privacy information cluster in the related parameters of the large language model in the Greek alphabet according to a preset desensitization step length, and recording the data obtained after replacement as a desensitization replacement text sequence.
9. The method for protecting private information of a large language model according to claim 1, wherein the method for protecting private information of a large language model according to the desensitized alternative text sequence comprises the steps of:
and inputting the desensitization substitution text sequence as an AES encryption algorithm to obtain the encrypted large language model related parameters, so as to protect the privacy information of the large language model.
CN202410013413.7A 2024-01-04 2024-01-04 Large language model privacy information protection method Active CN117521116B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410013413.7A CN117521116B (en) 2024-01-04 2024-01-04 Large language model privacy information protection method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410013413.7A CN117521116B (en) 2024-01-04 2024-01-04 Large language model privacy information protection method

Publications (2)

Publication Number Publication Date
CN117521116A CN117521116A (en) 2024-02-06
CN117521116B true CN117521116B (en) 2024-04-19

Family

ID=89745992

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410013413.7A Active CN117521116B (en) 2024-01-04 2024-01-04 Large language model privacy information protection method

Country Status (1)

Country Link
CN (1) CN117521116B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117808643B (en) * 2024-02-29 2024-05-28 四川师范大学 Teaching management system based on Chinese language

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113468885A (en) * 2021-07-13 2021-10-01 安徽大学绿色产业创新研究院 Chinese trademark similarity calculation method
CN116975927A (en) * 2023-08-17 2023-10-31 南开大学 LLM language user privacy information protection method based on natural language prompt
CN117313138A (en) * 2023-08-30 2023-12-29 西安电子科技大学 Social network privacy sensing system and method based on NLP

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113468885A (en) * 2021-07-13 2021-10-01 安徽大学绿色产业创新研究院 Chinese trademark similarity calculation method
CN116975927A (en) * 2023-08-17 2023-10-31 南开大学 LLM language user privacy information protection method based on natural language prompt
CN117313138A (en) * 2023-08-30 2023-12-29 西安电子科技大学 Social network privacy sensing system and method based on NLP

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
基于语义关联与模糊聚类的共词分析方法;陆泉 等;情报学报;20221031;第41卷(第10期);第1003-1014页 *
改进TF-IDF结合余弦定理计算中文语句相似度;张俊飞;;现代计算机(专业版);20171115(第32期);第20-23、27页 *
融合词语共现距离和类别信息的短文本特征提取方法;马慧芳 等;计算机工程与科学;20180915(第09期);第1689-1695页 *
语义相似和多维加权的联合敏感属性隐私保护;徐龙琴 等;计算机应用;20110401(第04期);第999-1002页 *

Also Published As

Publication number Publication date
CN117521116A (en) 2024-02-06

Similar Documents

Publication Publication Date Title
CN117521116B (en) Large language model privacy information protection method
CN109308494B (en) LSTM model and network attack identification method and system based on LSTM model
Qin et al. A network security entity recognition method based on feature template and CNN-BiLSTM-CRF
CN111241291A (en) Method and device for generating countermeasure sample by utilizing countermeasure generation network
CN110019758B (en) Core element extraction method and device and electronic equipment
CN110363001B (en) Application layer malicious request detection method based on Transformer model
CN113315789B (en) Web attack detection method and system based on multi-level combined network
CN112651025A (en) Webshell detection method based on character-level embedded code
CN115146068B (en) Method, device, equipment and storage medium for extracting relation triples
CN113094478A (en) Expression reply method, device, equipment and storage medium
CN110674370A (en) Domain name identification method and device, storage medium and electronic equipment
CN114547670A (en) Sensitive text desensitization method using differential privacy word embedding disturbance
CN113742763A (en) Confusion encryption method and system based on government affair sensitive data
CN111737688B (en) Attack defense system based on user portrait
CN117332411A (en) Abnormal login detection method based on transducer model
CN112711648A (en) Database character string ciphertext storage method, electronic device and medium
CN115268799B (en) Storage method and device based on cloud service
CN117271759A (en) Text abstract generation model training method, text abstract generation method and device
CN112507388B (en) Word2vec model training method, device and system based on privacy protection
CN113822018A (en) Entity relation joint extraction method
CN114338058A (en) Information processing method, device and storage medium
CN112182575A (en) Attack data set malicious segment marking method and system based on LSTM
CN116756296B (en) Consultation information management method and system based on privacy protection
CN116611037B (en) Deep neural network black box watermarking method, device and terminal
CN115600580B (en) Text matching method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant