CN117009989A

CN117009989A - Language model protection method and device and computing device cluster

Info

Publication number: CN117009989A
Application number: CN202310729884.3A
Authority: CN
Inventors: 武楚涵; 孟笑君; 董振华; 唐睿明
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2023-06-19
Filing date: 2023-06-19
Publication date: 2023-11-07

Abstract

A language model protection method, comprising: acquiring a request text input by a user; under the condition that the category of the request text belongs to the target category, inputting a target instruction and the request text into a target language model for processing to obtain first reply information added with watermark words, and outputting the first reply information, wherein the target instruction is used for indicating the target language model to add the watermark in a result of processing the request text; and under the condition that the category of the request text does not belong to the target category, inputting the request text into the target language model for processing to obtain second reply information, and outputting the second reply information. Thus, when a request of a specific type is processed through the language model, the reply information with the watermark can be automatically generated through the language model, and the copyright protection capability of the language model is improved on the premise that the quality of the text generated by the model is not damaged as much as possible.

Description

Language model protection method and device and computing device cluster

Technical Field

The application relates to the technical field of artificial intelligence (artificial intelligence, AI), in particular to a language model protection method and device and a computing device cluster.

Background

The large language model (large language model, LLM) is one of the very important technologies in the field of natural language processing. The large language model can help the user to better understand and use the language, thereby improving the productivity and communication efficiency of the user. The large language model can perform a plurality of tasks such as machine translation, text writing, code programming, open question and answer, and the like. Developing a large language model based system requires high machine and labor costs, and thus the large language model itself is a core asset for the company to build AI competitiveness. At present, research has demonstrated that the re-engraving of large language model functions can be achieved with very low cost through model stealing techniques, resulting in infringement of intellectual property rights of large language models. Therefore, the large language model needs to be effectively protected, attack of model theft is avoided, and identification of the existing infringement behavior is realized.

Disclosure of Invention

The application provides a language model protection method, a device, a computing device cluster, a computer storage medium and a computer product, which can effectively protect a large language model.

In a first aspect, the present application provides a language model protection method, including: acquiring a request text input by a user; under the condition that the category of the request text belongs to the target category, inputting a target instruction and the request text into a target language model for processing to obtain first reply information added with watermark words, and outputting the first reply information, wherein the target instruction is used for indicating the target language model to add the watermark in a result of processing the request text; and under the condition that the category of the request text does not belong to the target category, inputting the request text into the target language model for processing to obtain second reply information, and outputting the second reply information.

Thus, when a request of a specific type is processed through the language model, the reply information with the watermark can be automatically generated through the language model, and the copyright protection capability of the language model is improved on the premise that the quality of the text generated by the model is not damaged as much as possible.

In one possible implementation, the method further includes: and recording watermark information related to the watermark word contained in the first reply information under the condition that the category of the request text belongs to the target category.

In one possible implementation, the method further includes: acquiring a request data set containing at least one request text, wherein the types of the request text in the request data set belong to target types; processing the request data set through the suspicious language model and the reference language model respectively to obtain a suspicious reply set and a reference reply set; processing the request data set through a target language model to obtain a normal reply set and a watermark information set, wherein watermark information in the watermark information set comprises watermark words corresponding to normal replies in the normal reply set; based on the suspicious reply set, the reference reply set, and the watermark information set, it is determined whether the suspicious language model is a target language model for theft. Therefore, when the suspicious model is a stolen target language model, the suspicious model, the target language model and a known non-stolen model can be utilized to determine whether the suspicious model is the stolen target language model, so that the difficulty of model authentication is reduced.

In one possible implementation, the method further includes: determining whether the suspicious language model is a stolen target language model based on the suspicious reply set, the reference reply set and the watermark information set, including: based on the watermark information set, extracting watermark words contained in the suspicious reply set and the reference reply set respectively to obtain the suspicious reply watermark word set and the reference reply watermark word set; according to the general corpus, calculating the occurrence probability of watermark words in a suspicious reply watermark word set and a reference reply watermark word set respectively to obtain the occurrence probability score of watermark words in each suspicious reply in the suspicious reply set and the occurrence probability score of watermark words in each reference reply in the reference reply set; and determining whether the suspicious language model is a stolen target language model based on the probability score of the occurrence of the watermark word in each suspicious reply in the suspicious reply set and the probability score of the occurrence of the watermark word in each reference reply in the reference reply set. Therefore, whether the suspicious language model is a stolen target language model is determined in a white box mode, and the difficulty of model authentication is reduced.

In one possible implementation, the method further includes: determining whether the suspicious language model is a stolen target language model based on the suspicious reply set, the reference reply set and the watermark information set, including: obtaining a simulated stealing model by utilizing a normal reply set simulated model stealing flow; processing the request data set through a simulation stealing model to obtain a simulation reply set; splicing watermark information in the watermark information set with corresponding reference replies in the reference reply set respectively to obtain a positive sample set, and splicing watermark information in the watermark information set with corresponding analog replies in the analog reply set respectively to obtain a negative sample set; model training is carried out by utilizing the positive sample set and the negative sample set, and an authentication model is obtained; and processing the suspicious reply set and the watermark information set through an authentication model to determine whether the suspicious language model is a stolen target language model. Therefore, whether the suspicious language model is a target language model for stealing is determined in a black box mode, and the difficulty of model authentication is reduced.

In a second aspect, the present application provides a language model protection apparatus, including: and the communication module and the processing module. The communication module is used for acquiring a request text input by a user. The processing module is used for inputting a target instruction and the request text into the target language model for processing under the condition that the category of the request text belongs to the target category, obtaining first reply information added with watermark words, and outputting the first reply information, wherein the target instruction is used for indicating the target language model to add the watermark in the result of processing the request text. And the processing module is used for inputting the request text into the target language model for processing under the condition that the category of the request text does not belong to the target category, obtaining second reply information and outputting the second reply information.

In one possible implementation, the processing module is further configured to: and recording watermark information related to the watermark word contained in the first reply information under the condition that the category of the request text belongs to the target category.

In one possible implementation, the processing module is further configured to: acquiring a request data set containing at least one request text, wherein the types of the request text in the request data set belong to target types; processing the request data set through the suspicious language model and the reference language model respectively to obtain a suspicious reply set and a reference reply set; processing the request data set through a target language model to obtain a normal reply set and a watermark information set, wherein watermark information in the watermark information set comprises watermark words corresponding to normal replies in the normal reply set; based on the suspicious reply set, the reference reply set, and the watermark information set, it is determined whether the suspicious language model is a target language model for theft.

In one possible implementation, the processing module is specifically configured to, when determining whether the suspicious language model is a target language model for theft based on the suspicious reply set, the reference reply set, and the watermark information set: based on the watermark information set, extracting watermark words contained in the suspicious reply set and the reference reply set respectively to obtain the suspicious reply watermark word set and the reference reply watermark word set; according to the general corpus, calculating the occurrence probability of watermark words in a suspicious reply watermark word set and a reference reply watermark word set respectively to obtain the occurrence probability score of watermark words in each suspicious reply in the suspicious reply set and the occurrence probability score of watermark words in each reference reply in the reference reply set; and determining whether the suspicious language model is a stolen target language model based on the probability score of the occurrence of the watermark word in each suspicious reply in the suspicious reply set and the probability score of the occurrence of the watermark word in each reference reply in the reference reply set.

In one possible implementation, the processing module is specifically configured to, when determining whether the suspicious language model is a target language model for theft based on the suspicious reply set, the reference reply set, and the watermark information set: obtaining a simulated stealing model by utilizing a normal reply set simulated model stealing flow; processing the request data set through a simulation stealing model to obtain a simulation reply set; splicing watermark information in the watermark information set with corresponding reference replies in the reference reply set respectively to obtain a positive sample set, and splicing watermark information in the watermark information set with corresponding analog replies in the analog reply set respectively to obtain a negative sample set; model training is carried out by utilizing the positive sample set and the negative sample set, and an authentication model is obtained; and processing the suspicious reply set and the watermark information set through an authentication model to determine whether the suspicious language model is a stolen target language model.

In a third aspect, the present application provides a cluster of computing devices, comprising at least one computing device, each computing device comprising a processor and a memory; the processor of the at least one computing device is configured to execute instructions stored in the memory of the at least one computing device to cause the cluster of computing devices to perform the method described in the first aspect or any one of the possible implementations of the first aspect.

In a fourth aspect, the present application provides a computer readable storage medium comprising computer program instructions which, when executed by a computing device, perform the method described in the first aspect or any one of the possible implementations of the first aspect; alternatively, the computer program instructions, when executed by a cluster of computing devices, perform the method described in the first aspect or any one of the possible implementations of the first aspect. For example, one or more computing devices may be included in a cluster of computing devices.

In a fifth aspect, the application provides a computer program product comprising instructions which, when executed by a computing device, cause the computing device to perform the method described in the first aspect or any of the possible implementations of the first aspect, or which, when executed by a computing device cluster, cause the computing device cluster to perform the method described in the first aspect or any of the possible implementations of the first aspect. For example, one or more computing devices may be included in a cluster of computing devices.

It will be appreciated that the advantages of the second to fifth aspects may be found in the relevant description of the first aspect, and are not described here again.

Drawings

FIG. 1 is a schematic flow chart of a language model protection method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of steps for authenticating a model by a white-box method according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a step of authenticating a model by a black box method according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a language model protection process according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a model authentication process according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a process for authenticating a model by a white-box approach according to an embodiment of the present application;

FIG. 7 is a schematic diagram of a process for authenticating a model by a black box method according to an embodiment of the present application;

FIG. 8 is a schematic diagram of a language model protection device according to an embodiment of the present application;

FIG. 9 is a schematic diagram of a computing device according to an embodiment of the present application;

FIG. 10 is a schematic diagram of a computing device cluster according to an embodiment of the present application;

FIG. 11 is a schematic diagram of another computing device cluster provided by an embodiment of the application.

Detailed Description

The term "and/or" herein is an association relationship describing an associated object, and means that there may be three relationships, for example, a and/or B may mean: a exists alone, A and B exist together, and B exists alone. The symbol "/" herein indicates that the associated object is or is a relationship, e.g., A/B indicates A or B.

The terms "first" and "second" and the like in the description and in the claims are used for distinguishing between different objects and not for describing a particular sequential order of objects. For example, the first response message and the second response message, etc. are used to distinguish between different response messages, and are not used to describe a particular order of response messages.

In embodiments of the application, words such as "exemplary" or "such as" are used to mean serving as an example, instance, or illustration. Any embodiment or design described herein as "exemplary" or "e.g." in an embodiment should not be taken as preferred or advantageous over other embodiments or designs. Rather, the use of words such as "exemplary" or "such as" is intended to present related concepts in a concrete fashion.

In the description of the embodiments of the present application, unless otherwise specified, the meaning of "plurality" means two or more, for example, the meaning of a plurality of processing units means two or more, or the like; the plurality of elements means two or more elements and the like.

In general, in order to prevent models from being stolen, model watermarking technology is often used to embed some specific identification information in machine learning models to track and identify the source of the model. A large language model of watermarking tends to generate text with a particular pattern. When model stealing is performed on the generated watermark-added data, the model obtained by training inherits the model. Thus, the title owner of the original model can verify whether the suspicious model is model trained using data generated by the original model by comparing the output of the owned model with the output of the suspicious model.

Watermarking schemes for large language models are mainly based on two ideas: the first is to randomly divide the word list of the large language model and change the probability of different parts in the generated result; the second is to introduce rare words or specific phrases in the pre-training data set and output in a specific format as a trigger back gate. However, in the first scheme, the quality of the generated text is damaged by a method of artificially changing the occurrence probability of the word list part words, and particularly in the field of more proper nouns, the terms often have no replaceable word, so that the language fluency is poor; meanwhile, the method for artificially dividing the vocabulary to change the generation probability is very easy to be detected by an attacker after word frequency distribution in the text is generated through statistics, and compared with word distribution in natural corpus, the watermark adding mode is poor in concealment. In the second scheme, since the data volume constructed by model stealing is not large, the watermark of a back gate can not be triggered to be added in the pre-training data set, so that watermark information is volatile in model stealing training and is difficult to authenticate.

In view of this, an embodiment of the present application provides a language model protection method, which can classify a request input by a user, and if a request type belongs to a target type set, an instruction for generating a watermark by a request model is input to a language model, so that the language model generates a watermark protection text with autonomous generation of the model. In this way, the watermark can be embedded in the generated text in an imperceptible manner without damaging the quality of the generated text. Further, the suspicious model is authenticated by referring to the watermark word added by the language model, by a regular white box method or a black box method for replying a training authentication model generated by using the normal model and the simulated stealing model, so that the simple and convenient identification of whether the model is stolen or not is realized, and the authentication difficulty of infringed intellectual property related to the model is reduced.

Fig. 1 is a schematic flow chart of a language model protection method according to an embodiment of the present application. It is understood that the method may be performed by any apparatus, device, platform, cluster of devices having computing, processing capabilities. As shown in fig. 1, the language model protection method may include the steps of:

S101, acquiring a request text input by a user.

In this embodiment, when using the language model, the user may input a request text, for example: input on a terminal device, etc. After the user finishes inputting and issues a confirmation instruction, the request text input by the user can be obtained. Illustratively, the request text may include, but is not limited to, questions or the like that the user desires to answer in the language model.

S102, judging whether the category of the request text belongs to the target category.

In this embodiment, the requested text may be classified by a rule method (such as by keyword judgment, etc.), a conventional machine learning method, or a language model based on deep learning, to obtain a category to which the requested text belongs. After the category of the request text is obtained, it may be determined whether the category to which the request text belongs to the target category. Wherein the question characterized by the request text belonging to the target category is a non-factual question, i.e. a question not related to specific facts, data or objective realism. For example, the target category may be "text writing", "open answer", "recommendation", and the like. The non-target categories may be "mathematical calculations", "fact answers", etc.

Wherein, in case the category of the request text belongs to the target category, S103 may be performed. In the case where the category of the request text belongs to the target category, S105 may be performed.

S103, under the condition that the category of the request text belongs to a target category, inputting a target instruction and the request text into a language model for processing, obtaining first reply information added with watermark words, and outputting the first reply information, wherein the target instruction is used for indicating the language model to add watermarks in a result of processing the request text.

In this embodiment, when the category of the request text belongs to the target category, the target instruction and the request text input by the user may be spliced and input into the language model for processing, so as to obtain the first reply message added with the watermark word. Wherein the target instructions may be for instructing the language model to watermark the results of the text processing of the request. For example, the target instructions may be spliced before the request text, such that the language model may execute the target instructions first, processing the request text. In some embodiments, the target instructions may also be used to instruct the language model to give a rareness score for the watermark word it generated. For example, the target instruction may be "please add some special words or phrases as watermarks to the generated text, giving watermark words and rareness scores for these watermark words after the generated text. The watermark word added in the first reply message may be generated randomly by a language model, or may be selected from a preset set of watermark words by the language model, which is not limited herein.

Further, after the first reply information is obtained, the first reply information may be output, for example, the first reply information is transmitted to a terminal device used by the user, so that the first reply information is displayed through the terminal device, and the like. In addition, before or after the language model outputs the first reply information, watermark information related to the watermark word in the first reply information may be recorded. The watermark information may include: the watermark word added in the first reply message, or the rareness score of the added watermark word, etc.

S104, under the condition that the category of the request text does not belong to the target category, inputting the request text into a language model for processing to obtain second reply information, and outputting the second reply information.

In this embodiment, when the category of the request text input by the user does not belong to the target category, the request text input by the user may be input into the language model for processing, to obtain the second reply information to which the watermark word is not added, and to output the second reply information.

The above is an introduction to a language model protection method provided by the embodiment of the present application. While the above approach may protect the language model, there may be a possibility that the model may be stolen, and authentication-related problems occur when the model is suspected to be stolen. Next, an explanation will be given of how authentication is performed in the embodiment of the present application. When authentication is performed, authentication can be performed in a white box mode or a black box mode, and the two authentication modes are respectively described below.

Fig. 2 shows a schematic diagram of steps of a method for model authentication by means of a white-box method. As shown in fig. 2, the method for performing model authentication by a white-box mode may include the following steps:

s201, acquiring a request data set containing at least one request text, wherein the types of the request text in the request data set belong to the target types.

In this embodiment, the request data set may be a set of request texts randomly constructed and belonging to the target category, or may be a set of request texts generated by real service data and belonging to the target category. The request data set may be generated automatically by a model, or may be generated by human construction, as examples and not limited thereto.

S202, processing a request text in a request data set through a suspicious language model to obtain a suspicious reply set, wherein one request text corresponds to one suspicious reply in the suspicious reply set.

In this embodiment, the request text in the request data set may be sent to an application programming interface (application programming interface, API) of the suspicious language model to call the API, so that the suspicious language model processes the request text in the request data set to obtain a suspicious reply set. The suspicious reply set comprises at least one suspicious reply, and one suspicious reply corresponds to one request text. For example, the suspicious language model may be a suspicious language model that is likely to be stolen.

S203, processing the request text in the request data set through a reference language model to obtain a reference reply set, wherein one request text corresponds to one reference reply in the reference reply set.

In this embodiment, the reference reply set may be obtained by processing the request text in the request data set through the reference language model. The reference reply set comprises at least one reference reply, and one reference reply corresponds to one request text. The reference language model is a non-steal model and is a different model than the model to be protected.

S204, processing the request text in the request data set through a protection language model to obtain a normal reply set, and recording watermark information corresponding to each normal reply in the normal reply set to obtain a watermark information set; wherein one request text corresponds to one normal reply in the normal reply set.

In this embodiment, the request text in the request data set may be processed through the protection language model to obtain a normal reply set, and watermark information corresponding to each normal reply in the normal reply set is recorded to obtain a watermark information set. Wherein one request text corresponds to one normal reply in the normal reply set. The protection language model may be, for example, a language model that requires protection.

S205, based on the watermark information set, extracting watermark words contained in the suspicious reply set and the reference reply set respectively to obtain the suspicious reply watermark word set and the reference reply watermark word set.

In this embodiment, the watermark word contained in the normal reply set may be obtained from the watermark detail set. And then, extracting watermark words contained in the watermark detail set from the suspicious reply set and the reference reply set by utilizing a character string regular matching mode so as to extract the watermark words contained in the suspicious reply set and the reference reply set respectively.

S206, according to the general corpus, calculating the occurrence probability of watermark words in the suspicious reply watermark word set and the reference reply watermark word set respectively to obtain the occurrence probability score of watermark words in each suspicious reply in the suspicious reply set and the occurrence probability score of watermark words in each reference reply in the reference reply set.

In this embodiment, the probability scores of the occurrence of the watermark words in the suspicious reference reply set and the reference reply set may be calculated respectively according to the existing general corpus, and the probability scores of the watermark words in each reply may be summed to obtain the probability score of the occurrence of the watermark words in each reply. Wherein, when the calculated probability scores of the occurrence of the watermark words contained in the suspicious reference reply set and the reference reply set respectively, the calculation can be performed by the score of the term frequency-inverse document frequency (TF-IDF) or the score based on n-gram and the like.

S207, determining whether the suspicious language model is a stolen protection language model based on the probability score of the occurrence of the watermark word in each suspicious reply in the suspicious reply set and the probability score of the occurrence of the watermark word in each reference reply in the reference reply set.

In this embodiment, a double-sided t-test may be performed on a set of probability scores consisting of probability scores of occurrence of watermark words in each suspicious reply in the suspicious reply set and a set of probability scores consisting of probability scores of occurrence of watermark words in each reference reply in the reference reply set, and a p-value may be calculated. For example, assuming that the suspicious reply set and the reference reply set each include N replies, the probability score of occurrence of the watermark word in the N suspicious replies in the suspicious reply set is recorded as [ t ] ₁ ,t ₂ ,…,t _N ]The probability score of the occurrence of the watermark word in N suspicious replies in the reference reply set is recorded as s ₁ ,s ₂ ,…,s _N ]. Then, a two-sided t-test can be performed on the two sets of probability scores and a p-value calculated.

Then, if the average value of the sum of the probability scores of the occurrence of the watermark word in each suspicious reply in the suspicious reply set is higher than the average value of the sum of the probability scores of the occurrence of the watermark word in each reference reply in the reference reply set, and the p value satisfies p <, the suspicious language model is considered to be the stolen protection language model with the confidence degree of p < s (i.e. a certain probability such as 95%, 90%, etc.). Wherein, E is a preset value. The probability that the suspect language model is a stolen protected language model may then be output, or whether the suspect language model is a stolen protected language model may be output. By way of example, S205 to S207 in fig. 2 can be understood as: based on the suspicious reply set, the reference reply set, and the watermark information set, a determination is made as to whether the suspicious language model is a process of hacking the protected language model.

(2) Black box mode

By way of example, fig. 3 shows a schematic diagram of the steps of a method for model authentication by means of a black box. In fig. 3, S301, S302, S303, and S304, reference may be made to descriptions in S201, S202, S203, and S204 in fig. 2, and details thereof are not repeated here. As shown in fig. 3, the method for performing model authentication by a black box mode may include the following steps:

s301, acquiring a request data set containing at least one request text, wherein the types of the request text in the request data set belong to the target types.

S302, processing a request text in a request data set through a suspicious language model to obtain a suspicious reply set, wherein one request text corresponds to one suspicious reply in the suspicious reply set.

S303, processing the request text in the request data set through the reference language model to obtain a reference reply set, wherein one request text corresponds to one reference reply in the reference reply set.

S304, processing the request text in the request data set through a protection language model to obtain a normal reply set, and recording watermark information corresponding to each normal reply in the normal reply set to obtain a watermark information set; wherein one request text corresponds to one normal reply in the normal reply set.

S305, simulating a model stealing flow by using the normal reply set so as to train out the simulated stealing model.

In this embodiment, the data in the normal reply set may be used to simulate the model stealing flow, so as to train a model stealing model.

S306, processing the request text in the request data set through a simulated stealing model to obtain a simulated reply set, wherein one request text corresponds to one simulated reply in the simulated reply set.

In this embodiment, the simulated reply set may be obtained by processing the request text in the request data set through the simulated stealing model. The simulated reply set comprises at least one simulated reply, and one simulated reply corresponds to one request text.

S307, splicing watermark information in the watermark information set with corresponding reference replies in the reference reply set respectively to obtain a positive sample set, wherein one reference reply and watermark information corresponding to the reference reply are one positive sample; and splicing the watermark information in the watermark information set with the corresponding analog replies in the analog reply set respectively to obtain a negative sample set, wherein one analog reply and the watermark information corresponding to the analog reply are one negative sample.

And S308, performing model training by using the positive sample set and the negative sample set to obtain an authentication model.

In this embodiment, a model to be trained may be trained by using a positive sample set and a negative sample set, so as to obtain an authentication model.

S309, inputting the suspicious reply set and the watermark information set into an authentication model, and determining whether the suspicious language model is a stolen protection language model.

In this embodiment, the suspicious reply set and the watermark information set may be input to the authentication model, so as to classify the suspicious replies in the suspicious reply set by the authentication model, and determine whether the suspicious language model is a stolen protection language model based on the classification result. For example, assuming that the suspicious reply set includes N suspicious replies, and the authentication model classifies the M suspicious replies as positive samples, if M > 0.5N, the p-value of the authentication can be calculated according to the binomial distribution model as:

if the p value satisfies p < ∈, the suspicious language model can be considered to be a stolen protection language model with a confidence degree of p < ∈ (i.e. a certain probability such as 95%, 90%, etc.). Wherein, E is a preset value. The probability that the suspect language model is a stolen protected language model may then be output, or whether the suspect language model is a stolen protected language model may be output. Illustratively, S305 to S309 in fig. 3 can be understood as: based on the suspicious reply set, the reference reply set, and the watermark information set, a determination is made as to whether the suspicious language model is a process of hacking the protected language model.

The above is an introduction to the language model protection method and the corresponding model authentication method provided by the embodiment of the application. For ease of understanding, the following examples are presented.

(1) Model protection process

For example, as shown in fig. 4, after the request text input by the user is acquired, it may be first determined whether the requests respectively belong to the target category. If the request text does not belong to the target category, the request text is directly processed by using a language model, and a reply is generated, wherein no watermark is added in the reply. If the information belongs to the target category, splicing instructions for indicating the addition of the watermark before the request text, and processing the spliced information through a language model to obtain a reply added with the watermark and watermark information. The watermarked reply may then be output.

(2) Model authentication process

For example, as shown in fig. 5, a request data set may be constructed first, where the requests in the request data set respectively belong to the aforementioned target categories. And then, processing the request text in the request data set by using the suspicious language model, the protection language model and the reference language model respectively to obtain a suspicious reply set, a reference reply set and a watermark information set. Finally, authentication can be performed in a white box/black box mode based on the suspicious reply set, the reference reply set and the watermark information set to obtain an authentication result, namely whether the suspicious language model is a stolen protection language model or not is obtained. When the authentication is performed by adopting a black box mode, a normal reply set output by the protection language model is required to be obtained.

(3) White box authentication process

For example, as shown in fig. 6, when authentication is performed in a white-box manner, watermark words contained in the watermark information set may be first utilized, and character string matching may be performed in the suspicious reply set and the reference reply set respectively, so as to obtain watermark words contained in the suspicious reply set and the reference reply set. Then, TF-I DF score calculation can be performed on the two reply sets respectively to obtain the occurrence probability of watermark words in each suspicious reply in the suspicious reply set and the occurrence probability of watermark words in each reference reply in the reference reply set. Finally, the two obtained probability sets can be calculated to obtain an authentication result.

(4) Black box authentication process

For example, as shown in fig. 7, when authentication is performed in a black box manner, a normal reply set output by the protection language model may be used to simulate a model stealing process, so as to obtain a simulation language model (i.e. a known stealing model). And then, processing the corresponding request data set by using the known stealing model to obtain a simulated reply set. Then, splicing watermark information in the watermark information set with corresponding replies in the analog reply set respectively to obtain a negative sample set; and splicing the watermark information in the watermark information set with corresponding replies in the reference reply set respectively to obtain a positive sample set. Then, a model to be trained can be trained by utilizing the positive sample set and the negative sample set, and an authentication model is obtained. And finally, processing the suspicious reply set and the watermark information set by using an authentication model to obtain an authentication result.

It should be understood that, the sequence number of each step in the foregoing embodiment does not mean the execution sequence, and the execution sequence of each process should be determined by the function and the internal logic, and should not limit the implementation process of the embodiment of the present application. In addition, in some possible implementations, each step in the foregoing embodiments may be selectively performed according to practical situations, and may be partially performed or may be performed entirely, which is not limited herein. In addition, all or part of any features of any of the embodiments described above may be freely combined without contradiction; the combined technical scheme is also within the scope of the application.

Next, a language model protection device provided by the embodiment of the present application is described based on the method in the above embodiment.

Fig. 8 is a schematic structural diagram of a language model protection device according to an embodiment of the present application. As shown in fig. 8, the language model protection apparatus 800 includes: a communication module 801 and a processing module 802. The communication module 801 is configured to obtain a request text input by a user. The processing module 802 is configured to input, when the category of the request text belongs to the target category, a target instruction and the request text into the target language model to process, obtain first reply information added with the watermark word, and output the first reply information, where the target instruction is used to instruct the target language model to add the watermark to the result of processing the request text. The processing module 802 is configured to input the request text to the target language model for processing if the category of the request text does not belong to the target category, obtain the second reply message, and output the second reply message.

In some embodiments, the processing module 802 is further configured to: and recording watermark information related to the watermark word contained in the first reply information under the condition that the category of the request text belongs to the target category.

In some embodiments, the processing module 802 is further configured to: acquiring a request data set containing at least one request text, wherein the types of the request text in the request data set belong to target types; processing the request data set through the suspicious language model and the reference language model respectively to obtain a suspicious reply set and a reference reply set; processing the request data set through a target language model to obtain a normal reply set and a watermark information set, wherein watermark information in the watermark information set comprises watermark words corresponding to normal replies in the normal reply set; based on the suspicious reply set, the reference reply set, and the watermark information set, it is determined whether the suspicious language model is a target language model for theft.

In some embodiments, the processing module 802 is specifically configured to, when determining whether the suspicious language model is a target language model for theft based on the suspicious reply set, the reference reply set, and the watermark information set: based on the watermark information set, extracting watermark words contained in the suspicious reply set and the reference reply set respectively to obtain the suspicious reply watermark word set and the reference reply watermark word set; according to the general corpus, calculating the occurrence probability of watermark words in a suspicious reply watermark word set and a reference reply watermark word set respectively to obtain the occurrence probability score of watermark words in each suspicious reply in the suspicious reply set and the occurrence probability score of watermark words in each reference reply in the reference reply set; and determining whether the suspicious language model is a stolen target language model based on the probability score of the occurrence of the watermark word in each suspicious reply in the suspicious reply set and the probability score of the occurrence of the watermark word in each reference reply in the reference reply set.

In some embodiments, the processing module 802 is specifically configured to, when determining whether the suspicious language model is a target language model for theft based on the suspicious reply set, the reference reply set, and the watermark information set: obtaining a simulated stealing model by utilizing a normal reply set simulated model stealing flow; processing the request data set through a simulation stealing model to obtain a simulation reply set; splicing watermark information in the watermark information set with corresponding reference replies in the reference reply set respectively to obtain a positive sample set, and splicing watermark information in the watermark information set with corresponding analog replies in the analog reply set respectively to obtain a negative sample set; model training is carried out by utilizing the positive sample set and the negative sample set, and an authentication model is obtained; and processing the suspicious reply set and the watermark information set through an authentication model to determine whether the suspicious language model is a stolen target language model.

In some embodiments, both the communication module 801 and the processing module 802 shown in fig. 8 may be implemented by software, or may be implemented by hardware. By way of example, the implementation of the communication module 801 will be described next taking the communication module 801 as an example. Similarly, the implementation of the processing module 802 may refer to the implementation of the communication module 801.

Module as an example of a software functional unit, the communication module 801 may include code running on a computing instance. The computing instance may include at least one of a physical host (computing device), a virtual machine, and a container, among others. Further, the above-described computing examples may be one or more. For example, the communication module 801 may include code that runs on multiple hosts/virtual machines/containers. It should be noted that, multiple hosts/virtual machines/containers for running the code may be distributed in the same region (region), or may be distributed in different regions. Further, multiple hosts/virtual machines/containers for running the code may be distributed in the same availability zone (availability zone, AZ) or may be distributed in different AZs, each AZ comprising a data center or multiple geographically close data centers. Wherein typically a region may comprise a plurality of AZs.

Also, multiple hosts/virtual machines/containers for running the code may be distributed in the same virtual private cloud (virtual private cloud, VPC) or in multiple VPCs. In general, one VPC is disposed in one region, and a communication gateway is disposed in each VPC for implementing inter-connection between VPCs in the same region and between VPCs in different regions.

Module as an example of a hardware functional unit, the communication module 801 may include at least one computing device, such as a server or the like. Alternatively, the communication module 801 may be a device or the like implemented using an application-specific integrated circuit (ASIC) or a programmable logic device (programmable logic device, PLD). The PLD may be implemented as a complex program logic device (complex programmable logical device, CPLD), a field-programmable gate array (FPGA), a general-purpose array logic (generic array logic, GAL), or any combination thereof.

The plurality of computing devices included in the communication module 801 may be distributed in the same region or may be distributed in different regions. The plurality of computing devices included in the communication module 801 may be distributed among the same AZ or may be distributed among different AZ. Likewise, multiple computing devices included in the communication module 801 may be distributed in the same VPC or may be distributed among multiple VPCs. Wherein the plurality of computing devices may be any combination of computing devices such as servers, ASIC, PLD, CPLD, FPGA, and GAL.

It should be noted that, in other embodiments, the communication module 801 may be used to perform any step in the language model protection method described in the foregoing embodiments, the processing module 802 may be used to perform any step in the language model protection method described in the foregoing embodiments, the steps that the communication module 801 and the processing module 802 are responsible for implementing may be specified according to needs, and the communication module 801 and the processing module 802 implement different steps in the language model protection method described in the foregoing embodiments to implement all the functions of the language model protection apparatus 800 shown in fig. 8.

The present application also provides a computing device 900. As shown in fig. 9, the computing device 900 includes: bus 902, processor 904, memory 906, and communication interface 908. Communication between the processor 904, the memory 906, and the communication interface 908 is via the bus 902. Computing device 900 may be a server or an electronic device. It should be understood that the present application is not limited to the number of processors, memories in computing device 900.

Bus 902 may be a peripheral component interconnect standard (peripheral component interconnect, PCI) bus, or an extended industry standard architecture (extended industry standard architecture, EISA) bus, among others. The buses may be divided into address buses, data buses, control buses, etc. For ease of illustration, only one line is shown in fig. 9, but not only one bus or one type of bus. Bus 904 may include a path for transferring information between various components of computing device 900 (e.g., memory 906, processor 904, communication interface 908).

The processor 904 may include any one or more of a central processing unit (central processing unit, CPU), a graphics processor (graphics processing unit, GPU), a Microprocessor (MP), or a digital signal processor (digital signal processor, DSP).

The memory 906 may include volatile memory (RAM), such as random access memory (random access memory). The processor 904 may also include non-volatile memory (ROM), such as read-only memory (ROM), flash memory, a mechanical hard disk (HDD), or a solid state disk (solid state drive, SSD).

The memory 906 has stored therein executable program codes that the processor 904 executes to implement the functions of the communication module 801 and the processing module 802 shown in fig. 8 described above, respectively, thereby implementing the language model protection method described in the above embodiment. That is, the memory 906 has stored thereon instructions for executing the language model protection method described in the above embodiment.

Alternatively, the memory 906 has stored therein executable codes, which the processor 904 executes to implement the functions of the language model protection device 800 shown in fig. 8 described above, respectively, thereby implementing the language model protection method described in the above embodiment. That is, the memory 906 has stored thereon instructions for executing the language model protection method described in the above embodiment.

The communication interface 903 enables communication between the computing device 900 and other devices or communication networks using a transceiver module such as, but not limited to, a network interface card, transceiver, or the like.

The embodiment of the application also provides a computing device cluster. The cluster of computing devices includes at least one computing device. The computing device may be a server, such as a central server, an edge server, or a local server in a local data center. In some embodiments, the computing device may also be an electronic device such as a desktop, notebook, or smart phone.

As shown in fig. 10, the cluster of computing devices includes at least one computing device 900. The same instructions for performing the language model protection method described in the above embodiments may be stored in the memory 906 in one or more computing devices 900 in the computing device cluster.

In some possible implementations, part of the instructions for performing the language model protection method described in the above embodiments may also be stored in the memory 906 of one or more computing devices 900 in the computing device cluster, respectively. In other words, a combination of one or more computing devices 900 may collectively execute instructions for performing the language model protection methods described in the above embodiments.

It should be noted that, the memory 906 in different computing devices 900 in the computing device cluster may store different instructions for performing part of the functions of the language model protection apparatus 800 shown in fig. 8. That is, instructions stored in memory 906 of different computing devices 900 may implement the functionality of one or more of communications module 801 and processing module 802.

In some possible implementations, one or more computing devices in a cluster of computing devices may be connected through a network. Wherein the network may be a wide area network or a local area network, etc. Fig. 11 shows one possible implementation. As shown in fig. 11, two computing devices 900A and 900B are connected by a network. Specifically, the connection to the network is made through a communication interface in each computing device. In this type of possible implementation, instructions to perform the functions of the communication module 801 are stored in a memory 906 in the computing device 900A. Meanwhile, instructions to perform the functions of the processing module 802 are stored in the memory 906 in the computing device 900B.

It should be appreciated that the functionality of computing device 900A shown in fig. 11 may also be performed by multiple computing devices 900. Likewise, the functionality of computing device 900B may also be performed by multiple computing devices 900.

The embodiment of the application also provides another computing device cluster. The connection between computing devices in the computing device cluster may be similar to the connection of the computing device cluster described with reference to fig. 10 and 11. In contrast, the same instructions for performing the methods of the previous embodiments may be stored in memory 906 of one or more computing devices 900 in the computing device cluster.

In some possible implementations, part of the instructions for performing the aforementioned data processing method may also be stored in the memory 906 of one or more computing devices 900 in the computing device cluster, respectively. In other words, a combination of one or more computing devices 900 may collectively execute instructions for performing the aforementioned data processing methods.

Based on the methods in the above embodiments, embodiments of the present application provide a computer-readable storage medium including computer program instructions that, when executed by a computing device, perform the methods in the above embodiments; alternatively, the computing device cluster performs the methods of the embodiments described above when the computer program instructions are executed by the computing device cluster. By way of example, the computer readable storage medium may be any available medium that can be stored by a computing device or a data storage device such as a data center containing one or more available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid state disk), etc.

Based on the methods in the above embodiments, embodiments of the present application provide a computer program product containing instructions that, when executed by a computing device, cause the computing device to perform the methods in the above embodiments, or that, when executed by a computing device cluster, cause the computing device cluster to perform the methods in the above embodiments.

It is to be appreciated that the processor in embodiments of the application may be a central processing unit (central processing unit, CPU), other general purpose processor, digital signal processor (digital signal processor, DSP), application specific integrated circuit (application specificintegrated circuit, ASIC), field programmable gate array (field programmable gate array, FPGA) or other programmable logic device, transistor logic device, hardware components, or any combination thereof. The general purpose processor may be a microprocessor, but in the alternative, it may be any conventional processor.

The method steps in the embodiments of the present application may be implemented by hardware, or may be implemented by executing software instructions by a processor. The software instructions may be comprised of corresponding software modules that may be stored in random access memory (random access memory, RAM), flash memory, read-only memory (ROM), programmable ROM (PROM), erasable programmable PROM (EPROM), electrically erasable programmable EPROM (EEPROM), registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present application, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted across a computer-readable storage medium. The computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL)), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a DVD), or a semiconductor medium (e.g., a Solid State Disk (SSD)), or the like.

It will be appreciated that the various numerical numbers referred to in the embodiments of the present application are merely for ease of description and are not intended to limit the scope of the embodiments of the present application.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and are not limiting; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; these modifications or substitutions do not depart from the essence of the corresponding technical solutions from the protection scope of the technical solutions of the embodiments of the present application.

Claims

1. A method for protecting a language model, the method comprising:

acquiring a request text input by a user;

under the condition that the category of the request text belongs to a target category, inputting a target instruction and the request text into a target language model for processing to obtain first reply information added with watermark words, and outputting the first reply information, wherein the target instruction is used for indicating the target language model to add watermarks in a result of processing the request text;

And under the condition that the category of the request text does not belong to the target category, inputting the request text into the target language model for processing to obtain second reply information, and outputting the second reply information.

2. The method according to claim 1, wherein the method further comprises:

and recording watermark information related to the watermark word contained in the first reply information under the condition that the category of the request text belongs to the target category.

3. The method according to claim 1 or 2, characterized in that the method further comprises:

acquiring a request data set containing at least one request text, wherein the categories of the request text in the request data set all belong to the target category;

processing the request data set through a suspicious language model and a reference language model respectively to obtain a suspicious reply set and a reference reply set;

processing the request data set through the target language model to obtain a normal reply set and a watermark information set, wherein watermark information in the watermark information set comprises watermark words corresponding to normal replies in the normal reply set;

Determining whether the suspicious language model is the target language model that was stolen based on the suspicious reply set, the reference reply set, and the watermark information set.

4. The method of claim 3, wherein the determining whether the suspicious language model is the target language model that was stolen based on the set of suspicious replies, the set of reference replies, and the set of watermark information comprises:

extracting watermark words contained in the suspicious reply set and the reference reply set respectively based on the watermark information set to obtain a suspicious reply watermark word set and a reference reply watermark word set;

according to the general corpus, respectively calculating the occurrence probability of watermark words in the suspicious reply watermark word set and the reference reply watermark word set to obtain the occurrence probability score of watermark words in each suspicious reply in the suspicious reply set and the occurrence probability score of watermark words in each reference reply in the reference reply set;

and determining whether the suspicious language model is the stolen target language model based on the probability score of the occurrence of the watermark word in each suspicious reply in the suspicious reply set and the probability score of the occurrence of the watermark word in each reference reply in the reference reply set.

5. The method of claim 3, wherein the determining whether the suspicious language model is the target language model that was stolen based on the set of suspicious replies, the set of reference replies, and the set of watermark information comprises:

obtaining a simulated stealing model by utilizing the normal reply set simulated model stealing flow;

processing the request data set through the simulated stealing model to obtain a simulated reply set;

splicing watermark information in the watermark information set with corresponding reference replies in the reference reply set respectively to obtain a positive sample set, and splicing watermark information in the watermark information set with corresponding analog replies in the analog reply set respectively to obtain a negative sample set;

performing model training by using the positive sample set and the negative sample set to obtain an authentication model;

and processing the suspicious reply set and the watermark information set through the authentication model to determine whether the suspicious language model is the target language model which is stolen.

6. A language model protection device, comprising:

The communication module is used for acquiring a request text input by a user;

the processing module is used for inputting a target instruction and the request text into a target language model for processing under the condition that the category of the request text belongs to a target category, obtaining first reply information added with watermark words, and outputting the first reply information, wherein the target instruction is used for indicating the target language model to add watermarks in a result of processing the request text;

the processing module is configured to input the request text to the target language model for processing if the category of the request text does not belong to the target category, obtain second reply information, and output the second reply information.

7. The apparatus of claim 6, wherein the processing module is further configured to:

8. The apparatus of claim 6 or 7, wherein the processing module is further configured to:

9. The apparatus of claim 8, wherein the processing module is configured, when determining whether the suspicious language model is the target language model that was stolen based on the suspicious reply set, the reference reply set, and the watermark information set, to:

10. The apparatus of claim 8, wherein the processing module is configured, when determining whether the suspicious language model is the target language model that was stolen based on the suspicious reply set, the reference reply set, and the watermark information set, to:

11. A cluster of computing devices, comprising at least one computing device, each computing device comprising a processor and a memory;

the processor of the at least one computing device is configured to execute instructions stored in the memory of the at least one computing device to cause the cluster of computing devices to perform the method of any of claims 1-5.

12. A computer readable storage medium comprising computer program instructions which, when executed by a cluster of computing devices, cause the cluster of computing devices to perform the method of any of claims 1-5, wherein the cluster of computing devices comprises at least one computing device.

13. A computer program product containing instructions that, when executed by a cluster of computing devices, cause the cluster of computing devices to perform the method of any of claims 1-5, wherein the cluster of computing devices comprises at least one computing device.