CN116701637B

CN116701637B - Zero sample text classification method, system and medium based on CLIP

Info

Publication number: CN116701637B
Application number: CN202310778409.5A
Authority: CN
Inventors: 覃立波; 李勤政; 王玮赟; 陈麒光; 车万翔
Original assignee: Central South University
Current assignee: Central South University
Priority date: 2023-06-29
Filing date: 2023-06-29
Publication date: 2024-03-08
Anticipated expiration: 2043-06-29
Also published as: CN116701637A

Abstract

The invention discloses a zero sample text classification method, a zero sample text classification system and a zero sample text classification medium based on a CLIP, wherein the zero sample text classification method comprises the following steps: s1: acquiring a text to be classified; s2: inputting the text into a text encoder to obtain text vectors, and inputting the image set in the text image set into an image encoder to obtain image vectors; s3: calculating the text vector and the image vector to obtain the similarity degree of the picture and the text; s4: and carrying out prediction matching according to the current classification task type and the calculated similarity degree to obtain a text classification result. The text information and the image information are combined and applied to natural language processing, so that the text image matching task which can be solved by reconstructing the text classification task into the CLIP model is realized, and the precision of text classification is improved.

Description

Zero sample text classification method, system and medium based on CLIP

Technical Field

The invention relates to the technical field of Internet, in particular to a zero sample text classification method, system and medium based on CLIP.

Background

With the increasing maturity of internet technology, particularly the continuous progress of deep learning technology and natural language processing technology, the development of text classification technology has been greatly promoted. Meanwhile, the text classification technology is also widely applied in real life, such as intelligent customer service, intelligent mailbox and the like, and can be used for automatically identifying incoming message types, automatically detecting illegal contents and other services; the video platform field can help auditors to automatically carry out marking classification on related contents, so that manpower and material resources are greatly saved, and life experience of people is improved. Meanwhile, as a pre-trained model on a massive text image dataset, the CLIP can directly complete text image matching in the appointed field under the condition of no use of examples, namely zero sample learning.

However, in the existing study of text classification problems, people only pay attention to semantic information in input text, and ignore very valuable image information. For example, when a person sees the word "his mouth is raised", the mind first appears on a smiling picture, after which the person is reasonably considered to be happy, and the emotion expressed by the word is correspondingly classified as "happy". The process combines double information of the text river images, so that the classification result is more accurate, however, the text information and the image information are not combined in the current text classification field, and the process is further applied to natural language tasks.

Disclosure of Invention

The invention provides a zero sample classification method, a zero sample classification system and a zero sample classification medium based on CLIP, wherein the zero sample classification method solves the problem that text information and image information are not combined and applied to natural language tasks.

In a first aspect, the present invention provides a zero-sample text classification method based on CLIP, including:

s1: acquiring a text to be classified;

s2: inputting the text into a text encoder to obtain text vectors, and inputting the image set in the text image set into an image encoder to obtain image vectors;

s3: calculating the text vector and the image vector to obtain the similarity degree of the picture and the text;

s4: and carrying out prediction matching according to the current classification task type and the calculated similarity degree to obtain a text classification result.

Further, the text image set acquisition process comprises the following steps:

s21: acquiring a text set and a label set according to the text to be classified; the text set is a set of texts to be classified, and the tag set is a set of classifications to which the texts to be classified possibly belong.

S22: randomly downloading a picture aiming at each tag in the tag set to obtain an image set formed by all downloaded pictures;

s23: the text label set is converted into a text image set.

Further, the text image set acquisition process comprises the following steps:

S22: randomly downloading a plurality of pictures for each tag in a tag set to perform ensemble enhancement to obtain an image set formed by all downloaded pictures;

s23: the text label set is converted into a text image set.

Further, the method is characterized in that the specific process of converting the text label set into the text image set is as follows:

according to the type of each label, the corresponding picture obtained in S22 is adopted to replace, so that the text label set Mapping to text image set->Wherein x is _i For the ith text, y _i For the ith tag, V _i ^M Is y _i And corresponding M pictures, wherein N is the number of texts in the test set.

Further, after the text image set is acquired, additional semantic cue words are added before the beginning of the text in the test set for prompt enhancement, expressed as:

wherein, prompt is a semantic Prompt for a specific task of a different text classification test set; x is the text in the test set;text after adding additional semantic cue words.

Further, the similarity degree is calculated by performing dot product operation on the text vector and the image vector.

Further, the classification task types comprise a single-label classification task and a multi-label classification task.

Further, the process of obtaining the classified result by performing prediction matching according to the calculated similarity and the current classification task type specifically comprises the following steps:

if the classification task type is a single-label classification task, selecting the category with the highest similarity degree as a final matching result;

if the classification task type is the multi-label classification task, selecting the class with the similarity degree larger than a preset threshold value as a final matching result.

In a second aspect, the present invention provides a CLIP-based zero-sample text classification system, comprising:

and a data acquisition module: the method comprises the steps of obtaining a text to be classified;

and a coding module: for inputting text into a text encoder to obtain text vectors; inputting an image set in the text image set into an image encoder to obtain an image vector;

and a classification prediction module: the method comprises the steps of calculating a text vector and an image vector to obtain the similarity degree of a picture and a text; and the method is used for carrying out prediction matching according to the calculated similarity and the current classification task type to obtain a text classification result.

In a third aspect, the present invention provides a computer-readable storage medium: a computer program is stored which, when called by a processor, performs the steps of the method as described above.

Advantageous effects

The invention provides a zero sample classification method, a zero sample classification system and a zero sample classification medium based on a CLIP, wherein the zero sample classification method, the zero sample classification system and the zero sample classification medium based on the CLIP are used for realizing the text image matching task which can be solved by reconstructing a text classification task into a CLIP model by combining text information and image information and applying the zero sample classification method and the zero sample classification medium to natural language processing, and improving the precision of text classification.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a zero sample classification method based on CLIP provided by an embodiment of the invention;

FIG. 2 is an exemplary diagram of a zero sample text classification method based on CLIP provided by an embodiment of the invention;

FIG. 3 is a text image matching architecture diagram of a CLIP model provided by an embodiment of the present invention;

FIG. 4 is a schematic diagram of a campt enhancement mode provided by an embodiment of the present invention;

fig. 5 is a schematic diagram of an ensable enhancement mode provided by an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be described in detail below. It will be apparent that the described embodiments are only some, but not all, embodiments of the invention. All other embodiments, based on the examples herein, which are within the scope of the invention as defined by the claims, will be within the scope of the invention as defined by the claims.

Example 1

As shown in fig. 1, this embodiment provides a zero sample text classification method based on CLIP, including:

s1: and obtaining the text to be classified. In this embodiment, as shown in fig. 2, the acquired text to be classified is "Bye.

S2: inputting the text into a text encoder to obtain text vectors, and inputting the image set in the text image set into an image encoder to obtain image vectors.

Specifically, the text image set acquisition process comprises the following steps:

s21: acquiring a text set and a label set according to the text to be classified; the text set is a set of texts to be classified, and the tag set is a set of classifications to which the texts to be classified possibly belong. The number of texts in the text set is the same as the number of labels in the label set, for example, the number of texts in the text set { A, B, C, D } is 4, wherein the labels of the text A and the text B are both a, the label of the text C is C, and the label of the text D is D, and the label set corresponding to the text set is { a, a, C, D }.

In this embodiment, the test set data is a text label setThe text Set is a Test Set, and contains a plurality of texts, such as "Bye ₁ ，x ₂ ，…，x _N X, where x _N Is the nth text; the tag Set is a Label Set, which includes a plurality of tags, such as "ear", "anger", "joy", and..the term "surrise", which is denoted Label set= { y ₁ ，y ₂ ，…，y _N -wherein y _N Is the nth tag. In this embodiment, there are multiple texts corresponding to the same tag or a single text corresponding to a single tag. Each text can be classified as one of a set of labels. The text in the text set and the labels in the label set are in one-to-one correspondence in order, i.e. x _i Corresponding to y _i 。

S22: and randomly downloading a picture for each tag in the tag set to obtain an image set formed by all downloaded pictures.

In this embodiment, for each tag in the tag Set, a picture is randomly selected and downloaded from the internet in a classified manner, and is Set as an Image Set, wherein the picture data includes a fear picture, a happy picture, a … picture, and a surprise picture in sequence, and is recorded as Image set= { v ₁ ，v ₂ ，…，v _N -a }; wherein v is _N And the picture corresponding to the Nth label.

S23: the text label set is converted into a text image set.

In this embodiment, according to the type of each tag, the corresponding picture obtained in S22 is used for replacement, and Label set= { y ₁ ，y ₂ ，…，y _N Mapping to Image set= { v ₁ ，v ₂ ，…，v _N Label set= { "ear", "anger", "joy",... I.e. to assemble text labelsMapping toText image set->Wherein x is _i For the ith text, y _i For the ith tag, V _i Is y _i And the corresponding pictures, N, are the number of texts in the test set or the number of labels in the test set. In this embodiment, each tag only downloads one picture, and +.>M in (2) is 1, thereby omitting V _i And V is _i I.e. v _i . Wherein the texts in the text set and the labels in the label set are in one-to-one correspondence in order, i.e. x _i Corresponding to y _i Thus, the number of texts is the same as the number of labels.

Inputting an image set of text and text image sets into a trained CLIP model, a text encoder in the CLIP modelFor text x _i Coding to obtain text vector T _i Image encoder->For image v in image set in text image set _i Coding to obtain image vector I _i The expression is as follows:

s3: and calculating the text vector and the image vector to obtain the similarity degree of the picture and the text.

In this embodiment, dot product operation is performed on the calculated text vector T and image vector IAnd calculating the similarity degree of the images in the image set and the text. Wherein the text encoderA transducer network is adopted, the scale is 12 layers 512 wide, and 8 attention heads are provided; image encoder->ResNet or Vision Transformer is used.

S4: according to the calculated similarity and the current classification task type, prediction matching is carried out to obtain a classification result, namely a text label, and the specific process is as follows:

if the classification task type is a single-label classification task, selecting the category with the highest similarity as a final matching result; if the classification task type is a multi-label classification task, selecting a class with similarity larger than a preset threshold as a final matching result; the predicted final match results are as follows:

wherein, information is the final matching result; single Label Task is a single label classification task; t is a preset threshold for the degree of similarity.

As shown in FIG. 3, the classification task type is a single label classification task, text encoding vector T ₁ Sum image vector i= (I ₁ ，I ₂ ，...，I _N ) Dot product operations are performed to calculate the degree of similarity of the image and text. Because of the single label classification, the text information only comprises one type of classification, and T with the highest dot product result is taken ₁ ·I ₂ As a final match result. The text "I felt fear when my mother was heavily ill term" is classified as image data v ₂ The corresponding label "spar".

Example 2

The present embodiment provides a zero-sample text classification method based on CLIP, which is different from embodiment 1 in that after a text image set is acquired, an additional semantic prompt word is added before the beginning of a text in a test set for prompt enhancement, and the method is expressed as:

wherein,text after adding additional moral cue words; prompt is a semantic Prompt for a particular task of a test set of different text classifications, e.g., prompt may be taken as "Sentiment" for emotion classification, and "Intent" for Intent class classification; x is the text in the test set.

As shown in fig. 4, the semantic cue word "Topic". For the text "What is an" imaginary number "", without the sample enhancement, the CLIP model would classify it as matheat, however this is incomplete. By the enhancement of the prompt, the Chinese character is changed into ' Topic: what is an ' imaginary number ', and can be classified into Science and Mathematics.

Example 3

The present embodiment provides a zero sample text classification method based on CLIP, which is different from embodiment 1 in that the text image set acquisition process is as follows:

s21: acquiring a text set and a label set according to the text to be classified; the text set is a set of texts to be classified, and the tag set is a set of possibly belonging to the classification of the texts to be classified;

in this embodiment, the test set data is a text label set, and is recorded asWherein the text Set is a Test Set, and comprises a plurality of texts of' Bye ₁ ，x ₂ ，…，x _N X, where x _N Is the nth text; the tag Set is a Label Set, including the tags "ear", "anger", "joy",.Surpise ", denoted Label set= { y ₁ ，y ₂ ，…，y _N And N-th tag. In this embodiment, there are multiple texts corresponding to the same tag or a single text corresponding to a single tag. Each text can be classified as one of a set of labels.

in this embodiment, as shown in fig. 5, for each tag in the tag Set, a plurality of pictures are randomly selected and downloaded from the internet, and an Image Set is Set, wherein the Image Set includes a fear picture, a happy picture, a … picture and a surprise picture, and the picture data is recorded as a fear picture, a happy picture and a surprise picture in sequence Wherein (1)>The j-th picture in M pictures corresponding to the i-th label; m is the number of pictures corresponding to each tag download for the ensable enhancement, 2 in this example.

S23: the text label set is converted into a text image set.

In this embodiment, according to the type of each tag, the corresponding picture obtained in S22 is used as a replacement, and after the enhancement of the ensembe is used, the tag y is _i And (3) withCorrespondingly, i is E (1, N). That is, text tag set +.>Mapping to text image set->Wherein, x is _i For the ith text, y _i For the ith tag->Is y _i And corresponding M pictures, wherein N is the number of texts in the test set.

When the CLIP model matches the text of the test set with the image of the image set, the similarity degree of the text coding vector T and the image vector I is:

wherein M is the number of pictures corresponding to each tag download by the ensable enhancement; in this embodiment, the similarity is obtained by a simple addition method, and when the method is implemented, a specific weight can be set for a specific picture according to actual needs to perform a weighting operation.

For the text "I felt frustrated," angry, utterly detected ". The picture is selected for the" anger "tag if the picture selection effect is not good without the ensable enhancementCorrespondingly, a picture is selected for the "sadness" label->Correspondingly, it can be seen from fig. 5 that erroneous results will be obtained. After the enhancement of the ensamble is adopted, the influence of individual errors of a single picture on a matching result is reduced, and the accuracy is improved.

Example 4

The embodiment provides a zero sample text classification system based on CLIP, which comprises:

and a classification prediction module: the method comprises the steps of calculating a text vector and an image vector to obtain the similarity degree of a picture and a text; and the method is used for carrying out prediction matching according to the calculated similarity and the current classification task type to obtain a classification result, namely a text label.

Example 5

The present embodiment provides a computer-readable storage medium storing a computer program which, when called by a processor, performs the steps of a method as described above

It is to be understood that the same or similar parts in the above embodiments may be referred to each other, and that in some embodiments, the same or similar parts in other embodiments may be referred to.

While embodiments of the present invention have been shown and described above, it will be understood that the above embodiments are illustrative and not to be construed as limiting the invention, and that variations, modifications, alternatives and variations may be made to the above embodiments by one of ordinary skill in the art within the scope of the invention.

It should be appreciated that in embodiments of the present invention, the processor may be a central processing unit (Central Processing Unit, CPU), which may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), off-the-shelf Programmable gate arrays (FPGAs) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The memory may include read only memory and random access memory and provide instructions and data to the processor. A portion of the memory may also include non-volatile random access memory. For example, the memory may also store information of the device type.

The readable storage medium is a computer readable storage medium, which may be an internal storage unit of the controller according to any one of the foregoing embodiments, for example, a hard disk or a memory of the controller. The readable storage medium may also be an external storage device of the controller, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) or the like, which are provided on the controller. Further, the readable storage medium may also include both an internal storage unit and an external storage device of the controller. The readable storage medium is used to store the computer program and other programs and data required by the controller. The readable storage medium may also be used to temporarily store data that has been output or is to be output.

Based on such understanding, the technical solution of the present invention is essentially or a part contributing to the prior art, or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned readable storage medium includes: a U-disk, a removable hard disk, a Read-only Memory (ROM), a random access Memory (RAM, randomAccess Memory), a magnetic disk, an optical disk, or other various media capable of storing program codes.

Claims

1. A zero sample text classification method based on CLIP, comprising:

s1: acquiring a text to be classified;

the text image set acquisition process comprises the following steps of:

s23: converting the text label set into a text image set;

or, the text image set acquisition process comprises the following steps:

s23: converting the text label set into a text image set;

the specific process of converting the text label set into the text image set is as follows:

according to the type of each label, the corresponding picture obtained in S22 is adopted to replace, so that the text labels are opposite to each otherMapping to text image set->Wherein x is _i For the ith text, y _i For the ith tag->Is y _i Corresponding M pictures, wherein N is the number of texts in the test set;

after the text image set is acquired, adding additional semantic prompt words for the prompt enhancement before the beginning of the text in the test set, wherein the text image set is expressed as:

wherein, prompt is a semantic Prompt for a specific task of a different text classification test set; x is the text in the test set；Text after adding additional semantic cue words;

s4: according to the current classification task type and the calculated similarity degree, carrying out prediction matching to obtain a text classification result;

the classification task type comprises a single-label classification task and a multi-label classification task;

the process for obtaining the classified result by carrying out prediction matching according to the similarity obtained by calculation and the current classification task type specifically comprises the following steps: if the classification task type is a single-label classification task, selecting the category with the highest similarity degree as a final matching result;

2. The CLIP-based zero-sample text classification method of claim 1, wherein said similarity calculation is performed by dot product operation of a text vector and an image vector.

3. A CLIP-based zero-sample text classification system, comprising:

and a coding module: for inputting text into a text encoder to obtain text vectors; inputting an image set in the text image set into an image encoder to obtain an image vector; the text image set acquisition process comprises the following steps of:

s23: converting the text label set into a text image set;

the text image centralized image set acquisition process further comprises the following steps:

s23: converting the text label set into a text image set;

wherein, prompt is a semantic Prompt for a specific task of a different text classification test set; x is the text in the test set;text after adding additional semantic cue words;

and a classification prediction module: the method comprises the steps of calculating a text vector and an image vector to obtain the similarity degree of a picture and a text; the method comprises the steps of carrying out prediction matching according to the similarity obtained by calculation and the current classification task type to obtain a text classification result; the classification task type comprises a single-label classification task and a multi-label classification task;

4. A computer-readable storage medium, characterized by: a computer program is stored which, when called by a processor, performs: the method of any one of claims 1-2.