CN117093747A

CN117093747A - Net red classification method based on ChatGPT and pre-training model

Info

Publication number: CN117093747A
Application number: CN202310941131.9A
Authority: CN
Inventors: 陈春秀; 董东; 褚雷
Original assignee: Beijing Multipoint Online Technology Co ltd
Current assignee: Beijing Multipoint Online Technology Co ltd
Priority date: 2023-07-28
Filing date: 2023-07-28
Publication date: 2023-11-21

Abstract

The invention provides a net red classification method based on a ChatGPT and a pre-training model, which comprises the following steps: defining a ChatGPT template, calling the ChatGPT to acquire related search words of the network red classification, crawling video information associated with the search words according to the related search words, and processing the video information to form training data; selecting a pre-training model, training the pre-training model by using the training data, forming a preliminary grid structure on the basis of the pre-training model, performing Fine-tuning on the preliminary network structure, and generating a video classifier; screening the video released by the network red history, calling the video classifier to predict a video classification result, and performing aggregation calculation according to the video classification result to form the network red classification result.

Description

Net red classification method based on ChatGPT and pre-training model

Technical Field

The invention relates to the AI field, in particular to a net red classification method based on a ChatGPT and a pre-training model.

Background

The core of the network red classification is a classification machine learning model, the classification machine learning model belongs to the category of a supervised learning model, the supervised learning model needs to prepare training data of the network red classification in advance, and the training data generally needs to be marked by relying on manpower.

After the data is marked, a characteristic engineering and a proper model is selected (self-developed) to be prepared. Feature engineering is the conversion of raw data into feature sets that can be used by machine learning algorithms, with the goal of selecting, extracting, and converting meaningful, useful features in the raw data to help the machine learning model better understand the essential structure of the data. The machine learning model is an algorithm or mathematical model used to learn the mapping relationship between input data and output data.

Regarding the quality of the effect of the feature engineering and the model, the multiple rounds of iteration are generally required according to the paths of training, evaluating, retraining, reevaluating and the number of the years until the model meeting the service performance index and the effect index is found.

Finally, the model is in the online stage: the model is deployed to an on-line environment, the characteristics of the object to be predicted are input, the probability of outputting different classification results is predicted through model calculation (Forward feedback), and the classification with the highest probability is selected as the final prediction result.

Thus, the existing operating method of net red classification: training data of manual annotation net red classification is adopted, and then a classification model is trained from 0 to 1 based on the annotation data.

The disadvantages of the prior art are:

1. the manual annotation data is large in quantity and various. The method relies on manual labeling, has low efficiency, long time consumption and high cost, and has insufficient coverage of manual labeling data, thereby easily causing the phenomenon of model cocoon houses;

2. model training is carried out from 0 to 1 based on the labeling data, so that the time consumption is long, the effect is poor and the cost is high; and the insufficient data coverage and magnitude can lead to model training and fitting, and generalization is poor when the model is actually online.

Disclosure of Invention

The invention aims to provide a net red classification method based on a ChatGPT and a pre-training model.

The invention aims to solve the problems of low efficiency, long time consumption and high cost when manually marking data, and solves the problems of long time consumption, poor effect and high cost when training a model from 0 to 1.

Compared with the prior art, the technical scheme of the invention has the following beneficial effects:

a network red classification method based on a ChatGPT and a pre-training model comprises the following steps: defining a ChatGPT template, calling the ChatGPT to acquire related search words of the network red classification, crawling video information associated with the search words according to the related search words, and processing the video information to form training data; selecting a pre-training model, training the pre-training model by using the training data, forming a preliminary grid structure on the basis of the pre-training model, performing Fine-tuning on the preliminary network structure, and generating a video classifier; screening the video released by the network red history, calling the video classifier to predict a video classification result, and performing aggregation calculation according to the video classification result to form the network red classification result.

As a further improvement, the calling ChatGPT obtains related search terms of the netbook classification, including: the method comprises the steps of defining task requirements for ChatGPT, requesting the ChatGPT as a network red classifier to give related search words according to network red classification logic; and according to the specific related search word given by the ChatGPT, requesting the ChatGPT to give the results of different languages of the related search word again.

As a further improvement, crawling video information associated with a search term from related search terms includes: and crawling and processing the network red video website by using a web crawler technology, searching the related search words in the network red video website and acquiring the network red video information associated with the related search words.

As a further improvement, the processing the video information to form training data includes: and checking the video information samples, cleaning out the video information samples which are not required, such as the missing value, the abnormal value, the repeated value and the like, and carrying out random arrangement on the rest video information samples to form training data.

As a further refinement, the selection of the pre-training model is performed by selecting a pre-training model, wherein the pre-training model is a multilingual BERT model.

As a further improvement, the forming a preliminary mesh structure on the basis of the pre-training model includes: on the basis of the pre-training model, a Dropout layer and a Softmax layer are added to form a preliminary grid structure.

As a further improvement, the screening of the video published by the net red history includes: the video released in the last period of time by the network red is recalled and used as a video set reflecting the classification characteristics of the network red.

As a further improvement, if the number of videos released in the last period of time of the network red is smaller than a set threshold value, the network red is not classified.

As a further refinement, said invoking said video classifier predicts a video classification result comprising: and calling a video processor to process a network red video set, classifying according to the video characteristics of each video in the video set, and predicting a specific classification result of each video.

As a further improvement, the aggregation calculation is performed according to the video classification result to form a net red classification result, which includes: and selecting a video classification result in the last period of time to perform aggregation calculation, wherein the aggregation calculation result is used as a classification result of the network red at the current time point.

The beneficial effects of the invention are as follows:

by energizing ChatGPT, the output efficiency of related search words of network red classification is improved, and the efficiency is higher than that of manually searching the output search words, and the coverage range is wider;

based on the related search word, the training data is assembled, a training data training model is used for forming a video classifier, historical release videos of the network red are screened, the video classifier is called to predict video classification results, and aggregation calculation is carried out according to the video classification results to form network red classification results, so that the network red classification efficiency is improved.

Drawings

Fig. 1 is a schematic diagram of a network red classification method based on ChatGPT and a pre-training model according to an embodiment of the present invention.

Fig. 2 is a preliminary mesh structure provided by an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments. Thus, the following detailed description of the embodiments of the invention, as presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, based on the embodiments of the invention, which are apparent to those of ordinary skill in the art without inventive faculty, are intended to be within the scope of the invention.

In the description of the present invention, the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature. In the description of the present invention, the meaning of "a plurality" is two or more, unless explicitly defined otherwise.

Referring to fig. 1, a method for classifying network red based on ChatGPT and a pre-training model includes: defining a ChatGPT template, calling the ChatGPT to acquire related search words of the network red classification, crawling video information associated with the search words according to the related search words, and processing the video information to form training data; selecting a pre-training model, training the pre-training model by using the training data, forming a preliminary grid structure on the basis of the pre-training model, performing Fine-tuning on the preliminary network structure, and generating a video classifier; screening the video released by the network red history, calling the video classifier to predict a video classification result, and performing aggregation calculation according to the video classification result to form the network red classification result.

The calling ChatGPT to obtain related search terms of the net red classification includes: the method comprises the steps of defining task requirements for ChatGPT, requesting the ChatGPT as a network red classifier to give related search words according to network red classification logic; and according to the specific related search word given by the ChatGPT, requesting the ChatGPT to give the results of different languages of the related search word again.

In this embodiment, a set of promt templates is defined: "please play a net red classifier, give relevant video search words according to the characteristics of net red classification", and define task demands to AI;

next, inputting a specific category, and obtaining related search words of the category video, such as "please give similar english search words of clothes and trousers," display one item per line, "" please give similar english search words of clothes and trousers, "display one item per line," and so on.

According to the ChatGPT platform based on OpenAI, the related classified recommended search words can be quickly and efficiently obtained by calling the ChatGPT API interface, and a large number of search words of specific network red classification can be produced with high quality by matching with manual simple screening.

The crawling video information associated with the search term according to the related search term comprises the following steps: and crawling and processing the network red video website by using a web crawler technology, searching the related search words in the network red video website and acquiring the network red video information associated with the related search words.

In this embodiment, a web crawler technology is used to crawl and process a common network red video website (Youtube, tiktok, instagram, etc.), search for related search words generated in the previous step, and acquire network red video information associated with the search words, so as to generate a large amount of video training data.

The processing of the video information to form training data includes: and checking the video information samples, cleaning out the video information samples which are not required, such as the missing value, the abnormal value, the repeated value and the like, and carrying out random arrangement on the rest video information samples to form training data.

The steps obtain a large amount of video multi-mode data, and training data is formed after the processing of the steps:

cleaning: checking data samples, and cleaning out samples which are not required, such as missing values, abnormal values, repeated values and the like, so as to ensure the integrity and purity of the data, thereby ensuring the high ceilings of the final model effect;

confusion: the data collected by the method is generally arranged according to the sequence of the categories, training data of the same category are often concentrated together, and if the data are input into the model, the effect and the robustness of the model are poor. By randomizing and scrambling such comparison data, and then inputting the data to the model, the model is relatively high in effect and robustness.

The selection of a pre-training model, wherein the pre-training model is a multilingual BERT model.

The multilingual BERT model is a pretrained natural language processing model derived from Google, is based on a transducer model architecture, and uses a bi-directional encoder to establish a context representation, thereby supporting multiple natural language processing tasks. Unlike conventional natural language processing models, the multilingual BERT model does not require building different models for different languages, as it is a model that can be used in multiple languages. The model is pre-trained using a large corpus to learn information expressed in terms of underlying words and sentences, as well as correlations between natural language. The model is then adapted to the particular natural language processing task by fine-tuning in downstream tasks (e.g., named entity recognition and emotion analysis).

Because the multilingual BERT model is multilingual, this means that it is not necessary to build different models for different languages. The languages of the model comprise Arabic, chinese, english, french, german, japanese, korean, italian, portuguese, russian, spanish, turkish and the like, and the model has very good expandability and can be very conveniently adapted to new languages. The advantage of the multilingual BERT model is that it can be adapted to different languages and different natural language processing tasks, while also being very easy to adapt to new tasks by fine tuning. A common model is shared among a plurality of languages, so that the construction and maintenance cost of a large-scale corpus can be remarkably reduced, and the reusability of language data is improved. This makes the multilingual BERT model an important tool for implementing cross-language natural language processing.

In summary, the multilingual BERT model is a very powerful, scalable, flexible and efficient natural language processing tool, whose advent has brought great progress and value to the cross-language natural language processing field.

Referring to fig. 2, the forming a preliminary grid structure based on the pre-training model includes: on the basis of the pre-training model, a Dropout layer and a Softmax layer are added to form a preliminary grid structure so as to support the multi-classification requirement of the network red video.

The screening of the video published by the network red history comprises the following steps: the video released in the last period of time by the network red is recalled and used as a video set reflecting the classification characteristics of the network red.

And if the number of videos released by the network red in the last period of time is smaller than a set threshold value, not classifying the network red.

The calling the video classifier to predict the video classification result comprises the following steps: and calling a video processor to process a network red video set, classifying according to the video characteristics of each video in the video set, and predicting a specific classification result of each video.

The aggregation calculation is carried out according to the video classification result to form a net red classification result, which comprises the following steps: and selecting a video classification result in the last period of time to perform aggregation calculation, wherein the aggregation calculation result is used as a classification result of the network red at the current time point.

And if the related search terms of the ChatGPT recommended net red classification do not meet the requirements, returning to redefine the ChatGPT template.

In this embodiment, the video of the last 1 year is recalled from the video of the network red history release, and is used as the video set for responding to the network red classification feature.

In addition, in order to prevent the problem that the classification of the network red is inaccurate because the number of videos published by the partially inactive network red is small (for example, one network red only publishes a laughing video and cannot be simply considered as a laughing blogger), the minimum published video number of 12 for 1 year is given, that is, in the last 1 year, if the number of videos published by the network red is less than 12 (that is, 1 per month), the classification operation is not performed on the network red.

Aiming at the video list of the last 1 year of the last step, a trained model is called, classification is carried out according to video features, and a specific classification result of the video is predicted.

And incrementally updating the video classification result newly released by the network red, selecting the video classification result in the last 1 year, and aggregating the calculated results to be used as the classification result of the current network red and the current time point.

The above examples are only for illustrating the technical scheme of the present invention and are not limiting. It will be understood by those skilled in the art that any modifications and equivalents that do not depart from the spirit and scope of the invention are intended to be within the scope of the appended claims.

Claims

1. The network red classification method based on the ChatGPT and the pre-training model is characterized by comprising the following steps of:

defining a ChatGPT template, calling the ChatGPT to acquire related search words of the network red classification, crawling video information associated with the search words according to the related search words, and processing the video information to form training data;

selecting a pre-training model, training the pre-training model by using the training data, forming a preliminary grid structure on the basis of the pre-training model, performing Fine-tuning on the preliminary network structure, and generating a video classifier;

screening the video released by the network red history, calling the video classifier to predict a video classification result, and performing aggregation calculation according to the video classification result to form the network red classification result.

2. The method for classifying network red based on ChatGPT and pre-training model as claimed in claim 1, wherein said calling ChatGPT to obtain the related search word of network red classification comprises:

the method comprises the steps of defining task requirements for ChatGPT, requesting the ChatGPT as a network red classifier to give related search words according to network red classification logic;

and according to the specific related search word given by the ChatGPT, requesting the ChatGPT to give the results of different languages of the related search word again.

3. The ChatGPT and pretraining model based network red classification method of claim 1, wherein crawling video information associated with a search term according to the related search term comprises:

and crawling and processing the network red video website by using a web crawler technology, searching the related search words in the network red video website and acquiring the network red video information associated with the related search words.

4. The method for classifying network red based on ChatGPT and pre-training model of claim 1, wherein the processing the video information to form training data comprises:

and checking the video information samples, cleaning out the video information samples which are not required, such as the missing value, the abnormal value, the repeated value and the like, and carrying out random arrangement on the rest video information samples to form training data.

5. The method for classifying network reds based on ChatGPT and pre-training models according to claim 1, wherein the pre-training model is selected, and wherein the pre-training model is a multilingual BERT model.

6. The method for classifying network red based on ChatGPT and a pre-training model as claimed in claim 1, wherein the forming a preliminary grid structure based on the pre-training model comprises: on the basis of the pre-training model, a Dropout layer and a Softmax layer are added to form a preliminary grid structure.

7. The method for classifying network reds based on ChatGPT and a pre-training model as claimed in claim 1, wherein the screening of the network reds historically published videos comprises:

the video released in the last period of time by the network red is recalled and used as a video set reflecting the classification characteristics of the network red.

8. The ChatGPT and pre-training model based network red classification method of claim 7, wherein if the number of videos released in the last period of time of the network red is less than a set threshold, the network red is not classified.

9. The ChatGPT and pretraining model based network red classification method of claim 1, wherein the invoking the video classifier to predict the video classification result comprises:

and calling a video processor to process a network red video set, classifying according to the video characteristics of each video in the video set, and predicting a specific classification result of each video.

10. The method for classifying network red based on ChatGPT and pre-training model as claimed in claim 1, wherein said performing aggregate calculation according to said video classification result forms network red classification result, comprising:

and selecting a video classification result in the last period of time to perform aggregation calculation, wherein the aggregation calculation result is used as a classification result of the network red at the current time point.