CN117093747A - Net red classification method based on ChatGPT and pre-training model - Google Patents

Net red classification method based on ChatGPT and pre-training model Download PDF

Info

Publication number
CN117093747A
CN117093747A CN202310941131.9A CN202310941131A CN117093747A CN 117093747 A CN117093747 A CN 117093747A CN 202310941131 A CN202310941131 A CN 202310941131A CN 117093747 A CN117093747 A CN 117093747A
Authority
CN
China
Prior art keywords
video
chatgpt
network
network red
red
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310941131.9A
Other languages
Chinese (zh)
Inventor
陈春秀
董东
褚雷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Multipoint Online Technology Co ltd
Original Assignee
Beijing Multipoint Online Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Multipoint Online Technology Co ltd filed Critical Beijing Multipoint Online Technology Co ltd
Priority to CN202310941131.9A priority Critical patent/CN117093747A/en
Publication of CN117093747A publication Critical patent/CN117093747A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/75Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/7867Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, title and artist information, manually generated time, location and usage information, user ratings
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Databases & Information Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Library & Information Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a net red classification method based on a ChatGPT and a pre-training model, which comprises the following steps: defining a ChatGPT template, calling the ChatGPT to acquire related search words of the network red classification, crawling video information associated with the search words according to the related search words, and processing the video information to form training data; selecting a pre-training model, training the pre-training model by using the training data, forming a preliminary grid structure on the basis of the pre-training model, performing Fine-tuning on the preliminary network structure, and generating a video classifier; screening the video released by the network red history, calling the video classifier to predict a video classification result, and performing aggregation calculation according to the video classification result to form the network red classification result.

Description

Net red classification method based on ChatGPT and pre-training model
Technical Field
The invention relates to the AI field, in particular to a net red classification method based on a ChatGPT and a pre-training model.
Background
The core of the network red classification is a classification machine learning model, the classification machine learning model belongs to the category of a supervised learning model, the supervised learning model needs to prepare training data of the network red classification in advance, and the training data generally needs to be marked by relying on manpower.
After the data is marked, a characteristic engineering and a proper model is selected (self-developed) to be prepared. Feature engineering is the conversion of raw data into feature sets that can be used by machine learning algorithms, with the goal of selecting, extracting, and converting meaningful, useful features in the raw data to help the machine learning model better understand the essential structure of the data. The machine learning model is an algorithm or mathematical model used to learn the mapping relationship between input data and output data.
Regarding the quality of the effect of the feature engineering and the model, the multiple rounds of iteration are generally required according to the paths of training, evaluating, retraining, reevaluating and the number of the years until the model meeting the service performance index and the effect index is found.
Finally, the model is in the online stage: the model is deployed to an on-line environment, the characteristics of the object to be predicted are input, the probability of outputting different classification results is predicted through model calculation (Forward feedback), and the classification with the highest probability is selected as the final prediction result.
Thus, the existing operating method of net red classification: training data of manual annotation net red classification is adopted, and then a classification model is trained from 0 to 1 based on the annotation data.
The disadvantages of the prior art are:
1. the manual annotation data is large in quantity and various. The method relies on manual labeling, has low efficiency, long time consumption and high cost, and has insufficient coverage of manual labeling data, thereby easily causing the phenomenon of model cocoon houses;
2. model training is carried out from 0 to 1 based on the labeling data, so that the time consumption is long, the effect is poor and the cost is high; and the insufficient data coverage and magnitude can lead to model training and fitting, and generalization is poor when the model is actually online.
Disclosure of Invention
The invention aims to provide a net red classification method based on a ChatGPT and a pre-training model.
The invention aims to solve the problems of low efficiency, long time consumption and high cost when manually marking data, and solves the problems of long time consumption, poor effect and high cost when training a model from 0 to 1.
Compared with the prior art, the technical scheme of the invention has the following beneficial effects:
a network red classification method based on a ChatGPT and a pre-training model comprises the following steps: defining a ChatGPT template, calling the ChatGPT to acquire related search words of the network red classification, crawling video information associated with the search words according to the related search words, and processing the video information to form training data; selecting a pre-training model, training the pre-training model by using the training data, forming a preliminary grid structure on the basis of the pre-training model, performing Fine-tuning on the preliminary network structure, and generating a video classifier; screening the video released by the network red history, calling the video classifier to predict a video classification result, and performing aggregation calculation according to the video classification result to form the network red classification result.
As a further improvement, the calling ChatGPT obtains related search terms of the netbook classification, including: the method comprises the steps of defining task requirements for ChatGPT, requesting the ChatGPT as a network red classifier to give related search words according to network red classification logic; and according to the specific related search word given by the ChatGPT, requesting the ChatGPT to give the results of different languages of the related search word again.
As a further improvement, crawling video information associated with a search term from related search terms includes: and crawling and processing the network red video website by using a web crawler technology, searching the related search words in the network red video website and acquiring the network red video information associated with the related search words.
As a further improvement, the processing the video information to form training data includes: and checking the video information samples, cleaning out the video information samples which are not required, such as the missing value, the abnormal value, the repeated value and the like, and carrying out random arrangement on the rest video information samples to form training data.
As a further refinement, the selection of the pre-training model is performed by selecting a pre-training model, wherein the pre-training model is a multilingual BERT model.
As a further improvement, the forming a preliminary mesh structure on the basis of the pre-training model includes: on the basis of the pre-training model, a Dropout layer and a Softmax layer are added to form a preliminary grid structure.
As a further improvement, the screening of the video published by the net red history includes: the video released in the last period of time by the network red is recalled and used as a video set reflecting the classification characteristics of the network red.
As a further improvement, if the number of videos released in the last period of time of the network red is smaller than a set threshold value, the network red is not classified.
As a further refinement, said invoking said video classifier predicts a video classification result comprising: and calling a video processor to process a network red video set, classifying according to the video characteristics of each video in the video set, and predicting a specific classification result of each video.
As a further improvement, the aggregation calculation is performed according to the video classification result to form a net red classification result, which includes: and selecting a video classification result in the last period of time to perform aggregation calculation, wherein the aggregation calculation result is used as a classification result of the network red at the current time point.
The beneficial effects of the invention are as follows:
by energizing ChatGPT, the output efficiency of related search words of network red classification is improved, and the efficiency is higher than that of manually searching the output search words, and the coverage range is wider;
based on the related search word, the training data is assembled, a training data training model is used for forming a video classifier, historical release videos of the network red are screened, the video classifier is called to predict video classification results, and aggregation calculation is carried out according to the video classification results to form network red classification results, so that the network red classification efficiency is improved.
Drawings
Fig. 1 is a schematic diagram of a network red classification method based on ChatGPT and a pre-training model according to an embodiment of the present invention.
Fig. 2 is a preliminary mesh structure provided by an embodiment of the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments. Thus, the following detailed description of the embodiments of the invention, as presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, based on the embodiments of the invention, which are apparent to those of ordinary skill in the art without inventive faculty, are intended to be within the scope of the invention.
In the description of the present invention, the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature. In the description of the present invention, the meaning of "a plurality" is two or more, unless explicitly defined otherwise.
Referring to fig. 1, a method for classifying network red based on ChatGPT and a pre-training model includes: defining a ChatGPT template, calling the ChatGPT to acquire related search words of the network red classification, crawling video information associated with the search words according to the related search words, and processing the video information to form training data; selecting a pre-training model, training the pre-training model by using the training data, forming a preliminary grid structure on the basis of the pre-training model, performing Fine-tuning on the preliminary network structure, and generating a video classifier; screening the video released by the network red history, calling the video classifier to predict a video classification result, and performing aggregation calculation according to the video classification result to form the network red classification result.
The calling ChatGPT to obtain related search terms of the net red classification includes: the method comprises the steps of defining task requirements for ChatGPT, requesting the ChatGPT as a network red classifier to give related search words according to network red classification logic; and according to the specific related search word given by the ChatGPT, requesting the ChatGPT to give the results of different languages of the related search word again.
In this embodiment, a set of promt templates is defined: "please play a net red classifier, give relevant video search words according to the characteristics of net red classification", and define task demands to AI;
next, inputting a specific category, and obtaining related search words of the category video, such as "please give similar english search words of clothes and trousers," display one item per line, "" please give similar english search words of clothes and trousers, "display one item per line," and so on.
According to the ChatGPT platform based on OpenAI, the related classified recommended search words can be quickly and efficiently obtained by calling the ChatGPT API interface, and a large number of search words of specific network red classification can be produced with high quality by matching with manual simple screening.
The crawling video information associated with the search term according to the related search term comprises the following steps: and crawling and processing the network red video website by using a web crawler technology, searching the related search words in the network red video website and acquiring the network red video information associated with the related search words.
In this embodiment, a web crawler technology is used to crawl and process a common network red video website (Youtube, tiktok, instagram, etc.), search for related search words generated in the previous step, and acquire network red video information associated with the search words, so as to generate a large amount of video training data.
The processing of the video information to form training data includes: and checking the video information samples, cleaning out the video information samples which are not required, such as the missing value, the abnormal value, the repeated value and the like, and carrying out random arrangement on the rest video information samples to form training data.
The steps obtain a large amount of video multi-mode data, and training data is formed after the processing of the steps:
cleaning: checking data samples, and cleaning out samples which are not required, such as missing values, abnormal values, repeated values and the like, so as to ensure the integrity and purity of the data, thereby ensuring the high ceilings of the final model effect;
confusion: the data collected by the method is generally arranged according to the sequence of the categories, training data of the same category are often concentrated together, and if the data are input into the model, the effect and the robustness of the model are poor. By randomizing and scrambling such comparison data, and then inputting the data to the model, the model is relatively high in effect and robustness.
The selection of a pre-training model, wherein the pre-training model is a multilingual BERT model.
The multilingual BERT model is a pretrained natural language processing model derived from Google, is based on a transducer model architecture, and uses a bi-directional encoder to establish a context representation, thereby supporting multiple natural language processing tasks. Unlike conventional natural language processing models, the multilingual BERT model does not require building different models for different languages, as it is a model that can be used in multiple languages. The model is pre-trained using a large corpus to learn information expressed in terms of underlying words and sentences, as well as correlations between natural language. The model is then adapted to the particular natural language processing task by fine-tuning in downstream tasks (e.g., named entity recognition and emotion analysis).
Because the multilingual BERT model is multilingual, this means that it is not necessary to build different models for different languages. The languages of the model comprise Arabic, chinese, english, french, german, japanese, korean, italian, portuguese, russian, spanish, turkish and the like, and the model has very good expandability and can be very conveniently adapted to new languages. The advantage of the multilingual BERT model is that it can be adapted to different languages and different natural language processing tasks, while also being very easy to adapt to new tasks by fine tuning. A common model is shared among a plurality of languages, so that the construction and maintenance cost of a large-scale corpus can be remarkably reduced, and the reusability of language data is improved. This makes the multilingual BERT model an important tool for implementing cross-language natural language processing.
In summary, the multilingual BERT model is a very powerful, scalable, flexible and efficient natural language processing tool, whose advent has brought great progress and value to the cross-language natural language processing field.
Referring to fig. 2, the forming a preliminary grid structure based on the pre-training model includes: on the basis of the pre-training model, a Dropout layer and a Softmax layer are added to form a preliminary grid structure so as to support the multi-classification requirement of the network red video.
The screening of the video published by the network red history comprises the following steps: the video released in the last period of time by the network red is recalled and used as a video set reflecting the classification characteristics of the network red.
And if the number of videos released by the network red in the last period of time is smaller than a set threshold value, not classifying the network red.
The calling the video classifier to predict the video classification result comprises the following steps: and calling a video processor to process a network red video set, classifying according to the video characteristics of each video in the video set, and predicting a specific classification result of each video.
The aggregation calculation is carried out according to the video classification result to form a net red classification result, which comprises the following steps: and selecting a video classification result in the last period of time to perform aggregation calculation, wherein the aggregation calculation result is used as a classification result of the network red at the current time point.
And if the related search terms of the ChatGPT recommended net red classification do not meet the requirements, returning to redefine the ChatGPT template.
In this embodiment, the video of the last 1 year is recalled from the video of the network red history release, and is used as the video set for responding to the network red classification feature.
In addition, in order to prevent the problem that the classification of the network red is inaccurate because the number of videos published by the partially inactive network red is small (for example, one network red only publishes a laughing video and cannot be simply considered as a laughing blogger), the minimum published video number of 12 for 1 year is given, that is, in the last 1 year, if the number of videos published by the network red is less than 12 (that is, 1 per month), the classification operation is not performed on the network red.
Aiming at the video list of the last 1 year of the last step, a trained model is called, classification is carried out according to video features, and a specific classification result of the video is predicted.
And incrementally updating the video classification result newly released by the network red, selecting the video classification result in the last 1 year, and aggregating the calculated results to be used as the classification result of the current network red and the current time point.
The above examples are only for illustrating the technical scheme of the present invention and are not limiting. It will be understood by those skilled in the art that any modifications and equivalents that do not depart from the spirit and scope of the invention are intended to be within the scope of the appended claims.

Claims (10)

1. The network red classification method based on the ChatGPT and the pre-training model is characterized by comprising the following steps of:
defining a ChatGPT template, calling the ChatGPT to acquire related search words of the network red classification, crawling video information associated with the search words according to the related search words, and processing the video information to form training data;
selecting a pre-training model, training the pre-training model by using the training data, forming a preliminary grid structure on the basis of the pre-training model, performing Fine-tuning on the preliminary network structure, and generating a video classifier;
screening the video released by the network red history, calling the video classifier to predict a video classification result, and performing aggregation calculation according to the video classification result to form the network red classification result.
2. The method for classifying network red based on ChatGPT and pre-training model as claimed in claim 1, wherein said calling ChatGPT to obtain the related search word of network red classification comprises:
the method comprises the steps of defining task requirements for ChatGPT, requesting the ChatGPT as a network red classifier to give related search words according to network red classification logic;
and according to the specific related search word given by the ChatGPT, requesting the ChatGPT to give the results of different languages of the related search word again.
3. The ChatGPT and pretraining model based network red classification method of claim 1, wherein crawling video information associated with a search term according to the related search term comprises:
and crawling and processing the network red video website by using a web crawler technology, searching the related search words in the network red video website and acquiring the network red video information associated with the related search words.
4. The method for classifying network red based on ChatGPT and pre-training model of claim 1, wherein the processing the video information to form training data comprises:
and checking the video information samples, cleaning out the video information samples which are not required, such as the missing value, the abnormal value, the repeated value and the like, and carrying out random arrangement on the rest video information samples to form training data.
5. The method for classifying network reds based on ChatGPT and pre-training models according to claim 1, wherein the pre-training model is selected, and wherein the pre-training model is a multilingual BERT model.
6. The method for classifying network red based on ChatGPT and a pre-training model as claimed in claim 1, wherein the forming a preliminary grid structure based on the pre-training model comprises: on the basis of the pre-training model, a Dropout layer and a Softmax layer are added to form a preliminary grid structure.
7. The method for classifying network reds based on ChatGPT and a pre-training model as claimed in claim 1, wherein the screening of the network reds historically published videos comprises:
the video released in the last period of time by the network red is recalled and used as a video set reflecting the classification characteristics of the network red.
8. The ChatGPT and pre-training model based network red classification method of claim 7, wherein if the number of videos released in the last period of time of the network red is less than a set threshold, the network red is not classified.
9. The ChatGPT and pretraining model based network red classification method of claim 1, wherein the invoking the video classifier to predict the video classification result comprises:
and calling a video processor to process a network red video set, classifying according to the video characteristics of each video in the video set, and predicting a specific classification result of each video.
10. The method for classifying network red based on ChatGPT and pre-training model as claimed in claim 1, wherein said performing aggregate calculation according to said video classification result forms network red classification result, comprising:
and selecting a video classification result in the last period of time to perform aggregation calculation, wherein the aggregation calculation result is used as a classification result of the network red at the current time point.
CN202310941131.9A 2023-07-28 2023-07-28 Net red classification method based on ChatGPT and pre-training model Pending CN117093747A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310941131.9A CN117093747A (en) 2023-07-28 2023-07-28 Net red classification method based on ChatGPT and pre-training model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310941131.9A CN117093747A (en) 2023-07-28 2023-07-28 Net red classification method based on ChatGPT and pre-training model

Publications (1)

Publication Number Publication Date
CN117093747A true CN117093747A (en) 2023-11-21

Family

ID=88777918

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310941131.9A Pending CN117093747A (en) 2023-07-28 2023-07-28 Net red classification method based on ChatGPT and pre-training model

Country Status (1)

Country Link
CN (1) CN117093747A (en)

Similar Documents

Publication Publication Date Title
CN109214386B (en) Method and apparatus for generating image recognition model
CN110580292B (en) Text label generation method, device and computer readable storage medium
CN112270379A (en) Training method of classification model, sample classification method, device and equipment
CN110968660B (en) Information extraction method and system based on joint training model
CN112633010A (en) Multi-head attention and graph convolution network-based aspect-level emotion analysis method and system
CN112817561B (en) Transaction type functional point structured extraction method and system for software demand document
CN110147552B (en) Education resource quality evaluation mining method and system based on natural language processing
CN111859953A (en) Training data mining method and device, electronic equipment and storage medium
CN109614612A (en) A kind of Chinese text error correction method based on seq2seq+attention
CN111626041B (en) Music comment generation method based on deep learning
CN115859980A (en) Semi-supervised named entity identification method, system and electronic equipment
CN113779988A (en) Method for extracting process knowledge events in communication field
CN115437952A (en) Statement level software defect detection method based on deep learning
CN115129807A (en) Fine-grained classification method and system for social media topic comments based on self-attention
CN113239143B (en) Power transmission and transformation equipment fault processing method and system fusing power grid fault case base
CN112559741B (en) Nuclear power equipment defect record text classification method, system, medium and electronic equipment
CN116383521B (en) Subject word mining method and device, computer equipment and storage medium
CN117333146A (en) Manpower resource management system and method based on artificial intelligence
CN111985226B (en) Method and device for generating annotation data
Zhang et al. LogPrompt: A Log-based Anomaly Detection Framework Using Prompts
CN116975161A (en) Entity relation joint extraction method, equipment and medium of power equipment partial discharge text
CN113688232B (en) Method and device for classifying bid-inviting text, storage medium and terminal
CN116502649A (en) Training method and device for text generation model, electronic equipment and storage medium
CN117093747A (en) Net red classification method based on ChatGPT and pre-training model
Nina et al. Simplified LSTM unit and search space probability exploration for image description

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination