CN117521603A

CN117521603A - Short video text language model building training method

Info

Publication number: CN117521603A
Application number: CN202311486210.1A
Authority: CN
Inventors: 况锦文; 罗欣
Original assignee: Chongqing Juesheng Education Technology Co ltd
Current assignee: Chongqing Juesheng Education Technology Co ltd
Priority date: 2023-11-08
Filing date: 2023-11-08
Publication date: 2024-02-06

Abstract

The invention relates to the technical field of language models, in particular to a training method for constructing a short video language model, which comprises the following steps: s1, acquiring sample data of a video and a file thereof, extracting the file data in the sample data, and cleaning and preprocessing the file data; s2, converting the text data into embedded vectors; s3, selecting a pre-trained model, constructing a language model, and training and fine-tuning the language model by adopting a document data set consisting of embedded vectors; s4, setting promts, analyzing the training effect of the trained language model by adopting the promts, and optimizing the language model according to the analysis result. The method and the device can improve training efficiency, reduce training cost and train the language model which meets personalized requirements of users.

Description

Short video text language model building training method

Technical Field

The invention relates to the technical field of language models, in particular to a training method for constructing a short video language model.

Background

In the digital age today, short video content has become an extremely popular form of media, of paramount importance to both individual users and commercial brands. With the rapid development of social media platforms, short videos have become an important tool for attracting viewers, promoting products, and building brand images. In this context, it becomes a critical issue how to create a short video file that is attractive, creative and capable of inducing resonance in the audience. Market trends show that users' demand for high quality short video content is increasing, while premium literature can significantly improve the appeal and spreading effect of video.

While there are many document generation tools and methods in the current market, they often fail to fully understand the uniqueness and content of short videos, and the generated documents often lack originality and individualization, and are difficult to meet the increasingly critical demands of viewers. Even if an artificial intelligence algorithm is adopted to generate a file, a model is constructed and trained, so that a large amount of computing resources and time are required to be consumed, the cost is high, and the efficiency is low; moreover, due to the problem of data in the training process, the training result is possibly unsatisfactory, the trained model cannot well meet the user requirements, and the cost of repeatedly training the model again is too high along with the personalized requirement change of the user.

Therefore, a short video language model building training method is urgently needed at present, training efficiency can be improved, training cost can be reduced, and a language model which meets personalized requirements of users can be trained.

Disclosure of Invention

The invention aims to provide a training method for constructing a short video language model, which can improve training efficiency, reduce training cost and train a language model which meets personalized requirements of users.

The invention provides the following basic scheme: a training method for constructing a short video language model comprises the following steps:

s1, acquiring sample data of a video and a file thereof, extracting the file data in the sample data, and cleaning and preprocessing the file data;

s2, converting the text data into embedded vectors;

s3, selecting a pre-trained model, constructing a language model, and training and fine-tuning the language model by adopting a document data set consisting of embedded vectors;

s4, setting promts, analyzing the training effect of the trained language model by adopting the promts, and optimizing the language model according to the analysis result.

Further, the method further comprises the following steps: s5, setting evaluation indexes, evaluating the language model, and optimizing the language model according to the evaluation result.

Further, the method further comprises the following steps: and S6, collecting feedback of the user on the language model generated text, and carrying out iterative optimization on the language model according to the feedback.

Further, the S1 includes:

obtaining video and sample data of a document by adopting a crawler technology;

extracting the document data in the sample data and cleaning the document data, wherein the method comprises the following steps: extracting text data from the video, cleaning the text data, deleting irrelevant information and correcting text errors;

preprocessing the cleaned text data; the preprocessing is to classify the document data according to the style and the content of the document.

Further, the S1 further includes:

enhancing the document data, comprising: synonym substitutions and duplicates in the text.

Further, the S2 includes:

the pre-trained GPT model is used to convert the text data into high-dimensional embedded vectors and optimize the embedded vectors.

Further, the step S3 includes:

setting a model frame;

selecting a model meeting the current task requirement from the pre-trained GPT models as an initial model;

training an initial model by adopting a text data set consisting of embedded vectors, obtaining a language model, and fine-tuning the language model.

Further, the model framework is a transducer.

Further, the S4 includes:

setting related promts, inputting the promts into the trained language model, and obtaining an output document;

comparing and analyzing the acquired text with a preset text to acquire an analysis result of the model training effect;

and optimizing the model according to the analysis result.

Further, the method further comprises the following steps: converting the text output by the language model into voice by adopting a TTS technology;

carrying out sentence-by-sentence emotion recognition and classification on the generated text by adopting a text emotion classification model to generate a classification result;

and setting emotion of the voice of each sentence of the text according to the classification result.

The beneficial effect of this scheme: according to the scheme, a model fine tuning mode is adopted for building and training the short video language model, language model building (selecting a model meeting the current task requirement from the pre-trained GPT models as an initial model) is performed by selecting the pre-trained model, so that a large amount of calculation resources and time are saved, and training efficiency can be effectively improved; compared with the traditional method for constructing and training the model from scratch, the method can reach excellent performance level in a shorter time and ensure quick response to the change of the demand;

in the aspect of sample data processing, the scheme can clean, preprocess and strengthen the sample data, reduce the data processing capacity, fully mine the potential of the existing data and strengthen the generalization capability of the model; the text data is cleaned, synonym replaced, data expanded and the like, so that the data diversity and the data richness during model training are ensured, and the trained language model meets the requirements of users;

and for the trained language model, training effect analysis is performed by setting the probes, and the language model is optimized according to the analysis result, so that the language model is optimized, and the relevance and creativity of the generated text are improved. The language model can generate more pertinent and attractive texts according to different types of video contents, and the personalized requirements of users are met.

In summary, the training efficiency can be improved, the training cost can be reduced, and the language model which meets the personalized requirements of the user can be trained.

Drawings

FIG. 1 is a flow chart of an embodiment of a training method for constructing a short video language model according to the present invention.

Detailed Description

The following is a further detailed description of the embodiments:

example 1

An example is substantially as shown in figure 1: a training method for constructing a short video language model comprises the following steps:

obtaining a video and sample data of a document thereof on a short video platform by adopting a crawler technology; specifically, a crawler technology is adopted to crawl a preset number of videos and texts thereof on each short video platform, so that the acquired sample data (the videos and the texts thereof) can cover various subjects and styles to adapt to the requirements of different users;

extracting the text data in the sample data, and cleaning the text data; specifically, the text data is extracted from the video, the text data is cleaned, irrelevant information is deleted, and errors of the text are corrected, so that the high quality of the text data is ensured, and the quality of subsequent training is ensured;

preprocessing the cleaned document data, and specifically classifying according to the style and content of the document; providing accurate labels and basis for subsequent model training;

enhancing the document data, comprising: synonym substitutions, duplicates, etc. in the text. And the data is enhanced, so that the robustness of the model is improved. The sample data is cleaned, preprocessed and reinforced, the data processing capacity is reduced, the potential of the existing data is fully mined, and the generalization capability of the model is enhanced; the text data is cleaned, synonym replaced, data expanded and the like, so that the data diversity and the data richness during model training are ensured, and the trained language model meets the requirements of users;

s2, converting the text data into embedded vectors;

using a pre-trained GPT model to convert the text data into a high-dimensional embedded vector, wherein the embedded vector can well represent the semantic information of the text;

the embedded vector is optimized, so that the embedded vector can better represent the characteristics of the short video file, and the adaptability of the embedded vector to the current task is improved; in this embodiment, the whitening technique is used to optimize the embedded vector, and the whitening technique is divided into two steps: centering and normalizing, namely, centering and normalizing the embedded vector to optimize the embedded vector;

specifically, a model framework is set, and a transducer is set as a model framework in the embodiment, which can well process sequence data and is excellent in various natural language processing tasks; the invention selects the model architecture based on the Transformer, the architecture shows excellent performance in the aspect of processing text data, can better capture long-distance dependency relationship in the text, simultaneously provides strong parallel computing capability, and greatly improves training and reasoning efficiency;

training an initial model by adopting a text data set consisting of embedded vectors, obtaining a language model, and fine-tuning the language model; where the tuning is typically a part of the layers in the pre-trained model being tuned, e.g. only the last layers or some intermediate layers of the model; in the fine tuning process, the language model is optimized through a back propagation algorithm, so that the model performs better on a target task. By adopting a model fine tuning mode, a model meeting the current task requirement is selected from the pre-trained GPT models to serve as an initial model for fine tuning by selecting the pre-trained models, so that a large amount of calculation resources and time are saved, and the training efficiency can be effectively improved; compared with the traditional method for constructing and training the model from scratch, the method can reach excellent performance level in a shorter time and ensure quick response to the change of the demand;

Setting related promts (prompt words), inputting the promts into the trained model, and obtaining an output document;

optimizing the model according to the analysis result, such as fine tuning of a language model, retraining and the like; by setting the probes, training effect analysis is performed, and the language model is optimized according to the analysis result, so that the language model is optimized, and the relevance and creativity of the generated text are improved. The language model can generate more pertinent and attractive texts according to different types of video contents, so that the personalized requirements of users are met;

s5, setting evaluation indexes, evaluating the language model, and optimizing the language model according to the evaluation result;

setting evaluation indexes such as confusion, complexity and the like according to task demands;

evaluating the language model according to the evaluation index, and generating an evaluation result;

and optimizing the language model according to the evaluation result, such as fine tuning of the language model, retraining and the like.

S6, collecting feedback of a user on a language model generated document, and carrying out iterative optimization on the language model according to the feedback;

establishing a user feedback collection mechanism, and collecting feedback of a user on a language model generated document in the process of using the language model by the user, wherein the method comprises the following steps: satisfaction and document modification advice;

according to feedback, iterative optimization is carried out on the language model, so that the language model can be improved continuously, and the requirements of users are met.

The effect of constructing the trained language model is evaluated through S5 and S6, and the method comprises the steps of using various evaluation indexes and tests in actual application scenes to assist in analyzing the performance of the language model, so that the language model can be ensured to perform excellently under different conditions.

The scheme is not limited to short video file generation in any field, has good expandability, can adapt to short video contents in different professional fields, and meets the wide file generation requirement; the language model is built, the process of generating the text can be simplified, the working efficiency is improved, and a user without professional writing background can easily generate attractive short video text. The user uses the system for carrying the scheme to construct the generated language model, and presumes that the user is a travel blogger, plans to make a short video for introducing a certain scenic spot, inputs the field information as 'tourist scenic spot introduction' in the system, and puts forward the theme 'how a certain scenic spot is played'. The language model recognizes that this is a request for a tourist attraction introduction through a built-in classifier, and then passes the topic and domain information to the trained language model. After receiving the input, the language model searches the knowledge base for the information related to the scenic spot rapidly, and combines the common topics of travel and the interests of the audience to generate a section of specialized and attractive document. The generated literature not only provides practical information about how to play the attraction, but also incorporates attractive descriptions and engaging descriptions that make the audience feel as though they were able to feel the unique charm of the attraction through text, thereby inspiring them to watch the video and actually go to the quest.

Example two

This embodiment is substantially the same as the above embodiment except that: further comprises:

converting the Text output by the language model into voice by adopting a TTS technology (Text-To-Speech);

carrying out sentence-by-sentence emotion recognition and classification on the generated text by adopting a text emotion classification model to generate a classification result; in the embodiment, the text emotion classification model adopts a text emotion classification model based on MLP, a network structure is constructed by using a linear layer, an activation function and a Softmax function, training is performed through a cross entropy loss function, in addition, the learning rate can be automatically adjusted by using an Adam optimization algorithm, and the trained model can be used for classifying text emotion.

According to the classification result, the emotion of the voice of each sentence is set, specifically, according to the classification result, the emotion of the voice of each sentence is set by adjusting the corresponding voice of each sentence, for example: the voice with excited emotion has tone, speech speed and volume which are all larger than those with calm emotion.

Setting classification types of the text emotion classification model according to requirements, so as to identify which types the text emotion belongs to, for example: low, calm, cry, agitation, sadness, qi, etc.; setting different tones, speech speeds and volume of voices corresponding to different types; according to the classification result, setting emotion of corresponding voice, namely tone, speed and volume corresponding to emotion type for each sentence of text, so that the generated voice is rich in emotion, more meets the actual emotion and expression requirement of a user, and is better fused with animation, and the generated short video is more attractive.

Other embodiments also include: identifying whether the emotion difference value between adjacent sentence patterns is smaller than a preset emotion difference value, if not, carrying out emotion smoothing treatment on the voice between the adjacent sentence patterns;

specifically, the digital label for progressive classification of emotion identifies the difference of emotion between adjacent sentences, namely, calculates the absolute value of the difference of the digital label of emotion between adjacent sentences, judges whether the absolute value is smaller than the preset difference of emotion, if not, indicates that the emotion between two adjacent sentences is changed greatly, and if the emotion is changed directly, the emotion between the adjacent sentences is suddenly seen on the voice layer, so that the voice between the adjacent sentences is subjected to emotion smoothing, the text with the preset word number in the reciprocal of the previous sentence and the text with the preset word number in the positive of the next sentence are extracted as excessive text, and the tone, the speed and the volume of the voice corresponding to the excessive text are increased from the tone, the speed and the volume corresponding to the emotion of the previous sentence to the tone, the speed and the volume corresponding to the emotion of the next sentence, so that the emotion between the two sentences is in a progressive state, and the voice is not suddenly great. In addition, emotion smoothing processing can be set to only process adjacent sentence patterns between the same sections, but not process adjacent sentence patterns between different sections, emotion change between the two sections is large, and the influence on overall coordination is small, so that processing requirements are reduced.

Example III

This embodiment is substantially the same as the above embodiment except that: acquiring short video character information of a user;

adding expressions and actions to characters in the short video character information to generate an animation; specifically, if the short video character information is image information, the server adopts an image recognition algorithm to recognize characters in the picture, and adopts a face recognition technology and an animation generation technology to add expressions and actions to the characters; the image recognition algorithm and the face recognition technology can adopt a deep learning algorithm, a characteristic face algorithm and other algorithms, and the animation generation technology can adopt WAVE2Lip; if the short video character information is the user character demand information, the server generates a virtual character by adopting an virtual IP image generation technology according to the user character demand information, and adds expressions and actions to the character by adopting a face recognition technology and an animation generation technology.

The voice and the animation are fused to generate the short video, specifically, the generated voice and the animation are fused to generate the final short video, and the short video can be automatically generated according to the personalized requirements of the creator only by providing the personalized requirements of the user in the whole manufacturing process, so that the manufacturing difficulty of the short video is reduced, and the further popularization of the short video creation is facilitated.

Applying the generated short video to the intelligent business card, further comprising:

user information is acquired, and user login is performed;

managing user information, and correspondingly storing the generated short video and user information logged in when short video demand information and short video character information of a user are acquired; wherein the user information comprises: user name, occupation, job position, company, etc.;

analyzing user information and short video demand information of a user, and extracting keywords of a focused field of the user;

acquiring short video management information and managing short videos according to the short video management information, wherein the short video management information comprises: deleting the short video and setting the watching authority of the short video; wherein the viewing rights include: only for own viewing and for all;

analyzing short video demand information and short videos, and extracting domain keywords of the short videos;

acquiring a short video selection recommendation signal, calling a corresponding short video recommendation to a designated user, and specifically, sending the short video selection recommendation signal to a terminal of the designated user;

or generating a short video viewing two-dimensional code, and other users scan the two-dimensional code through own terminals to obtain a short video of a logged-in user in the scanned terminal;

receiving and playing the short video, and displaying the short video watched by all users with the watching authority of the user pushing the short video;

analyzing the similarity between the field of interest keywords and the field keywords, and extracting the field of interest keywords with the similarity not meeting the preset similarity as newly added field keywords;

generating short video production suggestions according to the newly added domain keywords, for example: the user is investment manager to make investment announcement, and recommends own short video intelligent business card to other users, and according to the comparison of the attention field keywords of other users and the field keywords of users, judging whether the short videos of the users do not have short videos meeting the attention fields of other users, and accordingly generating short video making suggestions of the fields (attention fields), if the short videos of the users are real estate investment, and the attention fields of other users are education and medical, the users can be recommended to make short videos of relevant investments in some education and medical fields, so that the intelligent business card recommendation efficiency is improved, and the propaganda effect is enhanced.

The foregoing is merely an embodiment of the present invention, and a specific structure and characteristics of common knowledge in the art, which are well known in the scheme, are not described herein, so that a person of ordinary skill in the art knows all the prior art in the application day or before the priority date of the present invention, and can know all the prior art in the field, and have the capability of applying the conventional experimental means before the date, so that a person of ordinary skill in the art can complete and implement the present embodiment in combination with his own capability in the light of the present application, and some typical known structures or known methods should not be an obstacle for a person of ordinary skill in the art to implement the present application. It should be noted that modifications and improvements can be made by those skilled in the art without departing from the structure of the present invention, and these should also be considered as the scope of the present invention, which does not affect the effect of the implementation of the present invention and the utility of the patent. The protection scope of the present application shall be subject to the content of the claims, and the description of the specific embodiments and the like in the specification can be used for explaining the content of the claims.

Claims

1. A training method for constructing a short video language model is characterized by comprising the following steps:

s2, converting the text data into embedded vectors;

2. The short video language model building training method of claim 1, further comprising: s5, setting evaluation indexes, evaluating the language model, and optimizing the language model according to the evaluation result.

3. The short video language model building training method of claim 2, further comprising: and S6, collecting feedback of the user on the language model generated text, and carrying out iterative optimization on the language model according to the feedback.

4. The short video language model building training method according to claim 1, wherein S1 comprises:

obtaining video and sample data of a document by adopting a crawler technology;

5. The short video language model building training method according to claim 4, wherein S1 further comprises:

6. The short video language model building training method according to claim 1, wherein S2 comprises:

7. The short video language model building training method according to claim 1, wherein S3 comprises:

setting a model frame;

8. The short video language model building training method of claim 7, wherein the model framework is a transducer.

9. The short video language model building training method according to claim 1, wherein S4 comprises:

and optimizing the model according to the analysis result.

10. The short video language model building training method of claim 1, further comprising: converting the text output by the language model into voice by adopting a TTS technology;