CN117216532A - Model training method, device, equipment, storage medium and program product - Google Patents

Model training method, device, equipment, storage medium and program product Download PDF

Info

Publication number
CN117216532A
CN117216532A CN202310153224.5A CN202310153224A CN117216532A CN 117216532 A CN117216532 A CN 117216532A CN 202310153224 A CN202310153224 A CN 202310153224A CN 117216532 A CN117216532 A CN 117216532A
Authority
CN
China
Prior art keywords
emotion
text
character
recognition model
intensity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310153224.5A
Other languages
Chinese (zh)
Inventor
宋腾韬
陈万顺
杜楠
杨智昊
曹博文
王徐喆
陈梓阳
刘星言
陈伟杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202310153224.5A priority Critical patent/CN117216532A/en
Publication of CN117216532A publication Critical patent/CN117216532A/en
Pending legal-status Critical Current

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Machine Translation (AREA)

Abstract

The application provides a model training method, device, equipment, storage medium and program product based on artificial intelligence; the method comprises the following steps: performing first mapping processing on character voice features of the first character to obtain a first character emotion intensity tag of the first character; acquiring a first emotion type tag used for representing a text emotion type of a first text, and constructing a first training sample based on the first text, the first emotion type tag of the first text and the first character emotion intensity tag; performing first training on the initialized emotion recognition model through a first training sample to obtain a first emotion recognition model; acquiring a second training sample with a real label, and performing second training on the first emotion recognition model through the second training sample to obtain a second emotion recognition model; a target emotion recognition model for performing the multitasking is obtained based on the second emotion recognition model. By the method and the device, the interpretability of the model and the learning effect on the fine-grained emotion information can be improved.

Description

Model training method, device, equipment, storage medium and program product
Technical Field
The present application relates to artificial intelligence technology, and in particular, to an artificial intelligence-based model training method, apparatus, electronic device, computer readable storage medium, and computer program product.
Background
Artificial intelligence (Artificial Intelligence, AI) is a comprehensive technology of computer science, and by researching the design principles and implementation methods of various intelligent machines, the machines have the functions of sensing, reasoning and decision. Artificial intelligence technology is a comprehensive subject, and relates to a wide range of fields, such as natural language processing technology, machine learning/deep learning and other directions, and with the development of technology, the artificial intelligence technology will be applied in more fields and has an increasingly important value.
In the related art, various interactions are performed by replacing a real person with a virtual object, and as the application range of the virtual object is wider and wider, the anthropomorphic requirement of people on the virtual object is higher and higher, so that different emotions are required to be reflected by various expressions of the virtual object on the face, but in the related art, a model can only identify the emotion of a sentence level from a text, the granularity of emotion identification is thicker, and the interpretation of the model is weaker, so that the training and learning of the model are not facilitated.
Disclosure of Invention
The embodiment of the application provides a model training method, device, electronic equipment, computer readable storage medium and computer program product based on artificial intelligence, which can improve the interpretability of a model and the learning effect of fine granularity emotion information.
The technical scheme of the embodiment of the application is realized as follows:
the embodiment of the application provides a model training method based on artificial intelligence, which comprises the following steps:
performing first mapping processing on character voice features of first characters aiming at each first character of a first text to obtain first character emotion intensity labels of the first characters;
acquiring a first emotion type tag used for representing the text emotion type of the first text, and constructing a first training sample based on the first text, the first emotion type tag of the first text and the first character emotion intensity tag;
performing first training processing based on multiple tasks on the initialized emotion recognition model through the first training sample to obtain a first emotion recognition model, wherein the multiple tasks comprise an emotion classification task and a character emotion intensity classification task;
acquiring a second training sample with a real label, and performing second training processing on the first emotion recognition model based on the multi-task through the second training sample to obtain a second emotion recognition model;
A target emotion recognition model for performing the multitasking is obtained based on the second emotion recognition model.
The embodiment of the application provides a model training device based on artificial intelligence, which comprises the following components:
the labeling module is used for carrying out first mapping processing on character voice characteristics of the first characters aiming at each first character of the first text to obtain first character emotion intensity labels of the first characters;
the composition module is used for acquiring a first emotion type tag used for representing the text emotion type of the first text and constructing a first training sample based on the first text, the first emotion type tag of the first text and the first character emotion intensity tag;
the first training module is used for carrying out first training processing on the initialized emotion recognition model based on multiple tasks through the first training sample to obtain the first emotion recognition model, wherein the multiple tasks comprise an emotion classification task, a text emotion intensity classification task and a character emotion intensity classification task;
the second training module is used for acquiring a second training sample with a real label, and performing a second training process based on the multi-task on the first emotion recognition model through the second training sample to obtain a second emotion recognition model;
And the acquisition module is used for acquiring a target emotion recognition model for performing the multitasking based on the second emotion recognition model.
The embodiment of the application provides a text processing method based on artificial intelligence, which comprises the following steps:
acquiring a text to be identified of a virtual object;
carrying out emotion recognition processing on the text to be recognized through a target emotion recognition model to obtain a predicted emotion type of the text to be recognized and a predicted character emotion intensity of each character to be recognized in the text to be recognized;
the target emotion recognition model is obtained through the model training method provided by the embodiment of the application.
The embodiment of the application provides a text processing device based on artificial intelligence, which comprises:
the text module is used for acquiring a text to be identified of the virtual object;
the recognition module is used for carrying out emotion recognition processing on the text to be recognized through a target emotion recognition model to obtain the predicted emotion type of the text to be recognized and the predicted character emotion intensity of each character to be recognized in the text to be recognized;
the target emotion recognition model is obtained through the model training method provided by the embodiment of the application.
An embodiment of the present application provides an electronic device, including:
a memory for storing computer executable instructions;
and the processor is used for realizing the model training method based on the artificial intelligence or the text processing method based on the artificial intelligence when executing the computer executable instructions stored in the memory.
The embodiment of the application provides a computer readable storage medium which stores computer executable instructions for realizing the model training method based on artificial intelligence or the text processing method based on artificial intelligence provided by the embodiment of the application when being executed by a processor.
The embodiment of the application provides a computer program product, which comprises a computer program or a computer executable instruction, wherein the computer program or the computer executable instruction realizes the model training method based on artificial intelligence or the text processing method based on artificial intelligence when being executed by a processor.
The embodiment of the application has the following beneficial effects:
according to the embodiment of the application, the emotion type label of the text and the emotion intensity label of the character level can be generated by utilizing the voice characteristics, which is equivalent to acquiring the emotion labels with fine granularity, so that the model can learn emotion information with more different granularities, the emotion recognition capability of the model is improved, and the labels obtained by combining the voice characteristics are fully utilized at different stages in the whole learning process, so that the model has stronger interpretability.
Drawings
FIG. 1 is a schematic diagram of an artificial intelligence based model training application system provided by an embodiment of the present application;
fig. 2 is a schematic structural diagram of an electronic device according to an embodiment of the present application;
FIGS. 3A-3C are schematic flow diagrams of an artificial intelligence based model training method provided by an embodiment of the present application;
FIG. 4 is a flow chart of an artificial intelligence based text processing method provided by an embodiment of the present application;
fig. 5 is an expressive view of a virtual object according to an embodiment of the present application;
FIG. 6 is a schematic illustration of fine granularity of an annotation of expressions provided by an embodiment of the present application;
FIG. 7 is a schematic diagram of a mood processing process provided by an embodiment of the present application;
FIG. 8 is a schematic illustration of speech features provided by an embodiment of the present application;
FIG. 9 is a schematic diagram of audio energy extraction and labeling provided by an embodiment of the present application;
fig. 10 is a training schematic diagram of a target emotion recognition model according to an embodiment of the present application.
Detailed Description
The present application will be further described in detail with reference to the accompanying drawings, for the purpose of making the objects, technical solutions and advantages of the present application more apparent, and the described embodiments should not be construed as limiting the present application, and all other embodiments obtained by those skilled in the art without making any inventive effort shall correspond to the scope of the present application.
In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is to be understood that "some embodiments" can be the same subset or different subsets of all possible embodiments and can be combined with one another without conflict.
In the following description, the terms "first", "second", "third" and the like are merely used to distinguish similar objects and do not represent a specific ordering of the objects, it being understood that the "first", "second", "third" may be interchanged with a specific order or sequence, as permitted, to enable embodiments of the application described herein to be practiced otherwise than as illustrated or described herein.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the application only and is not intended to be limiting of the application.
Before describing embodiments of the present application in further detail, the terms and terminology involved in the embodiments of the present application will be described, and the terms and terminology involved in the embodiments of the present application will be used in the following explanation.
1) Virtual humans, also called virtual digital humans, refer to virtual figures simulated on a computer to resemble real humans, which are considered to be important ways of realistically interacting with the meta-universe, where the identity of a real person exists. The virtual person is a virtual character having a digitized appearance, and may have a body shape and appearance, or may be a specific cartoon image, and is used in industries such as a virtual studio and virtual education, in which a space is accompanied by a space fire.
2) Expression driving means that a virtual person generates an expression matched with the emotion of a text by analyzing the input text.
3) Emotional intensity refers to the degree of emotion expressed by text or characters.
Along with development of the metauniverse, the presentation mode of the metauniverse has virtual scenes and virtual persons, along with wider application range of the virtual persons, the anthropomorphic degree of the virtual persons needs to be improved, even if the image, action, expression and even speaking modes of the virtual persons are closer to the human beings, the viewing experience of users is finally improved, and the requirements of higher precision and finer granularity on expression driving of the virtual persons are also provided. Referring to fig. 5, a virtual person needs to be able to produce different expressions according to the emotion type and emotion intensity of the speaking contents when speaking.
However, it is difficult to drive the expression of the dummy in a fine granularity in an interpretable manner in the related art, so that the expression of the dummy is abrupt and unnatural. Specifically, according to different driving modes, expression driving of a virtual person can be mainly divided into real person driving and AI intelligent driving, the real person driving is highly dependent on a motion capture technology, and because of more micro-motions such as mouth shape and expression of the virtual person, real person shooting cost is huge, and the AI intelligent driving becomes a mainstream. The AI intelligent drivers are mainly classified into two types of voice drivers and text drivers according to different inputs in the aspect of expression drivers. The driving processes of the two are similar, namely after a user inputs text or voice, a certain rule is supplemented by a deep learning algorithm, the model generally outputs emotion labels at sentence level, and the emotion labels are used for guiding the modeling of the expression of the virtual person, so that the expression corresponding to the emotion of the voice or the text is generated by the virtual person.
In the related art, the scheme of driving the expression of the virtual person by voice is greatly influenced by voice characteristics (tone, noise and the like), the generalization capability of a model is difficult to improve, in the application scene of the virtual person, a TTS technology is generally required to generate text into voice, the quality of synthesized voice is possibly not completely consistent with that of real and natural voice, in the related art, the voice driving scheme and the text driving scheme can only accurately predict emotion labels at sentence level, and in the TTS field, a few fine-granularity emotion prediction searches exist, such as word-level emotion intensity prediction by using attention score, but the schemes lack of interpretability. Therefore, the related art cannot well meet the business requirement of fine-grained expression driving.
The embodiment of the application provides a model training method, a text processing method, a device, electronic equipment, a computer readable storage medium and a computer program product based on artificial intelligence, which can improve the interpretability of a model and the learning effect of fine granularity emotion information.
The model training and text processing method based on artificial intelligence provided by the embodiment of the application can be independently realized by a terminal/server; the method can also be realized cooperatively by the terminal and the server, for example, the terminal solely bears the following model training method and text processing method based on artificial intelligence, or the terminal sends a model training request aiming at the model based on artificial intelligence to the server, the server executes the model training method based on artificial intelligence according to the received model training request, and carries out first mapping processing on character voice characteristics of the first characters aiming at each first character of the first text to obtain a first character emotion intensity label of the first characters; constructing a first training sample based on the first text, the first emotion type tag of the first text and the first character emotion intensity tag; performing first training processing based on multiple tasks on the initialized emotion recognition model through a first training sample to obtain a first emotion recognition model, wherein the multiple tasks comprise an emotion classification task, a text emotion intensity classification task and a character emotion intensity classification task; acquiring a second training sample with a real label, and performing multitasking-based second training treatment on the first emotion recognition model through the second training sample to obtain a second emotion recognition model; acquiring a target emotion recognition model for executing the multitasking based on the second emotion recognition model, returning prompt information of model training completion to the terminal by the server, sending a display driving request carrying a text to be recognized to the server by the terminal, and receiving the display driving request by the server to acquire the text to be recognized of the virtual object; carrying out emotion recognition processing on the text to be recognized through the target emotion recognition model to obtain the predicted emotion type of the text to be recognized and the predicted character emotion intensity of each character to be recognized in the text to be recognized; and based on the predicted emotion type of the text to be recognized and the predicted character emotion intensity of each character to be recognized in the text to be recognized, performing display driving processing on the virtual object to obtain a display effect when the virtual object broadcasts the text to be recognized, and feeding back the display effect to the terminal for presentation.
The electronic device for executing the model training method and the text processing method based on the artificial intelligence provided by the embodiment of the application can be various types of terminal devices or servers, wherein the servers can be independent physical servers, can be a server cluster or a distributed system formed by a plurality of physical servers, and can be cloud servers for providing cloud computing services; the terminal may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, etc. The terminal and the server may be directly or indirectly connected through wired or wireless communication, and the present application is not limited herein.
Taking a server as an example, for example, a server cluster deployed in a cloud may be used, an artificial intelligence cloud Service (aias a Service, AIaaS) is opened to users, an AIaaS platform splits several common AI services and provides independent or packaged services in the cloud, and the Service mode is similar to an AI theme mall, and all users can access one or more artificial intelligence services provided by using the AIaaS platform through an application programming interface.
Referring to fig. 1, fig. 1 is a schematic architecture diagram of an artificial intelligence based model training system according to an embodiment of the present application, where a terminal 400 is connected to a server 200 through a network 300, and the network 300 may be a wide area network or a local area network, or a combination of the two.
The terminal 400 (running with a live client) may be used to obtain a training request for a model, for example, a developer inputs the training request for the model through an input interface of the terminal 400, the terminal 400 sends the model training request to the server 200, and the server 200 performs a first mapping process on a character voice feature of the first character for each first character of the first text, to obtain a first character emotion intensity tag of the first character; constructing a first training sample based on the first text, the first emotion type tag of the first text and the first character emotion intensity tag; performing first training processing based on multiple tasks on the initialized emotion recognition model through a first training sample to obtain a first emotion recognition model, wherein the multiple tasks comprise an emotion classification task, a text emotion intensity classification task and a character emotion intensity classification task; acquiring a second training sample with a real label, and performing multitasking-based second training treatment on the first emotion recognition model through the second training sample to obtain a second emotion recognition model; based on the second emotion recognition model, a target emotion recognition model for performing multitasking is obtained, the server 200 returns prompt information of model training completion to the terminal 400, the terminal 400 sends a display driving request carrying a text to be recognized to the server 200, and the server 200 receives the display driving request to obtain the text to be recognized of the virtual object; carrying out emotion recognition processing on the text to be recognized through the target emotion recognition model to obtain the predicted emotion type of the text to be recognized and the predicted character emotion intensity of each character to be recognized in the text to be recognized; and performing display driving processing on the virtual object based on the predicted emotion type of the text to be recognized and the predicted emotion intensity of each character to be recognized in the text to be recognized, so as to obtain a display effect when the virtual object broadcasts the text to be recognized, and displaying the virtual object broadcasting the text to be recognized on the terminal 400 according to the display effect.
In some embodiments, a model training plug-in and an emotion processing plug-in can be implanted in a client running in the terminal to locally implement an artificial intelligence based model training method and a text processing method in the client. For example, after obtaining a model training request, the terminal 400 invokes a model training plug-in to implement an artificial intelligence based model training method and a text processing method, and performs a first mapping process on character speech features of the first characters for each first character of the first text to obtain a first character emotion intensity tag of the first character; constructing a first training sample based on the first text, the first emotion type tag of the first text and the first character emotion intensity tag; performing first training processing based on multiple tasks on the initialized emotion recognition model through a first training sample to obtain a first emotion recognition model, wherein the multiple tasks comprise an emotion classification task, a text emotion intensity classification task and a character emotion intensity classification task; acquiring a second training sample with a real label, and performing multitasking-based second training treatment on the first emotion recognition model through the second training sample to obtain a second emotion recognition model; acquiring a target emotion recognition model for performing multitasking based on the second emotion recognition model; the terminal 400 receives a text to be recognized input by a user, and the terminal 400 acquires the text to be recognized of the virtual object; carrying out emotion recognition processing on the text to be recognized through the target emotion recognition model to obtain the predicted emotion type of the text to be recognized and the predicted character emotion intensity of each character to be recognized in the text to be recognized; and performing display driving processing on the virtual object based on the predicted emotion type of the text to be recognized and the predicted emotion intensity of each character to be recognized in the text to be recognized, so as to obtain a display effect when the virtual object broadcasts the text to be recognized, and displaying the virtual object broadcasting the text to be recognized on the terminal 400 according to the display effect.
Referring to fig. 2, fig. 2 is a schematic structural diagram of an electronic device according to an embodiment of the present application, and a terminal 400 shown in fig. 2 includes: at least one processor 410, a memory 450, at least one network interface 420, and a user interface 430. The various components in terminal 400 are coupled together by a bus system 440. It is understood that the bus system 440 is used to enable connected communication between these components. The bus system 440 includes a power bus, a control bus, and a status signal bus in addition to the data bus. But for clarity of illustration the various buses are labeled in fig. 2 as bus system 440.
The processor 410 may be an integrated circuit chip having signal processing capabilities such as a general purpose processor, such as a microprocessor or any conventional processor, or the like, a digital signal processor (DSP, digital Signal Processor), or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or the like.
The user interface 430 includes one or more output devices 431, including one or more speakers and/or one or more visual displays, that enable presentation of the media content. The user interface 430 also includes one or more input devices 432, including user interface components that facilitate user input, such as a keyboard, mouse, microphone, touch screen display, camera, other input buttons and controls.
Memory 450 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard drives, optical drives, and the like. Memory 450 optionally includes one or more storage devices physically remote from processor 410.
Memory 450 includes volatile memory or nonvolatile memory, and may also include both volatile and nonvolatile memory. The non-volatile memory may be read only memory (ROM, read Only Me mory) and the volatile memory may be random access memory (RAM, random Access Memor y). The memory 450 described in embodiments of the present application is intended to comprise any suitable type of memory.
In some embodiments, memory 450 is capable of storing data to support various operations, examples of which include programs, modules and data structures, or subsets or supersets thereof, as exemplified below.
An operating system 451 including system programs, e.g., framework layer, core library layer, driver layer, etc., for handling various basic system services and performing hardware-related tasks, for implementing various basic services and handling hardware-based tasks;
a network communication module 452 for accessing other electronic devices via one or more (wired or wireless) network interfaces 420, the exemplary network interface 420 comprising: bluetooth, wireless compatibility authentication (WiFi), and universal serial bus (USB, universal Serial Bus), etc.;
A presentation module 453 for enabling presentation of information (e.g., a user interface for operating peripheral devices and displaying content and information) via one or more output devices 431 (e.g., a display screen, speakers, etc.) associated with the user interface 430;
an input processing module 454 for detecting one or more user inputs or interactions from one of the one or more input devices 432 and translating the detected inputs or interactions.
In some embodiments, the artificial intelligence based model training apparatus provided in the embodiments of the present application may be implemented in software, and fig. 2 shows the artificial intelligence based model training apparatus 455 stored in the memory 450, which may be software in the form of a program and a plug-in, and includes the following software modules: the labeling module 4551, the composition module 4552, the first training module 4553, the second training module 4554 and the acquisition module 4555 are logical, and thus may be arbitrarily combined or further split according to the functions implemented. The functions of the respective modules will be described hereinafter.
As before, the artificial intelligence based model training method provided by the embodiment of the application can be implemented by various types of electronic devices. Referring to fig. 3A, fig. 3A is a schematic flow chart of an artificial intelligence based model training method according to an embodiment of the present application, and is described with reference to steps 101 to 105 shown in fig. 3A.
Referring to fig. 3A, fig. 3A is a schematic flow chart of an artificial intelligence based model training method according to an embodiment of the present application, and will be described with reference to steps 101 to 105 shown in fig. 3A.
In step 101, for each first character of the first text, a first mapping process is performed on the character phonetic feature of the first character, so as to obtain a first character emotion intensity tag of the first character.
As an example, the first character emotion intensity tag is used to characterize the emotion intensity of the first character, and for each first character of the first text, a first interval in which the character speech feature of the first character is located is obtained and mapped to the first character emotion intensity tag. Taking the example that the voice feature is audio energy as an illustration, the first character emotion intensity tag includes: the first character emotion intensity label can also be directly the value of audio energy, for example, (1, 0) indicates no intensity, (0, 1, 0) indicates low intensity, and the like, the first interval of audio energy falling between 0 and 0.1 is no intensity, the second interval of audio energy falling between 0.1 and 0.3 is low intensity, and the second interval of audio energy falling between 0.3 and 0.5 is medium intensity; the second interval where the audio energy falls between 0.5 and 1.0 is high intensity.
In some embodiments, referring to fig. 3B, steps 106 through 108 shown in fig. 3B may be performed before step 101 is performed.
In step 106, audio data is obtained, and clipping processing is performed on the audio data to obtain at least one single sentence of audio.
As an example, audio data is obtained by cutting out a segment of a human utterance from video data, and the audio data is further cut into individual sentences of audio, i.e., each sentence is taken as an audio sample.
In step 107, speech recognition processing is performed on each single sentence of audio, so as to obtain a first text corresponding to each single sentence of audio, a start time point of each first character in the first text, and an end time point of each first character in the first text.
As an example, the single sentence audio is parsed by ASR technology to obtain a first text corresponding to each single sentence audio and a start time point and an end time point of each first character in the first text appearing in the single sentence audio.
In step 108, for the first text of each of the single-sentence audios, character phonetic features of each of the first characters are obtained from the single-sentence audios corresponding to the first text based on the start time point and the end time point of each of the first characters of the first text.
In some embodiments, the step 108 of obtaining the character speech feature of each first character from the single sentence audio corresponding to the first text based on the starting time point and the ending time point of each first character of the first text may be implemented by the following technical scheme: the following processing is performed for each first character: acquiring at least one audio frame corresponding to the first character from the single sentence audio based on the starting time point and the ending time point of the first character; acquiring audio energy of each audio frame corresponding to the first character; averaging the audio energy of at least one audio frame to obtain average audio energy corresponding to the first character; the average audio energy corresponding to the first character is taken as the character phonetic feature of the first character. According to the embodiment of the application, the information of the audio energy can be learned into the model, so that the model is more accurate in strength prediction of the character level and has stronger interpretation.
As an example, the audio energy of each frame in the audio is extracted, the text and the start-stop time of each word in the text are returned by ASR techniques, each frame is aligned with its corresponding word, and the audio energy of each word in the text is defined as the average of the audio energy of its corresponding audio frame. And carrying out normalization processing on the audio energy corresponding to each first character by taking sentences as units, and finally, respectively annotating the emotion intensity labels of the first characters for the word marks in the first text by utilizing the audio energy of the word level.
In some embodiments, the second mapping process may be further performed on the text voice feature of the first text to obtain a first text emotion intensity tag of the first text, that is, obtain both a character-level emotion intensity tag and a text-level emotion intensity tag. Specifically, a first interval in which the text voice feature of the first text is located is obtained, and a first text emotion intensity label corresponding to the first interval is obtained, wherein the first text emotion intensity label is used for representing the emotion intensity of the whole text.
As an example, before performing the second mapping process on the text-to-speech feature of the first text, obtaining audio data, and performing clipping process on the audio data to obtain at least one single sentence of audio, specifically, capturing a real person speaking segment from the video data to obtain audio data, and further clipping the audio data into the single sentence of audio, i.e., each sentence of speech is used as an audio sample; performing voice recognition processing on each single sentence audio to obtain a first text corresponding to each single sentence audio, and specifically, analyzing the single sentence audio through an ASR technology to obtain the first text corresponding to each single sentence audio; and taking the voice characteristic of each single sentence of audio as the text voice characteristic of the corresponding first text, and performing second mapping processing on the text voice characteristic to obtain a first text emotion intensity label of the first text. Before the voice feature of each single sentence audio is used as the text voice feature of the corresponding first text, and the text voice feature is subjected to second mapping processing to obtain the first text emotion intensity label of the first text, the following processing is performed for each single sentence audio: acquiring the audio energy of each audio frame in the single sentence audio; the average value of the audio energy of all audio frames in single sentence audio is used as the text voice characteristic of the single sentence audio, specifically, the audio energy of each frame in the audio is extracted, the audio energy corresponding to a sentence is defined as the average value of the energy of all words in the sentence, the continuous five sentences spoken by the same speaker are used as a group, the audio energy corresponding to each sentence is normalized by taking the group as a unit, and finally, the emotion energy labels are respectively marked for the words and the sentences in the text by utilizing the audio energy of the sentence level.
In step 102, a first emotion type tag for characterizing a text emotion type of a first text is obtained, and a first training sample is constructed based on the first text, the first emotion type tag of the first text, and the first character emotion intensity tag.
As an example, the first emotion type tag may be obtained by performing an emotion classification process on the first text by using an emotion classification model (for example, SOTA model), where the first emotion type tag includes: angry, likes, surprise, suspicion, fear, happiness, neutrality, aversion, sadness.
As an example, when the first text emotion intensity tag of the first text is acquired, the first text, the first emotion type tag of the first text, the first text emotion intensity tag (used for characterizing the emotion intensity of the whole text), and the first character emotion intensity tag (used for characterizing the emotion intensity of the first character) may be combined into a first training sample, that is, the first training sample includes emotion intensity tags with two granularities.
In step 103, a first training process based on multitasking is performed on the initialized emotion recognition model through the first training sample, so as to obtain a first emotion recognition model.
As an example, when a first text emotion intensity tag of a first text is acquired, the multitasking includes an emotion classification task (refer to a task for outputting a probability that the text belongs to each candidate emotion type), a text emotion intensity classification task (refer to a task for outputting a probability that the emotion intensity of the text is characterized), and a character emotion intensity classification task (refer to a task for outputting a probability that the emotion intensity of a character is characterized). Or the multitasking includes only the emotion classification task, and the character emotion intensity classification task.
In some embodiments, when the multitasking includes an emotion classification task and a character emotion intensity classification task, in step 103, a first training process based on multitasking is performed on the initialized emotion recognition model through a first training sample, so as to obtain a first emotion recognition model, which may be implemented by the following technical scheme: forward transmitting the first text in the initialized emotion recognition model to obtain a first prediction probability corresponding to the first emotion type label and a second prediction value representing the emotion intensity type of each first character of the first text; determining a first multiplexing loss based on the first prediction probability and the second prediction value; and updating parameters of the initialized emotion recognition model based on the first multitasking loss to obtain a first emotion recognition model. The embodiment of the application can simultaneously obtain the forward propagation results of different tasks and can effectively improve the model training efficiency.
As an example, the initialized emotion recognition model includes an initialized feature sharing network, which may be a text encoder obtained through pre-training, such as a Transformer-based encoder (BERT, bidirectional Encoder Representat ions from Transformers), and an initialized prediction network corresponding to each task, which may be a multi-layer perceptron (MLP, multilayer Perceptron) corresponding to each task, the multi-layer perceptron corresponding to an emotion classification task for predicting emotion types based on an output of the text encoder, and the multi-layer perceptron corresponding to a character emotion intensity classification task for predicting character emotion intensities based on an output of the text encoder.
In some embodiments, the foregoing forward propagating the first text in the initialized emotion recognition model to obtain a first prediction probability corresponding to the first emotion type tag, and a second prediction value representing a character emotion intensity type of each first character of the first text may be implemented by the following technical scheme: performing feature extraction processing on the first text through an initialized feature sharing network to obtain a first feature code of a front position of the first text and a first feature code of each character position of the first text; carrying out emotion type prediction processing on the first feature codes of the prepositions of the first text through an initialization prediction network corresponding to the emotion classification task to obtain a first prediction probability corresponding to the first emotion type label; and carrying out character emotion intensity prediction processing on the first feature codes of each character position of the first text through an initialization prediction network corresponding to the character emotion intensity classification task, so as to obtain a second predicted value representing the character emotion intensity type of each first character of the first text. According to the embodiment of the application, the local information and the global information can be simultaneously learned, and the forward propagation results of different tasks can be simultaneously obtained, so that the training efficiency and the training accuracy of the model can be effectively improved.
As an example, the feature extraction process is implemented by a text encoder, which may be any transducer-based encoder (BERT, bidirectional Encoder Representations from Transformers), which consists of 12-layer stacked encoding modules, specifically, text is first added with a [ CLS ] token (token prefix) at the head of a sentence as a special classification token including global information, and a [ SEP ] token at the end of the sentence for segmentation, before entering the BERT. After inputting the text into the BERT, mapping each position (a preposed position and a character position) in the text into character embedding characteristics through a pre-trained vocabulary, superposing the character embedding characteristics, the position embedding characteristics and segmentation mapping characteristics to be used as input of a text encoder, inputting a plurality of groups of key vectors, query vectors and value vectors which are obtained by mapping processing based on a first character, wherein the groups are the head numbers of multi-head attention processing, and then performing self-attention calculation, wherein the self-attention calculation process is shown in a formula (1):
wherein K is a key vector, V is a value vector, Q is a query vector, d k Is the dimension of the K vector. After each attention processing result is spliced, global features (corresponding to the first feature code of the front position, namely corresponding to [ CLS ] ]Position ofAn enabling feature) and an enabling feature (a first feature code corresponding to the first character) of each first character.
As an example, the first feature code of the front position of the first text is subjected to emotion type prediction processing through the initialized prediction network corresponding to the emotion classification task, so as to obtain a first prediction probability corresponding to the first emotion type label, the initialized prediction network corresponding to the emotion classification task is a multi-layer perception network a, and the probabilities of the first text corresponding to a plurality of candidate emotion type labels are output through the multi-layer perception network a, wherein the first prediction probability corresponding to the first emotion type label is included.
As an example, the initializing prediction network corresponding to the character emotion intensity classification task performs character emotion intensity prediction processing on the first feature code of each character position of the first text, so as to obtain a second predicted value representing the character emotion intensity type of each first character of the first text, for example, the second predicted value is 0.15, and is in a section corresponding to low intensity, so that the character emotion intensity representing the first character is low intensity, the initializing prediction network corresponding to the character emotion intensity classification task is another multi-layer perception network B, and parameters of the two multi-layer perception networks are different.
As an example, a first multitasking loss is determined based on the first prediction probability and the second prediction value, where a natural logarithm of the first prediction probability is determined, if there are a plurality of first texts, the obtained plurality of natural logarithms may be subjected to an averaging process (to obtain a first average error), then a mean square error of a task of classifying the emotion intensities of the corresponding characters is determined, that is, a square of a difference between the second prediction value and a tag value corresponding to the tag of the emotion intensity of the first character, where the tag value corresponding to the tag of the emotion intensity of the first character is an audio feature (e.g., audio energy) of the first character, if there are a plurality of first characters, then an averaging process may be performed on the obtained plurality of square values (to obtain a second average error), the second average error and the first average error may be subjected to a differencing process, and the obtained differencing result may be taken as the first multitasking loss, the first multitasking loss may be subjected to a back propagation in the initialized emotion recognition model to obtain a parameter variation amount when the first multitasking loss converges to a minimum value, and the parameter variation amount of the initialized emotion recognition model may be updated based on the parameter variation amount.
In some embodiments, when the multitasking includes an emotion classification task, a text emotion intensity classification task, and a character emotion intensity classification task, in step 103, the first training process based on the multitasking is performed on the initialized emotion recognition model through the first training sample, so as to obtain a first emotion recognition model, which may be implemented by the following technical scheme: forward spreading the first text in an initialized emotion recognition model to obtain a fifth prediction probability corresponding to the first emotion type label, wherein the sixth prediction value of the first text representing the emotion intensity type of the text and the seventh prediction value of each first character representing the emotion intensity type of the character are obtained; determining a second loss of multiplexing based on the fifth prediction probability, the sixth prediction value, and the seventh prediction value; and updating parameters of the initialized emotion recognition model based on the second multitasking loss to obtain a first emotion recognition model.
As an example, the manner of determining the second multitasking penalty is similar to the manner of determining the first multitasking penalty, except that the mean square error of the corresponding text emotion intensity classification task needs to be determined, that is, the square of the difference between the sixth predicted value and the label value corresponding to the first text emotion intensity label, where the label value corresponding to the first text emotion intensity label is the audio feature (e.g., audio energy) of the first text, and if there are multiple first texts, the obtained multiple square values are subjected to an averaging process, so as to obtain a third average error. The natural logarithm of the fifth prediction probability is determined, if a plurality of first texts exist, the obtained natural logarithms are subjected to average processing (obtaining a fourth average error), then the mean square error of the emotion intensity classification task of the corresponding characters, namely the square of the difference between the seventh prediction value and the label value corresponding to the emotion intensity label of the first character is determined, the label value corresponding to the emotion intensity label of the first character is the audio characteristic (such as audio energy) of the first character, if a plurality of first characters exist, the obtained square values are subjected to average processing (obtaining a fifth average error), the sum result of the third average error and the fifth average error and the fourth average error are subjected to difference processing, the difference result is obtained as a second multitask loss, the second multitask loss is reversely propagated in the initialized emotion recognition model to obtain the parameter variation amount when the second multitask loss converges to the minimum value, the parameter of the initialized emotion recognition model is updated based on the parameter variation amount, and the updated initialized emotion recognition model is used as the first emotion recognition model.
In some embodiments, the initializing emotion recognition model includes an initializing feature sharing network and an initializing prediction network corresponding to each task, and the foregoing forward propagating the first text in the initializing emotion recognition model to obtain a fifth prediction probability corresponding to a first emotion type tag, where the sixth prediction value of the emotion intensity type of the first text represents the sixth prediction value of the emotion intensity type of the text, and the seventh prediction value of the emotion intensity type of each first character of the first text represents the character emotion intensity type may be implemented by the following technical scheme: performing feature extraction processing on the first text through an initialized feature sharing network to obtain a third feature code of a front position of the first text and a third feature code of each character position of the first text; carrying out emotion type prediction processing on the third feature codes of the prepositions of the first text through an initialization prediction network corresponding to the emotion classification task to obtain a fifth prediction probability corresponding to the first emotion type label; carrying out text emotion intensity prediction processing on a third feature code of a preposed position of a first text through an initialization prediction network corresponding to a text emotion intensity classification task to obtain a sixth predicted value representing the text emotion intensity type of the first text; and carrying out character emotion intensity prediction processing on the third feature codes of each character position of the first text through an initialization prediction network corresponding to the character emotion intensity classification task, and obtaining a seventh predicted value representing the character emotion intensity type of each first character of the first text.
As an example, when the multitasking includes an emotion classification task, a text emotion intensity classification task, and a character emotion intensity classification task, the initialized emotion recognition model may include an initialized feature sharing network, an initialized prediction network corresponding to the emotion classification task, an initialized prediction network corresponding to the text emotion intensity classification task, and an initialized prediction network corresponding to the character emotion intensity classification task. And carrying out emotion type prediction processing on the third feature codes of the prepositions of the first texts through an initialization prediction network corresponding to the emotion classification task to obtain a fifth prediction probability corresponding to the first emotion type labels, wherein the initialization prediction network corresponding to the emotion classification task is a multi-layer perception network A, and the probabilities of the first texts corresponding to a plurality of candidate emotion type labels are output through the multi-layer perception network A, wherein the first prediction probability corresponding to the first emotion type labels is included. And carrying out text emotion intensity prediction processing on a third feature code of a prepositioned position of the first text through an initialization prediction network corresponding to the text emotion intensity classification task to obtain a sixth predicted value representing the text emotion intensity type of the first text, wherein the sixth predicted value is 0.15 and is in a section corresponding to low intensity, so that the text emotion intensity type representing the first text is low intensity, the initialization prediction network corresponding to the text emotion intensity classification task is another multi-layer perception network C, and the sixth predicted value representing the text emotion intensity type of the first text is output through the multi-layer perception network C. And carrying out character emotion intensity prediction processing on the third feature codes of each character position of the first text through an initialization prediction network corresponding to the character emotion intensity classification task to obtain a seventh predicted value of the first character, wherein the seventh predicted value is 0.15 and is in a section corresponding to low intensity, so that the character emotion intensity type of the first character is characterized as low intensity. The initialized prediction network corresponding to the character emotion intensity classification task is another multi-layer perception network B, the parameters of the three multi-layer perception networks are different, and a seventh prediction value representing the character emotion intensity type of the first character can be output through the multi-layer perception network B.
In step 104, a second training sample with a real label is obtained, and a second training process based on multiple tasks is performed on the first emotion recognition model through the second training sample, so as to obtain a second emotion recognition model.
As an example, the first emotion recognition model is trained by the second training sample in a similar manner to the above-described training of the initialized emotion recognition model by the first training sample, except that the sample used for training is the second training sample obtained by true labeling, and the first emotion recognition model obtained by training in step 103 is subjected to parameter updating to obtain the second emotion recognition model. These emotion recognition models are identical in structure but different in parameters.
In some embodiments, the obtaining of the second training sample with the real label in step 104 may be achieved by the following technical scheme: acquiring a second emotion type tag of the second text and a second character emotion intensity tag of each second character of the second text as real tags; the second emotion type tag and the second character emotion intensity tag are obtained through true labeling; and forming a second training sample by the second text, the second emotion type label and the second character emotion intensity label. According to the embodiment of the application, the training sample with the real label can be obtained, so that the training accuracy is improved.
In some embodiments, when the first training sample further includes the first text emotion intensity tag, the obtaining the second training sample with the real tag in step 104 may be implemented by the following technical scheme: acquiring a second emotion type tag of a second text and a second text emotion intensity tag of the second text, wherein a second character emotion intensity tag of each second character of the second text is used as a real tag; the second emotion type label, the second text emotion intensity label and the second character emotion intensity label are obtained through true labeling; and forming a second training sample by the second text, the second emotion type label, the second text emotion intensity label and the second character emotion intensity label.
By way of example, a small-scale data set with high labeling accuracy is obtained through a manual labeling mode, and is distinguished from a weak tag data set, so that the small-scale data with high labeling accuracy is obtainedThe set is named strong_dataset, and each sentence in strong_dataset is attached with sentence-level emotion type and intensity and word-level emotion intensity. The data forms of the w ak_dataset_v1 and the strong_dataset areN is the number of samples in the dataset, x i ={t i,1 ,t i,2 ,...,t i,L },t i,j For sentence x i The j-th word of (1), L is x i Is the sentence length, y i Consists of three parts, namely sentences x i Emotion category y of (2) i,class Sentence x i Is the emotional intensity y of (1) i,intensity In sentence x i The j-th word in (b) and t i,j Is the emotional intensity y of (1) i,j,intensity Specifically, when the second training sample includes the second text, the second emotion type tag, the second text emotion intensity tag, and the second character emotion intensity tag, y i ={y class ,y intentity ,(y 1_intensity ,y 2_intensity ,...,y L_intensity ) When the second training sample includes the second text, the second emotion type label, and the second character emotion intensity label, y i ={y class ,(y 1_intensity ,y 2_intensity ,...,y L_intensity )}。
In step 105, a target emotion recognition model for performing the multitasking is acquired based on the second emotion recognition model.
In some embodiments, referring to fig. 3C, step 105 of obtaining the target emotion recognition model based on the second emotion recognition model may be implemented by steps 1051 through 1053 shown in fig. 3C.
In step 1051, a multitasking based predictive process is performed on the first text with a second emotion recognition model.
In some embodiments, when the multitasking includes an emotion classification task and a character emotion intensity classification task, the second emotion recognition model includes a second feature sharing network and a second prediction network corresponding to each task, and in step 1051, the multitasking-based prediction processing is performed on the first text by using the second emotion recognition model to obtain a prediction result, which may be implemented by the following technical scheme: performing feature extraction processing on the first text through a second feature sharing network to obtain a second feature code of a front position of the first text and a second feature code of each character position of the first text; carrying out emotion type prediction processing on the second feature codes of the prepositions of the first text through a second prediction network corresponding to the emotion classification task to obtain a third prediction probability corresponding to each candidate emotion type label; carrying out character emotion intensity prediction processing on the second feature codes of each character position of the first text through a second prediction network corresponding to the character emotion intensity classification task to obtain a fourth prediction value representing the character emotion intensity type of each first character of the first text; and forming a prediction result by the third prediction probability corresponding to each candidate emotion type label and the fourth prediction value of the emotion intensity type of each first character of the first text.
As an example, the first text is subjected to multitasking-based prediction processing by the second emotion recognition model, that is, the first text is forward propagated in the second emotion recognition model, and in the case that the tasks are the same, the second emotion recognition model has the same structure as the initialized emotion recognition model, and the parameters of the two models are different due to the two training, so that the forward propagation process of the initialized emotion recognition model has been described in detail above, and will not be repeated here.
In some embodiments, when the multitasking includes an emotion classification task, a text emotion intensity classification task, and a character emotion intensity classification task, the second emotion recognition model includes a second feature sharing network and a second prediction network corresponding to each task, and the above-mentioned multitasking-based prediction processing is performed on the first text by using the second emotion recognition model to obtain a prediction result, which may be implemented by the following technical scheme: performing feature extraction processing on the first text through a second feature sharing network to obtain a fourth feature code of a front position of the first text and a fourth feature code of each character position of the first text; carrying out emotion type prediction processing on the fourth feature codes of the prepositions of the first text through a second prediction network corresponding to the emotion classification task to obtain eighth prediction probability corresponding to each candidate emotion type label; carrying out text emotion intensity prediction processing on fourth feature codes of the prepositions of the first text through a second prediction network corresponding to the text emotion intensity classification task to obtain a ninth predicted value of the first text, which characterizes the text emotion intensity type; carrying out character emotion intensity prediction processing on fourth feature codes of each character position of the first text through a second prediction network corresponding to the character emotion intensity classification task to obtain a tenth prediction value of each first character of the first text representing the character emotion intensity type; and combining the eighth prediction probability, the ninth prediction value and the tenth prediction value into a prediction result.
As an example, the first text is subjected to multitasking-based prediction processing by the second emotion recognition model, that is, the first text is forward propagated in the second emotion recognition model, and in the case that the tasks are the same, the second emotion recognition model has the same structure as the initialized emotion recognition model, and the parameters of the two models are different due to the two training, so that the forward propagation process of the initialized emotion recognition model has been described in detail above, and will not be repeated here.
In step 1052, the first character emotion intensity tag is updated based on the prediction result, a third character emotion intensity tag of the first text is obtained, and the first text, the first emotion type tag and the third character emotion intensity tag form a third training sample.
In some embodiments, when the first training sample includes a first emotion type tag, a first text emotion intensity tag, and a first character emotion intensity tag, updating the first text emotion intensity tag and the first character emotion intensity tag based on the prediction result to obtain a third text emotion intensity tag and a third character emotion intensity tag of the first text, and forming the first text, the first emotion type tag, the third text emotion intensity tag, and the third character emotion intensity tag into a third training sample.
In some embodiments, in step 1052, the first character emotion intensity tag is updated based on the prediction result, and the third character emotion intensity tag of the first text is obtained, which may be implemented by the following technical scheme: the following processing is performed for each first character: carrying out average processing on the fourth predicted value and the label value corresponding to the first character emotion intensity type label to obtain a third average label value; and acquiring a third character emotion intensity type label corresponding to the third average label value to replace the first character emotion intensity type label. According to the embodiment of the application, the first training sample is optimized, so that the training efficiency and the training accuracy of the model can be improved.
As an example, the following processing is performed for each first character, for example, the first character is "discussed" and "anaerobic", the fourth predicted value of the guaranteed character emotion intensity type for the first character "discussed" is 0.9, since the first character emotion intensity type tag of the first character "discussed" is low and its corresponding tag value is the phonetic feature of the first character (i.e., the audio energy of the first character), the tag value of the first character corresponding to "low" is 0.2, the first average tag value is 0.55, and since 0.55 is in the section corresponding to medium intensity, the medium is utilized as the third character emotion intensity type tag.
In some embodiments, updating the first text emotion intensity tag and the first character emotion intensity tag based on the prediction result to obtain a first text third text emotion intensity tag and a third character emotion intensity tag may be implemented by the following technical scheme: carrying out average processing on the ninth predicted value and the label value corresponding to the first text emotion intensity type label to obtain a fourth average label value; acquiring a third text emotion intensity type label corresponding to the fourth average label value to replace the first text emotion intensity type label; the following processing is performed for each first character: carrying out average processing on the tenth predicted value and the label value corresponding to the first character emotion intensity type label to obtain a fifth average label value; and acquiring a third character emotion intensity type label corresponding to the fifth average label value to replace the first character emotion intensity type label.
As an example, the following processing is performed for the first text, for example, the first text is "offensive", the ninth predicted value characterizing the text emotion intensity type for the first text is 0.9, and since the first text emotion intensity type tag of the first text "offensive" is low or the like and its corresponding tag value is the speech feature of the first text (i.e., the audio energy of the first character), the tag value of the first text corresponding to "low" is 0.2, and the fourth average tag value is 0.55 since 0.55 is in the section corresponding to the medium intensity, the medium is utilized as the third text emotion intensity type tag through the averaging processing. For each first character, for example, the first character is "discussed" and "anaerobic", the fourth predicted value of the guaranteed character emotion intensity type for the first character "discussed" is 0.9, and since the first character emotion intensity type tag of the first character "discussed" is low and its corresponding tag value is the phonetic feature of the first character (i.e., the audio energy of the first character), the tag value of the first character corresponding to "low" is 0.2, and the averaging processing is performed to obtain a first average tag value of 0.55, and since 0.55 is in the section corresponding to medium intensity, the medium is utilized as the third character emotion intensity type tag.
In step 1053, a third training process based on the multitasking is performed on the second emotion recognition model through the third training sample, so as to obtain a target emotion recognition model for performing the multitasking.
In some embodiments, in step 1053, the performing the third training process based on the multitasking on the second emotion recognition model through the third training sample, to obtain the target emotion recognition model for performing the multitasking may be implemented by the following technical solutions: performing fourth training processing based on multiple tasks on the second emotion recognition model through the third training sample to obtain a third emotion recognition model; and performing fifth training processing based on the multitasking on the fourth emotion recognition model through the second training sample to obtain a target emotion recognition model.
As an example, the second emotion recognition model is trained by the third training sample in a similar manner to the above-described training of the initialized emotion recognition model by the first training sample, except that the sample used for training is the third training sample obtained by updating, and the second emotion recognition model obtained by training in step 104 is subjected to parameter updating, so as to obtain the third emotion recognition model. These emotion recognition models are identical in structure but different in parameters. The method for training the third emotion recognition model through the second training sample is similar to the method for training the initialized emotion recognition model through the first training sample, and only the difference is that the sample used for training is a training sample obtained through true labeling, and the fourth emotion recognition model obtained through training is subjected to parameter updating, so that the target emotion recognition model is obtained. These emotion recognition models are identical in structure but different in parameters.
Referring to fig. 4, fig. 4 is a schematic flow chart of an artificial intelligence based text processing method according to an embodiment of the present application, and will be described with reference to steps 201 to 203 shown in fig. 4.
In step 201, text to be recognized of a virtual object is acquired.
The text to be identified can be content to be broadcasted of a preset virtual object, and the virtual object can be a virtual person, even a virtual animal and the like.
In step 202, the text to be recognized is subjected to emotion recognition processing through the target emotion recognition model, so as to obtain the predicted emotion type of the text to be recognized and the predicted character emotion intensity of each character to be recognized in the text to be recognized.
In some embodiments, when the multitasking in the model training process further includes a text emotion intensity classification task, performing emotion recognition processing on the text to be recognized through the target emotion recognition model to obtain a predicted emotion type of the text to be recognized, a predicted text emotion intensity of the text to be recognized, and a predicted character emotion intensity of each character to be recognized in the text to be recognized.
As an example, the target emotion recognition model is obtained by executing the model training method provided by the embodiment of the present application, where performing emotion recognition processing on the text to be recognized through the target emotion recognition model is equivalent to performing forward propagation on the text to be recognized in the target emotion recognition model, and the specific processing process may refer to the forward propagation process of the first text in the initialized emotion recognition model, where the difference is only that the parameters of the target emotion recognition model are different from those of the initialized emotion recognition model, and the structure of the target emotion recognition model is the same as that of the initialized emotion recognition model, and if the initialized emotion recognition model has a function of text emotion intensity classification, the target emotion recognition model has a function of text emotion intensity classification.
In step 203, based on the predicted emotion type of the text to be recognized and the predicted character emotion intensity of each character to be recognized in the text to be recognized, display driving processing is performed on the virtual object, so as to obtain a display effect when the virtual object broadcasts the text to be recognized.
In some embodiments, when the multitasking in the model training process further includes a text emotion intensity classification task, display driving processing is performed on the virtual object based on the predicted emotion type of the text to be recognized, the predicted text emotion intensity of the text to be recognized, and the predicted character emotion intensity of each character to be recognized in the text to be recognized, so as to obtain a display effect when the virtual object broadcasts the text to be recognized.
As an example, finally, the emotion type and the character-level emotion intensity of the text to be recognized can be calculated through the model, in the actual virtual person expression driving, the emotion type is controlled by the emotion type, if the sentence-level emotion intensity can be output, the sentence-level emotion intensity provides a basic key for the exaggeration degree of the emotion, and the fine-grained emotion control of the virtual person in the whole process of speaking a sentence is regulated and controlled according to the word-level emotion intensity.
The embodiment of the application provides an end-to-end fine granularity emotion prediction scheme aiming at the driving requirement of the virtual human expression, which has stronger interpretation compared with the original scheme and does not depend on a manual annotation data set too much. According to the scheme, the model learns fine-grained emotion information by combining the voice characteristics, so that the finally driven virtual human expression is more natural and real, and the watching experience of a user can be effectively improved. In terms of data, the embodiment of the application only needs to collect audio data, namely, the data is easy to obtain, the data set construction mode is simple, and the data sets in different fields can be constructed in the same mode according to service requirements in subsequent work. In terms of models, the text encoder and the multitasking output network in the model structure support replacement with other models, which are retrained in the target dataset.
In the following, an exemplary application of the embodiment of the present application in a practical application scenario will be described.
The method comprises the steps that a terminal sends a model training request aiming at an artificial intelligence, the server executes a model training method based on the artificial intelligence according to the received model training request, a first interval in which text voice characteristics of a first text are located is obtained, and a first text emotion intensity label corresponding to the first interval is obtained; aiming at each first character of the first text, acquiring a second interval in which the character voice characteristic of the first character is positioned, and acquiring a first character emotion intensity label corresponding to the second interval; forming a first training sample by the first text, a first emotion type tag of the first text, a first text emotion intensity tag and a first character emotion intensity tag; performing first training processing based on multiple tasks on the initialized emotion recognition model through a first training sample to obtain a first emotion recognition model, wherein the multiple tasks comprise an emotion classification task, a text emotion intensity classification task and a character emotion intensity classification task; acquiring a second training sample with a real label, and performing multitasking-based second training treatment on the first emotion recognition model through the second training sample to obtain a second emotion recognition model; based on the second emotion recognition model, a target emotion recognition model for executing the multitasking is obtained, the server returns prompt information of model training completion to the terminal, the terminal sends a display driving request carrying a text to be recognized to the server, the text to be recognized can be a broadcasting text of a virtual object (such as a virtual person), and the server receives the display driving request to obtain the text to be recognized of the virtual object; carrying out emotion recognition processing on the text to be recognized through the target emotion recognition model to obtain the predicted emotion type of the text to be recognized and the predicted character emotion intensity of each character to be recognized in the text to be recognized; based on the predicted emotion type of the text to be recognized and the predicted character emotion intensity of each character to be recognized in the text to be recognized, performing display driving processing on the virtual object to obtain a display effect when the virtual object broadcasts the text to be recognized, wherein the display driving comprises driving the emotion, in the actual virtual person emotion driving, the emotion type of the emotion is controlled by the emotion type, the emotion intensity of the sentence level provides a basic tone for the exaggeration degree of the emotion, and the fine-grained emotion control of the virtual person in the whole process of speaking a sentence is regulated and controlled according to the emotion intensity of the word level, and the display effect is fed back to the terminal for presentation.
The model provided by the embodiment of the application is an end-to-end model, and the function of the model is to provide intelligent labeling for fine-granularity expression driving of a virtual person based on emotion type, word level and multi-level emotion intensity of sentence level of text prediction, and the specific application scene of the scheme is described below.
Through investigation of voice characteristics, the method selects the voice energy characteristics as a weak tag when the model is pre-trained, solves the problem of high difficulty in acquiring word-level hierarchical emotion intensity, provides a certain interpretability for the model, ensures that the time and exaggeration degree of the appearance of the virtual human expression driven by intelligent annotation are more natural and reasonable, and is shown in FIG. 6, and FIG. 6 shows the expression annotation with fine granularity.
The fine-grained expression driving combining the voice features provided by the embodiment of the application mainly depends on an emotion type and emotion intensity prediction algorithm, the algorithm is an end-to-end multitask learning algorithm, the input of a model is a sentence, the emotion type and intensity of the whole sentence and the emotion intensity of each word in the sentence are output, the emotion is divided into nine types, namely, angry, like, surprise, suspicion, fear, happiness, neutrality, aversion and sadness, and the intensity is respectively four grades, namely, no intensity, low intensity, medium intensity and high intensity. The model mainly consists of two parts, namely a text encoder, the text encoder can use a transfomer model, the encoder encodes sentences into the ebedding features, the ebedding features comprise global features (namely the ebedding features corresponding to the [ CL S ] positions) and the ebedding features of the words, the second part consists of a multi-task output network, and the model consists of three groups of multi-layer perceptrons (MLP, multilayer Perceptron), which respectively use the ebedding features corresponding to the [ CLS ] positions to predict the emotion category and the emotion intensity of the whole sentence, and the ebedding features of the words are used for predicting the emotion intensity of the corresponding words. Referring to fig. 7, fig. 7 shows the general flow of the emotion classification and emotion intensity prediction algorithm. The model shown in fig. 7 is composed of two parts, namely a text encoder and a multi-task output network, wherein the text encoder outputs an ebadd feature as input to the multi-task output network, and the multi-task output network predicts the emotion classification of the whole sentence text, the emotion intensity of the whole sentence text and the emotion intensity of each word in the text respectively. According to the scheme, emotion recognition and emotion intensity prediction tasks are modeled as classification tasks, the damage function of the intensity tasks is mean square error loss, and the loss function of the emotion classification tasks is cross entropy loss. The predicted emotion type and emotion intensity are directly mapped in the expression track, and finally the expression of the virtual person is driven.
In some embodiments, the process of constructing a data set is described below, with clean and accurate emotional intensity data sets being very rare, especially word-level emotional intensity data sets that are not currently open-sourced within our knowledge. Fine granularity emotion intensity labeling is also very expensive, but interpretable fine granularity emotion intensity is critical for virtual human expression driving. To obtain a large-scale emotion dataset, we use emotion classification models and speech features to implement emotion type and emotion intensity labeling of the large-scale dataset, respectively.
The following describes a data acquisition method, which collects video data (film and television works and variety videos), collects audio data, intercepts a real person speaking segment from the video data to obtain audio data, further cuts the audio data into single-sentence audio, namely, each sentence is used as an audio sample, collects text data, and analyzes the audio through an ASR technology to obtain a text corresponding to each audio sample and a starting time point and an ending time point of each word in the text in the audio. And then, marking emotion categories, wherein the emotion classification model in the related technology can realize good emotion classification effect, so that the existing emotion classification model is used for predicting the text at sentence level, and the prediction result is used as an emotion type label of a corresponding sentence. And then marking the emotion intensity, wherein an emotion intensity prediction model in the related technology mainly provides sentence-level intensity prediction and has poor performance, and the expression driving requirement model can predict the emotion intensity with fine granularity. In order to obtain the emotion intensity label with fine granularity, the most intuitive method is to label the data according to the expression of the speaker, but the complete expression of the speaker in the video data set is difficult to obtain and analyze, the emotion intensity can be directly reflected by considering the audio of the speaker, the audio data is easy to collect and analyze, and the emotion intensity category is generated by using the voice characteristics.
The following describes the manner of generating emotion intensity category by using voice features in detail, through investigation and visualization of related voice features, the audio energy and emotion intensity are positively correlated, see fig. 8, fig. 8 shows a voice feature diagram of "we are team of speed and passion", and as can be seen from fig. 8, the emotion peak of "speed and passion", that is, expression attribution point can be captured by two features of audio energy and voice intensity, the capturing of audio energy is obvious, the first two energy peaks of the feature diagram are corresponding to "speed", and the third peak is corresponding to "passion". Compared with energy, the pitch does not show relevance with emotion, so that the energy in the extracted audio carries out emotion intensity labeling on the text corresponding to the energy.
Specifically, referring to fig. 9, fig. 9 shows a process of extracting and labeling audio energy, firstly extracting audio energy of each frame in audio, returning a text and a start-stop time of each word in the text through an ASR technology, aligning each frame with a corresponding word, defining the audio energy of each word in the text as an average value of the energy of the corresponding frame, and defining the audio energy of a sentence as an average value of the energy of all words in the sentence; taking continuous five sentences spoken by the same speaker as a group, normalizing the audio energy corresponding to each sentence by taking the group as a unit, normalizing the audio energy corresponding to each word by taking the sentence as a unit, and finally marking emotion energy labels for the words and the sentences in the text by utilizing the audio energy of the word level and the sentence level respectively, wherein the interval of the audio energy falling between 0 and 0.1 is no intensity, and the interval of the audio energy falling between 0.1 and 0.3 is low intensity; the audio energy falls within the interval of 0.3-0.5 as medium intensity; the audio energy falls in the interval 0.5-1.0 with high intensity.
In this way, a large-scale pre-training data set of tens of millions of levels can be constructed and automatically labeled weak, the training data set being named weak_dataset_v1. Meanwhile, a small-scale data set with high labeling accuracy is obtained through a manual labeling mode, the small-scale data set with high labeling accuracy is distinguished from a weak tag data set, the small-scale data set with high labeling accuracy is named as strong_dataset, and each sentence in the strong_dataset is attached with sentence-level emotion type and strength and word-level emotion strength. The data forms of the w ak_dataset_v1 and the strong_dataset areN is the number of samples in the dataset, x i ={t i,1 ,t i,2 ,...,t i,L },t i,j For sentence x i The j-th word of (1), L is x i Is the sentence length, y i Consists of three parts, namely sentences x i Emotion category y of (2) i,class Sentence x i Is the emotional intensity y of (1) i,intensity In sentence x i The j-th word in (b) and t i,j Is the emotional intensity y of (1) i,j,intensity Specifically, y i ={y class ,y intentity ,(y 1_intensity ,y 2_intensity ,...,y L_intensity )}。
In order to ensure the expression driving efficiency of the virtual human, the embodiment of the application trains the model by using an end-to-end multi-task learning strategy, and the framework of the model mainly comprises two parts, namely a text encoder and a multi-task output network.
The text encoder may be any transducer-based encoder (BERT, bidirectional Encoder Representations from Transformers) consisting of 12 layers of stacked encoding modules, specifically, text is first appended with [ CLS ] tokens at the head of the sentence as special classification tokens including global information and [ SEP ] tokens at the end of the sentence for segmentation before entering the BERT. After inputting the text into the BERT, mapping each character in the text into character embedding characteristics through a pre-trained vocabulary, wherein the character embedding characteristics, the position embedding characteristics and the segmentation mapping characteristics are overlapped and then used as input of an encoder, inputting a plurality of groups of key vectors, query vectors and value vectors which are obtained by mapping processing based on the characters in the text, wherein the groups are the head numbers of multi-head attention processing, and then performing self-attention calculation, wherein the self-attention calculation process is shown in a formula (2):
Wherein K is a key vector, V is a value vector, Q is a query vector, d k Is the dimension of the K vector.
And after the attention processing results of each head are spliced, the global feature (namely the ebedding feature corresponding to the [ CLS ] position) and the ebedding feature of each word can be obtained.
The output of the text encoder is the input of a multi-tasking output network consisting of three groups of MLPs corresponding to three groups of outputs: inputting a first group of MLPs corresponding to the ebedding of the [ CLS ] position, and outputting emotion categories; inputting a second group of MLPs corresponding to the ebedding of the [ CLS ] position, and outputting the emotion intensity of the sentence level; the ebedding of each word inputs a third group of MLPs, outputting the emotional intensity at the word level.
The following describes the training process of the model, which is performed on the basis of the above model structure, and is divided into the following four stages in order to make full use of the weak labels, see fig. 10.
In the first stage, initializing the model by using parameters of the pre-training language model, and then enabling the model to perform multitasking training on data in a large number of fields (for example, a virtual person is a presenter, data of the presenter is collected), wherein task one is mask language learning and is used for learning knowledge in the fields (MLP corresponding to mask language learning task is discarded in the subsequent stage); secondly, emotion classification is carried out, namely emotion classification predicted by a high-precision emotion classification model is used as a label, and emotion knowledge is injected into the model; task three is text emotion intensity classification, task four is character text emotion intensity classification, task three and task four are trained by using weak labels generated by voice features, and a model obtained through training in the first stage is called an EMOTion_model_s1.
In the second stage, the model_model_s1 performs the multi-task training of emotion classification and emotion intensity classification on the artificially marked strong_dataset to obtain a model model_s2.
In the third stage, the data with the weak labels are predicted by using the injection_model_s2, the prediction result is averaged with the weak labels to obtain new weak labels, the new labels are replaced by the new labels to obtain the weak_dataset_v2, and the injection_model_s2 performs multi-task training of emotion classification and emotion intensity classification on the weak_dataset_v2 to obtain the model injection_model_s3.
In the fourth stage, the model is emotion-classified and emotion intensity-classified multitasking (fine tuning) using strong_dataset to obtain the final model.
Based on the scheme, finally, the emotion type and intensity of the input text can be calculated through the model, in the actual virtual person expression driving, the emotion type is controlled by the emotion type, the emotion intensity of the sentence level provides a basic tone for the exaggeration degree of the expression, and the fine-grained emotion control of the virtual person in the whole process of speaking a sentence is regulated and controlled according to the emotion intensity of the word level.
The embodiment of the application provides an end-to-end fine granularity emotion prediction scheme aiming at the driving requirement of the virtual human expression, which has stronger interpretation compared with the original scheme and does not depend on a manual annotation data set too much. According to the scheme, the model learns fine-grained emotion information by combining the voice characteristics, so that the finally driven virtual human expression is more natural and real, and the watching experience of a user can be effectively improved. In terms of data, the embodiment of the application only needs to collect audio data, namely, the data is easy to obtain, the data set construction mode is simple, and the data sets in different fields can be constructed in the same mode according to service requirements in subsequent work. In terms of models, the text encoder and the multitasking output network in the model structure support replacement with other models, which are retrained in the target dataset.
It will be appreciated that in the embodiments of the present application, related data such as user information is involved, and when the embodiments of the present application are applied to specific products or technologies, user permissions or agreements need to be obtained, and the collection, use and processing of related data need to comply with relevant laws and regulations and standards of relevant countries and regions.
Continuing with the description below of an exemplary architecture implemented as software modules for an artificial intelligence based model training apparatus 455 provided in accordance with an embodiment of the present application, in some embodiments, as shown in FIG. 2, the software modules stored in the model training apparatus 455 of the memory 450 may include: the labeling module is used for carrying out first mapping processing on character voice characteristics of the first characters aiming at each first character of the first text to obtain first character emotion intensity labels of the first characters; the composition module is used for acquiring a first emotion type tag used for representing the text emotion type of a first text and constructing a first training sample based on the first text, the first emotion type tag of the first text and the first character emotion intensity tag; the first training module is used for carrying out first training processing on the initialized emotion recognition model based on multiple tasks through the first training sample to obtain the first emotion recognition model, wherein the multiple tasks comprise an emotion classification task, a text emotion intensity classification task and a character emotion intensity classification task; the second training module is used for acquiring a second training sample with a real label, and performing a second training process based on the multi-task on the first emotion recognition model through the second training sample to obtain a second emotion recognition model; and the acquisition module is used for acquiring a target emotion recognition model for performing the multitasking based on the second emotion recognition model.
In some embodiments, the labeling module is further configured to: acquiring audio data, and clipping the audio data to obtain at least one single sentence of audio; performing voice recognition processing on each single sentence audio to obtain a first text corresponding to each single sentence audio, a starting time point of each first character in the first text and an ending time point of each first character in the first text; and aiming at the first text of each single-sentence audio, acquiring character voice characteristics of each first character from the single-sentence audio corresponding to the first text based on a starting time point and an ending time point of each first character of the first text.
In some embodiments, the labeling module is further configured to: the following processing is performed for each of the first characters: acquiring at least one audio frame corresponding to the first character from the single sentence audio based on a starting time point and an ending time point of the first character; acquiring audio energy of each audio frame corresponding to the first character; averaging the audio energy of the at least one audio frame to obtain average audio energy corresponding to the first character; and taking the average audio energy corresponding to the first character as character voice characteristics of the first character.
In some embodiments, the labeling module is further configured to: acquiring audio data, and clipping the audio data to obtain at least one single sentence of audio; performing voice recognition processing on each single sentence of audio to obtain a first text corresponding to each single sentence of audio; the following processing is performed for each single sentence audio: acquiring the audio energy of each audio frame in the single sentence audio; taking the average value of the audio energy of all audio frames in the single sentence of audio as the voice characteristic of the single sentence of audio; and taking the voice characteristic of each single sentence of audio as the text voice characteristic of the corresponding first text, and performing second mapping processing on the text voice characteristic to obtain a first text emotion intensity label of the first text.
In some embodiments, the first training module is further to: forward transmitting the first text in the initialized emotion recognition model to obtain a first prediction probability corresponding to the first emotion type label and a second prediction value representing the emotion intensity type of each first character of the first text; determining a first multiplexing loss based on the first prediction probability and the second prediction value; and updating parameters of the initialized emotion recognition model based on the first multitasking loss to obtain the first emotion recognition model.
In some embodiments, initializing the emotion recognition model includes initializing a feature sharing network and initializing a predictive network for each task, the first training module further to: performing feature extraction processing on the first text through the initialized feature sharing network to obtain a first feature code of a front position of the first text and a first feature code of each character position of the first text; carrying out emotion type prediction processing on a first feature code of a prepositive position of the first text through an initialization prediction network corresponding to the emotion classification task to obtain a first prediction probability corresponding to the first emotion type label; and carrying out character emotion intensity prediction processing on the first feature codes of each character position of the first text through an initialization prediction network corresponding to the character emotion intensity classification task, and obtaining a second predicted value of each first character of the first text, wherein the second predicted value characterizes the character emotion intensity type.
In some embodiments, the second training module is further to: acquiring a second emotion type tag of a second text and a second character emotion intensity tag of each second character of the second text as the real tag; the second emotion type tag and the second character emotion intensity tag are obtained through true labeling; and forming the second text, the second emotion type label and the second character emotion intensity label into the second training sample.
In some embodiments, the acquisition module is further to: performing prediction processing based on the multi-task on the first text through the second emotion recognition model; updating the first character emotion intensity tag based on the prediction result to obtain a third character emotion intensity tag of the first text, and forming a third training sample by the first text, the first emotion type tag and the third character emotion intensity tag; and performing third training processing on the second emotion recognition model based on the multi-task through the third training sample to obtain a target emotion recognition model for executing the multi-task.
In some embodiments, the second emotion recognition model includes a second feature sharing network and a second prediction network corresponding to each task, the obtaining module further configured to: performing feature extraction processing on the first text through the second feature sharing network to obtain a second feature code of a front position of the first text and a second feature code of each character position of the first text; carrying out emotion type prediction processing on the second feature codes of the prepositions of the first text through a second prediction network corresponding to the emotion classification task to obtain a third prediction probability corresponding to each candidate emotion type label; carrying out character emotion intensity prediction processing on the second feature codes of each character position of the first text through a second prediction network corresponding to the character emotion intensity classification task to obtain a fourth prediction value of the character emotion intensity type of each first character of the first text; and forming a third prediction probability corresponding to each candidate emotion type label and a fourth prediction value representing the emotion intensity type of each first character of the first text into a prediction result.
In some embodiments, the acquisition module is further to: the following processing is performed for each of the first characters: carrying out average processing on the fourth predicted value and the label value corresponding to the first character emotion intensity type label to obtain a third average label value; and acquiring a third character emotion intensity type label corresponding to the third average label value to replace the first character emotion intensity type label.
In some embodiments, the acquisition module is further to: performing fourth training processing on the second emotion recognition model based on the multi-task through the third training sample to obtain a third emotion recognition model; and performing fifth training processing on the fourth emotion recognition model based on the multi-task through the second training sample to obtain the target emotion recognition model.
In some embodiments, when the first training sample further includes a first text emotion intensity tag of the first text, the multitasking further includes a text emotion intensity classification task, the first training module further for: forward spreading the first text in the initialized emotion recognition model to obtain a fifth prediction probability corresponding to the first emotion type label, a sixth prediction value of the first text representing the emotion intensity type of the text, and a seventh prediction value of each first character representing the emotion intensity type of the character of the first text; determining a second loss of multiplexing based on the fifth prediction probability, the sixth prediction value, and the seventh prediction value; and updating parameters of the initialized emotion recognition model based on the second multitasking loss to obtain a first emotion recognition model.
Continuing with the description below of exemplary structures implemented as software modules for an artificial intelligence based text processing device provided by embodiments of the present application, in some embodiments, software modules stored in a memory of a text processing device may include: the text module is used for acquiring a text to be identified of the virtual object; the recognition module is used for carrying out emotion recognition processing on the text to be recognized through the target emotion recognition model to obtain the predicted emotion type of the text to be recognized and the predicted character emotion intensity of each character to be recognized in the text to be recognized; the target emotion recognition model is obtained by executing the model training method provided by the embodiment of the application.
Embodiments of the present application provide a computer program product comprising a computer program or computer-executable instructions stored in a computer-readable storage medium. The processor of the electronic device reads the computer executable instructions from the computer readable storage medium, and the processor executes the computer executable instructions, so that the electronic device executes the model training method and the text processing method based on artificial intelligence according to the embodiments of the present application.
Embodiments of the present application provide a computer-readable storage medium storing computer-executable instructions that, when executed by a processor, perform the artificial intelligence-based model training method and the text processing method provided by the embodiments of the present application.
In some embodiments, the computer readable storage medium may be FRAM, ROM, PROM, EP ROM, EEPROM, flash memory, magnetic surface memory, optical disk, or CD-ROM; but may be a variety of devices including one or any combination of the above memories.
In some embodiments, computer-executable instructions may be written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages, in the form of programs, software modules, scripts, or code, and they may be deployed in any form, including as stand-alone programs or as modules, components, subroutines, or other units suitable for use in a computing environment.
As an example, computer-executable instructions may, but need not, correspond to files in a file system, may be stored as part of a file that holds other programs or data, such as in one or more scripts in a hypertext markup language (HTML, hyper Text Markup Language) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).
As an example, computer-executable instructions may be deployed to be executed on one electronic device or on multiple electronic devices located at one site or, alternatively, on multiple electronic devices distributed across multiple sites and interconnected by a communication network.
In summary, according to the embodiment of the application, the emotion type label of the text, the emotion intensity label of the text level and the emotion intensity label of the character level can be generated by utilizing the voice characteristics, which is equivalent to acquiring the emotion labels of fine granularity, so that the model can learn emotion information with more different granularities, the emotion recognition capability of the model is improved, and the labels obtained by combining the voice characteristics are fully utilized at different stages in the whole learning process, so that the model has stronger interpretability.
The foregoing is merely exemplary embodiments of the present application and is not intended to limit the scope of the present application. Any modification, equivalent replacement, improvement, etc. made within the spirit and scope of the present application are included in the protection scope of the present application.

Claims (18)

1. A model training method based on artificial intelligence, the method comprising:
performing first mapping processing on character voice features of first characters aiming at each first character of a first text to obtain first character emotion intensity labels of the first characters;
Acquiring a first emotion type tag used for representing the text emotion type of the first text, and constructing a first training sample based on the first text, the first emotion type tag of the first text and the first character emotion intensity tag;
performing first training processing based on multiple tasks on the initialized emotion recognition model through the first training sample to obtain a first emotion recognition model, wherein the multiple tasks comprise an emotion classification task and a character emotion intensity classification task;
acquiring a second training sample with a real label, and performing second training processing on the first emotion recognition model based on the multi-task through the second training sample to obtain a second emotion recognition model;
a target emotion recognition model for performing the multitasking is obtained based on the second emotion recognition model.
2. The method according to claim 1, wherein the method further comprises:
acquiring audio data, and clipping the audio data to obtain at least one single sentence of audio;
performing voice recognition processing on each single sentence audio to obtain a first text corresponding to each single sentence audio, a starting time point of each first character in the first text and an ending time point of each first character in the first text;
And aiming at the first text of each single-sentence audio, acquiring character voice characteristics of each first character from the single-sentence audio corresponding to the first text based on a starting time point and an ending time point of each first character of the first text.
3. The method according to claim 2, wherein the obtaining, based on the start time point and the end time point of each first character of the first text, the character phonetic feature of each first character from the single sentence audio corresponding to the first text includes:
the following processing is performed for each of the first characters:
acquiring at least one audio frame corresponding to the first character from the single sentence audio based on a starting time point and an ending time point of the first character;
acquiring audio energy of each audio frame corresponding to the first character;
averaging the audio energy of the at least one audio frame to obtain average audio energy corresponding to the first character;
and taking the average audio energy corresponding to the first character as character voice characteristics of the first character.
4. The method of claim 1, wherein the performing, by the first training sample, a first training process based on multiplexing on the initialized emotion recognition model to obtain a first emotion recognition model includes:
Forward transmitting the first text in the initialized emotion recognition model to obtain a first prediction probability corresponding to the first emotion type label and a second prediction value representing the emotion intensity type of each first character of the first text;
determining a first multiplexing loss based on the first prediction probability and the second prediction value;
and updating parameters of the initialized emotion recognition model based on the first multitasking loss to obtain the first emotion recognition model.
5. The method of claim 4, wherein the initializing the emotion recognition model includes initializing a feature sharing network and initializing a predictive network for each of the tasks, wherein the forward propagating the first text in the initializing emotion recognition model results in a first predictive probability for the first emotion type tag and a second predictive value for each first character of the first text characterizing a character emotion intensity type, comprising:
performing feature extraction processing on the first text through the initialized feature sharing network to obtain a first feature code of a front position of the first text and a first feature code of each character position of the first text;
Carrying out emotion type prediction processing on a first feature code of a prepositive position of the first text through an initialization prediction network corresponding to the emotion classification task to obtain a first prediction probability corresponding to the first emotion type label;
and carrying out character emotion intensity prediction processing on the first feature codes of each character position of the first text through an initialization prediction network corresponding to the character emotion intensity classification task, and obtaining a second predicted value of each first character of the first text, wherein the second predicted value characterizes the character emotion intensity type.
6. The method of claim 1, wherein the obtaining a second training sample with a genuine label comprises:
acquiring a second emotion type tag of a second text and a second character emotion intensity tag of each second character of the second text as the real tag;
the second emotion type tag and the second character emotion intensity tag are obtained through true labeling;
and forming the second text, the second emotion type label and the second character emotion intensity label into the second training sample.
7. The method of claim 1, wherein the step of determining the position of the substrate comprises,
The obtaining the target emotion recognition model based on the second emotion recognition model includes:
performing prediction processing based on the multi-task on the first text through the second emotion recognition model;
updating the first character emotion intensity tag based on the prediction result to obtain a third character emotion intensity tag of the first text, and forming a third training sample by the first text, the first emotion type tag and the third character emotion intensity tag;
and performing third training processing on the second emotion recognition model based on the multi-task through the third training sample to obtain a target emotion recognition model for executing the multi-task.
8. The method of claim 7, wherein the second emotion recognition model includes a second feature sharing network and a second prediction network corresponding to each task, and wherein the performing the multitasking-based prediction processing on the first text by the second emotion recognition model to obtain a prediction result includes:
performing feature extraction processing on the first text through the second feature sharing network to obtain a second feature code of a front position of the first text and a second feature code of each character position of the first text;
Carrying out emotion type prediction processing on the second feature codes of the prepositions of the first text through a second prediction network corresponding to the emotion classification task to obtain a third prediction probability corresponding to each candidate emotion type label;
carrying out character emotion intensity prediction processing on the second feature codes of each character position of the first text through a second prediction network corresponding to the character emotion intensity classification task to obtain a fourth prediction value of the character emotion intensity type of each first character of the first text;
and forming a third prediction probability corresponding to each candidate emotion type label and a fourth prediction value representing the emotion intensity type of each first character of the first text into a prediction result.
9. The method of claim 8, wherein updating the first character emotion intensity tag based on the prediction results to obtain a third character emotion intensity tag for the first text comprises:
the following processing is performed for each of the first characters:
carrying out average processing on the fourth predicted value and the label value corresponding to the first character emotion intensity type label to obtain a third average label value;
And acquiring a third character emotion intensity type label corresponding to the third average label value to replace the first character emotion intensity type label.
10. The method of claim 1, wherein the performing, by the third training sample, the third training process based on the multitasking on the second emotion recognition model to obtain a target emotion recognition model for performing the multitasking, includes:
performing fourth training processing on the second emotion recognition model based on the multi-task through the third training sample to obtain a third emotion recognition model;
and performing fifth training processing on the fourth emotion recognition model based on the multi-task through the second training sample to obtain the target emotion recognition model.
11. The method of claim 1, wherein when the first training sample further includes a first text emotion intensity tag of the first text, the multitasking further includes a text emotion intensity classification task, and the multitasking the initialized emotion recognition model through the first training sample includes:
Forward propagating the first text in the initialized emotion recognition model to obtain a fifth prediction probability corresponding to the first emotion type label, a sixth prediction value of the emotion intensity type of the first text representing the text, and a seventh prediction value of the emotion intensity type of each first character of the first text representing the character;
determining a second multiplexing loss based on the fifth prediction probability, the sixth prediction value, and the seventh prediction value;
and updating parameters of the initialized emotion recognition model based on the second multitasking loss to obtain the first emotion recognition model.
12. A method of text processing based on artificial intelligence, the method comprising:
acquiring a text to be identified;
carrying out emotion recognition processing on the text to be recognized through a target emotion recognition model to obtain a predicted emotion type of the text to be recognized and a predicted character emotion intensity of each character to be recognized in the text to be recognized;
wherein the target emotion recognition model is obtained by performing the model training method of any one of claims 1 to 11.
13. The method of claim 12, wherein the text to be identified corresponds to a virtual object, the method further comprising:
And carrying out display driving processing on the virtual object based on the predicted emotion type of the text to be recognized and the predicted character emotion intensity of each character to be recognized in the text to be recognized, so as to obtain a display effect when the virtual object broadcasts the text to be recognized.
14. An artificial intelligence based model training apparatus, the apparatus comprising:
the labeling module is used for carrying out first mapping processing on character voice characteristics of the first characters aiming at each first character of the first text to obtain first character emotion intensity labels of the first characters;
the composition module is used for acquiring a first emotion type tag used for representing the text emotion type of the first text and constructing a first training sample based on the first text, the first emotion type tag of the first text and the first character emotion intensity tag;
the first training module is used for carrying out first training processing on the initialized emotion recognition model based on multiple tasks through the first training sample to obtain the first emotion recognition model, wherein the multiple tasks comprise an emotion classification task, a text emotion intensity classification task and a character emotion intensity classification task;
The second training module is used for acquiring a second training sample with a real label, and performing a second training process based on the multi-task on the first emotion recognition model through the second training sample to obtain a second emotion recognition model;
and the acquisition module is used for acquiring a target emotion recognition model for performing the multitasking based on the second emotion recognition model.
15. An artificial intelligence based text processing apparatus, the apparatus comprising:
the text module is used for acquiring a text to be identified of the virtual object;
the recognition module is used for carrying out emotion recognition processing on the text to be recognized through a target emotion recognition model to obtain the predicted emotion type of the text to be recognized and the predicted character emotion intensity of each character to be recognized in the text to be recognized;
wherein the target emotion recognition model is obtained by performing the model training method of any one of claims 1 to 11.
16. An electronic device, the electronic device comprising:
a memory for storing executable instructions;
a processor for implementing the artificial intelligence based model training method of any one of claims 1 to 11 or the artificial intelligence based text processing method of any one of claims 12 to 13 when executing executable instructions stored in said memory.
17. A computer readable storage medium storing executable instructions which when executed by a processor implement the artificial intelligence based model training method of any one of claims 1 to 11 or the artificial intelligence based text processing method of any one of claims 12 to 13.
18. A computer program product comprising a computer program or instructions which, when executed by a processor, implements the artificial intelligence based model training method of any one of claims 1 to 11 or the artificial intelligence based text processing method of any one of claims 12 to 13.
CN202310153224.5A 2023-02-08 2023-02-08 Model training method, device, equipment, storage medium and program product Pending CN117216532A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310153224.5A CN117216532A (en) 2023-02-08 2023-02-08 Model training method, device, equipment, storage medium and program product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310153224.5A CN117216532A (en) 2023-02-08 2023-02-08 Model training method, device, equipment, storage medium and program product

Publications (1)

Publication Number Publication Date
CN117216532A true CN117216532A (en) 2023-12-12

Family

ID=89044916

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310153224.5A Pending CN117216532A (en) 2023-02-08 2023-02-08 Model training method, device, equipment, storage medium and program product

Country Status (1)

Country Link
CN (1) CN117216532A (en)

Similar Documents

Publication Publication Date Title
Sadoughi et al. Speech-driven animation with meaningful behaviors
Nyatsanga et al. A Comprehensive Review of Data‐Driven Co‐Speech Gesture Generation
Xie et al. Attention-based dense LSTM for speech emotion recognition
CN112528637B (en) Text processing model training method, device, computer equipment and storage medium
Triantafyllopoulos et al. An overview of affective speech synthesis and conversion in the deep learning era
CN114443899A (en) Video classification method, device, equipment and medium
CN113822017A (en) Audio generation method, device, equipment and storage medium based on artificial intelligence
US20230419725A1 (en) Systems for and methods of creating a library of facial expressions
CN114091466A (en) Multi-modal emotion analysis method and system based on Transformer and multi-task learning
CN113392265A (en) Multimedia processing method, device and equipment
CN117216234A (en) Artificial intelligence-based speaking operation rewriting method, device, equipment and storage medium
Zhang Voice keyword retrieval method using attention mechanism and multimodal information fusion
CN114661951A (en) Video processing method and device, computer equipment and storage medium
CN114399995A (en) Method, device and equipment for training voice model and computer readable storage medium
CN114330483A (en) Data processing method, model training method, device, equipment and storage medium
CN116611459B (en) Translation model training method and device, electronic equipment and storage medium
CN117312562A (en) Training method, device, equipment and storage medium of content auditing model
CN117216185A (en) Comment generation method, device, equipment and storage medium for distributed content
Vlasenko et al. Fusion of acoustic and linguistic information using supervised autoencoder for improved emotion recognition
CN116977701A (en) Video classification model training method, video classification method and device
CN115273856A (en) Voice recognition method and device, electronic equipment and storage medium
CN116959417A (en) Method, apparatus, device, medium, and program product for detecting dialog rounds
CN114330285B (en) Corpus processing method and device, electronic equipment and computer readable storage medium
CN111310847B (en) Method and device for training element classification model
CN115169472A (en) Music matching method and device for multimedia data and computer equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication