WO2024037196A1 - 一种通信方法以及装置 - Google Patents

一种通信方法以及装置 Download PDF

Info

Publication number
WO2024037196A1
WO2024037196A1 PCT/CN2023/103053 CN2023103053W WO2024037196A1 WO 2024037196 A1 WO2024037196 A1 WO 2024037196A1 CN 2023103053 W CN2023103053 W CN 2023103053W WO 2024037196 A1 WO2024037196 A1 WO 2024037196A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
emotion
expression
computer
code
Prior art date
Application number
PCT/CN2023/103053
Other languages
English (en)
French (fr)
Inventor
邢诗萍
俞雨
邵凯
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2024037196A1 publication Critical patent/WO2024037196A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • Embodiments of the present application relate to the field of communications, and in particular, to an expression generation method and device.
  • Virtual digital people are widely used in many fields such as entertainment, education, services, and sales. Digital people are currently in a period of rapid development, and their digital human images range from cartoons to hyper-realism. However, in the field of digital human driving, the current driving strategy still relies heavily on artificial intelligence or large amounts of data, such as the speech/text-driven digital human technology that is relatively popular in academia. As the application of digital people increases, the emotional companionship of digital people has become a very important part. Digital people are now eager to move from simple emotional expression to full-dimensional natural expression.
  • the most common method is to use the pattern designed by the animator and make expressions according to the rules. That is to say, the entire process of digital human expression display is also pre-prepared, resulting in a single digital human expression.
  • This application provides an expression generation method and device, which can increase the complexity of digital human emotion display.
  • a first aspect of the present application provides an expression generation method, which method includes: obtaining a first emotion code; matching a first text according to the first emotion code; inputting the first text into an inference neural network to generate a second text; Match the second text with the second emotion code; determine the corresponding first expression based on the second emotion code; display the first expression.
  • the execution subject of this application is the terminal device.
  • the terminal device After obtaining the first emotion code, the terminal device can match the first text, and then generate the second text through the inference neural network, and then based on the corresponding relationship between the text and the emotion code Determine the second emotion, determine the first expression based on the corresponding relationship between emotion and expression, and then display the first expression.
  • the expression displayed by the terminal device is speculated and changed by the inferential neural network. It is not a fixed setting, which can increase the complexity of digital human emotion display. .
  • the first emotion code is randomly generated.
  • the terminal device can randomly generate an initial first emotion code locally, thereby improving the flexibility of the solution.
  • obtaining the first emotion code includes: receiving a message from the user; and determining the first emotion code according to the message.
  • the first emotion code can also be generated by a message input by the user, which improves the flexibility of the solution.
  • the method further includes: receiving voice data from the user, where the voice data is used to request text corresponding to the first expression; and displaying the second text according to the voice data.
  • the user can also view the text corresponding to the current expression through voice, thereby improving the interactive effect of human-computer interaction.
  • the method further includes: inputting the second text into the inference neural network to generate a third text; matching the third emotion code according to the third text; determining the second expression according to the third emotion code; displaying Second expression.
  • the expression displayed by the terminal device will continuously change, improving the user's viewing experience.
  • the method further includes: displaying the emotion label corresponding to the first expression.
  • the user can directly determine the emotion of the current expression through the emotion tag, thereby improving the user experience.
  • the inferred neural network is generated by training based on the sample text and the emotion labels corresponding to the sample text.
  • the accuracy of the inference neural network is improved.
  • the second aspect of this application provides an expression generation device that can implement the method in the above first aspect or any possible implementation of the first aspect.
  • the device includes corresponding units or modules for performing the above method.
  • the units or modules included in the device can be implemented by software and/or hardware.
  • the device may be, for example, a network device, a chip, a chip system, a processor, etc. that supports the network device to implement the above method, or a logic module or software that can realize all or part of the network device functions.
  • a third aspect of the present application provides a computer device, including: a processor, the processor is coupled to a memory, and the memory is used to store instructions. When the instructions are executed by the processor, the computer device implements the first aspect or the third aspect.
  • the method in any possible implementation in one aspect.
  • the computer device may be, for example, a network device, or a chip or chip that supports the network device to implement the above method. system etc.
  • the fourth aspect of the present application provides a computer-readable storage medium.
  • the computer-readable storage medium stores instructions. When the instructions are executed by a processor, the first aspect or any possible implementation of the first aspect is realized. method provided.
  • the fifth aspect of the present application provides a computer program product.
  • the computer program product includes computer program code.
  • the computer program code is executed on a computer, the first aspect or any of the possible implementations of the first aspect are implemented. Methods.
  • Figure 1 is a schematic structural diagram of a virtual digital human interaction system provided by an embodiment of the present application.
  • Figure 2 is a schematic flow chart of an expression generation method provided by an embodiment of the present application.
  • Figure 3 is a schematic diagram of an expression generation process provided by an embodiment of the present application.
  • Figure 4 is a schematic structural diagram of a terminal device provided by an embodiment of the present application.
  • Figure 5 is a schematic structural diagram of an expression generating device provided by an embodiment of the present application.
  • Figure 6 is a schematic structural diagram of a computer device provided by an embodiment of the present application.
  • Embodiments of the present application provide an expression generation method and device, which can increase the complexity of digital human emotion display.
  • exemplary means "serving as an example, example, or illustrative.” Any embodiment described herein as “exemplary” is not necessarily to be construed as superior or superior to other embodiments.
  • Figure 1 shows a structural schematic diagram of the artificial intelligence main framework.
  • the following is from the “intelligent information chain” (horizontal axis) and “IT value chain” ( The above artificial intelligence theme framework is elaborated on the two dimensions of vertical axis).
  • the "intelligent information chain” reflects a series of processes from data acquisition to processing. For example, it can be the general process of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision-making, intelligent execution and output. In this process, the data has gone through the condensation process of "data-information-knowledge-wisdom".
  • the "IT value chain” reflects the value that artificial intelligence brings to the information technology industry, from the underlying infrastructure of human intelligence and information (providing and processing technology implementation) to the systematic industrial ecological process.
  • Infrastructure provides computing power support for artificial intelligence systems, enables communication with the external world, and supports it through basic platforms.
  • computing power is provided by smart chips, such as central processing unit (CPU), neural-network processing unit (NPU), graphics processing unit (GPU), Application specific integrated circuit (ASIC) or field programmable gate array (FPGA) and other hardware acceleration chips) are provided;
  • the basic platform includes distributed computing framework and network and other related platform guarantees and support. It can include cloud storage and computing, interconnection networks, etc.
  • sensors communicate with the outside world to obtain data, which are provided to smart chips in the distributed computing system provided by the basic platform for calculation.
  • Data from the upper layer of the infrastructure is used to represent data sources in the field of artificial intelligence.
  • the data involves graphics, images, voice, and text, as well as IoT data of traditional devices, including business data of existing systems and sensory data such as force, displacement, liquid level, temperature, and humidity.
  • Data processing usually includes data training, machine learning, deep learning, search, reasoning, decision-making and other methods.
  • machine learning and deep learning can perform symbolic and formal intelligent information modeling, extraction, preprocessing, training, etc. on data.
  • Reasoning refers to the process of simulating human intelligent reasoning in computers or intelligent systems, using formal information to perform machine thinking and problem solving based on reasoning control strategies. Typical functions are search and matching.
  • Decision-making refers to the process of decision-making after intelligent information is reasoned, and usually provides functions such as classification, sorting, and prediction.
  • some general capabilities can be formed based on the results of further data processing, such as algorithms or a general system, such as translation, text analysis, computer vision processing, speech recognition, and image processing. identification, etc.
  • Intelligent products and industry applications refer to the products and applications of artificial intelligence systems in various fields. They are the encapsulation of overall artificial intelligence solutions, productizing intelligent information decision-making and realizing practical applications. Its application fields mainly include: intelligent terminals, intelligent transportation, Smart healthcare, autonomous driving, safe cities, etc.
  • the embodiments of this application involve related applications of neural networks and data conversion (natural language processing, NLP).
  • NLP natural language processing
  • a virtual digital person has the following three characteristics: first, it has the appearance of a person, with specific characteristics such as appearance, gender, and personality; second, it has the behavior of a person, with the ability to express it through language, facial expressions, and body movements; third, it has the ability to express through language, facial expressions, and body movements; Possessing human thoughts, having the ability to recognize the external environment and communicate and interact with others.
  • convergence technologies such as computer graphics, deep learning, speech synthesis, and brain science
  • virtual digital humans are gradually evolving into a new species and new media, and more and more virtual digital humans are being designed, produced, and operated.
  • the action generation method provided by the embodiment of the present application can be executed on the server, and can also be executed on the terminal device based on artificial intelligence.
  • the terminal device may be a mobile phone with image processing function, a tablet personal computer (TPC), a media player, a smart TV, a laptop computer (LC), or a personal digital assistant (PDA). ), personal computer (PC), camera, camcorder, smart watch, wearable device (WD) or self-driving vehicle, etc., the embodiments of this application are not limited to this.
  • the above terminal device may be a device running various operating systems.
  • the above-mentioned terminal device may be a device running an Android system, a device running an IOS system, or a device running a windows system.
  • Virtual digital people are widely used in many fields such as entertainment, education, services, and sales. Digital people are currently in a period of rapid development, and their digital human images range from cartoons to hyper-realism. However, in the field of digital human driving, the current driving strategy still relies heavily on artificial intelligence or large amounts of data, such as the speech/text-driven digital human technology that is relatively popular in academia. As the application of digital people increases, the emotional companionship of digital people has become a very important part. Digital people are now eager to move from simple emotional expression to full-dimensional natural expression.
  • the most common method is to use the pattern designed by the animator and make expressions according to the rules. That is to say, the entire process of digital human expression display is also pre-prepared, resulting in a single digital human expression.
  • embodiments of the present application provide an expression generation method, which is as follows.
  • Figure 2 is a schematic flow chart of an expression generation method provided by an embodiment of the present application.
  • the method includes:
  • Step 201 The terminal device obtains the first emotion code.
  • the terminal device can obtain the first emotion code locally, and the first emotion code serves as the initial emotion of the virtual digital human, That is, the expression corresponding to the first emotion code can be used as the initial expression of the virtual digital human.
  • the first emotion code may be a numerical vector (such as a 1x256 dimensional vector).
  • the locally stored first emotion code may be randomly generated by the terminal device, that is, the terminal device randomly selects an emotion as the emotion of the virtual digital human, and then determines the emotion code corresponding to the emotion.
  • the locally stored first emotion code can also be determined by the user's input.
  • the terminal device can receive a message from the user, and then match the first emotion code according to the message, where the message can be the user's voice input or text. Input, the embodiment of this application does not limit this.
  • Step 202 The terminal device matches the first text according to the first emotion code.
  • each emotion code corresponds to an expression
  • each emotion code corresponds to one or more texts.
  • the content of the text is related to the expression.
  • Step 203 The terminal device inputs the first text into the inference neural network to generate the second text.
  • each expression has a corresponding emotion label indicating the name of the expression, for example, a smiling face corresponds to joy, a crying face corresponds to pain, etc.
  • the first text corresponds to an emotion label. If the first text is If the initial text does not reflect emotion, the emotion label can be randomly assigned.
  • the terminal device can input the first text and the corresponding emotion label into the inference neural network, and the inference neural network can decode the words, phrases or short sentences of the next text (second text) and the corresponding emotion label.
  • the neural network is generated based on the sample text and the emotional label corresponding to the sample text.
  • the database building module contains a large number of collected psychological activity texts (diaries, narrative texts%) and aligned labels entered manually or by text understanding algorithms. This text information is digitally After encoding, some conventional network preprocessing operations are performed, and then combined with emotional labels to obtain text training data. Through the general neural network structure, the text at the next moment can be inferred based on the text obtained in the previous period. Among them, it is speculated that the results output by the neural network are new permutations and combinations based on the sentence structure, and are new content generated rather than replicating the text in the database.
  • Step 204 The terminal device matches the second emotion code according to the second text.
  • the terminal device can determine the second emotion code that is closest to the second text based on the association between the text and the emotion code.
  • the association between the text and the emotion code can be stored locally or online. Obtained, there is no limit here.
  • Text information can be text encoded.
  • Commonly used encoding forms are character encoding, term frequency-inverse document frequency index (TF-IDF), etc.
  • Voice information can also be cross-modal encoded.
  • Common encoding methods The form is Melp, Mel-frequency cepstral coefficient (MFCC) and other data formats.
  • the expression can be encoded using the expression base and its coefficients. If it is represented by a grid, it can be encoded by a grid. The values of these codes are all different.
  • the cross-modal retriever can convert these numerically different codes into the same numerical expression, which is a common implementation of cross-modal retrieval algorithms.
  • the text encoding is the encoding of Chinese characters [00, 12, 3, 55...].
  • the trained network output is a 1x218 vector. After learning the audio encoding, the encoding that is most similar to the audio corresponding to this text should be infinitely close to This 1x218 vector is the same as expressions and emotions. In this way, this 1x218 vector can be used to retrieve data in the four modalities of expression, text, audio and emotion.
  • the database building module in the embodiment of this application is to build a certain amount of time-aligned modules of text, voice, and expression as training data.
  • the most commonly used method is to train an encoder and a decoder for all modalities.
  • the decoder can restore the encoded content to the original content; and then cross the decoders so that the content output by the encoder of another modality can also be decoded to restore the restored content.
  • expression coding can also solve similar texts, and use this to form a loss to supervise the consistency of the coding space.
  • Step 205 The terminal device determines the corresponding first expression according to the second emotion encoding.
  • the terminal device can determine the first expression corresponding to the second emotion code according to the corresponding relationship of the emotion code.
  • the first expression can be obtained locally, where the locally stored expression can include a variety of expression bases. , each expression base corresponds to a local expression.
  • the locally stored expressions may include multiple expression bases corresponding to common multiple partial expressions (the multiple expressions can cover the expressions of eyebrows, eyes, nose, mouth, chin, cheeks and other parts of the human face).
  • the multiple local expressions may include some common expressions on human faces, such as blinking, opening the mouth, frowning, raising eyebrows, etc.
  • the above-mentioned expressions can also include expressions obtained by subdividing some common expressions on human faces.
  • the above-mentioned various partial expressions can include expressions such as the inner left eyebrow moving upward, the lower eyelid of the right eye lifting, and the upper lip everting. There are no restrictions anywhere.
  • the first expression can also be obtained by the terminal device through network matching, which is not limited here.
  • Step 206 The terminal device displays the first expression.
  • the terminal device can display the first expression on the display screen to inform the current emotion of the person with the expression.
  • the first expression is a smiling face
  • the terminal device can also directly display the emotion, such as displaying an emotion label representing the emotion on the display screen.
  • the specific display location can be anywhere around the digital person.
  • the emotion label is displayed on the display screen. The number is directly below the person for example.
  • the expression of the digital person will also gradually change with the passage of time. That is, after the terminal device displays the first expression, it can continue to input the second text into the inference neural network to infer the subsequent psychological activities of the digital person, that is, generate
  • the third text is then determined according to the relationship between the text and the emotion code, and the third emotion code corresponding to the third text is determined.
  • the second expression corresponding to the third emotion code is determined, and the second expression is displayed on the display screen.
  • the second expression correspondingly, the third text and even the text obtained subsequently are input into the inference neural network to predict the new expression for display.
  • the terminal device reads the expression state at the end of the last time, or randomly generates a new expression state as the initial value of the psychological activity text (including content and emotion).
  • the text autoencoding network begins to write new subjects, predicates, objects, adverbials, etc. based on the sentence structures of diaries and psychological monologues learned from a large amount of data, and continuously generates their corresponding emotions.
  • this code is used to retrieve the expression coefficients, and the expression coefficients are displayed as expressions on the display screen, which is the final idle (IDLE) expression.
  • IDLE final idle
  • the user can also view the text corresponding to the current expression. For example, if the current expression is a first expression, the user can view the second text corresponding to the first expression. Specifically, the user can ask the second text corresponding to the first expression through voice, such as "What is the digital person thinking" or "Why does the digital person make this expression", etc., which are not limited here.
  • the terminal device After the terminal device receives the voice data, if it parses that the voice data is a request for current text, it can directly display the current text, such as displaying the second text.
  • Figure 3 is a schematic diagram of an expression generation process provided by this embodiment of the present application.
  • the first emotion code is After the emotional encoding is input to the inference neural network, the second text and emotional label are generated.
  • the second text and emotional label are obtained through cross-modal retrieval to obtain the encoding of the unified modal space (such as the aforementioned 1 ⁇ 128 vector), and then the corresponding matching is performed based on the encoding.
  • first expression for the first emotion code as the initial emotion, the first emotion code is After the emotional encoding is input to the inference neural network, the second text and emotional label are generated.
  • the second text and emotional label are obtained through cross-modal retrieval to obtain the encoding of the unified modal space (such as the aforementioned 1 ⁇ 128 vector), and then the corresponding matching is performed based on the encoding.
  • the unified modal space such as the aforementioned 1 ⁇ 128 vector
  • the terminal device after the terminal device obtains the first emotion code, it can match the first text, and then generate the second text through the inference neural network, and then determine the second emotion according to the corresponding relationship between the text and the emotion code. According to the emotion and The corresponding relationship of expressions determines the first expression, and then displays the first expression.
  • the expression displayed by the terminal device is inferred and changed by the inferential neural network. It is not a fixed setting, which can increase the complexity of digital human emotion display.
  • the terminal device includes: an input module 401, a database establishment module 402, a psychological activity generation module 403, and a digital human expression cross-modal Retrieval module 404 and output module 405.
  • the input module 401, database creation module 402, psychological activity generation module 403, digital human expression cross-modal retrieval module 404 and output module 405 can also be executed by one module, which is not limited here.
  • Input module 401 can receive multiple input modes such as emotion label input, voice text expression base coefficient input, etc., and generate the first emotion code. You can also choose to randomly generate the first emotion code without any input. Because it is an IDLE expression, you can choose Whether interactive data is needed or not, in addition, based on the characteristics of subsequent modules, you can choose to input data such as emotional tags, voice, text data and expression base coefficients (often the state stored in the previous interaction).
  • the input module 401 can perform steps 201 and 202 in the method embodiment of FIG. 2 .
  • Database creation module 402 This part mainly generates data for the subsequent psychological activity generation module 403 and expression cross-modal retrieval module 404.
  • the central activity generation module requires text training data and its corresponding emotion labels.
  • the cross-modal retrieval module requires time-domain aligned emotion labels, expression base coefficients, text and other data.
  • Mental activity generation module 403 This module is responsible for continuously self-encoding mental activity texts. There are a large number of texts such as diaries and psychological monologues in the database module. NLU is used to analyze the sentence components of the text and automatically annotate it. Generative networks such as the self-encoding network VAE that can generate new content are used to generate brand-new text that conforms to the sentence structure. content. The mental activity generation module 403 may perform step 203 in the method embodiment of Figure 2.
  • Digital human expression cross-modal retrieval module 404 This module uses machine learning algorithms to learn aligned expressions, text, speech, emotion coding and other common digital human-related modal codes in the same modality. That is, data in different forms can obtain the same coding expression, and the expression form is an n ⁇ m matrix or a 1 ⁇ m vector. Although this data form comes from different forms of data, it expresses the same object. It can be used to search databases of different modalities to obtain the most matching results.
  • the digital human expression cross-modal retrieval module 404 can perform steps 204 and 205 in the method embodiment of Figure 2.
  • the device 50 includes:
  • the processing unit 501 is configured to obtain the first emotion code, match the first text according to the first emotion code, input the first text into the inference neural network to generate the second text, match the second emotion code according to the second text, and match the second emotion code according to the second text.
  • the emotion encoding determines the corresponding first expression and displays the first expression.
  • the processing unit 501 is used to execute steps 201 to 206 in the method embodiment of FIG. 2 .
  • the first emotion code is randomly generated.
  • the device 50 also includes a transceiver unit 502, which is specifically configured to: receive messages from users;
  • the processing unit 501 is also configured to determine the first emotion code according to the message.
  • the transceiver unit 502 is also configured to receive the user's voice data, and the voice data is used to request the text corresponding to the first expression; the processing unit 501 is also configured to: display the second text according to the voice data.
  • the processing unit 501 is also configured to: input the second text into the inference neural network to generate a third text; match the third emotion code according to the third text; determine the second expression according to the third emotion code; display the second expression.
  • the processing unit 501 is also configured to display the emotion label corresponding to the first expression.
  • the inferred neural network is generated by training based on the sample text and the emotional labels corresponding to the sample text.
  • Figure 6 shows a possible logical structure diagram of a computer device 60 provided for an embodiment of the present application.
  • Computer device 60 includes: processor 601, communication interface 602, storage system 603, and bus 604.
  • the processor 601, the communication interface 602 and the storage system 603 are connected to each other through a bus 604.
  • the processor 601 is used to control and manage the actions of the computer device 60.
  • the processor 601 is used to execute the steps performed by the terminal device in the method embodiment of FIG. 2.
  • the communication interface 602 is used to support the computer device 60 to communicate.
  • Storage system 603 is used to store program codes and data of the computer device 60 .
  • the processor 601 may be a central processing unit, a general-purpose processor, a digital signal processor, an application-specific integrated circuit, a field-programmable gate array or other programmable logic devices, transistor logic devices, hardware components, or any combination thereof. It may implement or execute the various illustrative logical blocks, modules, and circuits described in connection with this disclosure.
  • the processor 601 may also be a combination that implements computing functions, such as a combination of one or more microprocessors, a combination of a digital signal processor and a microprocessor, and so on.
  • the bus 604 may be a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus, etc.
  • PCI Peripheral Component Interconnect
  • EISA Extended Industry Standard Architecture
  • the transceiver unit 502 in the device 50 is equivalent to the communication interface 602 in the computer device 60
  • the processing unit 501 in the device 50 is equivalent to the processor 601 in the computer device 60 .
  • the computer device 60 of this embodiment may correspond to the terminal device in the method embodiment of FIG. 2, and the communication interface 602 in the computer device 60 may implement the functions and/or functions of the terminal device in the method embodiment of FIG. 2.
  • the various steps of implementation are, for the sake of brevity, in This will not be described again.
  • each unit in the device can be a separate processing element, or it can be integrated and implemented in a certain chip of the device.
  • it can also be stored in the memory in the form of a program, and a certain processing element of the device can call and execute the unit. Function.
  • all or part of these units can be integrated together or implemented independently.
  • the processing element described here can also be a processor, which can be an integrated circuit with signal processing capabilities.
  • each step of the above method or each unit above can be implemented by an integrated logic circuit of hardware in the processor element or implemented in the form of software calling through the processing element.
  • the unit in any of the above devices may be one or more integrated circuits configured to implement the above method, such as: one or more application specific integrated circuits (ASIC), or one or Multiple microprocessors (digital signal processors, DSPs), or one or more field programmable gate arrays (FPGAs), or a combination of at least two of these integrated circuit forms.
  • ASIC application specific integrated circuits
  • DSPs digital signal processors
  • FPGAs field programmable gate arrays
  • the unit in the device can be implemented in the form of a processing element scheduler
  • the processing element can be a general processor, such as a central processing unit (Central Processing Unit, CPU) or other processors that can call programs.
  • CPU central processing unit
  • these units can be integrated together and implemented in the form of a system-on-a-chip (SOC).
  • SOC system-on-a-chip
  • a computer-readable storage medium is also provided.
  • Computer-executable instructions are stored in the computer-readable storage medium.
  • the processor of the device executes the computer-executed instructions
  • the device executes the above method embodiment. The method executed by the terminal device.
  • a computer program product includes computer-executable instructions, and the computer-executable instructions are stored in a computer-readable storage medium.
  • the processor of the device executes the computer execution instruction
  • the device executes the method executed by the terminal device in the above method embodiment.
  • the disclosed systems, devices and methods can be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the units is only a logical function division. In actual implementation, there may be other division methods.
  • multiple units or components may be combined or can be integrated into another system, or some features can be ignored, or not implemented.
  • the coupling or direct coupling or communication connection between each other shown or discussed may be through some interfaces, and the indirect coupling or communication connection of the devices or units may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or they may be distributed to multiple network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present application can be integrated into one processing unit, each unit can exist physically alone, or two or more units can be integrated into one unit.
  • the above integrated units can be implemented in the form of hardware or software functional units.
  • the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it may be stored in a computer-readable storage medium.
  • the technical solution of the present application is essentially or contributes to the existing technology, or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in various embodiments of this application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, read-only memory), random access memory (RAM, random access memory), magnetic disk or optical disk and other media that can store program code. .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Physics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Databases & Information Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Biology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Software Systems (AREA)
  • Processing Or Creating Images (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

本申请公开了一种表情生成方法以及装置,该方法包括:终端设备在获得第一情感编码后,可以匹配对的第一文本,然后通过推测神经网络生成第二文本,再根据文本和情感编码的对应关系确定第二情感,根据情感和表情的对应关系确定第一表情,再展示该第一表情,终端设备展示的表情由推测神经网络推测变化,不是固定设定,可以增加数字人情感展示的复杂度。

Description

一种通信方法以及装置
本申请要求与2022年8月19日提交中国国家知识产权局,申请号为202210998977.1,发明名称为“一种表情生成方法以及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请实施例涉及通信领域,尤其涉及一种表情生成方法以及装置。
背景技术
虚拟数字人被广泛应用于文娱、教育、服务、销售等诸多领域;在数字人目前处在高速发展时期,其数字人形象从卡通到超写实都有很好的“皮囊”。但数字人的驱动领域,目前的驱动策略仍然非常依赖人工或者大量数据,比如学术界比较热络的语音\文本驱动数字人技术。随着数字人应用增广,数字人的情感陪伴成为了很重要的一环,数字人目前也迫切从简单的情感表达,到全维度的自然表达。
当前最通常的就是使用动画师设计好的模式,按规则做表情。即数字人表情展示的整个过程也被提前预制好,导致数字人表情单一。
发明内容
本申请提供了一种表情生成方法以及装置,用于可以增加数字人情感展示的复杂度。
本申请第一方面提供了一种表情生成方法,该方法包括:获取第一情感编码;根据第一情感编码匹配第一文本;将第一文本输入推测神经网络,以生成第二文本;根据第二文本匹配第二情感编码;根据第二情感编码确定对应的第一表情;展示第一表情。
上述方面中,本申请的执行主体为终端设备,终端设备在获得第一情感编码后,可以匹配对的第一文本,然后通过推测神经网络生成第二文本,再根据文本和情感编码的对应关系确定第二情感,根据情感和表情的对应关系确定第一表情,再展示该第一表情,终端设备展示的表情由推测神经网络推测变化,不是固定设定,可以增加数字人情感展示的复杂度。
一种可能的实施方式中,第一情感编码为随机生成的。
上述可能的实施方式中,终端设备本地可以随机生成初始的第一情感编码,提高方案的灵活性。
一种可能的实施方式中,获取第一情感编码包括:接收用户的消息;根据消息确定第一情感编码。
上述可能的实施方式中,第一情感编码还可以是由用户输入的消息生成的,提高方案的灵活性。
一种可能的实施方式中,该方法还包括:接收用户的语音数据,语音数据用于请求第一表情对应的文本;根据语音数据展示第二文本。
上述可能的实施方式中,用户还可以通过语音查看当前表情对应的文本,提高人机交互的互动效果。
一种可能的实施方式中,该方法还包括:将第二文本输入到推测神经网络,以生成第三文本;根据第三文本匹配第三情感编码;根据第三情感编码确定第二表情;展示第二表情。
上述可能的实施方式中,终端设备展示的表情会连续变化,提高用户的观看体验。
一种可能的实施方式中,该方法还包括:展示第一表情对应的情感标签。
上述可能的实施方式中,用户可以直接通过情感标签确定当前表情的情感,提高用户体验。
一种可能的实施方式中,推测神经网络为根据样本文本以及样本文本对应的情感标签训练生成的。
上述可能的实施方式中,提高推测神经网络的准确度。
本申请第二方面提供了一种表情生成装置,可以实现上述第一方面或第一方面中任一种可能的实施方式中的方法。该装置包括用于执行上述方法的相应的单元或模块。该装置包括的单元或模块可以通过软件和/或硬件方式实现。该装置例如可以为网络设备,也可以为支持网络设备实现上述方法的芯片、芯片系统、或处理器等,还可以为能实现全部或部分网络设备功能的逻辑模块或软件。
本申请第三方面提供了一种计算机设备,包括:处理器,该处理器与存储器耦合,该存储器用于存储指令,当指令被处理器执行时,使得该计算机设备实现上述第一方面或第一方面中任一种可能的实施方式中的方法。该计算机设备例如可以为网络设备,也可以为支持网络设备实现上述方法的芯片或芯片 系统等。
本申请第四方面提供了一种计算机可读存储介质,该计算机可读存储介质中保存有指令,当该指令被处理器执行时,实现前述第一方面或第一方面任一种可能的实施方式提供的方法。
本申请第五方面提供了一种计算机程序产品,计算机程序产品中包括计算机程序代码,当该计算机程序代码在计算机上执行时,实现前述第一方面或第一方面任一种可能的实施方式提供的方法。
附图说明
图1为本申请实施例提供的一种虚拟数字人交互的系统结构示意图;
图2为本申请实施例提供的一种表情生成方法的流程示意图;
图3为本申请实施例提供的一种表情生成流程示意图;
图4为本申请实施例提供的一种终端设备的结构示意图;
图5为本申请实施例提供的一种表情生成装置的结构示意图;
图6为本申请实施例提供的一种计算机设备的结构示意图。
具体实施方式
本申请实施例提供了一种表情生成方法以及装置,用于可以增加数字人情感展示的复杂度。
下面结合附图,对本申请的实施例进行描述,显然,所描述的实施例仅仅是本申请一部分的实施例,而不是全部的实施例。本领域普通技术人员可知,随着技术的发展和新场景的出现,本申请实施例提供的技术方案对于类似的技术问题,同样适用。
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的实施例能够以除了在这里图示或描述的内容以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。
在这里专用的词“示例性”意为“用作例子、实施例或说明性”。这里作为“示例性”所说明的任何实施例不必解释为优于或好于其它实施例。
另外,为了更好的说明本申请,在下文的具体实施方式中给出了众多的具体细节。本领域技术人员应当理解,没有某些具体细节,本申请同样可以实施。在一些实例中,对于本领域技术人员熟知的方法、手段、元件和电路未作详细描述,以便于凸显本申请的主旨。
首先对人工智能系统总体工作流程进行描述,请参见图1,图1示出的为人工智能主体框架的一种结构示意图,下面从“智能信息链”(水平轴)和“IT价值链”(垂直轴)两个维度对上述人工智能主题框架进行阐述。其中,“智能信息链”反映从数据的获取到处理的一列过程。举例来说,可以是智能信息感知、智能信息表示与形成、智能推理、智能决策、智能执行与输出的一般过程。在这个过程中,数据经历了“数据—信息—知识—智慧”的凝练过程。“IT价值链”从人智能的底层基础设施、信息(提供和处理技术实现)到系统的产业生态过程,反映人工智能为信息技术产业带来的价值。
(1)基础设施
基础设施为人工智能系统提供计算能力支持,实现与外部世界的沟通,并通过基础平台实现支撑。通过传感器与外部沟通;计算能力由智能芯片,如中央处理器(central processing unit,CPU)、网络处理器(neural-network processing unit,NPU)、图形处理器(英语:graphics processing unit,GPU)、专用集成电路(application specific integrated circuit,ASIC)或现场可编程逻辑门阵列(field programmable gate array,FPGA)等硬件加速芯片)提供;基础平台包括分布式计算框架及网络等相关的平台保障和支持,可以包括云存储和计算、互联互通网络等。举例来说,传感器和外部沟通获取数据,这些数据提供给基础平台提供的分布式计算系统中的智能芯片进行计算。
(2)数据
基础设施的上一层的数据用于表示人工智能领域的数据来源。数据涉及到图形、图像、语音、文本,还涉及到传统设备的物联网数据,包括已有系统的业务数据以及力、位移、液位、温度、湿度等感知数据。
(3)数据处理
数据处理通常包括数据训练,机器学习,深度学习,搜索,推理,决策等方式。
其中,机器学习和深度学习可以对数据进行符号化和形式化的智能信息建模、抽取、预处理、训练等。
推理是指在计算机或智能系统中,模拟人类的智能推理方式,依据推理控制策略,利用形式化的信息进行机器思维和求解问题的过程,典型的功能是搜索与匹配。
决策是指智能信息经过推理后进行决策的过程,通常提供分类、排序、预测等功能。
(4)通用能力
对数据经过上面提到的数据处理后,进一步基于数据处理的结果可以形成一些通用的能力,比如可以是算法或者一个通用系统,例如,翻译,文本的分析,计算机视觉的处理,语音识别,图像的识别等等。
(5)智能产品及行业应用
智能产品及行业应用指人工智能系统在各领域的产品和应用,是对人工智能整体解决方案的封装,将智能信息决策产品化、实现落地应用,其应用领域主要包括:智能终端、智能交通、智能医疗、自动驾驶、平安城市等。
本申请实施例涉及了神经网络和数据转换(natural language processing,NLP)的相关应用,为了更好地理解本申请实施例的方案,下面先对本申请实施例可能涉及的神经网络的相关术语和概念进行介绍。
虚拟数字人具备以下三方面特征:一是拥有人的外观,具有特定的相貌、性别和性格等人物特征;二是拥有人的行为,具有用语言、面部表情和肢体动作表达的能力;三是拥有人的思想,具有识别外界环境、并能与人交流互动的能力。随着计算机图形学、深度学习、语音合成、类脑科学等聚合科技的进步,虚拟数字人正逐步演进成为新物种、新媒介,越来越多的虚拟数字人正在被设计、制作和运营。
本申请实施例提供的动作生成方法可以在服务器上被执行,还可以在基于人工智能的终端设备上被执行。其中该终端设备可以是具有图像处理功能的移动电话、平板个人电脑(tablet personal computer,TPC)、媒体播放器、智能电视、笔记本电脑(laptop computer,LC)、个人数字助理(personal digital assistant,PDA)、个人计算机(personal computer,PC)、照相机、摄像机、智能手表、可穿戴式设备(wearable device,WD)或者自动驾驶的车辆等,本申请实施例对此不作限定。上述终端设备可以是运行各种操作系统的设备。例如,上述终端设备可以是运行安卓系统的设备,也可以是运行IOS系统的设备,也可以是运行windows系统的设备。
虚拟数字人被广泛应用于文娱、教育、服务、销售等诸多领域;在数字人目前处在高速发展时期,其数字人形象从卡通到超写实都有很好的“皮囊”。但数字人的驱动领域,目前的驱动策略仍然非常依赖人工或者大量数据,比如学术界比较热络的语音\文本驱动数字人技术。随着数字人应用增广,数字人的情感陪伴成为了很重要的一环,数字人目前也迫切从简单的情感表达,到全维度的自然表达。
当前最通常的就是使用动画师设计好的模式,按规则做表情。即数字人表情展示的整个过程也被提前预制好,导致数字人表情单一。
为解决上述问题,本申请实施例提供了一种表情生成方法,该方法如下所述。
请参阅图2,如图2所示为本申请实施例提供的一种表情生成方法的流程示意图,该方法包括:
步骤201.终端设备获取第一情感编码。
本实施例中,终端设备可以在本地获取第一情感编码,该第一情感编码作为虚拟数字人的初始情感, 即该第一情感编码对应的表情可以作为虚拟数字人的初始表情。该第一情感编码可以是一个数值向量(比如1x256维度向量)。
其中,本地存储的第一情感编码可以是由该终端设备随机生成的,即终端设备随机选择一种情感作为虚拟数字人的情感,然后确定该情感对应的情感编码。
本地存储的第一情感编码还可以是由用户的输入确定的,如该终端设备可以接收用户的消息,然后根据该消息匹配该第一情感编码,其中,该消息可以是用户的语音输入或者文本输入,本申请实施例对此不作限定。
步骤202.终端设备根据第一情感编码匹配第一文本。
本申请实施例中,每个情感编码对应一个表情,且每个情感编码对应一个或多个文本,该文本的内容与该表情相关,当终端设备获得第一情感编码时,即可根据关联关系确定该第一文本。
步骤203.终端设备将第一文本输入推测神经网络,以生成第二文本。
本实施例中,每个表情有对应的情感标签指示该表情的名称,例如笑脸对应喜悦,哭脸对应痛苦等,此处不作限定,则第一文本相应对应一个情感标签,若第一文本为初始文本,且该初始文本没有体现情感,则该情感标签可以是随机赋予的。终端设备可以将该第一文本和对应的情感标签输入到推测神经网络中,该推测神经网络可以解码出下一个文本(第二文本)的词或词组或短句以及对应的情感标签。
推测神经网络为根据样本文本以及样本文本对应的情感标签训练生成的。具体的,训练该推测神经网络时,数据库构建模块中有收集到的大量的心理活动的文本(日记,叙述文…)以及人工或文本理解算法打入的对齐的标签,将此文本信息进行数字编码后进行一些常规的网络预处理操作后,再结合情感标签得到文本训练数据。通过一般的神经网络结构既可根据上一时段得到的文字,推测出下一时刻的文字。其中,推测神经网络输出的结果是基于句子结构的新的排列组合,是生成的新的内容不是复刻数据库内的文本。
步骤204.终端设备根据第二文本匹配第二情感编码。
本实施例中,终端设备可以根据文本和情感编码的关联关系,确定该第二文本最接近的第二情感编码,其中,该文本和情感编码的关联关系可以是存储在本地的,也可以联网获取的,此处不作限定。
文本信息可以进行文本编码,常用的编码形式为字符编码,以及词频-逆文本频率指数(term frequency–inverse document frequency,TF-IDF)等等,语音信息也可以进行跨模态编码,常见的编码形式为梅尔普,梅尔频率倒谱系数(mel-frequency cepstral coefficient,MFCC)等数据格式,表情则可以利用表情基以及其系数进行编码,如果用网格表示可以用网格进行编码。这些编码的数值都是各不相同的。跨模态检索器可以将这些数值不同的编码转换成同一个数值表达,这是常见的跨模态检索算法的实现方式。
示例性的,文本编码为汉字的编码[00,12,3,55…]经过训练好的网络输出为一个1x218的向量,音频编码经过学习与此文本对应音频最像的编码也要无限接近于这个1x218这个向量,表情和情感同理。这样用这个1x218向量就可以检索到表情,文本,音频和情感四个模态的数据。
本申请实施例中的数据库构建模块是构建了一定量的文本、语音、表情时间对齐的模块作为训练数据。最常用的方式是所有模态训练一个编码器一个解码器,解码器可以将编码内容复原为原始内容;然后交叉各个解码器使另外一个模态的编码器输出的内容也可以解码出复原内容。比如表情编码也可以解出相似的文本,用此形成loss来监督编码空间的一致性。
步骤205.终端设备根据第二情感编码确定对应的第一表情。
本实施例中,终端设备可以根据情感编码的对应关系,确定第二情感编码对应的第一表情,该第一表情可以是从本地获取的,其中,本地存储的表情可以是包含多种表情基,每种表情基对应一种局部表情。例如,本地存储的表情可以包含常见的多种局部表情(该多种表情能够覆盖人脸的眉毛、眼睛、鼻子、嘴、下巴以及面颊等部位的表情)对应的多种表情基。该多种局部表情可以包含人脸常见的一些表情,例如,眨眼,张嘴巴,皱眉,抬眉等等。另外,上述表情还可以包括对人脸常见的一些表情进行细分之后得到的表情,例如,上述多种局部表情可以包括左眉内侧上移,右眼下眼睑提升以及上嘴唇外翻等表情,此处不作限定。该第一表情还可以是终端设备通过联网匹配获得的,此处不作限定。
步骤206.终端设备展示第一表情。
本实施例中,终端设备在获得第一表情后,即可在显示屏展示该第一表情,以告知该表情人当前的情感。示例性的,当该第一表情为笑脸时,表示该数字人当前的情感为开心,当该第一表情为哭脸时,表示该数字人当前的情感为伤心。在一个示例中,终端设备还可以直接显示该情感,如将代表该情感的情感标签展示在显示屏上,具体的展示位置可以在该数字人周围的任意位置,本申请实施例以情感标签在该数字人的正下方为例。
数字人的表情还会按照时间的流逝而逐渐变化,即终端设备在展示第一表情后,还可以继续将第二文本输入到推测神经网络中,以推测出数字人后续的心理活动,即生成第三文本,然后再根据文本与情感编码的关系确定第三文本对应的第三情感编码,根据情感编码与表情的对应关系,确定第三情感编码对应的第二表情,并在显示屏展示该第二表情,相应的,后续再将第三文本乃至后续获得的文本输入到推测神经网络推测新的表情进行展示。具体的,终端设备读取上一次末尾的表情状态,或随机生成新的表情状态作为心理活动文本初值(包括内容和情感)。文本自编码网络开始按大量数据学习到的日记和心理独白的句子结构进行编写新的主语、谓语、宾语、状语等,并连续生成其对应情感。情感标签和文本信息进入跨模态检索网络生成编码后,利用此编码检索表情系数,将表情系数在显示屏显示成表情,即为最终的闲置(IDLE)表情。在用户不与数字人直接交流的情况下,数字人的表情变化能使得用户观看的时候认为看出数字人在思考,甚至有可能数字人主动与用户进行反向交流,此处不作限定。
在一个示例中,用户还可以查看当前表情对应的文本,例如当前表情为第一表情,则用户可以查看该第一表情对应的第二文本。具体的,用户可以通过语音询问该第一表情对应的第二文本,例如“数字人在想什么”或“或数字人为什么做出这个表情”等,此处不作限定。终端设备接收到语音数据后,如果解析到该语音数据是请求当前文本,则可以直接展示当前文本,如展示第二文本。
具体的,本申请实施例生成表情的过程可以参照图3所示为本申请实施例提供的一种表情生成流程示意图,如图3所示,对于作为初始情感的第一情感编码,该第一情感编码输入到推测神经网络后生成第二文本和情感标签,该第二文本和情感标签通过跨模态检索获得统一模态空间的编码(如前述1×128向量),再根据该编码匹配对应的第一表情。
本申请实施例通过终端设备在获得第一情感编码后,可以匹配对的第一文本,然后通过推测神经网络生成第二文本,再根据文本和情感编码的对应关系确定第二情感,根据情感和表情的对应关系确定第一表情,再展示该第一表情,终端设备展示的表情由推测神经网络推测变化,不是固定设定,可以增加数字人情感展示的复杂度。
本申请实施例的终端设备的结构可以参照图4所示的一种终端设备的结构示意图,该终端设备包括:输入模块401、数据库建立模块402、心理活动生成模块403、数字人表情跨模态检索模块404和输出模块405,其中,输入模块401、数据库建立模块402、心理活动生成模块403、数字人表情跨模态检索模块404和输出模块405也可以由一个模块执行,此处不作限定。
输入模块401:可以接收情感标签输入、语音文本表情基系数输入等多种输入模式,生成第一情感编码,也可以选择无任何输入,随机生成第一情感编码,因为是IDLE表情,所以可以选择需要还是不需要交互数据,另外就是基于后续模块特点,可以选择输入情感标签,语音,文本数据和表情基系数等数据(往往是前序交互存储下来的状态)。输入模块401可以执行图2方法实施例中的步骤201和步骤202。
数据库建立模块402:该部分主要为了后续的心理活动生成模块403和表情跨模态检索模块404生成数据。其中心里活动生成模块需要有文本训练数据和其对应的情感标签。跨模态检索模块需要有时域对齐的情感标签,表情基系数,文本等数据。
心理活动生成模块403:该模块负责源源不断的自编码出心里活动文本。数据库模块中有大量的日记,心理独白等文本,利用NLU对文本进行句子成分分析,并自动标注后,利用自编码网络VAE等可以产生新的内容的生成式网络生成符合句子结构的崭新的文本内容。心理活动生成模块403可以执行图2方法实施例中的步骤203。
数字人表情跨模态检索模块404:该模块用机器学习算法将对齐的表情,文本,语音,情感编码等常见的数字人相关模态编码学习在同一模态下。即不同形式的数据可以获得同一个编码表达,表达形式为一个n×m的矩阵或1×m的向量。该数据形式虽然来自不同的形式的数据,但表达的是同一对象,可以用它检索不同模态的数据库得到最匹配的结果。数字人表情跨模态检索模块404可以执行图2方法实施例中的步骤204和步骤205。
输出模块405:该模块是指本发明系统的输出:符合心理活动的数字人表情(图中示例性展出了4个表情),与之对应的情感标签(图中未示出),和与之对应的崭新的心理活动文本信息,例如当用户语音要求心理活动文本反馈时(“数字人在想什么”或“或数字人为什么做出这个表情”),输出模块405可以通过语音输出该心理活动文本,或者直接展示该心理活动文本:“今天有点高兴,科目一通过了,在这个城市生活了这么多年,车管所都不知道在哪,所以老早起来简单地做了早餐吃过就出发了……”。其中,输出模块405对于表情和心理活动文本可以同时展示也可以不同时展示,此处不作限定。输出模块405可以执行图2方法实施例中的步骤206。
对于上述功能模块仅由一个单元执行的场景,可以参照图5所示的一种表情生成装置的结构示意图,该装置50包括:
处理单元501,用于获取第一情感编码,根据第一情感编码匹配第一文本,将第一文本输入推测神经网络,以生成第二文本,根据第二文本匹配第二情感编码,根据第二情感编码确定对应的第一表情,展示第一表情。
其中,处理单元501用于执行图2方法实施例中的步骤201至步骤206。
可选的,第一情感编码为随机生成的。
可选的,装置50还包括收发单元502,收发单元502具体用于:接收用户的消息;
处理单元501还用于:根据消息确定第一情感编码。
可选的:收发单元502还用于,接收用户的语音数据,语音数据用于请求第一表情对应的文本;处理单元501还用于:根据语音数据展示第二文本。
可选的,处理单元501还用于:将第二文本输入到推测神经网络,以生成第三文本;根据第三文本匹配第三情感编码;根据第三情感编码确定第二表情;展示第二表情。
可选的,处理单元501还用于:展示第一表情对应的情感标签。
可选的,推测神经网络为根据样本文本以及样本文本对应的情感标签训练生成的。
图6所示,为本申请的实施例提供的计算机设备60的一种可能的逻辑结构示意图。计算机设备60包括:处理器601、通信接口602、存储系统603以及总线604。处理器601、通信接口602以及存储系统603通过总线604相互连接。在本申请的实施例中,处理器601用于对计算机设备60的动作进行控制管理,例如,处理器601用于执行图2的方法实施例中终端设备所执行的步骤。通信接口602用于支持计算机设备60进行通信。存储系统603,用于存储计算机设备60的程序代码和数据。
其中,处理器601可以是中央处理器单元,通用处理器,数字信号处理器,专用集成电路,现场可编程门阵列或者其他可编程逻辑器件、晶体管逻辑器件、硬件部件或者其任意组合。其可以实现或执行结合本申请公开内容所描述的各种示例性的逻辑方框,模块和电路。处理器601也可以是实现计算功能的组合,例如包含一个或多个微处理器组合,数字信号处理器和微处理器的组合等等。总线604可以是外设部件互连标准(Peripheral Component Interconnect,PCI)总线或扩展工业标准结构(Extended Industry Standard Architecture,EISA)总线等。总线可以分为地址总线、数据总线、控制总线等。为便于表示,图6中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。
装置50中的收发单元502相当于计算机设备60中的通信接口602,装置50中的处理单元501相当于计算机设备60中的处理器601。
本实施例的计算机设备60可对应于上述图2方法实施例中的终端设备,该计算机设备60中的通信接口602可以实现上述图2方法实施例中的终端设备所具有的功能和/或所实施的各种步骤,为了简洁,在 此不再赘述。
应理解以上装置中单元的划分仅仅是一种逻辑功能的划分,实际实现时可以全部或部分集成到一个物理实体上,也可以物理上分开。且装置中的单元可以全部以软件通过处理元件调用的形式实现;也可以全部以硬件的形式实现;还可以部分单元以软件通过处理元件调用的形式实现,部分单元以硬件的形式实现。例如,各个单元可以为单独设立的处理元件,也可以集成在装置的某一个芯片中实现,此外,也可以以程序的形式存储于存储器中,由装置的某一个处理元件调用并执行该单元的功能。此外这些单元全部或部分可以集成在一起,也可以独立实现。这里所述的处理元件又可以成为处理器,可以是一种具有信号的处理能力的集成电路。在实现过程中,上述方法的各步骤或以上各个单元可以通过处理器元件中的硬件的集成逻辑电路实现或者以软件通过处理元件调用的形式实现。
在一个例子中,以上任一装置中的单元可以是被配置成实施以上方法的一个或多个集成电路,例如:一个或多个特定集成电路(application specific integrated circuit,ASIC),或,一个或多个微处理器(digital singnal processor,DSP),或,一个或者多个现场可编程门阵列(field programmable gate array,FPGA),或这些集成电路形式中至少两种的组合。再如,当装置中的单元可以通过处理元件调度程序的形式实现时,该处理元件可以是通用处理器,例如中央处理器(central processing unit,CPU)或其它可以调用程序的处理器。再如,这些单元可以集成在一起,以片上系统(system-on-a-chip,SOC)的形式实现。
在本申请的另一个实施例中,还提供一种计算机可读存储介质,计算机可读存储介质中存储有计算机执行指令,当设备的处理器执行该计算机执行指令时,设备执行上述方法实施例中终端设备所执行的方法。
在本申请的另一个实施例中,还提供一种计算机程序产品,该计算机程序产品包括计算机执行指令,该计算机执行指令存储在计算机可读存储介质中。当设备的处理器执行该计算机执行指令时,设备执行上述方法实施例中终端设备所执行的方法。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统,装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,read-only memory)、随机存取存储器(RAM,random access memory)、磁碟或者光盘等各种可以存储程序代码的介质。

Claims (17)

  1. 一种表情生成方法,其特征在于,包括:
    获取第一情感编码;
    根据所述第一情感编码匹配第一文本;
    将所述第一文本输入推测神经网络,以生成第二文本;
    根据所述第二文本匹配第二情感编码;
    根据所述第二情感编码确定对应的第一表情;
    展示所述第一表情。
  2. 根据权利要求1所述的方法,其特征在于,所述第一情感编码为随机生成的。
  3. 根据权利要求1所述的方法,其特征在于,所述获取第一情感编码包括:
    接收用户的消息;
    根据所述消息确定所述第一情感编码。
  4. 根据权利要求1-3任一项所述的方法,其特征在于,所述方法还包括:
    接收用户的语音数据,所述语音数据用于请求所述第一表情对应的文本;
    根据所述语音数据展示所述第二文本。
  5. 根据权利要求1-4任一项所述的方法,其特征在于,所述方法还包括:
    将所述第二文本输入到所述推测神经网络,以生成第三文本;
    根据所述第三文本匹配第三情感编码;
    根据所述第三情感编码确定第二表情;
    展示所述第二表情。
  6. 根据权利要求1-5任一项所述的方法,其特征在于,所述方法还包括:
    展示所述第一表情对应的情感标签。
  7. 根据权利要求1-6任一项所述的方法,其特征在于,所述推测神经网络为根据样本文本以及所述样本文本对应的情感标签训练生成的。
  8. 一种表情生成装置,其特征在于,包括:
    处理单元,用于获取第一情感编码,根据所述第一情感编码匹配第一文本,将所述第一文本输入推测神经网络,以生成第二文本,根据所述第二文本匹配第二情感编码,根据所述第二情感编码确定对应的第一表情,展示所述第一表情。
  9. 根据权利要求8所述的装置,其特征在于,所述第一情感编码为随机生成的。
  10. 根据权利要求8所述的装置,其特征在于,所述装置还包括收发单元,所述收发单元具体用于:
    接收用户的消息;
    所述处理单元还用于,根据所述消息确定所述第一情感编码。
  11. 根据权利要求8-10任一项所述的装置,其特征在于,所述装置还包括收发单元,所述收发单元具体用于:
    接收用户的语音数据,所述语音数据用于请求所述第一表情对应的文本;
    所述处理单元还用于,根据所述语音数据展示所述第二文本。
  12. 根据权利要求8-11任一项所述的装置,其特征在于,所述处理单元还用于:
    将所述第二文本输入到所述推测神经网络,以生成第三文本;
    根据所述第三文本匹配第三情感编码;
    根据所述第三情感编码确定第二表情;
    展示所述第二表情。
  13. 根据权利要求8-12任一项所述的装置,其特征在于,所述处理单元还用于:
    展示所述第一表情对应的情感标签。
  14. 根据权利要求8-13任一项所述的装置,其特征在于,所述推测神经网络为根据样本文本以及所述样本文本对应的情感标签训练生成的。
  15. 一种计算机设备,其特征在于,包括:处理器以及存储器,
    所述处理器用于执行所述存储器中存储的指令,使得所述计算机设备执行权利要求1至7中任一项所述的方法。
  16. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质中存储有计算机程序,当所述计算机程序在所述计算机上运行时,使得所述计算机执行如权利要求1至7中任一项所述的方法。
  17. 一种计算机程序产品,其特征在于,当所述计算机程序产品在计算机上执行时,所述计算机执行如权利要求1至7中任一项所述的方法。
PCT/CN2023/103053 2022-08-19 2023-06-28 一种通信方法以及装置 WO2024037196A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210998977.1A CN117648411A (zh) 2022-08-19 2022-08-19 一种表情生成方法以及装置
CN202210998977.1 2022-08-19

Publications (1)

Publication Number Publication Date
WO2024037196A1 true WO2024037196A1 (zh) 2024-02-22

Family

ID=89940589

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/103053 WO2024037196A1 (zh) 2022-08-19 2023-06-28 一种通信方法以及装置

Country Status (2)

Country Link
CN (1) CN117648411A (zh)
WO (1) WO2024037196A1 (zh)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190172243A1 (en) * 2017-12-01 2019-06-06 Affectiva, Inc. Avatar image animation using translation vectors
US20190366557A1 (en) * 2016-11-10 2019-12-05 Warner Bros. Entertainment Inc. Social robot with environmental control feature
CN112330780A (zh) * 2020-11-04 2021-02-05 北京慧夜科技有限公司 一种生成目标角色的动画表情的方法和系统
US20210191506A1 (en) * 2018-01-26 2021-06-24 Institute Of Software Chinese Academy Of Sciences Affective interaction systems, devices, and methods based on affective computing user interface
CN114357135A (zh) * 2021-12-31 2022-04-15 科大讯飞股份有限公司 交互方法、交互装置、电子设备以及存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190366557A1 (en) * 2016-11-10 2019-12-05 Warner Bros. Entertainment Inc. Social robot with environmental control feature
US20190172243A1 (en) * 2017-12-01 2019-06-06 Affectiva, Inc. Avatar image animation using translation vectors
US20210191506A1 (en) * 2018-01-26 2021-06-24 Institute Of Software Chinese Academy Of Sciences Affective interaction systems, devices, and methods based on affective computing user interface
CN112330780A (zh) * 2020-11-04 2021-02-05 北京慧夜科技有限公司 一种生成目标角色的动画表情的方法和系统
CN114357135A (zh) * 2021-12-31 2022-04-15 科大讯飞股份有限公司 交互方法、交互装置、电子设备以及存储介质

Also Published As

Publication number Publication date
CN117648411A (zh) 2024-03-05

Similar Documents

Publication Publication Date Title
US10977452B2 (en) Multi-lingual virtual personal assistant
US11226673B2 (en) Affective interaction systems, devices, and methods based on affective computing user interface
CN108942919B (zh) 一种基于虚拟人的交互方法及系统
CN111418198B (zh) 提供文本相关图像的电子装置及其操作方法
US20070074114A1 (en) Automated dialogue interface
CN109086860B (zh) 一种基于虚拟人的交互方法及系统
CN111414506B (zh) 基于人工智能情绪处理方法、装置、电子设备及存储介质
CN109086351B (zh) 一种获取用户标签的方法及用户标签系统
Santamaría-Bonfil et al. Emoji as a proxy of emotional communication
CN114444510A (zh) 情感交互方法及装置、情感交互模型的训练方法及装置
Javaid et al. A Novel Action Transformer Network for Hybrid Multimodal Sign Language Recognition.
Yang et al. User behavior fusion in dialog management with multi-modal history cues
Kumar et al. A constructive deep convolutional network model for analyzing video-to-image sequences
CN116543798A (zh) 基于多分类器的情感识别方法和装置、电子设备、介质
WO2024037196A1 (zh) 一种通信方法以及装置
Gamborino et al. Towards effective robot-assisted photo reminiscence: Personalizing interactions through visual understanding and inferring
Wanner et al. Towards a multimedia knowledge-based agent with social competence and human interaction capabilities
CN111062207B (zh) 表情图像处理方法、装置、计算机存储介质及电子设备
WO2021095473A1 (ja) 情報処理装置、情報処理方法及びコンピュータプログラム
CN115171673A (zh) 一种基于角色画像的交流辅助方法、装置及存储介质
Chauhan et al. Mhadig: A multilingual humor-aided multiparty dialogue generation in multimodal conversational setting
CN115470325B (zh) 消息回复方法、装置及设备
CN117591660B (zh) 基于数字人的材料生成方法、设备及介质
US20230077446A1 (en) Smart seamless sign language conversation device
CN110795581B (zh) 图像搜索方法、装置、终端设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23854101

Country of ref document: EP

Kind code of ref document: A1