CN113792537A - Action generation method and device - Google Patents

Action generation method and device Download PDF

Info

Publication number
CN113792537A
CN113792537A CN202110925419.8A CN202110925419A CN113792537A CN 113792537 A CN113792537 A CN 113792537A CN 202110925419 A CN202110925419 A CN 202110925419A CN 113792537 A CN113792537 A CN 113792537A
Authority
CN
China
Prior art keywords
action
sequence
word
input
expression
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110925419.8A
Other languages
Chinese (zh)
Inventor
张镇嵩
王志鑫
许松岑
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to CN202110925419.8A priority Critical patent/CN113792537A/en
Publication of CN113792537A publication Critical patent/CN113792537A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The application discloses an action generation method and a device in the field of artificial intelligence, and the method can comprise the following steps: acquiring an input corpus, wherein the input corpus comprises at least one word; acquiring a word sequence according to the input corpus, wherein the word sequence comprises a vector corresponding to at least one word; acquiring an action sequence, wherein the action sequence comprises a vector corresponding to at least one action, and the at least one action comprises an action corresponding to at least one word; fusing the word sequence and the action sequence to obtain a fused sequence; and taking the fusion sequence as the input of the action generation network, and outputting a parameter set, wherein the parameter set comprises action parameters, and the action parameters are used for rendering to obtain an image comprising the action.

Description

Action generation method and device
Technical Field
The present application relates to the field of artificial intelligence, and in particular, to an action generation method and apparatus.
Background
Generally, the meaning of text can be represented by an action. For example, a sign language person can let a user obtain information of the world in a sign language without hindrance, that is, a text or a voice or the like is converted into a sign language by the sign language person, so that the user can know what the text or the voice or the like expresses in the sign language.
Generally, a text can be converted into a sign language word sequence through a preset conversion rule or a machine translation mode, and then a final sign language gesture is obtained through a corresponding action splicing mode; alternatively, sign language actions are generated directly from Chinese text or sign language text through a neural network, and so on. However, sign language actions generated by these approaches are often incoherent or limited to a limited data set with poor generalization, resulting in less accurate meaning being expressed and a lower user experience.
Disclosure of Invention
The application provides an action generating method and device, which are used for generating actions based on text driving and synthesizing sign language actions by fully utilizing the generating capacity of an action database and a neural network through inputting fused text and action information into the neural network.
In view of the above, in a first aspect, the present application provides an action generating method, including: acquiring an input corpus, wherein the input corpus comprises at least one word; acquiring a word sequence according to the input corpus, wherein the word sequence comprises a vector corresponding to at least one word; acquiring an action sequence, wherein the action sequence comprises a vector corresponding to at least one action, and the at least one action comprises an action corresponding to at least one word; fusing the word sequence and the action sequence to obtain a fused sequence; and taking the fusion sequence as the input of an action generation network, and outputting a parameter set, wherein the parameter set comprises action parameters, the action parameters are used for rendering to obtain an action image, and the action generation network is used for converting the input vector into parameters related to the action.
Therefore, in the embodiment of the present application, after obtaining the motion vector representing the motion, the motion corresponding to the input corpus can be preliminarily determined. The information included by the word vector and the action vector is fused in the fusion vector, and then the action parameters are further output through a neural network, so that more accurate and continuous actions are obtained, and the user experience is improved.
In a possible implementation manner, the parameter set further includes an expression parameter, and the expression parameter is used for rendering an image of an expression corresponding to the input corpus.
In the embodiment of the application, the action generation network can also output the expression parameters for rendering to obtain the image including the expression, so that a user can more accurately know the information included in the input corpus through the expression image, and the user experience is improved.
In a possible embodiment, the method may further include: acquiring an expression sequence according to the input corpus, wherein the expression sequence comprises at least one vector corresponding to an expression; the fusing the word sequence and the action sequence to obtain a fused sequence may include: and fusing the word sequence, the action sequence and the expression sequence to obtain a fused sequence.
Therefore, in the embodiment of the application, the expression sequence can be obtained based on the input corpus, and then in the process of obtaining the fusion sequence, the expression sequence is fused in addition to the fusion word sequence and the action sequence, so that expression actions can be obtained through subsequent rendering, a user can more accurately know information included in the input corpus through the expression image, and user experience is improved.
In a possible implementation, before outputting the parameter set using the fused sequence as an input of the action generation network, the method may further include: the action generation network is trained using a training set, which includes a plurality of sample pairs, each sample pair including a set of word sequences and a corresponding at least one set of action parameters.
Therefore, in the embodiment of the application, the samples used for training the neural network may include the whole group of word sequences and the corresponding action parameters, so that actions corresponding to the output parameters of the trained neural network are more consistent, and the user experience is improved.
In a possible implementation manner, the obtaining the word sequence according to the input corpus may include: converting the input corpus according to a preset rule to obtain a word sequence; or, the input corpus is used as the input of a word vector conversion network, and a word sequence is output, and the word vector conversion network is used for converting the corpus into corresponding vectors.
In the embodiment of the application, the word sequence can be acquired in various ways, so that the method and the device can adapt to more application scenes.
In a possible implementation manner, the acquiring action sequence may include: and querying the action corresponding to the input corpus from a preset action database to obtain an action sequence.
In the embodiment of the application, the action database can be preset, after the input corpus is obtained, the corresponding action can be inquired from the action database according to the vocabulary in the input corpus to obtain the action vector, and then the action sequence can be efficiently and accurately obtained.
In a possible implementation manner, the obtaining the input corpus may include: and extracting a text from at least one data of the text, the voice or the image to obtain an input corpus.
Therefore, in the embodiment of the application, the input corpus can be acquired through multiple ways, so that the method and the device can be applied to multiple scenes, and the user experience is improved.
In a second aspect, the present application provides a neural network for action generation, the neural network being configured to perform the method steps in the first aspect or any implementation manner of the first aspect, and the neural network may specifically include:
the text acquisition module is used for acquiring an input corpus, wherein the input corpus comprises at least one word.
The word vector conversion network is used for acquiring a word sequence according to the input corpus, and the word sequence comprises a vector corresponding to at least one word;
the motion vector conversion network is used for acquiring a motion sequence, wherein the motion sequence comprises a vector corresponding to at least one motion, and the at least one motion comprises a motion corresponding to at least one word;
the fusion network is used for fusing the word sequence and the action sequence to obtain a fusion sequence;
and the action generation network is used for outputting a parameter set by taking the fusion sequence as the input of the action generation network, wherein the parameter set comprises action parameters, the action parameters are used for rendering to obtain an action image, and the action generation network is used for converting the input vector into parameters related to the action.
In a possible implementation manner, the parameter set further includes expression parameters, and the expression parameters are used for rendering to obtain expression actions corresponding to at least one action respectively.
In a possible implementation, the neural network may further include:
the expression vector conversion network is used for acquiring an expression sequence according to the input corpus, and the expression sequence comprises at least one vector corresponding to an expression;
and the fusion network is specifically used for fusing the word sequence, the action sequence and the expression sequence to obtain a fusion sequence.
In one possible embodiment, the neural network is trained using a training set, the training set including a plurality of sample pairs, each sample pair including a set of word sequences and a corresponding at least one set of motion parameters.
In a possible implementation manner, the text obtaining module is specifically configured to extract a text from at least one of text, voice, and image data to obtain an input corpus.
In a third aspect, an embodiment of the present application provides an action generating device having a function of implementing the action generating method according to the first aspect. The function can be realized by hardware, and can also be realized by executing corresponding software by hardware. The hardware or software includes one or more modules corresponding to the functions described above.
Specifically, the motion generation means may include:
the text acquisition module is used for acquiring an input corpus, wherein the input corpus comprises at least one word;
the word vector conversion module is used for acquiring a word sequence according to the input corpus, wherein the word sequence comprises a vector corresponding to at least one word;
the action vector conversion module is used for acquiring an action sequence, wherein the action sequence comprises a vector corresponding to at least one action, and the at least one action comprises an action corresponding to at least one word;
the fusion module is used for fusing the word sequence and the action sequence to obtain a fusion sequence;
and the output module is used for outputting a parameter set by taking the fusion sequence as the input of the action generation network, wherein the parameter set comprises action parameters, the action parameters are used for rendering to obtain an action image, and the action generation network is used for converting the input vector into parameters related to the action.
In a possible implementation manner, the parameter set further includes an expression parameter, and the expression parameter is used for rendering an image of an expression corresponding to the input corpus.
In one possible embodiment, the apparatus further comprises:
the expression vector conversion module is used for acquiring an expression sequence according to the input corpus, wherein the expression sequence comprises at least one vector corresponding to an expression;
and the fusion module is specifically used for fusing the word sequence, the action sequence and the expression sequence to obtain a fusion sequence.
In one possible embodiment, the apparatus further comprises:
and the training module is used for training the action generation network by using a training set before the fused sequence is used as the input of the action generation network and outputting a parameter set, wherein the training set comprises a plurality of sample pairs, and each sample pair comprises a group of word sequences and at least one group of corresponding action parameters.
In a possible implementation manner, the word vector conversion module is specifically configured to:
converting the input linguistic data or word sequences according to a preset rule to obtain word sequences;
or, the input corpus or word sequence is used as the input of a word vector conversion network to output the word sequence, and the word vector conversion network is used for converting the corpus into the corresponding vector.
In a possible implementation manner, the action vector conversion module is specifically configured to query an action corresponding to an input corpus or word sequence from a preset action database to obtain an action sequence; or, the input corpus or word sequence is used as the input of the motion vector conversion network, and the motion sequence is output.
In a possible implementation manner, the text obtaining module is specifically configured to extract a text from at least one of text, voice, and image, and use the text as an input corpus or convert the text to obtain the input corpus.
In a fourth aspect, an embodiment of the present application provides an action generating apparatus, including: a processor and a memory, wherein the processor and the memory are interconnected by a line, and the processor calls the program code in the memory to execute the processing-related functions in the action generating method according to any of the first aspect. Alternatively, the motion generating means may be a chip.
In a fifth aspect, the present application provides a motion generating apparatus, which may also be referred to as a digital processing chip or chip, where the chip includes a processing unit and a communication interface, and the processing unit obtains program instructions through the communication interface, and the program instructions are executed by the processing unit, and the processing unit is configured to execute functions related to processing in the first aspect or any one of the optional implementations of the first aspect.
In a sixth aspect, the present application provides a training method, comprising:
the method comprises the steps of training a neural network by using a training set to obtain the trained neural network, wherein the training set comprises a plurality of sample pairs, each sample pair comprises a group of word sequences and at least one group of corresponding action parameters, the neural network is used for fusing the input action sequences and the word sequences to obtain a fusion sequence, and outputting a parameter set according to the fusion sequence, the word sequences comprise vectors corresponding to at least one word in an input corpus, the action sequences comprise vectors corresponding to at least one action, the at least one action comprises an action corresponding to at least one word, the parameter set comprises action parameters, and the action parameters are used for rendering to obtain an action image.
Therefore, in the embodiment of the application, when the neural network is trained, each sample pair included in the used training set may include a group of word sequences and corresponding complete action parameters, so that actions output by the neural network are more consistent, and user experience is improved.
It will be appreciated that the neural network may be used to perform the steps of the first aspect or an alternative embodiment of the first aspect, or the neural network may be a neural network for action generation as provided by the second aspect.
In a possible implementation manner, the neural network is specifically configured to fuse a word sequence, an action sequence and an expression sequence to obtain a fused sequence, the expression sequence is obtained according to an input corpus, and the expression sequence includes a vector corresponding to at least one expression; the parameter set further comprises expression parameters, and the expression parameters are used for rendering to obtain images of expressions corresponding to the input corpus.
Therefore, in the embodiment of the application, the neural network can also output the expression parameters, so that the expression image can be rendered, a user can more accurately know the specific information of the input corpus through the expression image, and the user experience is improved.
In one possible embodiment, before the neural network is trained by using the training set, the neural network is also pre-trained by using the data set to obtain a pre-trained neural network, the data set may include a plurality of sample pairs, each sample pair may include a word or word vector and a corresponding action parameter, and then the pre-trained neural network may be fine-tuned by using the training set to obtain the trained neural network.
Therefore, in the embodiment of the application, in the pre-training stage, the word vector and the action parameter corresponding to the word vector can be used for training, so that the finally obtained neural network can be ensured to output accurate action. In the fine tuning stage, the word sequence and the parameters of the corresponding complete action can be used for training, so that the finally obtained neural network can output more consistent action parameters on the basis of accurate output, and the user experience is improved.
In a seventh aspect, an embodiment of the present application provides a training apparatus having a function of implementing the operation generation method according to the sixth aspect. The function can be realized by hardware, and can also be realized by executing corresponding software by hardware. The hardware or software includes one or more modules corresponding to the functions described above.
Specifically, the training apparatus may include:
the training module is used for training the neural network by using a training set to obtain the trained neural network, the training set comprises a plurality of sample pairs, each sample pair comprises a group of word sequences and at least one group of corresponding action parameters, the neural network is used for fusing the input action sequences and the word sequences to obtain a fusion sequence and outputting a parameter set according to the fusion sequence, the word sequences comprise vectors corresponding to at least one word in the input corpus, the action sequences comprise vectors corresponding to at least one action, at least one action comprises an action corresponding to at least one word, the parameter set comprises action parameters, and the action parameters are used for rendering to obtain an action image.
In a possible implementation manner, the neural network is specifically configured to fuse a word sequence, an action sequence and an expression sequence to obtain a fused sequence, the expression sequence is obtained according to an input corpus, and the expression sequence includes a vector corresponding to at least one expression; the parameter set further comprises expression parameters, and the expression parameters are used for rendering to obtain images of expressions corresponding to the input corpus.
In a possible embodiment, the training module is further configured to pre-train the neural network by using the data set before training the neural network by using the training set, so as to obtain a pre-trained neural network, where the data set may include a plurality of sample pairs, each sample pair may include a word or word vector and a corresponding action parameter, and then, fine-tune the pre-trained neural network by using the training set, so as to obtain the trained neural network.
In an eighth aspect, an embodiment of the present application provides a training apparatus, including: a processor and a memory, wherein the processor and the memory are interconnected by a line, and the processor calls the program code in the memory to execute the function related to the processing in the action generating method according to any one of the first and sixth aspects. Alternatively, the training device may be a chip.
In a ninth aspect, embodiments of the present application provide a training apparatus, which may also be referred to as a digital processing chip or chip, where the chip includes a processing unit and a communication interface, the processing unit obtains program instructions through the communication interface, and the program instructions are executed by the processing unit, and the processing unit is configured to execute functions related to processing in any one of the above-mentioned sixth aspect or sixth aspect.
In a tenth aspect, an embodiment of the present application provides a computer-readable storage medium, which includes instructions that, when executed on a computer, cause the computer to perform the method in any optional implementation manner of the first aspect or the sixth aspect.
In an eleventh aspect, embodiments of the present application provide a computer program product containing instructions, which when run on a computer, cause the computer to perform the method in any of the optional embodiments of the first or sixth aspects.
Drawings
FIG. 1 is a schematic diagram of an artificial intelligence body framework for use in the present application;
FIG. 2 is a system architecture diagram provided herein;
FIG. 3 is a schematic diagram of another system architecture provided herein;
fig. 4 is a schematic flowchart of an action generating method according to an embodiment of the present application;
fig. 5 is a schematic flowchart of another action generation method provided in the embodiment of the present application;
fig. 6 is a schematic flowchart of a fusion vector according to an embodiment of the present disclosure;
fig. 7 is a schematic flow chart of another fused vector provided in the embodiment of the present application;
fig. 8 is a schematic flow chart of another fused vector provided in the embodiment of the present application;
fig. 9 is a schematic flow chart of another fused vector provided in the embodiment of the present application;
fig. 10 is a schematic flow chart illustrating a parameter output by autoregressive method according to an embodiment of the present application;
fig. 11 is a schematic flow chart illustrating a non-autoregressive method for outputting parameters according to an embodiment of the present disclosure;
fig. 12 is a schematic structural diagram of a neural network provided in an embodiment of the present application;
fig. 13 is a schematic flowchart of a training method according to an embodiment of the present application;
FIG. 14 is a schematic flow chart of a method for constructing a training sample according to an embodiment of the present disclosure;
fig. 15 is a schematic structural diagram of an action generating device according to an embodiment of the present application;
fig. 16 is a schematic structural diagram of another motion generation apparatus according to an embodiment of the present application;
FIG. 17 is a schematic structural diagram of an exercise device according to an embodiment of the present disclosure;
FIG. 18 is a schematic diagram of another exercise device according to an embodiment of the present disclosure;
fig. 19 is a schematic structural diagram of a chip according to an embodiment of the present application.
Detailed Description
Technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application, and the described embodiments are only a part of the embodiments of the present application, but not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The general workflow of the artificial intelligence system will be described first, please refer to fig. 1, which shows a schematic structural diagram of an artificial intelligence body framework, and the artificial intelligence body framework is explained below from two dimensions of "intelligent information chain" (horizontal axis) and "IT value chain" (vertical axis). Where "intelligent information chain" reflects a list of processes processed from the acquisition of data. For example, the general processes of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision making and intelligent execution and output can be realized. In this process, the data undergoes a "data-information-knowledge-wisdom" refinement process. The 'IT value chain' reflects the value of the artificial intelligence to the information technology industry from the bottom infrastructure of the human intelligence, information (realization of providing and processing technology) to the industrial ecological process of the system.
(1) Infrastructure
The infrastructure provides computing power support for the artificial intelligent system, realizes communication with the outside world, and realizes support through a foundation platform. Communicating with the outside through a sensor; the computing power is provided by an intelligent chip, such as a Central Processing Unit (CPU), a Network Processor (NPU), a Graphic Processor (GPU), an Application Specific Integrated Circuit (ASIC), or a Field Programmable Gate Array (FPGA), or other hardware acceleration chip; the basic platform comprises distributed computing framework, network and other related platform guarantees and supports, and can comprise cloud storage and computing, interconnection and intercommunication networks and the like. For example, sensors and external communications acquire data that is provided to intelligent chips in a distributed computing system provided by the base platform for computation.
(2) Data of
Data at the upper level of the infrastructure is used to represent the data source for the field of artificial intelligence. The data relates to graphs, images, voice and texts, and also relates to the data of the Internet of things of traditional equipment, including service data of the existing system and sensing data such as force, displacement, liquid level, temperature, humidity and the like.
(3) Data processing
Data processing typically includes data training, machine learning, deep learning, searching, reasoning, decision making, and the like.
The machine learning and the deep learning can perform symbolized and formalized intelligent information modeling, extraction, preprocessing, training and the like on data.
Inference means a process of simulating an intelligent human inference mode in a computer or an intelligent system, using formalized information to think about and solve a problem by a machine according to an inference control strategy, and a typical function is searching and matching.
The decision-making refers to a process of making a decision after reasoning intelligent information, and generally provides functions of classification, sequencing, prediction and the like.
(4) General capabilities
After the above-mentioned data processing, further based on the result of the data processing, some general capabilities may be formed, such as algorithms or a general system, e.g. translation, analysis of text, computer vision processing, speech recognition, recognition of images, etc.
(5) Intelligent product and industrial application
The intelligent product and industry application refers to the product and application of an artificial intelligence system in various fields, and is the encapsulation of an artificial intelligence integral solution, the intelligent information decision is commercialized, and the landing application is realized, and the application field mainly comprises: intelligent terminal, intelligent transportation, intelligent medical treatment, autopilot, safe city etc..
In order to better understand the scheme of the embodiments of the present application, the following first introduces terms and concepts related to a neural network that may be involved in the embodiments of the present application.
Corpora (Corpus): also known as free text, which may be words, sentences, segments, articles, and any combination thereof. For example, "today's weather is really good" is a corpus.
Loss function (loss function): which may also be referred to as a cost function, is a metric that compares the difference between the predicted output of the machine learning model on the sample and the true value of the sample (which may also be referred to as a supervised value), i.e., measures the difference between the predicted output of the machine learning model on the sample and the true value of the sample. The loss function may generally include a loss function such as mean square error, cross entropy, logarithm, or exponential. For example, the mean square error can be used as a loss function, defined as
Figure BDA0003208988990000061
The specific loss function can be selected according to the actual application scenario.
Gradient: the derivative vector of the loss function with respect to the parameter.
Random gradient: the number of samples in machine learning is large, so that the loss function calculated at each time is calculated by data obtained by random sampling, and the corresponding gradient is called random gradient.
Back Propagation (BP): an algorithm for calculating gradient of model parameters according to a loss function and updating the model parameters.
Neural machine translation (neural machine translation): neural machine translation is a typical task of data transformation. The task is a technique of outputting a sentence in a target language corresponding to a sentence in a source language given the sentence. In a commonly used neural machine translation model, words in sentences in both source and target languages are encoded into vector representations, and associations between words and sentences are calculated in a vector space, thereby performing a translation task.
Pre-trained language model (PLM): each word in the natural language sequence is encoded into a vector to represent, so as to perform downstream tasks, and a word vector conversion network, a motion generation network, or the like, which are mentioned below in the present application, can be implemented by using PLM. The training of PLM comprises two phases, namely a pre-training (pre-training) phase and a fine-tuning (refining) phase. In the pre-training stage, the model trains a language model task on large-scale unsupervised text, so that word representation modes are learned. In the fine tuning stage, the model is initialized by using parameters learned in the pre-training stage, and can successfully migrate semantic information obtained by pre-training to Downstream tasks (Downstream tasks) by performing less-step training on the Downstream tasks such as text classification (text classification) or sequence labeling (sequence labeling).
The action generation method provided by the embodiment of the application can be executed on a server and can also be executed on a terminal device. The terminal device may be a mobile phone with an image processing function, a Tablet Personal Computer (TPC), a media player, a smart tv, a notebook computer (LC), a Personal Digital Assistant (PDA), a Personal Computer (PC), a camera, a camcorder, a smart watch, a Wearable Device (WD), an autonomous vehicle, or the like, which is not limited in the embodiment of the present application.
Referring to fig. 2, a system architecture 200 is provided in an embodiment of the present application. The system architecture includes a database 230 and a client device 240. The data collection device 260 is used to collect data and store it in the database 230, and the training module 202 generates the target model/rule 201 based on the data maintained in the database 230. How the training module 202 obtains the target model/rule 201 based on the data will be described in more detail below, and the target model/rule 201 is a neural network referred to in the following embodiments of the present application, and refer to the following description in fig. 4 to fig. 11.
The calculation module may include the training module 202, and the target model/rule obtained by the training module 202 may be applied to different systems or devices. In fig. 2, the performing device 210 configures a transceiver 212, the transceiver 212 may be a wireless transceiver, an optical transceiver, a wired interface (such as an I/O interface), or the like, and performs data interaction with an external device, and a "user" may input data to the transceiver 212 through the client device 240, for example, the client device 240 may transmit a target task to the performing device 210, request the performing device to train a neural network, and transmit a database for training to the performing device 210.
The execution device 210 may call data, code, etc. from the data storage system 250 and may store data, instructions, etc. in the data storage system 250.
The calculation module 211 processes the input data using the target model/rule 201. Specifically, the calculation module 211 is configured to: acquiring an input corpus, wherein the input corpus can comprise one or more words; then, acquiring a word sequence according to the input corpus, wherein the word sequence comprises a vector corresponding to at least one word; acquiring an action sequence, wherein the action sequence comprises a vector corresponding to at least one action, and the at least one action corresponds to at least one word; fusing the word sequence and the action sequence to obtain a fused sequence; and obtaining a parameter set according to the fusion sequence, wherein the parameter set comprises action parameters, and the action parameters are used for rendering to obtain at least one action.
Finally, the transceiver 212 returns the output of the neural network to the client device 240. For example, a user may input a piece of text to be converted into a sign language action through the client device 240, and output the sign language action or a parameter representing the sign language action through the neural network, and feed the sign language action or the parameter back to the client device 240.
Further, the training module 202 may derive corresponding target models/rules 201 based on different data for different tasks to provide better results to the user.
In the case shown in fig. 2, the data entered into the execution device 210 may be determined from input data of a user, for example, who may operate in an interface provided by the transceiver 212. Alternatively, the client device 240 may automatically input data to the transceiver 212 and obtain the result, and if the client device 240 automatically inputs data to obtain authorization from the user, the user may set corresponding rights in the client device 240. The user can view the result output by the execution device 210 at the client device 240, and the specific presentation form can be display, sound, action, and the like. The client device 240 may also act as a data collector to store collected data associated with the target task in the database 230.
The training or updating processes mentioned in the present application may be performed by the training module 202. It will be appreciated that the training process of the neural network is the way in which the control space transformation, and more particularly the weight matrix, is learned. The purpose of training the neural network is to make the output of the neural network as close to an expected value as possible, so that the weight vector of each layer of the neural network in the neural network can be updated according to the difference between the predicted value and the expected value of the current network by comparing the predicted value and the expected value of the current network (of course, the weight vector can be initialized before the first update, that is, parameters are configured in advance for each layer in the deep neural network). For example, if the predicted value of the network is too high, the values of the weights in the weight matrix are adjusted to reduce the predicted value, with constant adjustment until the value of the neural network output approaches or equals the desired value. Specifically, the difference between the predicted value and the expected value of the neural network may be measured by a loss function (loss function) or an objective function (objective function). Taking the loss function as an example, the higher the output value (loss) of the loss function indicates the larger the difference, and the training of the neural network can be understood as the process of reducing the loss as much as possible. In the following embodiments of the present application, the process of updating the weight of the starting point network and training the serial network may refer to this process, and details are not described below.
As shown in fig. 2, a target model/rule 201 is obtained by training according to a training module 202, where the target model/rule 201 may be a self-attention model in the present application in this embodiment, and the self-attention model may include a Deep Convolutional Neural Network (DCNN), a Recurrent Neural Network (RNNS), and so on. The neural network referred to in the present application may include various types, such as Deep Neural Network (DNN), Convolutional Neural Network (CNN), Recurrent Neural Network (RNN), or residual neural network, among others.
Wherein, in the training phase, the database 230 may be used to store a sample set for training. The executing device 210 generates a target model/rule 201 for processing the sample, and iteratively trains the target model/rule 201 by using the sample set in the database to obtain a mature target model/rule 201, where the target model/rule 201 is embodied as a neural network. The neural network obtained by the execution device 210 can be applied to different systems or devices.
During the inference phase, the execution device 210 may invoke data, code, etc. from the data storage system 250 and may store data, instructions, etc. in the data storage system 250. The data storage system 250 may be disposed in the execution device 210 or the data storage system 250 may be an external memory with respect to the execution device 210. The calculation module 211 may process the sample acquired by the execution device 210 through the neural network to obtain a prediction result, where a specific expression form of the prediction result is related to a function of the neural network.
It should be noted that fig. 2 is only an exemplary schematic diagram of a system architecture provided by an embodiment of the present application, and a positional relationship between devices, modules, and the like shown in the diagram does not constitute any limitation. For example, in FIG. 2, the data storage system 250 is an external memory with respect to the execution device 210, and in other scenarios, the data storage system 250 may be disposed in the execution device 210.
The target model/rule 201 obtained by training according to the training module 202 may be applied to different systems or devices, such as a mobile phone, a tablet computer, a notebook computer, an Augmented Reality (AR)/Virtual Reality (VR), a vehicle-mounted terminal, and the like, and may also be a server or a cloud device.
The target model/rule 201 may be a model for performing the action generation method provided herein in the embodiment of the present application, that is, the target model/rule 201 may be a neural network for action generation provided herein. Specifically, the model provided in the embodiment of the present application may include one or more networks of CNN, Deep Convolutional Neural Networks (DCNN), Recurrent Neural Networks (RNN), and the like.
Referring to fig. 3, the present application further provides a system architecture 300. The execution device 210 is implemented by one or more servers, optionally in cooperation with other computing devices, such as: data storage, routers, load balancers, and the like; the execution device 210 may be disposed on one physical site or distributed across multiple physical sites. The executing device 210 may use the data in the data storage system 250 or call the program code in the data storage system 250 to implement the steps of the training method for a computing device corresponding to fig. 12 below.
The user may operate respective user devices (e.g., local device 301 and local device 302) to interact with the execution device 210. Each local device may represent any computing device, such as a personal computer, computer workstation, smartphone, tablet, smart camera, smart car or other type of cellular phone, media consumption device, wearable device, set-top box, gaming console, and so forth.
The local devices of each user may interact with the enforcement device 210 via a communication network of any communication mechanism/standard, such as a wide area network, a local area network, a peer-to-peer connection, etc., or any combination thereof. In particular, the communication network may include a wireless network, a wired network, or a combination of a wireless network and a wired network, and the like. The wireless network includes but is not limited to: a fifth Generation mobile communication technology (5th-Generation, 5G) system, a Long Term Evolution (LTE) system, a global system for mobile communication (GSM) or Code Division Multiple Access (CDMA) network, a Wideband Code Division Multiple Access (WCDMA) network, a wireless fidelity (WiFi), a bluetooth (bluetooth), a Zigbee protocol (Zigbee), a radio frequency identification technology (RFID), a Long Range (Long Range ) wireless communication, a Near Field Communication (NFC), or a combination of any one or more of these. The wired network may include a fiber optic communication network or a network of coaxial cables, among others.
In another implementation, one or more aspects of the execution device 210 may be implemented by each local device, e.g., the local device 301 may provide local data or feedback calculations for the execution device 210. The local device may also be referred to as a computing device.
It is noted that all of the functions of the performing device 210 may also be performed by a local device. For example, the local device 301 implements functions to perform the device 210 and provide services to its own user, or to provide services to a user of the local device 302.
In some scenarios, sign language digital persons may express the meaning represented by the text in the form of sign language actions, so that the user may learn information through the sign language actions of sign language digital persons.
In some schemes, the text can be converted into sign language word sequence by means of machine translation, and then the final sign language gesture is obtained by means of action splicing. For example, word segmentation, word quantification by removing a null word, word order adjustment, and the like can be performed in a rule-based manner to obtain gloss (written expression of sign language), and then corresponding actions are extracted from a gloss moving capture database, and the actions are spliced in order. The motion splicing mainly considers transition time, namely smoothness of motion transition. For example, a user can obtain an input corpus, which is a new television, through modes such as voice input or text input, and the like, and after sign language is transcribed, a sign language expression of "television/new/one" is obtained, and then hand motion data of sign language words of "television", "new" and "one" are respectively extracted, and the motion data are spliced to obtain a final sign language gesture motion.
However, rule-based transcription is the method adopted by early translation systems, and most of the rules are obtained by means of expert summary experience and the like, so that the construction cost is high, and the complete coverage of all cases is difficult.
In other arrangements, the translation rules may be learned directly from the corpus in pairs in a learning-based manner. For example, an end-to-end Chinese text to gloss translation model is trained directly with seq2seq networks such as transformers, and the transformation is learned from a large number of parallel corpora. And after acquiring the gloss, moving the capture database according to the gloss to extract corresponding actions, and splicing the actions in sequence.
However, the actions obtained in the above manner are limited by the limitation of the action database dictionary, and the interpretation of the meaning expressed by the text may not be accurate enough, so that the final obtained actions may not sufficiently express the specific meaning of the input corpus.
In other schemes, the text can be directly converted into sign language texts gloss through a machine translation method, and then the gloss is converted into action through a neural network, or the text is directly converted into the action. In this way, the problem that the final action cannot sufficiently express the specific meaning of the input corpus may occur due to insufficient generalization of the model.
Therefore, according to the action generation method provided by the application, the meaning represented by the input data is more accurately expressed in a mode of fusing the vector corresponding to the text and the vector corresponding to the action, and the user experience is improved.
First, the action generation method provided by the present application can be applied to various scenarios. For example, the method can be applied to places of daily communication of the user, such as scenes of restaurants, shopping, schools, meetings, traveling, social contacts and the like, and text, voice, images and the like which need to be expressed can be converted into actions to be expressed, so that the meaning expressed by the text, the voice, the images and the like can be conveyed to the user through the actions in various scenes.
For example, the sign language digital person may be used to execute instructions to perform corresponding actions, such as communicating information represented by text, voice, or images through the actions of the sign language digital person. In some scenarios, a sign language digital person can be used to convey the meaning of the required representation to the user, so that the user who is inconvenient for acquiring the information by means of voice, text or images can know the required information by means of voice, text or images by means of the actions of the sign language digital person. If the user communicates with an attendant at a restaurant, the attendant can introduce the recipe of the restaurant to the user through sign language digital persons; when the user goes to a retail store, the shop waiter can inform the user of the specific position of a certain store, a toilet and the like through the sign language digital person; for example, the users can communicate with each other through sign language digital persons, so that the known information is mutually exchanged; such as when traveling, a sign language digitizer may be used to communicate with other users, etc.
The following describes the operation generation method provided in the present application in detail.
Referring to fig. 4, a flow chart of an action generating method provided by the present application is schematically illustrated.
401. And acquiring input corpora.
Wherein, the input corpus may include one or more vocabularies.
The manner of obtaining the input corpus may be specifically local extraction, or may be extracted from data input by a user. Specifically, a text may be obtained first, and then the text is used as an input corpus, or the text is converted or extracted to obtain the input corpus.
The text is obtained, for example, by locally extracting a text segment, and using the text segment as an input corpus; the method comprises the steps of receiving a text input by a user to obtain an input corpus; or after receiving a text input by a user, translating the text to obtain an input corpus; or, receiving an image input by a user, and identifying the image to obtain an input corpus; or voice data input by a user can be received, and the voice data is recognized to obtain an input text and the like. Therefore, in the embodiment of the application, the input corpus can be obtained in various ways, so that the method and the device can adapt to various scenes and improve user experience.
Optionally, after obtaining the text locally or through input data, the text may be directly used as the input corpus, or after converting or translating the text, the text obtained after converting or translating may be used as the input corpus. For example, the word order of the words in sign language may not be the same as the common word order, and after obtaining a piece of text, such as "i ride bicycle", the text may be directly used as input corpus, or the text may be translated into the expression word order of sign language, such as "i/bicycle", i.e. input corpus.
402. A word sequence is obtained.
The word sequence may include a vector corresponding to an input corpus including one or more words, and each word may correspond to one or more vectors.
For the sake of understanding, the vectors corresponding to the words are referred to as word vectors hereinafter, and will not be described in detail below.
For example, if the word sequence is a vector corresponding to a plurality of words, the plurality of words may be grouped into a sentence or a corpus of sentences, i.e., an input corpus, each word corresponding to one or more groups of vectors.
In a possible implementation manner, the input corpus is converted according to a preset rule to obtain a word sequence, and the input corpus includes a text extracted from at least one of text, voice, or image. For example, a mapping relationship between words and vectors may be preset, and after a text to be converted into an action is obtained, a vector corresponding to each word in the text may be searched from the mapping relationship, so as to obtain a word sequence.
For example, given a piece of text, such as "this is an old mobile phone," corpora conforming to the language order of sign language can be extracted from the text according to step 401: "handset/old/one", where "handset" is a word, "old" is a word, "one" is a word, length is 3, and each word has a corresponding vector representation, e.g., "handset" corresponds to vector x1The "old" corresponding vector x2A corresponding vector x of "one3And the like.
In a possible implementation, all or part of the input corpus may also be used as an input of a word vector conversion network, and the corresponding word sequence is output, where the word vector conversion network is used to convert the input corpus into word vectors. Specifically, the word vector conversion network may include a network that acquires text in a natural language such as text recognition, image recognition, or machine translation.
Or, the word vector conversion network may further include a network for Natural Language Processing (NLP), and after obtaining the initial vector representation of the input corpus, the initial vector representation is used as an input of the word vector conversion network, and each word in the input corpus and the corresponding context semantics thereof are analyzed through the word vector conversion network, so as to obtain a word vector that can more represent the meaning expressed by the input corpus. For example, each word in the corpus may be converted into an initial vector representation according to a preset mapping relationship, each word corresponds to one initial vector representation, and one or more initial expression vectors form an input sequence. Specifically, for example, a mapping table may be set in advance by acquiring a corpus "how is the weather today" to be processed, and each mapping table is set in the mapping tableThe word corresponding to a vector, e.g. "today" corresponding to vector x1"weather" corresponds to vector x2And the input sequence is used as the input of the word vector conversion network, so that the word sequence can be output.
403. An action sequence is obtained.
The action sequence may be obtained based on the input corpus, or may be obtained based on the word vector. The motion sequence may include a vector corresponding to at least one motion corresponding to at least one word in the input corpus, and each word vector may correspond to one or more motion vectors (i.e., vectors representing motions).
Specifically, the action sequence may be determined by a preset conversion rule, or the action sequence may be output by a neural network, and the like, and may be specifically adjusted according to an actual application scenario.
In one possible embodiment, a rule for converting a word vector or text into a motion vector may be preset, for example, calculation may be performed according to a preset algorithm, or a mapping relationship between text and motion or between a word vector and motion may be preset. After the input corpus or word sequence is obtained, the action vector corresponding to the input corpus or word sequence can be determined according to the rule.
For example, a mapping relationship between a word vector and a motion vector may be preset, and after a word sequence is obtained, a motion vector corresponding to an input corpus is queried in the mapping relationship to obtain a motion sequence. For another example, a mapping relationship between the vocabulary and the motion vector may be preset, and after the input corpus is obtained, the motion vector mapped by the vocabulary in the input corpus may be queried in the mapping relationship to obtain the motion sequence.
In a possible implementation manner, the action vector corresponding to each word vector in the word sequence or the action vector corresponding to each word in the input corpus may be output in a neural network manner, so as to obtain the action sequence. If the input corpus or word sequence can be used as the input of the motion vector conversion network, the motion vector corresponding to each word vector is output, and then the motion sequence can be obtained.
Optionally, before step 402, the motion vector transformation network may be trained using a training set, such that the motion vector transformation network may transform word vectors into motion vectors. The training set may include a plurality of sample pairs, and each sample may include a set of word sequences and motion sequences, or a word vector and a motion vector, etc. Therefore, the word vectors can be converted into the motion vectors through the neural network in the embodiment of the application, and compared with the method for inquiring the mapping relation, the motion vectors can be accurately and efficiently output.
It should be noted that, if the action sequence is obtained by inputting the corpus, step 402, step 403 may be executed first, or step 402 and step 403 may be executed simultaneously, which may be specifically adjusted according to the actual application scenario, and this is not limited in this application.
404. And fusing the word sequence and the action sequence to obtain a fused sequence.
After the word sequence and the action sequence are obtained, the word sequence and the action sequence are fused, and then a fusion sequence can be obtained.
The specific fusion mode may include splicing, weighted fusion, or splicing after processing, and the like, and the matched fusion mode may be specifically selected according to the actual application scenario, which is not limited in the present application.
For example, the word vector included in the word sequence and the motion vector included in the motion sequence may be spliced, so that the obtained fused sequence includes both the information included in the word sequence and the information included in the motion sequence.
In a possible implementation manner, an expression sequence may be further obtained, where the expression sequence may include one or more groups of expression vectors, and each group of expression vectors may represent one or more expression actions for more accurately and visually expressing the information represented by the input data. And then, the word sequence, the action sequence and the expression sequence can be fused to obtain a fusion sequence. Therefore, the fusion sequence can include more information such as words, actions and expressions, the action generation network can extract more information from the fusion sequence, and the output accuracy of the action generation network is improved.
405. And taking the fusion sequence as the input of the action generation network and outputting a parameter set.
The motion generation network may be configured to output motion parameters corresponding to an input vector, that is, the parameter set may include parameters representing a motion, such as parameters representing a motion direction, a position of a joint, or a motion distance, and the motion parameters may be used to perform rendering to obtain one or more frames of motion images, so as to convert a word sequence into a motion, and enable a user to obtain required information through the motion images.
Therefore, in the embodiment of the application, after the action sequence is obtained, the action sequence and the word sequence can be fused, so that the obtained fusion sequence can include information included in the action sequence and the word sequence, and the neural network can output more accurate and more coherent parameters corresponding to the action based on the information included in the action sequence and the word sequence, so that the final action can accurately express the information expressed by the input data, and the user experience is improved.
In a possible implementation manner, the parameter set may further include an expression parameter for rendering an expression. Therefore, the user can know the context information which is more relevant to the input sequence, the information known by the user is more accurate, and the user experience is improved.
In one possible implementation, before step 404, the action generating network may be trained using a training set to obtain a trained action generating network, so that the action generating network may output accurate action parameters.
Optionally, the training set may include a plurality of sample pairs, each sample pair may include a group of word sequences and action parameters corresponding to the word sequences, for example, each sample pair may include a complete word vector of a corpus and a complete action vector of a group of actions corresponding to the corpus, so that actions corresponding to the action parameters of the trained action generating network output may be more coherent. For example, each sample pair only includes one word vector and one or more corresponding motion vectors, the sample pair provided by the embodiment can enable the motion generation network to learn more coherent motions, so that the motions corresponding to the parameters output by the motion generation network are more coherent, and the user experience is improved.
In addition, the sample pairs in the training set may further include word vectors and corresponding action vectors, so that the action generation network may learn the action corresponding to each word, the output accuracy of the action generation network is improved, and the adjustment may be specifically performed according to an actual application scenario.
In a possible implementation manner, if other vectors are also fused in the fused vector, such as an expression vector or other vectors capable of representing information included in the input data, the motion generation network may further output an expression motion or other parameters capable of expressing correspondence between the word sequence and the motion of the motion sequence. Correspondingly, when the action generating network is trained, the used sample pairs in the training set can also include expressions or other expressible word sequences and actions, so that more types of parameters output by the action generating network can be used for expressing information represented by input data or context semantics of each word through more actions, and a user can improve user experience through more accurate information or more actions obtained through rendering by a parameter set.
The flow of the action generating method provided by the present application is described above, and for convenience of understanding, the method provided by the present application is described in more detail below with reference to a specific application scenario of the method flow of fig. 4.
It should be noted that, in the present application, the text is exemplarily described by converting into the sign language, and the text may also be converted into other actions, and a specific action type may be specifically adjusted according to an actual application scenario, which is not limited in the present application.
Referring to fig. 5, a flow chart of another action generating method provided by the present application is schematically illustrated.
501. And acquiring the text.
The text can be obtained through various ways, and the input corpus is obtained through modes such as voice recognition, text input, image recognition or machine translation.
For example, the action generation method provided by the application can be deployed in a terminal of a user in the form of an application program (APP), the user can use the terminal to take a picture, and the terminal can recognize the taken image and recognize a text included in the image. Or, the user can input voice through a microphone of the terminal, and the terminal obtains the text input by the user through voice recognition. For another example, the user may input the corpus directly in the interface of the APP. For example, the user may input a text in a first language in the APP interface, and the terminal recognizes the text in the first language, so as to obtain a text in a second language, and the like.
502. Sign language translation.
After the input corpus is obtained, the input corpus can be translated into text capable of being expressed in sign language, namely sign language text, and the sign language text comprises one or more sign language words.
Generally, sign language is an independent expression, and has independent word order, language structure or grammar rule. For example, there is generally no quantifier in sign language, and the number is put after the noun, entered herein "I am five people", the expression "I/home/people/five" in sign language. Also for example, nouns are lexicalized, the Chinese text "I ride bicycle", the expression in sign language "I/bicycle".
In the expression mode of sign language, some words have specific attributes, such as information with special expressions, body orientation and the like, and can also be marked in the expression of sign language, for example, the Chinese text "how to buy down jackets? "in sign language expression, expression" down jacket/buy/don/why (question) "is questioned.
In the embodiment of the application, the sign language rules can be collected by experts, and the sign language translation rules are directly constructed in a rule-based mode; the Chinese text can be directly converted into the sign language text by collecting a large number of pairs of parallel linguistic data of written expression of the Chinese text to the sign language and training a sign language translation model to extract translation rules in a deep learning-based mode. For example, in the case that parallel corpora cannot be obtained or corpora are few, the performance of the sign language translation module constructed in a rule-based manner is better; under the condition that a large number of parallel linguistic data exist, the sign language translation module constructed in a deep learning-based mode is used, and therefore the sign language translation network with better performance is obtained.
503. And (5) converting the action.
After the sign language text is obtained through sign language translation, the action corresponding to the sign language text can be obtained, and the sign language action is obtained.
Specifically, the sign language action can be obtained by searching the sign language action database for an action corresponding to the sign language vocabulary. For example, an action database may be provided in advance, some sign language words and their corresponding sign language actions in a general sign language dictionary may be stored in the action database, and after the sign language words are obtained, the action database may be queried for the sign language actions corresponding to the sign language words.
504. Multimodal fusion.
After the sign language vocabulary and the sign language action are obtained, the sign language vocabulary can be converted into a word vector, and the sign language action can be converted into an action vector. Then, the word vector and the motion vector are fused to obtain a fused vector, so that the obtained fused vector can comprise information contained in the sign language vocabulary and information expressed by the sign language motion.
The specific mode of fusing the word vector and the motion vector may be splicing, or splicing after processing the word vector and the motion vector, or weighting fusion, and the like, and the specific mode may be adjusted according to an actual application scenario. For example, a word vector may be represented as x1The motion vector may be represented as y1X is to be1And y1Splicing is carried out, and a fusion vector x can be obtained1y1
Specifically, the specific manner of converting the sign language vocabulary and the sign language motion into the vector may be conversion according to a preset mapping relationship, or conversion through a neural network.
For example, the mapping relationship between the sign language vocabulary and the word vector can be preset, such as "today" corresponding to vector x1"weather" corresponds to vector x2After the sign language vocabulary is obtained, the word vectors corresponding to the sign language vocabulary can be searched according to the preset mapping relation, and therefore the word sequence is obtained.
For another example, the word vector may be output using a word vector conversion network. The word vector conversion network may be implemented by a fully connected network or a multi-layered perceptron network for converting sign language vocabulary of a textual expression into word vectors.
Also for example, a motion vector conversion network may be used to output motion vectors, which may also be implemented via a fully connected network or a multi-layer perceptron network for converting sign language motion into motion vectors.
Specifically, for example, as shown in fig. 6, after a sign language word and a sign language motion are obtained, the sign language word is input to a word vector conversion network to output a word vector, the sign language motion is input to a motion vector conversion network to output a motion vector, and then the word vector and the motion vector are fused by a fusion network to obtain a fusion vector.
In addition, usually, a sign language vocabulary may correspond to a multi-frame sign language action, so that the word vector and the action vector need to be fused after aligning the frame number of the word vector with the frame number of the action vector. The specific fusion mode may include: each sign language vocabulary is aligned with its corresponding one or more motion vectors, or, when one sign language vocabulary corresponds to multiple motion vectors, the multiple vectors may be fused into a fused motion vector, and then the word vector and the fused motion vector are aligned.
For example, as shown in FIG. 7, the first sign language vocabulary corresponds to 3 frames of motion, the second sign language vocabulary does not find a corresponding motion in the motion database, and the third sign language vocabulary corresponds to 2 frames of motion. Word vectors g1, g2 and g3 of the three sign language words can be obtained respectively; through the motion vector conversion network, motion vectors m1f1, m1f2, m1f3 corresponding to the first sign language vocabulary and motion vectors m3f1, m3f2 corresponding to the third sign language vocabulary can be obtained. And if the second sign language vocabulary does not search corresponding actions in the action database, using one placeholder vector to represent the action vector. In a possible implementation, the word vector is directly spliced with the motion vector of each frame, and the splicing result is input into the fusion network to obtain a fusion vector. The number of frames of the fused vector is identical to the number of frames of the motion vector input to the fused network. The result of the fusion of word vector G1 with action vector m1f1 is G1f1, the result of the fusion of word vector G1 with action vector m1f2 is G1f2, the result of the fusion of word vector G2 with place holder vector is G2f1, and so on. Then the word vector and the action vector can be spliced, and the spliced result can be input to a full connection layer and a ReLU layer for processing to obtain a final fusion vector.
For another example, as shown in fig. 8, a plurality of motion vectors corresponding to a single sign language vocabulary may be combined into a single motion vector, for example, the motion corresponding to the first sign language vocabulary has 3 frames, the second sign language vocabulary does not find the corresponding motion in the motion database, and the motion corresponding to the third sign language vocabulary has 2 frames. Word vectors G1, G2 and G3 of the three sign language words can be obtained respectively, a plurality of motion vectors corresponding to the first sign language word can be obtained and fused through a motion vector conversion network to obtain m1, a plurality of motion vectors corresponding to the third sign language word can be fused to obtain m3, G1 and m1 are fused to obtain G1, G2 and m2 to obtain G2, and G3 and m3 are fused to obtain G3.
Furthermore, in some possible embodiments, after the input corpus is obtained, other information, such as expressions, body orientation, and the like, may be obtained from the input corpus in addition to the sign language vocabulary and the sign language actions.
Specifically, the type of the expression can be obtained based on the input corpus, and a corresponding expression vector is generated to obtain an expression sequence. One or more expression corresponding vectors may be included in the expression sequence. The word sequence, the action sequence and the expression sequence can be fused to obtain a fusion sequence, so that the fusion sequence can comprise information carried by the word sequence and the action sequence and information carried by the expression sequence and related to the expression.
The specific manner of obtaining the type of the expression based on the input corpus may include generating a matched expression by combining context semantics of the input corpus, determining the matched expression according to information carried in the input corpus, such as information of a mood word, a punctuation mark, and the like, and determining the matched expression by combining the context semantics of the input corpus and the carried information. For example, if the input corpus carries "? And determining that the current expression is a suspicious expression by combining the context semantics of the input corpus. A vector representing each expression or a generation rule of the vector may be preset, and after one or more expressions matching the input corpus are determined, a corresponding expression vector may be obtained.
For example, as shown in fig. 9, the word vector of the first sign language vocabulary is G1, the first frame motion vector corresponding to the motion of the first sign language vocabulary is m1f1, the first frame expression vector corresponding to the first sign language vocabulary is e1f1, the first frame is taken as an example word vector G1, the result of the fusion of the motion vector m1f1 and the expression vector e1f1 is G1f1, and so on.
505. Sign language generation.
After the fusion vector is obtained, the fusion vector can be used as an input of a motion generation network, and a parameter set, such as parameters representing the motion direction of the arm, the position of the hand, the joint form and the like of the sign language digital human is output. After the action parameters of the sign language are generated, the action parameters can be used for rendering, and an image corresponding to the rendered one or more frames of sign language actions is obtained and displayed in a display interface.
For example, a sign language action sequence corresponding to the input vector may be output through an action generation network, wherein one or more kinds of information of sign language identification, motion direction, motion amplitude, and the like may be included. If each sign language action sequence stores corresponding sign language actions, rendered data can be directly extracted to be played, or the rendered data can be rendered and played in real time based on the sign language action sequence, and the rendering mode is not limited by the application.
In addition, if the expression vector is also fused in the fusion vector, the output of the action generation network may further include expression parameters, and the expression parameters may be used for rendering to obtain an image including an expression.
Specifically, after the fused vector is obtained, the fused vector may be used as an input to a sequence-to-sequence transform (seq2seq) network, i.e., an action generating network, to output a set of parameters. The seq2seq network may generate the parameter set by means of autoregressive or non-autoregressive means.
In one possible embodiment, the parameter set may be output by means of autoregressive. As an encoder for fusing vectors and a generator of parameter sets, a depth self-attention transform network (transformer) may be used.
For example, as shown in FIG. 10, where xnRepresents the N-th fused vector in the fused sequence, N is the total length of the fused vectors, ytThe gesture parameters and expression parameters (i.e., the parameters included in the parameter set) for sign language. The transform encoder may utilize a self-attention (self-attention) mechanism to read the information carried by the word vectors in the fused vector. For example, the Transformer encoder may include a plurality of self-attention (self-attention) modules, and the SA module may be configured to calculate an association degree based on an input vector, such as an association degree between each word in the corpus and one or more adjacent words, and then fuse the input vector and the association degree, so as to obtain an output result of the SA module.
The read information is then input to a transform Decoder. Action y in the u-th frameuAnd a position code cuFor example, the u frame may be any sign language action, and the action of the u +1 frame may be generated by a Transformer Decoder
Figure BDA0003208988990000151
And position coding
Figure BDA0003208988990000152
When the position code is generated
Figure BDA0003208988990000153
And when the preset threshold value is reached, the sign language action of the next frame is stopped being synthesized. In general, the output of the Transformer Decoder is a real vector, and floating point numbers can be transformed into usable parameters by linear transformation. The linear transform layer may include a full concatenation layer for projecting vectors generated by the Transformer Decoder into vectors of log probabilities (logits) to obtain motion parameters.
In another possible embodiment, the parameter set may be generated by non-autoregressive means. Compared with the autoregressive method, the difference is that in the process of generating the parameter set by the non-autoregressive method, the total length of the sign language actions can be estimated in advance, and then the position codes of the sign language actions can be determined according to the length.
For example, as shown in FIG. 11, similar parts to those of FIG. 10 described above will not be described again. The difference is that T is derived from the length estimation module or from input data given by the user. The length estimation module may be a small neural network for estimating the length of the input data, the input of the module may be the output of a Transformer Encoder, and the output of the module is the length of the sign language action corresponding to the input corpus. Under the condition of determining the length of generating the sign language action, the sign language action can be directly estimated in a non-autoregressive mode, so that the reasoning time length is reduced.
Therefore, in the embodiment of the present application, after the word vector and the motion vector are obtained, the word vector and the motion vector may be fused to obtain a fused vector. The information contained in the word vector and the motion vector can be included in the fused vector, so that the fused vector containing more information can be used as the input of the seq2seq network, and more accurate sign language motion output can be obtained by combining the context semantic information of the word vector.
The action generation method provided by the application can be realized through a neural network.
Illustratively, the present application provides a neural network for action generation, which may specifically include, as shown in fig. 12:
a text obtaining module 1201, configured to obtain an input corpus, where the input corpus includes at least one word.
A word vector conversion network 1202, configured to obtain a word sequence according to the input corpus, where the word sequence includes a vector corresponding to at least one word;
the action vector conversion network 1203 is configured to obtain an action sequence, where the action sequence includes a vector corresponding to at least one action, and the at least one action includes an action corresponding to at least one word;
a fusion network 1204, configured to fuse the word sequence and the action sequence to obtain a fusion sequence;
and the action generation network 1205 takes the fusion sequence as an input of the action generation network, and outputs a parameter set, wherein the parameter set comprises action parameters, the action parameters are used for rendering to obtain an action image, and the action generation network is used for converting the input vector into parameters related to the action.
In a possible implementation manner, the parameter set further includes expression parameters, and the expression parameters are used for rendering to obtain expression actions corresponding to at least one action respectively.
In a possible implementation, the model may further include:
the expression vector conversion network 1206 is used for acquiring an expression sequence according to the input corpus, wherein the expression sequence comprises at least one vector corresponding to an expression;
the fusion network 1204 is specifically configured to fuse the word sequence, the action sequence, and the expression sequence to obtain a fusion sequence.
In a possible implementation manner, the text obtaining module 1201 is specifically configured to extract a text from at least one of text, voice, and image, so as to obtain an input corpus. For example, the text acquisition module may include a text recognition network, a voice recognition network, an image recognition network, or the like.
In one possible embodiment, the neural network may be further trained using a training set, where the training set includes a plurality of sample pairs, and each sample pair includes a set of word sequences and a corresponding at least one set of motion parameters.
In addition, the present application also provides a training method, as shown in fig. 13, the training method may include:
1301. and training the neural network by using a training set to obtain the trained neural network.
The training set comprises a plurality of sample pairs, each sample pair comprises a group of word sequences and at least one group of corresponding action parameters, the neural network is used for fusing the input action sequences and the word sequences to obtain a fused sequence and outputting a parameter set according to the fused sequence, the word sequences comprise vectors corresponding to at least one word in the input corpus, the action sequences comprise vectors corresponding to at least one action, the at least one action comprises an action corresponding to at least one word, the parameter set comprises action parameters, and the action parameters are used for rendering to obtain an action image.
Therefore, in the embodiment of the application, when the neural network is trained, each sample pair included in the used training set may include a group of word sequences and corresponding complete action parameters, so that actions output by the neural network are more consistent, and user experience is improved.
In a possible implementation manner, the neural network is specifically configured to fuse a word sequence, an action sequence and an expression sequence to obtain a fused sequence, the expression sequence is obtained according to an input corpus, and the expression sequence includes a vector corresponding to at least one expression; the parameter set further comprises expression parameters, and the expression parameters are used for rendering to obtain images of expressions corresponding to the input corpus.
Therefore, in the embodiment of the application, the neural network can also output the expression parameters, so that the expression image can be rendered, a user can more accurately know the specific information of the input corpus through the expression image, and the user experience is improved.
In one possible embodiment, before the neural network is trained by using the training set, the neural network is also pre-trained by using the data set to obtain a pre-trained neural network, the data set may include a plurality of sample pairs, each sample pair may include a word or word vector and a corresponding action parameter, and then the pre-trained neural network may be fine-tuned by using the training set to obtain the trained neural network.
Therefore, in the embodiment of the application, in the pre-training stage, the word vector and the action parameter corresponding to the word vector can be used for training, so that the finally obtained neural network can be ensured to output accurate action. In the fine tuning stage, the word sequence and the parameters of the corresponding complete action can be used for training, so that the finally obtained neural network can output more consistent action parameters on the basis of accurate output, and the user experience is improved.
Certainly, in both the pre-training stage and the fine-tuning stage, the neural network can be trained by using the training set, so that the finally obtained neural network is accurate in output and more consistent in action, the user experience is improved, and specifically, samples used in each stage can be selected according to actual application scenes, which is not limited in the application.
Generally, a training set for training a neural network may use a variety of samples, and the following description is given by taking an example of the application to a text-to-sign language scenario, and exemplarily illustrates several possible training sets.
Training set one (i.e., the aforementioned data set)
The first training set may include a database of actions comprising sign language action and expression data corresponding to commonly used sign language vocabulary.
The common sign language vocabulary can be from some common sign language dictionaries or statistics of daily conversations of users, and the sign language actions corresponding to the sign language vocabulary in the action database can be labeled with a start frame and an end frame.
Common sign language vocabularies in a sign language database can be added in a sentence making mode, and a large number of training samples are constructed in an action fusion mode. For example, as shown in fig. 14, a certain vocabulary matches two actions, namely action 1 and action 2, each of which has a start action and an end action, such as raising an arm or lowering an arm, and an overage can be generated between action 1 and action 2, and the end action of action 1 and the start action of action 2 can be combined, and if the end action is to lower an arm and the start action is to raise an arm, the end action and the start action can be combined into actions of lowering an arm to a middle position or other smooth actions 1 and action 2, so as to make the action of the model output obtained by the final training smoother.
Therefore, training data can be collected at low cost by collecting some common sign language words and corresponding information such as actions or expressions, and the performance of the model is improved. In addition, model training is performed by adding a training set of excessive actions, so that the actions output by the model are more coherent, and the user experience is improved.
Training set two (the training set)
The difference from the aforementioned first training set is that the sample pairs in the second training set may include a corpus and parameters of a complete sign language action corresponding to the corpus. For example, a sample pair may include the sign language words "i/v" and parameters of a complete set of sign language actions corresponding to the corpus.
Therefore, when the training set two is used for model training, the model can learn the corpus and the corresponding complete sign language action, so that the model can output more continuous parameters of the sign language action.
In addition, the model training may also be performed in combination with the training set one and the training set two. Generally, a large amount of training data can be obtained quickly due to the low acquisition cost of the training set one; the second training set can obtain the truth data of the complete corpus. Therefore, the model training can be carried out in a two-stage training mode. Whether the first data set data volume increases or the second data set data volume increases, it is ultimately beneficial to optimize the final performance.
For example, in the pre-training stage, model pre-training is performed by using data constructed by a training set I, so that an output accurate model is obtained; in the fine tuning stage, fine tuning is performed by using the second training set, and a model with more coherent output sign language actions is obtained on the basis of accurate output. Certainly, in both the pre-training stage and the fine-tuning stage, the training set two can be used for training, so that the final sign language action output by the neural network is more coherent, and the user experience is improved.
The foregoing describes the action generation method, the flow of the training method, and the neural network in detail, and for easy understanding, some specific application scenarios and implementation effects of the present application are exemplarily described below.
For example, the accuracy of the sign language generation result may be measured by Dynamic Time Warping (DTW) distance, the DTW algorithm may align one high-dimensional time sequence to another high-dimensional time sequence, and if the two high-dimensional time sequences are closer, the DTW distance of the two high-dimensional time sequences is smaller.
In one scenario, a motion database of 283 sign language words is collected. This motion database is then used to construct training set Random10k and test set Random1k, training set Random10k as follows: 10000 sentences are obtained through a random sentence making mode, the longest length of each sentence is 6 words, and then the sign language GT of the whole sentence is obtained through a motion splicing and fusing mode. During training, 30% of the words in the action database have no corresponding action.
Test set Random1k was constructed as follows: 1000 sentences are obtained through a random sentence making mode, the longest length of each sentence is 6 words, and then the sign language action of the whole sentence is obtained through an action splicing and fusing mode.
Actions can be extracted from 100 real sign language sentences, and a test set GT100 is constructed.
Compared with various schemes, the first scheme comprises the following steps: generating corresponding sign language actions only according to the text information; the second scheme comprises the following steps: and generating corresponding sign language actions according to the action information. The scheme provided by the application inputs text information and action information.
The separately achieved effects can be shown in table 1, and therefore, the scheme provided by the present application is superior to the plain text scheme and the plain action scheme. Further, even when the sign language action is not included in the text pair, the output effect of the model can be made more excellent.
Verification set Scheme one Scheme two The scheme of the application
DTW Plain text Pure motion Text + action
Random1k 2.33 6.32 1.11
GT100 14.55 13.15 9.15
TABLE 1
In another scenario, a 283 sign language vocabulary motion database is collected. This motion database is then used to construct training set Random10k and test set Random1k, training set Random10k as follows: 10000 sentences are obtained through a random sentence making mode, the longest length of each sentence is 6 words, and then the sign language action of the whole sentence is obtained through an action splicing and fusing mode. In training, it is assumed that all words in the action database have corresponding actions.
Test set Random1k was constructed as follows: 1000 sentences are obtained through a random sentence making mode, the longest length of each sentence is 6 words, and then the sign language action of the whole sentence is obtained through an action splicing and fusing mode.
Actions can be extracted from 100 real sign language sentences, and a test set GT100 is constructed.
Compared with various schemes, the first scheme comprises the following steps: generating corresponding sign language actions only according to the text information; the second scheme comprises the following steps: and generating corresponding sign language actions according to the action information. The scheme provided by the application inputs text information and action information.
The respectively realized effects can be shown in table 2, and experiments show that the output effect of the scheme provided by the application is superior to that of the plain text scheme and the plain action scheme, and under the condition of obtaining the action of the sign language text pair, the action information is input into the model, so that the model can concentrate on learning the connection of more optimal actions.
Verification set Scheme one Scheme two This patent scheme
DTW Plain text Pure motion Text + action
Random1k 2.376 0.860 0.78
GT100 14.55 9.35 8.46
TABLE 2
The method and model provided by the present application are described above, and the apparatus provided by the present application is described in detail below.
Referring to fig. 15, a schematic structural diagram of an action generating device provided in the present application, the action generating device being configured to execute the method steps in the first aspect or any implementation manner of the first aspect, the action generating device may include:
the text acquisition module 1501 is configured to acquire an input corpus, where the input corpus includes at least one word;
a word vector conversion module 1502, configured to obtain a word sequence according to the input corpus, where the word sequence includes a vector corresponding to at least one word;
a motion vector conversion module 1503, configured to obtain a motion sequence, where the motion sequence includes a vector corresponding to at least one motion, and the at least one motion includes a motion corresponding to at least one word;
a fusion module 1504, configured to fuse the word sequence and the action sequence to obtain a fusion sequence;
the output module 1505 is configured to output a parameter set by using the fusion sequence as an input of the motion generation network, where the parameter set includes motion parameters, the motion parameters are used for rendering to obtain a motion image, and the motion generation network is used for converting an input vector into a parameter related to a motion.
In a possible implementation manner, the parameter set further includes an expression parameter, and the expression parameter is used for rendering an image of an expression corresponding to the input corpus.
In one possible embodiment, the apparatus further comprises:
the expression vector conversion module 1506 is configured to obtain an expression sequence according to the input corpus, where the expression sequence includes a vector corresponding to at least one expression;
and the fusion module is specifically used for fusing the word sequence, the action sequence and the expression sequence to obtain a fusion sequence.
In one possible embodiment, the apparatus further comprises:
a training module 1507, configured to train the action generation network by using a training set before outputting a parameter set by using the fused sequence as an input of the action generation network, where the training set includes a plurality of sample pairs, and each sample pair includes a group of word sequences and at least one corresponding group of action parameters.
In a possible implementation, the word vector conversion module 1502 is specifically configured to:
converting the input linguistic data or word sequences according to a preset rule to obtain word sequences;
or, the input corpus or word sequence is used as the input of a word vector conversion network to output the word sequence, and the word vector conversion network is used for converting the corpus into the corresponding vector.
In a possible implementation manner, the motion vector conversion module 1503 is specifically configured to query a motion corresponding to an input corpus or word sequence from a preset motion database to obtain a motion sequence; or, the input corpus or word sequence is used as the input of the motion vector conversion network, and the motion sequence is output.
In a possible implementation manner, the text obtaining module 1501 is specifically configured to extract a text from at least one of text, voice, and image, and use the text as an input corpus or convert the text to obtain the input corpus.
Referring to fig. 16, a schematic structural diagram of another motion generating device provided in the present application is as follows.
The motion generating device may include a processor 1601 and a memory 1602. The processor 1601 and the memory 1602 are interconnected by a line. The memory 1602 has stored therein program instructions and data.
The memory 1602 stores program instructions and data corresponding to the steps of fig. 4-11.
The processor 1601 is configured to perform the method steps performed by the action generation apparatus according to any of the embodiments of fig. 4 to 11.
Optionally, the action generating means may further comprise a transceiver 1603 for receiving or transmitting data.
Also provided in embodiments of the present application is a computer-readable storage medium, which stores a program that, when executed on a computer, causes the computer to perform the steps of the method described in the embodiments shown in fig. 4-11.
Alternatively, the motion generating device shown in fig. 16 described above is a chip.
The embodiment of the present application further provides an action generating device, which may also be referred to as a digital processing chip or a chip, where the chip includes a processing unit and a communication interface, the processing unit obtains program instructions through the communication interface, and the program instructions are executed by the processing unit, and the processing unit is configured to execute the method steps executed by the action generating device shown in any one of the foregoing fig. 4 to fig. 11.
Referring to fig. 17, the present application provides a schematic structural diagram of an exercise device for performing the method steps in the first aspect or any embodiment of the first aspect, where the exercise device may include:
the training module 1701 is configured to train a neural network by using a training set to obtain the trained neural network, where the training set includes a plurality of sample pairs, each sample pair includes a group of word sequences and at least one group of corresponding action parameters, the neural network is configured to fuse an input action sequence and a word sequence to obtain a fused sequence, and output a parameter set according to the fused sequence, the word sequence includes a vector corresponding to at least one word in an input corpus, the action sequence includes a vector corresponding to at least one action, the at least one action includes an action corresponding to at least one word, the parameter set includes action parameters, and the action parameters are used for rendering to obtain an action image.
In a possible implementation manner, the neural network is specifically configured to fuse a word sequence, an action sequence and an expression sequence to obtain a fused sequence, the expression sequence is obtained according to an input corpus, and the expression sequence includes a vector corresponding to at least one expression; the parameter set further comprises expression parameters, and the expression parameters are used for rendering to obtain images of expressions corresponding to the input corpus.
In one possible embodiment, the training module 1701 is further configured to pre-train the neural network by using the data set before training the neural network by using the training set to obtain a pre-trained neural network, where the data set may include a plurality of sample pairs, each sample pair may include a word or word vector and a corresponding motion parameter, and then, fine-tune the pre-trained neural network by using the training set to obtain the trained neural network.
It will be appreciated that the neural network may be used to perform the method steps corresponding to the preceding figures 4-11, or the neural network may be the neural network used for action generation as described in the preceding figure 12.
Referring to fig. 18, a schematic structural diagram of another training device provided in the present application is shown as follows.
The training apparatus may include a processor 1801 and a memory 1802. The processor 1801 and memory 1802 are interconnected by wiring. The memory 1802 has stored therein program instructions and data.
The memory 1802 stores therein program instructions and data for the steps performed by the training apparatus of fig. 16.
The processor 1801 is configured to perform the method steps performed by the training apparatus of fig. 16.
Optionally, the training apparatus may further comprise a transceiver 1803 for receiving or transmitting data.
Also provided in embodiments of the present application is a computer-readable storage medium, which stores a program that, when executed on a computer, causes the computer to perform the steps of the method described in the embodiments shown in fig. 4-11.
Optionally, the aforementioned training device shown in fig. 18 is a chip.
The embodiment of the present application further provides a training device, which may also be referred to as a digital processing chip or a chip, where the chip includes a processing unit and a communication interface, the processing unit obtains program instructions through the communication interface, and the program instructions are executed by the processing unit, and the processing unit is configured to execute the method steps executed by the training device shown in any one of the foregoing fig. 4 to fig. 11.
The embodiment of the application also provides a digital processing chip. The digital processing chip integrates a circuit and one or more interfaces for implementing the functions of the processor 1601, the processor 1801, or the processor 1601, the processor 1801. When integrated with memory, the digital processing chip may perform the method steps of any one or more of the preceding embodiments. When the digital processing chip is not integrated with the memory, the digital processing chip can be connected with the external memory through the communication interface. The digital processing chip implements the actions performed by the action generating device in the above embodiments according to the program codes stored in the external memory.
Embodiments of the present application also provide a computer program product, which when executed on a computer causes the computer to execute the steps performed by the action generating apparatus in the method described in the foregoing embodiments shown in fig. 4-11 or fig. 13.
The action generating device or the training device provided by the embodiment of the application can be a chip, and the chip comprises: a processing unit, which may be for example a processor, and a communication unit, which may be for example an input/output interface, a pin or a circuit, etc. The processing unit may execute the computer executable instructions stored by the storage unit to cause the chip in the server to perform the action generation method described in the embodiments shown in fig. 4-11 or fig. 13. Optionally, the storage unit is a storage unit in the chip, such as a register, a cache, and the like, and the storage unit may also be a storage unit located outside the chip in the wireless access device, such as a read-only memory (ROM) or another type of static storage device that can store static information and instructions, a Random Access Memory (RAM), and the like.
Specifically, the aforementioned processing unit or processor may be a Central Processing Unit (CPU), a Network Processor (NPU), a Graphics Processing Unit (GPU), a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA), or other programmable logic devices (programmable gate array), discrete gate or transistor logic devices (discrete hardware components), or the like. A general purpose processor may be a microprocessor or any conventional processor or the like.
Referring to fig. 19, fig. 19 is a schematic structural diagram of a chip according to an embodiment of the present disclosure, where the chip may be represented as a neural network processor NPU 190, and the NPU 190 is mounted on a main CPU (Host CPU) as a coprocessor, and the Host CPU allocates tasks. The core portion of the NPU is an arithmetic circuit 1903, and the controller 1904 controls the arithmetic circuit 1903 to extract matrix data in the memory and perform multiplication.
In some implementations, the arithmetic circuitry 1903 includes multiple processing units (PEs) internally. In some implementations, the operational circuitry 1903 is a two-dimensional systolic array. The arithmetic circuit 1903 may also be a one-dimensional systolic array or other electronic circuit capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuitry 1903 is a general-purpose matrix processor.
For example, assume that there is an input matrix A, a weight matrix B, and an output matrix C. The arithmetic circuit fetches the data corresponding to the matrix B from the weight memory 1902 and buffers it on each PE in the arithmetic circuit. The arithmetic circuit takes the matrix a data from the input memory 1901 and performs matrix operation with the matrix B, and partial or final results of the obtained matrix are stored in an accumulator (accumulator) 1908.
The unified memory 1906 is used for storing input data and output data. The weight data directly passes through a Direct Memory Access Controller (DMAC) 1905, and the DMAC is transferred to a weight memory 1902. The input data is also carried into the unified memory 1906 via the DMAC.
A Bus Interface Unit (BIU) 1910 for interaction of the AXI bus with the DMAC and the instruction fetch memory (IFB) 1909.
A bus interface unit 1910 (BIU) is configured to fetch instructions from the external memory by the instruction fetch memory 1909, and further configured to fetch original data of the input matrix a or the weight matrix B from the external memory by the storage unit access controller 1905.
The DMAC is mainly used to transfer input data in the external memory DDR to the unified memory 1906, or transfer weight data to the weight memory 1902, or transfer input data to the input memory 1901.
The vector calculation unit 1907 includes a plurality of operation processing units, and further processes the output of the operation circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, magnitude comparison, and the like, if necessary. The method is mainly used for non-convolution/full-connection layer network calculation in the neural network, such as batch normalization (batch normalization), pixel-level summation, up-sampling of a feature plane and the like.
In some implementations, the vector calculation unit 1907 can store the processed output vector to the unified memory 1906. For example, the vector calculation unit 1907 may apply a linear function and/or a nonlinear function to the output of the arithmetic circuit 1903, such as linear interpolation of the feature planes extracted by the convolutional layers, and further such as a vector of accumulated values to generate the activation values. In some implementations, the vector calculation unit 1907 generates normalized values, pixel-level summed values, or both. In some implementations, the vector of processed outputs can be used as activation inputs to the operational circuitry 1903, e.g., for use in subsequent layers in a neural network.
An instruction fetch buffer 1909 connected to the controller 1904, for storing instructions used by the controller 1904;
the unified memory 1906, the input memory 1901, the weight memory 1902, and the instruction fetch memory 1909 are all On-Chip memories. The external memory is private to the NPU hardware architecture.
The operation of each layer in the recurrent neural network may be performed by the operation circuit 1903 or the vector calculation unit 1907.
Wherein, any of the aforementioned processors may be a general purpose central processing unit, a microprocessor, an ASIC, or one or more integrated circuits for controlling the processes of the aforementioned methods of fig. 4-11 or fig. 13.
It should be noted that the above-described embodiments of the apparatus are merely schematic, where the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. In addition, in the drawings of the embodiments of the apparatus provided in the present application, the connection relationship between the modules indicates that there is a communication connection therebetween, and may be implemented as one or more communication buses or signal lines.
Through the above description of the embodiments, those skilled in the art will clearly understand that the present application can be implemented by software plus necessary general-purpose hardware, and certainly can also be implemented by special-purpose hardware including special-purpose integrated circuits, special-purpose CPUs, special-purpose memories, special-purpose components and the like. Generally, functions performed by computer programs can be easily implemented by corresponding hardware, and specific hardware structures for implementing the same functions may be various, such as analog circuits, digital circuits, or dedicated circuits. However, for the present application, the implementation of a software program is more preferable. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a readable storage medium, such as a floppy disk, a usb disk, a removable hard disk, a Read Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk of a computer, and includes instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute the methods described in the embodiments of the present application.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product.
The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the application to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that a computer can store or a data storage device, such as a server, a data center, etc., that is integrated with one or more available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.
The terms "first," "second," "third," "fourth," and the like in the description and in the claims of the present application and in the drawings described above, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that the embodiments described herein may be practiced otherwise than as specifically illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
Finally, it should be noted that: the above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (20)

1. An action generating method, comprising:
acquiring an input corpus, wherein the input corpus comprises at least one word;
acquiring a word sequence according to the input corpus, wherein the word sequence comprises a vector corresponding to the at least one word;
obtaining an action sequence, wherein the action sequence comprises a vector corresponding to at least one action, and the at least one action comprises an action corresponding to the at least one word;
fusing the word sequence and the action sequence to obtain a fused sequence;
and taking the fusion sequence as the input of an action generation network, and outputting a parameter set, wherein the parameter set comprises action parameters, the action parameters are used for rendering to obtain an action image, and the action generation network is used for converting the input vector into parameters related to the action.
2. The method according to claim 1, wherein the parameter set further includes an expression parameter, and the expression parameter is used for rendering an image of an expression corresponding to the input corpus.
3. The method of claim 2, further comprising:
obtaining an expression sequence according to the input corpus, wherein the expression sequence comprises at least one vector corresponding to an expression;
the fusing the word sequence and the action sequence to obtain a fused sequence, which comprises:
and fusing the word sequence, the action sequence and the expression sequence to obtain the fused sequence.
4. The method according to any of claims 1-3, wherein prior to said outputting a set of parameters using said fused sequence as an input to an action generating network, the method further comprises:
training the action generation network by using a training set, wherein the training set comprises a plurality of sample pairs, and each sample pair comprises a group of word sequences and at least one group of corresponding action parameters.
5. The method according to any of claims 1-4, wherein said obtaining a word sequence from said input corpus comprises:
converting the input corpus according to a preset rule to obtain the word sequence;
or, the input corpus is used as the input of a word vector conversion network, and the word sequence is output, and the word vector conversion network is used for converting the corpus into corresponding vectors.
6. The method according to any one of claims 1-5, wherein the obtaining a sequence of actions comprises:
inquiring the action corresponding to the input corpus or the word sequence from a preset action database to obtain the action sequence;
or, the input corpus or the word sequence is used as the input of a motion vector conversion network, and the motion sequence is output.
7. The method according to any one of claims 1-6, wherein said obtaining input corpus comprises:
extracting text from at least one of text, voice or image data;
and taking the text as the input corpus, or converting the text to obtain the input corpus.
8. An action generating device, comprising:
the system comprises a text acquisition module, a text processing module and a text processing module, wherein the text acquisition module is used for acquiring an input corpus, and the input corpus comprises at least one word;
the word vector conversion module is used for acquiring a word sequence according to the input corpus, wherein the word sequence comprises a vector corresponding to the at least one word;
the action vector conversion module is used for acquiring an action sequence, wherein the action sequence comprises a vector corresponding to at least one action, and the at least one action comprises an action corresponding to the at least one word;
the fusion module is used for fusing the word sequence and the action sequence to obtain a fusion sequence;
and the output module is used for taking the fusion sequence as the input of an action generation network and outputting a parameter set, wherein the parameter set comprises action parameters, the action parameters are used for rendering to obtain an action image, and the action generation network is used for converting the input vector into parameters related to the action.
9. The apparatus according to claim 8, wherein the parameter set further includes an expression parameter, and the expression parameter is used to render an image of an expression corresponding to the input corpus.
10. The apparatus of claim 9, further comprising:
the expression vector conversion module is used for acquiring an expression sequence according to the input corpus, wherein the expression sequence comprises at least one vector corresponding to an expression;
the fusion module is specifically configured to fuse the word sequence, the action sequence, and the expression sequence to obtain the fusion sequence.
11. The apparatus according to any one of claims 8-10, further comprising:
and the training module is used for training the action generation network by using a training set before the fused sequence is used as the input of the action generation network and a parameter set is output, wherein the training set comprises a plurality of sample pairs, and each sample pair comprises a group of word sequences and at least one group of corresponding action parameters.
12. The apparatus according to any of claims 8-11, wherein the word vector conversion module is specifically configured to:
converting the input corpus according to a preset rule to obtain the word sequence;
or, the input corpus is used as the input of a word vector conversion network, and the word sequence is output, and the word vector conversion network is used for converting the corpus into corresponding vectors.
13. The apparatus according to any one of claims 8-12,
the action vector conversion module is specifically configured to query an action corresponding to the input corpus or the word sequence from a preset action database to obtain the action sequence; or, the input corpus or the word sequence is used as the input of a motion vector conversion network, and the motion sequence is output.
14. The apparatus according to any one of claims 8-13,
the text acquisition module is specifically used for extracting a text from at least one of text, voice or image data; and taking the text as the input corpus, or converting the text to obtain the input corpus.
15. A method of training, comprising:
training a neural network by using a training set to obtain the trained neural network, wherein the training set comprises a plurality of sample pairs, each sample pair comprises a group of word sequences and at least one group of corresponding action parameters, the neural network is used for fusing the input action sequences and the word sequences to obtain a fused sequence and outputting a parameter set according to the fused sequence, the word sequences comprise vectors corresponding to at least one word in an input corpus, the action sequences comprise vectors corresponding to at least one action, the at least one action comprises an action corresponding to the at least one word, the parameter set comprises action parameters, and the action parameters are used for rendering to obtain an action image.
16. The method of claim 15,
the neural network is specifically used for fusing the word sequence, the action sequence and the expression sequence to obtain a fused sequence, the expression sequence is obtained according to the input corpus, and the expression sequence comprises at least one vector corresponding to an expression;
the parameter set further comprises expression parameters, and the expression parameters are used for rendering to obtain images of expressions corresponding to the input corpus.
17. An action generating device comprising a processor coupled to a memory, the memory storing a program, the program instructions stored by the memory when executed by the processor implementing the method of any of claims 1 to 7.
18. An exercise apparatus comprising a processor coupled to a memory, the memory storing a program, the program instructions stored by the memory when executed by the processor implementing the method of any of claims 15 to 16.
19. A computer readable storage medium comprising a program which, when executed by a processing unit, performs the method of any of claims 1 to 7 or 15 to 16.
20. A computer program product comprising a computer program, characterized in that the computer program realizes the method according to any one of claims 1 to 7 or 15 to 16 when executed by a processor.
CN202110925419.8A 2021-08-12 2021-08-12 Action generation method and device Pending CN113792537A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110925419.8A CN113792537A (en) 2021-08-12 2021-08-12 Action generation method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110925419.8A CN113792537A (en) 2021-08-12 2021-08-12 Action generation method and device

Publications (1)

Publication Number Publication Date
CN113792537A true CN113792537A (en) 2021-12-14

Family

ID=79181669

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110925419.8A Pending CN113792537A (en) 2021-08-12 2021-08-12 Action generation method and device

Country Status (1)

Country Link
CN (1) CN113792537A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116993873A (en) * 2023-07-31 2023-11-03 支付宝(杭州)信息技术有限公司 Digital human action arrangement method and device
CN116993873B (en) * 2023-07-31 2024-05-17 支付宝(杭州)信息技术有限公司 Digital human action arrangement method and device

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116993873A (en) * 2023-07-31 2023-11-03 支付宝(杭州)信息技术有限公司 Digital human action arrangement method and device
CN116993873B (en) * 2023-07-31 2024-05-17 支付宝(杭州)信息技术有限公司 Digital human action arrangement method and device

Similar Documents

Publication Publication Date Title
US20210233521A1 (en) Method for speech recognition based on language adaptivity and related apparatus
CN112668671B (en) Method and device for acquiring pre-training model
CN113240056B (en) Multi-mode data joint learning model training method and device
WO2020228376A1 (en) Text processing method and model training method and apparatus
CN110599557B (en) Image description generation method, model training method, device and storage medium
EP3940638A1 (en) Image region positioning method, model training method, and related apparatus
CN110490213B (en) Image recognition method, device and storage medium
CN112883149B (en) Natural language processing method and device
CN111816159B (en) Language identification method and related device
CN111951805A (en) Text data processing method and device
CN113205817B (en) Speech semantic recognition method, system, device and medium
JP2023535709A (en) Language expression model system, pre-training method, device, device and medium
CN112216307B (en) Speech emotion recognition method and device
WO2021238333A1 (en) Text processing network, neural network training method, and related device
EP4336490A1 (en) Voice processing method and related device
CN112818861A (en) Emotion classification method and system based on multi-mode context semantic features
CN114676234A (en) Model training method and related equipment
CN113505883A (en) Neural network training method and device
CN116050496A (en) Determination method and device, medium and equipment of picture description information generation model
CN116432019A (en) Data processing method and related equipment
CN115221846A (en) Data processing method and related equipment
JP2021081713A (en) Method, device, apparatus, and media for processing voice signal
CN115688937A (en) Model training method and device
CN113468857B (en) Training method and device for style conversion model, electronic equipment and storage medium
CN113763925B (en) Speech recognition method, device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination