CN111010589B - Live broadcast method, device, equipment and storage medium based on artificial intelligence - Google Patents
Live broadcast method, device, equipment and storage medium based on artificial intelligence Download PDFInfo
- Publication number
- CN111010589B CN111010589B CN201911319864.9A CN201911319864A CN111010589B CN 111010589 B CN111010589 B CN 111010589B CN 201911319864 A CN201911319864 A CN 201911319864A CN 111010589 B CN111010589 B CN 111010589B
- Authority
- CN
- China
- Prior art keywords
- data packet
- audio data
- image
- facial feature
- feature data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 110
- 238000013473 artificial intelligence Methods 0.000 title claims abstract description 64
- 230000001815 facial effect Effects 0.000 claims abstract description 245
- 238000012545 processing Methods 0.000 claims abstract description 108
- 238000009877 rendering Methods 0.000 claims abstract description 93
- 230000000694 effects Effects 0.000 claims abstract description 33
- 239000013598 vector Substances 0.000 claims description 34
- 230000008921 facial expression Effects 0.000 claims description 23
- 230000015654 memory Effects 0.000 claims description 23
- 230000009471 action Effects 0.000 claims description 22
- 230000002194 synthesizing effect Effects 0.000 claims description 17
- 238000010606 normalization Methods 0.000 claims description 12
- 230000008859 change Effects 0.000 claims description 10
- 238000004458 analytical method Methods 0.000 claims description 5
- 230000009467 reduction Effects 0.000 claims description 5
- 230000003044 adaptive effect Effects 0.000 claims description 4
- 238000004806 packaging method and process Methods 0.000 claims description 4
- 230000008569 process Effects 0.000 description 56
- 238000005516 engineering process Methods 0.000 description 22
- 230000002452 interceptive effect Effects 0.000 description 15
- 238000010586 diagram Methods 0.000 description 14
- 230000006870 function Effects 0.000 description 9
- 230000015572 biosynthetic process Effects 0.000 description 7
- 238000004891 communication Methods 0.000 description 7
- 239000000284 extract Substances 0.000 description 7
- 230000014509 gene expression Effects 0.000 description 7
- 230000004044 response Effects 0.000 description 7
- 238000003786 synthesis reaction Methods 0.000 description 7
- 230000001360 synchronised effect Effects 0.000 description 6
- 238000011161 development Methods 0.000 description 5
- 230000018109 developmental process Effects 0.000 description 5
- 230000003993 interaction Effects 0.000 description 4
- 238000012795 verification Methods 0.000 description 4
- 241001465754 Metazoa Species 0.000 description 3
- 238000012790 confirmation Methods 0.000 description 3
- 230000000977 initiatory effect Effects 0.000 description 3
- 238000013135 deep learning Methods 0.000 description 2
- 239000003999 initiator Substances 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000013515 script Methods 0.000 description 2
- 230000035945 sensitivity Effects 0.000 description 2
- 238000000926 separation method Methods 0.000 description 2
- 238000012163 sequencing technique Methods 0.000 description 2
- 238000003491 array Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000037213 diet Effects 0.000 description 1
- 235000005911 diet Nutrition 0.000 description 1
- 235000019800 disodium phosphate Nutrition 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000005111 flow chemistry technique Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 230000036544 posture Effects 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 230000003595 spectral effect Effects 0.000 description 1
- 238000012549 training Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/20—Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
- H04N21/23—Processing of content or additional data; Elementary server operations; Server middleware
- H04N21/233—Processing of audio elementary streams
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T3/00—Geometric image transformations in the plane of the image
- G06T3/04—Context-preserving transformations, e.g. by using an importance map
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L65/00—Network arrangements, protocols or services for supporting real-time applications in data packet communication
- H04L65/80—Responding to QoS
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/20—Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
- H04N21/23—Processing of content or additional data; Elementary server operations; Server middleware
- H04N21/234—Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/20—Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
- H04N21/23—Processing of content or additional data; Elementary server operations; Server middleware
- H04N21/235—Processing of additional data, e.g. scrambling of additional data or processing content descriptors
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Computer Networks & Wireless Communication (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Processing Or Creating Images (AREA)
Abstract
The invention provides a live broadcast method, a live broadcast device, live broadcast equipment and a storage medium based on artificial intelligence; the method comprises the following steps: receiving a given text for a virtual anchor to perform, acquiring audio data and facial feature data corresponding to the virtual anchor in real time according to the given text, and respectively forming at least one audio data packet and at least one facial feature data packet; performing special effect rendering processing on the basis of the facial feature data in the facial feature data packet in real time to obtain an image data packet carrying an image frame set corresponding to the virtual anchor; extracting the image frame set in the image data packet and the audio data in the audio data packet in real time; and carrying out live broadcast data stream pushing of the virtual anchor according to the image frame set and the audio data.
Description
Technical Field
The present invention relates to artificial intelligence technology, and in particular, to a live broadcast method, apparatus, device, and storage medium based on artificial intelligence.
Background
Artificial Intelligence (AI) is a theory, method and technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results.
With the development of communication technology, network communication bandwidth is greatly improved, live video technology is mature day by day, and is applied in various aspects, and with the development of artificial intelligence technology, text-to-speech technology and image synthesis technology also become research hotspots, and the combination of live video technology and artificial intelligence technology can play a role in many places, such as replacing real people to perform virtual news broadcast, replacing game anchor to perform virtual commentary, and the like, and has a wide application prospect.
Disclosure of Invention
The embodiment of the invention provides a live broadcast method, a live broadcast device, live broadcast equipment and a storage medium based on artificial intelligence, which can effectively improve the real-time performance and the fluency of live broadcast.
The technical scheme of the embodiment of the invention is realized as follows:
the embodiment of the invention provides a live broadcast method based on artificial intelligence, which comprises the following steps:
receiving a given text for a virtual anchor to perform, acquiring audio data and facial feature data corresponding to the virtual anchor in real time according to the given text, and respectively forming at least one audio data packet and at least one facial feature data packet;
performing special effect rendering processing on the basis of the facial feature data in the facial feature data packet in real time to obtain an image data packet carrying an image frame set corresponding to the virtual anchor;
extracting the image frame set in the image data packet and the audio data in the audio data packet in real time;
and carrying out live broadcast data stream pushing of the virtual anchor according to the image frame set and the audio data.
The embodiment of the invention provides a live broadcast device based on artificial intelligence, which comprises:
the system comprises a text-to-speech request module, a speech recognition module and a speech recognition module, wherein the text-to-speech request module is used for receiving a given text for a virtual anchor to perform, acquiring audio data and facial feature data corresponding to the virtual anchor in real time according to the given text, and respectively forming at least one audio data packet and at least one facial feature data packet;
the rendering module is used for performing special effect rendering processing on the basis of the facial feature data in the facial feature data packet in real time to obtain an image data packet carrying an image frame set corresponding to the virtual anchor;
the video plug-flow module is used for extracting the image frame set in the image data packet and the audio data in the audio data packet in real time; and carrying out live broadcast data stream pushing of the virtual anchor according to the image frame set and the audio data.
In the foregoing solution, the rendering module is further configured to:
when a first facial feature data packet in at least one facial feature data packet for the given text is formed, performing special effect rendering processing on the basis of facial feature data in the first facial feature data packet in real time to obtain a first image data packet carrying an image frame set corresponding to the virtual anchor;
the video plug-flow module is further configured to:
when a first audio data packet in at least one audio data packet for the given text is formed, extracting the audio data in the first audio data packet in real time to push the audio data;
when a first image data packet carrying an image frame set corresponding to the virtual anchor is formed, extracting the image frame set in the first image data packet in real time so as to push the image frame set.
In the foregoing solution, when the time taken to obtain the first image data packet from the given text is greater than the time taken to obtain the first audio data packet from the given text, the video stream pushing module is further configured to:
when the audio data in the first audio data packet is extracted, the extracted audio data is pushed to a live broadcast client in real time, and the audio data in the subsequent extracted audio data packet is pushed to the live broadcast client in real time until the audio data in the first audio data packet is extracted
And extracting an image frame set in the first image data packet, and pushing the extracted image frame set to the live broadcast client in real time.
In the foregoing solution, when the time taken to obtain the first image data packet from the given text is less than the time taken to obtain the first audio data packet from the given text, the video stream pushing module is further configured to:
when the image frame set in the first image data packet is extracted, the extracted image frame set is pushed to a live broadcast client, and the image frame set in the subsequent extracted image data packet is pushed to the live broadcast client in real time until the image frame set in the first image data packet is extracted
And extracting the audio data in the first audio data packet, and pushing the extracted audio data to the live broadcast client in real time.
In the foregoing solution, the text-to-speech request module is further configured to:
when the given text is received, dividing the given text into at least two language segments, and converting the language segments into word vectors corresponding to the language segments in real time;
carrying out encoding processing and decoding processing on the word vector to obtain audio features corresponding to the word vector;
and synthesizing the audio features to obtain audio data of each speech segment respectively corresponding to the virtual anchor.
In the foregoing solution, the text-to-speech request module is further configured to:
predicting the key points of the mouth of the virtual anchor corresponding to each language segment, and carrying out normalization processing on the key points of the mouth so as to enable the key points of the mouth to be adaptive to a standard face template;
performing dimensionality reduction processing on the mouth key points subjected to normalization processing to obtain mouth shape feature data corresponding to each language segment of the virtual anchor;
performing semantic analysis on the language segments to obtain the semantics represented by the language segments;
determining facial expression characteristic data matched with the semantics according to the semantics represented by the language segments, and combining the mouth shape characteristic data and the facial expression characteristic data to form facial characteristic data corresponding to each language segment of the virtual anchor;
and respectively forming the audio data packet and the facial feature data packet based on facial feature data and audio data corresponding to the same language segment.
In the foregoing solution, the text-to-speech request module is further configured to:
prior to normalizing the mouth keypoints,
sending a query transaction to a blockchain network, wherein the query transaction indicates a smart contract for querying a ledger in the blockchain network such that
A consensus node in the block chain network queries the ledger in a mode of executing the intelligent contract to obtain the standard face template stored in the ledger; or
According to the identification of the standard face template, inquiring the standard face template corresponding to the identification from a standard face template database, and determining the hash value of the inquired standard face template;
and inquiring a hash value corresponding to the identifier from the blockchain network, and when the inquired hash value is consistent with the determined hash value, determining that the inquired standard face template is not tampered.
In the foregoing solution, the text-to-speech request module is further configured to:
before forming the at least one audio data packet and the at least one facial feature data packet respectively,
sending the given text to a blockchain network such that
A consensus node in the blockchain network carries out compliance check on the given text in a mode of executing an intelligent contract;
and when compliance confirmation returned by more than a preset number of consensus nodes is received, determining that the given text passes the compliance check.
In the foregoing solution, the text-to-speech request module is further configured to:
before the audio data packet and the facial feature data packet matching each other are formed separately,
determining a phoneme order characterized by the audio data;
and determining the sequence of the face feature data carrying the mouth feature data based on the phoneme sequence so as to enable the mouth shape change represented by the sequence of the face feature data to be matched with the phoneme sequence.
In the foregoing solution, the rendering module is further configured to:
determining the product of the playing duration of the audio data and a frame rate parameter as the number of image frames corresponding to the audio data;
extracting facial feature data in the facial feature data packet in real time, and performing special effect rendering processing based on the facial feature data to obtain a facial image corresponding to the facial feature data;
synthesizing the face image and the background image to obtain an image frame corresponding to the virtual anchor, and combining a plurality of synthesized image frames into an image frame set corresponding to the audio data;
wherein the number of image frames in the set of image frames is the number of image frames;
and packaging the image frame set to obtain an image data packet carrying the image frame set corresponding to the virtual anchor.
In the foregoing solution, the rendering module is further configured to:
extracting the same number of feature vectors as the number of the image frames from the facial feature data;
and carrying out special effect rendering processing based on the extracted feature vectors to obtain the face images with the same number as the image frames.
In the foregoing solution, the rendering module is further configured to:
analyzing the facial feature data packet to obtain facial feature data contained in the facial feature data packet, and determining a language segment corresponding to the facial feature data;
extracting at least one group of background images corresponding to the semantics based on the semantics represented by the language segments;
each group of background images comprises a model, an action and a live broadcast scene of the virtual anchor;
and synthesizing the facial image and the at least one group of background images to obtain an image frame of the virtual anchor, wherein the virtual anchor is positioned in the live broadcast scene and has the corresponding facial expression and mouth shape in the action and the facial image, and the number of groups of the action is the same as that of the background images.
In the foregoing solution, the rendering module is further configured to:
sending a query transaction to a blockchain network, wherein the query transaction indicates an intelligent contract for querying a ledger in the blockchain network and query parameters corresponding to the semantics, such that
A consensus node in the block chain network inquires the ledger by executing the intelligent contract to obtain a background image which accords with the inquiry parameters in the ledger; or
Inquiring a background image corresponding to the semantics from a background image database according to the semantics, and determining a hash value of the inquired background image;
and inquiring the hash value corresponding to the semantics from the blockchain network, and when the inquired hash value is consistent with the determined hash value, determining that the inquired background image is not tampered.
In the above solution, the apparatus further comprises:
a streaming media service module, configured to:
responding to a live broadcast request sent by a client, distributing a streaming media playing address for the live broadcast request, and returning the streaming media playing address to the client;
the video plug-flow module is further configured to:
and sequentially pushing the image frame set and the audio data to a streaming media interface corresponding to the streaming media playing address in real time.
The embodiment of the invention provides a live broadcast method based on artificial intelligence, which comprises the following steps:
receiving a live broadcast selection instruction of a virtual anchor;
acquiring an image frame set and audio data corresponding to the virtual anchor according to the live broadcast selection instruction;
wherein the set of image frames and the audio data correspond to a given text of the virtual anchor;
and carrying out synchronous processing on the acquired image frame set and the audio data, and presenting the image frame of the virtual anchor and the corresponding audio data in real time.
The embodiment of the invention provides a live broadcast device based on artificial intelligence, which comprises:
the instruction receiving module is used for receiving a live broadcast selection instruction of the virtual anchor;
the data acquisition module is used for acquiring an image frame set and audio data corresponding to the virtual anchor according to the live broadcast selection instruction;
wherein the set of image frames and the audio data correspond to a given text of the virtual anchor;
and the presentation module is used for carrying out synchronous processing on the acquired image frame set and the audio data and presenting the image frame of the virtual anchor and the corresponding audio data in real time.
The embodiment of the invention provides live broadcast equipment based on artificial intelligence, which comprises:
a memory for storing executable instructions;
and the processor is used for realizing the live broadcasting method based on artificial intelligence provided by the embodiment of the invention when the executable instructions stored in the memory are executed.
The embodiment of the invention provides a storage medium, which stores executable instructions and is used for causing a processor to execute so as to realize the live broadcast method based on artificial intelligence provided by the embodiment of the invention.
The embodiment of the invention has the following beneficial effects:
according to the live broadcast method based on artificial intelligence provided by the embodiment of the invention, the real-time performance of virtual video live broadcast is enhanced and the fluency of live broadcast video is improved by means of a feature extraction mode, a special effect rendering mode and a plug flow parallel processing mode and a mode of separating audio data and facial feature data.
Drawings
Fig. 1 is an alternative structural diagram of an architecture of an artificial intelligence based live broadcast system 100 according to an embodiment of the present invention;
fig. 2 is an alternative structural diagram of an artificial intelligence based live broadcast device provided by an embodiment of the present invention;
3A-3D are schematic flow diagrams of an alternative artificial intelligence based live broadcast method provided by an embodiment of the invention;
fig. 4 is an alternative structural diagram of a live broadcast system architecture based on artificial intelligence provided by an embodiment of the present invention;
fig. 5 is an alternative structural diagram of an artificial intelligence-based live broadcast system architecture provided by an embodiment of the present invention;
fig. 6 is an implementation architecture diagram of a live broadcast method based on artificial intelligence provided in an embodiment of the present invention;
7A-7B are block diagrams of an overall framework for a virtual video live service provided by an embodiment of the invention;
fig. 8 is an implementation framework of a virtual video push streaming service provided by an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be described in further detail with reference to the attached drawings, the described embodiments should not be construed as limiting the present invention, and all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.
In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used herein is for the purpose of describing embodiments of the invention only and is not intended to be limiting of the invention.
Before further detailed description of the embodiments of the present invention, terms and expressions mentioned in the embodiments of the present invention are explained, and the terms and expressions mentioned in the embodiments of the present invention are applied to the following explanations.
1) Text To Speech (Text To Speech): text is intelligently converted to natural speech streams through artificial intelligence (e.g., neural networks).
2) Virtual video live broadcast: the video live broadcast synthesized by the virtual roles plays the role of a virtual main broadcast and carries out live broadcast in scenes such as news, games and the like.
3) Transactions (transactions), equivalent to the computer term "Transaction," include operations that need to be committed to a blockchain network for execution and do not refer solely to transactions in the context of commerce, which embodiments of the present invention follow in view of the convention colloquially used in blockchain technology.
For example, a deployment (deployment) transaction is used to install a specified smart contract to a node in a blockchain network and is ready to be invoked; the Invoke (Invoke) transaction is used to append records of the transaction in the blockchain by invoking the smart contract and to perform operations on the state database of the blockchain, including update operations (including adding, deleting, and modifying key-value pairs in the state database) and query operations (i.e., querying key-value pairs in the state database).
4) A Block chain (Blockchain) is a storage structure for encrypted, chained transactions formed from blocks (blocks).
5) A Blockchain Network (Blockchain Network) incorporates new blocks into a set of nodes of a Blockchain in a consensus manner.
6) Ledger (legger) is a general term for blockchains (also called Ledger data) and state databases synchronized with blockchains. Wherein, the blockchain records the transaction in the form of a file in a file system; the state database records the transactions in the blockchain in the form of different types of Key (Key) Value pairs for supporting fast query of the transactions in the blockchain.
7) Intelligent Contracts (Smart Contracts), also known as chain codes (chaincodes) or application codes, are programs deployed in nodes of a blockchain network, and the nodes execute the intelligent Contracts called in received transactions to perform operations of updating or querying key-value data of a state database.
8) Consensus (Consensus), a process in a blockchain network, is used to agree on transactions in a block among a plurality of nodes involved, the agreed block is to be appended to the end of the blockchain, and the mechanisms for achieving Consensus include Proof of workload (PoW, Proof of Work), Proof of rights and interests (PoS, Proof of equity (DPoS), Proof of granted of shares (DPoS), Proof of Elapsed Time (PoET, Proof of Elapsed Time), and so on.
9) Artificial Intelligence (AI) is a theory, method and technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results.
With the development of communication technology, the network communication bandwidth is greatly improved, and the video live broadcast technology is mature day by day and is applied in various aspects. Meanwhile, with the development of the artificial intelligence technology, the text-to-speech technology and the image synthesis technology also become research hotspots of people and develop rapidly. The combination of the live video technology and the artificial intelligence technology can play a role in many places, for example, the live video technology can replace a real person to perform virtual news broadcasting, the game anchor can replace a game to perform virtual commentary, and the like, and the live video technology and the artificial intelligence technology have wide application prospects. In the virtual video live broadcast technology, a large amount of computing time is consumed for generating audio and images, in order to ensure real-time performance of virtual video live broadcast, and realization of real-time live stream pushing of virtual video becomes an important factor influencing final live video quality, the live broadcast methods in the related technologies mostly aim at existing stable audio and image data input (such as local video stream pushing) or application scenes capable of rapidly acquiring audio and image data (such as data acquired by a camera) and the like, and cannot be well applied in virtual video live broadcast, and specifically, the technical schemes in the related technologies have the following defects in practical application: based on the fact that the technical scheme in the related art cannot be applied in a scene requiring large computational power, the embodiment of the invention provides a live broadcast method based on artificial intelligence, image frame data and audio data are enabled to uniformly reach a video stream module as much as possible through parallel processing and data separation, a video can be understood as being composed of two time lines, one is an audio time line and the other is an image time line, the live broadcast video can be played after the data of the two time lines are synchronously processed by a client, therefore, the video stream module can wait for a text-to-speech request module to acquire an audio data packet during the image rendering of a rendering module, so that the mutual waiting of the two time lines is reduced, and the problem that under the premise that large computational power is required for acquiring the audio and video data is solved, the problem of real-time stream pushing of the virtual video is solved, the interference of an unstable data source to the real-time stream pushing is effectively avoided, the real-time performance of the virtual video live broadcast is enhanced, and the smoothness of the live broadcast video is improved.
The embodiment of the invention provides a live broadcasting method, a live broadcasting device, live broadcasting equipment and a storage medium based on artificial intelligence, which can well solve the problem of unstable data sources and effectively improve the real-time performance of virtual video stream pushing. In the following, an exemplary application will be explained when the device is implemented as a server.
Referring to fig. 1, fig. 1 is an alternative architecture diagram of an artificial intelligence based live broadcast system 100 provided by an embodiment of the present invention, in which a terminal 400 (illustratively, a terminal 400-1 and a terminal 400-2) is connected to a server 200 through a network 300, and the network 300 may be a wide area network or a local area network, or a combination of both. The terminal 400-1 and the terminal 400-2 are respectively provided with a live client 410-1 and a live client 410-2, wherein the live client has a function of providing live content and a function of presenting live video, the server 200 comprises a text-to-speech request module 2551, a rendering module 2552 and a video stream pushing module 2553, the text-to-speech request module 2551 obtains corresponding facial feature data and audio data in a streaming manner based on given texts returned by the terminals 400-1 and 400-2, the facial feature data and the audio data are separated into two packets, the facial feature data packet obtained each time is pushed to the rendering module 2552, the audio data packet obtained each time is pushed to the video stream pushing module 2553, the rendering module 2552 obtains an expression image of a virtual character by performing three-dimensional rendering on the facial feature data obtained each time, and pushed to the video plug-flow module 2553, the video plug-flow module 2553 asynchronously receives the audio data packet and the expression image data packet, performs video synthesis, and pushes the video to the client, the text-to-speech request module 2551, the rendering module 2552 and the video plug-flow module 2553 are mutually independent and assist in parallel, and separate and independently shunt the audio data and the facial feature data, where the facial feature data may include mouth shape feature data, so as to separate and independently shunt the audio data and the mouth shape feature data, and can automatically synthesize the text data into a video in real time and push the video to the live broadcast clients 410-1 and 410-2, and in the process of performing three-dimensional rendering processing on the facial feature data acquired each time by the rendering module 2552, image resources in the background image database 500 in the live broadcast system 100 based on artificial intelligence can be called, so as to realize richer live video presentation.
Referring to fig. 2, fig. 2 is a schematic structural diagram of an artificial intelligence-based live broadcast server 200 according to an embodiment of the present invention, where the server 200 shown in fig. 2 includes: at least one processor 210, memory 250, at least one network interface 220, and a user interface 230. The various components in server 200 are coupled together by a bus system 240. It is understood that the bus system 240 is used to enable communications among the components. The bus system 240 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 240 in fig. 2.
The Processor 210 may be an integrated circuit chip having Signal processing capabilities, such as a general purpose Processor, a Digital Signal Processor (DSP), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like, wherein the general purpose Processor may be a microprocessor or any conventional Processor, or the like.
The user interface 230 includes one or more output devices 231, including one or more speakers and/or one or more visual display screens, that enable the presentation of media content. The user interface 230 also includes one or more input devices 232, including user interface components that facilitate user input, such as a keyboard, mouse, microphone, touch screen display, camera, other input buttons and controls.
The memory 250 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard disk drives, optical disk drives, and the like. Memory 250 optionally includes one or more storage devices physically located remotely from processor 210.
The memory 250 includes volatile memory or nonvolatile memory, and may include both volatile and nonvolatile memory. The nonvolatile memory may be a Read Only Memory (ROM), and the volatile memory may be a Random Access Memory (RAM). The memory 250 described in embodiments of the invention is intended to comprise any suitable type of memory.
In some embodiments, memory 250 is capable of storing data, examples of which include programs, modules, and data structures, or a subset or superset thereof, to support various operations, as exemplified below.
An operating system 251 including system programs for processing various basic system services and performing hardware-related tasks, such as a framework layer, a core library layer, a driver layer, etc., for implementing various basic services and processing hardware-based tasks;
a network communication module 252 for communicating to other computing devices via one or more (wired or wireless) network interfaces 220, exemplary network interfaces 220 including: bluetooth, wireless compatibility authentication (WiFi), and Universal Serial Bus (USB), etc.;
a presentation module 253 to enable presentation of information (e.g., a user interface for operating peripherals and displaying content and information) via one or more output devices 231 (e.g., a display screen, speakers, etc.) associated with the user interface 230;
an input processing module 254 for detecting one or more user inputs or interactions from one of the one or more input devices 232 and translating the detected inputs or interactions.
In some embodiments, the apparatus provided by the embodiments of the present invention may be implemented in software, and fig. 2 shows an artificial intelligence based live device 255 stored in a memory 250, which may be software in the form of programs and plug-ins, etc., and includes the following software modules: a text-to-speech request module 2551, a rendering module 2552, a video plug-streaming module 2553 and a streaming media service module 2554, which are logical and thus can be arbitrarily combined or further split according to the implemented functions, which will be described below.
In other embodiments, the artificial intelligence based live broadcasting Device 255 provided by the embodiments of the present invention may be implemented in hardware, for example, the artificial intelligence based live broadcasting Device 255 provided by the embodiments of the present invention may be a processor in the form of a hardware decoding processor, which is programmed to execute the artificial intelligence based live broadcasting method provided by the embodiments of the present invention, for example, the processor in the form of the hardware decoding processor may be implemented by one or more Application Specific Integrated Circuits (ASICs), DSPs, Programmable Logic Devices (PLDs), Complex Programmable Logic Devices (CPLDs), Field Programmable Gate Arrays (FPGAs), or other electronic components.
The artificial intelligence based live broadcasting method provided by the embodiment of the invention will be described in conjunction with the exemplary application and implementation of the terminal provided by the embodiment of the invention.
Referring to fig. 3A, fig. 3A is an optional flowchart of a live broadcasting method based on artificial intelligence according to an embodiment of the present invention, which will be described with reference to steps 101 and 105 shown in fig. 3A.
In step 101, the server receives given text from the client for the virtual anchor show.
In some embodiments, the client sends a given text for the virtual anchor to the server, where the given text may be input by a user or crawled through a network, where the given text is from a live content provider, where the given text may also be from a user watching a live broadcast, and the virtual anchor may perform in real time based on the text input by the user watching the live broadcast through the client, and in addition, during the live broadcast, the user watching the live broadcast may interact with the virtual anchor, and during the live broadcast performance of the virtual anchor, after the user inputs the interactive text, the client may query a corresponding dialect according to the interactive text to generate a given text corresponding to the dialect, or send the interactive text to the server, and the server queries the corresponding dialect according to the interactive text to generate a given text corresponding to the dialect, the virtual anchor performs live performance according to the given text corresponding to the dialect, and interaction between the watching user and the virtual anchor is realized.
In step 102, the server obtains audio data and facial feature data corresponding to the virtual anchor in real time according to the given text, forms at least one audio data packet and at least one facial feature data packet respectively, and continues to process the next given text in real time.
In some embodiments, in step 102, the server obtains, in real time, audio data and facial feature data of the corresponding virtual host according to the given text and forms at least one audio data packet and at least one facial feature data packet, respectively, and continues to process the next given text in real time, the text-to-speech request module may request a text-to-speech service to obtain the audio data packet and the facial feature data packet, the text-to-speech service is configured to obtain, in real time, the audio data and the facial feature data of the corresponding virtual host according to the given text to form at least one audio data packet and at least one facial feature data packet, respectively, and continues to process the next given text in real time, or the text-to-speech module may directly obtain, in real time, the audio data and the facial feature data of the corresponding virtual host according to the given text to form at least one audio data packet and at least one facial feature data packet, respectively, and continues to process the next given text in real time Here, processing the next given text in real time means that the text-to-speech request module does not wait for the completion of the entire plug-flow process to process the next given text, that is, as long as processing the next given text, processing the next given text is started, where the next given text may be from the same client or from different clients, the given text may be live broadcast for the same field, the received different given texts may be live broadcast for different fields, correspondingly formed audio data packets and facial feature data packets may carry live broadcast identifiers, and the live broadcast identifiers correspond to different live broadcast identifiers, and further, the text-to-speech request module may also have parallel processing capability corresponding to the processing capability configured by the processor based on the processing capability configured by the processor, that is, processing a plurality of given texts simultaneously, the multiple given texts may be given texts from different clients or for different live broadcasts. In the processing process, the given text can be divided, audio data packets and facial feature data packets corresponding to the various language segments are generated after the division, and after each language segment is processed to generate one audio data packet and one facial feature data packet, the next language segment is processed to generate the next audio data packet and the next facial feature data packet, or a plurality of audio data packets and facial feature data packets are generated for a plurality of language segments simultaneously based on the parallel processing capability configured by the processor.
Referring to fig. 3B, fig. 3B is an optional flowchart of the artificial intelligence based live broadcasting method according to the embodiment of the present invention, and the server obtains the audio data of the corresponding virtual anchor in real time according to the given text in step 102, which may be specifically implemented through step 1021 and step 1023.
In step 1021, when receiving the given text, the server divides the given text into at least two paragraphs, and converts the paragraphs into word vectors corresponding to the paragraphs in real time.
In step 1022, the server performs encoding processing and decoding processing on the word vector to obtain the audio feature of the corresponding word vector.
In step 1023, the server synthesizes the audio features to obtain audio data corresponding to each speech segment of the virtual anchor.
In some embodiments, an end-to-end deep learning text-to-speech model can be constructed, a text-to-speech model can be trained directly by a deep learning method, and after model training is completed, aiming at a given text, the model can generate corresponding audio, a text-to-speech request module in the server can divide the given text into at least two language segments and convert the language segments into word vectors of the corresponding language segments in real time, for the text-to-speech model, it is necessary to convert the text into corresponding word vectors, where a given text may be a sentence, perform segmentation on the given text, segmenting the word segments to determine word vectors corresponding to the words, determining the word vectors in a dictionary inquiring mode, using dictionary subscripts as identifications corresponding to each word in the dictionary, and then converting each word segment into the corresponding word vector by traversing the dictionary; and coding the word vectors to obtain intermediate semantic representation, then decoding to obtain audio features corresponding to the word vectors, and synthesizing the finally obtained audio features based on each audio synthesis algorithm to obtain audio data corresponding to each speech segment of the virtual anchor.
In some embodiments, the server in step 102 obtains facial feature data corresponding to the virtual anchor in real time according to the given text, which can be implemented according to the following technical scheme that mouth key points corresponding to each speech segment of the virtual anchor are predicted, and normalization processing is performed on the mouth key points, so that the mouth key points are adapted to the standard face template; performing dimensionality reduction processing on the mouth key points subjected to normalization processing to obtain mouth shape feature data corresponding to each speech segment of the virtual anchor; performing semantic analysis on the language segments to obtain the semantics represented by the language segments; determining facial expression characteristic data matched with the semantics according to the semantics represented by the language segments, and combining the mouth shape characteristic data and the facial expression characteristic data to form facial characteristic data corresponding to each language segment of the virtual anchor; after the technical scheme is executed, the server respectively forms an audio data packet and a facial feature data packet which are matched with each other based on the facial feature data and the audio data which correspond to the same language segment.
In some embodiments, the mouth shape expression is predicted from the various segments of a given text, where spectral features are used to represent audio, and to compute the mouth shape representation, mouth key points need to be extracted from the face and normalized so as to be unaffected by image size, face position, face rotation, and face size, normalization being important in this process because it enables the generated mouth key points to be compatible with any image, so that the normalized mouth key points can be adapted to a standard face template, where the standard face template is that of a virtual anchor, different virtual anchor characters correspond to different standard face templates, the virtual anchor can be an animal, the standard face template can be an animal avatar, the virtual anchor can also be a cartoon character, and the standard face template can be a cartoon character avatar, performing dimensionality reduction on the mouth key points subjected to normalization processing, performing decorrelation on the characteristics of the mouth key points, using the most important parts as mouth shape representations, regarding facial feature data, facial expression feature data are also included in the mouth shape feature data, the mouth shape feature data are related to audio data, the facial expression feature data are related to the speech segments of a given text, performing semantic analysis on the speech segments of the given text, regarding the semantics represented by the speech segments of the given text, presenting facial expression feature data adaptive to the semantics by a virtual anchor, and combining the mouth shape feature data and the facial expression feature data to form the facial feature data of the virtual anchor.
In some embodiments, the given text is divided into at least two language segments, for example, for the given text "great family, i.e. may be anchor small a", it may be divided into two language segments "great family" and "i.e. anchor small a", or three language segments "great family", "i.e." and "anchor small a", corresponding to different language segments, and based on the language segments, audio data of each of the "great family, i.e. anchor small a", i.e. audio data of "great family" and audio data of "i.e. anchor small a", are generated, and also based on the language segments obtained after division, facial feature data corresponding to "great family" and facial feature data corresponding to "i.e. anchor small a" are obtained, and facial feature data and facial expression feature data corresponding to "great family" are included in the facial feature data of "great family", and facial expression feature data corresponding to "a little anchor a" are included in the my facial feature data of "i.e. anchor small a", and and (5) characteristic data.
In some embodiments, before normalization processing is performed on the mouth key points, a standard face template may be obtained by the following technical scheme, and an inquiry transaction is sent to the blockchain network, where the inquiry transaction indicates an intelligent contract for inquiring an account book in the blockchain network, so that a consensus node in the blockchain network inquires the account book by executing the intelligent contract, and the standard face template stored in the account book is obtained; or according to the identification of the standard face template, inquiring the standard face template corresponding to the identification from the standard face template database, and determining the hash value of the inquired standard face template; and inquiring a hash value corresponding to the identifier from the blockchain network, and when the inquired hash value is consistent with the determined hash value, determining that the inquired standard face template is not tampered.
In some embodiments, any user may submit a standard face template to a blockchain network or a standard face template database through a live broadcast client, where the submission of the standard face template to the blockchain network may be implemented by a query transaction, where the query transaction queries an account book in a consensus node by executing an intelligent contract corresponding to the query, so as to obtain the standard face template stored in the account book, see fig. 4, where fig. 4 is an optional architecture diagram of the artificial intelligence based live broadcast system 100 provided in the embodiment of the present invention, the terminal 400 uploads the standard face template to the blockchain network, the standard face template is recorded in the account book of each accounting node after being commonly recognized by the intelligent contract executed by the consensus node, and the server 200 may obtain the standard face template from the blockchain network. The method comprises the steps that a standard face template is stored on a terminal 400 in a chain mode, the standard face template is stored on a server 200 in a chain mode, the standard face template is inquired and the standard face template is updated in a transaction mode, calling information of an intelligent contract for realizing data operation is carried in the transaction, the intelligent contract is called through a consensus node to be executed, the result that the intelligent contract is executed by the consensus node can be effective only when the consensus of other nodes is obtained, and therefore reliability and consistency of data are guaranteed. The terminal 400 and the server 200 register with the certificate authority to acquire a certificate (including the public key of the service entity and the digital signature signed by the certificate authority for the public key and the identity information of the service entity), attach the certificate to the transaction together with the digital signature of the terminal 400 or the server 200 for the transaction, and send the certificate and the signature to the blockchain network, so that the blockchain network can take the digital certificate and the signature out of the transaction, verify the authenticity (without tampering) of the message and the identity information of the service entity sending the message, and verify the blockchain network according to the identity, for example, whether the blockchain network has the right to initiate the transaction. The consensus function and the sequencing function are functions which can be realized by a consensus node, the block chain and the state database are two data maintaining modes of the account book, and the transaction of the uplink standard face template and the transaction of the query standard face template have complete records in the block chain; only partial data is recorded in the status database.
In some embodiments, a standard face template may be further queried from a standard face template database according to the identifier of the standard face template, where the identifier corresponds to the standard face template one to one, the standard face template is queried according to the unique identifier, and a hash value of the queried standard face template is determined, so as to perform hash verification on the queried standard face template, and when the queried hash value is consistent with a preset determined hash value, the queried standard face template is not tampered, so that the mouth key points are normalized based on the standard face template.
In some embodiments, some standard face templates uploaded by users may have elements such as violence and the like which do not conform to social public order customs, and therefore, the consensus node invokes an intelligent contract corresponding to compliance check to perform compliance check on the standard face templates, the intelligent contract may perform compliance check by using an artificial intelligence model, trains the artificial intelligence model according to compliance, and identifies the input standard face templates by using the artificial intelligence model by executing the intelligent contract to identify the standard face templates which do not conform to the compliance, and for the standard face templates which do not conform to the compliance, the standard face templates which do not conform to the compliance will not be stored on the blockchain network.
In some embodiments, before forming the at least one audio data packet and the at least one facial feature data packet, a technical scheme that the given text is sent to the blockchain network so that the consensus node in the blockchain network performs compliance check on the given text by executing an intelligent contract is also performed; and when the compliance confirmation returned by more than the preset number of the consensus nodes is received, determining that the given text passes the compliance check.
In some embodiments, in addition to performing the compliance check on the standard face template, the compliance check may be performed on the given text, the given text may be sent to the blockchain network for compliance check before forming the at least one audio data packet and the at least one facial feature data packet, the given text may be consensus verified by the consensus nodes, the consensus verification may be performed based on the compliance to implement the compliance check, the given underlying entity may be determined to pass the compliance check when a number of consensus verification acknowledgements returned by more than a threshold number of consensus nodes are received, the subsequent processing may be performed, in order to avoid a delay in the compliance check, the subsequent processing may be performed upon receipt of an acknowledgement from one of the consensus nodes, the processing may be terminated and a cause may be returned when a sufficient number of acknowledgements returned by the consensus nodes are not received within a subsequent waiting time, the timing of terminating the processing may occur when the at least one audio data packet and the at least one facial feature data packet are formed, when the special effect rendering processing is performed, and when the video streaming is performed, because the compliance check on the given text fails, and the client is required to resubmit the given text.
In some embodiments, before the audio data packet and the facial feature data packet which are matched with each other are respectively formed, the order of the facial feature data in the facial feature data packet may be determined, and the order of the facial feature data carrying the mouth shape feature data is determined based on the phoneme order, so that the mouth shape change and the phoneme order represented by the order of the facial feature data are matched.
In some embodiments, the facial feature data in the facial feature data packet itself has no sequence attribute, but the image frames are presented sequentially, and the image frames are obtained based on the facial feature data, so the facial feature data are sequential, and from the perspective of the mouth shape, the mouth shape change and the audio output should be consistent, otherwise the mouth shape and the audio are inconsistent, therefore, determining the phoneme sequence in the audio data, determining the sequence of the mouth shape feature data based on the phoneme sequence, because the mouth shape feature data, the facial feature data and the image frames correspond, so the sequence of the facial feature data including the mouth shape feature data can be determined, sequentially numbering each facial feature data, and matching the mouth shape change and the phoneme sequence represented by the image frames respectively obtained based on the sequentially arranged facial feature data, thus, when audio is output, the image frames are presented in the order of the facial feature data numbers.
In step 103, the server performs special effect rendering processing in real time based on the facial feature data in the facial feature data packet to obtain an image data packet carrying an image frame set corresponding to the virtual anchor, and continues to process the next facial feature data packet in real time.
In some embodiments, in step 103, the server performs special-effect rendering processing in real time based on facial feature data in the facial feature data packet to obtain an image data packet carrying an image frame set corresponding to the virtual anchor, and continues to process a next facial feature data packet in real time, which is implemented by the rendering module, where the real-time processing of the next facial feature data packet means that the rendering module does not wait for the completion of the whole stream pushing process to process the next facial feature data packet, that is, the processing of the next facial feature data packet is started as long as one facial feature data packet is processed, the facial feature data packet may be for the same live broadcast, the received different facial feature data packets may also be for different live broadcasts, the correspondingly formed image data packets may carry live broadcast identifiers, and the different live broadcast identifiers correspond to different live broadcasts, further, the rendering module, based on the processing capability configured by the processor, may also have parallel processing capability corresponding to the processing capability configured by the processor, that is, simultaneously process a plurality of facial feature data packets, which may be facial feature data packets for different live broadcasts, and in addition, the parallel processing of the rendering module may also be embodied as flow processing inside the module, that is, a process of generating a facial image and a process of synthesizing the facial image and the background image, which are executed by the rendering module, are divided into two sub-modules for processing, the processing between the sub-modules is independent and parallel, and each time one facial image is generated, the synthesizing sub-module synthesizes the generated facial image with the background image, and during the synthesis, the facial image generating sub-module starts to generate the next facial image.
In some embodiments, in addition to considering the mouth shape change to the ordering of the facial feature data, that is, the presentation order of the image frames generated in the rendering stage, the actions in the background images may also be considered to the ordering of the image frames in the special effect rendering stage, for the background images of the same facial feature data packet, the background images of the same facial feature data packet may represent the same group of actions, and the presentation of the group of actions may also be a dynamic process, and the background images of the group of actions are ordered according to the order of the facial images including the mouth shape feature data, so that the mouth shape change and the presentation of the actions can both conform to the output of subsequent audio data, so that the image frames and the audio data are matched and synchronized when video push is subsequently performed.
Referring to fig. 3C, based on fig. 3A and fig. 3C are schematic diagrams of an optional process of the live broadcast method based on artificial intelligence according to the embodiment of the present invention, in step 103, the server performs special effect rendering processing in real time based on the facial feature data in the facial feature data packet to obtain an image data packet carrying an image frame set corresponding to the virtual anchor, which may be specifically implemented in steps 1031-1034.
In step 1031, the server determines the product of the playing duration of the audio data and the frame rate parameter as the number of image frames corresponding to the audio data.
In step 1032, the server extracts the facial feature data in the facial feature data packet, and performs special effect rendering processing based on the facial feature data to obtain a facial image corresponding to the facial feature data.
In step 1033, the server synthesizes the face image and the background image to obtain an image frame corresponding to the virtual anchor, and combines a plurality of synthesized image frames into an image frame set corresponding to the audio data, where the number of image frames in the image frame set is the number of image frames.
In step 1034, the server performs packet processing on the image frame set to obtain an image data packet carrying the image frame set corresponding to the virtual anchor.
In some embodiments, a rendering module in the server determines a product of a playing duration of the audio data and the frame rate parameter as a number of image frames of the corresponding audio data; the playing time is determined by the text-to-voice request module, the playing time of the audio data packet is sent to the rendering module along with the facial feature data packet, the rendering module obtains the number of image frames corresponding to the audio data according to the playing time, facial feature data in the facial feature data packet are extracted, special effect rendering processing is carried out on the basis of the facial feature data, and a facial image corresponding to the facial feature data is obtained; synthesizing the face image and the background image to obtain an image frame corresponding to the virtual anchor, and combining a plurality of synthesized image frames into an image frame set corresponding to the audio data, wherein the number of the image frames in the image frame set is the number of the image frames; and packaging the image frame set to obtain an image data packet carrying the image frame set corresponding to the virtual anchor.
In some embodiments, the rendering module is responsible for generating an image frame set based on the facial feature data, the number of image frames in the image frame set is related to the length of the corresponding audio data, for example, the length of the audio data is 1 second, the frame rate is 25 frames per second, and the number of image frames in the image frame set in the facial feature data packet corresponding to the audio data is 25 frames. The generation process of the image frame is mainly based on the face feature data to carry out special effect rendering processing, so that the obtained face image can represent corresponding face feature data, wherein the face feature data comprises mouth shape feature data and facial expression feature data, for example, for a face feature data packet corresponding to the language segment of 'great family' the mouth shape on the obtained face image is the mouth shape corresponding to 'great family' and the expression on the obtained face image is the expression corresponding to 'great family', the mouth shape feature data can be suitable for any standard face template, the standard face template comprises a real character face template, a cartoon character face template, an animal face template or a graphic template, in virtual live broadcast, a cuboid open mouth can be used for speaking live broadcast, and the face image and the background image are synthesized, and obtaining image frames of the virtual anchor, generating a plurality of face images in the process of special effect rendering processing based on the face characteristic data, and combining the face images and the background images to form a plurality of image frames, wherein the plurality of image frames form an image frame set, and the number of the face images, the number of the background images and the number of the image frames are consistent.
In some embodiments, the above-mentioned process of extracting facial feature data in the facial feature data packet and performing special effect rendering processing based on the facial feature data to obtain facial images corresponding to the facial feature data may be implemented by the following technical solution, extracting feature vectors with the same number as that of the image frames from the feature vectors corresponding to the facial feature data; and carrying out special effect rendering processing based on the extracted feature vectors to obtain the face images with the same number as the image frames.
In some embodiments, the number of image frames in the image frame set is determined according to the playing length of the audio data, the length of the audio data is 1 second, the frame rate is 25 frames per second, and the number of image frames in the image frame set generated by the facial feature data corresponding to the same speech segment is 25 frames, which means that 25 image frames need to be synthesized, that is, 25 facial images need to be obtained, where the facial feature data is mathematically expressed by a feature vector, the facial feature data in the facial feature data packet corresponding to the 1 second audio data is represented by the feature vector, and different feature vectors can represent different mouth shapes, because the mouth shape change is a dynamic process, and even if only 1 second audio data is, the mouth shape change is also a dynamic process, wherein 25 image frames are needed to represent the dynamic process, therefore, the feature vectors with the same number as the number of image frames are extracted from the feature vectors, and special effect rendering processing is performed based on the extracted feature vectors, so that the face images with the same number as the number of image frames are obtained.
In some embodiments, the process of synthesizing the face image and the background image in the above steps may be implemented by the following technical scheme, analyzing the face feature data packet to obtain face feature data included in the face feature data packet, and determining a language segment corresponding to the face feature data; extracting at least one group of background images corresponding to the semanteme based on the semanteme represented by the language segment; each group of background images comprises a model, an action and a live broadcast scene of a virtual anchor; and synthesizing the face image and at least one group of background images to obtain an image frame of the virtual anchor, wherein the virtual anchor is in a live scene and has actions and corresponding facial expressions and mouth shapes in the face image, and the number of groups of the actions is the same as that of the background images.
In some embodiments, during the synthesis, it is required to determine a background image to which a face image is adapted, where the background image includes a virtual anchor model, a motion and a live broadcast scene, the virtual anchor model may be a three-dimensional model, different virtual anchor models correspond to different body postures and different overall images, the motion in the background image is a motion that the virtual anchor will present, such as a hand-waving motion or a dancing motion, the live broadcast scene is an environment in which the virtual anchor is located, such as a room in the field, a forest or a specific environment, the background image is determined to be related to semantics, the facial feature data in the facial feature data packet corresponds to the language segments one by one, each language segment has a semantic meaning representing the language, the background image adapted to the semantics is selected based on the semantics, and after the face image and the background image are synthesized, the obtained image frame of the virtual anchor can present the following information, the virtual anchor is in a live scene, and the virtual anchor has actions and corresponding facial expressions and mouth shapes in the facial images.
In some embodiments, the process of extracting the background image corresponding to the semantics may be implemented by sending a query transaction to the blockchain network, where the query transaction indicates an intelligent contract for querying an account book in the blockchain network and a query parameter corresponding to the semantics, so that a consensus node in the blockchain network queries the account book by executing the intelligent contract to obtain the background image in the account book, where the background image meets the query parameter; or inquiring a background image corresponding to the semantics from a background image database according to the semantics, and determining a hash value of the inquired background image; and inquiring a hash value corresponding to the semantics from the blockchain network, and determining that the inquired background image is not tampered when the inquired hash value is consistent with the determined hash value.
In some embodiments, referring to fig. 5, fig. 5 is an optional architecture diagram of the artificial intelligence based live broadcast system 100 provided in the embodiments of the present invention, where the terminal 400 uploads a background image to the blockchain network, the background image is recorded in an account book of each accounting node after being commonly identified by an intelligent contract executed by the common identification node, and the server 200 may obtain the background image from the blockchain network. The background image uplink storage of the terminal 400, the background image uplink storage of the server 200, the background image query and the background image update are all realized by submitting a transaction, the transaction carries calling information of an intelligent contract for realizing data operation, the intelligent contract is called by a consensus node to be executed, and the result of the intelligent contract executed by the consensus node needs to obtain consensus of other nodes to be effective, so that the reliability and consistency of data are ensured. The terminal 400 and the server 200 register with the certificate authority to acquire a certificate (including the public key of the service entity and the digital signature signed by the certificate authority for the public key and the identity information of the service entity), attach the certificate to the transaction together with the digital signature of the terminal 400 or the server 200 for the transaction, and send the certificate and the signature to the blockchain network, so that the blockchain network can take the digital certificate and the signature out of the transaction, verify the authenticity (without tampering) of the message and the identity information of the service entity sending the message, and verify the blockchain network according to the identity, for example, whether the blockchain network has the right to initiate the transaction. The consensus function and the sequencing function are functions which can be realized by a consensus node, the block chain and the state database are two data maintaining modes of an account book, and transactions of the uplink background image and transactions of the inquiry background image have complete records in the block chain; only partial data is recorded in the status database.
In some embodiments, the background image may be provided by a special art content provider, or by a live broadcast initiating user, and the background image may be uploaded to a blockchain network, so that the background image is not easily tampered, and it is prevented that content with bad social procedures appears in a synthesized image frame or content of the live broadcast initiating user is intentionally damaged in a process of live broadcast generation, and thus, the process of obtaining the background image may be implemented by initiating a query transaction to the blockchain network, where an intelligent contract for querying an account book in the blockchain network and a query parameter corresponding to semantics are indicated in the query transaction, and the query parameter may be an identifier of the background image or upload information of the background image, and the background image in the account book according with the query parameter may be obtained by executing the intelligent contract corresponding to the query transaction, in another embodiment, inquiring the background image and the hash value of the background image from the background image database according to the semantics, performing hash verification on the inquired hash value, comparing the hash value of the background image with the hash value of the semantics and the hash value of the background image stored in the blockchain network, and if the comparison result is consistent, indicating that the inquired background image is not tampered.
In step 104, the server extracts the image frame set in the image data packet and the audio data in the audio data packet in real time.
In some embodiments, the server performs special effect rendering processing in real time based on facial feature data in the facial feature data packet in step 103 to obtain an image data packet carrying an image frame set corresponding to the virtual anchor, which may be implemented by performing special effect rendering processing in real time based on facial feature data in a first facial feature data packet when the first facial feature data packet in at least one facial feature data packet for a given text is formed to obtain the first image data packet carrying the image frame set corresponding to the virtual anchor; the server extracts the image frame set in the image data packet and the audio data in the audio data packet in real time in step 104, which can be realized by the following technical scheme that when a first audio data packet in at least one audio data packet for a given text is formed, the audio data in the first audio data packet is extracted in real time to push the audio data; when a first image data packet carrying an image frame set corresponding to the virtual anchor is formed, the image frame set in the first image data packet is extracted in real time so as to push the image frame set.
In some embodiments, the image processing process and the audio processing process of the server are performed in two lines, for the image processing, once one facial feature data packet of at least one facial feature data packet for a given text is formed, a special effect rendering process is performed in real time based on the facial feature data in the facial feature data packet to obtain an image data packet carrying an image frame set corresponding to a virtual anchor, and once an image data packet carrying an image frame set corresponding to the virtual anchor is formed, the image frame set in the image data packet is extracted in real time to push the image frame set, the image processing process is not related to the audio processing, and for the audio processing, once an audio data packet is formed, the audio data in the audio data packet is extracted in real time to push the audio data to generate an audio data packet corresponding to the given text from the given text, the process of extracting and pushing the audio data in the audio data packet is independent of the image processing, the process of image processing is to generate a facial feature data packet corresponding to the given text from the given text, perform special effect rendering to generate an image data packet, and extract the image frame set in the image data packet to push the image frame set is independent of the audio data process.
In step 105, the server performs live streaming push of the virtual anchor according to the image frame set and the audio data.
In some embodiments, when the time taken for acquiring the first image data packet from the given text is longer than the time taken for acquiring the first audio data packet from the given text, the live broadcast data stream push of the virtual anchor is performed according to the image frame set and the audio data in step 105, which may be implemented by pushing the extracted audio data to the live broadcast client in real time when the audio data in the first audio data packet is extracted, and pushing the audio data in the subsequently extracted audio data packet to the live broadcast client in real time until the image frame set in the first image data packet is extracted, and pushing the extracted image frame set to the live broadcast client in real time
In some embodiments, when the time taken to obtain the first image data packet from the given text is less than the time taken to obtain the first audio data packet from the given text, the live broadcast data stream push of the virtual anchor is performed according to the image frame set and the audio data in step 105, and the method and the device can be implemented by pushing the extracted image frame set to the live broadcast client when the image frame set in the first image data packet is extracted, and pushing the subsequently extracted image frame set in the image data packet to the live broadcast client in real time until the audio data in the first audio data packet is extracted, and pushing the extracted audio data to the live broadcast client in real time.
In some embodiments, the image data packet and the audio data packet are generated asynchronously, that is, the image data packet and the audio data packet arrive at the video stream module asynchronously, the video stream module extracts and pushes the generated image data packet and audio data packet, when the image data packet arrives at the video stream module first, the image frame set in the image data packet is extracted and pushed, when the audio data packet arrives at the video stream module first, the audio data in the audio data packet is extracted and pushed, that is, the audio data packet is obtained first, the audio data in the audio data packet is pushed in real time, when the image data packet is obtained first, the image frame set in the image data packet is pushed in real time, and the image frame set does not wait for each other, for the subsequent audio data packet and image packet, the strategy of generating the first push is still adopted, after the audio data and the image frame data are pushed to the client, the client synchronously processes the audio data and the image frame data and plays the video, in order to avoid waiting time consumed when the client synchronously processes the audio data and the image frame data, namely, to ensure uniform arrival of the audio data and the image frame data as much as possible, the difference value of the consumed time for generating the image data packet and the audio data packet corresponding to the same language section needs to be controlled to be smaller than a difference threshold value, and the best situation is that the consumed time for generating the image data packet and the audio data packet corresponding to the same language section is completely the same, namely, the two packets simultaneously arrive at a video plug flow data packet, and the video plug flow data packet is simultaneously pushed to the client for synchronous processing and playing.
In some embodiments, the video streaming module in the server may directly push the extracted image frame set and audio data to the client for live broadcasting, and may further allocate a streaming media playing address to the live broadcasting request in response to a live broadcasting request sent by the client, and return the streaming media playing address to the client.
In some embodiments, when the client sends a live broadcast selection instruction including a given text to the server, the server may allocate a streaming server address to the client, so that the client continuously pulls a live broadcast video from the address, and the server selects an appropriate streaming server address to send to the live broadcast client, where the selection manner of the streaming server address may be fixed, or a pre-allocation selection range, or the like.
In some embodiments, in step 105, the server performs live data stream pushing of a virtual host according to the image frame set and the audio data, which may be implemented by the following technical solution, and sequentially pushes the image frame set and the audio data to a streaming media interface corresponding to a streaming media playing address in real time.
Referring to fig. 3D, fig. 3D is an optional flowchart of the artificial intelligence based live broadcasting method according to the embodiment of the present invention, and will be described with reference to step 201 and step 203 shown in fig. 3D.
In step 201, the client receives a live selection instruction of the virtual anchor.
In step 202, the client acquires an image frame set and audio data corresponding to the virtual anchor according to the live broadcast selection instruction; wherein the set of image frames and the audio data correspond to a given text of the virtual anchor.
In step 203, the client synchronizes the acquired image frame set and the audio data, and presents the image frame of the virtual anchor and the corresponding audio data in real time.
In some embodiments, the live broadcast request is an instruction for creating a live broadcast, that is, the live broadcast request is directed to a live broadcast initiator, the live broadcast request carries a given text of a virtual anchor, the live broadcast selection instruction is an instruction for selecting to watch the live broadcast, that is, the live broadcast selection instruction is directed to a user watching the live broadcast, the live broadcast selection instruction in step 201 is directed to the user watching the live broadcast, that is, a client receives a live broadcast selection instruction from the user watching, the live broadcast selection instruction points to the virtual anchor, the virtual anchor corresponds to the given text provided by the live broadcast initiator, and after the client receives the live broadcast selection instruction, the client obtains an image frame set and audio data corresponding to the virtual anchor according to the live broadcast selection instruction; the image frame set and the audio data correspond to a given text of a virtual anchor, the generation process of the image frame set and the audio data can be realized through the scheme, after the image frame set and the audio data are obtained, the image frame set and the audio data need to be synchronously processed, the image frame of the virtual anchor and the corresponding audio data are presented in real time, the audio data and the image frame data need to be presented correspondingly, namely the audio data and the image frame are audio-visual data which correspond to the same language section and are mutually matched.
In some embodiments, during the live broadcast, the client may receive an interactive text from a live viewing user, where the interactive text refers to an interaction with a virtual host during the live broadcast, for example, for a diet live broadcast, the content of the interactive text may be an address or a dish of an inquiry restaurant, at this time, the client or the server queries an image resource and a dialog resource matching the interactive text, where the dialog resource may be pre-configured and may be used for answering a dialog asking questions of the viewing user, so that the live viewing user may obtain a timely response, and meanwhile, the dialog may set a quick reply dialog, which may respond to the interactive text in real time, and while responding to the interactive text through the quick reply dialog, the client may obtain a given text with substantial content uploaded by a content provider specifically for answering the interactive text, or the server directly crawls a given text with substantive content for specifically answering the interactive text on the internet according to the interactive text, carries out voice synthesis processing on the interactive text, and determines audio data and facial feature data for responding to the interactive text by combining with image resources, wherein the subsequent processing process is the same as the live broadcasting method based on artificial intelligence provided by the embodiment of the invention.
In some embodiments, the conversational resources and the image resources of the virtual anchor can be configured, the image resources of the virtual anchor uploaded by the art resource provider are received, and the conversational resources of the virtual anchor are created; wherein the image resource comprises at least one of: scene resources, model resources, and action resources; when receiving the image resource of the new virtual anchor, distributing a new version identifier for the received image resource, and generating configuration information corresponding to the new version, wherein the configuration information comprises at least one of the following: a scene resource configuration project, a model resource configuration project, an action resource configuration project and a conversational resource configuration project; and when the received avatar resource of the virtual anchor is the updated resource of the existing virtual anchor, updating the avatar resource and the configuration information of the corresponding version of the existing virtual anchor.
Referring to fig. 6, fig. 6 is an implementation architecture diagram of a live broadcasting method based on artificial intelligence according to an embodiment of the present invention, a text-to-speech request module 2551 obtains facial feature data and audio data based on a given text (corresponding to step 101 and 102 in fig. 6, in step 101, a server receives the given text from a client for a virtual anchor performance, in step 102, the server obtains the audio data and facial feature data corresponding to the virtual anchor in real time according to the given text, forms at least one audio data packet and at least one facial feature data packet, respectively, and continues to process the next given text in real time), a rendering module 2552 performs three-dimensional rendering on the facial feature data obtained each time (corresponding to step 103 in fig. 6, in step 103, the server performs special effect rendering based on the facial feature data in the facial feature data packet in real time, an image data packet carrying an image frame set corresponding to the virtual anchor is obtained, and the next facial feature data packet is continuously processed in real time), a facial image of the virtual character is obtained, the video streaming module 2553 synthesizes the obtained facial image and audio data each time into a virtual video in real time and pushes the virtual video to the client (corresponding to steps 104 and 105 in fig. 6, in step 104, the server extracts the image frame set in the image data packet and the audio data in the audio data packet in real time, and in step 105, the server performs live broadcast data stream pushing of the virtual anchor according to the image frame set and the audio data). The text-to-speech request module 2551, the rendering module 2552 and the video stream pushing module 2553 can automatically synthesize the text data into a video in real time and push the video to the client in an independent and parallel assistance manner. The three modules are independent and parallel to each other, the audio data and the facial feature data are separate data streams, the independent and parallel representation text-to-speech request module 2551 finishes processing a first given text, sends the obtained first facial feature data packet to the rendering module 2552, sends the obtained first audio data packet to the video stream pushing module 2553, the text-to-speech request module 2551 continues to process a second given text to obtain a second facial feature data packet and a second audio data packet, directly pushes the second audio data packet to the video stream pushing module 2553, the rendering module 2552 processes the received first facial feature data packet to generate a first image data packet, and at this time, the rendering module 2552 continues to process the received second facial feature data packet to generate a second image data packet, that is, the text-to-speech request module 2551 does not wait for the first given text to complete the whole plug-flow process and then proceed to the next process, but continues to process the received given text, and the facial feature data packet and the audio data packet are not bound together for processing, but are divided into two processing lines, that is, in the process of the text-to-speech request module 2551 processing the first given text to generate the first audio data packet and the first facial feature data packet, the time consumed for generating the first audio data packet and the first facial feature data packet is different, once the first audio data packet is generated, the first audio data packet is sent to the video plug-flow module 2553, the process is unrelated to the generation progress of the facial feature data packet, once the first facial feature data packet is generated, the first facial feature data packet is sent to the rendering module 2552, this process is independent of the progress of the audio packet generation, and the rendering module 2552 will not be affected by the progress of the processing of the other modules, and will continue to process the next facial feature packet it receives.
An exemplary application of the artificial intelligence based live broadcasting method provided by the embodiment of the present invention in an actual application scenario will be described below.
The live broadcasting method based on artificial intelligence provided by the embodiment of the invention can be applied to a plurality of items and product applications including virtual news live broadcasting, virtual game commentary and the like, so that the real-time performance of live broadcasting is effectively improved.
In a virtual live broadcast scene, the artificial intelligence-based live broadcast method provided by the embodiment of the invention can be used for independently acquiring audio data and image data in parallel according to an input text (a given text) and pushing the audio data and the image data to a client in real time so as to realize real-time live broadcast.
Fig. 7A-7B are schematic diagrams of an overall framework of a virtual video live broadcast service provided in an embodiment of the present invention, and referring to fig. 7A-7B, the overall framework of the virtual video live broadcast service includes a live broadcast client, a text-to-speech service, and a virtual video push streaming service, where an implementation method of the virtual video push streaming service is a main innovation of the present invention. In fig. 7A, a live broadcast client sends a live broadcast request to a virtual video push streaming service, where the live broadcast request carries a given text, the virtual video push streaming service requests a text-to-speech service to acquire audio data and facial feature data, the text-to-speech service returns the audio data and the facial feature data as a response, and the virtual video push streaming service continues to execute the live broadcast method provided in the embodiment of the present invention according to the acquired response, so as to push an image frame set and audio data to the live broadcast client. In fig. 7B, the live broadcast client sends a live broadcast request to the virtual video push streaming service, where the live broadcast request carries a given text, the virtual video push streaming service returns a streaming media service address to the client as a response, and obtains audio data and facial feature data from the text-to-speech service request, and the text-to-speech service returns the audio data and the facial feature data as a response, the virtual video push streaming service continues to execute the live broadcast method provided in the embodiment of the present invention according to the obtained response, and pushes the image frame set and the audio data to the streaming media service, and the live broadcast client pulls a video from the streaming media service address corresponding to the streaming media service.
Fig. 8 is an implementation framework of a virtual video push streaming service according to an embodiment of the present invention, and referring to fig. 8, the virtual video push streaming service includes a text-to-speech request module, a rendering module, and a video push streaming module. A live broadcast client sends a text request to a virtual video live broadcast server, wherein a given text in the text request sent by the client is a character which is spoken by a virtual role of a live broadcast video; the method comprises the steps that a text-to-speech request module initiates a request to a text-to-speech service, audio data and facial feature data (an audio data packet 1-3 and a facial feature data packet 1-3) corresponding to a given text are obtained in a streaming mode, each time the text-to-speech request module obtains one audio data packet and one facial feature data packet from the text-to-speech service, the facial feature data packet is pushed to a three-dimensional rendering module, the audio data packet is pushed to a video stream pushing module, and when the text-to-speech request module receives an end packet of the text-to-speech service, the end packet is sent to the three-dimensional rendering module and the video stream pushing module; and when the three-dimensional rendering module acquires one facial feature data packet, extracting facial feature data in the facial feature data packet to perform three-dimensional rendering to acquire a group of corresponding facial expression images, simultaneously synthesizing one facial expression image and one background image into a complete image, acquiring a group of complete live video image frames serving as a graphic data packet, and pushing the graphic data packet to the video plug-flow module. When an end packet is received, the end packet is sent to the video plug flow module; and the video stream pushing module extracts audio data or image frame data in each acquired audio data packet pushed by one text-to-speech request module or image data packet 1-3 pushed by the three-dimensional rendering module. And asynchronously pushing audio or image frame data to the client by a Fast Forwarding Motion Picture Expert Group (FFMPEG) tool, and completing synchronization by the client until an end packet pushed by the text-to-speech request module and an end packet pushed by the three-dimensional rendering module are received, and then finishing the pushing of the video.
The text-to-speech request module, the three-dimensional rendering module and the video stream pushing module are independently and parallelly cooperated to realize virtual video real-time stream pushing, the text-to-speech request module needs to request a text-to-speech service to obtain the duration of audio finally generated by given text provided by a client before obtaining stream data of the text-to-speech service, the duration of the finally generated live video is estimated, audio and facial feature data are obtained from the text-to-speech service in a stream mode, the audio and image data can be quickly obtained, real-time live broadcast is realized, the facial feature data are separated from the audio data and are respectively pushed to the three-dimensional rendering module and the video pushing module, and the time delay influence of the processing time consumption of the three-dimensional rendering module on the real-time stream pushing can be effectively relieved.
The method for live broadcasting based on artificial intelligence determines the length of a video according to a given text, selects n groups of proper background images from prestored general background images to be matched and synthesized with facial expressions in the rendering process, each group of background images is a complete action, and the n groups of background images can exactly finish n actions when the video is finished.
The video stream pushing module mainly uses an FFMPEG tool to push the video stream, and because the audio data and the image data arrive at the video stream pushing module mutually independently and asynchronously, the time consumed by three-dimensional rendering can be effectively utilized to receive the audio data acquired by the text-to-speech request module in the stream pushing process, so that the time for waiting for a large number of audio data packets is avoided, the stream pushing real-time performance is improved, when a first data packet is received, stream pushing initialization is carried out, the audio or image data is pushed, when an ending packet is received, the stream pushing is ended, and a complete video stream pushing process is completed.
The processing performed by the text-to-speech request module, the three-dimensional rendering module and the video stream pushing module is a very time-consuming process, if the processing is realized in a serial manner, the data processing time duration is greater than the video generation time duration, and the live broadcast of the virtual video cannot be realized, so that the text-to-speech request module, the three-dimensional rendering module and the video stream pushing module are independent and cooperate in parallel, as long as the processing time of each module is less than the video time duration, and meanwhile, the client only needs to wait for a fixed delay after sending the request, and can realize the real-time live broadcast of the virtual video, and the fixed delay is equal to the time consumed for the first data acquired by the text-to-speech request module to be transmitted to the video stream pushing module and stream pushing is successful.
On the basis that the text-to-speech request module, the three-dimensional rendering module and the video stream pushing module are mutually independent and cooperate in parallel, the way of separating and shunting the audio data and the facial feature data acquired by the text-to-speech request module can effectively avoid the influence of unstable audio data source and facial feature data source (for example, only audio data or facial feature data are received each time, and the audio data and the facial feature data are not uniformly distributed in time) on real-time streaming, in the process that the three modules are mutually independent and cooperate, the video stream pushing module waits for the text-to-speech request module to acquire the audio data during the image rendering of the three-dimensional rendering module, so that the characteristics of time-consuming sensitivity of audio processing of the text-to-speech service and time-consuming sensitivity of image processing of the three-dimensional rendering module are effectively utilized, and the real-time streaming real-time performance is improved.
Continuing with the exemplary architecture of the artificial intelligence based live device 255 as implemented as software modules provided by embodiments of the present invention, in some embodiments, as shown in fig. 2, the software modules stored in the artificial intelligence based live device 255 of the memory 250 may include: a text-to-speech request module 251, configured to receive a given text for a virtual anchor performance, obtain audio data and facial feature data of a corresponding virtual anchor in real time according to the given text, and form at least one audio data packet and at least one facial feature data packet respectively; a rendering module 2552, configured to perform special effect rendering processing in real time based on the facial feature data in the facial feature data packet, to obtain an image data packet carrying an image frame set corresponding to the virtual anchor; the video plug-flow module 2553 is configured to extract the image frame set in the image data packet and the audio data in the audio data packet in real time; and carrying out live broadcast data stream pushing of the virtual anchor according to the image frame set and the audio data.
In some embodiments, rendering module 2552 is further configured to: when a first facial feature data packet in at least one facial feature data packet for a given text is formed, performing special effect rendering processing on the basis of facial feature data in the first facial feature data packet in real time to obtain a first image data packet carrying an image frame set corresponding to a virtual anchor; the video plug-flow module 2553 is further configured to: when a first audio data packet in at least one audio data packet for a given text is formed, extracting audio data in the first audio data packet in real time to push the audio data; when a first image data packet carrying an image frame set corresponding to the virtual anchor is formed, the image frame set in the first image data packet is extracted in real time so as to push the image frame set.
In some embodiments, when the time taken to retrieve the first image packet from the given text is greater than the time taken to retrieve the first audio packet from the given text, the video plug-flow module 2553 is further configured to: when the audio data in the first audio data packet is extracted, the extracted audio data is pushed to the live broadcast client in real time, the audio data in the subsequent extracted audio data packet is pushed to the live broadcast client in real time until the image frame set in the first image data packet is extracted, and the extracted image frame set is pushed to the live broadcast client in real time.
In some embodiments, when the time taken to retrieve the first image packet from the given text is less than the time taken to retrieve the first audio packet from the given text, the video plug-flow module 2553 is further configured to: when the image frame set in the first image data packet is extracted, the extracted image frame set is pushed to the live broadcast client, the image frame set in the subsequent extracted image data packet is pushed to the live broadcast client in real time until the audio data in the first audio data packet is extracted, and the extracted audio data is pushed to the live broadcast client in real time.
In some embodiments, the text-to-speech request module 2551 is further configured to: when a given text is received, dividing the given text into at least two language segments, and converting the language segments into word vectors of corresponding language segments in real time; carrying out encoding processing and decoding processing on the word vectors to obtain audio features corresponding to the word vectors; and synthesizing the audio features to obtain audio data of each speech segment respectively corresponding to the virtual anchor.
In some embodiments, the text-to-speech request module 2551 is further configured to: predicting the key points of the mouth of the virtual anchor corresponding to each language segment, and carrying out normalization processing on the key points of the mouth so as to enable the key points of the mouth to be adaptive to the standard face template; performing dimensionality reduction processing on the mouth key points subjected to normalization processing to obtain mouth shape feature data corresponding to each speech segment of the virtual anchor; performing semantic analysis on the language segments to obtain the semantics represented by the language segments; determining facial expression characteristic data matched with the semantics according to the semantics represented by the language segments, and combining the mouth shape characteristic data and the facial expression characteristic data to form facial characteristic data corresponding to each language segment of the virtual anchor; an audio data packet and a facial feature data packet are formed based on the facial feature data and the audio data corresponding to the same speech passage, respectively.
In some embodiments, the text-to-speech request module 2551 is further configured to: before normalization processing is carried out on the mouth key points, sending query transaction to the blockchain network, wherein the query transaction indicates an intelligent contract used for querying an account book in the blockchain network, so that a consensus node in the blockchain network queries the account book in a mode of executing the intelligent contract, and a standard face template stored in the account book is obtained; or according to the identification of the standard face template, inquiring the standard face template corresponding to the identification from the standard face template database, and determining the hash value of the inquired standard face template; and inquiring a hash value corresponding to the identifier from the blockchain network, and when the inquired hash value is consistent with the determined hash value, determining that the inquired standard face template is not tampered.
In some embodiments, the text-to-speech request module 2551 is further configured to: before at least one audio data packet and at least one facial feature data packet are respectively formed, sending a given text to a blockchain network, so that a consensus node in the blockchain network carries out compliance check on the given text in a mode of executing an intelligent contract; and when the compliance confirmation returned by more than the preset number of the consensus nodes is received, determining that the given text passes the compliance check.
In some embodiments, the text-to-speech request module 2551 is further configured to: before an audio data packet and a facial feature data packet which are matched with each other are respectively formed, determining a phoneme sequence represented by audio data; and determining the sequence of the face feature data carrying the mouth feature data based on the phoneme sequence so that the mouth shape change represented by the sequence of the face feature data is matched with the phoneme sequence.
In some embodiments, rendering module 2552 is further configured to: determining the product of the playing duration of the audio data and the frame rate parameter as the number of image frames corresponding to the audio data; extracting facial feature data in the facial feature data packet in real time, and performing special effect rendering processing based on facial features to obtain a facial image corresponding to the facial feature data; synthesizing the face image and the background image to obtain an image frame corresponding to the virtual anchor, and combining a plurality of synthesized image frames into an image frame set corresponding to the audio data; wherein the number of the image frames in the image frame set is the number of the image frames; and packaging the image frame set to obtain an image data packet carrying the image frame set corresponding to the virtual anchor.
In some embodiments, rendering module 2552 is further configured to: extracting feature vectors with the same number as the number of image frames from the face feature data; and carrying out special effect rendering processing based on the extracted feature vectors to obtain the face images with the same number as the image frames.
In some embodiments, rendering module 2552 is further configured to: analyzing the facial feature data packet to obtain facial feature data contained in the facial feature data packet, and determining a language segment corresponding to the facial feature data; extracting at least one group of background images corresponding to the semanteme based on the semanteme represented by the language segment; each group of background images comprises a model, an action and a live broadcast scene of a virtual anchor; and synthesizing the face image and at least one group of background images to obtain an image frame of the virtual anchor, wherein the virtual anchor is in a live scene and has actions and corresponding facial expressions and mouth shapes in the face image, and the number of groups of the actions is the same as that of the background images.
In some embodiments, rendering module 2552 is further configured to: sending a query transaction to the blockchain network, wherein the query transaction indicates an intelligent contract used for querying an account book in the blockchain network and query parameters of corresponding semantics, so that a consensus node in the blockchain network queries the account book in a mode of executing the intelligent contract to obtain a background image which accords with the query parameters in the account book; or inquiring a background image corresponding to the semantics from a background image database according to the semantics, and determining a hash value of the inquired background image; and inquiring a hash value corresponding to the semantics from the blockchain network, and determining that the inquired background image is not tampered when the inquired hash value is consistent with the determined hash value.
In some embodiments, the apparatus further comprises: a streaming media service module 2554, configured to: responding to a live broadcast request sent by a client, distributing a streaming media playing address for the live broadcast request, and returning the streaming media playing address to the client; the video plug-flow module 2553 is further configured to: and sequentially pushing the image frame set and the audio data to a streaming media interface corresponding to the streaming media playing address in real time.
The embodiment of the invention provides a live broadcast device based on artificial intelligence, which comprises:
the instruction receiving module is used for receiving a live broadcast selection instruction of the virtual anchor;
the data acquisition module is used for acquiring an image frame set and audio data corresponding to the virtual anchor according to the live broadcast selection instruction;
wherein the set of image frames and the audio data correspond to a given text of the virtual anchor;
and the presentation module is used for carrying out synchronous processing on the acquired image frame set and the audio data and presenting the image frame of the virtual anchor and the corresponding audio data in real time.
Embodiments of the present invention provide a storage medium having stored thereon executable instructions that, when executed by a processor, cause the processor to perform methods provided by embodiments of the present invention, for example, artificial intelligence based live broadcast methods as shown in fig. 3A-3D.
In some embodiments, the storage medium may be a memory such as FRAM, ROM, PROM, EPROM, EE PROM, flash, magnetic surface memory, optical disk, or CD-ROM; or may be various devices including one or any combination of the above memories.
In some embodiments, executable instructions may be written in any form of programming language (including compiled or interpreted languages), in the form of programs, software modules, scripts or code, and may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.
By way of example, executable instructions may correspond, but do not necessarily have to correspond, to files in a file system, and may be stored in a portion of a file that holds other programs or data, such as in one or more scripts in a hypertext Markup Language (H TML) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).
By way of example, executable instructions may be deployed to be executed on one computing device or on multiple computing devices at one site or distributed across multiple sites and interconnected by a communication network.
In summary, the live broadcast method based on artificial intelligence provided by the embodiment of the invention solves the problem of live broadcast of virtual video in real time on the premise of consuming large computational power for acquiring audio and video data through parallel processing and data separation, effectively enhances the real-time performance of live broadcast of virtual video, improves the smoothness of live broadcast video, promotes the development of artificial intelligence in live broadcast service, liberates live broadcast of real people, and reduces labor cost.
The above description is only an example of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, and improvement made within the spirit and scope of the present invention are included in the protection scope of the present invention.
Claims (11)
1. An artificial intelligence based live broadcasting method is characterized by comprising the following steps:
receiving a given text for a virtual anchor to perform, acquiring audio data and facial feature data corresponding to the virtual anchor in real time according to the given text, and respectively forming at least one audio data packet and at least one facial feature data packet;
when a first facial feature data packet in at least one facial feature data packet for the given text is formed, performing special effect rendering processing on the basis of facial feature data in the first facial feature data packet in real time to obtain a first image data packet carrying an image frame set corresponding to the virtual anchor;
extracting audio data in a first audio data packet of the at least one audio data packet for the given text in real time while forming the first audio data packet;
when a first image data packet carrying an image frame set corresponding to the virtual anchor is formed, extracting the image frame set in the first image data packet in real time;
when the time spent for acquiring the first image data packet from the given text is longer than the time spent for acquiring the first audio data packet from the given text and the audio data in the first audio data packet is extracted, pushing the extracted audio data to a live broadcast client in real time, pushing the audio data in the subsequently extracted audio data packet to the live broadcast client in real time until an image frame set in the first image data packet is extracted, and pushing the extracted image frame set to the live broadcast client in real time;
when the time spent for acquiring the first image data packet from the given text is less than the time spent for acquiring the first audio data packet from the given text and the image frame set in the first image data packet is extracted, the extracted image frame set is pushed to a live broadcast client, the image frame set in the subsequently extracted image data packet is pushed to the live broadcast client in real time until the audio data in the first audio data packet is extracted, and the extracted audio data is pushed to the live broadcast client in real time.
2. The method of claim 1, wherein the obtaining audio data corresponding to the virtual anchor in real-time according to the given text comprises:
when the given text is received, dividing the given text into at least two language segments, and converting the language segments into word vectors corresponding to the language segments in real time;
carrying out encoding processing and decoding processing on the word vector to obtain audio features corresponding to the word vector;
and synthesizing the audio features to obtain audio data of each speech segment respectively corresponding to the virtual anchor.
3. The method of claim 2, wherein said obtaining facial feature data corresponding to the virtual anchor in real-time from the given text comprises:
predicting the key points of the mouth of the virtual anchor corresponding to each language segment, and carrying out normalization processing on the key points of the mouth so as to enable the key points of the mouth to be adaptive to a standard face template;
performing dimensionality reduction processing on the mouth key points subjected to normalization processing to obtain mouth shape feature data corresponding to each language segment of the virtual anchor;
performing semantic analysis on the language segments to obtain the semantics represented by the language segments;
determining facial expression characteristic data matched with the semantics according to the semantics represented by the language segments, and combining the mouth shape characteristic data and the facial expression characteristic data to form facial characteristic data corresponding to each language segment of the virtual anchor;
said separately forming at least one audio data packet and at least one facial feature data packet, comprising:
and respectively forming an audio data packet and a facial feature data packet which are matched with each other based on the facial feature data and the audio data corresponding to the same language segment.
4. The method of claim 3, wherein prior to forming the audio data packet and the facial feature data packet that match each other, respectively, the method further comprises:
determining a phoneme order characterized by the audio data;
and determining the sequence of the face feature data carrying the mouth feature data based on the phoneme sequence so as to enable the mouth shape change represented by the sequence of the face feature data to be matched with the phoneme sequence.
5. The method according to claim 3, wherein performing special effect rendering processing in real time based on facial feature data in the facial feature data packet to obtain an image data packet carrying an image frame set corresponding to the virtual anchor comprises:
determining the product of the playing duration of the audio data and a frame rate parameter as the number of image frames corresponding to the audio data;
extracting facial feature data in the facial feature data packet in real time, and performing special effect rendering processing based on the facial feature data to obtain a facial image corresponding to the facial feature data;
synthesizing the face image and the background image to obtain an image frame corresponding to the virtual anchor, and combining a plurality of synthesized image frames into an image frame set corresponding to the audio data;
wherein the number of image frames in the set of image frames is the number of image frames;
and packaging the image frame set to obtain an image data packet carrying the image frame set corresponding to the virtual anchor.
6. The method of claim 5,
the extracting facial feature data in the facial feature data packet in real time, and performing special effect rendering processing based on the facial feature data to obtain a facial image corresponding to the facial feature data includes:
extracting the same number of feature vectors as the number of the image frames from the facial feature data;
and carrying out special effect rendering processing based on the extracted feature vectors to obtain the face images with the same number as the image frames.
7. The method according to claim 6, wherein the synthesizing the face image and the background image comprises:
analyzing the facial feature data packet to obtain facial feature data contained in the facial feature data packet, and determining a language segment corresponding to the facial feature data;
extracting at least one group of background images corresponding to the semantics based on the semantics represented by the language segments;
each group of background images comprises a model, an action and a live broadcast scene of the virtual anchor;
and synthesizing the facial image and the at least one group of background images to obtain an image frame of the virtual anchor, wherein the virtual anchor is positioned in the live broadcast scene and has the corresponding facial expression and mouth shape in the action and the facial image, and the number of groups of the action is the same as that of the background images.
8. The method according to any one of claims 1-7, further comprising:
responding to a live broadcast request sent by a client, distributing a streaming media playing address for the live broadcast request, and returning the streaming media playing address to the client;
the live data stream pushing of the virtual anchor according to the image frame set and the audio data comprises:
and sequentially pushing the image frame set and the audio data to a streaming media interface corresponding to the streaming media playing address in real time.
9. A live device based on artificial intelligence, the device comprising:
the system comprises a text-to-speech request module, a speech recognition module and a speech recognition module, wherein the text-to-speech request module is used for receiving a given text for a virtual anchor to perform, acquiring audio data and facial feature data corresponding to the virtual anchor in real time according to the given text, and respectively forming at least one audio data packet and at least one facial feature data packet;
the rendering module is used for performing special effect rendering processing on the basis of facial feature data in at least one facial feature data packet of the given text in real time when the first facial feature data packet in the at least one facial feature data packet of the given text is formed, so as to obtain a first image data packet carrying an image frame set corresponding to the virtual anchor;
the video plug-flow module is used for extracting audio data in a first audio data packet in real time when the first audio data packet in at least one audio data packet aiming at the given text is formed; when a first image data packet carrying an image frame set corresponding to the virtual anchor is formed, extracting the image frame set in the first image data packet in real time; when the time spent for acquiring the first image data packet from the given text is longer than the time spent for acquiring the first audio data packet from the given text and the audio data in the first audio data packet is extracted, pushing the extracted audio data to a live broadcast client in real time, pushing the audio data in the subsequently extracted audio data packet to the live broadcast client in real time until an image frame set in the first image data packet is extracted, and pushing the extracted image frame set to the live broadcast client in real time; when the time spent for acquiring the first image data packet from the given text is less than the time spent for acquiring the first audio data packet from the given text and the image frame set in the first image data packet is extracted, the extracted image frame set is pushed to a live broadcast client, the image frame set in the subsequently extracted image data packet is pushed to the live broadcast client in real time until the audio data in the first audio data packet is extracted, and the extracted audio data is pushed to the live broadcast client in real time.
10. A live device based on artificial intelligence, the device comprising:
a memory for storing executable instructions;
a processor configured to implement the artificial intelligence based live method of any one of claims 1 to 8 when executing executable instructions stored in the memory.
11. A computer-readable storage medium storing executable instructions for implementing the artificial intelligence based live broadcasting method of any one of claims 1 to 8 when executed by a processor.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911319864.9A CN111010589B (en) | 2019-12-19 | 2019-12-19 | Live broadcast method, device, equipment and storage medium based on artificial intelligence |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911319864.9A CN111010589B (en) | 2019-12-19 | 2019-12-19 | Live broadcast method, device, equipment and storage medium based on artificial intelligence |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111010589A CN111010589A (en) | 2020-04-14 |
CN111010589B true CN111010589B (en) | 2022-02-25 |
Family
ID=70116671
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911319864.9A Active CN111010589B (en) | 2019-12-19 | 2019-12-19 | Live broadcast method, device, equipment and storage medium based on artificial intelligence |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111010589B (en) |
Families Citing this family (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111225237B (en) | 2020-04-23 | 2020-08-21 | 腾讯科技(深圳)有限公司 | Sound and picture matching method of video, related device and storage medium |
CN113691833B (en) * | 2020-05-18 | 2023-02-03 | 北京搜狗科技发展有限公司 | Virtual anchor face changing method and device, electronic equipment and storage medium |
CN111654715B (en) * | 2020-06-08 | 2024-01-09 | 腾讯科技(深圳)有限公司 | Live video processing method and device, electronic equipment and storage medium |
CN111935491B (en) * | 2020-06-28 | 2023-04-07 | 百度在线网络技术(北京)有限公司 | Live broadcast special effect processing method and device and server |
CN111917866B (en) * | 2020-07-29 | 2021-08-31 | 腾讯科技(深圳)有限公司 | Data synchronization method, device, equipment and storage medium |
CN112004101B (en) * | 2020-07-31 | 2022-08-26 | 北京心域科技有限责任公司 | Virtual live broadcast data transmission method and device and storage medium |
CN112333179B (en) * | 2020-10-30 | 2023-11-10 | 腾讯科技(深圳)有限公司 | Live broadcast method, device and equipment of virtual video and readable storage medium |
CN112543342B (en) * | 2020-11-26 | 2023-03-14 | 腾讯科技(深圳)有限公司 | Virtual video live broadcast processing method and device, storage medium and electronic equipment |
CN114630135A (en) * | 2020-12-11 | 2022-06-14 | 北京字跳网络技术有限公司 | Live broadcast interaction method and device |
CN112616063B (en) * | 2020-12-11 | 2022-10-28 | 北京字跳网络技术有限公司 | Live broadcast interaction method, device, equipment and medium |
CN113570686A (en) * | 2021-02-07 | 2021-10-29 | 腾讯科技(深圳)有限公司 | Virtual video live broadcast processing method and device, storage medium and electronic equipment |
CN113194350B (en) * | 2021-04-30 | 2022-08-19 | 百度在线网络技术(北京)有限公司 | Method and device for pushing data to be broadcasted and method and device for broadcasting data |
CN113965665B (en) * | 2021-11-22 | 2024-09-13 | 上海掌门科技有限公司 | Method and equipment for determining virtual live image |
CN114827663B (en) * | 2022-04-12 | 2023-11-21 | 咪咕文化科技有限公司 | Distributed live broadcast frame inserting system and method |
CN116527956B (en) * | 2023-07-03 | 2023-08-22 | 世优(北京)科技有限公司 | Virtual object live broadcast method, device and system based on target event triggering |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104618786A (en) * | 2014-12-22 | 2015-05-13 | 深圳市腾讯计算机系统有限公司 | Audio/video synchronization method and device |
CN106653052A (en) * | 2016-12-29 | 2017-05-10 | Tcl集团股份有限公司 | Virtual human face animation generation method and device |
CN109118562A (en) * | 2018-08-31 | 2019-01-01 | 百度在线网络技术(北京)有限公司 | Explanation video creating method, device and the terminal of virtual image |
CN109637518A (en) * | 2018-11-07 | 2019-04-16 | 北京搜狗科技发展有限公司 | Virtual newscaster's implementation method and device |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8467443B2 (en) * | 2005-04-01 | 2013-06-18 | Korea Electronics Technology Institute | Object priority order compositor for MPEG-4 player |
-
2019
- 2019-12-19 CN CN201911319864.9A patent/CN111010589B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104618786A (en) * | 2014-12-22 | 2015-05-13 | 深圳市腾讯计算机系统有限公司 | Audio/video synchronization method and device |
CN106653052A (en) * | 2016-12-29 | 2017-05-10 | Tcl集团股份有限公司 | Virtual human face animation generation method and device |
CN109118562A (en) * | 2018-08-31 | 2019-01-01 | 百度在线网络技术(北京)有限公司 | Explanation video creating method, device and the terminal of virtual image |
CN109637518A (en) * | 2018-11-07 | 2019-04-16 | 北京搜狗科技发展有限公司 | Virtual newscaster's implementation method and device |
Also Published As
Publication number | Publication date |
---|---|
CN111010589A (en) | 2020-04-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111010589B (en) | Live broadcast method, device, equipment and storage medium based on artificial intelligence | |
CN111010586B (en) | Live broadcast method, device, equipment and storage medium based on artificial intelligence | |
WO2022166709A1 (en) | Virtual video live broadcast processing method and apparatus, and storage medium and electronic device | |
JP7479750B2 (en) | Virtual video live broadcast processing method and device, electronic device | |
CN111741326B (en) | Video synthesis method, device, equipment and storage medium | |
US11882319B2 (en) | Virtual live video streaming method and apparatus, device, and readable storage medium | |
CN107027050B (en) | Audio and video processing method and device for assisting live broadcast | |
CN104777911B (en) | A kind of intelligent interactive method based on holographic technique | |
CN110472099B (en) | Interactive video generation method and device and storage medium | |
WO2022134698A1 (en) | Video processing method and device | |
CN114495927A (en) | Multi-modal interactive virtual digital person generation method and device, storage medium and terminal | |
CN113923462A (en) | Video generation method, live broadcast processing method, video generation device, live broadcast processing device and readable medium | |
US11908476B1 (en) | System and method of facilitating human interactions with products and services over a network | |
CN113282791B (en) | Video generation method and device | |
CN116756285A (en) | Virtual robot interaction method, device and storage medium | |
CN117292022A (en) | Video generation method and device based on virtual object and electronic equipment | |
CN109241331B (en) | Intelligent robot-oriented story data processing method | |
CN115690277A (en) | Video generation method, system, device, electronic equipment and computer storage medium | |
CN116962746A (en) | Online chorus method and device based on continuous wheat live broadcast and online chorus system | |
CN116561294A (en) | Sign language video generation method and device, computer equipment and storage medium | |
CN111966803A (en) | Dialogue simulation method, dialogue simulation device, storage medium and electronic equipment | |
CN118377882B (en) | Accompanying intelligent dialogue method and electronic equipment | |
CN117373455B (en) | Audio and video generation method, device, equipment and storage medium | |
US20240290024A1 (en) | Dynamic synthetic video chat agent replacement | |
WO2024001307A1 (en) | Voice cloning method and apparatus, and related device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
REG | Reference to a national code |
Ref country code: HK Ref legal event code: DE Ref document number: 40021758 Country of ref document: HK |
|
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |