CN117812417A - Multimedia interaction method, device, electronic equipment and storage medium - Google Patents

Multimedia interaction method, device, electronic equipment and storage medium Download PDF

Info

Publication number
CN117812417A
CN117812417A CN202311746220.4A CN202311746220A CN117812417A CN 117812417 A CN117812417 A CN 117812417A CN 202311746220 A CN202311746220 A CN 202311746220A CN 117812417 A CN117812417 A CN 117812417A
Authority
CN
China
Prior art keywords
multimedia
interaction
interaction data
data
target user
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311746220.4A
Other languages
Chinese (zh)
Inventor
谭雅文
苏军根
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Telecom Technology Innovation Center
China Telecom Corp Ltd
Original Assignee
China Telecom Technology Innovation Center
China Telecom Corp Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Telecom Technology Innovation Center, China Telecom Corp Ltd filed Critical China Telecom Technology Innovation Center
Priority to CN202311746220.4A priority Critical patent/CN117812417A/en
Publication of CN117812417A publication Critical patent/CN117812417A/en
Pending legal-status Critical Current

Links

Abstract

The disclosure provides a multimedia interaction method, a device, electronic equipment and a storage medium, and relates to the technical field of multimedia interaction. The multimedia interaction method comprises the following steps: acquiring multimedia information of currently played multimedia; acquiring first interaction data of a target user acquired in the process of playing multimedia; wherein the target user is a user watching multimedia; inputting the multimedia information and the first interaction data into a pre-trained natural language generation model to generate second interaction data; and displaying the second interaction data to the target user. The method solves the problem of poor user experience caused by no human interaction when the user views the multimedia in the related technology, and can generate natural language response to interaction data sent by the user, thereby simulating the interaction experience with human and improving the user experience.

Description

Multimedia interaction method, device, electronic equipment and storage medium
Technical Field
The disclosure relates to the technical field of multimedia interaction, and in particular relates to a multimedia interaction method, a device, electronic equipment and a storage medium.
Background
A user has a need for social interactions while viewing multimedia. The user can expect to get the response when looking at the feelings, comments and the like, and if no interaction is performed, the user can not get social feedback, so that the looking experience is poor.
It should be noted that the information disclosed in the above background section is only for enhancing understanding of the background of the present disclosure and thus may include information that does not constitute prior art known to those of ordinary skill in the art.
Disclosure of Invention
The disclosure provides a multimedia interaction method, a device, an electronic device and a storage medium, which at least overcome the problem that in the related art, users have poor user experience due to no human interaction when watching multimedia to a certain extent.
Other features and advantages of the present disclosure will be apparent from the following detailed description, or may be learned in part by the practice of the disclosure.
According to one aspect of the present disclosure, there is provided a multimedia interaction method, including: acquiring multimedia information of currently played multimedia; acquiring first interaction data of a target user acquired in the process of playing multimedia; wherein the target user is a user watching multimedia; inputting the multimedia information and the first interaction data into a pre-trained natural language generation model to generate second interaction data; and displaying the second interaction data to the target user.
In some embodiments, the multimedia is a television program; acquiring multimedia information of currently played multimedia, including: program information of the played television program is obtained through the set top box.
In some embodiments, obtaining program information of a broadcasted digital television program by a set top box includes: determining a playing mode of a target user through the set top box; under the condition that the playing mode is on demand, acquiring program information of a television program on demand of a target user through the set top box; under the condition that the playing mode is multicast, acquiring a television program list on demand of a target user through the set top box, determining a currently played television program according to the television program list, and acquiring program information of the currently played television program.
In some embodiments, acquiring the first interaction data of the target user acquired in the process of playing the multimedia includes: acquiring live sound data acquired in the process of playing a television program; the live sound data are sound data collected on the site of playing the television program; and inputting the live sound data into a pre-trained voiceprint matching model, and extracting voice data matched with the voiceprint of the target user to obtain first interaction data.
In some embodiments, before inputting the multimedia information and the first interaction data into the pre-trained natural language generation model to generate the second interaction data, the method further comprises: acquiring first configuration information; the first configuration information is used for configuring the number M of first interaction data required by generating the second interaction data; inputting the multimedia information and the first interaction data into a pre-trained natural language generation model to generate second interaction data, wherein the second interaction data comprises the following steps of: and inputting the multimedia information and the M pieces of first interaction data into a pre-trained natural language generation model to generate second interaction data.
In some embodiments, the second interaction data is text data; the method further comprises, prior to presenting the second interaction data to the target user: acquiring second configuration information; the second configuration information is used for configuring word number conditions of the second interactive data which are allowed to be displayed; presenting second interaction data to the target user, including: and displaying the second interactive data meeting the word number condition to the target user.
In some embodiments, presenting the second interaction data to the target user includes: and displaying the second interaction data to the target user in a bullet screen mode.
According to another aspect of the present disclosure, there is also provided a multimedia interaction device, including: the first acquisition module is used for acquiring the multimedia information of the currently played multimedia; the second acquisition module is used for acquiring the first interaction data of the target user acquired in the process of playing the multimedia; wherein the target user is a user watching multimedia; the generation module is used for inputting the multimedia information and the first interaction data into a pre-trained natural language generation model to generate second interaction data; and the display module is used for displaying the second interaction data to the target user.
According to another aspect of the present disclosure, there is also provided an electronic device including: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to perform the multimedia interaction method of any of the embodiments described above via execution of the executable instructions.
According to another aspect of the present disclosure, there is also provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the multimedia interaction method of any of the above embodiments.
According to another aspect of the present disclosure, there is also provided a computer program product comprising a computer program which, when executed by a processor, implements the multimedia interaction method of any of the above embodiments.
According to another aspect of the present disclosure, there is also provided a multimedia interaction system, including: the multimedia interaction equipment is used for acquiring the multimedia information of the currently played multimedia; acquiring first interaction data of a target user acquired in the process of playing multimedia; wherein the target user is a user watching multimedia; inputting the multimedia information and the first interaction data into a pre-trained natural language generation model to generate second interaction data; displaying the second interaction data to the target user; and the sound collection equipment is used for collecting the first interaction data in the process of playing the multimedia.
In some embodiments, the multimedia interaction system further comprises: the playing device is used for playing the multimedia and displaying the second interactive data; the set top box is connected with the playing device and is in the same local area network with the sound collecting device, and is used for controlling the playing device to play the multimedia and display the second interactive data, establishing communication connection with the sound collecting device through the local area network and controlling the sound collecting device to collect the first interactive data.
According to the multimedia interaction method, the device, the electronic equipment and the storage medium, through obtaining the multimedia information of the currently played multimedia and the collected first interaction data of the target user watching the multimedia in the process of playing the multimedia, inputting the first interaction data to the pre-trained natural language generation model, generating the second interaction data and displaying the second interaction data to the target user, natural language response can be generated on the interaction data sent by the user, and therefore interaction experience with people is simulated, and user experience is improved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure. It will be apparent to those of ordinary skill in the art that the drawings in the following description are merely examples of the disclosure and that other drawings may be derived from them without undue effort.
FIG. 1 is a schematic diagram of a multimedia interactive system structure in an embodiment of the disclosure;
FIG. 2 is a schematic diagram II of a multimedia interactive system structure according to an embodiment of the disclosure;
FIG. 3 shows a flowchart of a method for multimedia interaction in an embodiment of the disclosure;
FIG. 4 shows a second flowchart of a multimedia interaction method in an embodiment of the disclosure;
FIG. 5 illustrates a third flowchart of a method of multimedia interaction in an embodiment of the present disclosure;
FIG. 6 illustrates a schematic diagram of a multimedia interaction device in an embodiment of the disclosure; and
fig. 7 shows a block diagram of an electronic device in an embodiment of the disclosure.
Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. However, the exemplary embodiments may be embodied in many forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.
Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus a repetitive description thereof will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in software or in one or more hardware modules or integrated circuits or in different networks and/or processor devices and/or microcontroller devices.
The following detailed description of embodiments of the present disclosure refers to the accompanying drawings.
Fig. 1 illustrates an architecture diagram of a multimedia interaction system to which the multimedia interaction method of the embodiments of the present disclosure may be applied. As shown in fig. 1, the system architecture may include a multimedia interaction device 11 and a sound collection device 12.
The multimedia interaction device 11 is for: acquiring multimedia information of currently played multimedia; acquiring first interaction data of a target user acquired in the process of playing multimedia; wherein the target user is a user watching multimedia; inputting the multimedia information and the first interaction data into a pre-trained natural language generation model to generate second interaction data; and displaying the second interaction data to the target user.
The multimedia interaction device 11 may be a variety of electronic devices including, but not limited to, a server, a set top box, a smart television, a smart speaker, a smart phone, a tablet computer, a laptop, a desktop computer, a smart watch, a wearable device, an augmented reality device, a virtual reality device, and the like.
Alternatively, the clients of the applications installed in the different multimedia interaction devices 11 may be the same or clients of the same type of application based on different operating systems. The specific form of the application client may also be different based on the different terminal platforms, for example, the application client may be a mobile phone client, a PC client, etc.
In the case where the multimedia interaction device 11 is a server, the server may be a server providing various services, such as a background management server. The background management server can analyze the received data and the like to obtain second interaction data and feed back the processing result to the playing device for display.
Optionally, the server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs (Content Delivery Network, content delivery networks), basic cloud computing services such as big data and artificial intelligence platforms, and the like.
The sound collection device 12 is used for collecting first interaction data during the process of playing multimedia. The sound collection device 12 may be a set top box with a sound collection module, a smart television, a smart speaker, a smart phone, a tablet computer, a laptop portable computer, a desktop computer, a smart watch, a wearable device, an augmented reality device, a virtual reality device, etc.
The multimedia interaction device 11 and the sound collection device 12 may be connected by a network. The network is a medium for providing a communication link between the multimedia interaction device 11 and the sound collection device 12, and may be a wired network or a wireless network.
Alternatively, the wireless network or wired network described above uses standard communication techniques and/or protocols. The network is typically the Internet, but may be any network including, but not limited to, a local area network (Local Area Network, LAN), metropolitan area network (Metropolitan Area Network, MAN), wide area network (Wide Area Network, WAN), mobile, wired or wireless network, private network, or any combination of virtual private networks. In some embodiments, data exchanged over a network is represented using techniques and/or formats including HyperText Mark-up Language (HTML), extensible markup Language (Extensible MarkupLanguage, XML), and the like. All or some of the links may also be encrypted using conventional encryption techniques such as secure sockets layer (Secure Socket Layer, SSL), transport layer security (Transport Layer Security, TLS), virtual private network (Virtual Private Network, VPN), internet security protocol (Internet Protocol Security, IPSec), etc. In other embodiments, custom and/or dedicated data communication techniques may also be used in place of or in addition to the data communication techniques described above.
In some embodiments, the multimedia interaction system may further comprise: playback device and set top box.
The playing device is used for playing the multimedia and displaying the second interactive data. The playback device may be a smart television, a smart speaker, a smart phone, a tablet computer, a laptop, a desktop computer, a smart watch, a wearable device, an augmented reality device, a virtual reality device, etc.
The set-top box is connected with the playing device and is located in the same local area network with the sound collecting device 12, and is used for controlling the playing device to play the multimedia and display the second interaction data, establishing communication connection with the sound collecting device 12 through the local area network and controlling the sound collecting device 12 to collect the first interaction data.
The following provides a specific embodiment of the multimedia interaction system according to the embodiment of the present disclosure in a specific application scenario, as shown in fig. 2, the multimedia interaction system includes a cloud server 21, a television 22, a set-top box 23 and a sound box 24.
The cloud server 21 is an optional embodiment of the multimedia interaction device 11, and is configured to obtain multimedia information of the current playing multimedia of the television 22; acquiring first interaction data of a target user acquired in the process of playing multimedia by the television 22; wherein the target user is a user watching multimedia; inputting the multimedia information and the first interaction data into a pre-trained natural language generation model to generate second interaction data; and displaying the second interaction data to the target user.
The speaker 24 is an alternative embodiment of the sound collection device 12 for collecting first interaction data during playback of multimedia by the television 22.
The television 22 is an alternative embodiment of a playback device for playing back multimedia and presenting the second interactive data.
The set top box 23 and the cloud server 21 may be connected and communicate through a network. The set top box 23 and the television 22 can be connected and communicated through a high-definition multimedia interface (High Definition Multimedia Interface, HDMI), an Audio line and Video Cable (AV line for short), a Video graphics array (Video Graphics Array for short VGA) interface or a wireless network communication technology Wi-Fi, optical fiber and the like. The set top box 23 and the speaker 24 may be connected and communicate via a local area network.
Under the system architecture described above, embodiments of the present disclosure provide a multimedia interaction method that may be performed by any electronic device with computing processing capabilities. In some embodiments, the multimedia interaction method provided in the embodiments of the present disclosure may be performed by the multimedia interaction device 11 of the above-described system architecture.
Fig. 3 shows a flowchart of a multimedia interaction method in an embodiment of the disclosure, and as shown in fig. 3, the multimedia interaction method provided in the embodiment of the disclosure includes the following steps:
S301, acquiring the multimedia information of the currently played multimedia.
Media refers to a carrier that carries and transmits some information or substance. Multimedia (Multimedia) is an integration of multiple media, typically including text, sound, and images, among other forms of media.
In a computer system, multimedia refers to a man-machine interactive information communication and transmission medium combining two or more media, and the used media include text, pictures, photos, sounds, animations, movies, and the like. Multimedia stores and manages various information such as language words, data, audio, video, etc. through a computer.
The multimedia in the embodiments of the present disclosure may be any multimedia played by an electronic device, for example, may be the multimedia interaction device 11 shown in fig. 1, may be played by a television in some embodiments, for example, the television 22 shown in fig. 2, and may be played by a mobile phone, a tablet computer, a notebook computer, or a computer in other embodiments. Depending on the specific application implementation of the multimedia interaction system, the device for playing multimedia may have different deployment modes, which is not specifically limited by the embodiments of the present disclosure.
The multimedia information refers to information related to multimedia, and may include content information of multimedia, attribute information of multimedia, and information played by multimedia. For example, taking a movie as an example, the multimedia information may include one or more of a movie scenario, a character introduction, an episode name, a movie name, a showing time, a staff table, a movie evaluation, a winning situation, a post-production, a playing time, etc., which are not exemplified.
The multimedia information may be obtained through a network, and in some embodiments, the search information of the multimedia may be obtained from an information source, and then the information related to the multimedia may be searched on the network according to the search information, so as to obtain the multimedia information.
In some embodiments, the multimedia is a television program. Correspondingly, the specific mode of acquiring the multimedia information of the current playing multimedia is that the program information of the played television program is acquired through the set top box. In a specific alternative embodiment, the multimedia interaction method is executed by the cloud server shown in fig. 2, the set-top box 23 may collect information of playing a movie, for example, a movie name, a country, a time of showing, a director, etc., and send the information to the cloud server 21, and the cloud server 21 may search on a network according to the movie information provided by the set-top box 23, and extract required information according to the searched information, so as to obtain multimedia information.
If the program information of the played digital television program is obtained through the set top box, the playing mode of the target user can be determined through the set top box in the implementation process. If the playing mode is on demand, the set top box can acquire the program information of the television program on demand of the target user. If the playing mode is multicast, the set top box can acquire the television program list requested by the target user, determine the currently played television program according to the television program list, and acquire the program information of the currently played television program. The television program guide may be an electronic program guide (Electrical Program Guide, EPG for short) that allows users to browse current and future program information and select and control programs on a receiving device.
S302, acquiring first interaction data of a target user acquired in the process of playing multimedia.
The target user is a user viewing multimedia. In the process of watching multimedia, the user may generate interactive ideas such as questions, feelings, comments and the like, and the interactive ideas may be specifically expressed in the modes of words, voices, actions and the like, and in order to collect the interactive information of the user, the interactive information of the user may be collected in a preset mode to obtain first interactive data of the target user, for example, voice data of the user may be collected through a voice collection device, and information such as semantics and/or emotion in the voice data is extracted to obtain the first interactive data.
In some embodiments, the step of S302 obtaining the collected first interaction data of the target user during the process of playing the multimedia may include:
s3021, acquiring live sound data acquired in the process of playing a television program; the live sound data are sound data collected on the site of playing the television program.
S3022, inputting the live sound data into a pre-trained voiceprint matching model, and extracting voice data matched with the voiceprint of the target user to obtain first interaction data.
The user watching multimedia is a user at the multimedia playing scene. The sound data collected on the playing site can be collected through a sound collection module carried by the playing equipment, and also can be collected through other sound collection equipment. In the architecture of the multimedia interactive system as shown in fig. 2, a sound collection module may be built in the sound box 24, and when the television 22 plays multimedia, live sound data is collected through the sound box 24.
Since the live sound data includes sounds emitted by different sound sources, and the sounds played by the multimedia may be collected, the user's sound may be extracted from the live sound data. An alternative embodiment is to extract the voice data of the target user from the live collected voice data by means of voiceprint recognition matching. Optionally, a voiceprint matching model may be trained in advance, where the voiceprint matching model may be used to identify and extract, from the voice data, voice data that matches a preset voiceprint, where the preset voiceprint may extract voiceprint features according to the voice data that is input in advance by the target user, and further extract the voice data of the target user, to obtain the first interaction data.
S303, inputting the multimedia information and the first interaction data into a pre-trained natural language generation model to generate second interaction data.
The natural language generation model in the embodiment of the disclosure may be a pre-trained model, and may be capable of generating a language conforming to the natural language characteristics of human beings, so as to respond to the first interaction data under a reply background frame defined by the multimedia information, and generate second interaction data related to the multimedia information and capable of responding to the first interaction data. In some embodiments, the natural language generating model may be obtained by training using big data as a sample, and model types such as a natural language processing NLP, an artificial intelligence AI, a neural network model, a conversational big language model and the like may be combined to configure a model structure to be trained, and after training, a relevant response can be performed with respect to the provided information. The natural language generation model may also use a model already existing in the related art, and the like, to which the embodiments of the present disclosure are not particularly limited.
S304, the second interaction data is displayed to the target user.
Alternatively, the second interaction data may include, but is not limited to, any one or more of the following types of interaction data: text data, voice data, image data, video data, etc., when presenting the second interactive data to the target user, may be presented by a device capable of presenting the corresponding media format. In some embodiments, the second interaction data may be presented on the device playing the multimedia without interrupting the playing of the multimedia during the playing of the multimedia.
In a specific alternative embodiment, the second interaction data is text data, and the playing device can be controlled to display the second interaction data in a bullet screen form. As another example, various forms of displaying a bullet box, displaying an avatar dialogue, etc. on the playback device may be used, which is not particularly limited by the embodiments of the present disclosure.
In some embodiments, the first configuration information may also be obtained before the multimedia information and the first interaction data are input into a pre-trained natural language generation model to generate the second interaction data. The first configuration information is used for configuring the number M (M is a natural number) of the first interaction data required for generating the second interaction data. In these embodiments, the information entered in the natural language generation model needs to include M pieces of first interaction data. For example, the user may configure the response frequency to send a barrage every 3 sentences the user speaks, so that after the user is monitored to speak 3 sentences, the information of the 3 sentences is integrated, or the 3 sentences are directly input into the natural speech generation model, and the barrage is generated and displayed in combination with the input multimedia information.
In some embodiments, the second interactive data is text data, and the word count condition (i.e., the second configuration information) of the second interactive data that is allowed to be displayed may also be configured, and only the second interactive data that meets the word count condition may be displayed. The second interaction data which does not conform to the second configuration information can be discarded before the second interaction data is displayed, or the natural language generation model can be directly configured to generate the second interaction data which conforms to the word number condition.
For example, a user may configure responses less than a predetermined number of words (e.g., 3 words) to be ignored directly and not presented to mask some less meaningful content of the responses. For another example, the user may configure word count constraints output by the natural language generation model, e.g., configure the natural language generation model to directly generate 15 words ± 5 words of responsive text.
The first configuration information and the second configuration information may be preconfigured, for example, configured by a technician, or the preference of the user is counted through some statistics data, so as to obtain the first configuration information and the second configuration information, or may be directly configured by the user, for example, the user may input word number limitation to the set top box through a remote controller, or the like.
In some embodiments, the first interaction data is voice type data, and before the first interaction data is input into the natural language generation model, the first interaction data may be preprocessed, the first interaction data is input into a pre-trained voice-to-text model, first text content corresponding to the first interaction data is output, and the first text content is input into the natural language generation model. Further, in some optional embodiments, emotion information and/or semantic information may be extracted from the first text content, and the emotion information and/or semantic information may be input into the natural language generation model.
In some embodiments, the first interaction data is voice type data, and before the first interaction data is input into the natural language generation model, the first interaction data may be preprocessed, emotion information in the user voice data is extracted according to voiceprint features of the user voice, and the emotion information is input into the natural language generation model.
The emotion information may include a classification tag capable of indicating human emotion, or text capable of indicating emotion, or the like.
In an alternative embodiment, the method for multimedia interaction provided in the present disclosure may be applied to the system architecture shown in fig. 2, and the flow of this embodiment is described below with reference to fig. 4 and 5 as follows:
with the increase of the number of independent people, the situation that users independently watch the video at home by using the television is more and more, and due to no human interaction, emotion expression during video watching cannot be timely fed back by other people, and video watching experience is not as good as that of multi-user co-watching. According to the multimedia interaction method provided by the embodiment of the invention, the user can introduce the virtual interaction in the text form independently when viewing the video at home, the scene accompanied by other people is simulated, and the video viewing experience of the user can be improved to a great extent. In a specific embodiment of the present disclosure, instead of using a single terminal device to perform virtual interaction, the sound box and the set top box may be cooperated, so that the sound box collects the voice of the user, and the set top box obtains the information of the program, and combines the playing content of the set top box and the interactive voice sent by the user collected by the sound box, and uses the generated AI carried by the cloud server to generate virtual interactive text, and displays the virtual interactive text on the screen of the television through the control of the set top box, so as to achieve the effect of virtual interaction, and enhance the interaction feeling when the user sees the video.
First, a user starts a virtual interactive function through the set-top box shown in fig. 2. The virtual interactive function may be a function module that is pre-built in the set-top box or that needs to be downloaded by the user. The user can send out instructions to the set top box through a remote controller, voice control and other modes, and the virtual interaction function is started.
After a user starts a virtual interaction function, the set top box firstly judges the playing form of the program, and if the program is on demand, the set top box can directly acquire the program information through the selection of the user during on demand; if the multicast is the multicast, the set top box can acquire the program information of the currently played multicast stream through the electronic program guide EPG. The natural language generating model can construct a response background frame of virtual interaction, namely, constraint on the content of the interaction response, and is related to the program information, so that the response is carried out according to the program information watched by the user, the effect of the real person accompanying the watched program is simulated, and the interactive voice sent by the user is avoided from being responded in the sky and the space of the AI model.
Secondly, the set top box informs a sound box in the same local area network to start a voice capturing function through signaling. The sound box can extract the voice of the user according to the voiceprint of the user so as to remove the interference of the program sound. The voice box can be pre-stored with user voiceprint characteristics, specifically, the user can input voice to the voice box in advance, and the voice box can extract and store the voiceprint characteristics.
And then, uploading the voice information to a cloud server by the sound box, performing voice-to-text processing by a voice-to-text conversion model deployed on the cloud server, and transmitting the converted text information to the set top box after filtering. The principle of filtering is to discard text information of three words or less. Therefore, the set top box can be prevented from processing meaningless personification words, and the calculation resource is wasted. The user can set how many sentences of voices receive a reply of one virtual interaction, so that the virtual interaction is well adapted to different habits and preferences of different users.
And then, the set top box integrates the received text information according to the times set by the user, and sends the integrated text information to a cloud server, and the cloud server inputs the integrated text information to a natural language generation model. When the number of the sentences set by the user is 1, each sentence is uploaded to the cloud server, and virtual interaction response is obtained; m represents that every M sentences are combined into one sentence to be uploaded to the cloud server. The user can set the word limit of each reply by himself, so that the virtual interaction is well adapted to different preferences of different users.
According to the word number limit set by the user, the natural language generation model can generate texts with specified word numbers in the program information serving as a background framework of interaction. The set top box controls the television screen to display the text output by the cloud server in a bullet screen mode, and the text can be directly displayed on the top layer of the currently played program, so that the interruption of the playing of the program is avoided, and the interference on the viewing of the user is reduced on the premise of realizing virtual interaction with the user. If the mode of using words instead of voices is selected to complete the response of the virtual interaction, the situation that the user cannot hear the original sound of the program due to the fact that the interactive response voices are played can be avoided, and the influence of the virtual interaction on normal viewing is reduced as much as possible.
The user can close the virtual interactive function through the set top box, the set top box informs the loudspeaker box cooperated with the set top box to close the virtual interactive service through signaling, and the loudspeaker box stops capturing voice after receiving the signaling.
One specific use scenario of the multimedia interaction method provided by the specific embodiments shown in fig. 4 and 5 is described below.
When the user starts the virtual interactive function, the set top box will collect the information of the currently playing program, for example, the user is watching the movie on demand, i.e. averda, and the set top box will know the information of the program watched by the user through the order.
Then, the set top box inputs the program information into a natural language generation model located at the cloud server to generate virtual interactive background frame information, wherein 'i are watching a movie' averda ', please simulate the scene you are watching with me, and respond to my feeling'.
The sound box can collect interactive voice sent by a user in the film watching process, user sound is extracted through the voice print characteristics of the user, and interference of program sound in the collected live sound is eliminated. And extracting interactive voice sent by the user, such as the sense of the user's true good of the scene special effect. The sound box can upload the voice data to a voice-to-text model of the cloud server, convert the voice data into text data, judge that the text word number accords with the filtering principle (more than three words), and then send the text data to the set top box.
If the response frequency set by the user is that the user responds once every sentence, the set top box directly uploads the text to the natural language generation model of the cloud server, and the response word number limiting condition information set by the user is attached, for example, "please limit your answer within 20 words: the scene effect is true.
After the set top box acquires the natural language model to generate a response text, display data in the form of a bullet screen is generated, and the television is controlled to be displayed on a screen. And the sound box continuously collects and extracts the voice of the user until the user closes the virtual interaction function.
It should be noted that, in the technical solution of the present disclosure, the acquiring, storing, using, processing, etc. of data all conform to relevant regulations of national laws and regulations, and various types of data such as personal identity data, operation data, behavior data, etc. relevant to individuals, clients, crowds, etc. acquired in the embodiments of the present disclosure have been authorized.
Based on the same inventive concept, a multimedia interaction device is also provided in the embodiments of the present disclosure, as described in the following embodiments. Since the principle of solving the problem of the embodiment of the device is similar to that of the embodiment of the method, the implementation of the embodiment of the device can be referred to the implementation of the embodiment of the method, and the repetition is omitted.
Fig. 6 shows a schematic diagram of a multimedia interaction device according to an embodiment of the disclosure, as shown in fig. 6, where the device includes: the system comprises a first acquisition module 601, a second acquisition module 602, a generation module 603 and a display module 604.
The first obtaining module 601 is configured to obtain multimedia information of a currently playing multimedia.
The second obtaining module 602 is configured to obtain first interaction data of the target user during the process of playing the multimedia; wherein the target user is a user viewing multimedia.
The generating module 603 is configured to input the multimedia information and the first interaction data into a pre-trained natural language generating model, and generate second interaction data.
The display module 604 is configured to display the second interaction data to the target user.
According to the multimedia interaction device provided by the embodiment of the disclosure, the multimedia information of the currently played multimedia is obtained, and the collected first interaction data of the target user watching the multimedia is input into the pre-trained natural language generation model in the process of playing the multimedia, so that the second interaction data is generated and displayed to the target user, and natural language response can be generated on the interaction data sent by the user, so that the interaction experience with people is simulated, and the user experience is improved.
In some embodiments, the multimedia is a television program; the first obtaining module 601 is further configured to obtain program information of a broadcast television program through the set top box.
In some embodiments, the first acquisition module 601 is further configured to: determining a playing mode of a target user through the set top box; under the condition that the playing mode is on demand, acquiring program information of a television program on demand of a target user through the set top box; under the condition that the playing mode is multicast, acquiring a television program list on demand of a target user through the set top box, determining a currently played television program according to the television program list, and acquiring program information of the currently played television program.
In some embodiments, the second acquisition module 602 is further to: acquiring live sound data acquired in the process of playing a television program; the live sound data are sound data collected on the site of playing the television program; and inputting the live sound data into a pre-trained voiceprint matching model, and extracting voice data matched with the voiceprint of the target user to obtain first interaction data.
In some embodiments, the apparatus further includes a third obtaining module, configured to obtain the first configuration information before inputting the multimedia information and the first interaction data into the pre-trained natural language generating model to generate the second interaction data; the first configuration information is used for configuring the number M of first interaction data required by generating the second interaction data; the generating module 603 is further configured to input the multimedia information and the M first interaction data to a pre-trained natural language generating model, and generate second interaction data.
In some embodiments, the second interaction data is text data; the device also comprises a fourth acquisition module for acquiring second configuration information before the second interaction data are displayed to the target user; the second configuration information is used for configuring word number conditions of the second interactive data which are allowed to be displayed; the presentation module 604 is further configured to present the second interaction data that meets the word count condition to the target user.
In some embodiments, the presentation module 604 is further configured to present the second interaction data to the target user in the form of a bullet screen.
It should be noted that, the first obtaining module 601, the second obtaining module 602, the generating module 603, and the displaying module 604 correspond to S301 to S304 in the method embodiment, and the foregoing modules are the same as examples and application scenarios implemented by corresponding steps, but are not limited to those disclosed in the method embodiment. It should be noted that the modules described above may be implemented as part of an apparatus in a computer system, such as a set of computer-executable instructions.
Those skilled in the art will appreciate that the various aspects of the present disclosure may be implemented as a system, method, or program product. Accordingly, various aspects of the disclosure may be embodied in the following forms, namely: an entirely hardware embodiment, an entirely software embodiment (including firmware, micro-code, etc.) or an embodiment combining hardware and software aspects may be referred to herein as a "circuit," module "or" system.
An electronic device 700 according to such an embodiment of the present disclosure is described below with reference to fig. 7. The electronic device 700 shown in fig. 7 is merely an example and should not be construed to limit the functionality and scope of use of embodiments of the present disclosure in any way.
As shown in fig. 7, the electronic device 700 is embodied in the form of a general purpose computing device. Components of electronic device 700 may include, but are not limited to: the at least one processing unit 710, the at least one memory unit 720, and a bus 730 connecting the different system components, including the memory unit 720 and the processing unit 710.
Wherein the storage unit stores program code that is executable by the processing unit 710 such that the processing unit 710 performs steps according to various exemplary embodiments of the present disclosure described in the above-described "exemplary methods" section of the present specification. For example, the processing unit 710 may perform the following steps of the method embodiment described above: acquiring multimedia information of currently played multimedia; acquiring first interaction data of a target user acquired in the process of playing multimedia; wherein the target user is a user watching multimedia; inputting the multimedia information and the first interaction data into a pre-trained natural language generation model to generate second interaction data; and displaying the second interaction data to the target user.
The memory unit 720 may include readable media in the form of volatile memory units, such as Random Access Memory (RAM) 7201 and/or cache memory 7202, and may further include Read Only Memory (ROM) 7203.
The storage unit 720 may also include a program/utility 7204 having a set (at least one) of program modules 7205, such program modules 7205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment.
Bus 730 may be a bus representing one or more of several types of bus structures including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.
The electronic device 700 may also communicate with one or more external devices 740 (e.g., keyboard, pointing device, bluetooth device, etc.), one or more devices that enable a user to interact with the electronic device 700, and/or any device (e.g., router, modem, etc.) that enables the electronic device 700 to communicate with one or more other computing devices. Such communication may occur through an input/output (I/O) interface 750. Also, electronic device 700 may communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN) and/or a public network, such as the Internet, through network adapter 760. As shown, network adapter 760 communicates with other modules of electronic device 700 over bus 730. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with electronic device 700, including, but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.
From the above description of embodiments, those skilled in the art will readily appreciate that the example embodiments described herein may be implemented in software, or may be implemented in software in combination with the necessary hardware. Thus, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.) or on a network, including several instructions to cause a computing device (may be a personal computer, a server, a terminal device, or a network device, etc.) to perform the method according to the embodiments of the present disclosure.
In particular, according to embodiments of the present disclosure, the process described above with reference to the flowcharts may be implemented as a computer program product comprising: computer program which, when executed by a processor, implements the above-mentioned multimedia interaction method.
In an exemplary embodiment of the present disclosure, a computer-readable storage medium, which may be a readable signal medium or a readable storage medium, is also provided. The computer readable storage medium has stored thereon a program product capable of implementing the above-described method of the present disclosure. In some possible implementations, various aspects of the disclosure may also be implemented in the form of a program product comprising program code for causing a terminal device to carry out the steps according to the various exemplary embodiments of the disclosure as described in the "exemplary methods" section of this specification, when the program product is run on the terminal device.
More specific examples of the computer readable storage medium in the present disclosure may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
In this disclosure, a computer readable storage medium may include a data signal propagated in baseband or as part of a carrier wave, with readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Alternatively, the program code embodied on a computer readable storage medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
In particular implementations, the program code for carrying out operations of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).
It should be noted that although in the above detailed description several modules or units of a device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit in accordance with embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into a plurality of modules or units to be embodied.
Furthermore, although the steps of the methods in the present disclosure are depicted in a particular order in the drawings, this does not require or imply that the steps must be performed in that particular order or that all illustrated steps be performed in order to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step to perform, and/or one step decomposed into multiple steps to perform, etc.
From the description of the above embodiments, those skilled in the art will readily appreciate that the example embodiments described herein may be implemented in software, or may be implemented in software in combination with the necessary hardware. Thus, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.) or on a network, including several instructions to cause a computing device (may be a personal computer, a server, a mobile terminal, or a network device, etc.) to perform the method according to the embodiments of the present disclosure.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any adaptations, uses, or adaptations of the disclosure following the general principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

Claims (12)

1. A method of multimedia interaction, comprising:
acquiring multimedia information of currently played multimedia;
acquiring first interaction data of a target user acquired in the process of playing the multimedia; wherein the target user is a user viewing the multimedia;
inputting the multimedia information and the first interaction data into a pre-trained natural language generation model to generate second interaction data;
and displaying the second interaction data to the target user.
2. The method of claim 1, wherein the multimedia is a television program; the obtaining the multimedia information of the currently playing multimedia includes:
program information of the played television program is obtained through the set top box.
3. The method for multimedia interaction according to claim 2, wherein the obtaining, by the set-top box, program information of the broadcasted digital television program comprises:
determining a playing mode of the target user through the set top box;
under the condition that the playing mode is on demand, acquiring program information of a television program on demand of the target user through the set top box;
And under the condition that the playing mode is multicast, acquiring a television program list requested by the target user through the set top box, determining a currently played television program according to the television program list, and acquiring program information of the currently played television program.
4. The method for multimedia interaction according to claim 2, wherein the acquiring the first interaction data of the target user acquired during the playing of the multimedia comprises:
acquiring live sound data acquired in the process of playing the television program; the live sound data are sound data collected on the site of playing the television program;
and inputting the live sound data into a pre-trained voiceprint matching model, and extracting voice data matched with the voiceprint of the target user to obtain the first interaction data.
5. The method of multimedia interaction according to claim 1, wherein before inputting the multimedia information and the first interaction data into a pre-trained natural language generation model to generate second interaction data, the method further comprises:
acquiring first configuration information; the first configuration information is used for configuring the number M of first interaction data required by generating the second interaction data;
Inputting the multimedia information and the first interaction data into a pre-trained natural language generation model to generate second interaction data, wherein the method comprises the following steps of:
and inputting the multimedia information and the M pieces of first interaction data into a pre-trained natural language generation model to generate the second interaction data.
6. The multimedia interaction method according to claim 1, wherein the second interaction data is text data; before presenting the second interaction data to the target user, the method further comprises:
acquiring second configuration information; the second configuration information is used for configuring word number conditions of the second interactive data which are allowed to be displayed;
the presenting the second interaction data to the target user includes:
and displaying the second interaction data meeting the word number condition to the target user.
7. The method of any of claims 1-6, wherein the presenting the second interaction data to the target user comprises:
and displaying the second interaction data to the target user in a barrage mode.
8. A multimedia interaction device, comprising:
The first acquisition module is used for acquiring the multimedia information of the currently played multimedia;
the second acquisition module is used for acquiring the first interaction data of the target user acquired in the process of playing the multimedia; wherein the target user is a user viewing the multimedia;
the generation module is used for inputting the multimedia information and the first interaction data into a pre-trained natural language generation model to generate second interaction data;
and the display module is used for displaying the second interaction data to the target user.
9. An electronic device, comprising:
a processor; and
a memory for storing executable instructions of the processor;
wherein the processor is configured to perform the multimedia interaction method of any of claims 1-7 via execution of the executable instructions.
10. A computer readable storage medium having stored thereon a computer program, characterized in that the computer program, when executed by a processor, implements the multimedia interaction method of any of claims 1-7.
11. A multimedia interactive system, comprising:
The multimedia interaction equipment is used for acquiring the multimedia information of the currently played multimedia; acquiring first interaction data of a target user acquired in the process of playing the multimedia; wherein the target user is a user viewing the multimedia; inputting the multimedia information and the first interaction data into a pre-trained natural language generation model to generate second interaction data; displaying the second interaction data to the target user;
and the sound collection equipment is used for collecting the first interaction data in the process of playing the multimedia.
12. The multimedia interaction system of claim 11, further comprising:
the playing device is used for playing the multimedia and displaying the second interaction data;
the set top box is connected with the playing device and is in the same local area network with the sound collecting device, and is used for controlling the playing device to play the multimedia and display the second interactive data, establishing communication connection with the sound collecting device through the local area network and controlling the sound collecting device to collect the first interactive data.
CN202311746220.4A 2023-12-18 2023-12-18 Multimedia interaction method, device, electronic equipment and storage medium Pending CN117812417A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311746220.4A CN117812417A (en) 2023-12-18 2023-12-18 Multimedia interaction method, device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311746220.4A CN117812417A (en) 2023-12-18 2023-12-18 Multimedia interaction method, device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN117812417A true CN117812417A (en) 2024-04-02

Family

ID=90421161

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311746220.4A Pending CN117812417A (en) 2023-12-18 2023-12-18 Multimedia interaction method, device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN117812417A (en)

Similar Documents

Publication Publication Date Title
US11252444B2 (en) Video stream processing method, computer device, and storage medium
WO2021114881A1 (en) Intelligent commentary generation method, apparatus and device, intelligent commentary playback method, apparatus and device, and computer storage medium
US10991380B2 (en) Generating visual closed caption for sign language
US7746986B2 (en) Methods and systems for a sign language graphical interpreter
CN108401192A (en) Video stream processing method, device, computer equipment and storage medium
KR102520019B1 (en) Speech enhancement for speech recognition applications in broadcast environments
CN112423081B (en) Video data processing method, device and equipment and readable storage medium
CN108012173A (en) A kind of content identification method, device, equipment and computer-readable storage medium
US11800202B2 (en) Systems and methods for generating supplemental content for a program content stream
CN111629253A (en) Video processing method and device, computer readable storage medium and electronic equipment
CN111654715A (en) Live video processing method and device, electronic equipment and storage medium
CN112492329B (en) Live broadcast method and device
CN114040255A (en) Live caption generating method, system, equipment and storage medium
CN111479124A (en) Real-time playing method and device
CN112735430A (en) Multilingual online simultaneous interpretation system
CN113630620A (en) Multimedia file playing system, related method, device and equipment
CN117812417A (en) Multimedia interaction method, device, electronic equipment and storage medium
US20180176631A1 (en) Methods and systems for providing an interactive second screen experience
CN114727120B (en) Live audio stream acquisition method and device, electronic equipment and storage medium
CN113742473A (en) Digital virtual human interaction system and calculation transmission optimization method thereof
CN113891108A (en) Subtitle optimization method and device, electronic equipment and storage medium
CN108495163B (en) Video barrage reading device, system, method and computer readable storage medium
CN113630613B (en) Information processing method, device and storage medium
KR20120050016A (en) Apparatus for construction social network by using multimedia contents and method thereof
WO2023006820A1 (en) System and method for question answering

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination