CN114598922A - Voice message interaction method, device, equipment and storage medium - Google Patents

Voice message interaction method, device, equipment and storage medium Download PDF

Info

Publication number
CN114598922A
CN114598922A CN202210227932.4A CN202210227932A CN114598922A CN 114598922 A CN114598922 A CN 114598922A CN 202210227932 A CN202210227932 A CN 202210227932A CN 114598922 A CN114598922 A CN 114598922A
Authority
CN
China
Prior art keywords
text information
information
voice
user
intention
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210227932.4A
Other languages
Chinese (zh)
Inventor
段洁斐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Skyworth RGB Electronics Co Ltd
Original Assignee
Shenzhen Skyworth RGB Electronics Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Skyworth RGB Electronics Co Ltd filed Critical Shenzhen Skyworth RGB Electronics Co Ltd
Priority to CN202210227932.4A priority Critical patent/CN114598922A/en
Publication of CN114598922A publication Critical patent/CN114598922A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs
    • H04N21/44008Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/441Acquiring end-user identification, e.g. using personal code sent by the remote control or by inserting a card
    • H04N21/4415Acquiring end-user identification, e.g. using personal code sent by the remote control or by inserting a card using biometric characteristics of the user, e.g. by voice recognition or fingerprint scanning
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/45Management operations performed by the client for facilitating the reception of or the interaction with the content or administrating data related to the end-user or to the client device itself, e.g. learning user preferences for recommending movies, resolving scheduling conflicts
    • H04N21/466Learning process for intelligent management, e.g. learning user preferences for recommending movies
    • H04N21/4662Learning process for intelligent management, e.g. learning user preferences for recommending movies characterized by learning algorithms
    • H04N21/4666Learning process for intelligent management, e.g. learning user preferences for recommending movies characterized by learning algorithms using neural networks, e.g. processing the feedback provided by the user
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/488Data services, e.g. news ticker
    • H04N21/4884Data services, e.g. news ticker for displaying subtitles

Abstract

The invention discloses a voice message interaction method, a device, equipment and a storage medium, wherein the method comprises the following steps: acquiring a video file being played, and analyzing video file information of the video file; collecting user voice information, and converting the user voice information into initial text information; acquiring a user voice intention based on the video file information and the initial text information; and if the user voice intention is an interactive intention, converting the initial text information into target text information, and displaying the target text information. The invention realizes the technical effect of simplifying the bullet screen sending process in the television film watching process.

Description

Voice message interaction method, device, equipment and storage medium
Technical Field
The present invention relates to the field of voice recognition technologies, and in particular, to a voice message interaction method, apparatus, device, and computer-readable storage medium.
Background
In recent years, with the continuous development of smart televisions, the requirements of users on viewing experience in large-screen scenes are higher and higher, and the use scenes of the televisions are more and more diversified. Barrage culture is gradually emerging, and barrage is widely applied to the process of watching television videos as a common interaction mode when a user watches the videos.
When a user wants to perform barrage interaction during television viewing, the user often needs to input barrage content into a television by means of a remote controller or Bluetooth voice. On the one hand, the user needs to use a remote controller and other related peripherals, on the other hand, a barrage mode and other links for interrupting the viewing experience are needed, and the operation of sending the barrage is complex.
Disclosure of Invention
The invention mainly aims to provide a voice message interaction method, a voice message interaction device, voice message interaction equipment and a computer readable storage medium, and aims to solve the problem that the operation of sending a bullet screen is complex in the television viewing process.
In order to achieve the above object, the present invention provides a voice message interaction method, which comprises:
acquiring a video file being played, and analyzing video file information of the video file;
collecting user voice information, and converting the user voice information into initial text information;
acquiring a user voice intention based on the video file information and the initial text information;
and if the user voice intention is an interactive intention, converting the initial text information into target text information, and displaying the target text information.
Optionally, before the step of obtaining the video file being played, the method further includes:
after the video playing is detected to start, identifying a signal source of the video playing;
judging whether the video supports message interaction or not according to the signal type of the signal source;
if yes, executing the steps of acquiring the video file being played and analyzing the video file information of the video file.
Optionally, the step of collecting the user voice information and converting the user voice information into the initial text information includes:
collecting mixed audio information in a far field range of the television equipment;
and extracting user voice information in the mixed audio information, and converting the user voice information into initial text information.
Optionally, before the step of obtaining the user's voice intention based on the video file information and the initial text information, the method further includes:
and sending the video file information to a preset server to serve as a training set of an initial intention prediction model in the server so as to establish a trained target intention prediction model.
Optionally, the step of obtaining the user's voice intention based on the video file information and the initial text information includes:
extracting key text information in the initial text information;
sending the key text information to the server so that the server can predict through the target intention prediction model to obtain a prediction result and return the prediction result;
and determining the voice intention of the user according to the prediction result.
Optionally, after the step of determining the user's voice intention according to the prediction result, the method further includes:
if the number of continuous appearance times of the same text content in the initial text information is identified to exceed the preset number of times, displaying an interactive commonly used term setting page;
and receiving a setting instruction based on the interactive commonly used expression setting page, and setting the same text content as the interactive commonly used expression to be used as one of the extraction bases of the key text information.
Optionally, if the user voice intention is an interaction intention, the step of converting the initial text information into target text information and displaying the target text information includes:
after the voice intention of the user is judged to be an interaction intention, the initial text information is obtained, and whether the initial text information contains preset sensitive words or not is identified;
if the user text information does not contain preset sensitive words, the initial text information is used as target text information, and the target text information is displayed;
and if the user text information contains preset sensitive words, filtering the initial text information, taking the filtered initial text information as target text information, and displaying the target text information.
In addition, to achieve the above object, the present invention further provides a voice message interaction apparatus, including:
the video file analysis module is used for acquiring a video file which is being played and analyzing video file information of the video file;
the voice recognition analysis module is used for acquiring user voice information and converting the user voice information into initial text information;
the user intention acquisition module is used for acquiring the voice intention of the user based on the video file information and the initial text information;
and the text information display module is used for converting the initial text information into target text information and displaying the target text information if the user voice intention is an interactive intention.
In addition, to achieve the above object, the present invention also provides an electronic device including: a memory, a processor, and a voice message interaction program stored on the memory and executable on the processor, the voice message interaction program configured to implement the steps of the voice message interaction method as described above.
In addition, to achieve the above object, the present invention also provides a computer readable storage medium having a voice message interaction program stored thereon, which, when executed by a processor, implements the steps of the voice message interaction method as described above.
The method comprises the steps of acquiring a video file being played, and analyzing video file information of the video file; collecting user voice information, and converting the user voice information into initial text information; acquiring a user voice intention based on the video file information and the initial text information; if the user voice intention is the interactive intention, the initial text information is converted into the target text information, the target text information is displayed, after the user inputs the voice information to the television equipment without the help of peripherals such as a remote controller, and if the television equipment judges that the user voice intention is the interactive intention, the user voice information can be converted into the target text information to be displayed on a television screen, so that the process of sending the bullet screen is simplified, and the fluency of sending the bullet screen by the user is improved.
Drawings
Fig. 1 is a schematic structural diagram of an electronic device in a hardware operating environment according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating a voice message interaction method according to a first embodiment of the present invention;
FIG. 3 is a flowchart illustrating a voice message interaction method according to a second embodiment of the present invention;
FIG. 4 is a flowchart illustrating a voice message interaction method according to a third embodiment of the present invention;
fig. 5 is a schematic diagram of a voice message interaction apparatus according to an embodiment of the present invention;
the implementation, functional features and advantages of the present invention will be further described with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Generally, a user sends a barrage at a television end and needs to open a barrage switch, and the user inputs characters or remotely controlled Bluetooth voice to input the content of the barrage by means of a remote controller.
The main technical scheme of the invention is as follows: acquiring a video file being played, and analyzing video file information of the video file; collecting user voice information, and converting the user voice information into initial text information; acquiring a user voice intention based on the video file information and the initial text information; and if the user voice intention is an interactive intention, converting the initial text information into target text information, and displaying the target text information.
Referring to fig. 1, fig. 1 is a schematic structural diagram of an electronic device in a hardware operating environment according to an embodiment of the present invention.
As shown in fig. 1, the electronic device may include: a processor 1001, such as a Central Processing Unit (CPU), a communication bus 1002, a user interface 1003, a network interface 1004, and a memory 1005. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a WIreless interface (e.g., a WIreless-FIdelity (WI-FI) interface). The Memory 1005 may be a Random Access Memory (RAM) Memory, or may be a Non-Volatile Memory (NVM), such as a disk Memory. The memory 1005 may alternatively be a storage device separate from the processor 1001.
Those skilled in the art will appreciate that the configuration shown in fig. 1 does not constitute a limitation of the electronic device and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.
As shown in fig. 1, a memory 1005, which is a storage medium, may include therein an operating system, a network communication module, a user interface module, and a voice message interaction program.
In the electronic device shown in fig. 1, the network interface 1004 is mainly used for data communication with other devices; the user interface 1003 is mainly used for data interaction with a user; the processor 1001 and the memory 1005 in the electronic device of the present invention may be disposed in the electronic device, and the electronic device calls the voice message interaction program stored in the memory 1005 through the processor 1001 and executes the following steps:
acquiring a video file being played, and analyzing video file information of the video file;
collecting user voice information, and converting the user voice information into initial text information;
acquiring a user voice intention based on the video file information and the initial text information;
and if the user voice intention is an interactive intention, converting the initial text information into target text information, and displaying the target text information.
In one embodiment, the electronic device, through the processor 1001 calling the voice message interaction program stored in the memory 1005, may further perform the following steps:
after the video playing is detected to start, identifying a signal source of the video playing;
judging whether the video supports message interaction or not according to the signal type of the signal source;
if yes, executing the steps of acquiring the video file being played and analyzing the video file information of the video file.
In one embodiment, the electronic device, through the processor 1001 calling the voice message interaction program stored in the memory 1005, may further perform the following steps:
collecting mixed audio information in a far field range of the television equipment;
and extracting user voice information in the mixed audio information, and converting the user voice information into initial text information.
In one embodiment, the electronic device, through the processor 1001 calling the voice message interaction program stored in the memory 1005, may further perform the following steps:
and sending the video file information to a preset server to serve as a training set of an initial intention prediction model in the server so as to establish a trained target intention prediction model.
In one embodiment, the electronic device, through the processor 1001 calling the voice message interaction program stored in the memory 1005, may further perform the following steps:
extracting key text information in the initial text information;
sending the key text information to the server so that the server can predict through the target intention prediction model to obtain a prediction result and return the prediction result;
and determining the voice intention of the user according to the prediction result.
In one embodiment, the electronic device, through the processor 1001 calling the voice message interaction program stored in the memory 1005, may further perform the following steps:
if the continuous occurrence times of the same text contents in the initial text information exceed the preset times, displaying an interactive commonly used term setting page;
and receiving a setting instruction based on the interactive commonly used expression setting page, and setting the same text content as the interactive commonly used expression to be used as one of the extraction bases of the key text information.
In one embodiment, the electronic device, through the processor 1001 calling the voice message interaction program stored in the memory 1005, may further perform the following steps:
after the voice intention of the user is judged to be an interaction intention, the initial text information is obtained, and whether the initial text information contains preset sensitive words or not is identified;
if the user text information does not contain preset sensitive words, the initial text information is used as target text information, and the target text information is displayed;
and if the user text information contains preset sensitive words, filtering the initial text information, taking the filtered initial text information as target text information, and displaying the target text information.
An embodiment of the present invention provides a voice message interaction method, and referring to fig. 2, fig. 2 is a flowchart illustrating a first embodiment of a voice message interaction method according to the present invention.
In this embodiment, the voice message interaction method includes:
step S10, acquiring a video file being played, and analyzing video file information of the video file;
the message interaction can be understood as that the user sends comments about the video content in a bullet screen sending mode and expresses own opinions. At present, video websites and video software of many mobile device terminals support sending barrages. When a user watches video played by a television, different signal sources can be selected, and the video content of all the signal sources does not support the transmission of barrage.
As an example, before the step of obtaining the video file being played, the method may include:
step A1, after detecting that video playing starts, identifying the signal source of the video. Generally, a television can be provided with a signal source interface, and Video content from a television box, a network on demand, a computer, a DVD (Digital Video Disc), or other devices or networks can be played by accessing different signal sources.
Step A2, according to the signal type of the signal source, judging whether the video supports message interaction. The signal types may be classified into an HDMI (High Definition Multimedia Interface) signal, a VGA (Video Graphics Array) signal, an AV (Audio & Video) signal, and a network on demand signal. Web-on-demand may be understood as watching internet video on a television. Before web-on-demand, a user may install video application software on a television.
When the video played by the television supports the sending of the barrage, an Android Application Package (APK) in the television can acquire the video file being played and analyze video file information related to the video file. The video file information may contain character information, image information, character information, and the like. The personal information may include the name of the corner of a movie, the name of staff, etc. in the video file. The image information can be actor information and role information obtained by intercepting pictures along with the video playing progress and analyzing the pictures in the video file playing process. The text information may comprise subtitles in a video file.
Step S20, collecting user voice information, and converting the user voice information into initial text information;
when a user watches video played by a television, the television can acquire user voice information through a preset voice recognition system and convert the user voice information into initial text information.
As an example, the step of collecting the user voice information and converting the user voice information into the initial text information may include:
and step B1, collecting mixed audio information in a far-field range, stripping user voice information in the mixed audio information, and converting the user voice information into initial text information. The voice capture system in the television may be a set of 4-way microphones mounted at the bottom of the television screen. Through the microphone array technology, the television can perform far-field speech recognition. The far field range in this embodiment may be 3-5 meters. In the process of collecting the voice information of the user, the television can also generate audio information when playing videos, and the voice collecting system can collect mixed audio information firstly. By configuring an AEC (Acoustic Echo Cancellation) algorithm in the voice acquisition system, Echo Cancellation can be performed on the acquired mixed audio information, the audio information played by the television can be removed, and the user voice information can be stripped. And sending the user voice information obtained after the AEC algorithm processing to an APK (android package), and analyzing the user voice information by the APK to convert the user voice information into initial text information.
Step S30, based on the video file information and the initial text information, obtaining the voice intention of the user;
the user's voice intention may be divided into an interactive intention and an instructional intention according to the purpose of the user's input of voice. When the content contained in the initial text information is successfully matched with the video file information, the voice intention of the user can be judged to be the interaction intention. When the initial text information is a control instruction such as fast forward, fast backward, pause and the like, the voice intention of the user can be judged to be an instruction intention. Trigger words related to the control instructions can be preset in the television, and when the trigger words are identified to be contained in the initial text information, the control instructions corresponding to the trigger words are directly executed.
Step S40, if the user voice intention is an interaction intention, converting the initial text information into target text information, and displaying the target text information.
And after the voice intention of the user is judged to be the interactive intention, the bullet screen content can be displayed in the television screen according to the initial text information. Specifically, the bullet screen content displayed last may be different from the initial text information. If the initial text information contains preset sensitive words, the initial text information can be filtered, and then the target text information obtained after filtering is displayed on a television screen. In addition, the APK can also identify the user identity information and display the user identity information and the target text information in a correlation mode. The user identity information may comprise a user nickname and an avatar.
In this embodiment, after a video is detected to start playing, whether the video supports message interaction is judged, if the video supports message interaction, a video file being played is obtained, video file information of the video file is analyzed, user voice information is collected, the user voice information is converted into initial text information, a user voice intention is obtained based on the video file information and the initial text information, if the user voice intention is an interaction intention, the initial text information is converted into target text information, the target text information is displayed, the user can realize bullet screen sending by inputting the voice information into a television, peripherals such as a remote controller and the like are not needed during voice input, smoothness of sending bullet screens by the user is improved, and a bullet screen sending flow is simplified.
Further, in a second embodiment of the voice message interaction method of the present invention, referring to fig. 3, the voice message interaction method of the present invention includes:
step S11, sending the video file information to a preset server as a training set of an initial intention prediction model in the server to establish a trained target intention prediction model;
an initial intention prediction model can be preset in the server, and when the initial intention prediction model is established, the existing network video introduction information is used as a training set through a CNN (Convolutional Neural Networks) algorithm. The training result obtained at this time is the interactive intention. The network video introduction information may include a performance staff and a scenario brief description of the network video, etc.
When the user watches the video, the APK can send the video file information obtained by analysis to the server, and the video file information is used as a training set to train the initial intention prediction model.
Step S12, extracting key text information in the initial text information;
when the intention prediction is performed, the initial text information may be pre-processed first, and the key text information in the initial text information is extracted, so as to reduce the data processing amount during the intention prediction. The APK may extract the key text information in the initial text information after parsing the user voice information into the initial text information. In another embodiment, the APK sends the initial text message directly to the server, and the server extracts the key text message. The key text information may be extracted by first identifying whether the initial text information includes an interactive keyword, and extracting the interactive keyword in the initial text information as the key text information.
Step S13, the key text information is sent to the server so that the server can obtain a prediction result through the prediction of the target intention prediction model and return the prediction result;
and the APK sends the extracted key text information to the server, the server inputs the key text information into the target intention prediction model, the target intention prediction model can output a prediction result, and the server returns the prediction result to the television.
Step S14, determining the user' S voice intention according to the prediction result;
the prediction result output by the target intention prediction model can be divided into an interaction type and an instruction type, and after receiving the prediction result, the APK can judge that the voice intention of the user is an interaction intention or an instruction intention according to the prediction type.
Step S15, if the continuous occurrence frequency of the same text content in the initial text information is identified to exceed the preset frequency, displaying an interactive commonly used term setting page;
during the process of watching the video by the user, the bullet screen content which is desired to be sent may not have a textual content association with the video file information. For example, the user may want to send a bullet screen such as "too laugh" or "bored", and may ignore the text content when extracting the key text information. And taking the continuous occurrence frequency of the same text content exceeding a preset frequency as a trigger condition, and displaying the interactive commonly used term setting page when the trigger condition is reached. The user accessible interaction commonly used term sets up the page, sets up the bullet screen content that send the frequency height as interaction commonly used term, improves the rate of accuracy that the bullet screen sent. The preset number of times may be set to 2 times, for example, when the voice information input by the user is "too happy and too happy," an interaction commonly used term setting page will be displayed on the television screen.
Step S16, receiving a setting instruction based on the interactive commonly used expression setting page, and setting the same text content as an interactive commonly used expression as one of the extraction bases of the key text information.
The mode of receiving the setting instruction can be receiving a voice setting instruction of a user, and can also be receiving the setting instruction through a remote controller. After receiving the setting instruction, the identified same text content may be set as an interactive commonly used term. And when the interactive commonly used expressions are identified subsequently, the interactive commonly used expressions are taken as key text information, extracted and then sent to the server.
In the embodiment, video file information is used as a training set, a target intention prediction model is established, key text information is input into the target intention prediction model, a prediction type output by the target intention prediction model is obtained, a user voice intention can be obtained, interactive phrase habits of the user can be fitted by setting interactive commonly used phrases, and accuracy of recognizing the interactive intention is improved.
Further, in a third embodiment of the voice message interaction method of the present invention, referring to fig. 4, the voice message interaction method of the present invention includes:
step S21, after the voice intention of the user is judged to be an interaction intention, the initial text information is obtained, and whether the initial text information contains preset sensitive words or not is identified;
after the voice intention of the user is judged to be the interaction intention, the initial text information can be cached through the APK, and whether the initial text information contains preset sensitive words or not is identified. Sensitive words may include non-civilized words and words related to national security, etc.
Step S22, if the user text information does not contain preset sensitive words, the initial text information is used as target text information, and the target text information is displayed;
the APK may obtain user identity information, such as user nicknames and avatars, at the time of installation. If the sensitive words are not recognized in the initial text information, the initial text information can be used as target text information and displayed together with the user identity information on a television screen. Specifically, the display manner and the display format of the target text information may be set by the user. The display modes may include a fixed position display and a scrolling display, and may further include a top display, a center display, and a bottom display. The display formats may include text font, text font size, text color, and text transparency, among others.
Step S23, if the initial text information includes a preset sensitive word, filtering the initial text information, taking the filtered initial text information as target text information, and displaying the target text information.
If the initial text information is recognized to contain the sensitive words, the initial text information can be filtered. The filtering mode may be that the sensitive words in the initial text information are not displayed, or the sensitive words are displayed instead of the symbols. And displaying the target text information obtained after filtering in a television screen.
In the embodiment, sensitive word recognition is performed on the initial text information, and if the sensitive words are recognized, the sensitive word content is filtered, so that the civilization and harmony of the network environment are maintained.
The embodiment of the invention also provides a voice message interaction device, and referring to fig. 5, fig. 5 is a schematic diagram of the voice message interaction device according to the scheme of the embodiment of the invention. As shown in fig. 5, the voice message interacting device may include:
the video file analysis module 101 is configured to acquire a video file being played and analyze video file information of the video file;
the voice recognition and analysis module 102 is used for collecting user voice information and converting the user voice information into initial text information;
a user intention obtaining module 103, configured to obtain a user voice intention based on the video file information and the initial text information;
and the text information display module 104 is configured to, if the user voice intention is an interaction intention, convert the initial text information into target text information, and display the target text information.
Optionally, the voice message interaction apparatus further includes:
the signal identification module is used for identifying a signal source of the video which is being played after the video playing is detected to start;
judging whether the video supports message interaction or not according to the signal type of the signal source;
if yes, executing the steps of acquiring the video file being played and analyzing the video file information of the video file.
Optionally, the speech recognition parsing module 102 is further configured to:
collecting mixed audio information in a far field range of the television equipment;
and extracting user voice information in the mixed audio information, and converting the user voice information into initial text information.
Optionally, the user intention acquisition module 103 is further configured to:
and sending the video file information to a preset server to serve as a training set of an initial intention prediction model in the server so as to establish a trained target intention prediction model.
Optionally, the user intention acquisition module 103 is further configured to:
extracting key text information in the initial text information;
sending the key text information to the server so that the server can predict through the target intention prediction model to obtain a prediction result and return the prediction result;
and determining the voice intention of the user according to the prediction result.
Optionally, the user intention acquisition module 103 is further configured to:
if the number of continuous appearance times of the same text content in the initial text information is identified to exceed the preset number of times, displaying an interactive commonly used term setting page;
and receiving a setting instruction based on the interactive commonly used expression setting page, and setting the same text content as the interactive commonly used expression to be used as one of the extraction bases of the key text information.
Optionally, the text information display module 104 is further configured to:
after the voice intention of the user is judged to be an interaction intention, the initial text information is obtained, and whether the initial text information contains preset sensitive words or not is identified;
if the user text information does not contain preset sensitive words, the initial text information is used as target text information, and the target text information is displayed;
and if the user text information contains preset sensitive words, filtering the initial text information, taking the filtered initial text information as target text information, and displaying the target text information.
An embodiment of the present invention further provides an electronic device, where the electronic device includes: a memory, a processor, and a voice message interaction program stored on the memory and executable on the processor, the voice message interaction program configured to implement the steps of the voice message interaction method as described above.
An embodiment of the present invention further provides a computer-readable storage medium, where a voice message interaction program is stored on the computer-readable storage medium, and when executed by a processor, the voice message interaction program implements the steps of the voice message interaction method described above.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) as described above and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (10)

1. A voice message interaction method, wherein the voice message interaction method is applied to a television device, and the voice message interaction method comprises the following steps:
acquiring a video file being played, and analyzing video file information of the video file;
collecting user voice information, and converting the user voice information into initial text information;
acquiring a user voice intention based on the video file information and the initial text information;
and if the user voice intention is an interactive intention, converting the initial text information into target text information, and displaying the target text information.
2. The voice message interaction method of claim 1, further comprising, prior to the step of retrieving the video file being played:
after the video playing is detected to start, identifying a signal source of the video playing;
judging whether the video supports message interaction or not according to the signal type of the signal source;
if yes, executing the steps of acquiring the video file being played and analyzing the video file information of the video file.
3. The voice message interaction method as claimed in claim 1, wherein the step of collecting the user voice information and converting the user voice information into the initial text information comprises:
collecting mixed audio information in a far-field range of the television equipment;
and extracting user voice information in the mixed audio information, and converting the user voice information into initial text information.
4. The voice message interaction method as claimed in claim 1, further comprising, before the step of acquiring the user's voice intention based on the video file information and the initial text information:
and sending the video file information to a preset server to serve as a training set of an initial intention prediction model in the server so as to establish a trained target intention prediction model.
5. The voice message interaction method as claimed in claim 4, wherein the step of acquiring the user's voice intention based on the video file information and the initial text information comprises:
extracting key text information in the initial text information;
sending the key text information to the server so that the server can predict through the target intention prediction model to obtain a prediction result and return the prediction result;
and determining the voice intention of the user according to the prediction result.
6. The voice message interaction method as claimed in claim 5, further comprising, after the step of determining the user's voice intention according to the prediction result:
if the number of continuous appearance times of the same text content in the initial text information is identified to exceed the preset number of times, displaying an interactive commonly used term setting page;
and receiving a setting instruction based on the interactive commonly used expression setting page, and setting the same text content as the interactive commonly used expression to be used as one of the extraction bases of the key text information.
7. The voice message interaction method as claimed in any one of claims 1 to 6, wherein the step of converting the initial text information into target text information and displaying the target text information if the user's voice intention is an interaction intention comprises:
after the voice intention of the user is judged to be an interactive intention, acquiring the initial text information, and identifying whether the initial text information contains preset sensitive words or not;
if the user text information does not contain preset sensitive words, the initial text information is used as target text information, and the target text information is displayed;
and if the user text information contains preset sensitive words, filtering the initial text information, taking the filtered initial text information as target text information, and displaying the target text information.
8. A voice message interaction apparatus, comprising:
the video file analysis module is used for acquiring a video file which is being played and analyzing video file information of the video file;
the voice recognition analysis module is used for acquiring user voice information and converting the user voice information into initial text information;
the user intention acquisition module is used for acquiring the voice intention of the user based on the video file information and the initial text information;
and the text information display module is used for converting the initial text information into target text information and displaying the target text information if the user voice intention is an interactive intention.
9. An electronic device, characterized in that the electronic device comprises: a memory, a processor, and a voice message interaction program stored on the memory and executable on the processor, the voice message interaction program configured to implement the steps of the voice message interaction method as claimed in any one of claims 1 to 7.
10. A computer-readable storage medium, wherein a voice message interaction program is stored on the computer-readable storage medium, and when executed by a processor, the voice message interaction program implements the steps of the voice message interaction method as claimed in any one of claims 1 to 7.
CN202210227932.4A 2022-03-07 2022-03-07 Voice message interaction method, device, equipment and storage medium Pending CN114598922A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210227932.4A CN114598922A (en) 2022-03-07 2022-03-07 Voice message interaction method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210227932.4A CN114598922A (en) 2022-03-07 2022-03-07 Voice message interaction method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN114598922A true CN114598922A (en) 2022-06-07

Family

ID=81809386

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210227932.4A Pending CN114598922A (en) 2022-03-07 2022-03-07 Voice message interaction method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114598922A (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107613400A (en) * 2017-09-21 2018-01-19 北京奇艺世纪科技有限公司 A kind of implementation method and device of voice barrage
CN109147784A (en) * 2018-09-10 2019-01-04 百度在线网络技术(北京)有限公司 Voice interactive method, equipment and storage medium
US20190295533A1 (en) * 2018-01-26 2019-09-26 Shanghai Xiaoi Robot Technology Co., Ltd. Intelligent interactive method and apparatus, computer device and computer readable storage medium
CN110709931A (en) * 2017-06-06 2020-01-17 赛普拉斯半导体公司 System and method for audio pattern recognition
CN110942779A (en) * 2019-11-13 2020-03-31 苏宁云计算有限公司 Noise processing method, device and system
CN111586469A (en) * 2020-05-12 2020-08-25 腾讯科技(深圳)有限公司 Bullet screen display method and device and electronic equipment
CN112104909A (en) * 2019-06-18 2020-12-18 上海哔哩哔哩科技有限公司 Interactive video playing method and device, computer equipment and readable storage medium
CN113573155A (en) * 2021-07-22 2021-10-29 深圳创维-Rgb电子有限公司 Voice bullet screen implementation method and device, intelligent device and readable storage medium
CN114120984A (en) * 2021-12-08 2022-03-01 思必驰科技股份有限公司 Voice interaction method, electronic device and storage medium

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110709931A (en) * 2017-06-06 2020-01-17 赛普拉斯半导体公司 System and method for audio pattern recognition
CN107613400A (en) * 2017-09-21 2018-01-19 北京奇艺世纪科技有限公司 A kind of implementation method and device of voice barrage
US20190295533A1 (en) * 2018-01-26 2019-09-26 Shanghai Xiaoi Robot Technology Co., Ltd. Intelligent interactive method and apparatus, computer device and computer readable storage medium
CN109147784A (en) * 2018-09-10 2019-01-04 百度在线网络技术(北京)有限公司 Voice interactive method, equipment and storage medium
CN112104909A (en) * 2019-06-18 2020-12-18 上海哔哩哔哩科技有限公司 Interactive video playing method and device, computer equipment and readable storage medium
CN110942779A (en) * 2019-11-13 2020-03-31 苏宁云计算有限公司 Noise processing method, device and system
CN111586469A (en) * 2020-05-12 2020-08-25 腾讯科技(深圳)有限公司 Bullet screen display method and device and electronic equipment
CN113573155A (en) * 2021-07-22 2021-10-29 深圳创维-Rgb电子有限公司 Voice bullet screen implementation method and device, intelligent device and readable storage medium
CN114120984A (en) * 2021-12-08 2022-03-01 思必驰科技股份有限公司 Voice interaction method, electronic device and storage medium

Similar Documents

Publication Publication Date Title
CN108847214B (en) Voice processing method, client, device, terminal, server and storage medium
US11330342B2 (en) Method and apparatus for generating caption
CN110737840A (en) Voice control method and display device
CN107257510B (en) Video unified playing method, terminal and computer readable storage medium
CN112511882B (en) Display device and voice call-out method
CN113035199B (en) Audio processing method, device, equipment and readable storage medium
CN112423081B (en) Video data processing method, device and equipment and readable storage medium
CN110827825A (en) Punctuation prediction method, system, terminal and storage medium for speech recognition text
CN107515870B (en) Searching method and device and searching device
CN113411674A (en) Video playing control method and device, electronic equipment and storage medium
CN112004145A (en) Program advertisement skipping processing method and device, television and system
CN110379406B (en) Voice comment conversion method, system, medium and electronic device
CN110970011A (en) Picture processing method, device and equipment and computer readable storage medium
CN111556350A (en) Intelligent terminal and man-machine interaction method
CN113992972A (en) Subtitle display method and device, electronic equipment and readable storage medium
US10911831B2 (en) Information processing apparatus, information processing method, program, and information processing system
CN114598922A (en) Voice message interaction method, device, equipment and storage medium
CN107391661B (en) Recommended word display method and device
CN115602167A (en) Display device and voice recognition method
CN112261321B (en) Subtitle processing method and device and electronic equipment
CN113552984A (en) Text extraction method, device, equipment and medium
CN113259754B (en) Video generation method, device, electronic equipment and storage medium
CN113593614A (en) Image processing method and device
CN113938742A (en) Control method, system, equipment and storage medium for automatically projecting screen content playing
CN114501042A (en) Cross-border live broadcast processing method and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination