CN114598922A

CN114598922A - Voice message interaction method, device, equipment and storage medium

Info

Publication number: CN114598922A
Application number: CN202210227932.4A
Authority: CN
Inventors: 段洁斐
Original assignee: Shenzhen Skyworth RGB Electronics Co Ltd
Current assignee: Shenzhen Skyworth RGB Electronics Co Ltd
Priority date: 2022-03-07
Filing date: 2022-03-07
Publication date: 2022-06-07

Abstract

The invention discloses a voice message interaction method, a device, equipment and a storage medium, wherein the method comprises the following steps: acquiring a video file being played, and analyzing video file information of the video file; collecting user voice information, and converting the user voice information into initial text information; acquiring a user voice intention based on the video file information and the initial text information; and if the user voice intention is an interactive intention, converting the initial text information into target text information, and displaying the target text information. The invention realizes the technical effect of simplifying the bullet screen sending process in the television film watching process.

Description

Voice message interaction method, device, equipment and storage medium

Technical Field

The present invention relates to the field of voice recognition technologies, and in particular, to a voice message interaction method, apparatus, device, and computer-readable storage medium.

Background

In recent years, with the continuous development of smart televisions, the requirements of users on viewing experience in large-screen scenes are higher and higher, and the use scenes of the televisions are more and more diversified. Barrage culture is gradually emerging, and barrage is widely applied to the process of watching television videos as a common interaction mode when a user watches the videos.

When a user wants to perform barrage interaction during television viewing, the user often needs to input barrage content into a television by means of a remote controller or Bluetooth voice. On the one hand, the user needs to use a remote controller and other related peripherals, on the other hand, a barrage mode and other links for interrupting the viewing experience are needed, and the operation of sending the barrage is complex.

Disclosure of Invention

The invention mainly aims to provide a voice message interaction method, a voice message interaction device, voice message interaction equipment and a computer readable storage medium, and aims to solve the problem that the operation of sending a bullet screen is complex in the television viewing process.

In order to achieve the above object, the present invention provides a voice message interaction method, which comprises:

acquiring a video file being played, and analyzing video file information of the video file;

collecting user voice information, and converting the user voice information into initial text information;

acquiring a user voice intention based on the video file information and the initial text information;

and if the user voice intention is an interactive intention, converting the initial text information into target text information, and displaying the target text information.

Optionally, before the step of obtaining the video file being played, the method further includes:

after the video playing is detected to start, identifying a signal source of the video playing;

judging whether the video supports message interaction or not according to the signal type of the signal source;

if yes, executing the steps of acquiring the video file being played and analyzing the video file information of the video file.

Optionally, the step of collecting the user voice information and converting the user voice information into the initial text information includes:

collecting mixed audio information in a far field range of the television equipment;

and extracting user voice information in the mixed audio information, and converting the user voice information into initial text information.

Optionally, before the step of obtaining the user's voice intention based on the video file information and the initial text information, the method further includes:

and sending the video file information to a preset server to serve as a training set of an initial intention prediction model in the server so as to establish a trained target intention prediction model.

Optionally, the step of obtaining the user's voice intention based on the video file information and the initial text information includes:

extracting key text information in the initial text information;

sending the key text information to the server so that the server can predict through the target intention prediction model to obtain a prediction result and return the prediction result;

and determining the voice intention of the user according to the prediction result.

Optionally, after the step of determining the user's voice intention according to the prediction result, the method further includes:

if the number of continuous appearance times of the same text content in the initial text information is identified to exceed the preset number of times, displaying an interactive commonly used term setting page;

and receiving a setting instruction based on the interactive commonly used expression setting page, and setting the same text content as the interactive commonly used expression to be used as one of the extraction bases of the key text information.

Optionally, if the user voice intention is an interaction intention, the step of converting the initial text information into target text information and displaying the target text information includes:

after the voice intention of the user is judged to be an interaction intention, the initial text information is obtained, and whether the initial text information contains preset sensitive words or not is identified;

if the user text information does not contain preset sensitive words, the initial text information is used as target text information, and the target text information is displayed;

and if the user text information contains preset sensitive words, filtering the initial text information, taking the filtered initial text information as target text information, and displaying the target text information.

In addition, to achieve the above object, the present invention further provides a voice message interaction apparatus, including:

the video file analysis module is used for acquiring a video file which is being played and analyzing video file information of the video file;

the voice recognition analysis module is used for acquiring user voice information and converting the user voice information into initial text information;

the user intention acquisition module is used for acquiring the voice intention of the user based on the video file information and the initial text information;

and the text information display module is used for converting the initial text information into target text information and displaying the target text information if the user voice intention is an interactive intention.

In addition, to achieve the above object, the present invention also provides an electronic device including: a memory, a processor, and a voice message interaction program stored on the memory and executable on the processor, the voice message interaction program configured to implement the steps of the voice message interaction method as described above.

In addition, to achieve the above object, the present invention also provides a computer readable storage medium having a voice message interaction program stored thereon, which, when executed by a processor, implements the steps of the voice message interaction method as described above.

The method comprises the steps of acquiring a video file being played, and analyzing video file information of the video file; collecting user voice information, and converting the user voice information into initial text information; acquiring a user voice intention based on the video file information and the initial text information; if the user voice intention is the interactive intention, the initial text information is converted into the target text information, the target text information is displayed, after the user inputs the voice information to the television equipment without the help of peripherals such as a remote controller, and if the television equipment judges that the user voice intention is the interactive intention, the user voice information can be converted into the target text information to be displayed on a television screen, so that the process of sending the bullet screen is simplified, and the fluency of sending the bullet screen by the user is improved.

Drawings

Fig. 1 is a schematic structural diagram of an electronic device in a hardware operating environment according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating a voice message interaction method according to a first embodiment of the present invention;

FIG. 3 is a flowchart illustrating a voice message interaction method according to a second embodiment of the present invention;

FIG. 4 is a flowchart illustrating a voice message interaction method according to a third embodiment of the present invention;

fig. 5 is a schematic diagram of a voice message interaction apparatus according to an embodiment of the present invention;

the implementation, functional features and advantages of the present invention will be further described with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Generally, a user sends a barrage at a television end and needs to open a barrage switch, and the user inputs characters or remotely controlled Bluetooth voice to input the content of the barrage by means of a remote controller.

The main technical scheme of the invention is as follows: acquiring a video file being played, and analyzing video file information of the video file; collecting user voice information, and converting the user voice information into initial text information; acquiring a user voice intention based on the video file information and the initial text information; and if the user voice intention is an interactive intention, converting the initial text information into target text information, and displaying the target text information.

Referring to fig. 1, fig. 1 is a schematic structural diagram of an electronic device in a hardware operating environment according to an embodiment of the present invention.

As shown in fig. 1, the electronic device may include: a processor 1001, such as a Central Processing Unit (CPU), a communication bus 1002, a user interface 1003, a network interface 1004, and a memory 1005. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a WIreless interface (e.g., a WIreless-FIdelity (WI-FI) interface). The Memory 1005 may be a Random Access Memory (RAM) Memory, or may be a Non-Volatile Memory (NVM), such as a disk Memory. The memory 1005 may alternatively be a storage device separate from the processor 1001.

Those skilled in the art will appreciate that the configuration shown in fig. 1 does not constitute a limitation of the electronic device and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.

As shown in fig. 1, a memory 1005, which is a storage medium, may include therein an operating system, a network communication module, a user interface module, and a voice message interaction program.

In the electronic device shown in fig. 1, the network interface 1004 is mainly used for data communication with other devices; the user interface 1003 is mainly used for data interaction with a user; the processor 1001 and the memory 1005 in the electronic device of the present invention may be disposed in the electronic device, and the electronic device calls the voice message interaction program stored in the memory 1005 through the processor 1001 and executes the following steps:

In one embodiment, the electronic device, through the processor 1001 calling the voice message interaction program stored in the memory 1005, may further perform the following steps:

extracting key text information in the initial text information;

if the continuous occurrence times of the same text contents in the initial text information exceed the preset times, displaying an interactive commonly used term setting page;

An embodiment of the present invention provides a voice message interaction method, and referring to fig. 2, fig. 2 is a flowchart illustrating a first embodiment of a voice message interaction method according to the present invention.

In this embodiment, the voice message interaction method includes:

step S10, acquiring a video file being played, and analyzing video file information of the video file;

the message interaction can be understood as that the user sends comments about the video content in a bullet screen sending mode and expresses own opinions. At present, video websites and video software of many mobile device terminals support sending barrages. When a user watches video played by a television, different signal sources can be selected, and the video content of all the signal sources does not support the transmission of barrage.

As an example, before the step of obtaining the video file being played, the method may include:

step A1, after detecting that video playing starts, identifying the signal source of the video. Generally, a television can be provided with a signal source interface, and Video content from a television box, a network on demand, a computer, a DVD (Digital Video Disc), or other devices or networks can be played by accessing different signal sources.

Step A2, according to the signal type of the signal source, judging whether the video supports message interaction. The signal types may be classified into an HDMI (High Definition Multimedia Interface) signal, a VGA (Video Graphics Array) signal, an AV (Audio & Video) signal, and a network on demand signal. Web-on-demand may be understood as watching internet video on a television. Before web-on-demand, a user may install video application software on a television.

When the video played by the television supports the sending of the barrage, an Android Application Package (APK) in the television can acquire the video file being played and analyze video file information related to the video file. The video file information may contain character information, image information, character information, and the like. The personal information may include the name of the corner of a movie, the name of staff, etc. in the video file. The image information can be actor information and role information obtained by intercepting pictures along with the video playing progress and analyzing the pictures in the video file playing process. The text information may comprise subtitles in a video file.

Step S20, collecting user voice information, and converting the user voice information into initial text information;

when a user watches video played by a television, the television can acquire user voice information through a preset voice recognition system and convert the user voice information into initial text information.

As an example, the step of collecting the user voice information and converting the user voice information into the initial text information may include:

and step B1, collecting mixed audio information in a far-field range, stripping user voice information in the mixed audio information, and converting the user voice information into initial text information. The voice capture system in the television may be a set of 4-way microphones mounted at the bottom of the television screen. Through the microphone array technology, the television can perform far-field speech recognition. The far field range in this embodiment may be 3-5 meters. In the process of collecting the voice information of the user, the television can also generate audio information when playing videos, and the voice collecting system can collect mixed audio information firstly. By configuring an AEC (Acoustic Echo Cancellation) algorithm in the voice acquisition system, Echo Cancellation can be performed on the acquired mixed audio information, the audio information played by the television can be removed, and the user voice information can be stripped. And sending the user voice information obtained after the AEC algorithm processing to an APK (android package), and analyzing the user voice information by the APK to convert the user voice information into initial text information.

Step S30, based on the video file information and the initial text information, obtaining the voice intention of the user;

the user's voice intention may be divided into an interactive intention and an instructional intention according to the purpose of the user's input of voice. When the content contained in the initial text information is successfully matched with the video file information, the voice intention of the user can be judged to be the interaction intention. When the initial text information is a control instruction such as fast forward, fast backward, pause and the like, the voice intention of the user can be judged to be an instruction intention. Trigger words related to the control instructions can be preset in the television, and when the trigger words are identified to be contained in the initial text information, the control instructions corresponding to the trigger words are directly executed.

Step S40, if the user voice intention is an interaction intention, converting the initial text information into target text information, and displaying the target text information.

And after the voice intention of the user is judged to be the interactive intention, the bullet screen content can be displayed in the television screen according to the initial text information. Specifically, the bullet screen content displayed last may be different from the initial text information. If the initial text information contains preset sensitive words, the initial text information can be filtered, and then the target text information obtained after filtering is displayed on a television screen. In addition, the APK can also identify the user identity information and display the user identity information and the target text information in a correlation mode. The user identity information may comprise a user nickname and an avatar.

In this embodiment, after a video is detected to start playing, whether the video supports message interaction is judged, if the video supports message interaction, a video file being played is obtained, video file information of the video file is analyzed, user voice information is collected, the user voice information is converted into initial text information, a user voice intention is obtained based on the video file information and the initial text information, if the user voice intention is an interaction intention, the initial text information is converted into target text information, the target text information is displayed, the user can realize bullet screen sending by inputting the voice information into a television, peripherals such as a remote controller and the like are not needed during voice input, smoothness of sending bullet screens by the user is improved, and a bullet screen sending flow is simplified.

Further, in a second embodiment of the voice message interaction method of the present invention, referring to fig. 3, the voice message interaction method of the present invention includes:

step S11, sending the video file information to a preset server as a training set of an initial intention prediction model in the server to establish a trained target intention prediction model;

an initial intention prediction model can be preset in the server, and when the initial intention prediction model is established, the existing network video introduction information is used as a training set through a CNN (Convolutional Neural Networks) algorithm. The training result obtained at this time is the interactive intention. The network video introduction information may include a performance staff and a scenario brief description of the network video, etc.

When the user watches the video, the APK can send the video file information obtained by analysis to the server, and the video file information is used as a training set to train the initial intention prediction model.

Step S12, extracting key text information in the initial text information;

when the intention prediction is performed, the initial text information may be pre-processed first, and the key text information in the initial text information is extracted, so as to reduce the data processing amount during the intention prediction. The APK may extract the key text information in the initial text information after parsing the user voice information into the initial text information. In another embodiment, the APK sends the initial text message directly to the server, and the server extracts the key text message. The key text information may be extracted by first identifying whether the initial text information includes an interactive keyword, and extracting the interactive keyword in the initial text information as the key text information.

Step S13, the key text information is sent to the server so that the server can obtain a prediction result through the prediction of the target intention prediction model and return the prediction result;

and the APK sends the extracted key text information to the server, the server inputs the key text information into the target intention prediction model, the target intention prediction model can output a prediction result, and the server returns the prediction result to the television.

Step S14, determining the user' S voice intention according to the prediction result;

the prediction result output by the target intention prediction model can be divided into an interaction type and an instruction type, and after receiving the prediction result, the APK can judge that the voice intention of the user is an interaction intention or an instruction intention according to the prediction type.

Step S15, if the continuous occurrence frequency of the same text content in the initial text information is identified to exceed the preset frequency, displaying an interactive commonly used term setting page;

during the process of watching the video by the user, the bullet screen content which is desired to be sent may not have a textual content association with the video file information. For example, the user may want to send a bullet screen such as "too laugh" or "bored", and may ignore the text content when extracting the key text information. And taking the continuous occurrence frequency of the same text content exceeding a preset frequency as a trigger condition, and displaying the interactive commonly used term setting page when the trigger condition is reached. The user accessible interaction commonly used term sets up the page, sets up the bullet screen content that send the frequency height as interaction commonly used term, improves the rate of accuracy that the bullet screen sent. The preset number of times may be set to 2 times, for example, when the voice information input by the user is "too happy and too happy," an interaction commonly used term setting page will be displayed on the television screen.

Step S16, receiving a setting instruction based on the interactive commonly used expression setting page, and setting the same text content as an interactive commonly used expression as one of the extraction bases of the key text information.

The mode of receiving the setting instruction can be receiving a voice setting instruction of a user, and can also be receiving the setting instruction through a remote controller. After receiving the setting instruction, the identified same text content may be set as an interactive commonly used term. And when the interactive commonly used expressions are identified subsequently, the interactive commonly used expressions are taken as key text information, extracted and then sent to the server.

In the embodiment, video file information is used as a training set, a target intention prediction model is established, key text information is input into the target intention prediction model, a prediction type output by the target intention prediction model is obtained, a user voice intention can be obtained, interactive phrase habits of the user can be fitted by setting interactive commonly used phrases, and accuracy of recognizing the interactive intention is improved.

Further, in a third embodiment of the voice message interaction method of the present invention, referring to fig. 4, the voice message interaction method of the present invention includes:

step S21, after the voice intention of the user is judged to be an interaction intention, the initial text information is obtained, and whether the initial text information contains preset sensitive words or not is identified;

after the voice intention of the user is judged to be the interaction intention, the initial text information can be cached through the APK, and whether the initial text information contains preset sensitive words or not is identified. Sensitive words may include non-civilized words and words related to national security, etc.

Step S22, if the user text information does not contain preset sensitive words, the initial text information is used as target text information, and the target text information is displayed;

the APK may obtain user identity information, such as user nicknames and avatars, at the time of installation. If the sensitive words are not recognized in the initial text information, the initial text information can be used as target text information and displayed together with the user identity information on a television screen. Specifically, the display manner and the display format of the target text information may be set by the user. The display modes may include a fixed position display and a scrolling display, and may further include a top display, a center display, and a bottom display. The display formats may include text font, text font size, text color, and text transparency, among others.

Step S23, if the initial text information includes a preset sensitive word, filtering the initial text information, taking the filtered initial text information as target text information, and displaying the target text information.

If the initial text information is recognized to contain the sensitive words, the initial text information can be filtered. The filtering mode may be that the sensitive words in the initial text information are not displayed, or the sensitive words are displayed instead of the symbols. And displaying the target text information obtained after filtering in a television screen.

In the embodiment, sensitive word recognition is performed on the initial text information, and if the sensitive words are recognized, the sensitive word content is filtered, so that the civilization and harmony of the network environment are maintained.

The embodiment of the invention also provides a voice message interaction device, and referring to fig. 5, fig. 5 is a schematic diagram of the voice message interaction device according to the scheme of the embodiment of the invention. As shown in fig. 5, the voice message interacting device may include:

the video file analysis module 101 is configured to acquire a video file being played and analyze video file information of the video file;

the voice recognition and analysis module 102 is used for collecting user voice information and converting the user voice information into initial text information;

a user intention obtaining module 103, configured to obtain a user voice intention based on the video file information and the initial text information;

and the text information display module 104 is configured to, if the user voice intention is an interaction intention, convert the initial text information into target text information, and display the target text information.

Optionally, the voice message interaction apparatus further includes:

the signal identification module is used for identifying a signal source of the video which is being played after the video playing is detected to start;

Optionally, the speech recognition parsing module 102 is further configured to:

Optionally, the user intention acquisition module 103 is further configured to:

extracting key text information in the initial text information;

Optionally, the user intention acquisition module 103 is further configured to:

Optionally, the text information display module 104 is further configured to:

An embodiment of the present invention further provides an electronic device, where the electronic device includes: a memory, a processor, and a voice message interaction program stored on the memory and executable on the processor, the voice message interaction program configured to implement the steps of the voice message interaction method as described above.

An embodiment of the present invention further provides a computer-readable storage medium, where a voice message interaction program is stored on the computer-readable storage medium, and when executed by a processor, the voice message interaction program implements the steps of the voice message interaction method described above.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) as described above and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A voice message interaction method, wherein the voice message interaction method is applied to a television device, and the voice message interaction method comprises the following steps:

2. The voice message interaction method of claim 1, further comprising, prior to the step of retrieving the video file being played:

3. The voice message interaction method as claimed in claim 1, wherein the step of collecting the user voice information and converting the user voice information into the initial text information comprises:

collecting mixed audio information in a far-field range of the television equipment;

4. The voice message interaction method as claimed in claim 1, further comprising, before the step of acquiring the user's voice intention based on the video file information and the initial text information:

5. The voice message interaction method as claimed in claim 4, wherein the step of acquiring the user's voice intention based on the video file information and the initial text information comprises:

extracting key text information in the initial text information;

6. The voice message interaction method as claimed in claim 5, further comprising, after the step of determining the user's voice intention according to the prediction result:

7. The voice message interaction method as claimed in any one of claims 1 to 6, wherein the step of converting the initial text information into target text information and displaying the target text information if the user's voice intention is an interaction intention comprises:

after the voice intention of the user is judged to be an interactive intention, acquiring the initial text information, and identifying whether the initial text information contains preset sensitive words or not;

8. A voice message interaction apparatus, comprising:

9. An electronic device, characterized in that the electronic device comprises: a memory, a processor, and a voice message interaction program stored on the memory and executable on the processor, the voice message interaction program configured to implement the steps of the voice message interaction method as claimed in any one of claims 1 to 7.

10. A computer-readable storage medium, wherein a voice message interaction program is stored on the computer-readable storage medium, and when executed by a processor, the voice message interaction program implements the steps of the voice message interaction method as claimed in any one of claims 1 to 7.