WO2023216414A1 - 语音交互系统及语音交互方法 - Google Patents

语音交互系统及语音交互方法 Download PDF

Info

Publication number
WO2023216414A1
WO2023216414A1 PCT/CN2022/106046 CN2022106046W WO2023216414A1 WO 2023216414 A1 WO2023216414 A1 WO 2023216414A1 CN 2022106046 W CN2022106046 W CN 2022106046W WO 2023216414 A1 WO2023216414 A1 WO 2023216414A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice
unit
text
screen system
instruction
Prior art date
Application number
PCT/CN2022/106046
Other languages
English (en)
French (fr)
Inventor
徐遥令
徐小清
沈思宽
吴伟
张曼华
张威轶
孙彦竹
姜晓飞
伍银河
袁新艳
Original Assignee
深圳创维-Rgb电子有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳创维-Rgb电子有限公司 filed Critical 深圳创维-Rgb电子有限公司
Publication of WO2023216414A1 publication Critical patent/WO2023216414A1/zh

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/41Structure of client; Structure of client peripherals
    • H04N21/422Input-only peripherals, i.e. input devices connected to specially adapted client devices, e.g. global positioning system [GPS]
    • H04N21/42203Input-only peripherals, i.e. input devices connected to specially adapted client devices, e.g. global positioning system [GPS] sound input device, e.g. microphone
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Definitions

  • the present application relates to the field of television technology, and in particular to a voice interaction system and a voice interaction method.
  • intelligent voice is increasingly used in mobile electronic products.
  • intelligent voice as a convenient interaction method has also begun to be gradually applied and recognized by people.
  • the smart voice interaction technology currently used by TVs mainly uses the processor of the TV terminal to collect speech, understand speech, generate instructions and execute them. This method takes up more processor resources during voice interaction and can easily cause the TV to freeze, making the user experience Difference.
  • the main purpose of this application is to provide a voice interaction system and a voice interaction method, aiming to solve the problem of TV lag during existing TV voice interaction.
  • this application provides a voice interaction system, which includes: a main screen system, a secondary screen system that establishes a communication connection with the main screen system, and a voice platform that establishes a network connection with the secondary screen system;
  • a voice interaction system which includes: a main screen system, a secondary screen system that establishes a communication connection with the main screen system, and a voice platform that establishes a network connection with the secondary screen system;
  • the main screen system and the secondary screen system are provided in a television;
  • the secondary screen system is used to generate voice packets based on the audio signals collected by the main screen system, send the voice packets to the voice platform, and parse the text packets fed back by the voice platform based on the voice packets to generate instructions. Text, generate a comprehensive information package based on the instruction text;
  • the voice platform is used to generate mixed data packets according to the comprehensive information packet
  • the secondary screen system is also used to parse the mixed data packet, obtain the voice response text and response audio signal, display the voice response text, and send the response audio signal to the main screen system for output.
  • the home screen system includes:
  • An acoustic-to-electrical conversion unit the acoustic-to-electrical conversion unit is used to collect external sound signals
  • the amplitude adjustment unit is used to obtain the internal audio signal
  • noise reduction unit which is connected to the acoustic-to-electrical conversion unit and the amplitude adjustment unit respectively; the noise reduction unit is used to perform noise reduction processing on the external sound signal according to the internal audio signal, To generate an audio signal corresponding to the voice in the external sound signal, and output the audio signal to the secondary screen system.
  • the secondary screen system includes:
  • a sound monitoring and voice acquisition module configured to generate a voice packet according to the audio signal output by the main screen system, and send the voice packet to the voice platform;
  • the text acquisition and instruction matching module is used to receive the text packet fed back by the voice platform based on the voice packet, parse the text packet to generate instruction text, determine the matching instruction corresponding to the instruction text, and output the matching instruction. to the home screen system;
  • the information fusion and data decomposition module is connected to the text acquisition and instruction matching module; the information fusion and data decomposition module is used to obtain description information corresponding to the instruction text, receive the main screen system and execute the matching instruction The response information fed back after the corresponding operation is generated, and a comprehensive information package is generated according to the response information and the description information, and the comprehensive information package is sent to the voice platform; the information fusion and data decomposition module is also used to receive and Analyze the mixed data packet output by the voice platform, generate the voice response text and response audio signal, and send the response audio signal to the main screen system for output;
  • a display module is connected to the display module and the information fusion and data decomposition module, and the display module is used to receive and display the voice response text output by the information fusion and data decomposition module.
  • the sound monitoring and speech acquisition module includes a first register, an audio monitoring unit, a switch unit, a delay unit, a conversion cache unit, a feature recognition unit and an extraction coding unit;
  • the first input end of the audio monitoring unit is connected to the input end of the delay unit, and the second input end of the audio monitoring unit and the first input end of the feature identification unit are respectively connected to the first register.
  • the output end of the audio monitoring unit is connected to the input end of the switch unit and the second input end of the feature identification unit respectively, and the output end of the switch unit is connected to the input end of the conversion cache unit, so
  • the conversion cache unit is also connected to the feature recognition unit and the extraction coding unit; the extraction coding unit is connected to the feature recognition unit;
  • the first register is used to store a preset time length, a preset energy threshold and a preset reference characteristic value
  • the audio monitoring unit is configured to receive the audio signal output by the main screen system, and output an interception trigger signal when it detects that the audio signal within the preset time length reaches the preset energy threshold;
  • the switch unit is used to turn on when receiving the interception trigger signal
  • the delay unit is configured to output the audio signal delayed for a preset time length to the conversion cache unit when the switch unit is turned on;
  • the conversion cache unit is configured to allocate a starting storage address to store the audio signal and output the starting storage address when receiving the interception trigger signal;
  • the feature identification unit is configured to read the preset reference feature value and the audio signal in the starting storage address when receiving the interception trigger signal, and compare the characteristics of the audio signal with When the preset feature values are consistent, output an extraction trigger signal to the extraction encoding unit;
  • the extraction and encoding unit is configured to read the audio signal according to the starting storage address when receiving the extraction trigger signal, encode the audio signal to form a voice packet, and send the voice packet to The voice platform.
  • the text acquisition and instruction matching module includes a decoding and parsing unit, an instruction matching unit, a second register, and a storage unit; the instruction matching unit is respectively connected to the decoding and parsing unit, the second register, and the storage unit. ;
  • the decoding and analysis unit is used to receive and decode the text packet fed back by the voice platform to obtain combined text, and analyze the combined text to obtain voice text and instruction text,
  • the second register is used to store the preset similarity
  • the output storage unit is used to store an instruction table, wherein the instruction table includes a plurality of instruction records and description field information of each instruction record;
  • the instruction matching unit is used to obtain the preset similarity and read each description field information in the instruction table, when the comparative similarity between the instruction text and the description field information reaches the preset similarity. , record the instruction corresponding to the description field information whose comparative similarity to the instruction text reaches a preset similarity as the matching instruction corresponding to the instruction text, and output the matching instruction to the main screen system.
  • the information fusion and data decomposition module includes an information fusion unit, a coding unit and a decoding and decomposition unit;
  • the information fusion unit is configured to receive the response information fed back after the main screen system performs the operation corresponding to the matching instruction, and obtain the description information corresponding to the instruction text, and use the response information and the The descriptive information is used to generate comprehensive information;
  • the encoding unit is connected to the information fusion unit; the encoding unit is used to encode the comprehensive information into the comprehensive information package, and output the comprehensive information package to the voice platform;
  • the decoding and decomposition unit is used to receive and analyze the mixed data packet output by the voice platform, separate the voice response text and the response audio signal; and send the voice response text to the display module, Send the response audio signal to the main screen system for output.
  • the speech platform includes a language understanding and text generation module and an information analysis and data generation module;
  • the language understanding and text generation module is used to generate a corresponding text packet according to the voice packet sent by the secondary screen system, and send the text packet to the secondary screen system;
  • the information analysis and data generation module is configured to receive the comprehensive information packet sent by the secondary screen system, generate a mixed data packet according to the comprehensive information packet, and send the mixed data packet to the secondary screen system.
  • the language understanding and text generation module includes a decoding recognition unit, a combined encoding unit and a logical structure conversion unit;
  • the decoding and recognition unit is used to receive and decode the voice packet sent by the secondary screen system to obtain a voice audio signal, and identify the voice audio signal and convert it into voice text;
  • the logical structure conversion unit is connected to the decoding recognition unit and is used to understand the voice text and convert the voice text into instruction text that conforms to the preset voice structure;
  • the combined encoding unit is respectively connected to the decoding recognition unit and the logical structure conversion unit; the combined encoding unit is used to combine the voice text and the instruction text in a preset order to form a combined text,
  • the combined text is encoded into the text packet, and the text packet is sent to the secondary screen system.
  • the information analysis and data generation module includes an analysis decoding unit, a synthesis conversion unit and a hybrid encoding unit;
  • the analysis and decoding unit is configured to receive and decode the comprehensive information packet sent by the secondary screen system to obtain comprehensive information, and analyze the comprehensive information to obtain the voice response text;
  • the synthesis conversion unit is connected to the output end of the analysis and decoding unit; the synthesis conversion unit is used to convert the speech response text into the response audio;
  • the mixed encoding unit is connected to the analysis decoding unit and the synthesis conversion unit; the mixed encoding unit is used to mix and encode the speech response text and the response audio to generate the mixed data packet, Send the mixed data packet to the secondary screen system.
  • this application also provides a voice interaction method, which is applied to the secondary screen system; the voice interaction method includes the steps:
  • the voice receiving platform generates a mixed data packet based on the comprehensive information package, analyzes the mixed data packet to obtain a voice response text and a response audio signal, displays the voice response text, and sends the response audio signal to the main screen system for output.
  • This application provides a voice interaction system and a voice interaction method.
  • the secondary screen system acquires audio signals in real time, generates voice packets, and parses the text packets fed back by the voice platform, generates instruction text and comprehensive information packets, and parses them.
  • the mixed data packet sent by the voice platform obtains the voice response text and the response audio signal, displays the voice response text, and sends the response audio signal to the main screen system for output; the voice platform mainly performs speech understanding; the main screen system only collects Sounds and responses to corresponding actions. Therefore, in the process of realizing voice interaction, it takes up less processor resources of the main screen system, the voice interaction response speed is fast and the delay is small, it does not occupy TV video processing resources, and the video display is clear and smooth, which greatly improves the user experience.
  • Figure 1 is a module schematic diagram of an embodiment of the voice interaction system of the present application.
  • Figure 2 is a module schematic diagram of another embodiment of the voice interaction system of the present application.
  • FIG. 3 is a timing diagram of an embodiment of the voice interaction system of the present application.
  • FIG. 4 is a partially detailed module schematic diagram of an embodiment of the voice interaction system of the present application.
  • Figure 5 is a schematic structural diagram of the text acquisition and instruction matching module of an embodiment of the voice interaction system of the present application.
  • Figure 6 is a schematic structural diagram of an instruction list of an embodiment of the voice interaction system of the present application.
  • FIG. 7 is a partially detailed module schematic diagram of another embodiment of the voice interaction system of the present application.
  • Figure 8 is a schematic diagram of the combined text structure of an embodiment of the voice interaction system of the present application.
  • FIG. 9 is a partially detailed module schematic diagram of another embodiment of the voice interaction system of the present application.
  • Figure 10 is a schematic diagram of the mixed data structure of an embodiment of the voice interaction system of the present application.
  • FIG 11 is a schematic flowchart of an embodiment of the voice interaction method of this application.
  • the smart voice interaction technology used by TVs mainly adopts two solutions: the first one is to use the processor of the TV terminal to understand speech and generate instructions and execute them. This takes up more processor resources during voice interaction, causing the TV to freeze and the user to The experience is poor, and the TV's voice acquisition and recognition are always in working condition, and the energy consumption is low; in the second type, voice detection and extraction, recognition and understanding, command generation, etc. are completed by the back-end voice platform, and the TV terminal only executes commands, and the voice Repeated information transmission is required between the platform and the TV terminal to complete intelligent voice interaction.
  • the delay is large, especially when the network conditions are poor, resulting in serious lag in interaction and poor experience.
  • the voice interaction system includes: a main screen system 100, a secondary screen system 200 that establishes a communication connection with the main screen system 100, and a secondary screen system 200 that establishes a communication connection with the main screen system 100.
  • the secondary screen system 200 establishes a network-connected voice platform 300; wherein the main screen system 100 and the secondary screen system 200 are provided in a television;
  • the secondary screen system 200 is configured to generate a voice packet according to the audio signal collected by the main screen system 100, send the voice packet to the voice platform 300, and parse the text fed back by the voice platform 300 based on the voice packet.
  • package generate instruction text, and generate a comprehensive information package according to the instruction text;
  • the voice platform 300 is used to generate a mixed data package according to the comprehensive information package;
  • the secondary screen system 200 is also used to parse the mixed data package, A voice response text and a response audio signal are obtained, the voice response text is displayed, and the response audio signal is sent to the main screen system 100 for output.
  • the main screen system 100 is provided with a sound collection module 11, an instruction execution and information feedback module 12 and an audio driver module 13; the sound collection module 11 is used to collect external sound signals and output corresponding audio signals to the secondary screen system 200.
  • the sound collection module 11 is used to collect external sound signals of the television and internal audio signals of the television. It can be understood that the external sound signals include sound signals outside the television, including interactive signals sent by the user.
  • the sound collection module 11 can filter out the audio signals played by the TV, generate audio signals that only include external sound signals, and send them to the secondary screen system 200 .
  • the audio driving module 13 can be selected according to the actual situation, such as a speaker, and the audio driving module 13 is used to emit sound according to the response audio signal.
  • the secondary screen system 200 processes the audio signal, extracts the audio signal that meets the preset extraction features, encodes the audio signal that meets the preset extraction features to form a voice packet, and sends the voice packet to Voice platform 300.
  • the preset extraction features can be set according to the characteristics of the external sound signal, such as the preset time length, preset energy threshold and preset reference feature value, etc. If the audio signal does not meet the preset extraction features at all, it means that the external sound segment The signal does not include voice signals including interactive instructions issued by the user; if the multiple audio signals obtained continuously do not match, the secondary screen system 200 will no longer perform audio signal processing, and the function will enter a sleep state to reduce power consumption.
  • the voice platform 300 After receiving the voice packet, the voice platform 300 decodes it into a voice audio signal, recognizes the voice audio signal to obtain the corresponding text, encodes it into a text packet, and then feeds it back to the secondary screen system 200 .
  • the secondary screen system 200 After receiving the text packet, the secondary screen system 200 decodes the received text packet to obtain the text, further performs text analysis to obtain the voice text and instruction text, determines the matching instruction corresponding to the instruction text, and outputs the matching
  • the instruction is sent to the main screen system 100; the main screen system 100 executes the operation corresponding to the matching instruction.
  • the operation corresponding to the matching instruction may be to complete the internal operation of the main screen system 100, such as volume adjustment; it may also be an internal or external operation, such as calling an internal
  • the video player obtains audio and video content from the content service platform, outputs the content after the instruction operation to the audio and video processing module of the main screen system 100, or directly controls the audio and video module to switch working states; and generates response information for executing the operation corresponding to the matching instruction. , such as the volume is 25 or starting video playback, etc., and is sent to the secondary screen system 200 .
  • the voice text is output to the display module 24 of the secondary screen system 200, and the display module 24 displays the voice text so that the user can see the text form after the voice signal is recognized. Furthermore, if the user finds that the voice signal is recognized If there is an error, the voice message can be sent to the TV again in time without having to wait until the voice interaction information fed back by the TV is wrong before realizing it is incorrect. Therefore, the timeliness of feedback and the visibility of human-computer interaction can be improved.
  • the secondary screen system 200 obtains the description information of the current matching instruction, fuses the response information and the description information of the instruction into comprehensive information, and encodes it into a comprehensive information package and sends it to the voice platform 300 . For example, if the response information is "volume 25" and the command description information is "please adjust the volume”, then the comprehensive information is "please adjust the volume, volume 25".
  • the voice platform 300 generates a mixed data packet based on the integrated information packet.
  • the voice platform 300 decodes the comprehensive information package to obtain the comprehensive information, analyzes and understands the comprehensive information, and obtains the voice response text. For example, the decoding of the above comprehensive information is "Please adjust the volume, the volume is 25", then the parsed voice response is obtained.
  • the text is "The volume has been adjusted to 25”; then the voice response text is converted into a response audio signal, and finally the response text and response audio signal are mixed and encoded into a mixed data packet, which is transmitted to the TV secondary screen system 200 through the network.
  • the secondary screen system 200 parses the received mixed data packet, it obtains the voice response text and the response audio signal, and sends the voice response text to the display module 24 of the secondary screen system 200 for display, so that the user can see the visual feedback text. ; and send the response audio signal to the main screen system 100 for output, thereby completing the "voice-to-voice" intelligent interaction with the user.
  • the main screen system 100 is mainly used to collect external sounds to generate audio signals and transmit them to the secondary screen system 200.
  • the secondary screen system 200 generates voice packets corresponding to the voice signals from the audio signals and transmits them to the voice platform 300 through the network.
  • the voice platform 300 Convert the voice packet into command text that conforms to the predetermined language structure and transmit it to the secondary screen system 200 through the network.
  • the secondary screen system 200 determines the matching command through the command text and sends it to the main screen system 100.
  • the main screen system 100 executes the command and executes the response information. Feedback to the secondary screen system 200; further, the secondary screen system 200 fuses the response information and the description information of the matching instructions into comprehensive information, and transmits it to the voice platform 300 through the network.
  • the voice platform 300 parses and converts the comprehensive information to obtain the voice response text and response audio signal and
  • the mixed data packets are transmitted to the TV secondary screen system 200 through the network.
  • the secondary screen system 200 decodes and decomposes the mixed data packets, separates the response text to drive the secondary screen display, and obtains the separated response audio signal to drive the main screen system 100 sound module to emit sound. Realize human-computer voice interaction.
  • the secondary screen system 200 acquires audio signals in real time, generates voice packets, parses the text packets fed back by the voice platform 300, generates command text and comprehensive information packets, and analyzes the mixed data packets sent by the voice platform 300 to obtain Voice response text and response audio signal, display the voice response text, and send the response audio signal to the main screen system 100;
  • the voice platform 300 mainly performs speech understanding, and the main screen system 100 only collects sounds and responds to corresponding operations, thereby
  • the voice interaction delay is small, the response speed is fast, it does not occupy TV video processing resources, and the video display is clear and smooth; and, it is consistent with existing technology Compared with network interaction, the delay is smaller, the interaction experience is better, the power consumption of voice processing is improved, and the user experience is improved.
  • the secondary screen system 200 includes a sound monitoring and voice acquisition module 21, a text acquisition and instruction matching module 22, an information fusion and data decomposition module 23 and a display module 24;
  • the voice monitoring and voice acquisition module 21 is used to generate a voice packet according to the audio signal output by the main screen system 100, and send the voice packet to the voice platform 300;
  • the text acquisition and instruction matching module 22 is used to receive the feedback from the voice platform 300 Text package, parse the text package to generate instruction text, determine the matching instruction corresponding to the instruction text, and output the matching instruction to the main screen system 100;
  • the matching module 22 is connected;
  • the information fusion and data decomposition module 23 is used to obtain the description information corresponding to the instruction text, receive the response information fed back after the main screen system 100 executes the operation corresponding to the matching instruction, and according to The response information and the description information generate a comprehensive information package, and send the comprehensive information package to the voice platform 300;
  • the mixed data package generates the voice response text and the response audio signal, and sends the response audio signal to the main screen system 100 for output; the display module 24 and the information fusion and data decomposition module 23 Connected, the display module 24 is configured to receive and display the voice response text output by the information fusion and data decomposition module 23 .
  • the secondary screen system 200 queries the instruction records in the stored instruction table according to the instruction text, finds the instruction record that is most similar to the instruction text, determines it as a matching instruction, and outputs the matching instruction to the main screen system 100 .
  • the instruction execution and information feedback module 12 in the main screen system 100 can be used to write a storage instruction table into the text acquisition and instruction matching module 22 of the secondary screen system 200 in advance.
  • Each instruction record in the storage instruction table is executable by the TV main screen system 100 instructions and their description information.
  • the main screen system 100 processor consumes less resources and provides good video display.
  • the network interaction delay is small, the interaction experience is good, and the voice processing power consumption is small and the efficiency is high.
  • the voice platform 300 includes a language understanding and text generation module and an information analysis and data generation module 32; the language understanding and text generation module is used to generate the corresponding voice packet according to the voice packet sent by the secondary screen system 200. text packet, and send the text packet to the secondary screen system 200; the information analysis and data generation module 32 is used to receive the comprehensive information packet sent by the secondary screen system 200, and generate Mixed data packets are sent to the secondary screen system 200 .
  • the speech understanding and text generation module 31 of the speech platform 300 decodes the speech packets to obtain the speech and performs speech understanding: converting the speech into speech text, and converting the speech text into instruction text that conforms to the predetermined language structure. , and combine the instruction text and the corresponding voice text to form text, encode it into a text package, and transmit it to the text parsing unit of the TV secondary screen system 200 through the network.
  • the information analysis and data generation module 32 decodes the comprehensive information package to obtain comprehensive information, analyzes the comprehensive information, obtains the voice response text, and converts the voice response text into a response audio signal; then the response text and response audio signal are processed.
  • the mixed encoding is converted into mixed data packets and transmitted to the TV secondary screen system 200 through the network.
  • the sound collection module 11 of the main screen system 100 specifically includes an acoustic-to-electrical conversion unit 110, an amplitude adjustment unit 111 and a noise reduction unit 112; the acoustic-to-electrical conversion unit 110 is used to collect external sound signals. ; The amplitude adjustment unit 111 is used to obtain the internal audio signal; the noise reduction unit 112 is used to perform noise reduction processing on the external sound signal according to the internal audio signal to generate the voice components in the external sound signal. corresponding audio signal, and output the audio signal to the secondary screen system 200; wherein, the noise reduction unit 112 is connected to the acoustic-to-electrical conversion unit 110 and the amplitude adjustment unit 111 respectively.
  • the acoustic-electrical conversion unit 110 after receiving the external sound signal, performs acoustic-electrical conversion to obtain the external sound audio signal; the amplitude adjustment unit 111 obtains the program audio signal output by the TV audio and video processing module, that is, the internal audio signal, Amplitude adjustment is performed to obtain a program audio signal with a set amplitude; then the noise reduction unit 112 performs denoising processing, that is, comparing the frequency difference between the external sound audio signal and the program audio signal, and removing the cost of the program audio signal from the external sound audio signal. , get the denoised audio signal. This enables the extraction of external sound signals, so that clear and accurate language signals from users can be obtained, thereby improving the accuracy of voice interaction.
  • the sound monitoring and speech acquisition module 21 includes a first register 210, an audio monitoring unit 211, a switch unit 212, a delay unit 213, a conversion cache unit 214, a feature recognition unit 215 and an extraction and encoding unit 216; the audio
  • the first input end of the monitoring unit 211 is connected to the input end of the delay unit 213, and the second input end of the audio monitoring unit 211 and the first input end of the feature identification unit 215 are respectively connected to the first register.
  • the output end of the audio monitoring unit 211 is connected to the input end of the switch unit 212 and the second input end of the feature identification unit 215, and the output end of the switch unit 212 is connected to the conversion cache unit.
  • the input end of 214 is connected, and the conversion cache unit 214 is also connected with the feature recognition unit 215 and the extraction and coding unit 216; the extraction and coding unit 216 is connected with the feature recognition unit 215.
  • the first register 210 is used to store a preset time length, a preset energy threshold and a preset reference characteristic value;
  • the audio monitoring unit 211 is used to receive the audio signal output by the main screen system 100, and monitor When the audio signal within the preset time length reaches the preset energy threshold, an interception trigger signal is output;
  • the switch unit 212 is used to turn on when receiving the interception trigger signal;
  • the delay unit 213 is used to When the switch unit 212 is turned on, the audio signal delayed by a preset time length is output to the conversion cache unit 214; the conversion cache unit 214 is used to allocate a starting point when receiving the interception trigger signal.
  • the feature identification unit 215 is configured to read the preset reference feature value and the starting value when receiving the interception trigger signal.
  • the audio signal is read according to the starting storage address, the audio signal is encoded to form a voice packet, and the voice packet is sent to the voice platform 300 .
  • the preset time length read by the audio monitoring unit 211 from the first register 210 is, for example, Ts, and the preset energy threshold is Es.
  • the audio monitoring unit 211 monitors the average energy of the audio signal within a time length Ts in real time. value. If it is detected that the average energy value of the audio signal within Ts reaches the preset energy threshold Es, the audio monitoring unit 211 generates an interception trigger signal and starts intercepting audio.
  • the switch unit 212 turns on the audio switch under the control of intercepting the trigger signal. After the audio signal passes through the delay unit 213, the delay time can be set to Ts, and the monitored audio with an average energy value reaching Es is passed through the audio switch. The signal is output to the conversion buffer unit 214.
  • the conversion cache unit 214 allocates a starting storage address, performs format conversion processing on the received audio signal, starts storing the audio signal starting from the starting storage address, and sends the starting storage address to the feature recognition unit 215 . It should be noted that there may be multiple audio units stored in the cache unit.
  • the feature identification unit 215 starts working after receiving the interception trigger signal, reads the preset reference feature value from the first register 210; and reads the audio signal stored at the starting storage address of the conversion cache unit 214, and analyzes the characteristics of the audio signal. , and compare it with the preset reference characteristic value; if it is inconsistent with the reference characteristic value, read the audio signal stored in the next storage address of the starting storage address of the conversion cache unit 214, and analyze and compare whether its characteristics are consistent with the preset reference The characteristic value is consistent; if it is inconsistent with the preset reference characteristic value, continue to read the audio signal at the next storage address for analysis and comparison, until the characteristics of the audio signal stored at a certain storage address are consistent with the preset reference characteristic value, then proceed to extract the code
  • the unit 216 sends an extraction trigger signal, and marks the storage address of the audio signal as the speech extraction starting address and outputs it to the extraction encoding unit 216
  • the extraction encoding unit 216 starts working after receiving the extraction trigger signal. Starting from the speech extraction starting address of the conversion cache unit 214, it reads the stored audio signals in sequence. The read audio signal is the speech that needs to be obtained; for the obtained speech Encoding is performed, and the encoded voice signal is output to form a voice packet and transmitted to the voice platform 300 through the network.
  • the audio monitoring unit 211 continues to monitor the audio.
  • the audio monitoring unit 211 generates an interception end signal to end this audio interception; the switch unit 212 turns off the audio switch under the control of the interception end signal.
  • the feature recognition unit 215 outputs the extraction end signal to the conversion cache unit 214 and the encoding unit, and begins to enter the sleep state, that is, a low power consumption state; after receiving the extraction end signal, the conversion cache unit 214 After receiving the signal, the cache unit is cleared and begins to enter the sleep state; the encoding unit also begins to enter the sleep state after receiving the extraction end signal. This in turn reduces the power consumption of the TV.
  • the text acquisition and instruction matching module 22 includes a decoding and parsing unit 220 , an instruction matching unit 221 , a second register 222 and a storage unit 223 ; the instruction matching unit 221 and the decoding and parsing unit 220 respectively , the second register 222 and the storage unit 223 are connected; the decoding and analysis unit 220 is used to receive and decode the text packet fed back by the voice platform 300 to obtain the combined text, and parse the combined text to obtain the voice text and instruction text, so
  • the second register 222 is used to store the preset similarity;
  • the output storage unit 223 is used to store an instruction table, wherein the instruction table includes multiple instruction records and description field information of each instruction record; the instruction The matching unit 221 is used to obtain the preset similarity and read each piece of description field information in the instruction table.
  • the matching unit 221 When the comparative similarity between the instruction text and the description field information reaches the preset similarity, the matching unit 221 will be compared with the preset similarity.
  • the instruction corresponding to the description field information whose comparative similarity reaches the preset similarity of the instruction text is recorded as the matching instruction corresponding to the instruction text, and the matching instruction is output to the main screen system 100 .
  • the working principle of the text acquisition and instruction matching module 22 is as follows: the decoding and analysis unit 220 is used to receive the text packet and decode it to obtain the combined text, further perform text analysis to obtain the voice text and the instruction text, and output the voice text to
  • the display module 24 of the secondary screen system 200 has an instruction text output instruction matching unit 221.
  • the instruction matching unit 221 After receiving the instruction text, the instruction matching unit 221 reads the preset similarity from the second register 222 and reads the storage instruction table from the storage unit 223; the instruction table structure is shown in Figure 6, including instruction record 1, instruction Record 2,..., each instruction record contains description information and instructions, and the description information includes field 1, field 2,....
  • the step may include: instruction matching Unit 221 sequentially reads the description information fields of the instruction record, and compares the similarity between the information field and the instruction text.
  • the instruction recorded in this instruction is a matching instruction, and the matching instruction is output to the main screen system 100; Otherwise, continue to query the next instruction record; for example: the instruction matching unit 221 reads the description field information of record 1, first compares the similarity between field 1 and the instruction text, and if the similarity reaches the preset similarity, the instruction recorded in this instruction is a matching instruction.
  • the instruction records in the stored instruction table are queried through the instruction text, the most similar instruction record is found as a matching instruction, and the matching instruction is output to the main screen system 100, thus improving the accuracy of voice interaction.
  • the speech understanding and text generation module 31 includes a decoding recognition unit 310 , a combined encoding unit 311 and a logical structure conversion unit 312 ; the decoding recognition unit 310 is used to receive and decode the secondary screen system 200 The voice packet is sent to obtain a voice audio signal, and the voice audio signal is recognized and converted into voice text; the logical structure conversion unit 312 is connected to the decoding recognition unit 310 for performing processing on the voice text.
  • the combined encoding unit 311 is connected to the decoding recognition unit 310 and the logical structure conversion unit 312 respectively; the combined encoding unit 311 uses The voice text and the instruction text are combined in a preset order to form a combined text, the combined text is encoded into the text packet, and the text packet is sent to the secondary screen system 200 .
  • the decoding and recognition unit 310 receives the voice packets and decodes them in parallel to obtain a voice audio signal, and further performs audio signal recognition to convert the audio into voice text.
  • the logical structure conversion unit 312 is preset with a language structure for conversion, that is, a preset speech structure, which can be set according to the user's language habits, etc. After logically understanding the speech text, the speech text is converted into a preset speech structure. command text.
  • the combined encoding unit 311 combines the voice text and the instruction text in a preset order to form a combined text, and then encodes it into a text packet and transmits it to the TV secondary screen system 200 through the network.
  • the preset order can be a front-to-back order, and the combined text structure is as shown in Figure 8.
  • the recognition and conversion of the voice packets are completed, so that the main screen system 100 and the secondary screen system 200 of the TV can perform corresponding operations.
  • the information analysis and data generation module 32 includes an analysis and decoding unit 320 , a synthesis conversion unit 321 and a hybrid encoding unit 322 ; the analysis and decoding unit 320 is used to receive and decode the information sent by the secondary screen system 200 Obtain comprehensive information from the comprehensive information package, and analyze the comprehensive information to obtain the speech response text; the synthesis conversion unit 321 is connected to the output end of the analysis and decoding unit 320; the synthesis conversion unit 321 uses For converting the speech response text into the response audio; the hybrid encoding unit 322 is connected to the synthesis conversion unit 321 and the analysis and decoding unit 320; the hybrid encoding unit 322 is used to convert the speech response text and the response audio.
  • the response audio is mixed and encoded, the mixed data packet is generated, and the mixed data packet is sent to the secondary screen system 200.
  • the structure of the mixed data can be described with reference to FIG. 10.
  • the information fusion and data decomposition module 23 includes an information fusion unit 230, an encoding unit 231 and a decoding and decomposition unit 232; the information fusion unit 230 is used to receive the matching instruction corresponding to the main screen system 100.
  • the response information fed back after the operation, and the description information corresponding to the instruction text is obtained, and comprehensive information is generated according to the response information and the description information; the encoding unit 231, and the information fusion unit 230 connection; the encoding unit 231 is used to encode the comprehensive information into the comprehensive information package, and output the comprehensive information package to the voice platform 300; the decoding and decomposition unit 232 is used to receive and analyze the The mixed data packet output by the voice platform 300 separates the voice response text and the response audio signal; and sends the voice response text to the display module 24 and sends the response audio signal to the main screen system. 100 for output.
  • the information fusion unit 230 receives the response information fed back by the main screen system 100, obtains the description information of the current instruction record from the text acquisition and instruction matching module 22, and fuses the response information and the instruction description information into comprehensive information. For example, if the response information is "Volume 25" and the instruction description information is "Please adjust the volume,” then the comprehensive information is "Please adjust the volume, volume 25.”
  • the encoding unit 231 encodes the comprehensive information into a comprehensive information package and sends it to the voice platform 300 through the network.
  • the analysis and decoding unit 320 of the voice platform 300 decodes the comprehensive information packet to obtain the comprehensive information, analyzes and understands the comprehensive information, and obtains the voice response text.
  • the decoding of the above comprehensive information is "Please adjust the volume, the volume is 25", then The parsed voice response text is "The volume has been adjusted to 25"; and the voice response text is output to the synthesis conversion unit 321 and the hybrid encoding unit 322; the synthesis conversion unit 321 converts the speech response text into response audio; the hybrid encoding unit 322
  • the response text and the response audio signal are mixed and encoded into a mixed data packet.
  • the mixed data structure is shown in Figure 8; and transmitted to the decoding and decomposition unit 232 of the TV secondary screen system 200 through the network.
  • the decoding and decomposition unit 232 After receiving the mixed data packet, the decoding and decomposition unit 232 Perform data decoding and decomposition processing, separate the response text and transmit it to the display module 24 of the secondary screen, and separate the response audio signal and transmit it to the speaker of the main screen system 100, so that the speaker of the main screen system 100 emits speech driven by the response audio signal. Interactive sounds.
  • This application also provides a voice interaction method, which is applied to the secondary screen system of the TV. See Figure 11.
  • the voice interaction method includes the steps:
  • Step S10 generate a voice packet based on the audio signal collected by the main screen system, and send the voice packet to the voice platform;
  • Step S20 Receive and parse the text packet fed back by the voice platform based on the voice packet, generate instruction text, and generate a comprehensive information package based on the instruction text;
  • Step S30 Receive the mixed data packet generated by the voice platform based on the comprehensive information package, parse the mixed data packet to obtain the voice response text and response audio signal, display the voice response text, and send the response audio signal to the main screen system for output.
  • the structure of the main screen system, the secondary screen system and the voice platform can be set up with reference to the above embodiment, and will not be described again.
  • the secondary screen system acquires audio signals in real time, generates voice packets, parses the text packets fed back by the voice platform, generates command text and comprehensive information packets, and parses the mixed data packets sent by the voice platform to obtain voice response text and response audio signals. , display the voice response text, and send the response audio signal to the main screen system for output.
  • the voice platform mainly performs speech understanding.
  • the main screen system only collects sounds and responds to corresponding operations.
  • the methods of the above embodiments can be implemented by means of software plus the necessary general hardware platform. Of course, it can also be implemented by hardware, but in many cases the former is better. implementation.
  • the technical solution of the present application can be embodied in the form of a software product that is essentially or contributes to the existing technology.
  • the computer software product is stored in a storage medium (such as ROM/RXM) as mentioned above. , magnetic disk, optical disk), including several instructions to cause a terminal device (which can be a mobile phone, computer, server, or network device, etc.) to execute the methods described in various embodiments of this application.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Telephonic Communication Services (AREA)

Abstract

本申请提出了一种语音交互系统及语音交互方法,该语音交互系统包括主屏系统、与主屏系统建立通信连接的副屏系统以及与副屏系统建立网络连接的语音平台;主屏系统和副屏系统设置于电视中;副屏系统用于根据主屏系统采集的音频信号生成语音包,发送语音包至语音平台,并解析语音平台基于语音包反馈的文本包,生成指令文字,根据指令文字生成综合信息包;语音平台用于根据综合信息包生成混合数据包;副屏系统还用于解析混合数据包,得到语音响应文本和响应音频信号,显示语音响应文本,并发送响应音频信号至主屏系统进行输出。

Description

语音交互系统及语音交互方法
本申请要求于2022年5月13日申请的、申请号为202210527135.8的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及电视技术领域,尤其涉及一种语音交互系统及语音交互方法。
背景技术
随着人工智能技术的发展,智能语音在移动电子产品中得到越来越多的应用。在电视产品中,智能语音作为便捷的交互方式也开始逐步应用并得到人们认可。目前电视所用的智能语音交互技术主要采用电视终端的处理器来采集语音、理解语音、生成指令及执行,这种方式在语音交互时占用较多处理器资源,容易导致电视卡顿,使用户体验差。
技术问题
本申请的主要目的在于提供一种语音交互系统及语音交互方法,旨在解决现有电视语音交互时电视卡顿的问题。
技术解决方案
为实现上述目的,本申请提供一种语音交互系统,所述语音交互系统包括:主屏系统、与所述主屏系统建立通信连接的副屏系统以及与所述副屏系统建立网络连接的语音平台;其中,所述主屏系统和所述副屏系统设置于电视中;
所述副屏系统,用于根据所述主屏系统采集的音频信号生成语音包,发送所述语音包至所述语音平台,并解析所述语音平台基于所述语音包反馈的文本包,生成指令文字,根据所述指令文字生成综合信息包;
所述语音平台用于根据所述综合信息包生成混合数据包;
所述副屏系统,还用于解析所述混合数据包,得到语音响应文本和响应音频信号,显示所述语音响应文本,并发送所述响应音频信号至所述主屏系统进行输出。
在一实施例中,所述主屏系统包括:
声电转换单元,所述声电转换单元用于采集外部声音信号;
幅度调整单元,所述幅度调整单元用于获取内部音频信号;
降噪单元,所述降噪单元分别与所述声电转换单元和所述幅度调整单元连接;所述降噪单元,用于根据所述内部音频信号对所述外部声音信号进行降噪处理,以生成所述外部声音信号中的语音所对应的音频信号,并输出所述音频信号至所述副屏系统。
在一实施例中,所述副屏系统包括:
声音监测与语音获取模块,用于根据所述主屏系统输出的所述音频信号生成语音包,并发送所述语音包至所述语音平台;
文本获取与指令匹配模块,用于接收所述语音平台基于所述语音包反馈的文本包,解析所述文本包生成指令文字,并确定所述指令文字所对应的匹配指令,输出所述匹配指令至所述主屏系统;
信息融合与数据分解模块,与所述文本获取与指令匹配模块连接;所述信息融合与数据分解模块用于获取与所述指令文字相对应的描述信息,接收所述主屏系统执行所述匹配指令对应的操作后反馈的响应信息,并根据所述响应信息和所述描述信息生成综合信息包,发送所述综合信息包至所述语音平台;所述信息融合与数据分解模块还用于接收并解析所述语音平台输出的所述混合数据包,生成所述语音响应文本和响应音频信号,发送所述响应音频信号至所述主屏系统进行输出;
显示模块,与所述显示模块与所述信息融合与数据分解模块连接,所述显示模块,用于接收并显示所述信息融合与数据分解模块输出的所述语音响应文本。
在一实施例中,所述声音监测与语音获取模块包括第一寄存器、音频监测单元、开关单元、延时单元、转换缓存单元、特征识别单元和提取编码单元;
所述音频监测单元的第一输入端和所述延时单元的输入端连接,所述音频监测单元的第二输入端和所述特征识别单元的第一输入端分别与所述第一寄存器连接,所述音频监测单元的输出端分别与所述开关单元的输入端和所述特征识别单元的第二输入端连接,所述开关单元的输出端与所述转换缓存单元的输入端连接,所述转换缓存单元还与所述特征识别单元和所述提取编码单元连接;所述提取编码单元与所述特征识别单元连接;
所述第一寄存器,用于存储预设时间长度、预设能量阈值和预设参考特征值;
所述音频监测单元,用于接收所述主屏系统输出的所述音频信号,并在监测到所述预设时间长度内的音频信号达到所述预设能量阈值时,输出截取触发信号;
所述开关单元,用于在接收到所述截取触发信号时开启;
所述延时单元,用于在所述开关单元开启时,输出延时预设时间长度的所述音频信号至所述转换缓存单元;
所述转换缓存单元,用于在接收到所述截取触发信号时,分配起始存储地址以存储所述音频信号,并输出所述起始存储地址;
所述特征识别单元,用于在接收到所述截取触发信号时,读取所述预设参考特征值和所述起始存储地址中的所述音频信号,并在所述音频信号的特征与所述预设特征值一致时,输出提取触发信号至所述提取编码单元;
所述提取编码单元,用于在接收到所述提取触发信号时,根据所述起始存储地址读取所述音频信号,并将所述音频信号进行编码形成语音包,发送所述语音包至所述语音平台。
在一实施例中,所述文本获取与指令匹配模块包括解码解析单元、指令匹配单元、第二寄存器和存储单元;所述指令匹配单元分别与所述解码解析单元、第二寄存器和存储单元连接;
所述解码解析单元,用于接收并解码所述语音平台反馈的文本包,得到组合文本,并解析所述组合文本得到语音文字和指令文字,
所述第二寄存器,用于存储预设相似度;
所输出存储单元,用于存储指令表,其中,所述指令表包括多个指令记录以及每个所述指令记录的描述字段信息;
所述指令匹配单元,用于获取所述预设相似度并读取所述指令表中每一条描述字段信息,在所述指令文字与所述描述字段信息的比较相似度达到预设相似度时,将与所述指令文字的比较相似度达到预设相似度的所述描述字段信息对应的指令记录作为所述指令文字所对应的所述匹配指令,输出所述匹配指令至所述主屏系统。
在一实施例中,所述信息融合与数据分解模块包括信息融合单元、编码单元和解码分解单元;
所述信息融合单元,用于接收所述主屏系统执行所述匹配指令对应的操作后反馈的所述响应信息,以及获取与所述指令文字相对应的描述信息,并根据所述响应信息和所述描述信息生成综合信息;
所述编码单元,与所述信息融合单元连接;所述编码单元,用于将所述综合信息编码为所述综合信息包,并输出所述综合信息包至所述语音平台;
所述解码分解单元,用于接收并解析所述语音平台输出的所述混合数据包,分离出所述语音响应文本和所述响应音频信号;并发送所述语音响应文本至所述显示模块,发送所述响应音频信号至所述主屏系统进行输出。
在一实施例中,所述语音平台包括语言理解与文本生成模块和信息解析与数据生成模块;
所述语言理解与文本生成模块,用于根据所述副屏系统发送的所述语音包生成对应的文本包,并发送所述文本包至所述副屏系统;
所述信息解析与数据生成模块,用于接收所述副屏系统发送的所述综合信息包,根据所述综合信息包生成混合数据包,发送所述混合数据包至所述副屏系统。
在一实施例中,所述语言理解与文本生成模块包括解码识别单元、组合编码单元和逻辑结构转换单元;
所述解码识别单元,用于接收并解码所述副屏系统发送的所述语音包得到语音音频信号,并对所述语音音频信号进行识别,转换为语音文字;
所述逻辑结构转换单元,与所述解码识别单元连接,用于对所述语音文字进行理解,并将所述语音文字转换为符合预设语音结构的指令文字;
所述组合编码单元,分别与所述解码识别单元和所述逻辑结构转换单元连接;所述组合编码单元用于将所述语音文字和所述指令文字按照预设顺序进行组合,形成组合文本,并将所述组合文本编码为所述文本包,发送所述文本包至所述副屏系统。
在一实施例中,所述信息解析与数据生成模块包括解析解码单元、合成转换单元和混合编码单元;
所述解析解码单元,用于接收并解码所述副屏系统发送的所述综合信息包得到综合信息,并对所述综合信息进行解析得到所述语音响应文本;
所述合成转换单元,与所述解析解码单元的输出端连接;所述合成转换单元,用于将所述语音响应文本转换为所述响应音频;
所述混合编码单元,与所述解析解码单元和所述合成转换单元连接;所述混合编码单元,用于将所述语音响应文本和所述响应音频进行混合编码,生成所述混合数据包,发送所述混合数据包至所述副屏系统。
为实现上述目的,本申请还提供一种语音交互方法,所述语音交互方法应用于副屏系统;所述语音交互方法包括步骤:
根据主屏系统采集的音频信号生成语音包,发送所述语音包至语音平台;
接收并解析语音平台基于所述语音包反馈的文本包,生成指令文字,根据所述指令文字生成综合信息包;
接收语音平台根据所述综合信息包生成的混合数据包,解析所述混合数据包得到语音响应文本和响应音频信号,显示所述语音响应文本,并发送所述响应音频信号至主屏系统进行输出。
有益效果
本申请提供一种语音交互系统及语音交互方法,该语音交互系统中副屏系统实时进行音频信号获取,生成语音包,并解析语音平台反馈的文本包,生成指令文字及综合信息包,并解析语音平台发送的混合数据包,得到语音响应文本和响应音频信号,显示所述语音响应文本,并发送所述响应音频信号至所述主屏系统进行输出;语音平台主要进行语音理解;主屏系统仅采集声音及响应相应的操作。从而在实现了语音交互的过程中,占用主屏系统的处理器资源少,语音交互响应速度快、延时小,不占用电视视频处理资源,视频显示清晰、流畅,大大提高了用户体验感。
附图说明
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图示出的结构获得其他的附图。
图1是本申请语音交互系统一实施例的模块示意图;
图2为本申请语音交互系统另一实施例的模块示意图;
图3为本申请语音交互系统一实施例的时序示意图;
图4为本申请语音交互系统一实施例的部分细化模块示意图;
图5为本申请语音交互系统一实施例的文本获取与指令匹配模块结构示意图;
图6为本申请语音交互系统一实施例的指令表结构示意图;
图7为本申请语音交互系统另一实施例的部分细化模块示意图;
图8为本申请语音交互系统一实施例的组合文本结构示意图;
图9为本申请语音交互系统又一实施例的部分细化模块示意图;
图10为本申请语音交互系统一实施例的混合数据结构示意图;
图11本申请语音交互方法一实施例的流程示意图。
本申请目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。
附图标号说明:
标号 名称 标号 名称
100 主屏系统 213 延时单元
200 副屏系统 214 转换缓存单元
300 语音平台 215 特征识别单元
11 声音采集模块 216 提取编码单元
12 指令执行与信息反馈模块 220 解码解析单元
13 音频驱动模块 221 指令匹配单元
21 声音监测与语音获取模块 222 第二寄存器
22 文本获取与指令匹配模块 223 存储单元
23 信息融合与数据分解模块 310 解码识别单元
24 显示模块 311 组合编码单元
31 语音理解与文本生成模块 312 逻辑结构转换单元
32 信息解析与数据生成模块 230 信息融合单元
110 声电转换单元 231 编码单元
111 幅度调整单元 232 解码分解单元
112 降噪单元 320 解析解码单元
210 第一寄存器 321 合成转换单元
211 音频监测单元 322 混合编码单元
212 开关单元 1 电视
本发明的实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请的一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
需要说明,本申请实施例中所有方向性指示(诸如上、下、左、右、前、后……)仅用于解释在某一特定姿态(如附图所示)下各部件之间的相对位置关系、运动情况等,如果该特定姿态发生改变时,则该方向性指示也相应地随之改变。
目前电视所用的智能语音交互技术主要采用两种方案:第一种是采用电视终端的处理器来理解语音并生成指令及执行,在语音交互时占用较多处理器资源,导致电视卡顿,用户体验差,且电视的语音获取和识别等一直处于工作状态,能耗较低;第二种其语音检测提取、识别理解、指令生成等由后端的语音平台来完成,电视终端仅执行指令,语音平台和电视终端间需反复进行信息传输才能完成智能语音交互,延时大、尤其网络条件差时导致交互严重滞后,体验差。
基于上述问题,本申请提供一种语音交互系统,参照图1,在一实施例中,所述语音交互系统包括:主屏系统100、与所述主屏系统100建立通信连接的副屏系统200以及与所述副屏系统200建立网络连接的语音平台300;其中,所述主屏系统100和所述副屏系统200设置于电视中;
所述副屏系统200,用于根据所述主屏系统100采集的音频信号生成语音包,发送所述语音包至所述语音平台300,并解析所述语音平台300基于所述语音包反馈的文本包,生成指令文字,根据所述指令文字生成综合信息包;所述语音平台300用于根据所述综合信息包生成混合数据包;所述副屏系统200还用于解析所述混合数据包,得到语音响应文本和响应音频信号,显示所述语音响应文本,并发送所述响应音频信号至所述主屏系统100进行输出。
本实施例中,参照图2,所述主屏系统100中设置有声音采集模块11、指令执行与信息反馈模块12和音频驱动模块13;声音采集模块11用于采集外部声音信号并输出对应的音频信号至所述副屏系统200,具体的,声音采集模块11用于采集电视外部声音信号和电视内部音频信号,可以理解的,外部声音信号包括为电视外部的声音信号,包括用户发出的包括交互指令的语音信号、环境声音信号和电视播放的音频信号等,声音采集模块11可以滤除掉电视播放的音频信号,生成仅包括外部声音信号的音频信号,并发送至副屏系统200。音频驱动模块13可以根据实际情况进行选取,例如为扬声器,音频驱动模块13用于根据响应音频信号发出声音。
参照图3,副屏系统200接收到音频信号后对音频信号进行处理,提取出符合预设提取特征的音频信号,将符合预设提取特征的音频信号进行编码形成语音包,并发送语音包至语音平台300。预设提取特征可以根据外部声音信号的特点进行设定,如预设时间长度、预设能量阈值和预设参考特征值等,若音频信号完全不符合预设提取特征,则说明该段外部声音信号中并不包含用户发出的包括交互指令的语音信号;如果连续获取的多个音频信号都不符合,副屏系统200将不再进行音频信号处理,该功能进入休眠状态,以降低功耗。
语音平台300接收到语音包后对其进行解码为语音音频信号,并识别该语音音频信号得到相应的文本,将其编码为文本包,再反馈至副屏系统200。
副屏系统200接收到该文本包后,对接收到的文本包进行解码得到文本,进一步进行文本解析获取到语音文字和指令文字,并确定所述指令文字所对应的匹配指令,输出所述匹配指令至所述主屏系统100;主屏系统100执行该匹配指令所对应的操作,匹配指令相对应的操作可以是完成主屏系统100的内部操作,比如音量调节;也可以是内外操作,比如调用内部的视频播放器并从内容服务平台获取音视频内容等,输出指令操作后的内容给主屏系统100的音视频处理模块或直接控制音频视频模块切换工作状态;生成执行匹配指令所对应的操作的响应信息,比如音量为25或启动视频播放等,并发送至副屏系统200。
另外,语音文字输出至副屏系统200的显示模块24,显示模块24显示语音文字,以使用户可以看到其发出的语音信号被识别后的文字形式,进而,如果用户发现其语音信号被识别的有误,可以及时再次向电视发出语音信息,而不必等到电视反馈的语音交互信息有误后才意识到不正确,因此可以提高反馈的及时性和人机交互的可视性。
副屏系统200获取当前匹配指令的描述信息,将响应信息和指令的描述信息融合为综合信息,并编码为综合信息包发送给语音平台300。比如响应信息为“音量25”,指令描述信息为“请调整音量”,则综合信息为“请调整音量,音量25”。
语音平台300根据所述综合信息包生成混合数据包。语音平台300对综合信息包进行解码处理得到综合信息,并对综合信息进行解析和理解,得到语音响应文本,比如解码得到上述综合信息为“请调整音量,音量25”,则解析得到的语音响应文本为“已将调整音量至25”;再将语音响应文本转换成响应音频信号,最后将响应文本和响应音频信号进行混合编码为混合数据包,通过网络传输给电视副屏系统200。
副屏系统200解析接收到的混合数据包后,得到语音响应文本和响应音频信号,将所述语音响应文本发送至副屏系统200的显示模块24进行显示,从而使用户看到可视化的反馈文本;并发送所述响应音频信号至所述主屏系统100进行输出,从而完成与用户之间的“语音-语音”的智能交互。
本实施例中,主屏系统100主要用于采集外部声音生成音频信号传送给副屏系统200,副屏系统200从音频信号生成语音信号对应的语音包并通过网络传输给语音平台300,语音平台300将语音包转换成符合预定语言结构的指令文字并通过网络传输给副屏系统200,副屏系统200通过指令文字来确定匹配指令并传送给主屏系统100,主屏系统100执行指令并将执行响应信息反馈给副屏系统200;进一步副屏系统200将响应信息及匹配指令的描述信息融合为综合信息,通过网络传输给语音平台300,语音平台300解析转换综合信息得到语音响应文本和响应音频信号并混合为混合数据包通过网络传输给电视副屏系统200,副屏系统200解码分解混合数据包,分离出响应文本驱动副屏显示,以及得到分离出响应音频信号驱动主屏系统100发声模块发出声音,实现人机语音交互。
通过上述结构以及方法,副屏系统200实时进行音频信号获取,生成语音包,并解析语音平台300反馈的文本包,生成指令文字及综合信息包,并解析语音平台300发送的混合数据包,得到语音响应文本和响应音频信号,显示所述语音响应文本,并发送所述响应音频信号至所述主屏系统100;语音平台300主要进行语音理解,主屏系统100仅采集声音及响应相应的操作,从而在实现了语音交互的过程中,占用主屏系统100的处理器资源少,语音交互延时小、响应速度快、不占用电视视频处理资源,视频显示清晰、流畅;并且,与现有及技术相比网络交互时延小、交互体验好、语音处理功耗,提高用户体验感。
进一步地,再次参照图2,所述副屏系统200包括声音监测与语音获取模块21、文本获取与指令匹配模块22、信息融合与数据分解模块23和显示模块24;声音监测与语音获取模块21用于根据所述主屏系统100输出的所述音频信号生成语音包,并发送所述语音包至所述语音平台300;文本获取与指令匹配模块22用于接收所述语音平台300反馈的所述文本包,解析所述文本包生成指令文字,并确定所述指令文字所对应的匹配指令,输出所述匹配指令至所述主屏系统100;信息融合与数据分解模块23与所述文本获取与指令匹配模块22连接;所述信息融合与数据分解模块23用于获取与所述指令文字相对应的描述信息,接收所述主屏系统100执行所述匹配指令对应的操作后反馈的响应信息,并根据所述响应信息和所述描述信息生成综合信息包,发送所述综合信息包至所述语音平台300;所述信息融合与数据分解模块23还用于接收并解析所述语音平台300输出的所述混合数据包,生成所述语音响应文本和响应音频信号,发送所述响应音频信号至所述主屏系统100进行输出;显示模块24与所述显示模块24与所述信息融合与数据分解模块23连接,所述显示模块24用于接收并显示所述信息融合与数据分解模块23输出的所述语音响应文本。
本实施例中,副屏系统200根据指令文字查询存储指令表中的指令记录,找到与所述指令文字最相似的指令记录确定为匹配指令,将匹配指令输出给主屏系统100。主屏系统100中的指令执行与信息反馈模块12可用于预先向副屏系统200的文本获取与指令匹配模块22中写入存储指令表,存储指令表的每一条指令记录是电视主屏系统100可执行的指令及其描述信息。
通过使副屏系统200实时检测音频信息、与语音平台300进行交互、处理相应的数据,以及输出相应的指令给主屏系统100执行操作;使得主屏系统100处理器的资源占用少、视频显示好,网络交互时延小、交互体验好,并且语音处理功耗小、效率高。
进一步地,所述语音平台300包括语言理解与文本生成模块和信息解析与数据生成模块32;所述语言理解与文本生成模块用于根据所述副屏系统200发送的所述语音包生成对应的文本包,并发送所述文本包至所述副屏系统200;所述信息解析与数据生成模块32用于接收所述副屏系统200发送的所述综合信息包,根据所述综合信息包生成混合数据包,发送所述混合数据包至所述副屏系统200。
本实施例中,语音平台300的语音理解与文本生成模块31对语音包进行解码处理后得到语音和进行语音理解:将语音转换成语音文字,及将语音文字转换成符合预定语言结构的指令文字,并将指令文字及对应的语音文字组合在一起形成文本,并编码为文本包,通过网络传输给电视副屏系统200的文本解析单元。信息解析与数据生成模块32对综合信息包进行解码处理得到综合信息,并对综合信息进行解析、得到语音响应文本,以及将语音响应文本转换为响应音频信号;然后将响应文本和响应音频信号进行混合编码为混合数据包,通过网络传输给电视副屏系统200。
进一步地,参照图4,所述主屏系统100的所述声音采集模块11具体包括声电转换单元110、幅度调整单元111和降噪单元112;所述声电转换单元110用于采集外部声音信号;所述幅度调整单元111用于获取内部音频信号;所述降噪单元112用于根据所述内部音频信号对所述外部声音信号进行降噪处理,以生成所述外部声音信号中的语音所对应的音频信号,并输出所述音频信号至所述副屏系统200;其中,所述降噪单元112分别与所述声电转换单元110和所述幅度调整单元111连接。
本实施例中,声电转换单元110接收到外部声音信号后,进行声电转换得到外部声音音频信号;幅度调整单元111获取到电视音视频处理模块输出的节目音频信号,即内部音频信号后,进行幅度调整得到设定幅值的节目音频信号;然后由降噪单元112进行去噪处理,即比较外部声音音频信号和节目音频信号的频率差异,将外部声音音频信号中的节目音频信号成本去除,得到去噪后的音频信号。由此实现了对外部声音信号的提取,从而可以获取到清晰、准确的用户发出的语言信号,进而提高语音交互的准确性。
进一步地,所述声音监测与语音获取模块21包括第一寄存器210、音频监测单元211、开关单元212、延时单元213、转换缓存单元214、特征识别单元215和提取编码单元216;所述音频监测单元211的第一输入端和所述延时单元213的输入端连接,所述音频监测单元211的第二输入端和所述特征识别单元215的第一输入端分别与所述第一寄存器210连接,所述音频监测单元211的输出端分别与所述开关单元212的输入端和所述特征识别单元215的第二输入端连接,所述开关单元212的输出端与所述转换缓存单元214的输入端连接,所述转换缓存单元214还与所述特征识别单元215和所述提取编码单元216连接;所述提取编码单元216与所述特征识别单元215连接。
所述第一寄存器210用于存储预设时间长度、预设能量阈值和预设参考特征值;所述音频监测单元211用于接收所述主屏系统100输出的所述音频信号,并在监测到所述预设时间长度内的音频信号达到所述预设能量阈值时,输出截取触发信号;所述开关单元212用于在接收到所述截取触发信号时开启;所述延时单元213用于在所述开关单元212开启时,输出延时预设时间长度的所述音频信号至所述转换缓存单元214;所述转换缓存单元214用于在接收到所述截取触发信号时,分配起始存储地址以存储所述音频信号,并输出所述起始存储地址;所述特征识别单元215用于在接收到所述截取触发信号时,读取所述预设参考特征值和所述起始存储地址中的所述音频信号,并在所述音频信号的特征与所述预设特征值一致时,输出提取触发信号至所述提取编码单元216;所述提取编码单元216用于在接收到所述提取触发信号时,根据所述起始存储地址读取所述音频信号,并将所述音频信号进行编码形成语音包,发送所述语音包至所述语音平台300。
本实施例中,音频监测单元211从第一寄存器210中读取的预设时间长度例如为Ts,预设能量阀值为Es,音频监测单元211实时监测一个时间长度Ts内音频信号的平均能量值。如果监测到Ts内音频信号的平均能量值达到预设能量阀值Es,音频监测单元211产生截取触发信号,开始截取音频。
具体的,开关单元212在截取触发信号的控制下,打开音频开关,音频信号经过延时单元213后,延时时间可设置为Ts,通过音频开关将经过监测的、平均能量值达到Es的音频信号输出给转换缓存单元214。
转换缓存单元214分配起始存储地址,对接收的音频信号进行格式转换处理、以起始存储地址为起点开始存储该音频信号;以及将起始存储地址发送给特征识别单元215。需要说明的是,缓存单元中存储的音频单元可能为多个。
特征识别单元215在接收到截取触发信号后开始工作,从第一寄存器210中读取预设参考特征值;以及读取转换缓存单元214起始存储地址存储的音频信号,分析此音频信号的特征,并与预设参考特征值进行比较;如果与参考特征值不一致,则读取转换缓存单元214起始存储地址的下一个存储地址中存储的音频信号,并分析比较其特征是否与预设参考特征值一致;如果与预设参考特征值不一致,继续读取下一个存储地址的音频信号进行分析比较,直到某个存储地址存储的音频信号的特征与预设参考特征值一致,则向提取编码单元216发出提取触发信号,以及将该音频信号的存储地址标记为语音提取起始地址输出给提取编码单元216
提取编码单元216在接收到提取触发信号后开始工作,从转换缓存单元214的语音提取起始地址开始、依次读取存储的音频信号,读取的音频信号就是需要获取的语音;对获取的语音进行编码,输出编码语音信号、形成语音包通过网络传输给语音平台300。
还需要说明的是,在特征识别单元215产生截取触发信号后的语音获取过程中,音频监测单元211仍在持续监测音频,当检测到连续的N个(N为预设提取次数,可以根据实际情况进行设置)Ts内的音频信号的平均能量值没有达到能量阀值Es时,音频监测单元211产生截取结束信号,结束本次音频截取;开关单元212在截取结束信号的控制下,关闭音频开关以关闭音频信号传输通道;特征识别单元215在接收到截取结束信号,向转换缓存单元214和编码单元输出提取结束信号,开始进入休眠状态即低功耗状态;转换缓存单元214在接收到提取结束信号后,清除缓存单元,开始进入休眠状态;编码单元在接收到提取结束信号后,也开始进入休眠状态。进而降低了电视的功耗。
进一步地,参照图5,所述文本获取与指令匹配模块22包括解码解析单元220、指令匹配单元221、第二寄存器222和存储单元223;所述指令匹配单元221分别与所述解码解析单元220、第二寄存器222和存储单元223连接;所述解码解析单元220用于接收并解码所述语音平台300反馈的文本包,得到组合文本,并解析所述组合文本得到语音文字和指令文字,所述第二寄存器222用于存储预设相似度;所输出存储单元223用于存储指令表,其中,所述指令表包括多个指令记录以及每个所述指令记录的描述字段信息;所述指令匹配单元221用于获取所述预设相似度并读取所述指令表中每一条描述字段信息,在所述指令文字与所述描述字段信息的比较相似度达到预设相似度时,将与所述指令文字的比较相似度达到预设相似度的所述描述字段信息对应的指令记录作为所述指令文字所对应的所述匹配指令,输出所述匹配指令至所述主屏系统100。
本实施例中,文本获取与指令匹配模块22的工作原理为:解码解析单元220用于接收文本包并进行解码得到组合文本,进一步进行文本解析,得到语音文字和指令文字,将语音文字输出给副屏系统200的显示模块24,指令文字输出指令匹配单元221。
指令匹配单元221接收到指令文字后,从第二寄存器222中读取预设相似度,以及从存储单元223中读出存储指令表;指令表结构如图6所示,包括指令记录1、指令记录2、......,每一条指令记录包含描述信息和指令,描述信息包含字段1、字段2、......。其中,所述跟据所述描述字段信息,依次比较每一条所述指令记录与所述指令文字的相似程度,并判断所述相似程度是否达到预设相似度的步骤之后,可以包括:指令匹配单元221依次读取指令记录的描述信息字段,比较信息字段与指令文字的相似程度,如果相似程度达到预设相似度,则此指令记录的指令为匹配指令,将匹配指令输出给主屏系统100;否则继续查询下一条指令记录;比如:指令匹配单元221读取记录1的描述字段信息,首先比较字段1与指令文字相似程度、如果相似程度达到预设相似度则此指令记录的指令为匹配指令,否则比较字段2与指令文字相似程度,并依次比较;如果记录1所有字段相似程度不符合要求,则读取记录2的描述字段信息进行比较;还可以包括:若所述相似程度达到预设相似度,则判定所述指令文字为第一匹配指令;获取所述第一匹配指令中,与所述指令记录相似度最大的第一匹配指令,并将相似度最大的第一匹配指令作为匹配指令。
从而通过指令文字查询存储指令表中的指令记录,找到最相似的指令记录为匹配指令,将匹配指令输出给主屏系统100,进而提高了语音交互的准确性。
进一步地,参照图7,所述语音理解与文本生成模块31包括解码识别单元310、组合编码单元311和逻辑结构转换单元312;所述解码识别单元310用于接收并解码所述副屏系统200发送的所述语音包得到语音音频信号,并对所述语音音频信号进行识别,转换为语音文字;所述逻辑结构转换单元312与所述解码识别单元310连接,用于对所述语音文字进行理解,并将所述语音文字转换为符合预设语音结构的指令文字;所述组合编码单元311分别与所述解码识别单元310和所述逻辑结构转换单元312连接;所述组合编码单元311用于将所述语音文字和所述指令文字按照预设顺序进行组合,形成组合文本,并将所述组合文本编码为所述文本包,发送所述文本包至所述副屏系统200。
本实施例中,解码识别单元310接收语音包并行解码处理后得到语音音频信号,进一步进行音频信号识别将音频换成语音文字。逻辑结构转换单元312中预先设定有用于转换的语言结构,即预设语音结构,可以根据用户的语言习惯等进行设置,对语音文字进行逻辑理解后,将语音文字转换成符合预设语音结构的指令文字。组合编码单元311将语音文字和指令文字按照预设顺序组合在一起形成组合文本,然后编码为文本包,通过网络传输给电视副屏系统200。其中,预设顺序可以为前后的顺序,组合文本结构如图8所示。从而完成了对语音包的识别,转换,以使电视的主屏系统100和副屏系统200能进行相应的操作。
进一步地,参照图9,所述信息解析与数据生成模块32包括解析解码单元320、合成转换单元321和混合编码单元322;所述解析解码单元320用于接收并解码所述副屏系统200发送的所述综合信息包得到综合信息,并对所述综合信息进行解析得到所述语音响应文本;所述合成转换单元321与所述解析解码单元320的输出端连接;所述合成转换单元321用于将所述语音响应文本转换为所述响应音频;所述混合编码单元322与合成转换单元321和所述解析解码单元320连接;所述混合编码单元322用于将所述语音响应文本和所述响应音频进行混合编码,生成所述混合数据包,发送所述混合数据包至所述副屏系统200,其中混合数据的结构可参照图10所述。
还需要说明的是,所述信息融合与数据分解模块23包括信息融合单元230、编码单元231和解码分解单元232;所述信息融合单元230用于接收所述主屏系统100执行所述匹配指令对应的操作后反馈的所述响应信息,以及获取与所述指令文字相对应的描述信息,并根据所述响应信息和所述描述信息生成综合信息;所述编码单元231,与所述信息融合单元230连接;所述编码单元231用于将所述综合信息编码为所述综合信息包,并输出所述综合信息包至所述语音平台300;所述解码分解单元232用于接收并解析所述语音平台300输出的所述混合数据包,分离出所述语音响应文本和所述响应音频信号;并发送所述语音响应文本至所述显示模块24,发送所述响应音频信号至所述主屏系统100进行输出。
本实施例中,信息融合单元230收到主屏系统100反馈的响应信息,并从文本获取与指令匹配模块22中获取当前指令记录的描述信息,将响应信息和指令的描述信息融合为综合信息。比如响应信息为“音量25”,指令描述信息“请调整音量”,则综合信息为“请调整音量,音量25”。编码单元231通过网络将综合信息编码为综合信息包发送给语音平台300。
语音平台300的解析解码单元320对综合信息包进行解码处理得到综合信息,并对综合信息进行解析和理解,得到语音响应文本,比如解码得到上述综合信息为“请调整音量,音量25”,则解析得到的语音响应文本为“已将调整音量至25”;以及将语音响应文本输出给合成转换单元321和混合编码单元322;合成转换单元321将语音响应文本转换成响应音频;混合编码单元322将响应文本和响应音频信号进行混合编码为混合数据包,混合数据结构如图8所示;并通过网络传输给电视副屏系统200的解码分解单元232,解码分解单元232收到混合数据包后进行数据解码及分解处理,分离出响应文本传输给副屏的显示模块24,以及分离出响应音频信号传输给主屏系统100的扬声器,以使主屏系统100的扬声器在响应音频信号的驱动下发出语音交互的声音。
本申请还提供一种语音交互方法,应用于电视的副屏系统,参见图11,所述语音交互方法包括步骤:
步骤S10,根据主屏系统采集的音频信号生成语音包,发送所述语音包至语音平台;
步骤S20,接收并解析语音平台基于所述语音包反馈的文本包,生成指令文字,根据所述指令文字生成综合信息包;
步骤S30,接收语音平台根据所述综合信息包生成的混合数据包,解析所述混合数据包得到语音响应文本和响应音频信号,显示所述语音响应文本,并发送所述响应音频信号至主屏系统进行输出。
本实施例中,主屏系统、副屏系统和语音平台结构可参照上述实施例进行设置,不再进行赘述。从而由副屏系统实时进行音频信号获取,生成语音包,并解析语音平台反馈的文本包,生成指令文字及综合信息包,并解析语音平台发送的混合数据包,得到语音响应文本和响应音频信号,显示所述语音响应文本,并发送所述响应音频信号至所述主屏系统进行输出。语音平台主要进行语音理解,主屏系统仅采集声音及响应相应的操作,从而在实现语音交互的过程中,占用主屏系统的处理器资源少,语音交互延时小、响应速度快、不占用电视视频处理资源,视频显示清晰、流畅;并且,与现有及技术相比网络交互时延小、交互体验好、语音处理功耗,提高用户体验感。
需要说明的是,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者系统不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者系统所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、方法、物品或者系统中还存在另外的相同要素。
上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在如上所述的一个存储介质(如ROM/RXM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,或者网络设备等)执行本申请各个实施例所述的方法。
以上仅为本申请的优选实施例,并非因此限制本申请的专利范围,凡是在本申请的发明构思下,利用本申请说明书及附图内容所作的等效结构变换,或直接/间接运用在其他相关的技术领域均包括在本申请的专利保护范围内。

Claims (16)

  1. 一种语音交互系统,其中,所述语音交互系统包括:主屏系统、与所述主屏系统建立通信连接的副屏系统以及与所述副屏系统建立网络连接的语音平台;其中,所述主屏系统和所述副屏系统设置于电视中;
    所述副屏系统,用于根据所述主屏系统采集的音频信号生成语音包,发送所述语音包至所述语音平台,并解析所述语音平台基于所述语音包反馈的文本包,生成指令文字,根据所述指令文字生成综合信息包;
    所述语音平台用于根据所述综合信息包生成混合数据包;
    所述副屏系统,还用于解析所述混合数据包,得到语音响应文本和响应音频信号,显示所述语音响应文本,并发送所述响应音频信号至所述主屏系统进行输出。
  2. 根据权利要求1所述的语音交互系统,其中,所述主屏系统包括:
    声电转换单元,用于采集外部声音信号;
    幅度调整单元,用于获取内部音频信号;
    降噪单元,所述降噪单元分别与所述声电转换单元和所述幅度调整单元连接;所述降噪单元,用于根据所述内部音频信号对所述外部声音信号进行降噪处理,以生成所述外部声音信号中的语音所对应的音频信号,并输出所述音频信号至所述副屏系统。
  3. 根据权利要求1所述的语音交互系统,其中,所述副屏系统包括:
    声音监测与语音获取模块,用于根据所述主屏系统输出的所述音频信号生成语音包,并发送所述语音包至所述语音平台;
    文本获取与指令匹配模块,用于接收所述语音平台基于所述语音包反馈的文本包,解析所述文本包生成指令文字,并确定所述指令文字所对应的匹配指令,输出所述匹配指令至所述主屏系统;
    信息融合与数据分解模块,与所述文本获取与指令匹配模块连接;所述信息融合与数据分解模块用于获取与所述指令文字相对应的描述信息,接收所述主屏系统执行所述匹配指令对应的操作后反馈的响应信息,并根据所述响应信息和所述描述信息生成综合信息包,发送所述综合信息包至所述语音平台;所述信息融合与数据分解模块还用于接收并解析所述语音平台输出的所述混合数据包,生成所述语音响应文本和响应音频信号,发送所述响应音频信号至所述主屏系统进行输出;
    显示模块,与所述显示模块与所述信息融合与数据分解模块连接,所述显示模块,用于接收并显示所述信息融合与数据分解模块输出的所述语音响应文本。
  4. 根据权利要求3所述的语音交互系统,其中,所述声音监测与语音获取模块包括第一寄存器、音频监测单元、开关单元、延时单元、转换缓存单元、特征识别单元和提取编码单元;所述音频监测单元的第一输入端和所述延时单元的输入端连接,所述音频监测单元的第二输入端和所述特征识别单元的第一输入端分别与所述第一寄存器连接,所述音频监测单元的输出端分别与所述开关单元的输入端和所述特征识别单元的第二输入端连接,所述开关单元的输出端与所述转换缓存单元的输入端连接,所述转换缓存单元还与所述特征识别单元和所述提取编码单元连接;所述提取编码单元与所述特征识别单元连接。
  5. 根据权利要求4所述的语音交互系统,其中,
    所述第一寄存器,用于存储预设时间长度、预设能量阈值和预设参考特征值;
    所述音频监测单元,用于接收所述主屏系统输出的所述音频信号,并在监测到所述预设时间长度内的音频信号达到所述预设能量阈值时,输出截取触发信号;
    所述开关单元,用于在接收到所述截取触发信号时开启;
    所述延时单元,用于在所述开关单元开启时,输出延时预设时间长度的所述音频信号至所述转换缓存单元;
    所述转换缓存单元,用于在接收到所述截取触发信号时,分配起始存储地址以存储所述音频信号,并输出所述起始存储地址;
    所述特征识别单元,用于在接收到所述截取触发信号时,读取所述预设参考特征值和所述起始存储地址中的所述音频信号,并在所述音频信号的特征与所述预设特征值一致时,输出提取触发信号至所述提取编码单元;
    所述提取编码单元,用于在接收到所述提取触发信号时,根据所述起始存储地址读取所述音频信号,并将所述音频信号进行编码形成语音包,发送所述语音包至所述语音平台。
  6. 根据权利要求3所述的语音交互系统,其中,所述文本获取与指令匹配模块包括解码解析单元、指令匹配单元、第二寄存器和存储单元,所述指令匹配单元分别与所述解码解析单元、第二寄存器和存储单元连接。
  7. 根据权利要求6所述的语音交互系统,其中,
    所述解码解析单元,用于接收并解码所述语音平台反馈的文本包,得到组合文本,并解析所述组合文本得到语音文字和指令文字,
    所述第二寄存器,用于存储预设相似度;
    所输出存储单元,用于存储指令表,其中,所述指令表包括多个指令记录以及每个所述指令记录的描述字段信息;
    所述指令匹配单元,用于获取所述预设相似度并读取所述指令表中每一条描述字段信息,在所述指令文字与所述描述字段信息的比较相似度达到预设相似度时,将与所述指令文字的比较相似度达到预设相似度的所述描述字段信息对应的指令记录作为所述指令文字所对应的所述匹配指令,输出所述匹配指令至所述主屏系统。
  8. 根据权利要求3所述的语音交互系统,其中,所述信息融合与数据分解模块包括信息融合单元、编码单元和解码分解单元。
  9. 根据权利要求8所述的语音交互系统,其中,
    所述信息融合单元,用于接收所述主屏系统执行所述匹配指令对应的操作后反馈的所述响应信息,以及获取与所述指令文字相对应的描述信息,并根据所述响应信息和所述描述信息生成综合信息;
    所述编码单元,与所述信息融合单元连接;所述编码单元,用于将所述综合信息编码为所述综合信息包,并输出所述综合信息包至所述语音平台;
    所述解码分解单元,用于接收并解析所述语音平台输出的所述混合数据包,分离出所述语音响应文本和所述响应音频信号;并发送所述语音响应文本至所述显示模块,发送所述响应音频信号至所述主屏系统进行输出。
  10. 根据权利要求1所述的语音交互系统,其中,所述语音平台包括语言理解与文本生成模块和信息解析与数据生成模块。
  11. 根据权利要求10所述的语音交互系统,其中,
    所述语言理解与文本生成模块,用于根据所述副屏系统发送的所述语音包生成对应的文本包,并发送所述文本包至所述副屏系统;
    所述信息解析与数据生成模块,用于接收所述副屏系统发送的所述综合信息包,根据所述综合信息包生成混合数据包,发送所述混合数据包至所述副屏系统。
  12. 根据权利要求11所述的语音交互系统,其中,所述语言理解与文本生成模块包括解码识别单元、组合编码单元和逻辑结构转换单元。
  13. 根据权利要求12所述的语音交互系统,其中,
    所述解码识别单元,用于接收并解码所述副屏系统发送的所述语音包得到语音音频信号,并对所述语音音频信号进行识别,转换为语音文字;
    所述逻辑结构转换单元,与所述解码识别单元连接,用于对所述语音文字进行理解,并将所述语音文字转换为符合预设语音结构的指令文字;
    所述组合编码单元,分别与所述解码识别单元和所述逻辑结构转换单元连接;所述组合编码单元用于将所述语音文字和所述指令文字按照预设顺序进行组合,形成组合文本,并将所述组合文本编码为所述文本包,发送所述文本包至所述副屏系统。
  14. 根据权利要求10所述的语音交互系统,其中,所述信息解析与数据生成模块包括解析解码单元、合成转换单元和混合编码单元。
  15. 根据权利要求14所述的语音交互系统,其中,
    所述解析解码单元,用于接收并解码所述副屏系统发送的所述综合信息包得到综合信息,并对所述综合信息进行解析得到所述语音响应文本;
    所述合成转换单元,与所述解析解码单元的输出端连接;所述合成转换单元,用于将所述语音响应文本转换为所述响应音频;
    所述混合编码单元,与所述解析解码单元和所述合成转换单元连接;所述混合编码单元,用于将所述语音响应文本和所述响应音频进行混合编码,生成所述混合数据包,发送所述混合数据包至所述副屏系统。
  16. 一种语音交互方法,其中,所述语音交互方法应用于副屏系统;所述语音交互方法包括步骤:
    根据主屏系统采集的音频信号生成语音包,发送所述语音包至语音平台;
    接收并解析语音平台基于所述语音包反馈的文本包,生成指令文字,根据所述指令文字生成综合信息包;
    接收语音平台根据所述综合信息包生成的混合数据包,解析所述混合数据包得到语音响应文本和响应音频信号,显示所述语音响应文本,并发送所述响应音频信号至主屏系统进行输出。
PCT/CN2022/106046 2022-05-13 2022-07-15 语音交互系统及语音交互方法 WO2023216414A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210527135.8 2022-05-13
CN202210527135.8A CN114945103B (zh) 2022-05-13 2022-05-13 语音交互系统及语音交互方法

Publications (1)

Publication Number Publication Date
WO2023216414A1 true WO2023216414A1 (zh) 2023-11-16

Family

ID=82906432

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/106046 WO2023216414A1 (zh) 2022-05-13 2022-07-15 语音交互系统及语音交互方法

Country Status (2)

Country Link
CN (1) CN114945103B (zh)
WO (1) WO2023216414A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117915160A (zh) * 2024-01-19 2024-04-19 江苏苏桦技术股份有限公司 一种会议教育用显示设备的交互系统及方法

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102013254A (zh) * 2010-11-17 2011-04-13 广东中大讯通信息有限公司 一种数字电视语音识别人机交互系统及方法
US20130196293A1 (en) * 2012-01-31 2013-08-01 Michael C. Wood Phonic learning using a mobile computing device having motion sensing capabilities
US20150149146A1 (en) * 2013-11-22 2015-05-28 Jay Abramovitz Systems for delivery of audio signals to mobile devices
CN109147784A (zh) * 2018-09-10 2019-01-04 百度在线网络技术(北京)有限公司 语音交互方法、设备以及存储介质
CN110740367A (zh) * 2019-10-23 2020-01-31 海信电子科技(武汉)有限公司 显示设备及语音指令处理方法
CN112511882A (zh) * 2020-11-13 2021-03-16 海信视像科技股份有限公司 一种显示设备及语音唤起方法
CN114283801A (zh) * 2021-12-15 2022-04-05 深圳创维-Rgb电子有限公司 语音交互显示系统及智能显示终端

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005189846A (ja) * 2003-12-05 2005-07-14 Ihm:Kk 音声制御スクリーンシステム
KR102056461B1 (ko) * 2012-06-15 2019-12-16 삼성전자주식회사 디스플레이 장치 및 디스플레이 장치의 제어 방법
CN106251869B (zh) * 2016-09-22 2020-07-24 浙江吉利控股集团有限公司 语音处理方法及装置
CN112788422A (zh) * 2019-11-04 2021-05-11 海信视像科技股份有限公司 显示设备

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102013254A (zh) * 2010-11-17 2011-04-13 广东中大讯通信息有限公司 一种数字电视语音识别人机交互系统及方法
US20130196293A1 (en) * 2012-01-31 2013-08-01 Michael C. Wood Phonic learning using a mobile computing device having motion sensing capabilities
US20150149146A1 (en) * 2013-11-22 2015-05-28 Jay Abramovitz Systems for delivery of audio signals to mobile devices
CN109147784A (zh) * 2018-09-10 2019-01-04 百度在线网络技术(北京)有限公司 语音交互方法、设备以及存储介质
CN110740367A (zh) * 2019-10-23 2020-01-31 海信电子科技(武汉)有限公司 显示设备及语音指令处理方法
CN112511882A (zh) * 2020-11-13 2021-03-16 海信视像科技股份有限公司 一种显示设备及语音唤起方法
CN114283801A (zh) * 2021-12-15 2022-04-05 深圳创维-Rgb电子有限公司 语音交互显示系统及智能显示终端

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117915160A (zh) * 2024-01-19 2024-04-19 江苏苏桦技术股份有限公司 一种会议教育用显示设备的交互系统及方法

Also Published As

Publication number Publication date
CN114945103B (zh) 2023-07-18
CN114945103A (zh) 2022-08-26

Similar Documents

Publication Publication Date Title
US20200184989A1 (en) Display apparatus, voice acquiring apparatus and voice recognition method thereof
WO2016169329A1 (zh) 一种语音控制电子节目的方法、装置及存储介质
WO2021068558A1 (zh) 一种同声字幕翻译方法、智能电视及存储介质
CN108683937B (zh) 智能电视的语音交互反馈方法、系统及计算机可读介质
CN108446095B (zh) 图像处理设备、其控制方法、以及图像处理系统
WO2020078300A1 (zh) 一种终端投屏的控制方法和终端
WO2020062670A1 (zh) 电器设备的控制方法、装置、电器设备和介质
CN110992955A (zh) 一种智能设备的语音操作方法、装置、设备及存储介质
WO2020135161A1 (zh) 视频播放跳转方法、系统及计算机可读存储介质
CN101594528A (zh) 信息处理系统、信息处理设备、信息处理方法及程序
CN111462744B (zh) 一种语音交互方法、装置、电子设备及存储介质
CN112102828A (zh) 大屏幕自动播报内容的语音控制方法及系统
WO2023216414A1 (zh) 语音交互系统及语音交互方法
CN111933149A (zh) 语音交互方法、穿戴式设备、终端及语音交互系统
US11153651B2 (en) Method, apparatus, and device for obtaining play data, and storage medium
CN113676761B (zh) 一种多媒体资源播放方法、装置及主控设备
CN106454463B (zh) 一种基于电视机的控制方法和装置
CN111327935B (zh) 一种基于人工智能电视机的信息交互平台
CN104717536A (zh) 一种语音控制的方法和系统
CN208225517U (zh) 具有语音控制功能的家用电器及语音交互系统
WO2021213221A1 (zh) 多媒体播放方法、装置、计算机设备和存储介质
CN101764970B (zh) 电视机及其操作方法
KR101877430B1 (ko) 영상처리장치 및 그 제어방법, 영상처리 시스템
CN113395585B (zh) 视频检测方法、视频播放控制方法、装置和电子设备
CN114566144A (zh) 一种语音识别方法、装置、服务器和电子设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22941358

Country of ref document: EP

Kind code of ref document: A1