CN116192423A

CN116192423A - Voice interaction method, corresponding equipment, server and storage medium

Info

Publication number: CN116192423A
Application number: CN202211518331.5A
Authority: CN
Inventors: 王康
Original assignee: Alibaba China Co Ltd
Current assignee: Alibaba China Co Ltd
Priority date: 2022-11-29
Filing date: 2022-11-29
Publication date: 2023-05-30

Abstract

The application provides a voice interaction method, corresponding equipment, a server and a storage medium. The voice interaction method comprises the following steps: before or during the multi-person voice interaction, responding to the encryption starting instruction to enter a voice encryption state; in a voice encryption state, encrypting the acquired voice signal based on a key to generate an encrypted voice signal; the encrypted voice signal is transmitted to a terminal device or a server connected to the voice output device. According to the technical scheme, the voice content in the multi-user voice interaction process can be effectively prevented from being leaked, and the safety of the multi-user voice interaction is improved.

Description

Voice interaction method, corresponding equipment, server and storage medium

Technical Field

The present disclosure relates to the field of voice interaction technologies, and in particular, to a voice interaction method, and corresponding devices, servers, and storage media.

Background

With the popularization of cloud video conference software such as nailing, voice calls, conferences and the like among a plurality of users are more convenient, and at the same time, the possibility that voice privacy of the users is revealed is greatly increased, for example, an operating system and some malicious software may acquire the content of the voice call. The related art provides an encryption function in VoIP (Voice over Internet Protocol, a voice call technology) software, but the encryption function is encryption of data level, transport layer, and cannot combat the acquisition of microphone input by operating system level and malware.

Disclosure of Invention

The embodiment of the application provides a voice interaction method, corresponding equipment, a server and a storage medium, so as to solve the technical problems in the prior art.

In a first aspect, an embodiment of the present application provides a voice output device, including:

the first control switch is positioned at the outer side of the voice output equipment and is used for generating an encryption starting instruction when triggered;

the encryption control module is electrically connected with the first control switch and is used for responding to an encryption starting instruction to enter a voice encryption state before or during multi-user voice interaction, encrypting the acquired voice signal based on a secret key in the voice encryption state and generating an encrypted voice signal;

the first communication module is electrically connected with the encryption control module and is used for transmitting encrypted voice signals to a terminal device or a server connected with the voice output device; the terminal device is used for forwarding the encrypted voice signal to the server, and the server is used for forwarding the encrypted voice signal to the voice receiving device of the interaction object.

In a second aspect, embodiments of the present application provide a server, including:

the second communication module is configured to receive the encrypted voice signal transmitted by the voice output device or the terminal device provided in the first aspect of the embodiment of the present application, and forward the encrypted voice signal to the voice receiving device of the interaction object.

In a third aspect, an embodiment of the present application provides a voice receiving apparatus, including:

the third communication module is configured to receive an encrypted voice signal sent by the server provided in the second aspect of the embodiment of the present application;

and the decryption control module is used for decrypting the encrypted voice signal based on the secret key.

In a fourth aspect, an embodiment of the present application provides a voice interaction method, which is applicable to a voice output device, where the method includes:

before or during the multi-person voice interaction, responding to the encryption starting instruction to enter a voice encryption state;

in a voice encryption state, encrypting the acquired voice signal based on a key to generate an encrypted voice signal;

transmitting the encrypted voice signal to a terminal device or a server connected to the voice output device; the terminal device is used for forwarding the encrypted voice signal to the server, and the server is used for forwarding the encrypted voice signal to the voice receiving device of the interaction object.

In a fifth aspect, an embodiment of the present application provides a voice interaction method, which may be applied to a server, where the method includes:

receiving an encrypted voice signal transmitted by a voice output device or a terminal device provided in the first aspect of the embodiment of the present application;

The encrypted speech signal is forwarded to a speech receiving device of the interactive object.

In a sixth aspect, an embodiment of the present application provides a voice interaction method, which is applicable to a voice receiving device, where the method includes:

receiving an encrypted voice signal sent by a server provided in a second aspect of the embodiment of the present application;

decrypting the encrypted voice signal based on the key; the key is a key used in the voice interaction method provided in the fourth aspect of the embodiment of the present application.

In a seventh aspect, embodiments of the present application provide a computer readable storage medium, where a computer program is stored, where the computer program is executed by a processor to implement the voice interaction method provided in any embodiment of the present application.

Compared with the prior art, the application has the following advantages:

according to the technical scheme of the embodiment of the application, voice interaction can be realized through the voice output equipment, the server and the voice receiving equipment, voice content in multi-user voice interaction can be encrypted on an audio domain, an operating system, malicious software and the like can only acquire encrypted voice data and cannot decrypt the encrypted voice data, so that original voice content cannot be acquired, thereby effectively protecting the call privacy of a user, and improving the safety of voice interaction in the scenes of secret calling, secret conferences and the like; the control operation of whether to enter the voice encryption state can be executed before the multi-user voice interaction or in the multi-user voice interaction process, so that the control flexibility is high; the encryption operation may be implemented based on a combination of hardware and software.

The foregoing description is merely an overview of the technical solutions of the present application, and in order to make the technical means of the present application more clearly understood, it is possible to implement the present application according to the content of the present specification, and in order to make the above and other objects, features and advantages of the present application more clearly understood, the following detailed description of the present application will be given.

Drawings

In the drawings, the same reference numerals refer to the same or similar parts or elements throughout the several views unless otherwise specified. The figures are not necessarily drawn to scale. It is appreciated that these drawings depict only some embodiments according to the application and are not to be considered limiting of its scope.

Fig. 1 is a schematic diagram of an application scenario of a voice interaction scheme provided in an embodiment of the present application;

fig. 2 is a schematic diagram of another application scenario of a voice interaction scheme provided in an embodiment of the present application;

fig. 3 is a schematic structural frame diagram of a voice output device according to an embodiment of the present application;

fig. 4 is a schematic structural frame of a server according to an embodiment of the present application;

fig. 5 is a schematic structural frame diagram of a voice receiving device according to an embodiment of the present application;

FIG. 6 is a schematic diagram of interaction between devices in an embodiment of the present application;

Fig. 7 is a schematic flow chart of a voice interaction method according to an embodiment of the present application;

FIG. 8 is a schematic diagram of a principle of spectrum inversion in an embodiment of the present application;

FIG. 9 is a schematic diagram of another principle of spectrum inversion in an embodiment of the present application;

fig. 10 is a schematic diagram of updating a spectrum cut-off point in the embodiment of the present application;

FIG. 11 is a flowchart illustrating another voice interaction method according to an embodiment of the present disclosure;

fig. 12 is a flowchart of another voice interaction method according to an embodiment of the present application; and

fig. 13 is a flowchart of another voice interaction method according to an embodiment of the present application.

Detailed Description

Hereinafter, only certain exemplary embodiments are briefly described. As will be recognized by those of skill in the pertinent art, the described embodiments may be modified in various different ways without departing from the spirit or scope of the present application. Accordingly, the drawings and description are to be regarded as illustrative in nature, and not as restrictive.

With the popularization of VoIP software such as nailing, the interaction of voice communication, conferences and the like among a plurality of users is more convenient, more interaction demands are derived on the basis, for example, communication contents are encrypted to prevent eavesdropping, part of users need to communicate independently to communicate more secret contents occasionally in the process of communication among a plurality of users, and the functions of communicating independently and not influencing the whole communication are needed to be realized, so that the communication efficiency and the conference efficiency are improved. Based on the above requirements, some solutions are proposed by the related art, but these solutions are usually data level, transmission layer encrypts call content, cannot resist the acquisition of microphone input by operating system level and malicious software, and the user cannot perceive the improvement of privacy protection, and the server side of VoIP software may still have decrypted keys, so that the VoIP software can eavesdrop during transit. In addition, the call content may be recognized as text by the ASR (Automatic Speech Recognition ) system at the server end, so that keywords are intercepted and advertisements are pushed.

Based on the above status quo, the embodiment of the application provides a voice interaction scheme, which comprises a voice interaction device, a server and a voice interaction method which can be executed on the voice interaction device. The voice interaction device can be a voice output device for outputting voice or a voice receiving device for receiving voice, the voice output device can encrypt an input voice signal and output an encrypted voice signal, the server can forward the encrypted voice signal output by the voice output device to the voice receiving device, the voice receiving device can receive the encrypted voice signal and decrypt the encrypted voice signal to restore an original voice signal, and the voice output device and the voice receiving device are matched for use, so that encrypted conversation can be realized, and the privacy of a user in the conversation process can be protected.

Fig. 1 shows an application scenario of the voice interaction scheme. Referring to fig. 1, the voice interaction scheme provided in the embodiment of the present application may be applied to a scenario of multi-person conversation interaction, such as online conference, live broadcast, etc., where each user may connect with a terminal device using a voice interaction device (a voice output device or a voice receiving device), and install VoIP software on the terminal device, and voice connection between multiple users may be achieved through VoIP. In the conversation process, a voice output side, a voice transit side and a voice receiving side are involved, a speaker can speak to the voice output device at the voice output side so as to input voice signals to the voice output device, the voice signals can be transmitted to a server at the voice transit side through VoIP software installed on a terminal device A after being encrypted by the voice output device, the server can transmit encrypted voice signals to a terminal device B at the voice receiving side, the VoIP software installed on the terminal device B can transmit received encrypted voice signals to the voice receiving device through the terminal device B, the voice receiving device can decrypt the encrypted voice signals, and the decrypted voice signals can be played to a listener at the voice receiving side.

In the application scenario shown in fig. 1, the voice output device may be any one of a microphone (may also be referred to as a microphone or a microphone), an earphone with a voice input function, an earphone conversion head, an intelligent sound box, and the like, and the voice output device may also be any one of a mobile phone, a computer, an intelligent watch, and the like, and the voice receiving device may be any one of an earphone without a sound transmission function, an earphone with a sound transmission function, an earphone conversion head, an intelligent sound box, and the like. The terminal equipment can be any one of a mobile phone, a computer, a smart watch and the like.

Fig. 2 shows another application scenario for implementing the voice interaction scheme. Referring to fig. 2, the voice interaction scheme provided in the embodiment of the present application may be applied to a scenario of multi-person call interaction, for example, an online conference of multiple persons, where a part of users may use a voice output device to connect with a terminal device, and install VoIP software on the terminal device, and another part of users may install VoIP software on a voice receiving device, so that voice connection between multiple users may be achieved through VoIP. The speech output side, the speech transit side and the speech receiving side are involved in the conversation process, the speaker (for example, a user using the speech output device) can speak to the speech output device, so as to input a speech signal to the speech output device, the speech signal can be transmitted to the server of the speech transit side through the VoIP software installed on the terminal device after being encrypted by the speech output device, the server can forward the encrypted speech signal to the speech receiving device of the speech receiving side, and the speech receiving device can decrypt the encrypted speech signal received by the VoIP software at the speech receiving side, and play the decrypted speech signal to a listener of the speech receiving side.

In the application scenario shown in fig. 2, the voice output device may be any one of a microphone, an earphone with a sound transmission function, an earphone conversion head, an intelligent sound box, and the like, or may be any one of a mobile phone, a computer, an intelligent watch, and the like, and the voice receiving device may be any one of terminal devices such as a mobile phone, a computer, an intelligent watch, and the like.

In the application scenario shown in fig. 1, if the voice interaction device used by each user is integrated with a voice output function and a voice receiving function, the voice interaction device of each user can be used as a speaker or a listener, and the voice interaction device of the user can realize an encryption function or a decryption function. In the scenario shown in fig. 2, for a user using a voice output device, if the voice output device is integrated with a voice receiving function at the same time, the user may be a speaker or a listener, and the voice output device may implement an encryption function or a decryption function.

In the application scenario shown in fig. 1 and 2, the speaker can select a listener, and at least a part of the listeners are given authority to listen to the content of the speaker, so that the encrypted voice signal can be output only to the voice receiving apparatus used by the selected listener.

The voice interaction scheme provided by the embodiment of the application can be applied to various actual scenes, such as multi-person online chatting, small-range live broadcasting and the like in life scenes, online conferences, online teaching, internal live broadcasting and the like in office scenes, and can be used when the voice content needs to be protected from leakage, the conversation efficiency needs to be improved and the conference efficiency needs to be improved.

In order to facilitate understanding of the technical solutions of the embodiments of the present application, the following describes related technologies of the embodiments of the present application. The following related technologies may be optionally combined with the technical solutions of the embodiments of the present application, which all belong to the protection scope of the embodiments of the present application.

The embodiment of the application provides a voice output device, as shown in fig. 3, the device may include: a first control switch 301, an encryption control module 302 and a first communication module 303.

The first control switch 301, which may be located outside the speech output device, may be used to generate an encryption on command when triggered. The encryption control module 302 is electrically connected to the first control switch 301, and is configured to enter a voice encryption state in response to an encryption start instruction before or during a multi-user voice interaction, and encrypt an acquired voice signal based on a key in the voice encryption state to generate an encrypted voice signal. The first communication module 303 may be electrically connected to the encryption control module 302 and may be used to transmit encrypted voice signals to a terminal device or a server connected to the voice output device; the terminal device is used for forwarding the encrypted voice signal to the server, and the server is used for forwarding the encrypted voice signal to the voice receiving device of the interaction object.

In one example, when the voice output device is a microphone, an earphone conversion head, an intelligent sound box and other devices, the voice output device needs to be connected with a mobile phone, a tablet computer and other devices for cooperation in the voice interaction process, at this time, the voice output device can send an encrypted voice signal to the connected terminal device, the terminal device forwards the encrypted voice signal to a server, and further forwards the encrypted voice signal to a voice receiving device through a service. In another example, when the voice output device is a terminal device such as a mobile phone, a computer, a smart watch, etc., the encrypted voice signal may be directly transmitted to the server.

The first control switch 301 may be any type of switch such as a push switch, a toggle switch, a twist switch, a touch switch, or the like. When the first control switch 301 is a key switch, an encryption start instruction may be generated by pressing the key switch to enable the voice output device to enter a voice encryption state, and an encryption end instruction may be generated by pressing the key switch again to enable the voice output device to end the voice encryption state. When the first control switch 302 is a toggle switch, the toggle switch is toggled to generate an encryption start instruction, so that the voice output device enters a voice encryption state, and the toggle switch is toggled again or toggled to the other direction to generate an encryption end instruction, so that the voice output device ends the voice encryption state. When the first control switch 302 is a twist switch, the encryption start instruction may be generated by performing a twist operation on the twist switch to make the voice output device enter a voice encryption state, and the encryption end instruction may be generated by performing a twist operation again or performing a twist operation in the other direction to make the voice output device end the voice encryption state. When the first control switch 302 is a touch switch, an encryption start instruction may be generated by performing a touch operation on the touch switch, so that the voice output device enters a voice encryption state, and an encryption end instruction may be generated by performing a touch operation again or performing a toggle operation in another direction, so that the voice output device ends the voice encryption state.

The type of the first control switch 302 may not be limited to the above types, but may be other types, and for the above types of switches, the specific operation by which the corresponding instruction is generated may not be limited to the above-listed manners, but may be other manners. The encryption control module 302 may be a processor.

According to the voice output equipment provided by the embodiment of the application, voice contents in multi-user voice interaction can be encrypted on an audio domain, an operating system, malicious software, an ASR system and the like can only acquire encrypted voice signals and cannot decrypt the encrypted voice signals so that original voice contents cannot be acquired, so that call privacy of a user can be effectively protected, the operating system and the malicious software are prevented from acquiring call contents, and secret call, secret conference and the like are realized; the control operation of whether to enter the voice encryption state can be executed before the multi-user voice interaction or in the multi-user voice interaction process, so that the control flexibility is high; the encryption operation may be implemented based on a combination of hardware and software.

In one embodiment, the first control switch 301 may have a plurality of gears, each of which, when triggered, may generate a level of encryption on command, each level of encryption on command being associated with a level of voice encryption status. Correspondingly, the encryption control module 302 may be further configured to enter, when a received encryption start instruction of a level, a voice encryption state of a level associated with the encryption start instruction of the level, and generate voice transmission rule information in the voice encryption state. The first communication module 303 may be further configured to transmit voice transmission rule information to the terminal device or the server, where the voice transmission rule information may include a specified authority level, which may be an authority level associated with a current voice encryption status.

The user can switch among the gears in the first control switch 301 according to the requirement, so as to switch among different levels of voice encryption states, and on the outer side of the voice output device 300, the gears can be arranged according to the level sequence (from low to high or from high to low) of the encryption start instruction which can be generated, so that the user can conveniently and sequentially switch the gears, and the user can conveniently and rapidly position a certain gear.

The multiple gears in the first control switch 301 enter different voice encryption states by generating encryption starting instructions with different levels, and can be finally associated with multiple authority levels, and through selection and triggering of the gears, interaction objects with corresponding authority levels can be selected as objects for receiving encrypted voice signals, so that different encryption requirements of users can be met.

In one embodiment, the voice output device may further include an output module, configured to output a list of interaction objects of a plurality of voice interactions for a current voice encryption state, and generate the selection instruction when the selection operation is performed. Correspondingly, the encryption control module 302 may be further configured to determine, in response to a selection instruction, the selected interactive object as the interactive object of the specified permission level, and generate the voice transmission rule information based on the interactive object of the specified permission level.

The output module can be a display screen, the interactive object list can be output through the display screen, and a user can finish the selection of the interactive object through touching the appointed area of the display screen so as to generate a selection instruction, so that the user can select the interactive object with the appointed authority level by himself.

In one embodiment, the voice output device may further include a second control switch, where the second control switch is operable to generate the key update instruction when triggered. Correspondingly, the encryption control module 302 may be configured to update the key in response to the key update instruction.

The second control switch can be any one of a button switch, a toggle switch, a twist switch, a touch switch and the like. When the second control switch is a key switch, the key update instruction can be generated by pressing the key switch. When the first control switch is a toggle switch, the key updating instruction can be generated by conducting toggle operation on the toggle switch. When the first control switch is a twist switch, the key update instruction may be generated by performing a twist operation on the twist switch. When the first control switch is a touch switch, the key update instruction can be generated by performing touch operation on the touch switch.

In one example, the second control switch may be integrated with the first control switch 301 as a single switch, e.g., the first control switch 301 may include multiple gear steps, one of which may be triggered to generate the update command.

The voice output device may be any device such as a microphone, an earphone conversion head, and an intelligent sound box, and when being used in cooperation with a terminal device such as a mobile phone, a computer, and an intelligent watch, the voice output device may further include an audio transmission interface for transmitting a voice signal with the terminal device, for example, a TRS (large three-core) interface, an XLR (card-farm head) interface, an HDMI (High Definition Multimedia Interface ) interface, and the like, and may be inserted into a corresponding interface of the terminal device to be connected with the terminal device in a plug-and-play manner, or may be connected with the terminal device wirelessly.

The voice output device can be independent of the terminal device, can independently encrypt the voice signal, is encrypted when the voice signal is transmitted to the terminal device by the voice output device, and cannot be decrypted by an operating system and malicious software on the terminal device, so that voice content cannot be acquired.

Based on the same technical concept, the embodiment of the present application further provides a server, as shown in fig. 4, where the server may include a second communication module 401, where the second communication module 401 may be configured to receive an encrypted voice signal transmitted by a voice output device or a terminal device, and forward the encrypted voice signal to a voice receiving device of an interaction object. The voice output device may be any one of the voice output devices provided in the embodiments of the present application, and the encrypted voice signal transmitted by the terminal device may be provided by any one of the voice output devices provided in the embodiments of the present application.

In one embodiment, the second communication module 401 may be further configured to receive voice transmission rule information transmitted by the voice output device or the terminal device. Correspondingly, when forwarding the encrypted voice signal to the voice receiving device of the interactive object, the second communication module 401 may be configured to transmit the encrypted voice signal to the voice receiving device of the interactive object corresponding to the specified authority level in the voice transmission rule information, or transmit the encrypted voice signal to the voice receiving device of the interactive object corresponding to each authority level and send the key to the voice receiving device of the interactive object corresponding to the specified authority level.

The interactive objects of different authority levels may have an inclusion relationship, for example, the interactive object of the second authority level may include the interactive object of the first authority level, and have more interactive objects, and the first authority level may be an interactive object with higher authority relative to the interactive object of the second authority level, so that the user can listen to the content that the interactive object of the second authority level has authority to listen to, or listen to the content that the interactive object of the second authority level does not have authority to listen to, for example, the interactive object of the low authority level includes the interactive object of the high authority level. In one example, for an online meeting of an enterprise, all employees of the enterprise have the right to listen to part of the meeting content, which is an interactive object of a second right level, and advanced administrators in the enterprise have the right to listen to all the meeting content (including highly confidential content), which is an interactive object of a first right level.

Based on the above manner, the server can make the specified interactive object acquire the voice content in two ways. One way is to transmit the encrypted voice signal only to the voice receiving device of the interactive object corresponding to the designated authority level (e.g., high authority level) in the voice transmission rule information, but not to the voice receiving device of the interactive object of the other authority level (e.g., low authority level) except the designated authority level, that is, the interactive object of the other authority level cannot acquire the encrypted voice signal and further cannot decrypt, so that the voice content which needs to be transmitted to part of the interactive object is prevented from being leaked to other interactive objects; another way is to transmit encrypted voice signals to all the voice receiving devices of the interactive objects indiscriminately, but only send the key to the voice receiving devices of the interactive objects corresponding to the designated authority level (for example, the high authority level), that is, only the voice receiving devices of the interactive objects corresponding to the designated authority level can decrypt the received encrypted voice signals to obtain correct voice content, and the voice receiving devices of the interactive objects corresponding to other authority levels cannot decrypt even if receiving the encrypted voice signals, so that the correct voice content cannot be obtained.

In one embodiment, the second communication module 401 is further configured to: and acquiring an interactive object list uploaded by the voice output equipment or the terminal equipment, wherein the interactive object list comprises a plurality of interactive objects and organization structure information of the plurality of interactive objects, such as employee information of enterprises and department information of employee membership.

Correspondingly, the server may further include a processing module, where the processing module may be electrically connected to the second communication module 401, and the processing module may be configured to perform corresponding authority levels on each interactive object in the interactive object list according to the organization architecture information of each interactive object in the interactive object list. In one example, authority levels of each employee may be determined according to employee information of an enterprise and department information of an employee affiliated with the employee, for example, if employee a belongs to a financial department, employee a has a listening authority to financial data, if employee a belongs to a high-level management department such as a board of directors, employee a has an accounting information and other information all have a listening authority, and employees of the same department have the same authority level.

The authority level can be rapidly divided based on the organization structure information of the interactive object, and the correct voice content can be transferred to the interactive object with the appointed authority level when the encrypted voice signal and the voice transmission rule information are received.

The server provided by the embodiment of the application may be a provider server of a voice output device or a voice receiving device, may also be a provider server of VoIP software for implementing multi-user voice interaction, may also be other servers with a certain security, and the processing module may be any one of a plurality of types of processors.

Based on the same technical concept, the embodiment of the present application further provides a voice receiving apparatus, as shown in fig. 5, which may include a third communication module 501 and a decryption control module 502. The third communication module 501 may be configured to receive an encrypted voice signal sent by a server, where the server may be any one of servers provided in the embodiments of the present application; the decryption control module 502 may be configured to decrypt the encrypted voice signal based on the key, and obtain correct voice content after decryption, thereby completing effective voice interaction.

In an embodiment, the voice receiving apparatus may further include a third control switch disposed outside the voice receiving apparatus, and the third control switch may generate a decryption start instruction when a start operation is performed, so as to decrypt the received encrypted voice signal. The third control switch can be any one of a button switch, a toggle switch, a twist switch, a touch switch and the like.

The voice receiving device can be any device such as a microphone, an earphone conversion head, an intelligent sound box, a mobile phone, a computer, an intelligent watch and the like.

When the voice receiving device is a microphone, an earphone conversion head or an intelligent sound box which are matched with the terminal device, decryption can be directly carried out based on a secret key, and the voice receiving device is a hardware decryption mode. The voice receiving device may further include an audio transmission interface, such as a TRS interface, XLR interface, HDMI interface, etc., for transmitting a voice signal to the terminal device, and may be inserted into a corresponding interface of the terminal device to be plugged into and used by the terminal device, or may be connected with the terminal device wirelessly.

When the voice receiving device is any device such as a mobile phone, a computer and a smart watch, and the voice receiving and decrypting functions can be integrated in the terminal device, the voice receiving device can receive the plug-in for decryption in advance and is installed on the device, so that the encrypted voice signal received by the VoIP software can be decrypted in subsequent voice interaction, and the voice receiving device is a software decrypting mode.

In an exemplary scenario, the voice output device in the embodiments of the present application may integrate functions of voice receiving and voice decrypting, for example, the foregoing voice output device may integrate components and functions of the foregoing voice receiving device, the voice receiving device may integrate functions of voice output and voice encrypting, for example, the foregoing voice receiving device may integrate components and functions of the foregoing voice output device. That is, the same device may have integrated therein a voice output function, a voice encryption function, a voice receiving function, and a voice decryption function. Thus, through the same device, the user can output the encrypted voice and decrypt the received encrypted voice, and can switch roles between the speaker and the listener at will.

The devices (voice output device, server, and voice receiving device) in the embodiments of the present application may further include a memory, which may be used to store a computer program that may be executed to implement voice interactions.

In the embodiment of the application, each component in each device (the voice output device, the server and the voice receiving device) can be connected with each other through a bus and complete data interaction with each other. The bus may be an industry standard architecture (Industry Standard Architecture, ISA) bus, an external device interconnect (Peripheral Component Interconnect, PCI) bus, or an extended industry standard architecture (Extended Industry Standard Architecture, EISA) bus, among others. The bus may be classified as an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in fig. 3 and 5, but not only one bus or one type of bus.

Alternatively, in a specific implementation, if the components in each device are integrated on a chip, the components may communicate with each other through an internal interface.

Fig. 6 shows a schematic interaction between devices, and referring to fig. 6, a mobile phone (terminal device) may set a key, the mobile phone may upload the key to a VoIP server, the VoIP service may obtain a list of authorized (e.g., selected by a speaker) listeners (i.e., listeners) and issue a key to the devices of the authorized listeners, e.g., issue a key at time t1, issue a key at time t2, and a voice receiving device of the listener receiving the key may decrypt the voice content through hardware or software. The specific working principle of each device provided in the embodiment of the present application will be further described in the subsequent method embodiment.

Based on the same technical concept, referring to fig. 7, the embodiment of the present application further provides a voice interaction method 700, which may be applied to a voice output device, the voice interaction method 700 may include the following steps S701 to S703:

s701, before or during the multi-person voice interaction, entering a voice encryption state in response to an encryption start instruction.

The multi-person voice interaction may be a multi-person online meeting, live broadcast, or the like. The multi-person voice interaction may be preceded by a time before the multi-person online meeting, live broadcast, etc., for example 1 minute, 5 minutes, or other time before the time of the start of the multi-person online meeting, live broadcast, etc. The encryption start instruction may be generated based on the start operation of the first control switch of the voice output device by the user, and the first control switch and the start operation of the voice receiving device may refer to the description in the foregoing device embodiment, which is not described herein.

S702, in a voice encryption state, the acquired voice signal is encrypted based on a key, and an encrypted voice signal is generated.

S703, transmitting the encrypted voice signal to a terminal device or a server connected to the voice output device.

The terminal device may be configured to forward the encrypted speech signal to a server, which may be configured to forward the encrypted speech signal to a speech receiving device of the interactive object. The interaction object may be a user on the speech receiving side that participates in a multi-person speech interaction, such as the listeners shown in fig. 1 and 2. When outputting the encrypted voice signal to the voice receiving apparatus of the interactive object of the multi-person voice interaction, the encrypted voice signal may be output to all or part of the interactive objects participating in the multi-person voice interaction.

According to the voice interaction method provided by the embodiment of the application, voice contents in multi-user voice interaction can be encrypted on an audio domain, an operating system, malicious software, an ASR system and the like can only acquire encrypted voice signals and cannot decrypt the encrypted voice signals so that original voice contents cannot be acquired, so that call privacy of a user can be effectively protected, the operating system and the malicious software are prevented from acquiring call contents, and secret call, secret conference and the like are realized; the control operation of whether to enter the voice encryption state can be executed before the multi-user voice interaction or in the multi-user voice interaction process, and the control flexibility is high.

In one embodiment, the encryption on instruction includes multiple levels of encryption on instructions. Correspondingly, in the step S701, the entering the voice encryption state in response to the encryption start instruction may include: responding to a received encryption starting instruction of one level, and entering a voice encryption state of one level associated with the encryption starting instruction of the level;

generating voice transmission rule information in the voice encryption state; the voice transmission rule information includes a specified permission level, which is a permission level associated with a current voice encryption state.

In one embodiment, the voice interaction method 700 may further include: outputting an interaction object list corresponding to the multi-user voice interaction aiming at the current voice encryption state; in response to a selection instruction for an interactive object in the interactive object list, the selected interactive object is determined as an interactive object of a specified authority level. Correspondingly, the generating the voice transmission rule information in the voice encryption state may include: voice transmission rule information is generated based on the interactive object of the specified authority level. The method can be used for designating the interactive object with the current listening authority by the user so as to meet the actual interactive requirement of the user.

The interactive object list may include hardware information of all interactive objects corresponding to the multi-user voice interaction, for example, at least one item of information such as device identification, hardware configuration information and the like of a voice receiving device of the interactive object, and may also include at least one item of information such as account numbers, nicknames, head images and the like of users serving as the interactive objects.

The selection instruction may be generated after the user selects the interactive object in the interactive object list, and the selected interactive object may include all or part of the objects in the interactive object list according to the actual selection of the user.

According to the actual demands of users, the operation of outputting the interactive object list corresponding to the multi-user voice interaction for the users to select can be executed once or more times so as to meet the encryption demands of different objects, and when each operation is executed, the interactive object list corresponding to the multi-user voice interaction can be output in response to the triggering operation of the users to the operation list, so that listening rights are given to different interactive objects, for example, in an online conference, a speaker can select which users need to listen to voice contents, and by selecting higher rights given to the users as designated rights grades, encrypted voice signals are sent to the users.

In one embodiment, the voice interaction method 700 may further include: in the current speech encryption state, the state switching instruction may be generated when a gear of the first control switch of the speech output device is switched in response to the state switching instruction switching to another level of speech encryption state.

The method can carry out voice encryption on the interactive objects with different authority levels in the voice encryption states of different levels, when different voice contents are required to be sent to the interactive objects with different authority levels, the conference does not need to be independently initiated any more, the voice encryption states can be switched in the same conference, in one example, if only a small part of contents in the same conference need to be sent to part of the participants, the voice encryption states can be switched, the situation that the part of the participants cannot timely switch back to the original conference after independently initiating the conference can be avoided, the whole conference flow is smoother, and the conference efficiency is higher.

The state switch instruction may be generated based on any of the following: firstly, a user performs switching operation on a gear of a first control switch on voice output equipment to generate a state switching instruction; the second mode is to generate a state switching instruction after identifying according to the voice content, for example, the state switching instruction can be generated after identifying keywords possibly leading out voice content with higher confidentiality level, for example, the lower service data is identified to be only limited in the service data in the project group colleagues, and the lower service data are identified to be used as keywords triggering the state switching instruction; and thirdly, generating a state switching instruction at a designated time point, wherein for a plurality of online conferences with definite conference agenda, namely, teaching of which contents are definite in which time periods, the starting time point of which the voice encryption state needs to be switched can be preset.

In one embodiment, in the step S702, in the voice encryption state, encrypting the acquired voice signal based on the key may include: in the different level voice encryption state, the acquired voice signal is encrypted based on different keys. A secret key can be allocated to each level of voice encryption state in advance, the allocation operation can be performed before the multi-person voice interaction, so that the operation is prevented from affecting the normal progress of the multi-person voice interaction, the overall efficiency of the multi-person voice interaction is improved, and the operation can also be performed in the process of the multi-person voice interaction, so that the real-time requirements of users are acquired, and the association is performed pertinently.

Different keys are adopted for encryption under different levels of voice encryption states, so that the encryption safety can be improved, and the confidentiality effect of voice interaction can be improved.

In one embodiment, the voice interaction method 700 may further include: and displaying voice encryption options on a starting interface or a real-time interaction interface of the multi-user voice interaction, and generating an encryption starting instruction in response to triggering operation for the voice encryption options.

The starting interface may be an interface for voice interaction to be entered, through which a voice encryption option may be displayed for a user to select whether to enter a voice encryption state, where the voice encryption option may be displayed in a text form or in a text-to-image form, where the text form may be, for example, "whether to enter the voice encryption state", "whether to turn on a voice encryption function", etc. The real-time interactive interface can be an interface mainly displayed by interactive information, voice encryption options can be displayed in the edge area of the interactive interface, and the voice encryption options can be displayed in the form of images. The voice encryption options presented on the launch interface or the real-time interactive interface may include a number of sub-options, such as "enter a first level voice encryption state", "enter a second level voice encryption state", and so forth.

In one embodiment, the voice interaction method 700 may further include: and updating the key in response to the key updating instruction, so that the voice signal acquired in real time can be encrypted based on the updated key. Updating the key may increase the encryption strength to prevent the key from being stolen to create invalid encryption.

In one example, the specific way to update the key may be to update the key periodically, i.e. at a preset number of time points, the time interval between which may be fixed, e.g. one key may be updated every 5 minutes after entering the speech encryption state. The time interval between the preset time points may be not fixed, for example, the key may be updated once at a time point of 5 minutes after entering the voice encryption state, then the key may be updated again at a time point of 10 minutes, and the time interval for updating the key may be sequentially increased or sequentially decreased. The mode of regularly updating the secret key can improve the instantaneity of the secret key, improve the encryption strength and better prevent the leakage of the interaction information.

In another example, the specific manner of updating the key may be to update the key in response to an update instruction, where the update instruction may be generated when the user triggers a second control switch in the voice output device, based on this manner, the user may determine whether to update the key by himself or herself, and trigger the update of the key at the point in time when the key is required, so that autonomy of the user in the encryption process and operability of the encryption may be enhanced.

In one embodiment, the key includes at least one of a spectral boundary point and a spectral cut-off point of the acquired speech signal. Correspondingly, in the step S702, encrypting the acquired voice signal based on the key may include: and determining at least one frequency band in the acquired voice signal based on at least one of the information of the frequency spectrum boundary point and the frequency spectrum cut-off point, and performing frequency spectrum inversion (inversion) on the at least one frequency band, for example, performing frequency spectrum inversion on each frequency band in the at least one frequency band. Spectral inversion is an encryption algorithm in the analog domain.

In one example, the key may include spectral boundary points Lo and Hi of the speech signal as shown in fig. 8, where Lo is a lower spectral limit and Hi is an upper spectral limit, and based on the lower spectral limit Lo and the upper spectral limit Hi shown in the graph (a) in fig. 8, a unique frequency band, that is, a frequency band between the lower spectral limit Lo and the upper spectral limit Hi, may be determined, the frequency band may be inverted as a whole, an original high-frequency signal in the frequency band may be placed at a low-frequency position, and an original low-frequency signal in the frequency band may be placed at a high-frequency position, to obtain an inverted frequency spectrum shown in (b).

In another example, the key may include spectral boundary points Lo and Hi of the speech signal as shown in fig. 9, where Lo is a spectral lower limit, hi is a spectral upper limit, and spectral cut-off point a is a point selected between the spectral lower limit Lo and the spectral upper limit Hi, and based on the spectral lower limit Lo, the spectral upper limit Hi, and the spectral cut-off point a shown in fig. 9 (a), the three points may be separated into two frequency bands, the whole of the first frequency band, that is, the frequency band between the spectral lower limit Lo and the spectral cut-off point a may be inverted, the whole of the second frequency band, that is, the frequency band between the spectral cut-off point a and the spectral upper limit Hi may be inverted, the original high-frequency signal in each frequency band may be placed at a low-frequency position, and the original low-frequency signal in each frequency band may be placed at a high-frequency position.

In this embodiment of the present application, the number of spectrum cut points may be one or more, and one spectrum cut point a shown in fig. 9 is merely an example and is not limited to the number of spectrum cut points. The scheme of setting spectrum cut-off points between spectrum boundary points can increase the number of frequency bands which are inverted, thereby increasing encryption strength.

In one embodiment, determining at least one frequency band in the acquired voice signal based on at least one of the information of the spectral boundary point and the spectral cut-off point may be implemented by: inputting the acquired voice signal into a low-pass filter to attenuate the signal with the frequency greater than the cut-off, wherein the cut-off frequency of the low-pass filter is the upper limit of a frequency spectrum, the signal with the frequency below the cut-off frequency reserved is a determined frequency band, and the cut-off frequency can be a carrier frequency, for example, 4kHz; or, the obtained voice signals are respectively input into a low-pass filter and at least one band-pass filter, the low-pass filter can attenuate signals with frequencies larger than a cut-off frequency, the band-pass filter can attenuate signals except for a specified frequency band, and at least one frequency spectrum cut-off point and at least two frequency bands can be determined based on the low-pass filter and the at least one band-pass filter.

In one example, the acquired speech signal may be input to a low pass filter with a cut-off frequency of 4kHz, which attenuates signals greater than 4kHz, retains signals below 4kHz, i.e., 0-4 kHz, with 0kHz being the lower spectral limit, 4kHz being the upper spectral limit, and 0-4 kHz being the determined frequency band.

In another example, the acquired speech signal may be input to a low-pass filter having a cut-off frequency of 1.6kHz, which attenuates signals greater than 1.6kHz, and a band-pass filter having cut-off frequencies of 1.6kHz and 4kHz, which attenuates signals in other frequency ranges than the range of 1.6 to 4kHz, with 0kHz serving as the lower spectral limit, 4kHz serving as the upper spectral limit, 1.6kHz serving as the spectral cut-off point between the lower spectral limit and the upper spectral limit, and 0 to 1.6kHz and 1.6 to 4kHz serving as the determined two frequency bands.

In one embodiment, spectrum inversion of at least one frequency band may include: the voice signal of each frequency band is mixed with carrier wave (sine wave) to obtain two sidebands, namely an upper sideband and a lower sideband, the upper sideband is attenuated by a low pass filter, and the lower sideband is the frequency band with inverted frequency spectrum. In one example, if the low-pass filter retains a frequency band of 0-4 kHz before inversion, the signal in the frequency band may be mixed with a carrier (e.g., 4 kHz), equivalent to AM (amplitude modulation) modulation, if the frequency in the frequency band is denoted as f, after mixing with the carrier of 4kHz, two sidebands of 4k-f (lower sideband) and 4k+f (upper sideband) may be obtained, and by one low-pass filter of 4kHz, 4k+f may be attenuated, and the retained 4k-f is the frequency band after spectral inversion, i.e., the encrypted frequency band, and all the encrypted frequency bands may together form an encrypted speech signal.

In one embodiment, after the mixing and the low-pass filtering process after the mixing, the power of the speech signal may be lost, and in order to compensate for this loss, the loudness normalization process may be performed on the signal obtained after the mixing and the low-pass filtering process after the mixing, where a specific processing manner may be to multiply the signal obtained after the mixing and the low-pass filtering process after the mixing by a value greater than 1, to increase the loudness. In the above step S703, the encrypted speech signal output to the terminal device or the server may be a loudness normalized signal.

In one embodiment, when updating the key, information of at least one of the spectrum lower limit, the spectrum upper limit, and the spectrum cut-off point may be updated. In one example, referring to fig. 10, at time t1, the spectrum cut-off point may be a point, i.e., spectrum cut-off point a, and the frequency band is divided and inverted based on the spectrum cut-off point a, and at time t2, the spectrum cut-off point may be updated to a point b, i.e., spectrum cut-off point b, and the frequency band is divided and inverted based on the spectrum cut-off point b. The ASR system may include a model retrained for the encryption method of spectrum inversion, and update information of at least one point of the spectrum lower limit, the spectrum upper limit and the spectrum cut-off point, so that the retrained model in the ASR system can be disabled, and voice information can be better prevented from being leaked to the ASR system.

In one embodiment, in the above voice interaction method, after frequency spectrum inversion is performed on at least one frequency band, the voice signal after frequency spectrum inversion may be input into a low-pass filter, so that a portion greater than a cut-off frequency is attenuated, and the output encrypted voice signal may be a signal output after processing by the low-pass filter. The cut-off frequency of the low-pass filter can be set according to practical requirements, for example, can be set to 4kHz (kilohertz), the upper limit of a VoIP channel is usually 4kHz, and the human ear is usually more sensitive to signals below 4 kHz.

In one embodiment, the voice interaction method 700 may further include: outputting prompt information, wherein the prompt information is used for prompting a user to close a voice enhancement function in current interaction software (namely VoIP software for realizing voice interaction), and closing the voice enhancement function in the current interaction software in response to a confirmation operation for the prompt information; or, the reverse compensation is performed on the voice signal before the frequency spectrum inversion or the voice signal after the frequency spectrum inversion, and the reverse compensation can be used for compensating the signal attenuation generated by the frequency spectrum inversion.

In practical applications, part of the interactive software may have a voice enhancement function, where the voice enhancement function may implement operations such as voice enhancement and noise reduction, and may be equivalently equivalent to an Equalizer (EQ), which is generally used to perform gain adjustment on voices in different frequency bands, where the voice enhancement function generally attenuates a portion insensitive to human ears (i.e., a high frequency portion) and amplifies a portion sensitive to human ears (i.e., a low frequency portion), where frequency spectrum inversion places the low frequency portion in a high frequency position, and where after frequency spectrum inversion, the high frequency portion is attenuated by using the voice enhancement function to enhance the low frequency portion, which is equivalent to enhancing the portion sensitive to the human ears and degrading the effect of voice interaction.

To solve this problem, the embodiments of the present application provide two schemes: one scheme is to prompt a user to close the original voice enhancement function, so that adverse effects caused by the voice enhancement function after frequency spectrum inversion are avoided; another approach is to reverse compensate the speech signal before spectral inversion to compensate for the signal attenuation expected to occur by spectral inversion, or reverse compensate the speech signal after spectral inversion to compensate for the signal attenuation already occurring after spectral inversion.

In one example, the reverse compensation of the speech signal before spectral inversion or the speech signal after spectral inversion may include: and carrying out enhancement processing on the signal in the designated frequency range of the voice signal before or after the frequency spectrum inversion based on the frequency response of the VoIP channel obtained by testing the VoIP channel in advance so as to compensate the signal attenuation in the designated frequency range after the frequency spectrum inversion and the low-pass filtering processing. The specified frequency range may be a frequency range matching the frequency response of the VoIP channel, and the specified frequency range may be a lower frequency range, for example, 0 to 400Hz (hertz), or a higher frequency range, for example, 3600Hz to 4000Hz, or other ranges, and the specific range may be determined according to the frequency response of the VoIP channel being tested. The specific way of the enhancement processing may be to construct an EQ model with reverse compensation based on the frequency response of the VoIP channel obtained by testing the VoIP channel in advance, and enhance the specified frequency range of the voice signal based on the EQ model.

In one example, if the frequency range is specified to be 0-400 Hz, the specific way to reverse compensate the speech signal before the spectrum inversion may be: the method comprises the steps of performing enhancement processing on a signal of 0-400 Hz in a voice signal before frequency spectrum inversion to make up for attenuation generated after the signal of 0-400 Hz is inverted to a position of 3600-4000 Hz and is processed by a low-pass filter; the specific way of performing reverse compensation on the voice signal after frequency spectrum inversion can be as follows: the original 0-400 Hz signal is inverted in 3600-4000 Hz after frequency spectrum inversion, and the enhancement processing can be carried out on the 3600-4000 Hz signal in the inverted voice signal at the moment so as to make up for the attenuation generated after the original 0-400 Hz signal is inverted in 3600-4000 Hz and processed by a low-pass filter.

The specific way of the enhancement processing may be to construct an EQ model with reverse compensation based on the frequency response of the VoIP channel obtained by testing the VoIP channel in advance, and enhance the specified frequency range of the voice signal based on the EQ model.

In one embodiment, the voice interaction method 700 may further include: before frequency spectrum inversion is carried out on at least one frequency band, a direct current signal in the acquired voice signal is removed, direct current bias is prevented from being generated, and large noise interference caused by inversion of the direct current signal into a high-frequency signal in the subsequent frequency spectrum inversion process is prevented.

Fig. 11 shows a specific example of a voice interaction method applied to a voice output device according to an embodiment of the present application, and referring to fig. 11, a voice interaction method 1100 may include the following steps:

s1101, acquiring an audio sampling point, namely, a sampling point of a voice signal input by a speaker; s1102, removing direct current bias in an audio sampling point; s1103, performing reverse compensation on the audio sampling points from which the direct current bias is removed; s1104, inputting the audio sampling points subjected to the reverse compensation into a low-pass filter of 4kHz for low-pass filtering; s1105, mixing the filtered audio sampling points with a carrier wave (sine wave) of 4 kHz; s1106, inputting the mixed signal into a low-pass filter of 4kHz for low-pass filtering; s1107, carrying out loudness normalization processing on the filtered audio sampling points, and then outputting.

Based on the same technical concept, referring to fig. 12, the embodiment of the present application further provides a voice interaction method 1200, which may be applied to a server, where the voice interaction method 1200 may include the following steps S1201-S1202:

s1201, an encrypted voice signal transmitted by a voice output device or a terminal device is received.

The voice output device may be any one of the voice output devices provided in the embodiments of the present application, and the encrypted voice signal transmitted by the terminal device may be provided by any one of the voice output devices provided in the embodiments of the present application.

S1202, the encrypted voice signal is forwarded to the voice receiving device of the interaction object.

In one embodiment, the voice interaction method 1200 may further include: and receiving voice transmission rule information transmitted by the voice output equipment or the terminal equipment. Correspondingly, forwarding the encrypted voice signal to the voice receiving device of the interactive object may include: transmitting an encrypted voice signal to a voice receiving device of an interactive object corresponding to the designated authority level in the voice transmission rule information; or transmitting the encrypted voice signal to the voice receiving equipment of the interactive object corresponding to each authority level, and sending the secret key to the interactive object corresponding to the appointed authority level.

In one embodiment, the voice interaction method 1200 may further include: acquiring equipment information of each interaction object corresponding to multi-person voice interaction; and determining whether the voice receiving equipment of the interaction object corresponding to the appointed authority level is the voice receiving equipment with the decryption function according to the equipment information, if so, sending a key to the voice receiving equipment of the interaction object corresponding to the appointed authority level, so that the voice receiving equipment can decrypt the encrypted voice signal according to the key, and if not, keeping the current state. The key may be obtained from the speech output device.

In one example, the device information of the interactive object may include a device identification, such as a model number or other specific identification, of the voice receiving device of the interactive object, where the device identification includes information about whether the voice receiving device has a decryption function or is a decryption device, and it may be determined whether the voice receiving device of each interactive object is a device having a decryption function, that is, is a decryption device, according to the device identification. In the embodiment of the present application, the device with the decryption function may be a device that leaves the factory to store the key.

For a voice receiving apparatus having no decryption function, a key may be provided thereto so that it can decrypt an encrypted voice signal, so that the voice information of the speaker can be normally listened to. For example, in a scenario of multi-user voice interaction, the settings used by the users participating in the voice interaction may not be uniform, and only some users may use a voice receiving device with a decryption function, so that the voice signal can be directly decrypted to normally listen to the voice information, and other users cannot directly decrypt, and for the voice receiving device of the user that cannot decrypt, a key may be provided to the voice receiving device, so that the user can normally listen to the voice information.

In one embodiment, the voice interaction method 1200 may further include: and acquiring an interactive object list uploaded by the voice output equipment or the terminal equipment, wherein the interactive object list comprises a plurality of interactive objects and organization structure information of the plurality of interactive objects, and corresponding authority levels of the interactive objects in the interactive object list are obtained according to the organization structure information of the interactive objects in the interactive object list.

In one example, for a meeting within an enterprise, the organization architecture information may include employee information within the enterprise and department information affiliated with the employee to determine authority levels of each employee, e.g., if employee a belongs to a financial department, employee a has listening authority to financial data, if employee a belongs to a high-level management department such as a board of directors, employee a has accounting information and other information all have listening authority, and employees of the same department have the same authority level.

In another example, for a meeting between enterprises, the organization architecture information may include user information and enterprise information to which the user belongs, for example, if the user B belongs to the enterprise C, that is, if the user B is an employee of the enterprise C, then the user B has a listening authority to information related to an internal transaction of the enterprise C, if the user B belongs to another enterprise, then the user B does not have a listening authority to information of an internal transaction of the enterprise C, and users of the same enterprise may have the same authority level.

In one embodiment, the voice interaction method 1200 may further include: and acquiring interaction flow information corresponding to the multi-user voice interaction, and determining the association relation between each voice interaction period and each authority level according to the interaction flow information. Correspondingly, in the above step S1202, forwarding the encrypted voice signal to the voice receiving apparatus of the interaction object may include: in the multi-person voice interaction process, the encrypted voice signal is forwarded to a first interaction object, wherein the first interaction object can be an interaction object corresponding to the authority level associated with the voice interaction period to which the current moment belongs.

In one example, for an online conference, the interactive flow information may be a conference agenda, based on which conference agenda may explicitly share which content to which participants in which time period, the determined participants correspond to the determined authority levels, and further may determine association between each time period and each authority level, and when the server forwards the encrypted voice signal, the server may identify the time period to which the current time belongs, and send the encrypted voice signal to the participant corresponding to the time period.

In one embodiment, the voice interaction method 1200 may further include: and in the multi-person voice interaction process, sending the association relation between each voice interaction period and each authority level to a second interaction object. The second interactive object is other interactive objects except the first interactive object, namely the interactive object of the authority level which is not associated with the voice interaction period to which the current moment belongs. For example, in a period T of an online conference, the server may send an association relationship between each conference period and each authority level to a participant of an authority level not associated with the period T, and the participant may determine whether the conference content in the current period T is related to the participant or whether the participant has authority to listen to the participant according to the association relationship, and when determining that the participant does not have authority to listen to the participant, the participant may choose to continue to participate in the conference or leave the conference temporarily.

Based on the same technical concept, referring to fig. 13, the embodiment of the present application further provides a voice interaction method 1300, which is applicable to a voice receiving device, and the method may include the following steps S1301-S1302:

s1301, the encrypted voice signal sent by the server is received.

The server may be any one of the servers provided in the embodiments of the present application, and the server may send the encrypted voice signal through the voice interaction method 1200 provided in the embodiments of the present application.

S1302 decrypts the encrypted voice signal based on the key.

The key may be a key used in a voice interaction method applicable to the voice output device provided in any of the embodiments of the present application.

The key in the voice receiving apparatus may be the same key as the voice output apparatus, for example, the voice receiving apparatus having the decryption function, typically storing the same key as the voice output apparatus in advance; the key in the voice receiving device may also be obtained in real time, for example, the voice receiving device without the decryption function does not generally store the same key as the voice output device in advance, and the key provided by the voice output device and forwarded by the server or the updated key may be obtained in real time.

In an exemplary scenario, during a multi-person voice interaction, there may be multiple voice output devices simultaneously speaking at the output voice, e.g., multiple persons speaking simultaneously in a conference, and each voice receiving device may simultaneously obtain keys provided by the multiple voice output devices, and decrypt the encrypted voice signal output by the corresponding voice output device based on each key. The keys provided by the voice output devices received by the same voice receiving device may be the same, for example, unified as a preset key, or may be different, for example, the update frequency and update rule of the keys by different voice output devices may be different, and for example, the encryption range selected by the user of different voice output devices and the set authority of the same listener are different, so that the keys output to the voice receiving devices of the same listener are different.

In one embodiment, in the step S1302, decrypting the encrypted voice signal based on the key may include: detecting whether suspicious software exists on the current voice receiving equipment; decrypting the encrypted voice signal based on the key if it is determined that no suspicious software exists on the current voice receiving device; in the event that it is determined that suspicious software is present on the current speech receiving device, the encrypted speech signal is not decrypted. The suspicious software can be software capable of monitoring voice interaction content, the suspicious software can be detected to determine whether the software environment of the current voice receiving equipment is safe or not, and the acquired encrypted voice signal is decrypted under the safe environment, so that the privacy protection function can be further enhanced.

In this embodiment of the present application, the specific manner of decrypting the encrypted voice signal may be: according to the information of at least one point of the spectrum lower limit, the spectrum upper limit and the spectrum cut-off point in the secret key, spectrum inversion is carried out on at least one frequency band in the encrypted voice signal, which is equivalent to performing the same spectrum inversion operation as the encryption process again, namely, the decryption is the symmetric operation of encryption, and the encryption operation is performed twice, which is equivalent to the decryption.

Based on the same technical idea, the embodiments of the present application provide a computer-readable storage medium storing a computer program that when executed by a processor implements the method provided in the embodiments of the present application.

The embodiment of the application also provides a chip, which comprises a processor and is used for calling the instructions stored in the memory from the memory and running the instructions stored in the memory, so that the communication device provided with the chip executes the method provided by the embodiment of the application.

The embodiment of the application also provides a chip, which comprises: the input interface, the output interface, the processor and the memory are connected through an internal connection path, the processor is used for executing codes in the memory, and when the codes are executed, the processor is used for executing the method provided by the application embodiment.

It should be appreciated that the processor in embodiments of the present application may be a central processing unit (Central Processing Unit, CPU), but may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field programmable gate arrays (Field Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or any conventional processor or the like. It is noted that the processor may be a processor supporting an advanced reduced instruction set machine (Advanced RISC Machines, ARM) architecture.

Further, optionally, the memory in the embodiments of the present application may include a read-only memory and a random access memory. The memory may be volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), programmable ROM (PROM), erasable Programmable ROM (EPROM), electrically Erasable EPROM (EEPROM), or flash Memory, among others. Volatile memory can include random access memory (Random Access Memory, RAM), which acts as external cache memory. By way of example, and not limitation, many forms of RAM are available. For example, static RAM (SRAM), dynamic RAM (Dynamic Random Access Memory, DRAM), synchronous DRAM (SDRAM), double Data Rate Synchronous DRAM (DDR SDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), and Direct RAM (DR RAM).

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, the processes or functions in accordance with the present application are produced in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. Computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another.

In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present application. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.

Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In the description of the present application, the meaning of "a plurality" is two or more, unless explicitly defined otherwise.

Any process or method described in flow charts or otherwise herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process. And the scope of the preferred embodiments of the present application includes additional implementations in which functions may be performed in a substantially simultaneous manner or in an opposite order from that shown or discussed, including in accordance with the functions that are involved.

Logic and/or steps described in the flowcharts or otherwise described herein, e.g., may be considered a ordered listing of executable instructions for implementing logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions.

It is to be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. All or part of the steps of the methods of the embodiments described above may be performed by a program that, when executed, comprises one or a combination of the steps of the method embodiments, instructs the associated hardware to perform the method.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing module, or each unit may exist alone physically, or two or more units may be integrated in one module. The integrated modules may be implemented in hardware or in software functional modules. The integrated modules described above, if implemented in the form of software functional modules and sold or used as a stand-alone product, may also be stored in a computer-readable storage medium. The storage medium may be a read-only memory, a magnetic or optical disk, or the like.

The foregoing is merely exemplary embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily think of various changes or substitutions within the technical scope of the present application, which should be covered in the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A speech output device, comprising:

the encryption control module is electrically connected with the first control switch and is used for responding to the encryption starting instruction to enter a voice encryption state before or during the multi-user voice interaction, and encrypting the acquired voice signal based on a secret key in the voice encryption state to generate an encrypted voice signal;

the first communication module is electrically connected with the encryption control module and is used for transmitting the encrypted voice signal to a terminal device or a server connected with the voice output device; the terminal device is used for forwarding the encrypted voice signal to the server, and the server is used for forwarding the encrypted voice signal to the voice receiving device of the interaction object.

2. The speech output device according to claim 1, wherein the first control switch has a plurality of gear steps, each gear step when triggered generating a level of encryption on command, each level of encryption on command being associated with a level of speech encryption status;

the encryption control module is further used for entering a voice encryption state of one level associated with the encryption starting instruction of one level when the encryption starting instruction of the level is received, and generating voice transmission rule information in the voice encryption state;

the first communication module is further configured to transmit the voice transmission rule information to the terminal device or the server; the voice transmission rule information includes a specified permission level, which is a permission level associated with a current voice encryption state.

3. The speech output device according to claim 2, further comprising:

the output module is used for outputting a plurality of interactive object lists of voice interaction aiming at the current voice encryption state, and generating a selection instruction when the selection operation is executed;

the encryption control module is further used for responding to the selection instruction to determine the selected interactive object as the interactive object with the appointed authority level, and generating the voice transmission rule information based on the interactive object with the appointed authority level.

4. The speech output device according to any one of claims 1-3, further comprising:

the second control switch is used for generating a key updating instruction when triggered;

the encryption control module is used for updating the key in response to the key updating instruction.

5. A server, comprising:

a second communication module, configured to receive the encrypted voice signal transmitted by the voice output device or the terminal device according to any one of claims 1 to 4, and forward the encrypted voice signal to a voice receiving device of the interaction object.

6. The server according to claim 5, wherein the second communication module is further configured to receive voice transmission rule information transmitted by the voice output device or the terminal device;

when forwarding the encrypted voice signal to the voice receiving device of the interactive object, the second communication module is configured to transmit the encrypted voice signal to the voice receiving device of the interactive object corresponding to the specified authority level in the voice transmission rule information, or transmit the encrypted voice signal to the voice receiving device of the interactive object corresponding to each authority level, and send a key to the voice receiving device of the interactive object corresponding to the specified authority level.

7. The server according to claim 5 or 6, wherein the second communication module is further configured to: acquiring an interactive object list uploaded by the voice output equipment or the terminal equipment; the interactive object list comprises a plurality of interactive objects and organization architecture information of the plurality of interactive objects;

the server further includes: and the processing module is electrically connected with the second communication module and is used for determining the corresponding authority level of each interactive object in the interactive object list according to the organization architecture information of each interactive object in the interactive object list.

8. A voice receiving apparatus, comprising:

a third communication module for receiving the encrypted voice signal transmitted by the server of any one of claims 5-7;

9. A method of voice interaction, for use with a voice output device, the method comprising:

before or during the multi-person voice interaction, responding to an encryption starting instruction to enter a voice encryption state;

in the voice encryption state, encrypting the acquired voice signal based on a secret key to generate an encrypted voice signal;

10. The voice interaction method according to claim 9, wherein the encryption on instruction includes a plurality of levels of encryption on instructions;

the method for responding to the encryption starting instruction to enter a voice encryption state comprises the following steps:

responding to a received encryption starting instruction of one level, and entering a voice encryption state of one level associated with the encryption starting instruction of the level;

11. The voice interaction method of claim 10, further comprising:

outputting an interaction object list corresponding to the multi-user voice interaction aiming at the current voice encryption state;

determining the selected interactive object as the interactive object of the appointed authority level in response to a selection instruction for the interactive object in the interactive object list;

The generating voice transmission rule information in the voice encryption state includes:

and generating voice transmission rule information based on the interactive object with the appointed authority level.

12. The voice interaction method of claim 10, further comprising:

in the current voice encryption state, responding to a state switching instruction to switch to another level of voice encryption state; the state switching instruction is generated when the gear of the first control switch of the voice output device is switched.

13. The voice interaction method of claim 12, wherein in the voice encryption state, encrypting the acquired voice signal based on a key comprises:

in the different level voice encryption state, the acquired voice signal is encrypted based on different keys.

14. The voice interaction method of any of claims 9-13, further comprising:

displaying voice encryption options on the starting interface or the real-time interaction interface of the multi-user voice interaction;

and generating an encryption starting instruction in response to a triggering operation for the voice encryption option.

15. The voice interaction method according to any one of claims 9-13, wherein the key comprises at least one of a spectral boundary point and a spectral cut-off point of the acquired voice signal;

The encrypting the acquired voice signal based on the key comprises the following steps:

determining at least one frequency band in the acquired voice signal based on at least one item of information in the frequency spectrum boundary point and the frequency spectrum cut-off point;

and performing frequency spectrum inversion on the at least one frequency band.

16. The voice interaction method of claim 15, further comprising:

outputting prompt information, and closing a voice enhancement function in the current interactive software in response to a confirmation operation for the prompt information;

or, reversely compensating the voice signal before the frequency spectrum inversion or the voice signal after the frequency spectrum inversion; the reverse compensation is used to compensate for signal attenuation resulting from spectral inversion.

17. A voice interaction method, applied to a server, the method comprising:

receiving an encrypted voice signal transmitted by the voice output device or the terminal device of any one of claims 1 to 4;

forwarding the encrypted voice signal to a voice receiving device of the interactive object.

18. The voice interaction method of claim 17, further comprising:

receiving voice transmission rule information transmitted by the voice output equipment or the terminal equipment;

The forwarding the encrypted speech signal to the speech receiving device of the interactive object comprises:

transmitting the encrypted voice signal to voice receiving equipment of an interactive object corresponding to the designated authority level in the voice transmission rule information;

or transmitting the encrypted voice signal to the voice receiving equipment of the interactive object corresponding to each authority level, and sending a secret key to the interactive object corresponding to the appointed authority level.

19. The voice interaction method of claim 17 or 18, further comprising:

acquiring an interactive object list uploaded by the voice output equipment or the terminal equipment; the interactive object list comprises a plurality of interactive objects and organization architecture information of the plurality of interactive objects;

and according to the organization architecture information of each interactive object in the interactive object list, corresponding authority levels of each interactive object in the interactive object list are obtained.

20. The voice interaction method of claim 17, further comprising:

acquiring interaction flow information corresponding to the multi-person voice interaction;

determining the association relation between each voice interaction period and each authority level according to the interaction flow information;

forwarding the encrypted voice signal to a first interaction object in the multi-person voice interaction process; the first interactive object is an interactive object corresponding to the authority level associated with the voice interaction period to which the current moment belongs.

21. The voice interaction method of claim 20, further comprising:

in the multi-person voice interaction process, sending the association relation between each voice interaction period and each authority level to a second interaction object; the second interactive object is other interactive objects except the first interactive object.

22. A voice interaction method, characterized by being applied to a voice receiving apparatus, the method comprising:

receiving an encrypted voice signal transmitted by the server of any one of claims 5 to 7;

decrypting the encrypted speech signal based on a key; the key being a key used in the voice interaction method of any of claims 9-16.

23. The voice interaction method of claim 22, wherein decrypting the encrypted voice signal based on the key comprises:

Detecting whether suspicious software exists on the current voice receiving equipment;

in the event that it is determined that no suspicious software is present on the current speech receiving device, the encrypted speech signal is decrypted based on the key.

24. A computer readable storage medium having stored therein a computer program which, when executed by a processor, implements the voice interaction method of any of claims 9-23.