CN112653789A

CN112653789A - Voice mode switching method, terminal and storage medium

Info

Publication number: CN112653789A
Application number: CN202011544069.2A
Authority: CN
Inventors: 洪江力
Original assignee: Shanghai Chuanying Information Technology Co Ltd
Current assignee: Shanghai Chuanying Information Technology Co Ltd
Priority date: 2020-12-23
Filing date: 2020-12-23
Publication date: 2021-04-13

Abstract

The application discloses a voice mode switching method, a terminal and a storage medium, relates to the technical field of terminals, and can intelligently switch to a corresponding voice mode according to a current use scene, so that the problems that the switching of the existing voice mode is not intelligent and cannot meet the user expectation are solved. The voice mode switching method comprises the following steps: acquiring scene information; determining a second voice mode as a target voice mode according to the scene information and the decision relation rule; and switching to the target voice mode. Optionally, the speech pattern is a set of speech parameters comprising at least one speech parameter, and the speech database comprises at least one speech pattern.

Description

Voice mode switching method, terminal and storage medium

Technical Field

The present application relates to the field of terminal technologies, and in particular, to a voice mode switching method, a terminal, and a storage medium.

Background

With the development of information technology, intelligent voice technology has become the most convenient and effective means for people to acquire and communicate information. Intelligent terminal devices such as mobile phones and the like are generally provided with voice assistants, and users are helped to solve problems through intelligent interaction of intelligent conversation and instant question and answer.

In some implementations, the voice assistant is a fixed voice mode or the user can manually set some of the voice mode parameters. As shown in fig. 1, some mobile phones can set a speaker character, and the setting method is as follows: clicking a voice broadcasting role in voice setting, popping up a speaker setup/download page where a user can select chinese-english mandarin, northeast mandarin, henna, hunan mandarin, mandarin (dawn), mandarin (dawn swallow), and the like. The mobile phone downloads or calls a mode selected by the user, such as a northeast mode, and broadcasts in subsequent voice interaction using the northeast mode.

The inventor finds that the voice mode is single, and the switching of the voice mode can be only manually performed, which cannot meet the current expectations of users.

Disclosure of Invention

In view of this, the present application provides a voice mode switching method, a terminal and a storage medium, which can intelligently switch to a corresponding voice mode according to a current usage scenario, and solve the problem that the switching of the voice mode is not intelligent and cannot meet the user expectation.

The application provides a voice mode switching method, which comprises the following steps:

acquiring scene information;

determining a second voice mode as a target voice mode according to the scene information and the decision relation rule;

and switching to the target voice mode.

According to the voice mode switching method, the second voice mode is determined as the target voice mode according to the acquired current scene information, the target voice mode is switched to, and voice interaction is carried out with the user according to the second voice mode. The first speech mode is a currently used speech mode. The scheme of the embodiment of the application can realize the working mode of dynamically switching the proper voice mode according to the current scene, and avoids using the voice mode which is not suitable for the current use occasion or scene.

Optionally, the method further comprises at least one of: the speech pattern includes speech parameters including: basic voice parameters and/or voice behavior parameters of the basic voice parameters changing along with time; the decision relationship rule includes a first correspondence and a second correspondence.

Optionally, the method further comprises at least one of: the first corresponding relation is a preset corresponding relation between the scene information and the voice mode; the second correspondence is a correspondence between the scene information newly generated by machine learning and the speech pattern.

Optionally, the context information includes environment data, and the method further includes at least one of:

when the environmental data includes environmental noise, the first correspondence includes: when the environmental noise is larger than a first preset value, the target voice mode selects an ultra-high voice mode with the volume higher than the first volume or increases the volume in the current voice mode;

when the environment data includes location information, the first correspondence includes: when the position information is identified as a workplace, the target voice mode selects a working voice mode suitable for conversation in a working state;

when the environment data includes a current time, the first correspondence includes: and if the current time is night, selecting a night voice mode suitable for night conversation from the target voice mode.

Optionally, the context information includes a user input voice, and the first corresponding relationship includes at least one of: when the voice input by the user is recognized as the voice of the old, the target voice mode selects an accompanying voice mode suitable for conversation with the old; when the voice input by the user is recognized as the voice of the child, selecting a child voice mode suitable for conversation with the child by the target voice mode; when the voice input by the user is recognized as the first dialect, the target voice mode selects a first-party speech voice mode for carrying out conversation by using the first dialect.

Optionally, the determining, by the context information and the decision relation rule, a second speech mode as a target speech mode includes: extracting a noise level parameter from the environment data, extracting the basic voice parameter and/or the voice behavior parameter of the voice of the user from the voice input by the user, and matching the extracted noise level parameter, the basic voice parameter and/or the voice behavior parameter with the voice mode in the voice database according to the decision relation rule to determine a target voice mode.

Optionally, after the switching to the target speech mode, the method further comprises:

receiving annotation feedback;

and performing machine learning on the annotation feedback to correct the first corresponding relation or generate a second corresponding relation, and/or correct the voice mode or generate a new voice mode.

Optionally, before the switching to the target voice mode, the method further includes: and acquiring transition parameters for switching the voice modes, and switching to the target voice mode according to the transition parameters.

acquiring scene information, wherein the scene information at least comprises one of environment data and user input voice;

according to the scene information and decision relation rules, selecting a corresponding voice mode or generating a new voice mode in a voice database as a target voice mode, optionally, the decision relation rules include the corresponding relation between the scene information and the voice mode, the voice mode is a voice parameter group including at least one voice parameter, and the voice database includes at least one voice mode;

and switching the current voice mode to the target voice mode, and carrying out voice interaction with the user according to the target voice mode.

Optionally, the decision relationship rule is stored in a database, the decision relationship rule includes a first corresponding relationship and a second corresponding relationship, and the first corresponding relationship is a preset corresponding relationship between the scene information and the voice mode; the second correspondence is a correspondence between the scene information newly generated by machine learning and the speech pattern.

Optionally, when the scene information includes the environmental data, and the environmental data includes environmental noise, the first corresponding relationship includes: when the environmental noise is larger than a first preset value, the target voice mode selects an ultra-high voice mode with the volume higher than the first volume or increases the volume in the current voice mode; when the environment data includes location information, the first correspondence includes: when the position information is identified as a workplace, the target voice mode selects a working voice mode suitable for conversation in a working state; when the environment data includes a current time, the first correspondence includes that the target voice mode selects a night voice mode suitable for a night conversation if the current time is night;

when the scene information includes a user input voice, the first corresponding relationship includes: when the voice input by the user is recognized as the voice of the old, the target voice mode selects an accompanying voice mode suitable for conversation with the old; when the voice input by the user is recognized as the voice of the child, selecting a child voice mode suitable for conversation with the child by the target voice mode; when the voice input by the user is recognized as the first dialect, the target voice mode selects a first-party speech voice mode for carrying out conversation by using the first dialect.

Optionally, after switching the current voice mode to the target voice mode, the voice mode switching method further includes: receiving the response switching input by the user and the marking feedback of whether the response switching is correct or not; and the annotation feedback is used for machine learning so as to correct the first corresponding relation or generate a second corresponding relation, and correct the voice mode or generate a new voice mode.

Optionally, the speech parameters include: basic voice parameters and voice behavior parameters of the basic voice parameters changing with time, wherein the basic voice parameters comprise volume, softness, tone and pitch of sound.

Optionally, the context information includes environmental data and user input speech; selecting a corresponding voice mode as a target voice mode in a voice database according to the scene information and the decision relation rule, wherein the selecting comprises the following steps: extracting a noise level parameter, the basic speech parameter of a user's speech and the speech behavior parameter from the environmental data and the user-input speech, the noise level parameter characterizing a low level of environmental noise; and matching the extracted noise level parameter, the basic voice parameter of the user voice and the voice behavior parameter with the voice mode in the voice database according to the decision relation rule so as to determine a target voice mode.

Optionally, the matching the extracted noise level parameter, the basic speech parameter of the user's speech, and the speech behavior parameter with the speech patterns in the speech database according to the decision relationship rule to determine a target speech pattern includes:

determining at least one initial voice mode according to the noise level parameter, the voice basic parameter of the voice of the user, the voice behavior parameter and the decision relation rule;

mapping the noise level parameter, the voice basic parameter of the user voice and the voice behavior parameter into a group of voice parameters according to the decision relation rule, and scoring the credibility of each of the at least one initial voice mode, wherein optionally, the scoring item of the credibility scoring comprises: the degree of matching of the mapped set of voice parameters with the voice parameters of the initial voice mode, the priority of the initial voice mode and the limitation of the initial voice mode selection;

screening a first initial voice mode with the highest score in the credibility scores, and selecting the first initial voice mode as a target voice mode when the score of the first initial voice mode is greater than or equal to a preset value; and when the score of the first initial voice mode is smaller than a preset value, modifying the voice parameters of the first initial voice mode according to a decision relation rule, and outputting the voice mode formed after modification as a target voice mode.

Optionally, after selecting a corresponding voice mode as a target voice mode in a voice database according to the scenario information and the decision relationship rule, before switching the current voice mode to the target voice mode, the voice mode switching method further includes: acquiring the switching time and transition parameters of the voice mode; and when the current voice mode is switched to the target voice mode, the switching is carried out according to the switching time and the transition parameters.

Optionally, the switching time and the transition parameter are preset, and in the process of executing the voice mode switching method, learning and correction are performed according to corresponding user labeling feedback.

Optionally, the voice database comprises: presetting a voice mode, a user-defined voice mode and a first voice mode; and the first voice mode is a new voice mode generated through machine learning according to the collected scene information and the user mark feedback.

Optionally, the method further comprises: and according to the user labeling feedback, learning the voice input of a specific person through a machine, and updating the learned voice parameters into the self-defined voice mode.

Optionally, the database comprises at least one of:

a standard mode, a silent mode, a business voice mode suitable for use in business activities, a career voice mode suitable for conversation with the elderly, a child voice mode suitable for conversation with children, a super-loud voice mode suitable for use in noisy environments, a night voice mode suitable for night conversations, a funny voice mode of humorous fun, and a user-defined speaker mode.

On the other hand, the present application further provides a voice mode switching apparatus, including: an acquisition unit configured to acquire scene information, the scene information including at least one of environmental data and user input speech; a decision unit, configured to select a corresponding voice mode as a target voice mode in a voice database according to the scene information and a decision relationship rule, where optionally, the decision relationship rule includes a correspondence between the scene information and the voice mode, the voice mode is a group of voice parameters including at least one voice parameter, and the voice database includes at least one voice mode; and the switching unit is used for switching the current voice mode to the target voice mode and carrying out voice interaction with the user according to the target voice mode.

In another aspect, the present application further provides a terminal, including a memory and a processor, where the memory stores a program, and the program is used for being executed by the processor to perform any one of the above-mentioned voice mode switching methods.

In another aspect, the present application further provides a readable storage medium, in which a program is stored, the program being used for being executed by a processor to execute the voice mode switching method according to any one of the above-mentioned methods.

According to the method for switching the voice modes, the corresponding voice mode is selected from the voice database as the target voice mode or a new voice mode is directly generated as the target voice mode according to the scene information and the decision relation rule, then the current voice mode is switched to the target voice mode, and voice interaction is carried out with a user according to the target voice mode. The use scene when the voice mode switching occurs can be acquired from the scene information. Therefore, the voice mode switching method can switch to the corresponding voice mode intelligently according to the current use scene, and solves the problems that the existing voice mode is not intelligent in switching and cannot meet the user expectation.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a voice mode switch page of a conventional voice assistant;

fig. 2 is a schematic hardware structure diagram of a mobile terminal implementing various embodiments of the present application;

fig. 3 is a flowchart illustrating a voice mode switching method according to a first embodiment of the present application;

fig. 4 is a flowchart illustrating a voice mode switching method according to a second embodiment of the present application;

FIG. 5 is a schematic diagram of an experience database of a second embodiment of the present application;

FIG. 6 is a schematic diagram of a speech database of a second embodiment of the present application;

fig. 7 is a flowchart illustrating a further voice mode switching method according to a second embodiment of the present application;

FIG. 8 is a schematic flow chart of determining a target speech mode according to a second embodiment of the present application;

fig. 9 is a flowchart illustrating another speech mode switching method according to a second embodiment of the present application;

fig. 10 is a flowchart illustrating another speech mode switching method according to a second embodiment of the present application;

fig. 11 is a schematic diagram of a voice mode switching apparatus according to a second embodiment of the present application;

fig. 12 is a schematic diagram illustrating a switching process of a voice mode switching apparatus according to a second embodiment of the present application;

fig. 13 is a schematic diagram of a terminal according to a second embodiment of the present application.

The implementation, functional features and advantages of the objectives of the present application will be further explained with reference to the accompanying drawings. With the above figures, there are shown specific embodiments of the present application, which will be described in more detail below. These drawings and written description are not intended to limit the scope of the inventive concepts in any manner, but rather to illustrate the inventive concepts to those skilled in the art by reference to specific embodiments.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present application, as detailed in the appended claims.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, the recitation of an element by the phrase "comprising an … …" does not exclude the presence of additional like elements in the process, method, article, or apparatus that comprises the element, and further, where similarly-named elements, features, or elements in different embodiments of the disclosure may have the same meaning, or may have different meanings, that particular meaning should be determined by their interpretation in the embodiment or further by context with the embodiment.

It should be understood that although the terms first, second, third, etc. may be used herein to describe various information, such information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, a first correspondence may also be referred to as a second correspondence, and similarly, a second correspondence may also be referred to as a first correspondence, without departing from the scope herein. Depending on context, the word "if" as used herein may be interpreted as "at … …" or "when … …" or "determine … … in response to … …". Also, as used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context indicates otherwise. It will be further understood that the terms "comprises," "comprising," "includes" and/or "including," when used in this specification, specify the presence of stated features, steps, operations, elements, components, items, species, and/or groups, but do not preclude the presence, or addition of one or more other features, steps, operations, elements, components, species, and/or groups thereof. The terms "or," "and/or," "including at least one of the following," and the like, as used herein, are to be construed as inclusive or mean any one or any combination. For example, "includes at least one of: A. b, C "means" any of the following: a; b; c; a and B; a and C; b and C; a and B and C ", again for example," A, B or C "or" A, B and/or C "means" any of the following: a; b; c; a and B; a and C; b and C; a and B and C'. An exception to this definition will occur only when a combination of elements, functions, steps or operations are inherently mutually exclusive in some way.

It should be understood that, although the steps in the flowcharts in the embodiments of the present application are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least some of the steps in the figures may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, in different orders, and may be performed alternately or at least partially with respect to other steps or sub-steps of other steps.

The words "if", as used herein, may be interpreted as "at … …" or "at … …" or "in response to a determination" or "in response to a detection", depending on the context. Similarly, the phrases "if determined" or "if detected (a stated condition or event)" may be interpreted as "when determined" or "in response to a determination" or "when detected (a stated condition or event)" or "in response to a detection (a stated condition or event)", depending on the context.

It should be noted that step numbers such as S201 and S202 are used herein for the purpose of more clearly and briefly describing the corresponding contents, and do not constitute a substantial limitation on the sequence, and those skilled in the art may perform S202 first and then S201 in the specific implementation, but these should be within the scope of the present application.

In the following description, suffixes such as "module", "component", or "unit" used to denote elements are used only for the convenience of description of the present application, and have no specific meaning in themselves.

The terminal in the following description may be any electronic device that requires identity authentication. In general, the terminals in the above description may be mobile terminals. The mobile terminal may be implemented in various forms. For example, the mobile terminal described in the present application may include mobile terminals such as a mobile phone, a tablet computer, a notebook computer, a palmtop computer, a Personal Digital Assistant (PDA), a Portable Media Player (PMP), a navigation device, a wearable device, a smart band, a pedometer, and the like, and fixed terminals such as a Digital TV, a desktop computer, and the like.

The following description will be given taking a mobile terminal as an example, and it will be understood by those skilled in the art that the configuration according to the embodiment of the present application can be applied to a fixed type terminal or other electronic devices, in addition to elements particularly used for mobile purposes.

Referring to fig. 2, which is a schematic diagram of a hardware structure of a mobile terminal for implementing various embodiments of the present application, the mobile terminal 100 may include: RF (Radio Frequency) unit 101, WiFi module 102, audio output unit 103, a/V (audio/video) input unit 104, sensor 105, display unit 106, user input unit 107, interface unit 108, memory 109, processor 110, and power supply 111. Those skilled in the art will appreciate that the mobile terminal architecture shown in fig. 2 is not intended to be limiting of mobile terminals, and that a mobile terminal may include more or fewer components than shown, or some components may be combined, or a different arrangement of components.

The following describes each component of the mobile terminal in detail with reference to fig. 2:

the radio frequency unit 101 may be configured to receive and transmit signals during information transmission and reception or during a call, and specifically, receive downlink information of a base station and then process the downlink information to the processor 110; in addition, the uplink data is transmitted to the base station. Typically, radio frequency unit 101 includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a low noise amplifier, a duplexer, and the like. In addition, the radio frequency unit 101 can also communicate with a network and other devices through wireless communication. The wireless communication may use any communication standard or protocol, including but not limited to GSM (Global System for Mobile communications), GPRS (General Packet Radio Service), CDMA2000(Code Division Multiple Access 2000), WCDMA (Wideband Code Division Multiple Access), TD-SCDMA (Time Division-Synchronous Code Division Multiple Access), FDD-LTE (Frequency Division duplex Long Term Evolution), and TDD-LTE (Time Division duplex Long Term Evolution).

WiFi belongs to short-distance wireless transmission technology, and the mobile terminal can help a user to receive and send e-mails, browse webpages, access streaming media and the like through the WiFi module 102, and provides wireless broadband internet access for the user. Although fig. 2 shows the WiFi module 102, it is understood that it does not belong to the essential constitution of the mobile terminal, and may be omitted entirely as needed within the scope not changing the essence of the invention.

The audio output unit 103 may convert audio data received by the radio frequency unit 101 or the WiFi module 102 or stored in the memory 109 into an audio signal and output as sound when the mobile terminal 100 is in a call signal reception mode, a call mode, a recording mode, a voice recognition mode, a broadcast reception mode, or the like. Also, the audio output unit 103 may also provide audio output related to a specific function performed by the mobile terminal 100 (e.g., a call signal reception sound, a message reception sound, etc.). The audio output unit 103 may include a speaker, a buzzer, and the like.

The a/V input unit 104 is used to receive audio or video signals. The a/V input Unit 104 may include a Graphics Processing Unit (GPU) 1041 and a microphone 1042, the Graphics processor 1041 Processing image data of still pictures or video obtained by an image capturing device (e.g., a camera) in a video capturing mode or an image capturing mode. The processed image frames may be displayed on the display unit 106. The image frames processed by the graphic processor 1041 may be stored in the memory 109 (or other storage medium) or transmitted via the radio frequency unit 101 or the WiFi module 102. The microphone 1042 may receive sounds (audio data) via the microphone 1042 in a phone call mode, a recording mode, a voice recognition mode, or the like, and may be capable of processing such sounds into audio data. The processed audio (voice) data may be converted into a format output transmittable to a mobile communication base station via the radio frequency unit 101 in case of a phone call mode. The microphone 1042 may implement various types of noise cancellation (or suppression) algorithms to cancel (or suppress) noise or interference generated in the course of receiving and transmitting audio signals.

The mobile terminal 100 also includes at least one sensor 105, such as a light sensor, a motion sensor, and other sensors. Specifically, the light sensor includes an ambient light sensor that may optionally adjust the brightness of the display panel 1061 according to the brightness of ambient light, and a proximity sensor that may turn off the display panel 1061 and/or the backlight when the mobile terminal 100 is moved to the ear. As one of the motion sensors, the accelerometer sensor can detect the magnitude of acceleration in each direction (generally, three axes), detect the magnitude and direction of gravity when stationary, and can be used for applications of recognizing the device posture (such as mobile phone horizontal and vertical screen switching, related games, magnetometer posture calibration), vibration recognition related functions (such as pedometer and tapping), and the like; as for other sensors such as a fingerprint sensor, a pressure sensor, an iris sensor, a molecular sensor, a gyroscope, a barometer, a hygrometer, a thermometer, and an infrared sensor, which can be configured on the mobile phone, further description is omitted here.

The display unit 106 is used to display information input by a user or information provided to the user. The Display unit 106 may include a Display panel 1061, and the Display panel 1061 may be configured in the form of a Liquid Crystal Display (LCD), an Organic Light-Emitting Diode (OLED), or the like.

The user input unit 107 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the mobile terminal. Specifically, the user input unit 107 may include a touch panel 1071 and other input devices 1072. The touch panel 1071, also referred to as a touch screen, may collect a touch operation performed by a user on or near the touch panel 1071 (e.g., an operation performed by the user on or near the touch panel 1071 using a finger, a stylus, or any other suitable object or accessory), and drive a corresponding connection device according to a predetermined program. The touch panel 1071 may include two parts of a touch detection device and a touch controller. Optionally, the touch detection device detects a touch orientation of a user, detects a signal caused by a touch operation, and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device, converts the touch information into touch point coordinates, sends the touch point coordinates to the processor 110, and can receive and execute commands sent by the processor 110. In addition, the touch panel 1071 may be implemented in various types, such as a resistive type, a capacitive type, an infrared ray, and a surface acoustic wave. In addition to the touch panel 1071, the user input unit 107 may include other input devices 1072. In particular, other input devices 1072 may include, but are not limited to, one or more of a physical keyboard, function keys (e.g., volume control keys, switch keys, etc.), a trackball, a mouse, a joystick, and the like, and are not limited to these specific examples.

Alternatively, the touch panel 1071 may cover the display panel 1061, and when the touch panel 1071 detects a touch operation thereon or nearby, the touch panel 1071 transmits the touch operation to the processor 110 to determine the type of the touch event, and then the processor 110 provides a corresponding visual output on the display panel 1061 according to the type of the touch event. Although the touch panel 1071 and the display panel 1061 are shown in fig. 2 as two separate components to implement the input and output functions of the mobile terminal, in some embodiments, the touch panel 1071 and the display panel 1061 may be integrated to implement the input and output functions of the mobile terminal, and is not limited herein.

The interface unit 108 serves as an interface through which at least one external device is connected to the mobile terminal 100. For example, the external device may include a wired or wireless headset port, an external power supply (or battery charger) port, a wired or wireless data port, a memory card port, a port for connecting a device having an identification module, an audio input/output (I/O) port, a video I/O port, an earphone port, and the like. The interface unit 108 may be used to receive input (e.g., data information, power, etc.) from external devices and transmit the received input to one or more elements within the mobile terminal 100 or may be used to transmit data between the mobile terminal 100 and external devices.

The memory 109 may be used to store software programs as well as various data. The memory 109 may mainly include a program storage area and a data storage area, and optionally, the program storage area may store an operating system, an application program (such as a sound playing function, an image playing function, and the like) required by at least one function, and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. Further, the memory 109 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.

The processor 110 is a control center of the mobile terminal, connects various parts of the entire mobile terminal using various interfaces and lines, and performs various functions of the mobile terminal and processes data by operating or executing software programs and/or modules stored in the memory 109 and calling data stored in the memory 109, thereby performing overall monitoring of the mobile terminal. Processor 110 may include one or more processing units; preferably, the processor 110 may integrate an application processor and a modem processor, optionally, the application processor mainly handles operating systems, user interfaces, application programs, etc., and the modem processor mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 110.

The mobile terminal 100 may further include a power supply 111 (e.g., a battery) for supplying power to various components, and preferably, the power supply 111 may be logically connected to the processor 110 via a power management system, so as to manage charging, discharging, and power consumption management functions via the power management system.

Although not shown in fig. 2, the mobile terminal 100 may further include a bluetooth module or the like, which is not described in detail herein.

Based on the above mobile terminal hardware structure, various embodiments of the present application are provided.

It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application. It should also be understood that the mobile terminal hardware architecture and communication network system described herein are only used to assist understanding of the present application and are not intended to limit the present application.

Example one

As shown in fig. 3, an embodiment of the present application provides a voice mode switching method, including:

s201, acquiring scene information;

s202, determining a second voice mode as a target voice mode according to the scene information and the decision relation rule;

s203, switching to the target voice mode.

The context information here refers to a usage context of a voice conversation when a voice mode switch occurs, that is, a usage context of a voice mode. The decision relation rule is used to determine the voice mode to which the acquired scene information should correspond, and the decision relation rule may be a function, for example, and is used to express the correspondence between the scene information and the voice mode. The first speech mode is typically the speech mode currently in use. According to the embodiment of the application, the second voice mode is determined as the target voice mode according to the acquired current scene information, the target voice mode is switched to, and voice interaction is carried out with the user according to the second voice mode. The scheme of the embodiment of the application can realize the working mode of dynamically switching the proper voice mode according to the current scene information, and avoids using the voice mode which is not suitable for the current use occasion or scene.

Optionally, the voice mode may include voice parameters, and the voice parameters include: basic speech parameters and/or speech behavior parameters of the basic speech parameters over time.

Optionally, the decision relationship rule may include a first corresponding relationship and a second corresponding relationship. The first corresponding relationship may be preset before factory shipment, and the second corresponding relationship may be newly generated or downloaded in actual application, or newly acquired through software upgrade.

In some embodiments, the first corresponding relationship may be a preset corresponding relationship between the scene information and the voice mode; the second correspondence may be a correspondence of scene information newly generated by machine learning and a voice pattern.

Optionally, the context information may include environmental data.

Optionally, when the environmental data includes environmental noise, the first corresponding relationship may include: when the environmental noise is larger than a first preset value, the target voice mode selects an ultra-high voice mode with the volume higher than the first volume or increases the volume in the current voice mode.

Optionally, when the environment data includes the location information, the first corresponding relationship may include: when the location information is identified as the work place, the target voice mode selects a work voice mode suitable for a dialogue at a work state.

Optionally, when the environment data includes the current time, the first corresponding relationship may include: if the current time is night, the target voice mode selects a night voice mode suitable for the night conversation.

In addition, optionally, the scene information may include user input voice.

Optionally, when the scene information includes a user input voice, the first corresponding relationship includes at least one of: when the voice input by the user is recognized as the voice of the old, the target voice mode selects the accompanying voice mode suitable for conversation with the old.

Alternatively, when the voice input by the user is recognized as a child voice, the target voice mode selects a child voice mode suitable for a conversation with a child.

Optionally, when the voice input by the user is recognized as the first dialect, the target voice mode selects a first-party speech voice mode for performing a conversation using the first dialect.

The obtained scene information is not limited in this embodiment, and may be one or more of the descriptions above. The present embodiment also does not limit the decision relationship rule, and may be one or more of the above descriptions.

Example two

As shown in fig. 4, an embodiment of the present application provides a voice mode switching method, including:

201. acquiring scene information, wherein the scene information at least comprises one of environment data and user input voice;

context information is a usage context of a voice mode. For example, the current usage scenario of the voice assistant generally includes user information (with what user to talk to), environment information (in what environment to talk, whether it is a noisy environment or a quiet environment), an APP that invokes the voice assistant when a conversation occurs, and so on. The APP invoking the voice assistant may be used to infer, for example, whether it is a serious workplace or a leisure entertainment period.

202. According to the scene information and the decision relation rule, selecting a corresponding voice mode or generating a new voice mode as a target voice mode in a voice database, wherein optionally, the decision relation rule comprises the corresponding relation between the scene information and the voice mode, the voice mode is a voice parameter group comprising at least one voice parameter, and the voice database comprises at least one voice mode;

the speech parameters of the present application refer to characteristic parameters for distinguishing different sounds, and exemplarily include: basic speech parameters and speech behavior parameters of the basic speech parameters over time, the basic speech parameters including volume, softness, tone, timbre and pitch of sound.

The speech mode of the present application is a speech mode defined by a plurality of speech parameters. Illustratively, the speech mode may be: a standard mode, a silent mode, a business voice mode suitable for use in business activities, a career voice mode suitable for conversation with the elderly, a child voice mode suitable for conversation with children, a super-loud voice mode suitable for use in noisy environments, a night voice mode suitable for night conversations, a funny voice mode of humorous fun, or a user-customized speaker mode. Taking a child voice mode suitable for conversation with a child as an example, each voice parameter of the child voice mode is inclined to a direction suitable for conversation with the child, for example, the voice volume is appropriate, the voice speed is slow, the voice is soft, the tone color is sweet or natural and mellow, and the like.

At least one voice pattern is stored in the voice database. Preferably, the voice database stores a plurality of voice patterns. Illustratively, the voice database may include at least one of: a standard mode, a silent mode, a business voice mode suitable for use in business activities, a career voice mode suitable for conversation with the elderly, a child voice mode suitable for conversation with children, a super-loud voice mode suitable for use in noisy environments, a night voice mode suitable for night conversations, a funny voice mode of humorous fun, and a user-defined speaker mode.

The decision relationship rule includes a correspondence of scene information to a speech pattern (defined by a set of speech parameters). The decision relation rule may be preset. Exemplary decision relationship rules may include: when the scene that the children use the mobile phone is identified, the voice parameters such as the volume, the softness, the tone and the tone of the voice are reasonably limited, so that the voice is suitable for the comprehension and the psychological characteristics of the children, for example, the voice volume is appropriate, the voice speed is moderate, the voice is softer, and the like.

Optionally, the decision relation rule may collect empirical data and user annotation feedback in subsequent voice mode switching, and evolve through autonomous learning. Autonomous learning herein refers to machine learning.

The target voice mode is determined according to the use scene and the decision relation rule. In the step, based on the received scene information, the corresponding voice mode is matched in the voice database according to the decision relation rule to be used as the target voice mode, and when the matching is unsuccessful, the voice parameters are modified according to the decision relation rule to generate a new voice mode which is used as the target voice mode.

203. And switching the current voice mode to a target voice mode, and carrying out voice interaction with the user according to the target voice mode. In the step, the current voice mode is switched to the target voice mode, and the target voice mode is used for carrying out dialogue with the user.

According to the method and the device, the use scene when the voice mode is switched is obtained through the input scene information, and then the voice mode is intelligently switched to the corresponding voice mode according to the current use scene, so that the problems that the existing voice mode is not intelligently switched and cannot meet the user expectation are solved.

In some embodiments, as shown in FIG. 5, the decision relationship rules are stored in a database, which may be, for example, a named experience database 20. The decision relation rule comprises a first corresponding relation 21 and a second corresponding relation 22, wherein the first corresponding relation 21 is a preset corresponding relation between scene information and a voice mode; the second correspondence 22 is a correspondence of scene information newly generated by machine learning and a voice pattern. Alternatively, the second corresponding relation 22 may be generated by learning and evolving the first corresponding relation 21, or may be newly generated by learning according to the collected scene information and the user annotation feedback.

In some embodiments, according to the decision relationship rule, if the matching degree of the scene information and any one of the voice patterns in the voice database is less than the preset score, then the voice parameters of the voice pattern may be modified appropriately based on the best matching voice pattern (i.e. the voice pattern with the highest score), and the modified voice pattern may be stored as a new voice pattern in the voice database.

The exemplary speech database 30 shown in FIG. 6 includes: a preset speech pattern 31, a custom speech pattern 32 and a first speech database 33.

The preset voice pattern 31 illustratively includes: a standard mode, a career voice mode, a child voice mode, a super-high voice mode, a funny voice mode, a night voice mode, a first party voice mode, and the like. The first party speech sound pattern may be, for example, a northeast speech sound pattern, a shanxi speech sound pattern, a henna speech sound pattern. A new preset speech pattern 31 can also be added by means of an internet download.

The speech parameters in the custom speech pattern 32 may be modified autonomously by the user. Alternatively, in some other embodiments, the custom speech patterns may evolve autonomously through machine learning. The voice mode switching method of the embodiment further includes: and according to the user labeling feedback, learning the voice input of a specific person through a machine, and updating the learned voice parameters into the self-defined voice mode.

The first speech pattern 33 is a new speech pattern generated by machine learning based on the collected scene information and user annotation feedback. The first speech pattern 33 may be an evolved speech pattern, i.e., a modified preset speech pattern, obtained by learning the collected scene information and user annotation feedback based on the preset speech pattern 31 and the custom speech pattern 32; or a new voice mode directly generated by learning the collected scene information and the user labeling feedback.

The labels and user label feedback presented herein refer to operations input by or confirmed by the user, voice parameter data, or inferences about voice mode switching, and may include, for example, user modification of voice mode, user feedback on whether the voice mode is switched correctly or not, modification or feedback on timing of voice mode switching, and so on. In the process of executing voice mode switching, if a user finds that inference about voice mode switching is wrong, the user generally needs to correct the annotation, and if inference is correct, the annotation does not need to be corrected.

In some embodiments, the context information includes environmental data, which may include ambient noise, location information, current time, etc., and user input speech. The voice database may include: an ultra-high pitch voice mode, a work voice mode, a night voice mode, an accompanying voice mode, a child voice mode, and a first party voice mode.

When the environmental data includes environmental noise, exemplary first correspondences include: when the environmental noise is larger than a first preset value, the target voice mode selects an ultra-high voice mode with the volume higher than the first volume or increases the volume in the current voice mode; when the environment data includes location information, an exemplary first correspondence includes: when the position information is identified as the workplace, the target voice mode selects a working voice mode suitable for conversation in a working state; when the environmental data includes the current time, an exemplary first correspondence includes the target voice pattern selecting a night voice pattern suitable for a night conversation if the current time is night.

When the scene information includes user input speech, exemplary first correspondences include: when the voice input by the user is recognized as the voice of the old, the target voice mode selects an accompanying voice mode suitable for conversation with the old; when the voice input by the user is recognized as the voice of the child, selecting a child voice mode suitable for conversation with the child by using the target voice mode; when the voice input by the user is recognized as the first dialect, the target voice mode selects a first-party speech voice mode for conversation using the first dialect.

In this embodiment, a plurality of decision relationship rules and a plurality of voice modes are preset, so that a corresponding voice mode is matched in the voice database according to the decision relationship rules and is used as a target voice mode to be output based on the environment data and the user input voice, and each voice parameter can be modified according to the decision relationship rules and the target voice mode can be output based on the voice mode. For example, when the ambient noise is loud, the child voice is input, and according to the first relationship, the child voice mode is selected and the voice volume is set to be appropriately increased, for example, higher than the first volume.

As shown in fig. 7, in some embodiments, the context information includes environmental data and user input speech; step 202 comprises:

2021. extracting a noise level parameter, a basic voice parameter of the user voice and a voice behavior parameter from the environment data and the user input voice, wherein the noise level parameter represents that the environment noise is low in level;

this step extracts, from the input scene information, noise level parameters representing low levels of environmental noise, such as volume, softness, tone, timbre, and pitch of sound, and speech behavior parameters representing temporal changes in the basic speech parameters, through a speech recognition technique. The scene information can be represented by a seven-dimensional vector (noise level parameter, volume of sound, softness of sound, tone of sound, timbre of sound, tone of sound, voice behavior parameter).

2022. And matching the extracted noise level parameters, basic voice parameters of the user voice and voice behavior parameters with the voice modes in the voice database according to the decision relation rule to determine a target voice mode.

According to a decision relation rule, mapping a seven-dimensional vector representing scene information into a group of voice parameter sets (including at least one voice parameter), matching the voice parameter sets generated by mapping with each voice mode in a voice database one by one, and determining a target voice mode according to the matching degree, the priority of the voice mode, the selected limiting factor of the voice mode and the like.

Optionally, as shown in fig. 8, the matching process of this step includes:

221. determining at least one initial voice mode according to the noise level parameter, the voice basic parameter and the voice behavior parameter of the voice of the user and the decision relation rule;

matching is carried out according to the decision relation rule, and at least one initial voice mode is determined. Due to the existence of a plurality of decision relation rules, one or more voice modes can be matched according to different decision relation rules, and the matched voice mode or modes are called initial voice modes.

222. And mapping the noise level parameter, the voice basic parameter of the voice of the user and the voice behavior parameter into a group of voice parameters according to a decision relation rule, and scoring the credibility of each initial voice mode. Optionally, the scoring items of the credibility score include: the matching degree of the mapped group of voice parameters and the voice parameters of the initial voice mode, the priority of the initial voice mode and the selection limit of the initial voice mode;

this step scores the credibility of at least one initial voice mode obtained in step 221, and the scoring items include: matching degree of the voice parameter group mapped by the seven-dimensional vector representing the scene information and the voice parameters of at least one initial voice mode; the priority of each initial speech mode; limiting factors for each initial speech mode, etc. The priority of the initial voice mode and the selected limiting factor are built in the equipment system before leaving factory, and can be corrected through user setting or marking in subsequent use.

223. Screening a first initial voice mode with the highest score in the credibility scores, and selecting the first initial voice mode as a target voice mode when the score of the first initial voice mode is greater than or equal to a preset value; and when the score of the first initial voice mode is smaller than a preset value, modifying the voice parameters of the first initial voice mode according to a decision relation rule, and outputting the voice mode formed after modification as a target voice mode.

And the step of making a final decision according to the score of the credibility score, and when the credibility score is high enough, selecting the initial voice mode with the highest score as the target voice mode. And when the credibility score is low, modifying the voice parameters of the initial voice mode with the highest score according to a decision relation rule, and outputting the newly formed voice mode after modification as a target voice mode.

As shown in fig. 9, in some embodiments, after switching the current voice mode to the target voice mode, the voice mode switching method further includes:

204. receiving the response switching input by the user and the marking feedback of whether the response switching is correct or not;

205. and performing machine learning on the annotation feedback to correct the first corresponding relation or generate a second corresponding relation, and correct the voice mode or generate a new voice mode.

In this embodiment, after the switching is completed, a labeling feedback that prompts the user whether the switching is correct or not may be output, the labeling feedback is used for machine learning, the learning result may be used to modify the decision relationship rule or generate a new decision relationship rule, and may also be used to modify the voice mode or generate a new voice mode, so that the subsequent voice mode matching is more reasonable and the switching is more accurate.

As shown in fig. 10, in some embodiments, after selecting a corresponding voice mode as a target voice mode in the voice database according to the scenario information and the decision relation rule, before switching the current voice mode to the target voice mode, the voice mode switching method further includes: 206. and acquiring the switching time and transition parameters of the voice mode. And step 203, when the current voice mode is switched to the target voice mode, the switching is performed according to the switching time and the transition parameters.

The timing of the switching refers to when the switching is initiated, e.g., within seconds after a speech mode is determined. And setting values of the voice parameters in the process of switching the transition parameters from the current voice mode to the target voice mode. For example, each latitude value gap of the speech pattern may be segmented on a time average. For example, when the voice mode is switched, the voice volume can be increased or decreased from the current volume to the volume of the target voice mode, and gradually changed according to the current volume-transition volume parameter-volume of the target voice mode.

In some other embodiments, the timing and transition parameters of the switching are preset and during the execution of the speech mode switching method, learning and correction are performed according to the corresponding user annotation feedback.

According to the voice mode switching method, the corresponding voice mode is selected from the voice database as the target voice mode according to the scene information and the decision relation rule, user labeling feedback can be learned in the later execution process, the decision relation rule and the voice mode in the voice database are modified and increased, and the subsequent voice mode switching is more intelligent and accurate.

On the other hand, as shown in fig. 11, the present application also provides a voice mode switching apparatus 30, including: an acquiring unit 31 configured to acquire scene information, the scene information including at least one of environment data and user input speech; a decision unit 32, configured to select a corresponding voice mode as a target voice mode in a voice database according to the scene information and a decision relationship rule, where optionally, the decision relationship rule includes a correspondence between the scene information and the voice mode, the voice mode is a group of voice parameters including at least one voice parameter, and the voice database includes at least one voice mode; and a switching unit 33, configured to switch the current voice mode to a target voice mode, and perform voice interaction with the user according to the target voice mode.

The voice mode switching device 30 provided in the present embodiment is further exemplified by a mobile phone.

The voice mode may be set to a manual mode or an automatic mode. The manual mode is a voice mode fixed by setting, for example, a user may select a built-in voice mode of the system, or a voice mode defined by the user. The automatic mode is to carry out automatic matching by judging scenes, and the automatic matching is to automatically combine new parameter sets matched with the scene according to collected parameters of the scene on site and take the new parameter sets as output parameters.

The voice modes comprise the high and low degree, the softness, the tone color and the like of the voice, and the system can synthesize several common modes with obvious distinction degree according to the parameters and is built in the system, which is called as a preset voice mode. The user may also customize the voice mode. The user-defined voice mode is realized by setting a group of voice parameters which jointly form a voice mode; by collecting user behavior parameters or standard data over a long period of time, the system can also automatically learn to generate several new speech patterns with discrimination and allow the user to select.

Illustratively, the voice assistant is internally provided with a plurality of voice modes, and also comprises different voice modes defined by the user, and the defined voice modes are stored in a voice database, and the voice modes can comprise a accompany mode, a child mode, a super-high mode, a standard mode, a funny mode, a user-defined speaker mode and the like. The custom voice mode may learn the voice input of a particular person based on the calibration, and update the learned voice parameters or behavior parameters to the custom voice mode, such as a higher volume, a different speaker, a different tone, etc.

When a user carries out voice interaction, corresponding voice parameters or behavior parameters are extracted according to the input of the user, the parameters are matched with a built-in voice mode or a self-defined voice mode for reasoning, if the parameters are matched to belong to one voice mode, switching is carried out, dynamic switching can be carried out according to the input change, and the optimal switching time and the optimal transition parameters can be selected according to empirical data during switching.

The voice assistant needs to switch the voice mode according to the collected scene data and the decision relation rule so as to match the voice mode with the scene. If a child is in conversation with the voice assistant currently, the voice assistant can automatically switch to a voice mode with the voice of the child and the tone, the speed and the like equivalent to those of the child; if the current user carries out a conversation in a very formal standard mode, the voice assistant can be switched to a business mode or a standard mode; if the conversation with the old man is detected currently, switching to an accompanying voice mode and the like, wherein the contextual mode further comprises environmental information of the current user, such as whether the scene is noisy or particularly quiet, and whether the scene is at night is judged according to time, and different sound sizes are automatically set according to different environments. Real-time scenarios are also a factor for intelligent switching of decision modes, and when the environment is noisy, the volume needs to be automatically turned up, and when a quiet area is identified, the volume can be turned down.

As shown in fig. 12, the decision unit 33 includes a voice mode determination module and a decision module, which are integrated with each other, and first determines a plurality of possible initial voice modes in advance according to user input and environmental data, and then inputs the current parameters and the selected initial voice modes into the decision module, and the decision module gives a confidence score to each mode according to parameter matching, priority, other limiting factors, and the like; and then reversely inputting the results into a voice mode judging module, and carrying out final decision making by the voice mode judging module according to information such as credibility scores and the like to select an optimal voice mode as a target voice mode.

The voice assistant can properly switch voice modes, and the switching needs to be supported by empirical data, which is obtained by continuously learning and user feedback according to the collected parameters input by the user for a long time. The term "appropriate handover" refers to whether or not the timing of handover and the parameters of transition are appropriate. Firstly, whether information such as switching, switching parameters, switching opportunities and the like should be judged in time and accurately according to scene input and experience data, and user feedback and user behaviors after switching are stored in time and serve as experience data.

After the voice mode is switched, the user can perform corresponding labeling feedback according to the correctness of the switching, the method is used for the voice assistant to perform decision learning and learning of the voice mode, various labeling parameters can be provided, and the voice parameters are optimized according to different labeling parameters. For example, decision learning is performed, decision relation rules are modified and stored in an experience database. And according to the parameters of the voice mode fed back by the user, learning the voice mode, correcting or generating the voice mode, and storing the voice mode into a voice database.

The voice switching device of the embodiment automatically switches to the voice mode matched with the user scene or behavior according to the intelligent decision and experience data, so that the user experience is improved; the speech assistant can also automatically learn and use the collected parameters or feedback as experience knowledge, so that the speech assistant can really 'think' for the user without time. The user labeling feedback reduces the learning cost and has strong implementation.

On the other hand, as shown in fig. 13, the present application further provides a terminal 40, which includes a memory 401 and a processor 402, where the memory 401 stores a program, and the program is used for being executed by the processor to execute any one of the voice mode switching methods.

By adopting the device and the terminal for switching the voice modes, the corresponding voice mode can be intelligently switched according to the current use scene, and the problems that the switching of the existing voice mode is not intelligent and cannot meet the user expectation are solved; the method can also be evolved through machine learning, continuously optimize decision functions, enrich the voice mode of the voice database, and gradually switch to a parameter combination or voice mode which meets the expectation of a user according to the habit learning of the user.

Embodiments of the present application also provide a computer program product, which includes computer program code, when the computer program code runs on a computer, the computer is caused to execute the method in the above various possible embodiments.

Embodiments of the present application further provide a chip, which includes a memory and a processor, where the memory is used to store a computer program, and the processor is used to call and run the computer program from the memory, so that a device in which the chip is installed executes the method in the above various possible embodiments.

The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.

In the present application, the same or similar term concepts, technical solutions and/or application scenario descriptions will be generally described only in detail at the first occurrence, and when the description is repeated later, the detailed description will not be repeated in general for brevity, and when understanding the technical solutions and the like of the present application, reference may be made to the related detailed description before the description for the same or similar term concepts, technical solutions and/or application scenario descriptions and the like which are not described in detail later.

In the present application, each embodiment is described with emphasis, and reference may be made to the description of other embodiments for parts that are not described or illustrated in any embodiment.

The technical features of the technical solution of the present application may be arbitrarily combined, and for brevity of description, all possible combinations of the technical features in the embodiments are not described, however, as long as there is no contradiction between the combinations of the technical features, the scope of the present application should be considered as being described in the present application.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, a controlled terminal, or a network device) to execute the method of each embodiment of the present application.

The above description is only a preferred embodiment of the present application, and not intended to limit the scope of the present application, and all modifications of equivalent structures and equivalent processes, which are made by the contents of the specification and the drawings of the present application, or which are directly or indirectly applied to other related technical fields, are included in the scope of the present application.

The application in other related technical fields is all included in the scope of protection of this application.

Claims

1. A method for switching speech modes, the method comprising:

acquiring scene information;

and switching to the target voice mode.

2. The method of claim 1, further comprising at least one of:

the speech pattern includes speech parameters including: basic voice parameters and/or voice behavior parameters of the basic voice parameters changing along with time;

the decision relationship rule includes a first correspondence and a second correspondence.

3. The method of claim 2,

the first corresponding relation is a preset corresponding relation between the scene information and the voice mode;

the second correspondence is a correspondence between the scene information newly generated by machine learning and the speech pattern.

4. The method of claim 2, wherein the context information comprises environmental data; the method further comprises at least one of:

5. The method of claim 2, wherein the context information comprises user input speech, and wherein the first correspondence comprises at least one of:

when the voice input by the user is recognized as the voice of the old, the target voice mode selects an accompanying voice mode suitable for conversation with the old;

when the voice input by the user is recognized as the voice of the child, selecting a child voice mode suitable for conversation with the child by the target voice mode;

when the voice input by the user is recognized as the first dialect, the target voice mode selects a first-party speech voice mode for carrying out conversation by using the first dialect.

6. The method according to any one of claims 1 to 5, wherein the context information comprises environment data and user input speech, and the determining a second speech mode as a target speech mode according to the context information and a decision relation rule comprises:

extracting a noise level parameter from the environment data, extracting the basic voice parameter and/or the voice behavior parameter of the voice of the user from the voice input by the user, and matching the extracted noise level parameter, the basic voice parameter and/or the voice behavior parameter with the voice mode in the voice database according to the decision relation rule to determine a target voice mode.

7. The method according to any of claims 1 to 5, further comprising, after said switching to said target speech mode:

receiving annotation feedback;

8. The method according to any of claims 1 to 5, further comprising, prior to said switching to said target speech mode: and acquiring transition parameters for switching the voice modes, and switching to the target voice mode according to the transition parameters.

9. A terminal, characterized by comprising a memory and a processor, the memory storing a program for execution by the processor to perform the voice mode switching method of any one of claims 1 to 8.

10. A readable storage medium, in which a program is stored, the program being adapted to be executed by a processor to perform the speech mode switching method according to any one of claims 1 to 8.