CN114203148A - Analog voice playing method and device, electronic equipment and storage medium - Google Patents

Analog voice playing method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN114203148A
CN114203148A CN202010899170.3A CN202010899170A CN114203148A CN 114203148 A CN114203148 A CN 114203148A CN 202010899170 A CN202010899170 A CN 202010899170A CN 114203148 A CN114203148 A CN 114203148A
Authority
CN
China
Prior art keywords
voice
target
user
target user
simulated
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010899170.3A
Other languages
Chinese (zh)
Inventor
李芳�
吴玲
田书君
许升
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qingdao Haier Washing Machine Co Ltd
Haier Smart Home Co Ltd
Original Assignee
Qingdao Haier Washing Machine Co Ltd
Haier Smart Home Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qingdao Haier Washing Machine Co Ltd, Haier Smart Home Co Ltd filed Critical Qingdao Haier Washing Machine Co Ltd
Priority to CN202010899170.3A priority Critical patent/CN114203148A/en
Publication of CN114203148A publication Critical patent/CN114203148A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

The invention belongs to the technical field of intelligent household appliances, and particularly relates to a method and a device for playing analog voice, electronic equipment and a storage medium. The invention aims to solve the problems that the human-computer interaction efficiency of a user and an intelligent household appliance is low and the user experience is influenced in the existing voice simulation playing method. According to the method, the target user is determined by acquiring the environmental sound information and according to the environmental sound information; determining a target simulated voice according to the voice content sent by a target user, wherein the voice characteristics of the target simulated voice are different from those of the target user; and playing the voice by using the target simulation voice. Because the target simulated voice is determined by the voice content sent by the target user, the simulated voice is determined as the target simulated voice with different voice characteristics from the target user according to the voice content and is played, the situation that the target simulated voice conflicts with the voice of the target user to influence the communication among users can be avoided, the human-computer interaction efficiency is improved, and the user experience is improved.

Description

Analog voice playing method and device, electronic equipment and storage medium
Technical Field
The invention belongs to the technical field of intelligent household appliances, and particularly relates to a method and a device for playing analog voice, electronic equipment and a storage medium.
Background
At present, with the becoming mature of voice recognition technology, voice control and voice interaction functions become more and more common functions in intelligent household appliances, in order to further improve the user experience and meet the individual demands of users, intelligent household appliance manufacturers set a voice simulation function on the intelligent household appliances, the intelligent household appliances can imitate the pronunciation of the users and play voice, so that the voice interaction process between the users and the intelligent household appliances is more interesting, and the diversified demands of the users are met.
However, in the actual use process of such intelligent home appliances, when the intelligent home appliances play voices by using simulated pronunciation, auditory judgment of the user is often affected, so that the human-computer interaction efficiency between the user and the intelligent home appliances is reduced, and the user experience is affected.
Accordingly, there is a need in the art for a new analog voice playing method, apparatus, electronic device and storage medium to solve the above-mentioned problems.
Disclosure of Invention
In order to solve the problems in the prior art, namely the problems that the human-computer interaction efficiency of the existing user and the intelligent household appliance is reduced and the user experience is influenced, the invention provides a method and a device for playing analog voice, electronic equipment and a storage medium.
According to a first aspect of the embodiments of the present invention, the present invention provides an analog voice playing method, applied to an electronic device, including:
acquiring environmental sound information, and determining a target user according to the environmental sound information; determining a target simulation voice according to the voice content sent by the target user; and playing the voice by using the target simulation voice.
In a preferred technical solution of the above method for playing simulated voice, determining a target user according to the environmental sound information includes: acquiring user voiceprint characteristics according to the environmental sound information; and determining a target user according to the user voiceprint characteristics.
In a preferred technical solution of the above method for playing analog voice, the method further includes: acquiring environment image information, and acquiring user position information according to the environment image information; and extracting the voice content sent by the target user from the environmental sound information according to the user position information.
In a preferred technical solution of the above method for playing simulated voice, determining a target simulated voice according to the voice content sent by the target user includes: performing semantic analysis according to the voice content sent by the target user, and determining a semantic type corresponding to the voice content; and determining the target simulation voice according to the semantic type.
In a preferred technical solution of the above method for playing simulated speech, determining a target simulated speech according to the semantic type includes: if the semantic type is a first type, determining the alternative voice as the target simulation voice; otherwise, determining the currently used simulated voice as the target simulated voice; the voice content corresponding to the first type is used for controlling the electronic equipment to execute instructions.
In a preferred technical solution of the above simulated voice playing method, the alternative voice is a preset robot voice or a preset alternative simulated voice, where a voice feature of the alternative simulated voice is different from a voice feature of the target user.
In a preferred technical solution of the above method for playing simulated voice, after obtaining the environment information, the method further includes: determining a specific target user according to the environment information; and setting preset specific voice as the currently used simulated voice according to the specific target user.
According to a second aspect of the embodiments of the present invention, the present invention provides an analog voice playing apparatus, which is applied to an electronic device, and the apparatus includes:
the acquisition module is used for acquiring environmental sound information and determining a target user according to the environmental sound information;
the determining module is used for determining target simulation voice according to the voice content sent by the target user;
and the playing module is used for playing the voice by using the target simulation voice.
In a preferred technical solution of the above-mentioned analog voice playing apparatus, when determining the target user according to the environmental sound information, the obtaining module is specifically configured to: acquiring user voiceprint characteristics according to the environmental sound information; and determining a target user according to the user voiceprint characteristics.
In a preferred embodiment of the above analog audio playback device, the device further includes:
the positioning module is used for acquiring environment image information and acquiring user position information according to the environment image information;
and the extraction module is used for extracting the voice content sent by the target user from the environmental sound information according to the user position information.
In an preferable technical solution of the above analog voice playing apparatus, the determining module is specifically configured to: performing semantic analysis according to the voice content sent by the target user, and determining a semantic type corresponding to the voice content; and determining the target simulation voice according to the semantic type.
In an preferable technical solution of the above simulated speech playing apparatus, when the determining module determines the target simulated speech according to the semantic type, the determining module is specifically configured to: if the semantic type is a first type, determining the alternative voice as the target simulation voice; otherwise, determining the currently used simulated voice as the target simulated voice; the voice content corresponding to the first type is used for controlling the electronic equipment to execute instructions.
In an optimal technical solution of the above simulated voice playing apparatus, the alternative voice is a preset robot voice or a preset alternative simulated voice, where a voice feature of the alternative simulated voice is different from a voice feature of the target user.
In a preferred embodiment of the above analog audio playback device, the device further includes:
the setting module is used for determining a specific target user according to the environment information; and setting preset specific voice as the currently used simulated voice according to the specific target user.
According to a third aspect of embodiments of the present invention, there is provided an electronic apparatus including: a memory, a processor, and a computer program;
wherein the computer program is stored in the memory and configured to be executed by the processor to perform the method for playing simulated speech according to any of the first aspect of the embodiments of the present invention.
According to a fourth aspect of the embodiments of the present invention, there is provided a computer-readable storage medium, in which computer-executable instructions are stored, and when the computer-executable instructions are executed by a processor, the computer-readable storage medium is used for implementing the method for playing simulated voice according to any one of the first aspect of the embodiments of the present invention.
The person skilled in the art can understand that, the simulated voice playing method of the present invention determines the target user by obtaining the environmental sound information and according to the environmental sound information; determining a target simulated voice according to the voice content sent by the target user, wherein the voice characteristics of the target simulated voice are different from those of the target user; and playing the voice by using the target simulation voice. Because the target simulated voice is determined by the voice content sent by the target user, when the voice content sent by the target user is related to the content of the simulated voice sent by the electronic equipment, the simulated voice is determined as the target simulated voice with different voice characteristics from the target user, and the voice is played by the target simulated voice, so that the conflict with the voice of the target user can be avoided, the communication between the users is influenced, the human-computer interaction efficiency is improved, and the user experience is improved.
Drawings
The following describes preferred embodiments of the analog voice playing method, apparatus, and electronic device according to the present invention with reference to the accompanying drawings. The attached drawings are as follows:
fig. 1 is an application scene diagram of a simulated voice playing method according to an embodiment of the present application;
fig. 2 is a flowchart of a method for playing simulated voice according to an embodiment of the present application;
FIG. 3 is a schematic diagram illustrating a user's voice interaction with an electronic device according to an embodiment of the present application;
fig. 4 is a flowchart of a method for playing simulated voice according to another embodiment of the present application;
FIG. 5 is a schematic diagram of the classification of speech content according to an embodiment of the present application;
fig. 6 is a schematic structural diagram of an analog audio playing apparatus according to an embodiment of the present application;
fig. 7 is a schematic structural diagram of an analog audio playback device according to another embodiment of the present application;
fig. 8 is a schematic diagram of an electronic device according to an embodiment of the present application.
Detailed Description
First, it should be understood by those skilled in the art that these embodiments are merely for explaining the technical principles of the present invention, and are not intended to limit the scope of the present invention. And can be adjusted as needed by those skilled in the art to suit particular applications. For example, although the analog voice playing method of the present invention is described in conjunction with an intelligent washing machine, this is not limiting, and other devices with voice interaction requirements may be configured with the analog voice playing method of the present invention, such as an intelligent refrigerator, an intelligent television, and other devices.
Furthermore, it should be noted that, in the description of the present invention, unless otherwise explicitly specified or limited, the terms "connected" and "connected" should be interpreted broadly, e.g., as being fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; the two components can be directly connected or indirectly connected through an intermediate medium, and the two components can be communicated with each other. The specific meanings of the above terms in the present invention can be understood by those skilled in the art according to specific situations.
The terms referred to in this application are explained first:
1) the intelligent household appliance equipment is a household appliance product formed by introducing a microprocessor, a sensor technology and a network communication technology into the household appliance equipment, has the characteristics of intelligent control, intelligent perception and intelligent application, and the operation process of the intelligent household appliance equipment usually depends on the application and processing of modern technologies such as the Internet of things, the Internet and an electronic chip, for example, the intelligent household appliance equipment can realize the remote control and management of a user on the intelligent household appliance equipment by connecting a cloud server.
2) "plurality" means two or more, and other terms are analogous. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.
3) "correspond" may refer to an association or binding relationship, and a corresponds to B refers to an association or binding relationship between a and B.
The following explains an application scenario of the embodiment of the present application:
fig. 1 is an application scenario diagram of the simulated voice playing method according to the embodiment of the present application, and as shown in fig. 1, the simulated voice playing method according to the embodiment of the present application can be applied to an electronic device, such as an intelligent washing machine. In the scenario provided by this embodiment, the intelligent washing machine is placed in a multi-user environment including a target user and other users, and after being preset, the intelligent washing machine can simulate the voice of the target user to play the voice.
Under the application scene, the intelligent washing machine can play the control instruction or the prompt message by simulating the pronunciation of the target user, so that the user can feel familiarity and interestingness, and the use experience of the user can be improved. However, because the simulated pronunciation used by the intelligent home appliance is similar to the pronunciation of the simulated target user, when the simulated target user is present, and the intelligent home appliance plays the voice of the simulated target user, other users cannot distinguish whether the heard voice is sent by the intelligent home appliance or sent by the target user, so that the interaction efficiency between the other users and the intelligent home appliance is reduced, and the use experience of the user is influenced.
The following describes the technical solutions of the present application and how to solve the above technical problems with specific embodiments. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.
Fig. 2 is a flowchart of a method for playing a simulated voice according to an embodiment of the present application, which is applied to an electronic device, and as shown in fig. 2, the method for playing a simulated voice according to the present embodiment includes the following steps:
step S101, ambient sound information is acquired.
The environmental sound information refers to sound information obtained from an environment in which the electronic device applying the analog voice playing method provided by the embodiment of the present application is located, and the environmental information may be collected and obtained from the environment by the electronic device through a sensor provided by the electronic device itself, for example, the electronic device obtains a sound signal in the environment through a sound sensor and a vibration sensor. Of course, in another possible implementation manner, the electronic device may obtain the collected environmental sound information from other devices, such as other terminal devices, server devices, network devices, and the like. The specific implementation manner of obtaining the environmental sound information is not limited herein.
Further, the ambient sound information may be obtained by one or more sound sensors with directivity provided on or in communication with the electronic device. In a possible implementation manner, the plurality of sound sensors are respectively pointed to different directions and/or arranged at different positions, and the sound generating bodies can be positioned and identified by collecting environmental sound information through the plurality of sound sensors.
And step S102, determining a target user according to the environment information.
Since the user and the electronic device are in the same environment, for example, in the same room or a set of rooms, the specific sound-producing user, i.e., the target user, in the environment where the electronic device is located can be determined by performing voiceprint recognition according to the environmental sound information. The target user may be one user or multiple users, and is not limited in this respect. Specifically, according to the collected environmental sound information, the sound information is analyzed and classified, one or more corresponding users can be determined, one or more of all the users can be determined as target users, and details are not repeated here.
Step S103, determining a target simulated voice according to the voice content sent by the target user, wherein the voice characteristics of the target simulated voice are different from those of the target user.
Illustratively, the collected voice of the target user is analyzed and recognized, and specific voice content of the target user can be obtained. The voice content is content of a voice conversation between a target user and an electronic device or the target user, for example, a conversation between the target users is as follows: "hi, how to come back so early today", or the dialog between the target user and the electronic device is as follows: "temperature adjusted to 26 degrees Celsius". The electronic device performs semantic analysis and classification on the content, and can determine the difference between the dialog between the target users and the electronic device.
Furthermore, the electronic device may correspondingly determine a corresponding target simulated voice according to the difference, and fig. 3 is a schematic diagram of voice interaction between the user and the electronic device provided in the embodiment of the present application, as shown in fig. 3, for example, when the electronic device determines that the target user is nearby and a conversation is performed between the target user and another user, since voice interaction is not required between the user and the electronic device at this time, it may be considered that the target user and another user are in an communication state, and the user in the communication state is not easily interfered by information of the external electronic device, and the user may easily distinguish whether the heard voice is sent by the target user or sent by the electronic device, so that the current simulated voice may be continuously set as the voice of the target user; and when the electronic device judges that the current target user is present and the voice content of the target user is an instruction indicating the work of the electronic device, at this time, the target user and the other users can be considered to be in a non-communication state, for example, the target user and the other users are doing respective things. Between users in a non-communication state, attention is not concentrated as in an communication state, so that when voice information sent by external electronic equipment is suddenly heard, whether the heard response sound is sent by a target user or sent by the electronic equipment cannot be distinguished, in order to avoid that the response sound characteristics sent by the electronic equipment are the same as the sound characteristics of the target user and influence the auditory understanding of other users, the target simulated voice is set to be a simulated voice different from the target user, such as a robot voice, the recognition degree of the voice is improved, and the interaction efficiency between the user and the electronic equipment is improved.
Specifically, there are various methods for determining the target simulated voice, for example, according to a preset user identifier mapping relation table and an identifier of the target user, selecting a simulated voice with a different voice characteristic from the target user from an alternative simulated voice library as the target simulated voice, or directly determining the robot voice as the target simulated voice, which may be set according to specific needs and scenarios, and is not specifically limited herein.
And step S104, playing the voice by using the target simulation voice.
Illustratively, the electronic device plays the target simulation voice through the sound playing unit to realize human-computer interaction with the user, and because the target simulation voice has different voice characteristics from the target user, the electronic device does not cause voice interference with the target user, and human-computer interaction efficiency is improved. Specifically, the simulated voice may be a preset voice library with user voice characteristics, and the target simulated voice may be a voice library with target user voice, where the voice library is preset with a plurality of voice messages, and different voice messages are played under different trigger conditions to complete the purpose of voice playing. Of course, it is understood that the simulated speech may also be obtained through a conversion model that can implement speech feature migration, i.e. the model may convert specific text or information into sound with the speech features of the user. Therefore, the target simulated voice can be obtained through the conversion model corresponding to the target user, the model can be obtained after being designed and trained in a deep learning mode and the like, and the process is not repeated at this time.
In the embodiment, the target user is determined by acquiring the environmental sound information and according to the environmental sound information; determining a target simulated voice according to the voice content sent by a target user, wherein the voice characteristics of the target simulated voice are different from those of the target user; and playing the voice by using the target simulation voice. Because the target simulated voice is determined by the voice content sent by the target user, when the voice content sent by the target user is related to the content of the simulated voice sent by the electronic equipment, the simulated voice is determined as the target simulated voice with different voice characteristics from the target user, and the voice is played by the target simulated voice, so that the conflict with the voice of the target user can be avoided, the communication between the users is influenced, the human-computer interaction efficiency is improved, and the user experience is improved.
Fig. 4 is a flowchart of a simulated speech playing method according to another embodiment of the present application, and as shown in fig. 4, the simulated speech playing method according to this embodiment further refines steps S102 to S103 on the basis of the simulated speech playing method according to the embodiment shown in fig. 2, and then the simulated speech playing method according to this embodiment includes the following steps:
step S201, obtaining the user voiceprint characteristics according to the environmental sound information.
Illustratively, the environmental sound information refers to sound information in the environment where the electronic device to which the method provided by the present embodiment is applied is located. The environmental sound information may be acquired by a sound sensor such as a microphone. More specifically, the environmental sound information may be obtained by acquiring sound signals by one or more sound sensors, where if the environmental sound information is obtained by acquiring by a plurality of sound sensors, the environmental sound information may further include signal processing steps such as sound superposition, sound mixing, and sound denoising, and is not described herein any more.
The environmental sound information can describe sound conditions in the environment, such as sound generated by user interaction, sound generated by table and chair movement, sound generated by a television set, and the like. These sounds have different voiceprint characteristics, which can be distinguished according to the voiceprint characteristics of the different sounds. The sound of the user is taken as the interesting sound, and can be distinguished through the voiceprint characteristics of the user.
Further, there are many implementation ways to obtain the user voiceprint characteristics through the environmental sound information, for example, the user voiceprint characteristics are obtained by analyzing the environmental sound information and identifying the frequency domain components corresponding to different sounds and/or the time domain waveform characteristics, where a specific method for performing time domain and frequency domain analysis on the sound signal is the prior art in the field and is not described here again.
And step S202, determining a target user according to the voiceprint characteristics of the user.
The target user may be a user whose pronunciation may be interfered by simulated voice of the electronic device, for example, the pronunciation feature of the user a is entered into an intelligent washing machine, the intelligent washing machine may send out the simulated voice according to the pronunciation feature of the user a, and at this time, the user a is the target user.
Step S203, obtaining environment image information, and obtaining user position information according to the environment image information.
Illustratively, the environment image information refers to image information in the environment where the electronic device to which the method provided by the present embodiment is applied is located. The environment image information may be acquired by an image sensor such as a camera. More specifically, the environmental image information may be obtained by acquiring image signals by one or more image sensors, where if the environmental image information is obtained by acquiring by a plurality of image sensors, the method may further include image processing steps such as image superposition, stitching, and denoising, and is not described herein any more.
From the ambient image information, position information of the user within the coverage of the image sensor of the electronic device can be determined. Specifically, by performing image feature recognition on the environment image information, the "user" in the environment image information can be distinguished from other objects and extracted, and then, the user position information corresponding to each user is determined. For example, the position information may be a distance and a direction between the user and the electronic device, and may be characterized by a vector. The position information may also be position coordinate information of the user in a coverage area of an image sensor of the entire electronic device, and here, a specific implementation form of the user position information is not limited as long as the user position information can realize positioning of the user.
In a possible implementation manner, the user position information is acquired according to the environment image information, and meanwhile, according to preset appearance information of the target user, the target user can be identified from a plurality of users according to the environment image information, and then, the user position information corresponding to the target user is determined.
And step S204, extracting the voice content sent by the target user from the environmental sound information according to the user position information.
As described in the above steps, the environmental sound information may include not only the voice uttered by the target user but also the voice uttered by other users. The voice in the environmental sound information is directly extracted, and the effect is poor. The position information of the target user and the position information of other users can be respectively determined through the user position information, and then the environmental sounds received by the plurality of sound sensors from different directions are weighted according to the user position information, namely, the weight of the environmental sound information of the position where the target user is located is increased, and then the voice information sent by the target user is extracted, and then the voice information of the target user is subjected to voice recognition, so that the voice content sent by the target user can be obtained.
In the embodiment of the application, all the voice information in the environmental sound information is screened through the user position information, so that the voice information sent by a target user and the corresponding voice content are extracted. Because the environment sound is complex in the multi-user environment where the electronic equipment is located, the voice content is extracted directly from the voiceprint information, the effect is poor, the follow-up analysis requirement on the voice content is difficult to meet, and after the environment sound information correspondingly collected by different sound sensors is subjected to weighting processing according to the user position information, the effect of amplifying the voice energy level of a target user can be achieved, and the accuracy of extracting the voice content of the target user is improved.
Step S205, according to the voice content sent by the target user, performing semantic analysis, and determining the semantic type corresponding to the voice content.
According to the meaning of the voice content sent by the target user, the voice content can be classified, and the corresponding semantic type is determined. Fig. 5 is a schematic diagram illustrating classification of voice contents according to an embodiment of the present application, as shown in fig. 5, the voice contents uttered by the target user are classified, and when the target user issues a voice instruction to the electronic device, the voice contents are, for example: the temperature is adjusted to 25 ℃, or clothes are dried, and the semantic type corresponding to the voice content is the first type. When the target user is in dialogue communication with other users, voice interaction with the electronic equipment is not involved. Such speech content is for example: "hi, how to come back so early today", or "he did not tell me that thing" etc., the semantic type to which such speech content corresponds is of the second type. And the voice content corresponding to the first type is used for controlling the electronic equipment to execute the instruction.
Specifically, semantic analysis is performed according to the voice content, and various implementations are available for determining the semantic type corresponding to the voice content, for example, matching is performed with the voice content according to preset instruction information or a keyword, and if the voice content can be matched with the instruction information or the keyword, the corresponding semantic type is determined to be the first type; and if the semantic types cannot be matched, determining that the corresponding semantic type is the second type. For another example, speech content is identified according to a neural network-based semantic identification model trained to be convergent, and a semantic type is determined according to an output result of the semantic identification model. The setting can be performed according to specific needs, and is not particularly limited herein.
And step S206, determining the target simulation voice according to the semantic type.
Exemplarily, if the semantic type is a first type, determining the alternative voice as the target simulated voice; otherwise, the currently used simulated voice is determined as the target simulated voice. The alternative voice is preset robot voice or preset alternative simulated voice, and the voice characteristics of the alternative simulated voice are different from those of the target user.
Specifically, if the semantic type is the first type, it indicates that the electronic device determines that the current target user is present, and the voice content of the target user is an instruction indicating that the electronic device is operating, at this time, the target user and the other users are in a non-communication state, and the attention of the other users is not focused on the target user, so that if the electronic device makes the same sound as the target user, it is likely to cause mishearing of the other users. In actual use, when other users hear by mistake to make a conversation with the target user, the target user can be sounded for confirmation, and at the moment, the target user is performing voice interaction with the electronic equipment, responding to other users without time, or responding to other users in a hurry, so that the electronic equipment receives wrong voice instructions, and interaction efficiency and user experience between the user and the electronic equipment are influenced. Therefore, if the semantic type is the first type, the alternative voice with the different voice characteristics from the target user is confirmed as the target simulated voice, so that the problem caused by the fact that the simulated voice sent by the electronic equipment is the same as the voice characteristics of the target user is avoided.
Correspondingly, if the semantic type is the second type, the electronic device judges that the target user is nearby and the target user and other users are in conversation, at the moment, the target user and other users are in a communication state, the users in the communication state are not easily interfered by information of external electronic devices, and the users can easily distinguish whether the heard voice is sent by the target user or sent by the electronic device, so that the voice of the target user can be continuously set as the target simulation voice, the user can feel intimated and interesting, and the use experience of the user is improved.
And step S207, performing voice playing by using the target simulation voice.
In this embodiment, the implementation manner of step S207 is the same as the implementation manner of step S104 in the embodiment shown in fig. 2, and reference may be made to the specific implementation manner of step S104 and the description of the related technical effects, which are not described herein again.
Fig. 6 is a schematic structural diagram of an analog audio playing device according to an embodiment of the present application, and as shown in fig. 6, the analog audio playing device 3 according to the present embodiment includes:
the obtaining module 31 is configured to obtain the environmental sound information, and determine the target user according to the environmental sound information.
And the determining module 32 is used for determining the target simulated voice according to the voice content sent by the target user.
And the playing module 33 is configured to play the voice by using the target simulated voice.
The obtaining module 31, the determining module 32, and the determining module 33 are connected in sequence. The analog voice playing device 3 provided in this embodiment may execute the technical solution of the method embodiment shown in fig. 2, and the implementation principle and the technical effect are similar, which are not described herein again.
Fig. 7 is a schematic structural diagram of an analog voice playing device according to another embodiment of the present application, and as shown in fig. 7, the analog voice playing device 4 according to this embodiment further includes a positioning module 41, an extracting module 42, and a setting module 43 on the basis of the analog voice playing device 3 shown in fig. 6, where:
in the preferred technical solution of the above analog voice playing apparatus, when determining the target user according to the environmental sound information, the obtaining module 31 is specifically configured to: acquiring user voiceprint characteristics according to the environmental sound information; and determining the target user according to the voiceprint characteristics of the user.
In a preferred embodiment of the above analog audio playback device, the device further includes:
the positioning module 41 is configured to obtain environment image information and obtain user position information according to the environment image information;
and the extracting module 42 is configured to extract the voice content sent by the target user from the environmental sound information according to the user location information.
In the preferred technical solution of the analog voice playing apparatus, the determining module 32 is specifically configured to: according to the voice content sent by the target user, performing semantic analysis and determining a semantic type corresponding to the voice content; and determining the target simulation voice according to the semantic type.
In the above preferred technical solution of the analog voice playing apparatus, when determining the target analog voice according to the semantic type, the determining module 32 is specifically configured to: if the semantic type is the first type, determining the alternative voice as the target simulation voice; otherwise, determining the currently used simulated voice as the target simulated voice; and the voice content corresponding to the first type is used for controlling the electronic equipment to execute the instruction.
In the above preferred technical solution of the simulated voice playing apparatus, the alternative voice is a preset robot voice or a preset alternative simulated voice, where a voice feature of the alternative simulated voice is different from a voice feature of the target user.
In a preferred embodiment of the above analog audio playback device, the device further includes:
a setting module 43, configured to determine a specific target user according to the environment information; and setting the preset specific voice as the currently used simulated voice according to the specific target user.
The analog voice playing device 4 provided in this embodiment may execute the technical solutions of the method embodiments shown in fig. 3 to fig. 5, and the implementation principles and technical effects thereof are similar and will not be described herein again.
Fig. 8 is a schematic view of an electronic device according to an embodiment of the present application, and as shown in fig. 8, an electronic device 5 according to the embodiment includes: a memory 51, a processor 52 and a computer program.
The computer program is stored in the memory 51 and configured to be executed by the processor 52 to implement the analog voice playing method provided in any embodiment corresponding to fig. 2 to 5 in the present application.
The memory 51 and the processor 52 are connected by a bus 53.
The relevant descriptions and effects corresponding to the steps in the embodiments corresponding to fig. 2 to fig. 5 can be understood, and are not described in detail herein.
One embodiment of the present application provides a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to implement the method for playing simulated voice provided in any embodiment of the present application corresponding to fig. 2 to fig. 5.
The computer readable storage medium may be, among others, ROM, Random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, and the like.
In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of modules is merely a division of logical functions, and an actual implementation may have another division, for example, a plurality of modules or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or modules, and may be in an electrical, mechanical or other form.
Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the application disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.
It will be understood that the present application is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.
So far, the technical solutions of the present invention have been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of the present invention is obviously not limited to these specific embodiments. Equivalent changes or substitutions of related technical features can be made by those skilled in the art without departing from the principle of the invention, and the technical scheme after the changes or substitutions can fall into the protection scope of the invention.

Claims (10)

1. A method for playing simulated voice is applied to electronic equipment, and is characterized by comprising the following steps:
acquiring environmental sound information, and determining a target user according to the environmental sound information;
determining a target simulation voice according to the voice content sent by the target user;
and playing the voice by using the target simulation voice.
2. The method of claim 1, wherein determining a target user based on the ambient sound information comprises:
acquiring user voiceprint characteristics according to the environmental sound information;
and determining a target user according to the user voiceprint characteristics.
3. The method of claim 1, further comprising:
acquiring environment image information, and acquiring user position information according to the environment image information;
and extracting the voice content sent by the target user from the environmental sound information according to the user position information.
4. The method of claim 1, wherein determining a target simulated voice based on the voice content uttered by the target user comprises:
performing semantic analysis according to the voice content sent by the target user, and determining a semantic type corresponding to the voice content;
and determining the target simulation voice according to the semantic type.
5. The method of claim 4, wherein determining a target simulated speech based on the semantic type comprises:
if the semantic type is a first type, determining the alternative voice as the target simulation voice;
otherwise, determining the currently used simulated voice as the target simulated voice;
the voice content corresponding to the first type is used for controlling the electronic equipment to execute instructions.
6. The method according to claim 4, wherein the alternative voice is a preset robot voice or a preset alternative simulated voice, wherein the voice feature of the alternative simulated voice is different from the voice feature of the target user.
7. The method according to any one of claims 1-6, further comprising, after obtaining the context information:
determining a specific target user according to the environment information;
and setting preset specific voice as the currently used simulated voice according to the specific target user.
8. An analog voice playback apparatus, comprising:
the acquisition module is used for acquiring environmental sound information and determining a target user according to the environmental sound information;
the determining module is used for determining target simulation voice according to the voice content sent by the target user;
and the playing module is used for playing the voice by using the target simulation voice.
9. An electronic device, comprising: a memory, a processor, and a computer program;
wherein the computer program is stored in the memory and configured to be executed by the processor to implement the simulated speech playback method of any of claims 1 to 7.
10. A computer-readable storage medium, wherein computer-executable instructions are stored in the computer-readable storage medium, and when executed by a processor, the computer-executable instructions are used for implementing the method for playing simulated voice according to any one of claims 1 to 7.
CN202010899170.3A 2020-08-31 2020-08-31 Analog voice playing method and device, electronic equipment and storage medium Pending CN114203148A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010899170.3A CN114203148A (en) 2020-08-31 2020-08-31 Analog voice playing method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010899170.3A CN114203148A (en) 2020-08-31 2020-08-31 Analog voice playing method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN114203148A true CN114203148A (en) 2022-03-18

Family

ID=80644394

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010899170.3A Pending CN114203148A (en) 2020-08-31 2020-08-31 Analog voice playing method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114203148A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114842842A (en) * 2022-03-25 2022-08-02 青岛海尔科技有限公司 Voice interaction method and device of intelligent equipment and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114842842A (en) * 2022-03-25 2022-08-02 青岛海尔科技有限公司 Voice interaction method and device of intelligent equipment and storage medium

Similar Documents

Publication Publication Date Title
EP3340243B1 (en) Method for performing voice control on device with microphone array, and device thereof
CN107454508B (en) TV set and TV system of microphone array
CN106898348B (en) Dereverberation control method and device for sound production equipment
CN107591152B (en) Voice control method, device and equipment based on earphone
CN108681440A (en) A kind of smart machine method for controlling volume and system
CN108469966A (en) Voice broadcast control method and device, intelligent device and medium
CN108156497B (en) Control method, control equipment and control system
JP2019204074A (en) Speech dialogue method, apparatus and system
CN113574846A (en) Position estimation method of IoT device, server and electronic device supporting the same
CN110444206A (en) Voice interactive method and device, computer equipment and readable medium
CN110248021A (en) A kind of smart machine method for controlling volume and system
CN110767225B (en) Voice interaction method, device and system
CN112634897B (en) Equipment awakening method and device, storage medium and electronic device
CN111413877A (en) Method and device for controlling household appliance
CN111930336A (en) Volume adjusting method and device of audio device and storage medium
CN110399474B (en) Intelligent dialogue method, device, equipment and storage medium
CN114203148A (en) Analog voice playing method and device, electronic equipment and storage medium
CN110517702A (en) The method of signal generation, audio recognition method and device based on artificial intelligence
CN103974168A (en) Information processing method and electronic devices
CN116386623A (en) Voice interaction method of intelligent equipment, storage medium and electronic device
CN114999457A (en) Voice system testing method and device, storage medium and electronic equipment
CN111414760B (en) Natural language processing method, related equipment, system and storage device
CN111176430B (en) Interaction method of intelligent terminal, intelligent terminal and storage medium
CN113992463A (en) Voice interaction method and related device, equipment, system and storage medium
CN113488031A (en) Method and device for determining electronic equipment, storage medium and electronic device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination