CN114710733A - Voice playing method and device, computer readable storage medium and electronic equipment - Google Patents

Voice playing method and device, computer readable storage medium and electronic equipment Download PDF

Info

Publication number
CN114710733A
CN114710733A CN202210327184.7A CN202210327184A CN114710733A CN 114710733 A CN114710733 A CN 114710733A CN 202210327184 A CN202210327184 A CN 202210327184A CN 114710733 A CN114710733 A CN 114710733A
Authority
CN
China
Prior art keywords
target
user
intention
voice
microphone
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210327184.7A
Other languages
Chinese (zh)
Inventor
姚碧莹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Horizon Robotics Technology Research and Development Co Ltd
Original Assignee
Beijing Horizon Robotics Technology Research and Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Horizon Robotics Technology Research and Development Co Ltd filed Critical Beijing Horizon Robotics Technology Research and Development Co Ltd
Priority to CN202210327184.7A priority Critical patent/CN114710733A/en
Publication of CN114710733A publication Critical patent/CN114710733A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R1/00Details of transducers, loudspeakers or microphones
    • H04R1/20Arrangements for obtaining desired frequency or directional characteristics
    • H04R1/32Arrangements for obtaining desired frequency or directional characteristics for obtaining desired directional characteristic only
    • H04R1/40Arrangements for obtaining desired frequency or directional characteristics for obtaining desired directional characteristic only by combining a number of identical transducers
    • H04R1/406Arrangements for obtaining desired frequency or directional characteristics for obtaining desired directional characteristic only by combining a number of identical transducers microphones
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R3/00Circuits for transducers, loudspeakers or microphones

Abstract

The embodiment of the disclosure discloses a voice playing method, a voice playing device, a computer readable storage medium and an electronic device, wherein the method comprises the following steps: determining a target user with the vocalization intention in the target space based on the detection result of the vocalization intention; determining a target position of a target part of a target user in a target space; determining a target microphone corresponding to a target user based on the position relation between the target position and microphones included by a microphone array in a target space; extracting a target voice signal of a target user from an audio signal collected by a target microphone; and controlling the audio playing equipment in the target space to play the target voice signal. The embodiment of the disclosure realizes that a user does not need to manually control the microphone to collect and play sound, greatly improves the operation convenience of playing the sound by the microphone, and saves the cost for independently setting the microphone for playing the sound.

Description

Voice playing method and device, computer readable storage medium and electronic equipment
Technical Field
The present disclosure relates to the field of computer technologies, and in particular, to a method and an apparatus for playing a voice, a computer-readable storage medium, and an electronic device.
Background
At present, in some spaces containing a plurality of people, the sound emitted from some people or some areas needs to be collected and played. The mainstream scheme at present is to set up a separate microphone, and collect the sound of the user by means of holding or wearing the user. For example, in a scene where a user sings in a vehicle, an additional microphone device needs to be equipped in the vehicle as a sound pickup terminal, and parameters such as sensitivity, directivity and the like of the sound pickup terminal are designed so that sound acquired by the microphone can mask the influence of sound feedback played by an audio playing device. Alternatively, the mobile phone may be used as a microphone after being connected to an in-vehicle system as a sound pickup terminal.
According to the current scheme for acquiring and playing the audio in some spaces, when a user sets the function of the audio acquisition playing device or starts the audio acquisition playing function, the user needs to manually operate, and the operation convenience is low.
Disclosure of Invention
The embodiment of the disclosure provides a voice playing method and device, a computer readable storage medium and electronic equipment.
An embodiment of the present disclosure provides a voice playing method, including: determining a target user with the vocalization intention in the target space based on the detection result of the vocalization intention; determining a target position of a target part of a target user in a target space; determining a target microphone corresponding to a target user based on the position relation between the target position and the microphones included in the microphone array in the target space; extracting a target voice signal of a target user from an audio signal collected by a target microphone; and controlling an audio playing device in the target space to play the target voice signal.
According to another aspect of the embodiments of the present disclosure, there is provided a voice playback apparatus including: the first determination module is used for determining a target user with the sound production intention in the target space based on the detection result of the sound production intention; the second determination module is used for determining the target position of the target part of the target user in the target space; the third determining module is used for determining a target microphone corresponding to a target user based on the position relation between the target position and the microphones included by the microphone array in the target space; the extraction module is used for extracting a target voice signal of a target user from an audio signal collected by a target microphone; and the playing module is used for controlling the audio playing equipment in the target space to play the target voice signal.
According to another aspect of the embodiments of the present disclosure, there is provided a computer-readable storage medium storing a computer program for executing the above-mentioned voice playing method.
According to another aspect of the embodiments of the present disclosure, there is provided an electronic apparatus including: a processor; a memory for storing processor-executable instructions; and the processor is used for reading the executable instruction from the memory and executing the instruction to realize the voice playing method.
Based on the voice playing method, the voice playing device, the computer readable storage medium and the electronic device provided by the embodiments of the present disclosure, a target user with a sound production intention in a target space is determined through a detection result based on the sound production intention, then a target position of a target part of the target user in the target space is determined, then a target microphone corresponding to the target user is determined based on a position relationship between the target position and microphones included in a microphone array in the target space, a target voice signal of the target user is extracted from audio signals collected by the target microphone, and finally an audio playing device in the target space is controlled to play the target voice signal. The target user who has the pronunciation intention of automatic identification has been realized to automatic for the target user distributes the microphone, the user need not the manual control microphone and carries out sound collection and broadcast, and the user need not to hand solitary microphone or remove and to be provided with the position of microphone and can accomplish collection and broadcast the audio frequency, has improved the operation convenience that the user utilized the microphone to broadcast pronunciation greatly, has practiced thrift the cost that the microphone that sets up alone and be used for broadcasting pronunciation consumes simultaneously.
The technical solution of the present disclosure is further described in detail by the accompanying drawings and examples.
Drawings
The above and other objects, features and advantages of the present disclosure will become more apparent by describing in more detail embodiments of the present disclosure with reference to the attached drawings. The accompanying drawings are included to provide a further understanding of the embodiments of the disclosure and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain the principles of the disclosure and not to limit the disclosure. In the drawings, like reference numbers generally represent like parts or steps.
Fig. 1 is a system diagram to which the present disclosure is applicable.
Fig. 2 is a flowchart illustrating a voice playing method according to an exemplary embodiment of the present disclosure.
Fig. 3 is a flowchart illustrating a voice playing method according to another exemplary embodiment of the present disclosure.
Fig. 4 is a flowchart illustrating a voice playing method according to another exemplary embodiment of the present disclosure.
Fig. 5 is a flowchart illustrating a voice playing method according to another exemplary embodiment of the present disclosure.
Fig. 6 is a schematic diagram of a target sound reception area where a target portion of a target user is located according to an embodiment of the disclosure.
Fig. 7 is a flowchart illustrating a voice playing method according to another exemplary embodiment of the present disclosure.
Fig. 8 is a flowchart illustrating a voice playing method according to another exemplary embodiment of the present disclosure.
Fig. 9 is a flowchart illustrating a voice playing method according to another exemplary embodiment of the present disclosure.
Fig. 10 is a schematic structural diagram of a voice playing apparatus according to an exemplary embodiment of the present disclosure.
Fig. 11 is a schematic structural diagram of a voice playing apparatus according to another exemplary embodiment of the present disclosure.
Fig. 12 is a block diagram of an electronic device provided in an exemplary embodiment of the present disclosure.
Detailed Description
Hereinafter, example embodiments according to the present disclosure will be described in detail with reference to the accompanying drawings. It is to be understood that the described embodiments are merely a subset of the embodiments of the present disclosure and not all embodiments of the present disclosure, with the understanding that the present disclosure is not limited to the example embodiments described herein.
It should be noted that: the relative arrangement of the components and steps, the numerical expressions, and numerical values set forth in these embodiments do not limit the scope of the present disclosure unless specifically stated otherwise.
It will be understood by those of skill in the art that the terms "first," "second," and the like in the embodiments of the present disclosure are used merely to distinguish one element from another, and are not intended to imply any particular technical meaning, nor is the necessary logical order between them.
It is also understood that in embodiments of the present disclosure, "a plurality" may refer to two or more and "at least one" may refer to one, two or more.
It is also to be understood that any reference to any component, data, or structure in the embodiments of the disclosure, may be generally understood as one or more, unless explicitly defined otherwise or stated otherwise.
In addition, the term "and/or" in the present disclosure is only one kind of association relationship describing an associated object, and means that three kinds of relationships may exist, for example, a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" in the present disclosure generally indicates that the former and latter associated objects are in an "or" relationship.
It should also be understood that the description of the various embodiments of the present disclosure emphasizes the differences between the various embodiments, and the same or similar parts may be referred to each other, so that the descriptions thereof are omitted for brevity.
Meanwhile, it should be understood that the sizes of the respective portions shown in the drawings are not drawn in an actual proportional relationship for the convenience of description.
The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application, or uses.
Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.
The disclosed embodiments may be applied to electronic devices such as terminal devices, computer systems, servers, etc., which are operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well known terminal devices, computing systems, environments, and/or configurations that may be suitable for use with electronic devices, such as terminal devices, computer systems, servers, and the like, include, but are not limited to: personal computer systems, server computer systems, thin clients, thick clients, hand-held or laptop devices, microprocessor-based systems, set top boxes, programmable consumer electronics, network pcs, minicomputer systems, mainframe computer systems, distributed cloud computing environments that include any of the above systems, and the like.
Electronic devices such as terminal devices, computer systems, servers, etc. may be described in the general context of computer system-executable instructions, such as program modules, being executed by a computer system. Generally, program modules may include routines, programs, objects, components, logic, data structures, etc. that perform particular tasks or implement particular abstract data types. The computer system/server may be practiced in distributed cloud computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed cloud computing environment, program modules may be located in both local and remote computer system storage media including memory storage devices.
Summary of the application
At present, according to the scheme of audio acquisition and playing in some spaces, when a user sets the function of audio acquisition playing equipment or starts the audio acquisition playing function, the user needs to manually operate, and the user far away from a control device needs to move by a large margin to operate the control device, so that the operation convenience is low. For example, in a scene in a traveling vehicle, a passenger needs to manually operate a control terminal such as a touch screen, the space in the vehicle is narrow, the passenger is inconvenient to operate, and the traveling safety of the vehicle is affected by the movement of the passenger in the vehicle.
To solve this problem, embodiments of the present disclosure provide a speech playing method, which can implement a scheme of automatically recognizing the utterance intention of a user and allocating microphones to the user.
Exemplary System
Fig. 1 illustrates an exemplary system architecture 100 of a voice playback method or voice playback apparatus to which embodiments of the present disclosure may be applied.
As shown in fig. 1, system architecture 100 may include terminal device 101, network 102, server 103, microphone array 104, and audio playback device 105. Network 102 is the medium used to provide communication links between terminal devices 101 and server 103. Network 102 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.
The microphone array 104 may capture audio signals emitted within the target space. The audio playback device 105 may play back audio signals collected by the microphone array. The target space may be various types of spaces such as a vehicle interior space, a ship interior space, a house interior space, and the like.
A user may use terminal device 101 to interact with server 103 over network 102 to receive or send messages and the like. The terminal device 101 may have various communication client applications installed thereon, such as a multimedia application, a search-type application, a web browser application, a shopping-type application, an instant messaging tool, and the like.
The terminal device 101 may be various electronic devices including, but not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), etc., and a fixed terminal such as a digital TV, a desktop computer, etc., of a vehicle-mounted terminal (e.g., a car navigation terminal). The terminal device 101 may control a voice interaction device (which may be the terminal device 101 itself or another device connected to the terminal device 101) to perform voice interaction.
The server 103 may be a server that provides various services, such as a background server that processes audio signals uploaded by the terminal apparatus 101. The background server may receive the audio signal, the image, and the like uploaded by the terminal device to perform utterance intention detection, and may perform operations such as voice separation on the received audio signal to obtain a voice signal of the target user and send the voice signal to the audio playing device 105.
It should be noted that the voice playing method provided by the embodiment of the present disclosure may be executed by the server 103, or may be executed by the terminal device 101, and accordingly, the voice playing apparatus may be disposed in the server 103, or may be disposed in the terminal device 101.
It should be understood that the numbers of terminal devices 101, network 102, server 103, microphone array 104, and audio playback devices 105 in fig. 1 are merely illustrative. There may be any number of terminal devices 101, networks 102, servers 103, microphone arrays 104, and audio playback devices 105, as desired for an implementation. For example, in the case where the audio signal does not require remote processing, the system architecture may not include a network and a server, and only includes a microphone array, a terminal device, and an audio playback device.
Exemplary method
Fig. 2 is a flowchart illustrating a voice playing method according to an exemplary embodiment of the present disclosure. The embodiment can be applied to an electronic device (such as the terminal device 101 or the server 103 shown in fig. 1), and as shown in fig. 2, the method includes the following steps:
and step 201, determining a target user with the sound production intention in the target space based on the detection result of the sound production intention.
In this embodiment, the electronic device may determine, based on the detection result of the utterance intention, a target user having an utterance intention in the target space. The detection result may indicate whether a user with a vocalization intention exists in the target space. The vocalization intentions may represent various forms of vocalization such as singing, lecturing, accompanying, and the like.
The electronic equipment can obtain the detection result of the sound production intention according to various preset sound production intention detection modes. For example, a camera may be disposed in the target space, the electronic device may recognize an image captured by the camera, determine whether each user performs a specific motion (for example, a motion of holding a microphone) according to the recognition result, and determine the user performing the specific motion as the target user with the utterance intention.
Step 202, determining a target position of a target part of a target user in a target space.
In this embodiment, the electronic device may determine a target position of a target portion of a target user in a target space. The target portion may be various portions such as a head, a mouth, a hand (for example, suitable for a user to clap hands, operate a musical instrument, and the like), and the like. The electronic device may determine the target location at which the target site is located based on various means. For example, an image captured by a camera in the target space may be recognized, a position of the target portion in the image may be determined, and an actual position of the target portion may be determined according to a mapping relationship between the image and an actual area in the target space, where the position is the target position.
Step 203, determining a target microphone corresponding to the target user based on the position relationship between the target position and the microphones included in the microphone array in the target space.
In this embodiment, the electronic device may determine a target microphone corresponding to the target user based on a position relationship between the target position and microphones included in the microphone array in the target space. Wherein the microphone array comprises a plurality of microphones arranged at different positions within the target space. As an example, when the target space is an in-vehicle space, one microphone may be provided near each seat in the vehicle, or two microphones may be provided above each row of seats in the vehicle.
After determining the target position, the electronic device may determine, from the microphone array, a microphone closest to the target position as the target microphone.
Step 204, extracting a target voice signal of the target user from the audio signal collected by the target microphone.
In this embodiment, the electronic device may extract a target speech signal of a target user from an audio signal collected by a target microphone. Specifically, the electronic device may extract a speech signal emitted from a target location from a path of audio signal collected by a target microphone based on an existing speech separation technique (e.g., blind source separation, sound source localization, adaptive filtering, etc.), where the speech signal is a target speech signal of a target user.
Step 205, controlling the audio playing device in the target space to play the target voice signal.
In this embodiment, the electronic device may control an audio playback device (e.g., the audio playback device 105 shown in fig. 1) in the target space to play the target voice signal. Specifically, the electronic device may generate an instruction for instructing the audio playback device to play the target voice signal, and the audio playback device plays the corresponding target voice signal based on the instruction.
In the method provided by the above embodiment of the present disclosure, a target user with a sound production intention in a target space is determined based on a detection result of the sound production intention, then a target position of a target portion of the target user in the target space is determined, then a target microphone corresponding to the target user is determined based on a position relationship between the target position and microphones included in a microphone array in the target space, a target speech signal of the target user is extracted from an audio signal acquired by the target microphone, and finally an audio playing device in the target space is controlled to play the target speech signal. The target user who has the pronunciation intention of automatic identification has been realized to automatic for the target user distributes the microphone, the user need not the manual control microphone and carries out sound collection and broadcast, and the user need not to hand solitary microphone or remove and to be provided with the position of microphone and can accomplish collection and broadcast the audio frequency, has improved the operation convenience that the user utilized the microphone to broadcast pronunciation greatly, has practiced thrift the cost that the microphone that sets up alone and be used for broadcasting pronunciation consumes simultaneously.
In some optional implementation manners, as shown in fig. 3, in step 201, for each user to be detected in at least one user in the target space, the following sub-steps are performed on the user to be detected:
step 2011, detecting the voice gesture of the user to be detected to obtain the voice gesture information of the user to be detected.
The user to be detected is the user to be subjected to the sounding intention detection in at least one user. For any user, before the sound production intention detection is carried out for the time, the user is the user to be detected, and after the sound production intention detection is carried out for the time, the user is not the user to be detected any more. It should be understood that the user to be detected is defined for the utterance intention detection, and if the utterance intention detection needs to be performed on the user next time, the user may be determined as the user to be detected before performing the utterance intention detection next time.
The electronic equipment can perform gesture recognition on the image shot by the camera in the target space to obtain the sound production gesture information. The vocalization gesture information may indicate whether the user has made a vocalization gesture. As an example, the vocalizing gesture can include at least one of: hand held gestures, hand waving gestures, etc.
Step 2012, determining the voice intention information of the user to be detected based on the voice of the user to be detected.
Specifically, the electronic device may extract a voice signal of the user to be detected from each audio signal collected by the microphone array through a voice separation technique, and identify the voice signal to obtain voice intention information. The voice intention information may indicate whether the user uttered a voice intention. By way of example, the voicing intent speech may be "my want to sing", "give me microphone", or the like.
And 2013, determining the lip language information of the user to be detected based on the lip action of the user to be detected.
Specifically, the electronic device may perform lip motion feature recognition on an image sequence including lips of the user to be detected, which is captured by the camera, to obtain lip language information. The lip language information indicates whether or not the voice represented by the lip movement of the user has the meaning of the utterance intention. As an example, if the voice represented by the lip language information is "i want to sing", it is determined that the user has an intention to sound.
Step 2014, in response to determining that at least a first preset amount of the information of the vocalization gesture information, the voice intention information and the lip language information of the user to be detected meets the intention judgment condition, determining that the user to be detected is a target user with vocalization intention.
The first preset number may be set arbitrarily, for example, to 2. Each piece of the utterance gesture information, the voice intention information, and the lip language information corresponds to one intention judgment condition. For example, the intention determination condition corresponding to the utterance gesture information is: the vocalized gesture information matches preset gesture information (e.g., gesture information indicative of a hand holding the microphone). The intention judgment condition corresponding to the voice intention information is as follows: the voice intention information matches preset voice information (e.g., voice information indicating "i want to sing"). The intention judgment condition corresponding to the lip language information is as follows: the lip language information matches preset lip language information (e.g., lip language information indicating a voice of "i want to sing").
Since the steps 2011-2014 are executed for each user to be detected, after the execution of the present embodiment is finished, it can be determined whether each user in the target space has the sound production intention.
The embodiment determines the target user with the sounding intention through multiple sounding intention detection modes, realizes the sounding intention detection of the user by flexibly utilizing various means, and can also realize the verification of the sounding intention through various sounding intention detection results, thereby greatly improving the accuracy of the sounding intention detection of the user.
In some alternative implementations, as shown in fig. 4, step 204 may include the following sub-steps:
step 2041, a main audio signal is determined based on the audio signal collected by the target microphone.
Specifically, one path of audio signal collected by the target microphone may be a main audio signal. Or, filtering and the like the audio signal collected by the target microphone to obtain a main audio signal.
Step 2042, a reference audio signal is determined based on the audio signals collected by the other microphones of the microphone array.
Specifically, each path of audio signal collected by each other microphone may be a reference audio signal. Or, filtering and other processing are performed on each path of audio signal collected by other microphones to obtain a reference audio signal.
Step 2043, based on the reference audio signal, filtering the main audio signal to obtain a target audio signal of the target user.
Specifically, the electronic device may use an adaptive filtering algorithm to adaptively filter, by using audio signals acquired by microphones other than the target microphone as a reference, audio signals corresponding to non-target users in the main audio signal. For example, if a section of speech uttered by a user is collected by a non-target microphone, an audio signal matched with the section of speech in the main audio signal may be filtered as noise based on an adaptive filtering algorithm, so as to obtain a target speech signal of the target user. In addition, the electronic device may also adopt a neural network model trained by a machine learning method to input the main audio signal and each reference audio signal into the model, and finally output a target audio signal of the target user.
In the embodiment, the main audio signal and the reference audio signal are distinguished, and the main audio signal is filtered, so that the voice signals sent by other users except the target user can be filtered, the voice signals of other users can be prevented from being played simultaneously when the target voice signal is played, and the target voice signal of the target user can be played with pertinence and high quality.
In some alternative implementations, as shown in fig. 5, step 204 includes the following sub-steps:
step 2044, based on the target part image photographed by the target user, determines a target sound reception area where the target part of the target user is located within the sound reception range of the target microphone.
Generally, the sound reception range of the target microphone is approximately a sector area with the position of the target microphone as the center of a circle, and the electronic device may determine the position of a sub-sector area where the target part in the target part image is located in the sound reception range according to the mapping relationship between the target part image and the actual space, where the sub-sector area may be determined as the target sound reception area.
As an example, as shown in fig. 6, the sound pickup range of the target microphone 601 is represented by a sector area shown in 602 in the figure, the mouth of the target user is a target portion, the mouth of the target user is located in a sub-sector area 6021 in the sector area 602, and the sub-sector area 6021 is the target sound pickup area. The angle of the sub-sector region 6021 is a preset angle (e.g., 3 °), and the electronic device may determine the sub-sector region 6021 of the preset angle in the sector region 602 by determining the position of the target user's mouth.
Step 2045, perform sound source localization on the audio signal collected by the target microphone, and determine the position of at least one sound source within the sound reception range of the target microphone.
The method for positioning the sound source can be implemented by using the prior art, and is not described herein again. The electronic device can separate the mixed audio signal collected by the target microphone to obtain at least one separated audio signal, and perform sound source positioning on each separated audio signal, thereby determining the position of the sound source corresponding to each separated audio signal.
Step 2046, suppress audio signals collected from sound sources outside the target sound reception zone.
Specifically, the electronic device may filter the audio signals of the sound sources outside the target sound receiving area, so as to complete suppression of the audio signals of the sound sources outside the target sound receiving area.
Step 2047, extract the target speech signal of the target user from the suppressed audio signal.
Specifically, the electronic device may perform noise reduction on the suppressed audio signal, and filter the background noise, so as to obtain a target audio signal of the target user.
According to the embodiment, the target sound receiving area where the target part of the target user is located is determined, and the audio signals emitted by sound sources outside the target sound receiving area are restrained, so that the target voice signal of the target user can be extracted more accurately, and the interference of the sound of other users when the target voice signal is played is reduced.
In some optional implementations, as shown in fig. 7, after step 205, the method further includes:
and step 206, determining the intention of the target user to stop the sound production based on the sound production stopping intention detection.
The method of stopping the detection of the utterance intention may be implemented in various ways. For specific implementation, refer to the following alternative embodiments. The intention of the target user to stop sounding means that the target user no longer needs the electronic device to collect and play the voice signal sent by the target user, and the target user desires that the target microphone is no longer exclusive to the target user.
And step 207, in response to determining that the detection result indicates that the target user has the intention of stopping sounding, stopping extracting the target voice signal of the target user from the audio signal collected by the target microphone.
After the electronic equipment stops extracting the target voice signal from the audio signal collected by the target microphone, the audio playing equipment stops playing the target voice signal, namely the target microphone is not exclusive to the target user any more, and the electronic equipment recovers the use right of the target microphone. At this time, the target microphone may continue to collect audio signals, and the electronic device may continue to perform utterance intention detection on each user in the target space.
According to the method and the device, the target microphone is not distributed to the target user any more when the user has the sound production stopping intention by detecting the sound production stopping intention of the target user, and the target microphone can be used in other aspects, so that the target user can stop occupying the target microphone without manual operation of the user, and the convenience of controlling the microphone by the user is further improved.
In some alternative implementations, as shown in fig. 8, step 206 may include the following sub-steps:
step 2061, performing voice production gesture detection on the target user, and generating a first detection result indicating that the target user has the voice production stopping intention in response to the fact that the obtained voice production gesture information is not matched with the preset voice production gesture or matched with the preset voice production stopping gesture.
Here, the detection of the vocalization gesture may be the same as the method described in step 2011. As an example, the preset vocalization gesture comprises a virtual gesture of holding a microphone, and during singing, when the detected vocalization gesture information indicates that the user is no longer a gesture of holding the microphone, a detection result indicating that the user no longer has a singing intention is generated. Or the preset voice production stopping gesture can be gestures of putting down two hands, swinging the hands and the like of the user, and when the voice production gesture information is detected to represent the gestures, a detection result indicating that the user no longer has singing intention is generated.
Step 2062, performing voice recognition on the target user, and generating a second detection result indicating that the target user has the voice production stopping intention in response to the fact that the determined voice intention information is matched with the preset voice production stopping intention.
The method for obtaining the voice intention information may be the same as the method described in step 2012 above. As an example, the preset stop utterance intention voice includes a voice "i do not want to sing", "stop receiving", and the like. When detecting that the user utters the voice that 'I do not want to sing' in the process of singing, the target user generates a detection result indicating that the user no longer has the intention of singing.
Step 2063, performing lip language recognition on the target user, and generating a third detection result indicating that the target user has the intention to stop vocalizing in response to the fact that the determined lip language information indicates the intention to stop vocalizing.
The method of lip language identification may be the same as the method described in step 2013. As an example, when the detected voice represented by the lip information is "i don't want to sing", it is determined that the target user has the utterance stop intention, and a detection result indicating that the target user has the utterance stop intention is generated.
Step 2064, determining a time length between the time when the target user stops sounding last time and the current time, and generating a fourth detection result indicating that the target user has the intention of stopping sounding in response to the determination time length being greater than or equal to the preset time length.
That is, when it is detected that the target user has not uttered the voice signal for a long time, it is determined that the target user has the utterance stop intention, and a detection result indicating that the target user has the utterance stop intention is generated.
Step 2065, in response to obtaining at least a second preset number of the first detection result, the second detection result, the third detection result and the fourth detection result, determining the intention of the target user to stop sounding.
The second preset number may be set arbitrarily (e.g., 1 or 2), that is, the target user has the intention to stop the sound production is detected in any second preset number of the four manners of detecting whether the target user has the intention to stop the sound production, that is, it is determined that the target user really has the intention to stop the sound production, and then the step 207 is executed.
The embodiment provides multiple detection modes of the sound production stopping intention of the target user, realizes the sound production stopping intention detection of the target user by flexibly utilizing various means, further facilitates the operation of returning the use right of the target microphone by the target user, and simultaneously can also realize the verification of the sound production stopping intention through detection results of various sound production stopping intentions, thereby greatly improving the accuracy of the sound production stopping intention detection of the user.
In some optional implementations, as shown in fig. 9, after step 205, the method may further include:
and 208, detecting the current state of the target space based on at least one preset state detection mode to obtain at least one state information.
Optionally, the at least one state detection manner may include:
in the first method, motion detection is performed on a user in a target space, and motion information indicating the current motion of the user is obtained.
The method for detecting the motion can adopt the prior art. For example, gesture recognition may be performed on a user in images included in an image sequence captured by a camera in the target space, and the motion of the user may be determined from gesture information corresponding to each image.
And detecting the state of the voice interaction equipment in the target space to obtain the use state information representing the current use state of the voice interaction equipment.
And detecting the environment of the target space to obtain environment type information representing the type of the environment of the target space.
And detecting the motion state of the target space to obtain motion state information representing the motion state of the target space.
The target space in this manner is a movable space, for example, a vehicle interior space.
Step 209, in response to determining that any one of the at least one status information conforms to the corresponding preset status, reducing the playing volume of the audio playing device in the target space, and/or outputting a prompt message corresponding to the preset environment type.
Wherein each state information corresponds to a preset state. As an example, in the first mode, in a vehicle space scene, if it is detected that a certain user makes a call, or that a body motion of the user is a body motion that may affect the driving behavior of the driver, it is determined that the motion information conforms to the preset motion state.
For the second mode, as an example, the voice interaction device may include a bluetooth call device, and the electronic device may detect the bluetooth call device, and determine that the use state information conforms to the preset use state when it is detected that the user is using the bluetooth call device to make a call, or when it is detected that there is use state information of the user using the voice interaction device to perform voice control on the device in the vehicle.
For the third mode, as an example, when the target space is an in-vehicle space, a camera arranged outside the vehicle may be used to perform environment detection on an image taken of an environment around the vehicle, or perform environment detection in an ADAS (Advanced Driver Assistance System), high-precision map positioning, or the like, and when the detected environment type information indicates that the current road condition is poor, the vehicle is congested, and special vehicles such as police cars, fire trucks, and the like exist around the vehicle, it is determined that the environment type information conforms to the preset environment state.
With the above-described mode four, the motion state may include the running speed, acceleration, and the like of the vehicle. And when the detected motion state information indicates that the current vehicle speed is fast, the vehicle is braked emergently and the like, determining that the motion state information accords with a preset motion state.
When the playing volume of the audio playing device is reduced, the playing volume can be reduced to the lowest level or to a preset volume. The prompt message can be output in various ways, such as displaying on a display in the target space, or playing a prompt tone through an audio playing device.
The present embodiment detects the current state of the target space based on at least one state detection mode, and when the preset state is met, by reducing the playing volume and/or outputting the prompt information, the interference of the audio playing to the user can be effectively reduced, and the current state of the target space of the user can be timely reminded. When the method is applied to the vehicle, the interference of audio playing to a driver is facilitated to be improved, and the driving safety is improved.
In some optional implementations, as shown in fig. 9, after step 209, the method further includes:
and step 210, in response to detecting that the current state of the target space no longer conforms to the corresponding preset state, controlling the playing volume of the audio playing device to be adjusted to the target volume.
The target volume may be a preset fixed volume, or may be the volume of the audio playing device before the step 209 is performed.
According to the embodiment, under the condition that the user in the target space is not interfered by audio playing, the volume of the audio playing device is automatically recovered, the user does not need to manually set the volume, and the convenience of user operation is further improved.
In some optional implementations, after step 202, the method may further include:
a marker representing a target location is displayed on a display within the target space.
As an example, the indicia of the target location may be indicia of a corresponding seat.
According to the method and the device, the mark representing the target position is displayed, so that a user in the target space can more intuitively check the use condition of the current target microphone, and the user can more flexibly use the microphone array to play voice.
Exemplary devices
Fig. 10 is a schematic structural diagram of a voice playing apparatus according to an exemplary embodiment of the present disclosure. The present embodiment can be applied to an electronic device, as shown in fig. 10, the voice playing apparatus includes: a first determining module 1001, configured to determine, based on a detection result of the vocalization intention, a target user with a vocalization intention in a target space; a second determining module 1002, configured to determine a target position of a target portion of a target user in a target space; a third determining module 1003, configured to determine a target microphone corresponding to the target user based on a position relationship between the target position and microphones included in the microphone array in the target space; an extracting module 1004, configured to extract a target speech signal of a target user from an audio signal collected by a target microphone; the playing module 1005 is configured to control the audio playing device in the target space to play the target voice signal.
In this embodiment, the first determination module 1001 may determine the target user having the utterance intention in the target space based on the detection result of the utterance intention. The detection result may indicate whether a user with a vocalization intention exists in the target space. The vocalization intentions may represent various forms of vocalization such as singing, lecturing, accompanying, and the like.
The first determination module 1001 may obtain a detection result of the utterance intention according to various utterance intention detection manners set in advance. For example, a camera may be disposed in the target space, the first determining module 1001 may recognize an image captured by the camera, determine whether each user performs a specific motion (e.g., a motion of holding a microphone) according to the recognition result, and determine the user performing the specific motion as the target user with the sound production intention.
In this embodiment, the second determining module 1002 may determine a target position of a target portion of a target user in a target space. The target portion may be various portions such as a head, a mouth, a hand (for example, suitable for a user to clap hands, operate a musical instrument, and the like), and the like. The second determination module 1002 may determine the target location where the target site is based on various means. For example, an image captured by a camera in the target space may be recognized, a position of the target portion in the image may be determined, and an actual position of the target portion may be determined according to a mapping relationship between the image and an actual area in the target space, where the position is the target position.
In this embodiment, the third determining module 1003 may determine a target microphone corresponding to the target user based on a position relationship between the target position and microphones included in the microphone array in the target space. Wherein the microphone array comprises a plurality of microphones arranged at different positions within the target space. As an example, when the target space is an interior space of a vehicle, one microphone may be provided near each seat in the vehicle, or two microphones may be provided above each row of seats in the vehicle.
After determining the target position, the third determining module 1003 may determine a microphone closest to the target position from the microphone array as the target microphone.
In this embodiment, the extraction module 1004 may extract a target speech signal of a target user from an audio signal collected by a target microphone. Specifically, the extracting module 1004 may extract a speech signal emitted from a target location from a path of audio signal collected by a target microphone based on an existing speech separation technique (e.g., blind source separation, sound source localization, adaptive filtering, etc.), wherein the speech signal is a target speech signal of a target user.
In this embodiment, the playing module 1005 may control an audio playing device (e.g., the audio playing device 105 shown in fig. 1) in the target space to play the target voice signal. Specifically, the playing module 1005 may generate an instruction for instructing the audio playing device to play the target voice signal, and based on the instruction, the audio playing device plays the corresponding target voice signal.
Referring to fig. 11, fig. 11 is a schematic structural diagram of a voice playing apparatus according to another exemplary embodiment of the present disclosure.
In some alternative implementations, the first determining module 1001 includes: the first detection unit 10011 is configured to perform, for each to-be-detected user of the at least one user in the target space, sounding gesture detection on the to-be-detected user to obtain sounding gesture information of the to-be-detected user; a first determining unit 10012, configured to determine, based on the voice of the user to be detected, voice intention information of the user to be detected; the second determining unit 10013 is configured to determine lip language information of the user to be detected based on the lip action of the user to be detected; the third determining unit 10014 is configured to determine that the user to be detected is a target user with a vocalization intention in response to determining that at least a first preset amount of the vocalization gesture information, the voice intention information, and the lip language information of the user to be detected satisfy the intention determination condition.
In some alternative implementations, the extraction module 1004 includes: a fourth determining unit 10041, configured to determine a main audio signal based on the audio signal collected by the target microphone; a fifth determining unit 10042, configured to determine a reference audio signal based on the audio signals collected by the other microphones of the microphone array; the filtering unit 10043 is configured to perform filtering processing on the main audio signal based on the reference audio signal to obtain a target audio signal of the target user.
In some alternative implementations, the extraction module 1004 includes: a sixth determining unit 10044, configured to determine, based on the target portion image captured by the target user, a target sound pickup area where the target portion of the target user is located within the sound pickup range of the target microphone; the positioning unit 10045 is configured to perform sound source positioning on the audio signal acquired by the target microphone, and determine a position of at least one sound source within a sound reception range of the target microphone; a suppressing unit 10046 configured to suppress an audio signal collected from a sound source located outside the target sound reception area; an extracting unit 10047 is configured to extract a target speech signal of the target user from the suppressed audio signal.
In some optional implementations, the apparatus further comprises: a fourth determining module 1006, configured to determine, based on the detection of the vocalization stopping intention, an intention of the target user to stop vocalizing; and a control module 1007, configured to stop extracting the target speech signal of the target user from the audio signal collected by the target microphone in response to determining that the detection result indicates that the target user has an intention to stop sounding.
In some alternative implementations, the fourth determining module 1006 includes: the second detection unit 10061 is configured to perform utterance gesture detection on the target user, and generate a first detection result indicating that the target user has an utterance stopping intention in response to that the determined utterance gesture information is not matched with the preset utterance gesture or is matched with the preset utterance stopping gesture; a third detecting unit 10062, configured to perform voice recognition on the target user, and generate a second detection result indicating that the target user has the utterance stopping intention in response to that the determined voice intention information matches the preset utterance stopping intention voice; a fourth detecting unit 10063, configured to perform lip language recognition on the target user, and generate a third detection result indicating that the target user has the utterance stopping intention in response to determining that the obtained lip language information indicates the utterance stopping intention; a fifth detecting unit 10064, configured to determine a duration between a time when the target user last stopped sounding and a current time, and generate a fourth detection result indicating that the target user has an intention to stop sounding in response to that the determined duration is greater than or equal to a preset duration; a seventh determining unit 10065, configured to determine an intention of the target user to stop sounding in response to obtaining at least a second preset number of the first detection result, the second detection result, the third detection result, and the fourth detection result.
In some optional implementations, the apparatus further comprises: a detection module 1008, configured to detect a current state of the target space based on at least one preset state detection manner, to obtain at least one state information; the first adjusting module 1009 is configured to, in response to determining that any one of the at least one piece of status information conforms to the corresponding preset status, reduce the playing volume of the audio playing device in the target space, and/or output a prompt message corresponding to the preset environment type.
In some optional implementations, the apparatus further comprises: the second adjusting module 1010 is configured to control the playing volume of the audio playing device to be adjusted to the target volume in response to detecting that the current state of the target space no longer conforms to the corresponding preset state.
In some optional implementations, the apparatus further comprises: a display module 1011 for displaying a marker indicating a target position on a display within the target space.
According to the voice playing device provided by the embodiment of the disclosure, a target user with a sound production intention in a target space is determined based on a detection result of the sound production intention, then a target position of a target part of the target user in the target space is determined, then a target microphone corresponding to the target user is determined based on a position relation between the target position and microphones included in a microphone array in the target space, a target voice signal of the target user is extracted from an audio signal collected by the target microphone, and finally an audio playing device in the target space is controlled to play the target voice signal. The target user who has the pronunciation intention of automatic identification has been realized to automatic for the target user distributes the microphone, the user need not the manual control microphone and carries out sound collection and broadcast, and the user need not to hand solitary microphone or remove and to be provided with the position of microphone and can accomplish collection and broadcast the audio frequency, has improved the operation convenience that the user utilized the microphone to broadcast pronunciation greatly, has practiced thrift the cost that the microphone that sets up alone and be used for broadcasting pronunciation consumes simultaneously.
Exemplary electronic device
Next, an electronic apparatus according to an embodiment of the present disclosure is described with reference to fig. 12. The electronic device may be either or both of the terminal device 101 and the server 103 as shown in fig. 1, or a stand-alone device separate from them, which may communicate with the terminal device 101 and the server 103 to receive the collected input signals therefrom.
FIG. 12 shows a block diagram of an electronic device in accordance with an embodiment of the disclosure.
As shown in fig. 12, the electronic device 1200 includes one or more processors 1201 and memory 1202.
The processor 1201 may be a Central Processing Unit (CPU) or other form of processing unit having data processing capabilities and/or instruction execution capabilities, and may control other components in the electronic device 1200 to perform desired functions.
Memory 1202 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. Volatile memory can include, for example, Random Access Memory (RAM), cache memory (cache), and/or the like. The non-volatile memory may include, for example, Read Only Memory (ROM), a hard disk, flash memory, and the like. One or more computer program instructions may be stored on a computer readable storage medium and executed by the processor 1201 to implement the voice playback methods of the various embodiments of the present disclosure above and/or other desired functions. Various contents such as a vocalization intention detection result, a target voice signal, and the like may also be stored in the computer-readable storage medium.
In one example, the electronic device 1200 may further include: an input device 1203 and an output device 1204, which are interconnected by a bus system and/or other form of connection mechanism (not shown).
For example, when the electronic device is the terminal device 101 or the server 103, the input device 1203 may be a microphone, a camera, or the like, for inputting an image for intention detection, a captured audio signal, or the like. When the electronic device is a stand-alone device, the input means 1203 may be a communication network connector for receiving input images, audio signals, etc. from the terminal device 101 and the server 103.
The output device 1204 can output various information including a target voice signal and the like to the outside. The output devices 1204 may include, for example, audio playback devices, displays, printers, and communication networks and remote output devices connected thereto, among others.
Of course, for simplicity, only some of the components of the electronic device 1200 relevant to the present disclosure are shown in fig. 12, and components such as buses, input/output interfaces, and the like are omitted. In addition, electronic device 1200 may include any other suitable components depending on the particular application.
Exemplary computer program product and computer-readable storage Medium
In addition to the above-described methods and apparatus, embodiments of the present disclosure may also be a computer program product comprising computer program instructions that, when executed by a processor, cause the processor to perform the steps in the voice playback method according to various embodiments of the present disclosure described in the "exemplary methods" section above of this specification.
The computer program product may write program code for carrying out operations for embodiments of the present disclosure in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server.
Furthermore, embodiments of the present disclosure may also be a computer-readable storage medium having stored thereon computer program instructions that, when executed by a processor, cause the processor to perform the steps in the voice playback method according to various embodiments of the present disclosure described in the "exemplary methods" section above in this specification.
The computer readable storage medium may take any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may include, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The basic principles of the present disclosure have been described above in connection with specific embodiments, but it should be noted that advantages, effects, and the like, mentioned in the present disclosure are only examples and not limitations, and should not be considered essential to the various embodiments of the present disclosure. Furthermore, the foregoing disclosure of specific details is for the purpose of illustration and description and is not intended to be limiting, since the disclosure is not intended to be limited to the specific details so described.
In the present specification, the embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts in the embodiments are referred to each other. For the system embodiment, since it basically corresponds to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to the partial description of the method embodiment.
The block diagrams of devices, apparatuses, systems referred to in this disclosure are only given as illustrative examples and are not intended to require or imply that the connections, arrangements, configurations, etc. must be made in the manner shown in the block diagrams. These devices, apparatuses, devices, systems may be connected, arranged, configured in any manner, as will be appreciated by those skilled in the art. Words such as "including," "comprising," "having," and the like are open-ended words that mean "including, but not limited to," and are used interchangeably therewith. The words "or" and "as used herein mean, and are used interchangeably with, the word" and/or, "unless the context clearly dictates otherwise. The word "such as" is used herein to mean, and is used interchangeably with, the phrase "such as but not limited to".
The methods and apparatus of the present disclosure may be implemented in a number of ways. For example, the methods and apparatus of the present disclosure may be implemented by software, hardware, firmware, or any combination of software, hardware, and firmware. The above-described order for the steps of the method is for illustration only, and the steps of the method of the present disclosure are not limited to the order specifically described above unless specifically stated otherwise. Further, in some embodiments, the present disclosure may also be embodied as programs recorded in a recording medium, the programs including machine-readable instructions for implementing the methods according to the present disclosure. Thus, the present disclosure also covers a recording medium storing a program for executing the method according to the present disclosure.
It is also noted that in the devices, apparatuses, and methods of the present disclosure, each component or step can be decomposed and/or recombined. These decompositions and/or recombinations are to be considered equivalents of the present disclosure.
The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present disclosure. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the disclosure. Thus, the present disclosure is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
The foregoing description has been presented for purposes of illustration and description. Furthermore, this description is not intended to limit embodiments of the disclosure to the form disclosed herein. While a number of example aspects and embodiments have been discussed above, those of skill in the art will recognize certain variations, modifications, alterations, additions and sub-combinations thereof.

Claims (12)

1. A voice playing method, comprising:
determining a target user with the vocalization intention in the target space based on the detection result of the vocalization intention;
determining a target position of a target part of the target user in the target space;
determining a target microphone corresponding to the target user based on a position relation between the target position and microphones included by a microphone array in the target space;
extracting a target voice signal of the target user from an audio signal collected by the target microphone;
and controlling audio playing equipment in the target space to play the target voice signal.
2. The method of claim 1, wherein the determining a target user with vocalization intent in a target space based on the detection of vocalization intent comprises:
for each user to be detected in at least one user in the target space, performing sounding gesture detection on the user to be detected to obtain sounding gesture information of the user to be detected;
determining voice intention information of the user to be detected based on the voice of the user to be detected;
determining lip language information of the user to be detected based on the lip action of the user to be detected;
and determining that the user to be detected is a target user with a vocalization intention in response to determining that at least a first preset amount of information of the vocalization gesture information, the voice intention information and the lip language information of the user to be detected meets an intention judgment condition.
3. The method of claim 1, wherein said extracting the target speech signal of the target user from the audio signal collected by the target microphone comprises:
determining a main audio signal based on the audio signal collected by the target microphone;
determining a reference audio signal based on audio signals acquired by other microphones of the array of microphones;
and based on the reference audio signal, carrying out filtering processing on the main audio signal to obtain a target voice signal of the target user.
4. The method of claim 1, wherein said extracting the target speech signal of the target user from the audio signal captured by the target microphone comprises:
determining a target sound receiving area where a target part of the target user is located in a sound receiving range of the target microphone based on a target part image shot for the target user;
carrying out sound source positioning on the audio signal collected by the target microphone, and determining the position of at least one sound source in the sound receiving range of the target microphone;
suppressing audio signals collected from sound sources outside the target sound reception area;
and extracting the target voice signal of the target user from the suppressed audio signal.
5. The method of claim 1, wherein after said controlling an audio playback device within the target space to play the target speech signal, the method further comprises:
determining, based on the stop utterance intent detection, an intent of the target user to stop utterance;
in response to determining that the detection result indicates that the target user has an intention to stop vocalizing, stopping extracting the target speech signal of the target user from the audio signal collected by the target microphone.
6. The method of claim 5, wherein the determining the target user's intent to stop vocalizing based on stop vocalizing intent detection comprises:
performing voice production gesture detection on the target user, and generating a first detection result indicating that the target user has a voice production stopping intention in response to the fact that the obtained voice production gesture information is not matched with a preset voice production gesture or matched with a preset voice production stopping gesture;
performing voice recognition on the target user, and generating a second detection result indicating that the target user has the voice production stopping intention in response to the fact that the determined voice intention information is matched with the preset voice production stopping intention;
performing lip language recognition on the target user, and generating a third detection result indicating that the target user has the intention of stopping sounding in response to the fact that the determined lip language information indicates the intention of stopping sounding;
determining the time length between the moment when the target user stops sounding last time and the current moment, and generating a fourth detection result indicating that the target user has the sounding stopping intention in response to the fact that the time length is greater than or equal to the preset time length;
and determining the intention of the target user to stop sounding in response to obtaining at least a second preset number of the first detection result, the second detection result, the third detection result and the fourth detection result.
7. The method of claim 1, wherein after said controlling an audio playback device within the target space to play the target speech signal, the method further comprises:
detecting the current state of the target space based on at least one preset state detection mode to obtain at least one state information;
and in response to determining that any one of the at least one piece of state information conforms to a corresponding preset state, reducing the playing volume of the audio playing device in the target space, and/or outputting prompt information corresponding to the preset environment type.
8. The method of claim 7, wherein after the reducing the playback volume of the audio playback device in the target space and/or outputting the prompt information corresponding to the preset environment type, the method further comprises:
and controlling the playing volume of the audio playing equipment to be adjusted to the target volume in response to the detection that the current state of the target space no longer conforms to the corresponding preset state.
9. The method of claim 1, wherein after the determining the target location of the target portion of the target user in the target space, the method further comprises:
displaying a marker representing the target location on a display within the target space.
10. A voice playback apparatus comprising:
the first determination module is used for determining a target user with the sound production intention in the target space based on the detection result of the sound production intention;
a second determination module, configured to determine a target position of a target portion of the target user in the target space;
a third determining module, configured to determine a target microphone corresponding to the target user based on a position relationship between the target position and microphones included in a microphone array in the target space;
the extraction module is used for extracting a target voice signal of the target user from an audio signal collected by the target microphone;
and the playing module is used for controlling the audio playing equipment in the target space to play the target voice signal.
11. A computer-readable storage medium, the storage medium storing a computer program for performing the method of any of the preceding claims 1-9.
12. An electronic device, the electronic device comprising:
a processor;
a memory for storing executable instructions of the processor;
the processor is configured to read the executable instructions from the memory and execute the instructions to implement the method of any one of claims 1 to 9.
CN202210327184.7A 2022-03-30 2022-03-30 Voice playing method and device, computer readable storage medium and electronic equipment Pending CN114710733A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210327184.7A CN114710733A (en) 2022-03-30 2022-03-30 Voice playing method and device, computer readable storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210327184.7A CN114710733A (en) 2022-03-30 2022-03-30 Voice playing method and device, computer readable storage medium and electronic equipment

Publications (1)

Publication Number Publication Date
CN114710733A true CN114710733A (en) 2022-07-05

Family

ID=82170473

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210327184.7A Pending CN114710733A (en) 2022-03-30 2022-03-30 Voice playing method and device, computer readable storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN114710733A (en)

Similar Documents

Publication Publication Date Title
US10733987B1 (en) System and methods for providing unplayed content
CN106796786B (en) Speech recognition system
JP6635049B2 (en) Information processing apparatus, information processing method and program
US9881605B2 (en) In-vehicle control apparatus and in-vehicle control method
JP6604151B2 (en) Speech recognition control system
CN112397065A (en) Voice interaction method and device, computer readable storage medium and electronic equipment
US20130332160A1 (en) Smart phone with self-training, lip-reading and eye-tracking capabilities
US11176948B2 (en) Agent device, agent presentation method, and storage medium
US20120166190A1 (en) Apparatus for removing noise for sound/voice recognition and method thereof
JP2017090611A (en) Voice recognition control system
JP2017090613A (en) Voice recognition control system
JP2017090612A (en) Voice recognition control system
US20230274740A1 (en) Arbitrating between multiple potentially-responsive electronic devices
CN110696756A (en) Vehicle volume control method and device, automobile and storage medium
KR20180135595A (en) Apparatus for selecting at least one task based on voice command, a vehicle including the same and a method thereof
JP2017090614A (en) Voice recognition control system
CN112307816A (en) In-vehicle image acquisition method and device, electronic equipment and storage medium
CN113593572A (en) Method and apparatus for performing sound zone localization in spatial region, device and medium
CN113674754A (en) Audio-based processing method and device
CN109243457B (en) Voice-based control method, device, equipment and storage medium
JP6387287B2 (en) Unknown matter resolution processing system
CN111429882A (en) Method and device for playing voice and electronic equipment
CN114710733A (en) Voice playing method and device, computer readable storage medium and electronic equipment
CN114734942A (en) Method and device for adjusting sound effect of vehicle-mounted sound equipment
CN110232911B (en) Singing following recognition method and device, storage medium and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination