WO2015101523A1 - Method of improving the human voice - Google Patents

Method of improving the human voice Download PDF

Info

Publication number
WO2015101523A1
WO2015101523A1 PCT/EP2014/078754 EP2014078754W WO2015101523A1 WO 2015101523 A1 WO2015101523 A1 WO 2015101523A1 EP 2014078754 W EP2014078754 W EP 2014078754W WO 2015101523 A1 WO2015101523 A1 WO 2015101523A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice
audience
cosmetics
stream
user
Prior art date
Application number
PCT/EP2014/078754
Other languages
French (fr)
Inventor
Peter Ebert
Original Assignee
Peter Ebert
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peter Ebert filed Critical Peter Ebert
Publication of WO2015101523A1 publication Critical patent/WO2015101523A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants

Definitions

  • the present invention relates to the processing of audio streams, more particularly to improving the impression the human voice creates in communication.
  • VOIP Voice-over-IP
  • the sound of the human voice can be altered.
  • the voice of a singer can be subjected to a plurality of different effects and the voice track thus produced is likely to suit the musical taste of the audience better, increasing sales of recorded music.
  • the voice track thus produced is likely to suit the musical taste of the audience better, increasing sales of recorded music.
  • the speech waveform is considered as the response of a resonator (the vocal tract) to a series of pulses (quasi-periodic glottal pulses during voiced sounds, or noise generated at a constriction during unvoiced sounds)
  • the resonances of the vocal tract will influence the sound of a voice.
  • These resonances are called formants.
  • the influence of these formants on the voice and the resulting differences of different voices can be perceived easily by a human being and can also be automatically evaluated.
  • Formants are manifested in the spectral domain by energy maxima at the resonant frequencies. As the frequencies at which the formants occur are primarily dependent upon the shape of the vocal tract, e.g. the positions of the articulators (tongue, lips, jaw, etc.).
  • a listener may and will associate certain voice characteristics with the look and/or character of a speaker. It should be noted that while voice fundamental frequencies and formants have been explicitly mentioned as giving important clues to a listener about a speaker, further characteristics of a voice may and will determine the impression a listening person has about the speaker.
  • a method of processing an audio stream of a voice of a natural human being for communication wherein at least one target voice property considered to be appealing in a speaker is defined; a plurality of steps for processing the audio voice stream of the human being in a manner approximat- ing said appealing target property without destroying the impression of the audience to listen to a natural voice, is defined, a degree to which the target voice is to have said appealing voice property for a particular audience is determined, the plurality of processing steps associated with said determination is applied to the voice stream and the processed voice stream is reproduced to the audience.
  • This invention therefore introduces inter alia a voice cosmetics system and method suitable to optimize the appeal of human voice transmitted to a remote and/or locally addressed audience as described in detail below.
  • the method both to outgoing and/or to incoming audio data streams of a voice and/or can be applied locally, e.g. in the chain of a public address system used during a convention, church service or the like.
  • the audience can be a single listener, e.g. a person calling on a telephone such as a mobile or smart phone, a person called or talked to via an Voice- Over-IP-Service or can be a group of persons, such as participants of a webinar, the audience in an assembly hall or conference room, e.g. during a video conference, or the like.
  • an audio stream of a voice or the like it will be understood that usually, this will be a stream of digitized data which will contain language spoken by a real human being.
  • the spoken language need not be the only audio signal present in the audio stream; for example, background noise such as noise from a street or from the surroundings of the speaker such as music playing in a bar, other people speaking asf. may be present.
  • the method of the invention will not give the impression to disguise the speaker but will enhance positive and/or wanted aspects and will mitigate negative or unwanted aspects thereof, communication can be improved. Very much in the same manner as cosmetics may be used for facial make-up to show that a person cares for its personal optical appearance, the same care can now be given without adverse effects to the acoustical appearance, so that even where discernible, the method of the present invention will not stir antipathies given that the degree to which the property is achieved will be set according to the preferences of that particular audience.
  • the method is applied for an outgoing data stream, e.g. to process audio data transmitted during a phone call to a person called, it is also possible to enhance the listening experience for incoming data streams. For example, it may be useful to process data of incoming calls in a call center where complaints of customers are handled so as to reduce the stress of employees due to angry calling persons.
  • the employee of the call center is provided with an audio stream voice processed to reduce anger or stress expressed by the calling person, it will be easier to stay friendly for an extended period, thus in turn improving customer satisfaction and hence communication.
  • the method can be applied in hearing aids, head phones, head sets and the like.
  • the method can be applied prior to transmission; in that case, an app, plug-in or other program will be executed in the background during the call.
  • the processing may specifically relate to e.g. the frequency response of the microphone used in registering the voice of the speaking person and the frequency response of an electrodynamic speaker in the device used by the listener without having to disclose information about the device used by the listener to the caller.
  • a corresponding voice processing service may be provided to paying premium users only, e.g. on a pay-per-call or subscription basis.
  • the processing of the voice audio stream is effected in real time, that is in a manner where incoming voice data is processed at least as fast as it is streaming in.
  • real time processing can be achieved by taking into account a current processor load (or coprocessor load, dsp load and the like) of a device, power consumption, overall computing performance and adjust- ing the processing steps to current data processing performance available.
  • a current processor load or coprocessor load, dsp load and the like
  • the method might be executed using less complicated processing while the user is web surfing in the background during a call, while improved quality voice processing can be effected once the user has finished browsing the web.
  • the audio voice processing suggested by the present invention usually can be effected in real time as the overall amount of data will be rather small even for high quality audio streams such as 192kHz 16 bit resolution stereo audio data.
  • the processing of audio data according to the present invention will typically cause only minimal latency, in particular compared to the delay induced in Voice-Over- IP transmission. It is possible and preferred to have a latency of less than 15ms, in particular of less than 5ms due to the processing. This is comparable or negligible compared to latency induced by transmission over an IP network. Therefore, processing will hardly be discernible.
  • the method according to the present invention in video calls. Where the latency-induced delay of the audio stream is too large, it is easy to automatically correct for any delay by re- synchronizing video frames and audio stream according to the delay. Therefore, the original lip movement will either still correspond very closely to the amended voice stream or can be made to do so, thereby improving user experience of an Audio-video conference.
  • At least one of the plurality of steps for processing the audio voice stream is selected from the group comprising equalizing, limiting, modulation, flanging, adding chorus, adding reverb, adding air, changing pitch, adding 3D audio effects, compressing, expanding.
  • An advantage of certain of the processing steps suggested such as equalizing, limiting, modulation, flanging, adding chorus, adding reverb, adding air, adding 3D audio effects, compressing, expanding is that they can be done without subjecting the audio stream to a detailed analysis. Instead, they can be used regardless of the current actual behavior of the wave form. Accordingly, only very limited processing power is needed to execute corresponding algorithms on a processor.
  • the actual amount of boost should be judged by ear to provide for optimized effects, depending inter alia on the speaker, the chain of transmission from microphone to electrodynamic speaker and the audience.
  • the electronic component used for emitting sound is referred to as "electrodynamic" speaker not in order to restrict the disclosure to such variety of electronic components but to more clearly distinguish the electronic speaker component from a person speaking.
  • electronic speaker components other than “electrodynamic” speakers might be used just as well, such as e.g. electrostatic headphones or the like.
  • the voice cosmetics methods may therefore comprise algorithms and instructions on how to assemble or morph two or more of the processing steps as voice cosmetic ingredients to cause a desired audio effect when applied to a voice input and the like.
  • one of the particularly important and preferred steps effected in processing the voice stream is pitch control and adding 'air'. These effects are very important ingredients to making voices sound more attractive, typically for males by lowering pitch and adding air, and for females by increasing pitch and adding air.
  • pitch control and adding 'air' are very important ingredients to making voices sound more attractive, typically for males by lowering pitch and adding air, and for females by increasing pitch and adding air.
  • pitch and certain implementations to add air other than that mentioned above may require a more complex approach relying on the analysis of temporal behavior, which however is also well known in the art.
  • a target property can be selected from making the voice sound more feminine, more masculine, more energetic, more soothing, more sensual or erotic, warmer, more trustworthy, more tender, or similar to the voice of a particular celebrity.
  • a desired target voice characteristic from a suggested plurality of possible alternations by varying the processing from call to call during an initial number of calls, evaluating each adjustment by success, e.g. revenues achieved in telephone sales marketing, satisfaction of users according to feedback, voice stress of counterpart perceived asf. It will be understood that such adaption may be dynamically changed even during the initial number of calls, e.g. after an early phase of a call where the gender or social status of a person contacted is identified either manually or automatically.
  • Another way to judge whether a certain appealing target property has been achieved by a given combination of processing steps and parameters set for these steps would be by statistical analysis of the response of a sufficient number of persons representative for the audience the target property is to be appealing to.
  • Such response can be gathered in a conventional manner, such as a face-to face interview but could also be determined using web-based methods, such as evaluating sales numbers of certain apps, web questionnaires asf. Therefore, it is also suggested that in addition or as an alternative, the impression of the audience that is to listen to a natural voice stream is selected according to a statistical evaluation of a number of different pluralities of steps for processing the audio voice stream of the human being.
  • the statistics evaluated can be based on the response of a group representative for the target audience. In all of these cases, it is perfectly possible to let the persons representative for the audience the target property is to be appealing to select from a number of chained effects initially suggested by experienced users and/or experts.
  • a user might wish to apply voice cosmetics in a manner not discernible to an audience.
  • Particular attention may be paid in that case when the cosmetic treatment according to the method suggested herein is dynamically adapted during an ongoing communication to prevent the treatment from being discernible as such due to glitches and transitions. Therefore, it might be helpful to e.g. cover a transition phase with some added noise, e.g. by temporarily partially defeating a noise reduction or to amend the parameters for cosmetic treatment in a manner sufficiently gradual to be indiscernible.
  • the necessary length of a transition phase can be determined individually for every single given desired property, for different groups of desired properties or for all properties in the same manner.
  • At least one further property considered to be appealing in a speaker is defined; and for this at least one further property, a further plurality of steps for processing the audio voice stream of the human being for approximating the respective appealing target property without destroying the impression of the audience to listen to a natural voice is defined, a degree to which the target voice stream is to have the further property is defined, the plurality of processing steps associated with said determinations is applied to the voice stream in real time, and the processed voice stream is transmitted to the audience.
  • the degree to which a certain property is to be achieved or approximated, that is the corresponding determination may be made in response to at least one of image analysis to identify males/females in an audience and /or to identify the gender of a single person or few persons constituting the audience, and/or to identify surroundings, determine current stress level of audience, asf..
  • the result of a voice stress analysis in the audio voice stream of the speaker and/or voice stress analysis in the audio voice stream of the audience can be relied upon for setting or adapting parameters.
  • a male audience considers other features attractive in a male speaker than a female audience.
  • the same male audience will also consider other features attractive in a female speaker than a female audience.
  • the features considered attractive in a female speaker will not be the same as those considered attractive in a male speaker; this holds for both male and female audiences. Therefore, while some features in the voice of a speaker of a specific gender may be attractive to both male and female audiences, it is preferred to adapt the processing step to the respective audience. As the question of what is considered attractive in a voice will depend e.g. on the gender of the audience, it is preferred to automatically adapt the target property to the gender identified.
  • the target property might be selected according to the majority of genders present or might be selected according to the gender of a specific person or persons, e.g. in an audio-video-conference according to those speaking most or to the gender of a conference leader or according to the gender of those currently stressed most or being most aggressive towards a speaker using the voice cosmetic method.
  • the gender of an audience or persons in an audience can be determined using voice analysis techniques applied on incoming voice streams, image analysis e.g. of images transmitted in an audio-video-conference asf. These techniques are well known in the art. Also, a manual override by the speaking person is considered to be possible.
  • voice stress analysis is a well-known technique and has been related inter alia to determining whether or not a person tells the truth. In certain instances it might therefore be helpful to also indicate to a speaker and/or to a listener the result of a voice stress analysis if such voice stress analysis is effected.
  • a phone number called or calling can be used to set a desired property. Properties set once can be stored with a given number. Also, in certain instances, a desired property can be set relying on a given local area code of an audience calling or called. E.g. if a sales representative only has business contacts in a given area, it is advisable to choose a voice amelioration increasing trust of the listener.
  • Other parameters that may influence a selected target property or the degree to which it is to be achieved comprise detected ambient sound levels and/or characteristics, a physical location of the speaker, a local time of speaker or audience and ambient temperature. For example, a person contacted early in the morning in an environment that due to background noise received can be suspected to be an office will be more frequently contacted for business reasons than a person contacted late in the afternoon at a hot summer day in an environment that according to background noise might be a beach. Here, even if the person is contacted for business reasons, it might be more advisable to have a less demanding voice.
  • the parameters used to influence or select a target property can be physical detectors such as temperature detectors; however, it is also possible to implement parameter detecting means by way of software, e.g. by a software component for automatically detecting the gender of an audience by listening into the conversation; or as another example, by a software component for automatically optimizing the attractiveness of the user's voice based on contextual input derived from background noise analysis- it is noted that this can be done by comparison of background sound pattern to databases;
  • the signal strength of a communication service might necessitate that only a restricted limited bandwidth is used and that accordingly, a setting that does not rely on e.g. extended highs will be preferred.
  • a specific device type is used by a given participant of a communication, it might be helpful to take technical limitations of the device into account, e.g. frequency response characteristics.
  • either each of said number of different pluralities of steps for processing the audio voice stream of the human being suggested and/or the combination thereof observes voice transmission standards defined by a service provider.
  • the method of the present invention will be easily compatible with existing or upcoming standards.
  • audio processing such as data compression, noise reduction asf. suggested by such known or upcoming standards may still be used as known in the art.
  • target properties may be dynamically adjusted during an ongoing communication.
  • voice stress detected increases, it might be helpful to increase a de-stressing cosmetic effect or to make the voice sound more soothing.
  • keywords detected by speech analysis are identified to be insults or are typical for speakers having a given degree of education, thus serving as an indicator of social status.
  • speech recognition it may also become possible to identify by speech recognition whether the voice stream coming from a communication partner indicates that the user speaks a certain dialect or uses specific grammatical constructions that may be indicative of education, typical for a given social status and/or age.
  • the degree to which the target voice stream is to have the appealing voice property may be determined according to at least one of a user input, a sensor means for analyzing at least one of an audience gender, an audience behavior, the social background of an audience.
  • a user can input instructions e.g. via a touch screen or other pointing device such as a computer mouse, or e.g. by speech input, and respectively e.g. in response to monitoring currently applied voice cosmetics on his or her own voice and/or ad hoc.
  • the method of the present invention will be executed in a context-aware and/or a content-aware manner, particularly in a dynamical manner, automatically selecting target properties and/or automatically adjusting the degree to which target properties are to be achieved in a con- text- and/or content aware manner. It is noted that it is perfectly possible to rely only on one of content or context, e.g. either on keywords recognized or e.g. gender identified without deviating from a preferred implementation of the invention.
  • the degree to which the target voice is to have said appealing voice property is changed during an ongoing communication in an automated manner relying on a change in the determination made.
  • sensors may be provided for automatically detecting the gender of an audience by listening into the conversation, for automatically optimizing the attractiveness of the user's voice based on contextual/sensor input for automatically and dynamically optimizing voice attractiveness throughout a conversation.
  • These sensors may be implemented as hardware components, e.g. as GPS sensors or other sensors available in a smart phone or as one or more software components, in particular a software component specific to the invention, e.g. for analysing the incoming or outgoing voice stream.
  • the sensor may rely on web-retrieved information, e.g. the current local temperature at the place a person calling from a fixed network is situated according to caller ID.
  • Dynamic adaption may also be achieved by relaying information derived such as background noise characteristics or the like to a database where these characteristics are linked to locations or types of locations so that suitable adaptions of the voice cosmetic can be suggested, parameter sets be prepared and transmitted to the user for immediate adaption of an ongoing communication. Therefore, it is possible for users and providers to create, buy, sell, apply one or more voice ameliorating methods via the Internet, e.g. in app stores and the like.
  • the degree selected is a simple on/off degree
  • at least three different degrees such as an ON, an OFF and a MEDIUM setting or settings in a range form 0 to 10 in 0,5 steps are provided.
  • the number of effects and/or their relative strength may change at least with one of the different degrees.
  • different effects might be needed if several properties are to be achieved simultaneously.
  • a slider relating to 'sugarcoat' the user's voice more or less for example by adding 'air' and slightly increasing the pitch of the user's voice for a female user or adding 'air' and slightly decreasing the pitch of the user' s voice for a male user might be provided.
  • Another slider might be provided to make the user's voice sound more or less energetic, for example by applying a chain of compressors and/or expander effects.
  • a database or look up table may be provided either indicating specific parameters for each combination of settings or, where this is not feasible due to a large number of voice properties offered, general rules to determine preferred combinations can be set up, e.g. rules giving priority to certain effects and reduced priority to other effects.
  • a means for determining a parameter set for effects used in combination to obtain a combination of properties to an adjusted degree is provided.
  • Said means may be implemented as look-up-table of parameters and/or as a set of equations for determining said parameters.
  • the look-up table and/or the equations preferably will again be determined by evaluation of the response of a group of users representative for a target audience; however, the existence of other ways to determine suitable sets of parameters will by now be obvious to the skilled person in view of the preceding sections of the disclosure.
  • the voice cosmetics methods of the present invention may comprise, inter alia in a preferred embodiment, algorithms, instructions on how to assemble or morph two or more of the voice cosmetic ingredients to cause a desired audio effect when applied to a voice input and the like.
  • Protection is also sought for a data set usable in executing a method as described above, in particular a personalized data set.
  • data sets may be offered in dedicated stores and/or dedicated web stores and/or in departments of stores and/or departments of web stores. Where a store or web store or a department is set up, it will be advantageous to define categories such business, romance, relatives, friends where data sets can be found adapted for a particular target audience and/or purpose. In addition and/or as an alternative, it will be advantageous if data sets are offered according to one or more specific characteristics of the intended user such as gender, age, size, weight.
  • the store using a data base relating to specific descriptors of the data set, the descriptors being e.g. intended purpose or characteristic of a user. It is possible and advantageous, although not necessary to use icons describing the intended purpose and/or target audience and/or characteristics. Furthermore, photos of persons representing the specific characteristics of the intended user may be associated with each data set and/or a group of data sets for faster orientation of a potential customer. Applicant reserves the right to file divisional applications relating to such web stores. Applicant also considers data an essential means for execution of the method, even when sold separately. It will be understood that the data set may in certain cases comprise only parameters for algorithms already known to the device of a user, e.g.
  • data sets offered might also comprise the algorithms the parameters relate to. It will be understood that when additional data sets to provide for certain effects are offered to a user who already uses other effects, a set of rules relating to the chaining and/or morphing of a plurality of effects might be updated and/or generated as described above. Also, determining such rules may be a task the shop may offer as part of a purchase or separately.
  • a male having between 120- 140kg and being between 160cm- 165cm may use another set of parameters to obtain a given set of voice properties compared to a person having 190cm-195cm and 85 kg -90 kg.
  • different ways of processing the voice might be offered to a user. For example, a standard average user grade relying on either an average male or female pattern may be offered; a pattern defined for a smaller group may be offered for other users and a personalized data set may be offered for premium users.
  • the number of possible voice properties offered to a particular user might vary depending on grade or the download of a specific parameter set for implementing a given voice cosmetic property might be offered.
  • personalized services can be provided as remote, e.g. cloud service and/or as a local service.
  • Protection is also sought for a device for implementing the method according to any of the previous claims, in particular selected from base stations, smartphone, tablet computer, game console, laptop computer, personal computer, switchboard, wearables, glasses, peripherals such as microphones, headphones and/or headsets, each adapted by specific hardware and/or software for executing the method and stored on the device or otherwise accessible for the device.
  • the method can be implemented by using specific hardware such as FPGAs, digital signal processors or ASICs or e.g. by using a general purpose CPU. It is possible to register sounds using e.g. microphones, bone vibrations, optical recognition of mouth movements. It should be noted that it would be possible to build a device for a stand-alone solution, but in a more preferred embodiment, the device will be a mobile or other device having circuitry adapted for execution of the method steps.
  • wireless headsets or the like are used, it is preferred to provide them with a remote user interface.
  • FIG. 1 is a block diagram illustrating the invented voice cosmetics system and method.
  • FIG. 2 is a flowchart depicting a typical operation of the present invention from a user's perspective.
  • FIG. 3 is a flowchart depicting a typical operation of the present invention from a voice cosmetics service operator's perspective.
  • FIG. 4 is a simplified mockup of a voice cosmetics application operated on a mobile device.
  • FIG. 5 is a block diagram illustrating context sensitive aspects of the invented voice cosmetics system and method.
  • FIG. 6 depicts a simplified mockup of a context sensitive voice cosmetics application operated on a mobile device.
  • FIG. 7 is a flowchart depicting a context sensitive operation of the present invention.
  • FIG. 8A is a block diagram illustrating an operation of a voice cosmetics system and method when comprised in a peripheral device such as a microphone.
  • FIG. 8B is a block diagram illustrating an operation of a voice cosmetics system and method when comprised in a peripheral device such as a headset.
  • FIG. 8C is a block diagram illustrating an operation of a voice cosmetics system and method when comprised in a peripheral device such as an advanced headset.
  • FIG. 8D is a block diagram illustrating an operation of a voice cosmetics system and method when comprised in a peripheral device such as an advanced headset with an additional audience context input.
  • FIG. 8E is a block diagram illustrating an operation of a voice cosmetics system and method when comprised in a peripheral device as well as in a connected communication device.
  • FIG. 1 is a block diagram illustrating the invented voice cosmetics system and method.
  • this cosmetics methods is a method of processing an audio stream of a voice of a natural human being for communication, wherein at least one target voice property considered to be appealing in a speaker is defined; a plurality of steps for processing the audio voice stream of the human being in a manner approximating said appealing target property without destroying the impression of the audience to listen to a natural voice, is defined, a degree to which the target voice is to have said appealing voice property is determined, the plurality of processing steps associated with said determination is applied to the voice stream, and the processed voice stream is reproduced to the audience.
  • the invented system comprises a voice cosmetics application 108 and a voice cosmetics service 140.
  • the voice cosmetics application 108 may be implemented as software, hardware, or a combination thereof. Furthermore, the voice cosmetics application 108 may be operated on any device allowing a user 100 to remotely talk with an audience 118 that consists of one or more other users. Suitable devices particularly include, but are not limited to, smart phones, wearables including Google Glass, tablet computers, laptops and personal computers.
  • the voice cosmetics service 140 may be implemented as software, hardware or a combination thereof.
  • the voice cosmetics service 140 is implemented as a cloud service accessible via the Internet, allowing bidirectional communication with one or more of the voice cosmetics applications 108 via the Internet.
  • the voice cosmetics service 140 may comprise at least one of the following voice cosmetics components: one or more voice cosmetics ingredients 132, one or more voice cosmetics methods 134, one or more voice cosmetics ratings 136, and one or more voice cosmetics resources 138.
  • Providers 130 may comprise any business entity or human being that is able and willing to provide value in the form of such voice cosmetics components or related services to the voice cosmetics service 140 as described in more detail below.
  • the voice cosmetics ingredients 132 may comprise, but not be limited to, best practice voice samples, instructions on how to alter the audio characteristic of a particular voice input to make it sound in a particular way to an audience 140, and the like.
  • the voice cosmetics methods 134 may comprise, but not be limited to, algorithms, instructions on how to assemble or morph two or more of the voice cosmetic ingredients 132 to cause a desired audio effect when applied to a voice input and the like.
  • voice cosmetics application 108, voice cosmetics service 140 or both may be designed in such a way that they can automatically analyze the voice characteristics of a user 100 and then automatically apply respective voice altering effects in such a way as to make the voice of user 100 sound most appealing to a current audience, context, or situation.
  • the voice cosmetics ratings 136 may originate from one or more users 100, one or more audiences 118, one or more providers 130, or a combination thereof.
  • the voice cosmetics ratings 136 may comprise, but not be limited to, structured or unstructured data relating to the perceived quality of one or more of the heretofore-mentioned components 132, 134, 136, 138 of the voice cosmetics service 140, as perceived or measured by the respective rating's origin.
  • the voice cosmetics resources 138 may comprise, but not be limited to, software tools, electronic articles, best practices, related remote services as well as access to human experts suited to assist or consult at least one user 100 in finding and applying desired voice cosmetics to the user' s voice.
  • the voice cosmetics application 108 allows user 100 to select and apply at least one of the voice cosmetics components provided by voice cosmetics service 140 in such a way that it has an effect on the audio characteristic of his or her voice input 106 when processed with digital signal processor 114.
  • User 100 can control this effect by providing instructions 102 to the voice cosmetics application 108, for example, via a graphical touch-enabled user interface as illustrated in FIG. 4, as well as speech input, sensor input, a combination thereof and the like.
  • Voice application 108 may be used to dynamically alter the appearance of user 100's voice in intensity levels ranging from very subtly to very strongly.
  • digital signal processor 114 is part of a device used by user 100 to remotely talk to an audience 118 (the device is not shown here).
  • the digital signal processor 114 is enabled to receive instructions 112 from voice cosmetics application 108 and react to such instructions by causing voice outputs 110 and 116 to sound in a particular way as desired by user 100.
  • voice cosmetics application 108 may include a digital signal processor or be built into the digital signal processor, whether implemented as software or hardware or a combination thereof, respectively, to cause the same or a similar effect on the voice output 110 and 116 of user 100.
  • voice output-altering methods and sound effects well known in the art such as equalization, limiting, sound effects including, but not limited to, modulation, flanger, chorus, reverberation, 3D audio effects, adding of 'air' and changing of the pitch of the voice output may be applied either directly or indirectly by the voice cosmetics application 108 in the heretofore-mentioned way to create a desired voice cosmetics effect on voice output 110 and 116.
  • a female user 100 it is a preferred embodiment of the present invention to apply a combination of adding 'air' to her voice output 110 and 116 plus controlling the pitch of her voice output 110 and 116 in such a way (typically higher) that it is perceived most appealing particularly among any male fraction of audience 118.
  • voice output 110 is optionally also provided back to user 100 in order to allow user 100 to monitor the currently applied voice cosmetics effects on his or her voice, preferably without a delay.
  • Voice output 116 is provided to audience 118.
  • user 100 may, for example, choose to switch off voice output 110 after having ensured a desired voice cosmetics effect is being applied on his or her voice, for example, in order to avoid unwanted distractions or interferences during a conversation with audience 118.
  • voice cosmetics application 108 is suited to apply the heretofore-mentioned sound effects on the voice input 106 of user 100 without a noticeable delay.
  • the invented system and method may be used, for example, to make the voice of user 100 sound more appealing, feminine, masculine, energetic, similar to the voice of a particular celebrity, or a combination thereof.
  • user 100 may apply, combine, morph, chain or otherwise make use of one or more components of voice cosmetics service 140 via voice cosmetics application 108. Additionally, user 100 may change desired voice cosmetics in an ad hoc fashion between or during conversations with particular audiences 118, for example, to optimize his or her voice's appeal on each particular audience 118.
  • user 100 may want to use different voice cosmetic settings when talking with a prospective business customer versus a person that he or she aspires to get romantically involved with.
  • Voice cosmetics service 140 may be provided as a self-service or a service that is provided at least partly by one or more human service providers.
  • user 100 may, for example, physically visit a voice cosmetics service location that will analyze the voice characteristic of user 100, recommend one or more particular voice cosmetic treatments to user 100 and then enable desired voice cosmetics treatments on the device of user 100 either directly via voice cosmetics application 108 or remotely via voice cosmetics service 140.
  • providers 130 may provide voice cosmetics components to one or more voice cosmetics services 140 for free or for a fee.
  • voice cosmetics application 108 may be designed in such a way that it can be operated in the heretofore-described way without the need to communicate with voice cosmetics service 140.
  • voice cosmetics application may simply be a component of a stand-alone communication device or software that user 100 is operating without the need for a connection with voice cosmetics service 140 or merely intermittent connection with voice cosmetics service 140.
  • voice cosmetics application 108 may be connected with one or more sensors 120 allowing voice cosmetics application 108 to automatically detect one or more aspects of the current user's voice features or context.
  • sensors 120 may be suitable, but are not limited to, detecting at least one of, the gender of user 100, the gender of one or more members of audience 118, respective ambient sound levels and characteristics.
  • the term sensor shall be used broadly here to refer to any information input operable to provide voice cosmetics application 108 with information regarding the context of the user 100 or the audience 118. This information may further include, but not be limited to, respective current physical location, respective local time, respective ambient temperature, respective ambient noise characteristics and the like.
  • Audio information input regarding the audience 118 may be provided, for example, by tapping into the host device's audio input stream coming from the audience 118.
  • voice cosmetics application 108 may automatically and continually adapt to the contextual information in such a way that it selects and applies available voice cosmetics components to the voice output of user 100 in order to make user 100's voice sound as appealing as possible to audience 118 during a call.
  • FIG. 2 a typical operation of the present invention from a user's perspective is depicted.
  • step 200 a user selects one or more voice cosmetics options using voice cosmetics application 108 on his or her communication device.
  • voice cosmetics options shall refer to one or more voice output altering components that can be selected by a user 100.
  • step 204 the selected voice cosmetic options are applied to the user's voice output by having the voice cosmetic application 108 instruct digital signal processor 114 on the communication device as described above.
  • step 206 the user hears the resulting effect on his or her voice output as provided via voice output 110.
  • step 208 if the user is satisfied with the effect of the selected voice cosmetics options on his or her voice output, the selected voice cosmetics options remain applied to his or her voice output.
  • step 212 is performed.
  • step 212 if the user chooses to try again selecting and applying alternative voice cosmetics options, step 200 is performed.
  • step 212 if the user chooses to not try again selecting and applying alternative voice cosmetics options, step 214 is performed.
  • FIG. 3 a typical operation of the present invention from a voice cosmetics service operator's perspective is depicted.
  • one or more providers 130 provide one or more voice cosmetics options to voice cosmetics service 140 preferably via the Internet.
  • step 304 the one or more providers 130 present one or more of their voice cosmetics options of step 300 to a user 100 via a voice cosmetics application 108 implemented on a device used by user 100, the device remotely connected with voice cosmetics service 140.
  • step 308 if the user 100 does not choose to apply any of the presented voice cosmetic options to his or her voice output, step 308 is performed.
  • step 306 if the user 100 chooses to apply one or more of the presented voice cosmetic options to his or her voice output, step 310 is performed via the voice cosmetics application 108 as described above.
  • step 312 if the user 100 chooses to have the voice cosmetic service 140 or voice cosmetic application 108 periodically monitor the effectiveness of the applied voice cosmetics options on her voice output, step 316 is periodically performed.
  • the effectiveness of applied voice cosmetics options may change, for example, if the user 100 changes the position of his or her mouth relative to his or her voice-transmitting device or if ambient sound patterns change.
  • step 314 if the user 100 chooses to not have her voice output periodically monitored, step 314 is performed.
  • step 316 if upon monitoring the user's 100 applied voice cosmetics options are deemed non-effective, step 318 is performed optimizing the applied voice cosmetics options of user 100 via voice cosmetic application 108. In step 316, if upon monitoring the user's 100 applied voice cosmetics options are deemed effective, or step 318 is finished, step 312 is performed.
  • FIG. 4 is a simplified mockup of a voice cosmetics application operated on a mobile device.
  • Voice cosmetics app 402 here representing voice cosmetics application 108 as depicted in FIG. 1, is run on a mobile device 400.
  • voice cosmetics app 402 allows a user 100 (not shown) to apply more or less of the following voice cosmetic options to the voice input of user 100 by moving respective sliders left or right and then tapping on apply button 416:
  • Slider 404 give the user's voice more or less 'sizzle', for example by amplifying or dampening high frequencies part of the user's voice;
  • Slider 406 'sugarcoat' the user's voice more or less, for example by adding 'air' and slightly increasing the pitch of the user's voice for a female user or adding 'air' and slightly decreasing the pitch of the user's voice for a male user;
  • Slider 408 make the user's voice sound more or less energetic, for example by applying compressor or expander effects;
  • Slider 412 give the user's voice a higher or lower pitch.
  • Tapping on cancel button 414 allows user 100 to have voice cosmetics app 402 ignore unsaved selections.
  • voice cosmetics app 402 is applying the selected voice cosmetics by instructing the digital signal processor (not shown) of mobile device 400 to impact the voice input of user 100 in the desired way.
  • voice cosmetics app 402 may directly impact the voice input of user 100 as selected without involving the digital signal processor of mobile device 400, for example, by directly impacting the digital byte stream representing the voice input of user 100 on mobile device 400 after it was converted from an analog to a digital form and before it is transmitted to an audience 400 (not shown).
  • voice cosmetics above represent only a subset of potential voice cosmetics options and there may be a multitude of alternative or additional voice cosmetics available to user 100 via voice cosmetics app 402.
  • voice cosmetics may at least partly be applied by voice cosmetics app 402 based on one or more voice cosmetics components or respective instructions provided via one or more voice cosmetics services 140 (not shown).
  • a voice cosmetics application 108 can be used in a stand-alone only fashion.
  • the voice cosmetics application 108 is not connected with a voice cosmetics service 140 but is implemented as part of a device in such a way that it can locally apply at least one of a voice cosmetic component 132, 134, 136 and 140 to the voice input 106 of a user 100 when communicating with an audience 118.
  • a voice cosmetics service 140 may be enabled to access a voice output 116 of a user 100 while it is being transmitted to an audience 118, automatically analyze the voice output 116 and instruct a digital signal processor 114 that has an impact on voice output 116 to automatically apply or further optimize one or more voice cosmetics options on voice output 116 according to a preset criteria.
  • voice cosmetics options including preset criteria may at least partly depend on contextual information, such as, for example, the phone number of the audience 118 that user 100 is talking with, the service level of user 100 with a respective wireless carrier, the present location of user 100, the device type used by user 100, and the like.
  • FIG. 5 is a block diagram illustrating context sensitive aspects of the invented voice cosmetics system and method.
  • Device 510 contains voice cosmetics application 504 and digital signal processor 508.
  • Device 510 may be any device that includes hardware, software or a combination thereof allowing user 500 to communicate with audience 516 via audio or video call, the device including, but not limited to, a smart phone, a tablet computer, a game console, a laptop computer, a personal computer and the like.
  • Voice cosmetics application 504 is enabled to listen into the ongoing communication stream 512 originating from user 500, analyzing the communication stream 512 and instructing digital signal processor with instructions 506 to alter the communication stream 512 in order to make the user 500's voice sound more appealing to audience 516.
  • voice cosmetics application 504 is enabled to listen into the ongoing communication stream 514 originating from audience 516 and to analyze the communication stream 514 in order to detect contextual information that may be useful to make the user 500's voice sound more appealing to audience 516.
  • contextual information may include, but is not limited to, frequency patterns helping to determine the predominant gender and current stress level of audience 516 and the like.
  • voice cosmetics application 504 may be able to also analyze at least one video portion of communication streams 512 and 514, for example, to detect the predominant gender of audience 516, stress levels and the like using, for example, image recognition.
  • voice cosmetics application 504 may generate and send instructions 506 to the digital signal processor 508 in order to alter the communication stream 512 in such a way as to make the user 500's voice sound more appealing to audience 516.
  • FIG. 6 depicts a simplified mockup of a context sensitive voice cosmetics application operated on a mobile device.
  • Mobile device 600 contains a voice cosmetics app 602 that allows a user (not shown) to control the effect of the voice cosmetics app 602 on his or her voice when communicating with an audience (not shown).
  • the user is female.
  • Radio button group 603 allows the user to quickly switch between various voice cosmetics effects optimized for specific audiences. For example, if the user is about to call a predominantly male audience, she may select option 604, versus option 606 for a predominantly female audience, or option 608 for an audience that contains both genders.
  • the user can select radio option 610.
  • the voice cosmetics app 602 will ideally initially apply general setting 608, then try to automatically detect the predominant gender of the audience and apply best possible voice cosmetics to the user's voice accordingly in order to maximize the user voice's appeal among the audience.
  • voice cosmetics app 602 may, for example, access the voice stream originating from the audience, search for gender-specific attributes such as gender-specific frequency patterns and then apply optimized voice cosmetics settings accordingly to the voice stream originating from the user. This process may be performed once at the beginning of the conversation, periodically throughout the conversation, or a combination thereof.
  • the user may also select from a list of voice cosmetics effects designed to make the user' s voice sound like the voice of a selected celebrity.
  • the user may adjust the intensity of at least a subset of all currently active voice cosmetics effects.
  • the user can tap button 616 in order to hear her voice with the selected effects applied.
  • tapping button 618 Once the user is satisfied with her settings, she can either apply and save her settings for further use by tapping button 618 or cancel and go back to prior settings by tapping button 620. It will be understood that, the present invention is not limited to the depicted effects and may also allow more than one settings saved and related to one or more audiences each.
  • FIG. 7 is a flowchart depicting a context sensitive operation of the present invention.
  • a call between a user and an audience is started utilizing a voice cosmetics application in the spirit of the present invention.
  • the call may be an audio call or an audiovisual call.
  • the audience may be one or more persons.
  • the voice cosmetics application initially applies previously saved or otherwise available voice cosmetics settings or default settings in such a way that the user's voice stream sounds most appealing to the audience based on best prior knowledge. For example, if the user talked with the audience using the voice cosmetics application before, the voice cosmetics application may retrieve and apply respective previously saved voice cosmetic settings.
  • the previously saved voice cosmetic settings may be linked, for example, to a phone number, video call number, or other unique ID relating to the audience.
  • the voice cosmetics application may have access to a voice cosmetics database or voice cosmetics service such as depicted in FIG. 1 providing best possible voice cosmetic settings for a given phone number, video call number, unique audience ID and the like.
  • the voice cosmetics application may analyze the user's context, for example, by utilizing attached sensors as depicted in FIG 1. 120. Such a context analysis may comprise, but is not limited to, measuring ambient noise levels around the user, measuring a user voice input level, measuring a current stress level of the user and the like.
  • the voice cosmetics application may then optimize the user voice cosmetics settings of step 702 further, for example, in order to counter ambient noise or a low user voice input level and the like.
  • step 706 if the predominant gender among the audience can be automatically detected by the voice cosmetics application, for example, by analyzing the audience's voice stream, step 708 is performed and voice cosmetics settings applied to the user's voice stream are optimized accordingly.
  • step 706 if the predominant gender among the audience cannot be detected, or after performing step 708 is performed, step 710 is performed.
  • step 710 if the call has not ended as yet, step 706 is performed.
  • step 710 if the call is about to end, optional step 712 is performed saving the latest applied voice cosmetics settings, for example, as part of a contact record related to the audience's phone number.
  • step 714 the call is ended and voice cosmetic settings seize to be applied.
  • FIG. 8A is a block diagram illustrating an operation of a voice cosmetics system and method when comprised in a peripheral device, including but not limited to, a wired or wireless microphone.
  • Peripheral device 810 comprises a voice cosmetics application 804 and a digital signal processor 808 It will be understood that, as described above, voice cosmetics application 804 may be implemented as hardware, software, a combination thereof or built directly into digital signal processor 808. Peripheral device 810 is picking up the voice output 812 of user 800, is converting it into digital information using digital signal processor 808, and is transmitting the resulting information to communication device 820 wired or wirelessly.
  • peripheral device 810 is not limited to picking up the voice of user 800 by means of a microphone but may also do this by one or more other means including bone vibrations, optical recognition of mouth movements and the like.
  • Communication device 820 for example a smart phone, tablet or other computing device, is wired or wirelessly connected via connection 818 with an audience 830, for example via the Internet (not shown). Audience 830 comprises one or more users (not shown).
  • voice cosmetics application 804 is optimizing the appeal of the user 800's voice output 812 via digital signal processor 808, preferably based on selections or instructions 802 originating from user 800.
  • peripheral device 810 may not allow user 800 to influence the operation of voice cosmetics application 804 but rather use one or more preset instructions or listen into the user's voice output 812 and automatically adapt its operation accordingly.
  • voice cosmetics application 804 may be able to automatically detect the user 800 's gender, stress level, current context and the like in order to optimize its operation.
  • peripheral device 810 is not involved in providing voice output 814 originating from audience 830 back to user 800.
  • FIG. 8B is a block diagram illustrating an operation of a voice cosmetics system and method when comprised in a peripheral device such as a headset.
  • peripheral device 810 provides voice output 814 originating from audience 830 to user 800, for example via a loudspeaker.
  • voice cosmetics application 804 neither influences nor analyses voice output 814 in order to optimize the user 800's voice appeal.
  • FIG. 8C is a block diagram illustrating an operation of a voice cosmetics system and method when comprised in a peripheral device such as an advanced headset.
  • voice cosmetics application 804 may analyze voice output 814, for example to automatically detect a context or gender of audience 830, in order to further optimize the user 800's voice appeal as described above and in FIG. 7.
  • FIG. 8D is a block diagram illustrating an operation of a voice cosmetics system and method when comprised in a peripheral device such as an advanced headset with an additional audience context input.
  • voice cosmetics application 804 may receive additional contextual information 840 directly originating from the audience 830, or originating from one or more of their communication devices or peripherals thereof.
  • the contextual information 840 is not a part of, or dependent upon, voice output 814.
  • Contextual information 840 may comprise, but is not limited to, the number of, the genders of, or other information regarding the individuals comprised in audience 830.
  • This embodiment allows voice cosmetics application 804, for example, to readily apply contextual information 840 to optimize its voice cosmetics operation without depending on analyzing voice output 814.
  • FIG. 8E is a block diagram illustrating an operation of a voice cosmetics system and method when comprised in a peripheral device as well as in a connected communication device.
  • communication device 820 comprises a voice cosmetics application 816 that may be operated by user 800 either in combination with, or separately from, voice cosmetics application 804.
  • This embodiment may, for example, allow the user 800 to remotely control or monitor at least a subset of the functionality of voice cosmetics application 804 from communication device 820 or combine voice cosmetics applications 804 and 816 in order to create a desired effect.

Landscapes

  • Engineering & Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Telephonic Communication Services (AREA)

Abstract

To improve the impression a human voice creates in communication, a method of processing an audio stream of a voice of a natural human being for communication is suggested, wherein at least one target voice property considered to be appealing in a speaker is defined; a plurality of steps for processing the audio voice stream of the human being in a manner approximating said appealing target property without destroying the impression of the audience to listen to a natural voice, is defined, a degree to which the target voice is to have said appealing voice property for a particular audience is determined, the plurality of processing steps associated with said determination is applied to the voice stream and the processed voice stream is reproduced to the audience. Such voice cosmetics may be used e.g. during a telephone call, webinar or when talking in an assembly hall.

Description

Method of improving the human voice
The present invention relates to the processing of audio streams, more particularly to improving the impression the human voice creates in communication.
More and more human conversations are conducted using digital cell phone- or Internet- based technologies such as Voice-over-IP (VOIP). Many of these conversations are creating important first impressions with significant impact on outcomes, particularly in business contexts.
It is well known that while humans speaking use their voice mainly for communicating information about the world, at the same time cues in the voice signal convey rich information, e.g. about a speaker's arousal and emotional state. Extralinguistic cues may be interpreted by an audience as they may reflect more stable speaker characteristics including identity, gender, social, socioeconomic or regional background or biological age.
It is also well known that the sound of the human voice can be altered. For example, in a music studio, the voice of a singer can be subjected to a plurality of different effects and the voice track thus produced is likely to suit the musical taste of the audience better, increasing sales of recorded music. Given the broad range of different music styles it will be easy to understand that to different groups of listeners different styles of music and hence different ways of processing a recorded voice will be attractive.
While this is well understood for the singing voice and a lot of work of highly qualified personnel is invested in amending music tracks, applicant considers that there also is a need to improve the sound of a voice in everyday life. Just like visual appeal has measurable impact on outcomes, it is understood by the applicant that a great sounding voice does as well, particularly, but not exclusively in audio-only conversations where no visual aspects come into play. Properties of speech have already been the subject of considerable research, e.g. to improve speech recognition. In speech recognition, gender identification is helpful. Here, facilitating automatic speaker recognition by cutting the search space in half, therefore reducing computations and enhancing the speed of the system has been considered. It has been stated that by identifying the gender and removing the gender specific components, higher compression rates can be achieved of a speech signal and thus enhancing the information content to be transmitted and also saving the bandwidth. Furthermore, automatically sorting telephone calls by gender for gender sensitive surveys has been suggested. Therefore, it should be noted that techniques for identifying the gender of a speaker based on a voice audio data stream are known. For example, a substantial amount of work in the prior art focuses on the frequency of the voice fundamental (F 0 ) in the speech of speakers who differ in age and sex. The data reported nearly always include an average measure of F 0 , usually expressed in Hz. Typical values obtained for F 0 are 120 Hz for men and 200 Hz for women. The mean values change slightly with age. Many methods and algorithms are in use for pitch detection divided into two main camps, time domain analysis and frequency domain analysis. Therefore, such analysis gives valuable clues about gender and age.
Furthermore, if the speech waveform is considered as the response of a resonator (the vocal tract) to a series of pulses (quasi-periodic glottal pulses during voiced sounds, or noise generated at a constriction during unvoiced sounds), the resonances of the vocal tract will influence the sound of a voice. These resonances are called formants. The influence of these formants on the voice and the resulting differences of different voices can be perceived easily by a human being and can also be automatically evaluated. Formants are manifested in the spectral domain by energy maxima at the resonant frequencies. As the frequencies at which the formants occur are primarily dependent upon the shape of the vocal tract, e.g. the positions of the articulators (tongue, lips, jaw, etc.). a listener may and will associate certain voice characteristics with the look and/or character of a speaker. It should be noted that while voice fundamental frequencies and formants have been explicitly mentioned as giving important clues to a listener about a speaker, further characteristics of a voice may and will determine the impression a listening person has about the speaker.
Altering these characteristics may alter the impression caused by a speaker in a single listening person or a group of persons also commonly designated as audience in certain parts of the present application.
In this respect, reference is made to presentation notes of a talk entitled "A new dimension of voice quality manipulation" given by H. Kawahara at "The Listening Talker- an interdisciplinary workshop on natural and synthetic modification of speech in response to listening conditions" Edinburgh, May 2-3, 2012 and to presentations of other participants of this workshop to be found online. According to these presentation notes, temporal fine structure analysis is suggested to be a key to investigate sound/speech texture and morph- ing is to provide a complementary strategy to a statistical approach on context.
Furthermore, it is noted that efforts have been made to improve synthesized speech, compare e.g. JP 2007 041012.
Despite this, current voice-transmitting technologies are typically optimized towards low noise and maximum compression and not towards creating or maintaining a voice characteristic that is most appealing to a particular audience.
It is an object of the present invention to provide for an improved way of communication.
According to a first basic aspect of the invention, a method of processing an audio stream of a voice of a natural human being for communication is suggested wherein at least one target voice property considered to be appealing in a speaker is defined; a plurality of steps for processing the audio voice stream of the human being in a manner approximat- ing said appealing target property without destroying the impression of the audience to listen to a natural voice, is defined, a degree to which the target voice is to have said appealing voice property for a particular audience is determined, the plurality of processing steps associated with said determination is applied to the voice stream and the processed voice stream is reproduced to the audience.
This invention therefore introduces inter alia a voice cosmetics system and method suitable to optimize the appeal of human voice transmitted to a remote and/or locally addressed audience as described in detail below. Hence, it is possible to apply the method both to outgoing and/or to incoming audio data streams of a voice and/or can be applied locally, e.g. in the chain of a public address system used during a convention, church service or the like. It is noted that the audience can be a single listener, e.g. a person calling on a telephone such as a mobile or smart phone, a person called or talked to via an Voice- Over-IP-Service or can be a group of persons, such as participants of a webinar, the audience in an assembly hall or conference room, e.g. during a video conference, or the like.
Where reference is made to an audio stream of a voice or the like, it will be understood that usually, this will be a stream of digitized data which will contain language spoken by a real human being. The spoken language need not be the only audio signal present in the audio stream; for example, background noise such as noise from a street or from the surroundings of the speaker such as music playing in a bar, other people speaking asf. may be present.
While it may be possible to partially or fully isolate the speech related part of such an audio stream containing also sounds other than that coming from the speaker intending to communicate, it is perfectly possible to apply the method of the present invention to the data stream in its entirety, that is, even if the audio stream of the voice does contain background noise related data. It will be obvious that the method of the present invention can be used even if noise reduction is effected, if an equalization for purpose of data compression and/or if compression methods relying on psycho-acoustic methods such as MP3 en-/decoding are used. This clearly is a significant advantage. It has been understood by the applicant that it is possible to significantly amend the appeal of a natural voice without giving the negative and unwanted impression to disguise ones voice and it has been suggested how in view of this, communication can be improved in a context- sensitive or context non-sensitive manner and in a content-sensitive or content non-sensitive manner.
As applying the method of the invention will not give the impression to disguise the speaker but will enhance positive and/or wanted aspects and will mitigate negative or unwanted aspects thereof, communication can be improved. Very much in the same manner as cosmetics may be used for facial make-up to show that a person cares for its personal optical appearance, the same care can now be given without adverse effects to the acoustical appearance, so that even where discernible, the method of the present invention will not stir antipathies given that the degree to which the property is achieved will be set according to the preferences of that particular audience.
As a degree to which the target voice is to have said appealing voice property for a particular audience is determined, there is no risk that a speaker will give the impression to use an acoustic mask instead of improving the appeal of a voice and hence of a speaker. It will be understood that in the most common and preferred embodiments characteristics such as perceived intonation will basically be maintained. This will even hold where an extremely complex cosmetic make up is applied such as removal of dialect, which could be done by content analysis using speech recognition of (dialect) words spoken, automatic translation of content in a dialect free and hence other language and re-synthesis of spoken language by synthetic speech in a manner adapted to characteristics detected and identified in the initial voice stream such as emphasis of certain terms and intonation in general, although it is understood that for a practical implementation on a mobile device, processing power currently available may not yet be sufficient.
It should be noted that while in most embodiments, the method is applied for an outgoing data stream, e.g. to process audio data transmitted during a phone call to a person called, it is also possible to enhance the listening experience for incoming data streams. For example, it may be useful to process data of incoming calls in a call center where complaints of customers are handled so as to reduce the stress of employees due to angry calling persons. Here, if the employee of the call center is provided with an audio stream voice processed to reduce anger or stress expressed by the calling person, it will be easier to stay friendly for an extended period, thus in turn improving customer satisfaction and hence communication. It will be understood that for similar reasons, the method can be applied in hearing aids, head phones, head sets and the like.
If used with a cell phone, Voice-Over-IP Service or the like, the method can be applied prior to transmission; in that case, an app, plug-in or other program will be executed in the background during the call.
Furthermore, it is possible to process the voice data stream during transmission, e.g. at a base station of a service provider. In this case, the processing may specifically relate to e.g. the frequency response of the microphone used in registering the voice of the speaking person and the frequency response of an electrodynamic speaker in the device used by the listener without having to disclose information about the device used by the listener to the caller. It is noted that a corresponding voice processing service may be provided to paying premium users only, e.g. on a pay-per-call or subscription basis.
While the processing could be automatically effected off-line, in a preferred embodiment, the processing of the voice audio stream is effected in real time, that is in a manner where incoming voice data is processed at least as fast as it is streaming in.
This also holds where the dynamic adaption of the specific processing applied at a given time is provided for.
Where necessary due to processing power restrictions, real time processing can be achieved by taking into account a current processor load (or coprocessor load, dsp load and the like) of a device, power consumption, overall computing performance and adjust- ing the processing steps to current data processing performance available. As an example, the method might be executed using less complicated processing while the user is web surfing in the background during a call, while improved quality voice processing can be effected once the user has finished browsing the web.
It should however be noted that the audio voice processing suggested by the present invention usually can be effected in real time as the overall amount of data will be rather small even for high quality audio streams such as 192kHz 16 bit resolution stereo audio data. Also, the processing of audio data according to the present invention will typically cause only minimal latency, in particular compared to the delay induced in Voice-Over- IP transmission. It is possible and preferred to have a latency of less than 15ms, in particular of less than 5ms due to the processing. This is comparable or negligible compared to latency induced by transmission over an IP network. Therefore, processing will hardly be discernible.
Also, it is possible to use the method according to the present invention in video calls. Where the latency-induced delay of the audio stream is too large, it is easy to automatically correct for any delay by re- synchronizing video frames and audio stream according to the delay. Therefore, the original lip movement will either still correspond very closely to the amended voice stream or can be made to do so, thereby improving user experience of an Audio-video conference.
It should be noted that in certain cases, it may be advantageous to register audio data with a higher frequency and/or dynamic resolution than needed for transmission so as to allow for better processing. For example, clues how to best adapt the processing might be derived from non audible, e.g. ultrasonic or weak features of the audio data stream; as an example, irregularities of stress induced voice tremor and /or irregularities of air intake/breathing noises are mentioned.
As has been indicated above, it is frequently preferred if the processing of the audio voice stream is effected prior to transmission to a remote audience over a network. Regarding the steps effected during processing data of an audio stream of a voice or voice cosmetic ingredients, it is suggested that in a preferred embodiment at least one of the plurality of steps for processing the audio voice stream is selected from the group comprising equalizing, limiting, modulation, flanging, adding chorus, adding reverb, adding air, changing pitch, adding 3D audio effects, compressing, expanding.
It should be noted and understood that certain effects mentioned can be combined to give certain desirable properties to a voice. For example, compression can be effected using different attack /rise times, compression factors asf. and a plurality of such effects can be chained as is known in the art. Because the loudness pattern of the source material is modified by the compressor it may change the character of the signal in subtle to quite noticeable ways depending on the settings used.
An advantage of certain of the processing steps suggested such as equalizing, limiting, modulation, flanging, adding chorus, adding reverb, adding air, adding 3D audio effects, compressing, expanding is that they can be done without subjecting the audio stream to a detailed analysis. Instead, they can be used regardless of the current actual behavior of the wave form. Accordingly, only very limited processing power is needed to execute corresponding algorithms on a processor.
It should be noted that while the effects mentioned might be effected in an analog manner prior to A/D conversion, in general they will be implemented by digital signal processing techniques. The way the mentioned processing steps can be effected are per se well known in the art as is the way they are implemented digitally. For example, a known way to "add air" is to use an equaliser in an 'air' increasing manner by using an EQ centre frequency between 12 and 14kHz and setting the EQ width to be fairly wide (one to two octaves or a Q value of around 0.7). Then, if e.g. up to 4dB or so of boost are added the high end appears to 'open up' and appears to become more detailed, but without adding harshness to voices. It is explicitly noted that in this example, the actual amount of boost should be judged by ear to provide for optimized effects, depending inter alia on the speaker, the chain of transmission from microphone to electrodynamic speaker and the audience. It will be understood that the electronic component used for emitting sound is referred to as "electrodynamic" speaker not in order to restrict the disclosure to such variety of electronic components but to more clearly distinguish the electronic speaker component from a person speaking. Obviously, electronic speaker components other than "electrodynamic" speakers might be used just as well, such as e.g. electrostatic headphones or the like.
The voice cosmetics methods may therefore comprise algorithms and instructions on how to assemble or morph two or more of the processing steps as voice cosmetic ingredients to cause a desired audio effect when applied to a voice input and the like.
It is also noted that one of the particularly important and preferred steps effected in processing the voice stream is pitch control and adding 'air'. These effects are very important ingredients to making voices sound more attractive, typically for males by lowering pitch and adding air, and for females by increasing pitch and adding air. In this respect, while a major fraction of the important processing steps can be effected without regard to the transient or temporal behavior of the voice audio stream, pitch and certain implementations to add air other than that mentioned above may require a more complex approach relying on the analysis of temporal behavior, which however is also well known in the art.
It will be obvious that several of the voice cosmetic ingredients (or, speaking more technically, the plurality of steps for processing the audio voice stream) will be used together to obtain one certain desired target property. Also, it will be understood and explained hereinafter in more detail that the selection of steps may vary depending on the degree to which a certain voice property is to be achieved and/or to which a natural voice is to be altered. In particular, a target property can be selected from making the voice sound more feminine, more masculine, more energetic, more soothing, more sensual or erotic, warmer, more trustworthy, more tender, or similar to the voice of a particular celebrity.
It will be understood that up to a certain degree, it is known in the art how to achieve these properties in a voice, e.g. from music recording studios. However, while there a setting of effect devices is selected by a mixing engineer, the setting of certain effects according to the present invention will preferably be judged in a different manner as will be obvious from the disclosure presented above and hereinafter.
For example, for applications such as call centers, it is feasible to determine a desired target voice characteristic from a suggested plurality of possible alternations by varying the processing from call to call during an initial number of calls, evaluating each adjustment by success, e.g. revenues achieved in telephone sales marketing, satisfaction of users according to feedback, voice stress of counterpart perceived asf. It will be understood that such adaption may be dynamically changed even during the initial number of calls, e.g. after an early phase of a call where the gender or social status of a person contacted is identified either manually or automatically.
Another way to judge whether a certain appealing target property has been achieved by a given combination of processing steps and parameters set for these steps would be by statistical analysis of the response of a sufficient number of persons representative for the audience the target property is to be appealing to. Such response can be gathered in a conventional manner, such as a face-to face interview but could also be determined using web-based methods, such as evaluating sales numbers of certain apps, web questionnaires asf. Therefore, it is also suggested that in addition or as an alternative, the impression of the audience that is to listen to a natural voice stream is selected according to a statistical evaluation of a number of different pluralities of steps for processing the audio voice stream of the human being. The statistics evaluated can be based on the response of a group representative for the target audience. In all of these cases, it is perfectly possible to let the persons representative for the audience the target property is to be appealing to select from a number of chained effects initially suggested by experienced users and/or experts.
Also, it should be noted in this context that it can be established by way of statistical analysis evaluating the response and/or feedback obtained from a group representative for the target audience that a given suggested processing of the voice will not be discernible as such by at least a significant part of the target audience, even though automatic detection thereof might be possible. The fact that a suggested voice treatment is not discernible to a certain part of a representative group can be established during the evaluation of suggested processing steps.
It should be noted that very frequently, a user might wish to apply voice cosmetics in a manner not discernible to an audience. Particular attention may be paid in that case when the cosmetic treatment according to the method suggested herein is dynamically adapted during an ongoing communication to prevent the treatment from being discernible as such due to glitches and transitions. Therefore, it might be helpful to e.g. cover a transition phase with some added noise, e.g. by temporarily partially defeating a noise reduction or to amend the parameters for cosmetic treatment in a manner sufficiently gradual to be indiscernible. The necessary length of a transition phase can be determined individually for every single given desired property, for different groups of desired properties or for all properties in the same manner.
Also, it should be noted that despite of restrictions imposed on the voice data stream by transmission characteristics such as bandwidth available, a large number of different voice ameliorating measures might be implemented and executed without destroying the impression of the audience to listen to a natural voice,
Furthermore, it is suggested that in addition or as an alternative, at least one further property considered to be appealing in a speaker is defined; and for this at least one further property, a further plurality of steps for processing the audio voice stream of the human being for approximating the respective appealing target property without destroying the impression of the audience to listen to a natural voice is defined, a degree to which the target voice stream is to have the further property is defined, the plurality of processing steps associated with said determinations is applied to the voice stream in real time, and the processed voice stream is transmitted to the audience.
In other words, it would be e.g. possible to give the voice a more feminine touch while at the same time de-stressing the voice or, as an alternative to adding this sort of feminine touch, to make it sound more sensual, masculine and trustworthy at the same time.
The degree to which a certain property is to be achieved or approximated, that is the corresponding determination may be made in response to at least one of image analysis to identify males/females in an audience and /or to identify the gender of a single person or few persons constituting the audience, and/or to identify surroundings, determine current stress level of audience, asf.. Furthermore, the result of a voice stress analysis in the audio voice stream of the speaker and/or voice stress analysis in the audio voice stream of the audience, detected ambient sound levels/characteristics, a physical location of the speaker, a local time of speaker or audience, ambient temperature, phone number called or calling, area code of an audience called, ID of audience, service level of provider of a communication path, device type of user, and/or keywords identified in the voice stream by speech recognition can be relied upon for setting or adapting parameters.
In more detail, it has been e.g. found that a male audience considers other features attractive in a male speaker than a female audience. The same male audience will also consider other features attractive in a female speaker than a female audience. The features considered attractive in a female speaker will not be the same as those considered attractive in a male speaker; this holds for both male and female audiences. Therefore, while some features in the voice of a speaker of a specific gender may be attractive to both male and female audiences, it is preferred to adapt the processing step to the respective audience. As the question of what is considered attractive in a voice will depend e.g. on the gender of the audience, it is preferred to automatically adapt the target property to the gender identified. Where different genders are simultaneously addressed, the target property might be selected according to the majority of genders present or might be selected according to the gender of a specific person or persons, e.g. in an audio-video-conference according to those speaking most or to the gender of a conference leader or according to the gender of those currently stressed most or being most aggressive towards a speaker using the voice cosmetic method. The gender of an audience or persons in an audience can be determined using voice analysis techniques applied on incoming voice streams, image analysis e.g. of images transmitted in an audio-video-conference asf. These techniques are well known in the art. Also, a manual override by the speaking person is considered to be possible.
It is possible to determine a current stress level of the audience and to react to the result of a voice stress found in the audio voice stream of both the speaker and/or the audio voice stream received from the audience. It should be noted that voice stress analysis is a well-known technique and has been related inter alia to determining whether or not a person tells the truth. In certain instances it might therefore be helpful to also indicate to a speaker and/or to a listener the result of a voice stress analysis if such voice stress analysis is effected.
Also, it might be helpful to process the voice in a similar manner or with the intention to achieve similar properties every time a given audience is spoken to. Therefore, a phone number called or calling can be used to set a desired property. Properties set once can be stored with a given number. Also, in certain instances, a desired property can be set relying on a given local area code of an audience calling or called. E.g. if a sales representative only has business contacts in a given area, it is advisable to choose a voice amelioration increasing trust of the listener.
Other parameters that may influence a selected target property or the degree to which it is to be achieved comprise detected ambient sound levels and/or characteristics, a physical location of the speaker, a local time of speaker or audience and ambient temperature. For example, a person contacted early in the morning in an environment that due to background noise received can be suspected to be an office will be more frequently contacted for business reasons than a person contacted late in the afternoon at a hot summer day in an environment that according to background noise might be a beach. Here, even if the person is contacted for business reasons, it might be more advisable to have a less demanding voice.
The parameters used to influence or select a target property can be physical detectors such as temperature detectors; however, it is also possible to implement parameter detecting means by way of software, e.g. by a software component for automatically detecting the gender of an audience by listening into the conversation; or as another example, by a software component for automatically optimizing the attractiveness of the user's voice based on contextual input derived from background noise analysis- it is noted that this can be done by comparison of background sound pattern to databases;
If sufficient processing power is available it is even possible to analyze the voice stream by speech recognition to search for keywords spoken so as to e.g. determine an appropriate make-up.
It should also be noted that in certain cases, the signal strength of a communication service might necessitate that only a restricted limited bandwidth is used and that accordingly, a setting that does not rely on e.g. extended highs will be preferred. Also, if a specific device type is used by a given participant of a communication, it might be helpful to take technical limitations of the device into account, e.g. frequency response characteristics. In this context, it is noted that either each of said number of different pluralities of steps for processing the audio voice stream of the human being suggested and/or the combination thereof observes voice transmission standards defined by a service provider. In other words, the method of the present invention will be easily compatible with existing or upcoming standards. It should be noted that audio processing such as data compression, noise reduction asf. suggested by such known or upcoming standards may still be used as known in the art.
Furthermore, it is suggested that in addition or as an alternative, when the plurality of processing steps associated with said determinations is applied to the voice stream in real time, at least some of these processing steps are effected as mathematical operations on a digitized voice stream and at least some of the mathematical operations are calculated by executing a combination obtained by concatenation thereof. It should be understood that this is particularly straightforward where only linear operations are applied. Also, it should be noted that for certain chained processing steps, approximations may be used.
It should be noted in particular, that in a preferred embodiment, target properties may be dynamically adjusted during an ongoing communication. E.g., if during a phone call voice stress detected increases, it might be helpful to increase a de-stressing cosmetic effect or to make the voice sound more soothing. The same holds if keywords detected by speech analysis are identified to be insults or are typical for speakers having a given degree of education, thus serving as an indicator of social status. It is appreciated that with increasing processing power, it may also become possible to identify by speech recognition whether the voice stream coming from a communication partner indicates that the user speaks a certain dialect or uses specific grammatical constructions that may be indicative of education, typical for a given social status and/or age.
From the above, it will be understood that in addition or as an alternative, the degree to which the target voice stream is to have the appealing voice property may be determined according to at least one of a user input, a sensor means for analyzing at least one of an audience gender, an audience behavior, the social background of an audience. Accordingly, a user can input instructions e.g. via a touch screen or other pointing device such as a computer mouse, or e.g. by speech input, and respectively e.g. in response to monitoring currently applied voice cosmetics on his or her own voice and/or ad hoc. In preferred embodiments, it is possible for a user to set the intensity levels of certain cosmetic properties and/or to adjust the overall intensity level.
From the above, it will be understood that in a preferred embodiment, the method of the present invention will be executed in a context-aware and/or a content-aware manner, particularly in a dynamical manner, automatically selecting target properties and/or automatically adjusting the degree to which target properties are to be achieved in a con- text- and/or content aware manner. It is noted that it is perfectly possible to rely only on one of content or context, e.g. either on keywords recognized or e.g. gender identified without deviating from a preferred implementation of the invention.
Therefore, it is suggested for a preferred embodiment that in addition or as an alternative, the degree to which the target voice is to have said appealing voice property is changed during an ongoing communication in an automated manner relying on a change in the determination made.
To this end, sensors may be provided for automatically detecting the gender of an audience by listening into the conversation, for automatically optimizing the attractiveness of the user's voice based on contextual/sensor input for automatically and dynamically optimizing voice attractiveness throughout a conversation. These sensors may be implemented as hardware components, e.g. as GPS sensors or other sensors available in a smart phone or as one or more software components, in particular a software component specific to the invention, e.g. for analysing the incoming or outgoing voice stream. Furthermore, in certain cases, the sensor may rely on web-retrieved information, e.g. the current local temperature at the place a person calling from a fixed network is situated according to caller ID.
Also, as the effectiveness of applied voice cosmetics options may change, for example, if the user changes the position of his or her mouth relative to his or her voice-transmitting device or if ambient sound patterns change, these can be detected. For example, it is possible to provide a camera repeatedly taking a picture of a person speaking into the microphone of a mobile phone so as to detect changes in the composure.
Dynamic adaption may also be achieved by relaying information derived such as background noise characteristics or the like to a database where these characteristics are linked to locations or types of locations so that suitable adaptions of the voice cosmetic can be suggested, parameter sets be prepared and transmitted to the user for immediate adaption of an ongoing communication. Therefore, it is possible for users and providers to create, buy, sell, apply one or more voice ameliorating methods via the Internet, e.g. in app stores and the like.
Furthermore, while it is possible that the degree selected is a simple on/off degree, it is preferred that at least three different degrees such as an ON, an OFF and a MEDIUM setting or settings in a range form 0 to 10 in 0,5 steps are provided.
In this preferred case, the number of effects and/or their relative strength may change at least with one of the different degrees. In general, different effects might be needed if several properties are to be achieved simultaneously. For example, a slider relating to 'sugarcoat' the user's voice more or less, for example by adding 'air' and slightly increasing the pitch of the user's voice for a female user or adding 'air' and slightly decreasing the pitch of the user' s voice for a male user might be provided. Another slider might be provided to make the user's voice sound more or less energetic, for example by applying a chain of compressors and/or expander effects.
However, when e.g. giving a voice the impression to be somewhat warmer, only a slight first compression and a slight second different compression need to be applied in series. When more warmth is required as make-up, the compression factor of each compression might be simply increased. Yet, when even more warmth is required, it might be necessary to also equalize the voice so as to avoid an artificial impression. Hence, an additional effect, namely equalization would be needed. It is preferred if an embodiment of the invention provides for this.
As another example, when simply warming up a voice by a medium degree, two different compressions might be needed while a certain equalization might be necessary to give a certain amount of "air" to a voice and hence reduce the impression the speaker would be stressed. Now, if both effects are needed at the same time, it might be necessary to alter the equalization in order to give a better impression. Therefore, in a preferred embodiment, where sliders or the like are used to alter the degree to which one of a plurality of properties is attained, a database or look up table may be provided either indicating specific parameters for each combination of settings or, where this is not feasible due to a large number of voice properties offered, general rules to determine preferred combinations can be set up, e.g. rules giving priority to certain effects and reduced priority to other effects. In that way, rules stating in a suitable mathematical manner e.g. "equalization mid frequency for air effect must be reduced by X% if compression A is increased by Y% if Y>Z, but minimum air mid frequency may not be below value D" or a rule "if for compression A value Y is requested, it must be set regardless of all other values" can be implemented thus allowing to determine a suitable parameter set for chaining effects necessary to approximate the combined effects as set by the user.
Accordingly, in a preferred embodiment, a means for determining a parameter set for effects used in combination to obtain a combination of properties to an adjusted degree is provided. Said means may be implemented as look-up-table of parameters and/or as a set of equations for determining said parameters. The look-up table and/or the equations preferably will again be determined by evaluation of the response of a group of users representative for a target audience; however, the existence of other ways to determine suitable sets of parameters will by now be obvious to the skilled person in view of the preceding sections of the disclosure. It will be understood that the voice cosmetics methods of the present invention may comprise, inter alia in a preferred embodiment, algorithms, instructions on how to assemble or morph two or more of the voice cosmetic ingredients to cause a desired audio effect when applied to a voice input and the like.
Protection is also sought for a data set usable in executing a method as described above, in particular a personalized data set. It will be understood that data sets may be offered in dedicated stores and/or dedicated web stores and/or in departments of stores and/or departments of web stores. Where a store or web store or a department is set up, it will be advantageous to define categories such business, romance, relatives, friends where data sets can be found adapted for a particular target audience and/or purpose. In addition and/or as an alternative, it will be advantageous if data sets are offered according to one or more specific characteristics of the intended user such as gender, age, size, weight. It will be understood that it is possible to implement the store using a data base relating to specific descriptors of the data set, the descriptors being e.g. intended purpose or characteristic of a user. It is possible and advantageous, although not necessary to use icons describing the intended purpose and/or target audience and/or characteristics. Furthermore, photos of persons representing the specific characteristics of the intended user may be associated with each data set and/or a group of data sets for faster orientation of a potential customer. Applicant reserves the right to file divisional applications relating to such web stores. Applicant also considers data an essential means for execution of the method, even when sold separately. It will be understood that the data set may in certain cases comprise only parameters for algorithms already known to the device of a user, e.g. because corresponding software capable of executing the algorithms needed for certain effects has already been installed. In other cases, data sets offered might also comprise the algorithms the parameters relate to. It will be understood that when additional data sets to provide for certain effects are offered to a user who already uses other effects, a set of rules relating to the chaining and/or morphing of a plurality of effects might be updated and/or generated as described above. Also, determining such rules may be a task the shop may offer as part of a purchase or separately.
It will also be advantageous to process samples of voice streams transmitted to a web store by a potential customer for demonstration purposes. Here, it will be helpful for a potential customer to transmit a voice stream of a given length to the web shop and to store the corresponding audio data in the web store for repeated processing using different data sets the customer considers to buy. As an alternative, the data sets might be transmitted to a user for temporal use, e.g. ensuring that only a recorded voice track of a given length can be processed so that the potential buyer can select a specific data set. It will be understood that once a sufficient number of certain algorithms and software components have been installed, processing a voice to obtain a specific target property will only require a set of parameters input into specific algorithms. While it is possible to use sets of parameters that give satisfying results for average users and typically used devices, it can be understood that improved results may be obtained using parameters that are specifically adapted to a particular user and/or a particular device. Hence, it is possible to personalize a set of parameters for a given desired property or for a given set of desired properties. Such personalisation can take into account speech characteristics of a specific user, and/or his habits such as typical position of the mouth relative to the microphone of a handheld mobile phone and/or specific device used and/or typical background noise patterns asf. By taking into account these influences, an improved result can be obtained.
It is obvious that such personalization may be fully individual or may just take into account typical important characteristics that a large number of people have in common. For example, a male having between 120- 140kg and being between 160cm- 165cm may use another set of parameters to obtain a given set of voice properties compared to a person having 190cm-195cm and 85 kg -90 kg. It can be understood that in certain instances, different ways of processing the voice might be offered to a user. For example, a standard average user grade relying on either an average male or female pattern may be offered; a pattern defined for a smaller group may be offered for other users and a personalized data set may be offered for premium users. Also, the number of possible voice properties offered to a particular user might vary depending on grade or the download of a specific parameter set for implementing a given voice cosmetic property might be offered.
It is noted that personalized services can be provided as remote, e.g. cloud service and/or as a local service.
Protection is also sought for a device for implementing the method according to any of the previous claims, in particular selected from base stations, smartphone, tablet computer, game console, laptop computer, personal computer, switchboard, wearables, glasses, peripherals such as microphones, headphones and/or headsets, each adapted by specific hardware and/or software for executing the method and stored on the device or otherwise accessible for the device.
It should be noted in particular that the method can be implemented by using specific hardware such as FPGAs, digital signal processors or ASICs or e.g. by using a general purpose CPU. It is possible to register sounds using e.g. microphones, bone vibrations, optical recognition of mouth movements. It should be noted that it would be possible to build a device for a stand-alone solution, but in a more preferred embodiment, the device will be a mobile or other device having circuitry adapted for execution of the method steps.
Where wireless headsets or the like are used, it is preferred to provide them with a remote user interface.
Additional objects and advantages of the present invention will become apparent to those skilled in the art based on the following drawings and detailed description, which are given by way of non-limiting example only. In particular, where certain features are present in one embodiment but not in another embodiment, it cannot be concluded that these features cannot or may not be implemented in that other embodiment unless explicitly noted.
In the drawings:
FIG. 1 is a block diagram illustrating the invented voice cosmetics system and method. FIG. 2 is a flowchart depicting a typical operation of the present invention from a user's perspective.
FIG. 3 is a flowchart depicting a typical operation of the present invention from a voice cosmetics service operator's perspective.
FIG. 4 is a simplified mockup of a voice cosmetics application operated on a mobile device. FIG. 5 is a block diagram illustrating context sensitive aspects of the invented voice cosmetics system and method.
FIG. 6 depicts a simplified mockup of a context sensitive voice cosmetics application operated on a mobile device.
FIG. 7 is a flowchart depicting a context sensitive operation of the present invention. FIG. 8A is a block diagram illustrating an operation of a voice cosmetics system and method when comprised in a peripheral device such as a microphone.
FIG. 8B is a block diagram illustrating an operation of a voice cosmetics system and method when comprised in a peripheral device such as a headset.
FIG. 8C is a block diagram illustrating an operation of a voice cosmetics system and method when comprised in a peripheral device such as an advanced headset.
FIG. 8D is a block diagram illustrating an operation of a voice cosmetics system and method when comprised in a peripheral device such as an advanced headset with an additional audience context input.
FIG. 8E is a block diagram illustrating an operation of a voice cosmetics system and method when comprised in a peripheral device as well as in a connected communication device.
FIG. 1 is a block diagram illustrating the invented voice cosmetics system and method. As will be understood by the detailed description hereinafter, this cosmetics methods is a method of processing an audio stream of a voice of a natural human being for communication, wherein at least one target voice property considered to be appealing in a speaker is defined; a plurality of steps for processing the audio voice stream of the human being in a manner approximating said appealing target property without destroying the impression of the audience to listen to a natural voice, is defined, a degree to which the target voice is to have said appealing voice property is determined, the plurality of processing steps associated with said determination is applied to the voice stream, and the processed voice stream is reproduced to the audience. In a preferred embodiment, for executing this method, the invented system comprises a voice cosmetics application 108 and a voice cosmetics service 140.
The voice cosmetics application 108 may be implemented as software, hardware, or a combination thereof. Furthermore, the voice cosmetics application 108 may be operated on any device allowing a user 100 to remotely talk with an audience 118 that consists of one or more other users. Suitable devices particularly include, but are not limited to, smart phones, wearables including Google Glass, tablet computers, laptops and personal computers.
The voice cosmetics service 140 may be implemented as software, hardware or a combination thereof. In a preferred embodiment, the voice cosmetics service 140 is implemented as a cloud service accessible via the Internet, allowing bidirectional communication with one or more of the voice cosmetics applications 108 via the Internet.
The voice cosmetics service 140 may comprise at least one of the following voice cosmetics components: one or more voice cosmetics ingredients 132, one or more voice cosmetics methods 134, one or more voice cosmetics ratings 136, and one or more voice cosmetics resources 138.
Providers 130 may comprise any business entity or human being that is able and willing to provide value in the form of such voice cosmetics components or related services to the voice cosmetics service 140 as described in more detail below.
The voice cosmetics ingredients 132 may comprise, but not be limited to, best practice voice samples, instructions on how to alter the audio characteristic of a particular voice input to make it sound in a particular way to an audience 140, and the like.
The voice cosmetics methods 134 may comprise, but not be limited to, algorithms, instructions on how to assemble or morph two or more of the voice cosmetic ingredients 132 to cause a desired audio effect when applied to a voice input and the like.
Furthermore, voice cosmetics application 108, voice cosmetics service 140 or both may be designed in such a way that they can automatically analyze the voice characteristics of a user 100 and then automatically apply respective voice altering effects in such a way as to make the voice of user 100 sound most appealing to a current audience, context, or situation. This includes, for example, the automatic detection of the gender of user 100, the automatic detection of the predominant gender among a current audience 118, and the au- tomatic application of one or more voice cosmetic components making the voice of user 100 be perceived most appealing by audience 118 accordingly.
The voice cosmetics ratings 136 may originate from one or more users 100, one or more audiences 118, one or more providers 130, or a combination thereof. The voice cosmetics ratings 136 may comprise, but not be limited to, structured or unstructured data relating to the perceived quality of one or more of the heretofore-mentioned components 132, 134, 136, 138 of the voice cosmetics service 140, as perceived or measured by the respective rating's origin.
The voice cosmetics resources 138 may comprise, but not be limited to, software tools, electronic articles, best practices, related remote services as well as access to human experts suited to assist or consult at least one user 100 in finding and applying desired voice cosmetics to the user' s voice.
The voice cosmetics application 108 allows user 100 to select and apply at least one of the voice cosmetics components provided by voice cosmetics service 140 in such a way that it has an effect on the audio characteristic of his or her voice input 106 when processed with digital signal processor 114.
User 100 can control this effect by providing instructions 102 to the voice cosmetics application 108, for example, via a graphical touch-enabled user interface as illustrated in FIG. 4, as well as speech input, sensor input, a combination thereof and the like. Voice application 108 may be used to dynamically alter the appearance of user 100's voice in intensity levels ranging from very subtly to very strongly.
In a preferred embodiment, digital signal processor 114 is part of a device used by user 100 to remotely talk to an audience 118 (the device is not shown here).
In this embodiment, the digital signal processor 114 is enabled to receive instructions 112 from voice cosmetics application 108 and react to such instructions by causing voice outputs 110 and 116 to sound in a particular way as desired by user 100.
It will be understood that, additionally or alternatively, voice cosmetics application 108 may include a digital signal processor or be built into the digital signal processor, whether implemented as software or hardware or a combination thereof, respectively, to cause the same or a similar effect on the voice output 110 and 116 of user 100. One or more suitable voice output-altering methods and sound effects well known in the art such as equalization, limiting, sound effects including, but not limited to, modulation, flanger, chorus, reverberation, 3D audio effects, adding of 'air' and changing of the pitch of the voice output may be applied either directly or indirectly by the voice cosmetics application 108 in the heretofore-mentioned way to create a desired voice cosmetics effect on voice output 110 and 116.
It will be understood that the present invention is not limited to such currently available effects only and may apply alternative or additional effects as available.
For a female user 100, it is a preferred embodiment of the present invention to apply a combination of adding 'air' to her voice output 110 and 116 plus controlling the pitch of her voice output 110 and 116 in such a way (typically higher) that it is perceived most appealing particularly among any male fraction of audience 118.
For a male user 100, it is a preferred embodiment of the present invention to apply a combination of adding 'air' to his voice output 110 and 116 plus controlling the pitch of his voice output 110 and 116 in such a way (typically lower) that it is perceived most appealing particularly among any female fraction of audience 118.
It will be understood that the present invention is not limited by these two gender-based default approaches and can also be applied in alternative ways most appealing to a particular audience.
If so desired by user 100, voice output 110 is optionally also provided back to user 100 in order to allow user 100 to monitor the currently applied voice cosmetics effects on his or her voice, preferably without a delay.
Voice output 116 is provided to audience 118.
It will be understood that user 100 may, for example, choose to switch off voice output 110 after having ensured a desired voice cosmetics effect is being applied on his or her voice, for example, in order to avoid unwanted distractions or interferences during a conversation with audience 118.
In a preferred embodiment, voice cosmetics application 108 is suited to apply the heretofore-mentioned sound effects on the voice input 106 of user 100 without a noticeable delay. The invented system and method may be used, for example, to make the voice of user 100 sound more appealing, feminine, masculine, energetic, similar to the voice of a particular celebrity, or a combination thereof.
To that end, user 100 may apply, combine, morph, chain or otherwise make use of one or more components of voice cosmetics service 140 via voice cosmetics application 108. Additionally, user 100 may change desired voice cosmetics in an ad hoc fashion between or during conversations with particular audiences 118, for example, to optimize his or her voice's appeal on each particular audience 118.
For example, user 100 may want to use different voice cosmetic settings when talking with a prospective business customer versus a person that he or she aspires to get romantically involved with.
Voice cosmetics service 140 may be provided as a self-service or a service that is provided at least partly by one or more human service providers. In this case, user 100 may, for example, physically visit a voice cosmetics service location that will analyze the voice characteristic of user 100, recommend one or more particular voice cosmetic treatments to user 100 and then enable desired voice cosmetics treatments on the device of user 100 either directly via voice cosmetics application 108 or remotely via voice cosmetics service 140.
It will be understood that the present invention may be used in many different contexts, including as part of a free or as part of a paid business model.
Also, providers 130 may provide voice cosmetics components to one or more voice cosmetics services 140 for free or for a fee.
Additionally or alternatively, voice cosmetics application 108 may be designed in such a way that it can be operated in the heretofore-described way without the need to communicate with voice cosmetics service 140. In this case, voice cosmetics application may simply be a component of a stand-alone communication device or software that user 100 is operating without the need for a connection with voice cosmetics service 140 or merely intermittent connection with voice cosmetics service 140.
Also, voice cosmetics application 108 may be connected with one or more sensors 120 allowing voice cosmetics application 108 to automatically detect one or more aspects of the current user's voice features or context. For example, sensors 120 may be suitable, but are not limited to, detecting at least one of, the gender of user 100, the gender of one or more members of audience 118, respective ambient sound levels and characteristics. The term sensor shall be used broadly here to refer to any information input operable to provide voice cosmetics application 108 with information regarding the context of the user 100 or the audience 118. This information may further include, but not be limited to, respective current physical location, respective local time, respective ambient temperature, respective ambient noise characteristics and the like.
Audio information input regarding the audience 118 may be provided, for example, by tapping into the host device's audio input stream coming from the audience 118.
Using at least a subset of such contextual information, voice cosmetics application 108 may automatically and continually adapt to the contextual information in such a way that it selects and applies available voice cosmetics components to the voice output of user 100 in order to make user 100's voice sound as appealing as possible to audience 118 during a call.
Now referring to FIG. 2, a typical operation of the present invention from a user's perspective is depicted.
In step 200, a user selects one or more voice cosmetics options using voice cosmetics application 108 on his or her communication device.
The term voice cosmetics options shall refer to one or more voice output altering components that can be selected by a user 100.
In step 204, the selected voice cosmetic options are applied to the user's voice output by having the voice cosmetic application 108 instruct digital signal processor 114 on the communication device as described above.
In step 206, the user hears the resulting effect on his or her voice output as provided via voice output 110.
In step 208, if the user is satisfied with the effect of the selected voice cosmetics options on his or her voice output, the selected voice cosmetics options remain applied to his or her voice output.
After step 210 or if the user is not satisfied with the effect of the selected voice cosmetics options in step 208, step 212 is performed. In step 212, if the user chooses to try again selecting and applying alternative voice cosmetics options, step 200 is performed.
In step 212, if the user chooses to not try again selecting and applying alternative voice cosmetics options, step 214 is performed.
Now referring to FIG. 3, a typical operation of the present invention from a voice cosmetics service operator's perspective is depicted.
In step 300, one or more providers 130 provide one or more voice cosmetics options to voice cosmetics service 140 preferably via the Internet.
In step 304, the one or more providers 130 present one or more of their voice cosmetics options of step 300 to a user 100 via a voice cosmetics application 108 implemented on a device used by user 100, the device remotely connected with voice cosmetics service 140.
In step 306, if the user 100 does not choose to apply any of the presented voice cosmetic options to his or her voice output, step 308 is performed.
In step 306 if the user 100 chooses to apply one or more of the presented voice cosmetic options to his or her voice output, step 310 is performed via the voice cosmetics application 108 as described above.
In optional step 312, if the user 100 chooses to have the voice cosmetic service 140 or voice cosmetic application 108 periodically monitor the effectiveness of the applied voice cosmetics options on her voice output, step 316 is periodically performed. The effectiveness of applied voice cosmetics options may change, for example, if the user 100 changes the position of his or her mouth relative to his or her voice-transmitting device or if ambient sound patterns change.
In optional step 312, if the user 100 chooses to not have her voice output periodically monitored, step 314 is performed.
In step 316, if upon monitoring the user's 100 applied voice cosmetics options are deemed non-effective, step 318 is performed optimizing the applied voice cosmetics options of user 100 via voice cosmetic application 108. In step 316, if upon monitoring the user's 100 applied voice cosmetics options are deemed effective, or step 318 is finished, step 312 is performed.
FIG. 4 is a simplified mockup of a voice cosmetics application operated on a mobile device.
Voice cosmetics app 402, here representing voice cosmetics application 108 as depicted in FIG. 1, is run on a mobile device 400. In this simplified mockup, voice cosmetics app 402 allows a user 100 (not shown) to apply more or less of the following voice cosmetic options to the voice input of user 100 by moving respective sliders left or right and then tapping on apply button 416:
Slider 404: give the user's voice more or less 'sizzle', for example by amplifying or dampening high frequencies part of the user's voice;
Slider 406: 'sugarcoat' the user's voice more or less, for example by adding 'air' and slightly increasing the pitch of the user's voice for a female user or adding 'air' and slightly decreasing the pitch of the user's voice for a male user;
Slider 408: make the user's voice sound more or less energetic, for example by applying compressor or expander effects;
Slider 410:
make the user's voice sound more or less like the voice of a celebrity x, by analyzing the frequency, energy and pitch characteristics of the celebrity x's voice and then increasing or decreasing available voice cosmetics options applied to the user's voice in such a way that it sounds as similar as possible;
Slider 412: give the user's voice a higher or lower pitch.
Tapping on cancel button 414 allows user 100 to have voice cosmetics app 402 ignore unsaved selections.
In this example, voice cosmetics app 402 is applying the selected voice cosmetics by instructing the digital signal processor (not shown) of mobile device 400 to impact the voice input of user 100 in the desired way.
It will be understood that, in an alternative embodiment, voice cosmetics app 402 may directly impact the voice input of user 100 as selected without involving the digital signal processor of mobile device 400, for example, by directly impacting the digital byte stream representing the voice input of user 100 on mobile device 400 after it was converted from an analog to a digital form and before it is transmitted to an audience 400 (not shown).
It will be understood that, the voice cosmetics above represent only a subset of potential voice cosmetics options and there may be a multitude of alternative or additional voice cosmetics available to user 100 via voice cosmetics app 402.
Also, voice cosmetics may at least partly be applied by voice cosmetics app 402 based on one or more voice cosmetics components or respective instructions provided via one or more voice cosmetics services 140 (not shown).
In an alternative embodiment of the present invention, a voice cosmetics application 108 can be used in a stand-alone only fashion. In this case, the voice cosmetics application 108 is not connected with a voice cosmetics service 140 but is implemented as part of a device in such a way that it can locally apply at least one of a voice cosmetic component 132, 134, 136 and 140 to the voice input 106 of a user 100 when communicating with an audience 118.
In yet another embodiment of the present invention, a voice cosmetics service 140 may be enabled to access a voice output 116 of a user 100 while it is being transmitted to an audience 118, automatically analyze the voice output 116 and instruct a digital signal processor 114 that has an impact on voice output 116 to automatically apply or further optimize one or more voice cosmetics options on voice output 116 according to a preset criteria.
Additionally, application of voice cosmetics options including preset criteria may at least partly depend on contextual information, such as, for example, the phone number of the audience 118 that user 100 is talking with, the service level of user 100 with a respective wireless carrier, the present location of user 100, the device type used by user 100, and the like.
This embodiment is particularly applicable if the voice cosmetics service 114 is operated directly by a wireless carrier, for example, as a paid premium service for a select group among its wireless customers. FIG. 5 is a block diagram illustrating context sensitive aspects of the invented voice cosmetics system and method.
Device 510 contains voice cosmetics application 504 and digital signal processor 508. Device 510 may be any device that includes hardware, software or a combination thereof allowing user 500 to communicate with audience 516 via audio or video call, the device including, but not limited to, a smart phone, a tablet computer, a game console, a laptop computer, a personal computer and the like.
Voice cosmetics application 504 is enabled to listen into the ongoing communication stream 512 originating from user 500, analyzing the communication stream 512 and instructing digital signal processor with instructions 506 to alter the communication stream 512 in order to make the user 500's voice sound more appealing to audience 516.
In addition, voice cosmetics application 504 is enabled to listen into the ongoing communication stream 514 originating from audience 516 and to analyze the communication stream 514 in order to detect contextual information that may be useful to make the user 500's voice sound more appealing to audience 516. Such contextual information may include, but is not limited to, frequency patterns helping to determine the predominant gender and current stress level of audience 516 and the like.
In case of audiovisual communication between user 500 and audience 516, voice cosmetics application 504 may be able to also analyze at least one video portion of communication streams 512 and 514, for example, to detect the predominant gender of audience 516, stress levels and the like using, for example, image recognition.
Based on at least a subset of the heretofore-described analyses, voice cosmetics application 504 may generate and send instructions 506 to the digital signal processor 508 in order to alter the communication stream 512 in such a way as to make the user 500's voice sound more appealing to audience 516.
Furthermore, user 500 may influence the operation of voice cosmetics application 504 via instructions 502, for example, by using a user interface as depicted in FIG. 6. FIG. 6 depicts a simplified mockup of a context sensitive voice cosmetics application operated on a mobile device.
Mobile device 600 contains a voice cosmetics app 602 that allows a user (not shown) to control the effect of the voice cosmetics app 602 on his or her voice when communicating with an audience (not shown). In this example, the user is female.
Radio button group 603 allows the user to quickly switch between various voice cosmetics effects optimized for specific audiences. For example, if the user is about to call a predominantly male audience, she may select option 604, versus option 606 for a predominantly female audience, or option 608 for an audience that contains both genders.
It is an aspect of the present invention to allow saving specific voice cosmetics settings as part of contact information such as a phone number, name, unique ID and the like in order to have voice cosmetics app 602 automatically apply the same settings for future calls with the same audience.
Additionally or alternatively, the user can select radio option 610. In this case, the voice cosmetics app 602 will ideally initially apply general setting 608, then try to automatically detect the predominant gender of the audience and apply best possible voice cosmetics to the user's voice accordingly in order to maximize the user voice's appeal among the audience.
In order to detect the predominant gender of the audience, voice cosmetics app 602 may, for example, access the voice stream originating from the audience, search for gender- specific attributes such as gender-specific frequency patterns and then apply optimized voice cosmetics settings accordingly to the voice stream originating from the user. This process may be performed once at the beginning of the conversation, periodically throughout the conversation, or a combination thereof.
Optionally, by operating user interface control 612, the user may also select from a list of voice cosmetics effects designed to make the user' s voice sound like the voice of a selected celebrity. By operating slider 614, the user may adjust the intensity of at least a subset of all currently active voice cosmetics effects.
Optionally, the user can tap button 616 in order to hear her voice with the selected effects applied.
Once the user is satisfied with her settings, she can either apply and save her settings for further use by tapping button 618 or cancel and go back to prior settings by tapping button 620. It will be understood that, the present invention is not limited to the depicted effects and may also allow more than one settings saved and related to one or more audiences each.
FIG. 7 is a flowchart depicting a context sensitive operation of the present invention. In step 700, a call between a user and an audience is started utilizing a voice cosmetics application in the spirit of the present invention. The call may be an audio call or an audiovisual call. The audience may be one or more persons.
In step 702, the voice cosmetics application initially applies previously saved or otherwise available voice cosmetics settings or default settings in such a way that the user's voice stream sounds most appealing to the audience based on best prior knowledge. For example, if the user talked with the audience using the voice cosmetics application before, the voice cosmetics application may retrieve and apply respective previously saved voice cosmetic settings. The previously saved voice cosmetic settings may be linked, for example, to a phone number, video call number, or other unique ID relating to the audience.
Additionally or alternatively, the voice cosmetics application may have access to a voice cosmetics database or voice cosmetics service such as depicted in FIG. 1 providing best possible voice cosmetic settings for a given phone number, video call number, unique audience ID and the like. In optional step 704, the voice cosmetics application may analyze the user's context, for example, by utilizing attached sensors as depicted in FIG 1. 120. Such a context analysis may comprise, but is not limited to, measuring ambient noise levels around the user, measuring a user voice input level, measuring a current stress level of the user and the like.
Based on the results of such an analysis, the voice cosmetics application may then optimize the user voice cosmetics settings of step 702 further, for example, in order to counter ambient noise or a low user voice input level and the like.
In step 706, if the predominant gender among the audience can be automatically detected by the voice cosmetics application, for example, by analyzing the audience's voice stream, step 708 is performed and voice cosmetics settings applied to the user's voice stream are optimized accordingly.
In step 706, if the predominant gender among the audience cannot be detected, or after performing step 708 is performed, step 710 is performed.
In step 710, if the call has not ended as yet, step 706 is performed.
In step 710, if the call is about to end, optional step 712 is performed saving the latest applied voice cosmetics settings, for example, as part of a contact record related to the audience's phone number.
In step 714, the call is ended and voice cosmetic settings seize to be applied.
FIG. 8A is a block diagram illustrating an operation of a voice cosmetics system and method when comprised in a peripheral device, including but not limited to, a wired or wireless microphone.
Peripheral device 810 comprises a voice cosmetics application 804 and a digital signal processor 808 It will be understood that, as described above, voice cosmetics application 804 may be implemented as hardware, software, a combination thereof or built directly into digital signal processor 808. Peripheral device 810 is picking up the voice output 812 of user 800, is converting it into digital information using digital signal processor 808, and is transmitting the resulting information to communication device 820 wired or wirelessly.
It will be understood that peripheral device 810 is not limited to picking up the voice of user 800 by means of a microphone but may also do this by one or more other means including bone vibrations, optical recognition of mouth movements and the like.
Communication device 820, for example a smart phone, tablet or other computing device, is wired or wirelessly connected via connection 818 with an audience 830, for example via the Internet (not shown). Audience 830 comprises one or more users (not shown). As described above, voice cosmetics application 804 is optimizing the appeal of the user 800's voice output 812 via digital signal processor 808, preferably based on selections or instructions 802 originating from user 800.
There are at least three key advantages of building voice cosmetics application 804 into a peripheral device 810: 1) consistently optimized voice appeal across different communication devices such as 820; 2) communication devices that have no voice cosmetics built in can be easily augmented with voice cosmetics functionality, 3) voice cosmetic outcomes can be improved by designing the voice cosmetics application in such as way that it is using the voice pickup technology in the peripheral device in the best possible way. It will be understood that, in a simplified embodiment, peripheral device 810 may not allow user 800 to influence the operation of voice cosmetics application 804 but rather use one or more preset instructions or listen into the user's voice output 812 and automatically adapt its operation accordingly.
In case of automatically adapting its operation, voice cosmetics application 804 may be able to automatically detect the user 800 's gender, stress level, current context and the like in order to optimize its operation.
In the depicted embodiment, peripheral device 810 is not involved in providing voice output 814 originating from audience 830 back to user 800.
FIG. 8B is a block diagram illustrating an operation of a voice cosmetics system and method when comprised in a peripheral device such as a headset. In this case, in addition to the above-described embodiment of FIG. 8A, peripheral device 810 provides voice output 814 originating from audience 830 to user 800, for example via a loudspeaker. In the depicted case, voice cosmetics application 804 neither influences nor analyses voice output 814 in order to optimize the user 800's voice appeal.
FIG. 8C is a block diagram illustrating an operation of a voice cosmetics system and method when comprised in a peripheral device such as an advanced headset.
In this case, in addition to the above-described embodiment of FIG. 8A, voice cosmetics application 804 may analyze voice output 814, for example to automatically detect a context or gender of audience 830, in order to further optimize the user 800's voice appeal as described above and in FIG. 7.
FIG. 8D is a block diagram illustrating an operation of a voice cosmetics system and method when comprised in a peripheral device such as an advanced headset with an additional audience context input.
In this case, in addition to the above-described embodiment of FIG. 8C, voice cosmetics application 804 may receive additional contextual information 840 directly originating from the audience 830, or originating from one or more of their communication devices or peripherals thereof. In this case, the contextual information 840 is not a part of, or dependent upon, voice output 814.
Contextual information 840 may comprise, but is not limited to, the number of, the genders of, or other information regarding the individuals comprised in audience 830.
This embodiment allows voice cosmetics application 804, for example, to readily apply contextual information 840 to optimize its voice cosmetics operation without depending on analyzing voice output 814.
FIG. 8E is a block diagram illustrating an operation of a voice cosmetics system and method when comprised in a peripheral device as well as in a connected communication device. In this case, in addition to the above-described embodiment of FIG. 8C, communication device 820 comprises a voice cosmetics application 816 that may be operated by user 800 either in combination with, or separately from, voice cosmetics application 804.
This embodiment may, for example, allow the user 800 to remotely control or monitor at least a subset of the functionality of voice cosmetics application 804 from communication device 820 or combine voice cosmetics applications 804 and 816 in order to create a desired effect.
The inventions set forth above are subject to many modifications and changes without departing from the spirit, scope or essential characteristics thereof. Other embodiments of this invention will be obvious to those skilled in the art in view of the above disclosure. Thus, the embodiments explained above should be considered in all respect as being illustrative rather than restrictive of the scope of the inventions as defined in the appended claims.
In summary, methods have been disclosed to make a voice sound more attractive to a particular audience.

Claims

Claims
1. A method of processing an audio stream of a voice
of a natural human being
for communication, wherein
at least one target voice property considered to be
appealing in a speaker is defined; a plurality of steps for processing the audio voice stream of the human being in a manner approximating said appealing target property
without destroying the impression of the audience to listen to a natural voice, is defined a degree to which the target voice is to have said appealing voice property is determined, the plurality of processing steps associated with said determination is applied to the voice stream, and the processed voice stream is reproduced to the audience.
2. The method according to claim 1 , wherein the processing of the voice audio stream is effected in real time. The method according to one of the previous claims, wherein the processing of the audio voice stream is effected prior to transmission to a remote audience over a network.
The method according to one of the previous claims, wherein at least one of the plurality of steps for processing the audio voice stream is selected from the group comprising equalizing, limiting, modulation, flanging, adding chorus, adding reverb, adding air, changing pitch, adding 3D audio effects, compressing, expanding.
The method according to one of the previous claims, wherein the at least one target property is selected from
making the voice sound more feminine,
making the voice sound more masculine,
making the voice sound more energetic,
making the voice sound more soothing,
making the voice sound more sensual,
making the voice sound warmer,
making the voice sound more trustworthy,
making the voice sound more tender,
making the voice sound similar to the voice of a particular celebrity. The method according to one of the previous claims, wherein at least one further property considered to be
appealing in a speaker is defined; and for this at least one further property, a further plurality of steps for processing the audio voice stream of the human being for approximating the respective appealing target property without destroying the impression of the audience to listen to a natural voice is defined, a degree to which the target voice stream is to have the further property is defined, the plurality of processing steps associated with said determinations is applied to the voice stream in real time,
and
the processed voice stream is transmitted to the audience.
7. The method according to the previous claims, wherein when the plurality of processing steps associated with said determinations is applied to the voice stream in real time, at least some of these processing steps are effected as mathematical operations on a digitized voice stream and at least some of the mathematical operations are calculated by executing a combination obtained by concatenation thereof.
8. The method according to one of the previous claims, wherein the impression of the audience to listen to a natural voice stream is selected according to a statistical evaluation of a number of different pluralities of steps for processing the audio voice stream of the human being.
9. The method according to one of the previous claims, wherein each of said number of different pluralities of steps for processing the audio voice stream of the human being and/or the combination thereof observes voice transmission standards.
10. The method according to one of the previous claims, wherein the degree to which the target voice stream is to have the appealing voice property is determined according to at least one of
a user input,
a sensor means for analyzing at least one of
an audience gender, an audience behavior,
the social background of an audience.
11. The method according to one of the previous claims, wherein
the determination is made in response to at least one of
image analysis to identify males/females in the audience and /or the gender of a single person or few persons constituting the audience,
current stress level of audience
result of a voice stress analysis in the audio voice stream of the speaker and/or voice stress analysis in the audio voice stream of the audience,
detected ambient sound levels/characteristics,
a physical location of the speaker
local time of speaker or audience
ambient temperature
phone number
area code of an audience called
ID of audience
service level of provider of a communication path,
device type of user,
keywords identified in the voice stream by speech recognition.
12. The method according to one of the previous claims, wherein the degree to which the target voice is to have said appealing voice property is changed during an ongoing communication in an automated manner relying on a change in the determination made.
13. The method according to one of the previous claims, wherein at least three different degrees to which the target voice is to have said appealing voice property have been defined and the number of effects and/or their relative strength changes at least with one of the different degrees.
14. A data set usable in executing a method of one of the previous claims, in particular a personalized data set.
15. A device for implementing the method according to any of the previous claims, in particular selected from mobile communication network base stations, smartphones, tablet computers, game consoles, laptop computers, personal computers, switchboards, wearables, glasses, peripherals such as microphones, headphones and/or headsets.
PCT/EP2014/078754 2014-01-03 2014-12-19 Method of improving the human voice WO2015101523A1 (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US201461923370P 2014-01-03 2014-01-03
US61/923,370 2014-01-03
US201461983167P 2014-04-23 2014-04-23
US61/983,167 2014-04-23

Publications (1)

Publication Number Publication Date
WO2015101523A1 true WO2015101523A1 (en) 2015-07-09

Family

ID=52130267

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2014/078754 WO2015101523A1 (en) 2014-01-03 2014-12-19 Method of improving the human voice

Country Status (1)

Country Link
WO (1) WO2015101523A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP4145444A1 (en) * 2021-09-07 2023-03-08 Avaya Management L.P. Optimizing interaction results using ai-guided manipulated speech
US11652921B2 (en) 2020-08-26 2023-05-16 Avaya Management L.P. Contact center of celebrities

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004252085A (en) * 2003-02-19 2004-09-09 Fujitsu Ltd System and program for voice conversion
US7003462B2 (en) * 2000-07-13 2006-02-21 Rockwell Electronic Commerce Technologies, Llc Voice filter for normalizing an agent's emotional response
US20080044048A1 (en) * 2007-09-06 2008-02-21 Massachusetts Institute Of Technology Modification of voice waveforms to change social signaling
JP2008145994A (en) * 2006-12-05 2008-06-26 Mitsumasa Kobayashi Voice conversion management system and voice conversion managing program
US20100195812A1 (en) * 2009-02-05 2010-08-05 Microsoft Corporation Audio transforms in connection with multiparty communication
US20110264453A1 (en) * 2008-12-19 2011-10-27 Koninklijke Philips Electronics N.V. Method and system for adapting communications
US20120016674A1 (en) * 2010-07-16 2012-01-19 International Business Machines Corporation Modification of Speech Quality in Conversations Over Voice Channels
US8457688B2 (en) * 2009-02-26 2013-06-04 Research In Motion Limited Mobile wireless communications device with voice alteration and related methods
CN103489451A (en) * 2012-06-13 2014-01-01 百度在线网络技术(北京)有限公司 Voice processing method of mobile terminal and mobile terminal

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7003462B2 (en) * 2000-07-13 2006-02-21 Rockwell Electronic Commerce Technologies, Llc Voice filter for normalizing an agent's emotional response
JP2004252085A (en) * 2003-02-19 2004-09-09 Fujitsu Ltd System and program for voice conversion
JP2008145994A (en) * 2006-12-05 2008-06-26 Mitsumasa Kobayashi Voice conversion management system and voice conversion managing program
US20080044048A1 (en) * 2007-09-06 2008-02-21 Massachusetts Institute Of Technology Modification of voice waveforms to change social signaling
US20110264453A1 (en) * 2008-12-19 2011-10-27 Koninklijke Philips Electronics N.V. Method and system for adapting communications
US20100195812A1 (en) * 2009-02-05 2010-08-05 Microsoft Corporation Audio transforms in connection with multiparty communication
US8457688B2 (en) * 2009-02-26 2013-06-04 Research In Motion Limited Mobile wireless communications device with voice alteration and related methods
US20120016674A1 (en) * 2010-07-16 2012-01-19 International Business Machines Corporation Modification of Speech Quality in Conversations Over Voice Channels
CN103489451A (en) * 2012-06-13 2014-01-01 百度在线网络技术(北京)有限公司 Voice processing method of mobile terminal and mobile terminal

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11652921B2 (en) 2020-08-26 2023-05-16 Avaya Management L.P. Contact center of celebrities
EP4145444A1 (en) * 2021-09-07 2023-03-08 Avaya Management L.P. Optimizing interaction results using ai-guided manipulated speech

Similar Documents

Publication Publication Date Title
CN105074822B (en) Device and method for audio classification and processing
CN103236263B (en) Method, system and mobile terminal for improving call quality
JP2011512694A (en) Method for controlling communication between at least two users of a communication system
US11729312B2 (en) Hearing accommodation
WO2022042129A1 (en) Audio processing method and apparatus
CN109410973B (en) Sound changing processing method, device and computer readable storage medium
CN112352441B (en) Enhanced environmental awareness system
CN106383676B (en) Instant photochromic rendering system for sound and application thereof
US20110264453A1 (en) Method and system for adapting communications
EP2030195B1 (en) Speech differentiation
WO2015101523A1 (en) Method of improving the human voice
US20230267942A1 (en) Audio-visual hearing aid
TWI498880B (en) Automatic Sentiment Classification System with Scale Sound
US20240213943A1 (en) Dynamic audio playback equalization using semantic features
CN111696566B (en) Voice processing method, device and medium
TWI498886B (en) An automatic emotion classification system with no sound
Reis et al. Mobile interaction: automatically adapting audio output to users and contexts on communication and media control scenarios
CN111696564B (en) Voice processing method, device and medium
JP7271821B2 (en) Cloud voice conversion system
CN111696565B (en) Voice processing method, device and medium
CN113393863B (en) Voice evaluation method, device and equipment
WO2023139849A1 (en) Emotion estimation method, content determination method, program, emotion estimation system, and content determination system
Abel et al. Audio and Visual Speech Relationship
Roverud et al. Effect of hearing aids on the externalization of everyday sounds
JP2022171300A (en) Computer program, method and server device

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14815378

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 14815378

Country of ref document: EP

Kind code of ref document: A1