CN115424603A

CN115424603A - Voice generation method and device, electronic equipment and storage medium

Info

Publication number: CN115424603A
Application number: CN202211056690.3A
Authority: CN
Inventors: 张孝东; 黄艺超
Original assignee: Vivo Mobile Communication Co Ltd
Current assignee: Vivo Mobile Communication Co Ltd
Priority date: 2022-08-30
Filing date: 2022-08-30
Publication date: 2022-12-02

Abstract

The application discloses a voice generation method, a voice generation device, electronic equipment and a storage medium, and belongs to the technical field of communication. The method comprises the following steps: determining a plurality of objects based on the initial speech; acquiring second voice, wherein different objects in the plurality of objects are associated with different second voices; generating third voices respectively associated with the plurality of objects based on the first voices in the initial voices and the second voices respectively associated with the plurality of objects; the voice content of the third voice includes: the first voice content of the first voice and the second voice content of the second voice which is related to the same object with the third voice.

Description

Voice generation method and device, electronic equipment and storage medium

Technical Field

The present application belongs to the field of communication technologies, and in particular, to a method and an apparatus for generating speech, an electronic device, and a storage medium.

Background

In the process that a mobile terminal user uses the terminal device, when the user needs social contact, the user is generally required to use social application for communication, and in the communication, the voice technology also becomes a wide information transmission channel.

When a mobile terminal user needs to send voices with most of the same contents and only part of the contents are different to a plurality of different contacts, the contents of voice messages can be recorded once for each contact, and the voices with only part of the contents being different cannot be quickly and conveniently generated in batch.

Disclosure of Invention

An embodiment of the present application provides a method and an apparatus for generating speech, an electronic device, and a storage medium, which can solve the problem that speech with only partial content differences cannot be generated quickly and conveniently in batch.

In a first aspect, an embodiment of the present application provides a speech generation method, where the method includes:

determining a plurality of objects based on the initial speech;

acquiring second voice, wherein different objects in the plurality of objects are associated with different second voices;

generating third voices respectively associated with the plurality of objects based on the first voices in the initial voices and the second voices respectively associated with the plurality of objects;

the voice content of the third voice includes: and the first voice content of the first voice and the second voice content of the second voice which is related to the same object with the third voice.

In a second aspect, an embodiment of the present application provides a speech generating apparatus, including:

a first determination module to determine a plurality of objects based on an initial voice;

the first acquisition module is used for acquiring second voice, wherein different objects in the plurality of objects are associated with different second voices;

a first generating module, configured to generate third voices respectively associated with the multiple objects based on a first voice in the initial voice and second voices respectively associated with the multiple objects;

the voice content of the third voice includes: the first voice content of the first voice and the second voice content of the second voice which is related to the same object with the third voice.

In a third aspect, embodiments of the present application provide an electronic device, which includes a processor and a memory, where the memory stores a program or instructions executable on the processor, and the program or instructions, when executed by the processor, implement the steps of the method according to the first aspect.

In a fourth aspect, embodiments of the present application provide a readable storage medium, on which a program or instructions are stored, which when executed by a processor implement the steps of the method according to the first aspect.

In a fifth aspect, an embodiment of the present application provides a chip, where the chip includes a processor and a communication interface, where the communication interface is coupled to the processor, and the processor is configured to execute a program or instructions to implement the method according to the first aspect.

In a sixth aspect, embodiments of the present application provide a computer program product, which is stored in a storage medium and executed by at least one processor to implement the method according to the first aspect.

In the embodiment of the application, under the condition that it is determined that the obtained initial voice needs to be sent to multiple objects, third voices respectively associated with the objects are generated based on a first voice in the initial voice and a second voice respectively associated with the objects, so that the generated third voices include the same first voice content of the first voice and the second voice content of the second voice with difference among the objects, and the multiple voices only having partial difference are rapidly and conveniently generated.

Drawings

FIG. 1 is a flow chart of a speech generation method provided by an embodiment of the present application;

FIG. 2 is a schematic diagram of a voice cache interface provided in an embodiment of the present application;

fig. 3 is a second schematic diagram of a voice cache interface provided in the embodiment of the present application;

fig. 4 is a third schematic diagram of a voice cache interface provided in the embodiment of the present application;

FIG. 5 is a fourth schematic diagram of a voice cache interface provided in an embodiment of the present application;

fig. 6 is a schematic structural diagram of a speech generating apparatus according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of an electronic device provided in an embodiment of the present application;

fig. 8 is a schematic diagram of a hardware structure of an electronic device implementing the embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described clearly below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments that can be derived by one of ordinary skill in the art from the embodiments given herein are intended to be within the scope of the present disclosure.

The terms first, second and the like in the description and in the claims of the present application are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that embodiments of the application may be practiced in sequences other than those illustrated or described herein, and that the terms "first," "second," and the like are generally used herein in a generic sense and do not limit the number of terms, e.g., the first term can be one or more than one. In addition, "and/or" in the specification and claims means at least one of connected objects, a character "/", and generally means that the former and latter related objects are in an "or" relationship.

The speech generation method, apparatus, electronic device and storage medium provided in the embodiments of the present application are described in detail below with reference to the accompanying drawings by specific embodiments and application scenarios thereof.

Fig. 1 is a schematic flowchart of a speech generation method provided in an embodiment of the present application, and as shown in fig. 1, the speech generation method includes:

step 100, determining a plurality of objects based on initial voice;

alternatively, one object may be a contact, and different objects may be different contacts, for example, multiple objects may include contact a, contact b, contact c, and contact d;

alternatively, one object may be a chat group, and different objects may be different chat groups, for example, the plurality of objects may include chat group A1, chat group A2, chat group A3, chat group A4, and chat group A5;

optionally, one object may be one or more contacts corresponding to one tag, and different objects may be contacts corresponding to different tags, for example, multiple objects may include all contacts corresponding to the tag "family" and all contacts corresponding to the tag "co-worker";

optionally, a speech input of an initial speech of the user may be received;

optionally, the first voice may be obtained from the initial voice;

alternatively, the first voice may be the same part in a plurality of voice messages that need to be sent to a plurality of objects, respectively;

optionally, when a user needs to send a voice with most of the same content and only part of the different content to a plurality of different contacts or different chat groups or contacts corresponding to different tags, a plurality of objects may be determined based on the initial voice first;

optionally, the method may further include acquiring a first voice based on the initial voice, where the first voice includes contents of the same part in voice messages that need to be sent to different objects, respectively;

step 110, obtaining a second voice, wherein different objects in the plurality of objects are associated with different second voices;

optionally, the voice content of the voices to be sent to different objects is partially the same, but not completely the same, where the voice content with differentiation may be the second voice respectively associated with each object;

optionally, second voices respectively associated with the plurality of objects may be respectively acquired;

step 120, generating third voices respectively associated with the plurality of objects based on the first voice in the initial voice and the second voices respectively associated with the plurality of objects;

Optionally, third voices respectively associated with the multiple objects may be generated based on the first voices and second voices respectively associated with the multiple objects, where all the third voices include the same first voice content of the first voice, and each of the third voices further includes a second voice associated with the same object as the third voice.

Alternatively, the second voice may be a user's name for an object associated with the second voice;

for example, the plurality of objects may include a contact a, a contact b, a contact c, and a contact d, and the voice contents of the plurality of third voices may include:

the voice content of the third voice associated with contact a: "first, do you", wherein "first" is the second speech content of the second speech associated with contact a, "do you" is the first speech content of the first speech;

the voice content of the third voice associated with contact b: "second, do you", where "second" is the second speech content of the second speech associated with contact b, and "do you" is the first speech content of the first speech;

voice content of the third voice associated with contact c: "third, do you," where "third" is the second speech content of the second speech associated with contact third, and "do you" is the first speech content of the first speech;

the voice content of the third voice associated with the contact person: "D, do you," where "D" is the second phonetic content of the second voice associated with the contact D and "do you" is the first phonetic content of the first voice.

Optionally, the second voice content of the second voice may be content, such as an address or a time, for a user to make a targeted inquiry or notification to an object associated with the second voice;

for example, the plurality of objects may include contact a, contact b, contact c, and contact d, and the voice content of the plurality of third voices may include:

the voice content of the third voice associated with contact a: "we go to cinema together and are good, wherein" cinema "is the second voice content of the second voice associated with contact A," we go to \8230togetherand are the first voice content of the first voice, wherein ".." can represent a blank voice segment which is used for being replaced by the second voice to generate a third voice, and other embodiments have the same principle and are not described again;

the voice content of the third voice associated with contact b: "do we go to the library together," where "library" is the second phonetic content of the second phonetic associated with contact b, "go together" \8230, do is the first phonetic content of the first phonetic;

voice content of the third voice associated with contact c: "do we go to mall together," wherein "mall" is the second speech content of the second speech associated with contact C, "go together" \8230, do is the first speech content of the first speech;

the voice content of the third voice associated with the contact person: "do we go to the restroom together," where "restroom" is the second phonetic content of the second voice associated with the contact person, and "do we go together \8230and" do "is the first phonetic content of the first voice.

Illustratively, the plurality of objects may include chat group A1, chat group A2, chat group A3, chat group A4, and chat group A5.

Optionally, the voice content of the third voice associated with chat group A1: "did your paper complete," where "paper" is the second speech content of the second speech associated with chat group A1, "did you \8230, is the first speech content of the first speech;

the voice content of the third voice associated with chat group A2: "do your job completed", wherein "job" is the second phonetic content of the second voice associated with chat group A2, "do you \8230"; do not complete "is the first phonetic content of the first voice;

voice content of the third voice associated with chat group A3: "did you complete their PPT", where "PPT" is the second phonetic content of the second voice associated with chat group A3, "did you' \8230"; did you complete "is the first phonetic content of the first voice;

voice content of the third voice associated with chat group A4: "did you's answer material complete", wherein "answer material" is the second speech content of the second speech associated with chat group A4, "did you's \8230, is the first speech content of the first speech;

voice content of the third voice associated with chat group A5: "do your test paper complete," wherein "test paper" is the second phonetic content of the second voice associated with chat group A5, "do you' \8230," is the first phonetic content of the first voice.

For example, the plurality of objects may include all contacts corresponding to the tag "family" and all contacts corresponding to the tag "colleague";

the voice content of the third voice associated with all contacts corresponding to the label "family": "do us organize family dinner party" on weekend, wherein "weekend" and "family dinner party" are the second phonetic content of the second pronunciation associated with all contacts corresponding to label "family", and "we \8230, organize \8230, do" is the first phonetic content of the first pronunciation;

the voice content of the third voice associated with all contacts corresponding to the label "colleague": "We have Tuesday organization do a clique," where "Tuesday" and "clique" are the second phonetic content of the second phonetic associated with all contacts corresponding to the label "co-workers," We \8230, organization \8230, and do a first phonetic content of the first phonetic.

Alternatively, in the case of displaying a chat page of a user, for example, in a chat page displaying a one-to-one chat or a group chat, the user may send a voice message for a chat, and if the user needs to send a voice message to a plurality of contacts according to a chat conversation scenario, most of the contents of the voice message are the same, but the content segments of a certain position of the voice message are different, the initial voice may be obtained, a plurality of objects (sending objects of the voice) may be determined based on the initial voice, a first voice may be obtained from the initial voice, and a plurality of third voices may be generated based on second voices respectively associated with the first voice and the plurality of objects.

Alternatively, the difference portion between the speech contents of the plurality of third speeches may be at any position of the beginning, middle, or end of the third speech.

Alternatively, a third speech associated with an object may indicate that the third speech may be sent to the object after being generated.

Optionally, the determining a plurality of objects based on the initial speech includes at least one of:

determining the plurality of objects based on an object indicated by first voice information in initial voice content of the initial voice;

or the like, or a combination thereof,

determining the plurality of objects based on a second voice content included in the initial voice content;

or the like, or a combination thereof,

receiving a first input of a user;

determining the plurality of objects in response to the first input; wherein the first input comprises a user operation indicating the plurality of objects.

Alternatively, in a case where it is determined that the transmission target corresponding to the initial voice includes a plurality of objects, the recorded initial voice may not be transmitted for a while, but the first voice may be acquired from the initial voice.

Optionally, after the first voice is obtained, a control of the first voice may be displayed on the voice cache interface.

Alternatively, the user may indicate that the transmission target corresponding to the first voice includes a plurality of objects by voice, for example, when the user records the initial voice, the user may speak the first voice message "leave voice position" when recording to the position where the differentiated voice content is obtained (i.e., the position where the second voice needs to be inserted), and then may continue to record the remaining voice content.

Optionally, an input of recording an initial voice by a user may be received, and in a case that it is determined that the initial voice content includes first voice information, it is determined that a transmission target corresponding to the first voice includes a plurality of objects, and the plurality of objects are determined.

Alternatively, a location where the second voice needs to be added to the first voice may be determined based on the first voice information.

Alternatively, the user may indicate that the transmission target corresponding to the first voice includes a plurality of objects through the first input, for example, when the user records the initial voice, the user may perform the first input (may also leave a blank voice segment in the initial voice when performing the first input) when recording a position where the differentiated voice content is (i.e., a position where the second voice needs to be inserted), and then may continue to record the remaining voice content, where the blank voice segment is used to be replaced by the second voice to generate the third voice;

optionally, a first input may be received, and in response to the first input, it may be determined that a transmission target corresponding to the first voice includes a plurality of objects, and the plurality of objects may be determined;

optionally, the position of the second voice needing to be added into the first voice can be determined based on the blank voice segment;

optionally, fig. 2 is one of schematic diagrams of a voice cache interface provided in an embodiment of the present application, as shown in fig. 2, the voice cache interface of the control displaying the first voice may include a specific function button, and the first input may be an operation of a user clicking the specific function button;

optionally, the first input may be an operation of a special gesture of the user, and after receiving the operation of the special gesture, the electronic device may determine that the sending target corresponding to the first voice includes multiple objects, and determine the multiple objects;

alternatively, the special gesture may be a slide operation that slides up from the bottom of the screen to the middle of the screen.

Alternatively, the first input may be an operation of long-pressing a special area in the screen, such as an operation of long-pressing a home key area in the screen, and further, when the long-pressing time is greater than a first preset value and/or the long-pressing time is less than a second preset value, it may be determined that the transmission target corresponding to the first voice includes a plurality of objects, and the plurality of objects may be determined.

Alternatively, the first input may be an operation on a special region of the body of the electronic device, such as a home key region of the body of the electronic device, and the long press time is greater than the third preset value and/or the long press time is less than the fourth preset value, and after receiving the long press operation, the electronic device may determine that the transmission target corresponding to the first voice includes a plurality of objects.

Optionally, the first input may be a long press input to any region of the electronic device, for example, a long press input to any region of a screen or any region of a body, and a value range of the long press time is within a preset value range, for example, greater than 3 seconds, or greater than 3 seconds and less than 7 seconds, which is not limited in this embodiment; after the electronic device receives the long press input, it can be determined that the transmission target corresponding to the first voice includes a plurality of objects.

Alternatively, when recording the initial voice, the user may perform a first input when recording positions where the differentiated voice contents are different (i.e., positions where the second voices need to be inserted respectively), and during performing the first input, completely speak and record the second voices respectively corresponding to all objects into the initial voice in a certain order, and then may continue to record the remaining voice contents.

For example, if a first input is received and a differentiated speech content "bright, red, bright" is received during execution of the first input, the initial speech content includes: "xiaoming, xiaohong, xiaoliang, we see the movie good this week", it may be determined that the content other than the differentiated voice content in the initial voice is the first voice content of the first voice, and it may also be determined that the transmission destination corresponding to the first voice includes a plurality of objects, which are xiaoming, xiaohong, and xiaoliang, respectively.

Optionally, a position where the second voice needs to be added to the first voice may be determined based on positions of the plurality of objects in the second voice content of the corresponding second voice, respectively, and may be referred to as a first position;

optionally, when the user records the initial voice, when the user records a position with differentiated voice content (that is, a position where the second voice needs to be inserted), the user may speak first voice information "hereinafter is differentiated voice content" or "differentiated voice content starts" by voice, and after speaking "hereinafter is differentiated voice content" or "differentiated voice content starts" by voice, the user may speak all second voices corresponding to all the objects in a certain order and record the second voices in the initial voice, and after speaking all the second voices corresponding to all the objects in a certain order, speak the first voice information "differentiated voice content ends" by voice, and then may continue to record the remaining voice content;

optionally, the second speeches corresponding to all the objects in the initial speech may be distinguished by information such as a blank or a pause between every two second speeches;

optionally, it may be determined that a transmission target corresponding to the first voice includes a plurality of objects based on "following differential voice content" or "differential voice content start" and "differential voice content end" first voice information, and based on that the initial voice content includes second voice content corresponding to the plurality of objects, respectively;

optionally, the insertion of the second voice at the same position in the first voice can be determined based on the position of the first voice information in the initial voice, which is "differentiated voice content" or "differentiated voice content start" below, and "differentiated voice content end", or the position of the second voice corresponding to each of the plurality of objects in the initial voice.

For example, if the initial speech content is received, the method includes: "the differentiated speech content starts ' bright, red, bright ' differentiated speech content ends ', and we see the movie good in the week", it can be determined that, in the initial speech, after the differentiated speech content and the first speech information are removed, the remaining speech content is the first speech content of the first speech, and it can also be determined that the transmission target corresponding to the first speech includes a plurality of objects.

Optionally, a piece of first speech may be fixedly cached, for any one object in the plurality of objects, a second speech associated with the object is added to the first speech, a third speech associated with the object is generated and cached (a control of the third speech may also be displayed on the speech caching interface), then a third speech associated with a next object is generated based on the fixedly cached first speech, and so on until third speech corresponding to each object is generated.

Optionally, if there are N objects in total, N pieces of first speech may be obtained through copying and pasting, each piece corresponds to one object, for any one object in the multiple objects, the second speech associated with the object may be added to the first speech corresponding to the object, and the third speech associated with the object is generated, and so on until the third speech corresponding to each object is generated, and part or all of the generated controls of the third speech may be displayed on the speech cache interface.

Optionally, fig. 3 is a second schematic diagram of the voice cache interface provided in the embodiment of the present application, and as shown in fig. 3, the obtaining of the N first voices through copy and paste may be: receiving a long-press operation of a user on a control of a first voice, responding to the long-press operation, displaying a plurality of options to be selected, wherein the options to be selected comprise a 'copy' option to be selected, receiving a selection operation of the user selecting the 'copy' option to be selected, responding to the selection operation, displaying an input box of copy number on a voice cache interface, receiving an operation of inputting a specific number N on the input box by the user, responding to the operation, and obtaining N pieces of first voice through copy and paste.

Alternatively, obtaining the N first voices by copy-and-paste may be: receiving a long-press operation of a user on a control of a first voice, displaying a plurality of to-be-pasted options in response to the long-press operation, wherein the plurality of to-be-pasted options comprise a 'copy' to-be-pasted option, receiving a selection operation of the user selecting the 'copy' to-be-pasted option, setting the first voice as the voice to be pasted in response to the selection operation, receiving an operation of the user selecting a pasting position on a voice cache interface or other interfaces, displaying a pasted control of the first voice in the position selected by the user in response to the operation, and so on until N pieces of first voice are obtained by copying and pasting.

Optionally, for any one object in the plurality of objects, a second voice associated with the object may be added to the first voice, and a third voice associated with the object may be generated, so that the third voice includes the same first voice and a differentiated second voice between the objects.

Alternatively, if only one join position for joining the second voice is included in one first voice, the second voice associated with a certain object can be automatically inserted into the join position in the first voice (if a blank voice segment exists in the join position, the blank voice segment is replaced by the second voice), a third voice associated with the object is generated, and so on; it will be appreciated that N different second voices may be automatically assigned to N identical first voices one after the other, with each first voice being added to the second voice to which it was assigned.

Optionally, for any one of the multiple objects, after adding the second voice associated with the object to the first voice, and generating a third voice associated with the object, a control of the third voice may be displayed, and text content of the second voice associated with the object or an identifier of the object may be displayed in a region corresponding to the control of the third voice, so that the user may clearly distinguish the multiple third voices or may clearly know that each of the third voices corresponds to the object or the second voice content.

Optionally, the area corresponding to the control of the third voice may be in the middle or inside of the control of the third voice, may also be at the edge of the control of the third voice or outside of the edge, and may also be an area connected to the control of the third voice, which is not limited in this embodiment of the present application.

Alternatively, any scheme that enables a user to clearly distinguish a plurality of third voices or enables the user to clearly know an object or a second voice corresponding to each third voice is applicable to the embodiments of the present application.

Optionally, the generating, based on the first voice in the initial voice and the second voices respectively associated with the plurality of objects, third voices respectively associated with the plurality of objects includes:

for any object in the plurality of objects, receiving a second input of a user, wherein the second input comprises an operation of the user to move a control corresponding to the object to a first area, the control corresponding to the object is a control for identifying the object or a control corresponding to a second voice associated with the object, the first area is an area corresponding to a first position in the first voice, and the first position is used for indicating a joining position of the second voice;

in response to the second input, adding a second voice associated with the object to the first location, resulting in the third voice associated with the object.

Optionally, for any one of the plurality of objects, after the second voice associated with the object is acquired, a control of the second voice associated with the object may be displayed in a voice cache interface or other interface of the electronic device, an operation of a user moving the control of the second voice associated with the object to the first area may be received, and in response to the operation, the second voice is added to the first position of the first voice.

Fig. 4 is a third schematic diagram of a voice cache interface provided in the embodiment of the present application, as shown in fig. 4, where the third schematic diagram includes a control of a second voice associated with an object "zhuyu", and the control of the second voice associated with an object "zhuyu", and may receive an operation that a user moves the control of the second voice associated with zhuyu "to a region corresponding to the first position of the first voice 1, and in response to the operation, add the second voice content of the second voice associated with zhuyu" to the first position of the first voice 1, and may receive an operation that the user moves the control of the second voice associated with zhuyu "to a region corresponding to the first position of the first voice 2, and in response to the operation, add the second voice associated with zhuyu" to the first position of the first voice 2;

optionally, the contact avatar may be exposed to the corresponding voice control location.

Alternatively, the second voice may be recorded separately by the user before recording the initial voice, or recorded together with the initial voice during recording the initial voice, or recorded separately after recording the initial voice;

optionally, when the differentiated content is a name of a user for each of a plurality of objects, for any one of the plurality of objects, a control corresponding to the object may be displayed in the voice cache interface or another interface of the electronic device, for example, an operation that the user moves the control corresponding to the object to a first area corresponding to the first position may be received, and in response to the operation, the second voice is added to the first position of the first voice.

Fig. 5 is a fourth schematic diagram of a voice cache interface provided in an embodiment of the present application, as shown in fig. 5, a contact list is displayed, where the contact list includes a control corresponding to an object "zhangsan" and a control corresponding to an object "liquan", and the contact list may receive an operation of a user moving the control corresponding to zhangsan "to a region corresponding to a first position of the first voice 1, and in response to the operation, add the nickname of zhangsan" to the first position of the first voice 1 after converting the nickname of zhangsan into voice, and may receive an operation of the user moving the control corresponding to zhuyu "to the region corresponding to the first position of the first voice 2, and in response to the operation, add the nickname of lie to the first position of the first voice 2 after converting the nickname of zhuyu" into voice.

Optionally, for any one of the plurality of objects, the control corresponding to the object may be an image identifier of the object, such as an avatar of the object in social software.

Optionally, for any one of the plurality of objects, the control corresponding to the object may be a nickname identification of the object, such as a nickname of the object in social software.

Optionally, for any one of the plurality of objects, the control corresponding to the object may be a remark identifier of the object, such as a remark of the object in social software given by a user.

Optionally, for any object in the plurality of objects, the second voice corresponding to the object may be obtained by converting text into voice based on the nickname identifier of the object or the remark identifier of the object.

Optionally, the first area corresponding to the first position may be an area corresponding to the first position in the control of the first voice.

Optionally, for any object in the multiple objects, the second input may be an operation that the user selects and presses the control corresponding to the object for a long time in the group chat interface of the social software, and then moves the control to the first area corresponding to the first position.

Optionally, for any object in the plurality of objects, the second input may be an operation that, after the electronic device displays a contact list of the social software, the user selects and presses a control corresponding to the object for a long time in the contact list and moves the control to a first area corresponding to the first position after the control corresponding to the object is pressed for a certain length of time.

Optionally, the electronic device may receive an operation of exchanging positions of controls of objects in two different third voices by a user, for example, a control corresponding to the third voice a of a contact is displayed inside a control of the third voice a, a control corresponding to the liquad of the contact is displayed inside a control of the third voice B, and an operation of exchanging positions of the control corresponding to the third voice a and the control corresponding to the liquad of the contact is received by the user, and then a new third voice a 'and a new third voice B' may be generated in response to the operation, where the new third voice a 'does not include any more voice content corresponding to the control corresponding to the third voice a of the contact but includes voice content corresponding to the control corresponding to the liquad of the contact, and the new third voice B' does not include any more voice content corresponding to the control corresponding to the liquad of the contact but includes voice content corresponding to the control corresponding to the third voice a of the contact;

for example, the voice content of the third voice a includes "zhang san, we go to the movie bar together", and the voice content of the third voice B includes "li si, we go to the dining bar together"; after receiving an operation of the user exchanging the position of the control corresponding to zhangsan and the position of the control corresponding to lie four, a new third voice a 'and a new third voice B' may be generated in response to the operation, the new third voice a 'including "lie four, we go to the movie bar together", and the new third voice B' including "zhangsan, we go to the dining bar together".

Optionally, the object control may also be a control of one or more messages in a chat interface currently displayed by the electronic device or a chat interface that has been displayed in history, and accordingly, after receiving an operation that the user moves the object control to the first region, if the one or more messages are voice messages, the one or more messages may be directly used as the second voice, or after extracting key content from the one or more messages, the key content may be used as the second voice; if the one or more messages are text messages, the text contents of the one or more messages can be converted into voice to be used as second voice, or the text contents of the keywords can be converted into voice to be used as second voice after the keywords are extracted from the text contents of the one or more messages.

Optionally, if a plurality of contacts exist in the initial voice, a plurality of voice controls may be generated according to the number of contacts in the voice.

under the condition that the initial voice comprises second voices respectively associated with a plurality of objects, intercepting the second voices respectively associated with the plurality of objects from the initial voice to obtain a first voice;

and for any object in the plurality of objects, adding a second voice associated with the object to a first position in a first voice to obtain a third voice associated with the object, wherein the first position is the same as the positions of the second voices respectively associated with the plurality of objects in the initial voice.

Optionally, second voices corresponding to the plurality of objects respectively may be clipped from the initial voice, that is, two parts of voices are obtained, one part is the second voices corresponding to the plurality of objects respectively, that is, differentiated voices, and the other part is non-differentiated voices, that is, first voices, and the positions of the differentiated voices in the initial voice may be corresponding to the positions of the first voices as first positions for adding the second voices, and then the second voices respectively associated with the plurality of objects may be obtained, and for any one of the plurality of objects, the second voice associated with the object is added to the first position of the first voice, and a third voice associated with the object is generated.

under the condition that the initial voice content of the initial voice comprises first voice information, deleting a voice segment corresponding to the first voice information from the initial voice to obtain first voice;

and for any object in the plurality of objects, adding second voice associated with the object to a first position in first voice to obtain third voice associated with the object, wherein the first position is the same as the position of a voice fragment corresponding to the first voice information in the initial voice.

Optionally, the first speech information may be clipped from the initial speech, the remaining portion may be non-differentiated speech, that is, the first speech, and the position of the first speech information in the initial speech may also be corresponded to the position in the first speech, as a first position, for adding the second speech, and then the second speech associated with each of the plurality of objects may be acquired, and for any one of the plurality of objects, the second speech associated with the object is added to the first position of the first speech, and a third speech associated with the object is generated.

Optionally, the voice content of the third voice further includes: a message reference identification referencing a history message, the message reference identification to associate the third voice with the history message.

Alternatively, the electronic device may receive an operation of associating the history message by the user, and insert a message reference identifier at a third position in the first voice or the third voice in response to the operation, for associating with the history message.

Alternatively, the operation of the user to associate the history message may be a special gesture or a click of a preset button.

Alternatively, the operation of associating the history message by the user may be an operation of inputting a message reference identifier by the user.

Alternatively, the message reference identification may be an @.

The third position may be determined based on the voice instruction of the user, or based on the operation of inserting the history message by the user instruction, or based on the blank position or pause position of the user when recording the first voice, similar to the determination of the first position.

Optionally, the associated history message may be highlighted in the chat interface, such as one or more of highlighted, bolded, underlined, or italicized.

Optionally, if the history message is a history voice message, a third position in the third voice is associated with the history voice message, after the third voice associated object receives the third voice, in the process of playing the third voice, when the third voice is played to the third position in the third voice, the device of the third voice associated object may automatically play the history voice message, and may also display a chat interface or a chat history recording interface where the history voice message is located;

optionally, if the historical message is a certain historical text message, a third position in the third voice is associated with the historical text message, after the third voice is received by the object associated with the third voice, in the process of playing the third voice, when the object is played to the third position of the third voice, the device of the object associated with the third voice may automatically play the voice content generated by automatically converting the historical text message, and may also display a chat interface or a chat history interface where the historical text message is located.

Optionally, during the playing of the third voice and/or when playing to a third position in the third voice, the electronic device displays the history message in a highlighted manner in the chat interface or the chat history recording interface where the history message is located, such as one or more of highlighting, bolding a text or control border, underlining a text or control, or italicizing a text.

Alternatively, the electronic device may receive an operation of inserting or associating the history message by the user, and in response to the operation, add the history message at the second position of the first voice in the case where the history message is a voice message, convert the text content of the history message into the voice content in the case where the history message is a text message, and add the voice content of the history message at the second position of the first voice.

Optionally, the operation of the user to insert or associate the history message may be a special gesture or clicking a preset button;

the second position may be determined based on the voice indication of the user, or based on the operation of inserting the history message indicated by the user, or based on the blank position or pause position of the user when recording the first voice.

In the embodiment of the application, under the condition that the obtained initial voice is determined to need to be sent to a plurality of objects, third voice respectively associated with each object is generated based on first voice in the initial voice and second voice respectively associated with each object, so that the generated third voice comprises the same first voice content of the first voice and second voice content with differentiated second voice among the objects, and a plurality of voices only having partial difference are rapidly and conveniently generated.

According to the voice generating method provided by the embodiment of the application, the execution main body can be a voice generating device. The embodiment of the present application takes an example in which a speech generating apparatus executes a speech generating method, and describes a speech generating apparatus provided in the embodiment of the present application.

Fig. 6 is a schematic structural diagram of a speech generating apparatus according to an embodiment of the present application, and as shown in fig. 6, the speech generating apparatus 600 includes: a determining module 610, an obtaining module 620 and a generating module 630; wherein:

the determination module 600 is configured to determine a plurality of objects based on the initial speech;

the obtaining module 610 is configured to obtain a second voice, where different objects in the multiple objects are associated with different second voices;

the generating module 620 is configured to generate third voices respectively associated with the plurality of objects based on the first voice in the initial voice and the second voices respectively associated with the plurality of objects;

The speech generation device provided in the embodiment of the present application can implement each process implemented by each method embodiment described above, and achieve the same technical effect, and for avoiding repetition, details are not repeated here.

Optionally, the determining module is configured to at least one of:

or the like, or, alternatively,

or the like, or a combination thereof,

receiving a first input of a user;

Optionally, the generating module is configured to:

under the condition that the initial voice comprises second voices respectively associated with a plurality of objects, intercepting the second voices respectively associated with the plurality of objects from the initial voice to obtain first voices;

Optionally, the generating module is configured to:

and for any object in the plurality of objects, adding a second voice associated with the object to a first position in a first voice to obtain a third voice associated with the object, wherein the first position is the same as the position of a voice segment corresponding to the first voice information in the initial voice.

The speech generating apparatus in the embodiment of the present application may be an electronic device, or may be a component in an electronic device, such as an integrated circuit or a chip. The electronic device may be a terminal, or may be a device other than a terminal. The electronic Device may be, for example, a Mobile phone, a tablet computer, a notebook computer, a palm top computer, a vehicle-mounted electronic Device, a Mobile Internet Device (MID), an Augmented Reality (AR)/Virtual Reality (VR) Device, a robot, a wearable Device, an ultra-Mobile personal computer (UMPC), a netbook or a Personal Digital Assistant (PDA), and the like, and may also be a server, a Network Attached Storage (Network Attached Storage, NAS), a personal computer (NAS), a Television (TV), a teller machine, a self-service machine, and the like, and the embodiments of the present application are not limited in particular.

The speech generating apparatus in the embodiment of the present application may be an apparatus having an operating system. The operating system may be an Android (Android) operating system, an ios operating system, or other possible operating systems, and embodiments of the present application are not limited specifically.

The speech generating device provided in the embodiment of the present application can implement each process implemented by the method embodiments of fig. 1 to fig. 5, and is not described here again to avoid repetition.

Optionally, fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present application, and as shown in fig. 7, an electronic device 700 is further provided in an embodiment of the present application and includes a processor 701 and a memory 702, where the memory 702 stores a program or an instruction that can be executed on the processor 701, and when the program or the instruction is executed by the processor 701, the steps of the embodiment of the speech generation method are implemented, and the same technical effect can be achieved, and details are not repeated here to avoid repetition.

It should be noted that the electronic device in the embodiment of the present application includes the mobile electronic device and the non-mobile electronic device described above.

Fig. 8 is a schematic diagram of a hardware structure of an electronic device implementing an embodiment of the present application.

The electronic device 800 includes, but is not limited to: a radio frequency unit 801, a network module 802, an audio output unit 803, an input unit 804, a sensor 805, a display unit 806, a user input unit 807, an interface unit 808, a memory 809, and a processor 810.

Those skilled in the art will appreciate that the electronic device 800 may further comprise a power source (e.g., a battery) for supplying power to the various components, and the power source may be logically connected to the processor 810 via a power management system, so as to manage charging, discharging, and power consumption management functions via the power management system. The electronic device structure shown in fig. 8 does not constitute a limitation to the electronic device, and the electronic device may include more or less components than those shown, or combine some components, or arrange different components, and thus, the description is omitted here.

Wherein the processor 810 is configured to:

determining a plurality of objects based on the initial speech;

Optionally, the processor 810 is configured to at least one of:

determining the plurality of objects based on an object indicated by first voice information in initial voice content of the initial voice; or, determining the plurality of objects based on a second voice content included in the initial voice content; or, receiving a first input of a user; determining the plurality of objects in response to the first input; wherein the first input comprises an operation of a user indicating the plurality of objects.

Optionally, the processor 810 is configured to:

It should be understood that, in the embodiment of the present application, the input Unit 804 may include a Graphics Processing Unit (GPU) 8041 and a microphone 8042, and the Graphics Processing Unit 8041 processes image data of still pictures or videos obtained by an image capturing device (such as a camera) in a video capturing mode or an image capturing mode. The display unit 806 may include a display panel 8061, and the display panel 8061 may be configured in the form of a liquid crystal display, an organic light emitting diode, or the like. The user input unit 807 includes at least one of a touch panel 8071 and other input devices 8072. A touch panel 8071, also referred to as a touch screen. The touch panel 8071 may include two portions of a touch detection device and a touch controller. Other input devices 8072 may include, but are not limited to, a physical keyboard, function keys (e.g., volume control keys, switch keys, etc.), a trackball, a mouse, and a joystick, which are not described in detail herein.

The memory 809 may be used to store software programs as well as various data. The memory 809 may mainly include a first storage area storing programs or instructions and a second storage area storing data, wherein the first storage area may store an operating system, application programs or instructions required for at least one function (such as a sound playing function, an image playing function, etc.), and the like. Further, the memory 809 can include volatile memory or nonvolatile memory, or the memory 809 can include both volatile and nonvolatile memory. The non-volatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable PROM (EEPROM), or a flash Memory. The volatile Memory may be a Random Access Memory (RAM), a Static Random Access Memory (Static RAM, SRAM), a Dynamic Random Access Memory (Dynamic RAM, DRAM), a Synchronous Dynamic Random Access Memory (Synchronous DRAM, SDRAM), a Double Data Rate Synchronous Dynamic Random Access Memory (Double Data Rate SDRAM, ddr SDRAM), an Enhanced Synchronous SDRAM (ESDRAM), a Synchronous Link DRAM (SLDRAM), and a Direct Memory bus RAM (DRRAM). The memory 809 in the present embodiment of the application includes, but is not limited to, these and any other suitable types of memory.

Processor 810 may include one or more processing units; optionally, the processor 810 integrates an application processor, which primarily handles operations related to the operating system, user interface, applications, etc., and a modem processor, which primarily handles wireless communication signals, such as a baseband processor. It will be appreciated that the modem processor described above may not be integrated into processor 810.

The embodiment of the present application further provides a readable storage medium, where a program or an instruction is stored on the readable storage medium, and when the program or the instruction is executed by a processor, the program or the instruction implements each process of the embodiment of the speech generation method, and can achieve the same technical effect, and in order to avoid repetition, details are not repeated here.

The processor is the processor in the electronic device described in the above embodiment. The readable storage medium includes a computer readable storage medium, such as a computer read only memory ROM, a random access memory RAM, a magnetic or optical disk, and the like.

The embodiment of the present application further provides a chip, where the chip includes a processor and a communication interface, the communication interface is coupled to the processor, and the processor is configured to run a program or an instruction to implement each process of the foregoing speech generation method embodiment, and can achieve the same technical effect, and the details are not repeated here to avoid repetition.

It should be understood that the chips mentioned in the embodiments of the present application may also be referred to as system-on-chip, system-on-chip or system-on-chip, etc.

Embodiments of the present application provide a computer program product, where the program product is stored in a storage medium, and the program product is executed by at least one processor to implement the processes of the foregoing embodiment of the speech generation method, and can achieve the same technical effects, and in order to avoid repetition, details are not repeated here.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising a component of' 8230; \8230;" does not exclude the presence of another like element in a process, method, article, or apparatus that comprises the element. Further, it should be noted that the scope of the methods and apparatus of the embodiments of the present application is not limited to performing the functions in the order illustrated or discussed, but may include performing the functions in a substantially simultaneous manner or in a reverse order based on the functions involved, e.g., the methods described may be performed in an order different than that described, and various steps may be added, omitted, or combined. In addition, features described with reference to certain examples may be combined in other examples.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a computer software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal (such as a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present application.

While the present embodiments have been described with reference to the accompanying drawings, it is to be understood that the present embodiments are not limited to those precise embodiments, which are intended to be illustrative rather than restrictive, and that various changes and modifications may be effected therein by one skilled in the art without departing from the scope of the appended claims.

Claims

1. A method of speech generation, comprising:

determining a plurality of objects based on the initial speech;

2. The speech generation method of claim 1, wherein determining a plurality of objects based on the initial speech comprises at least one of:

or the like, or a combination thereof,

or the like, or, alternatively,

receiving a first input of a user;

determining the plurality of objects in response to the first input; wherein the first input comprises an operation of a user indicating the plurality of objects.

3. The method according to claim 1 or 2, wherein the generating a third voice associated with each of the plurality of objects based on a first voice in the initial voice and a second voice associated with each of the plurality of objects comprises:

in response to the second input, adding a second voice associated with the object to the first location resulting in the third voice associated with the object.

4. The speech generation method according to claim 1 or 2, wherein generating third speech to which the plurality of objects are respectively associated based on the first speech in the initial speech and the second speech to which the plurality of objects are respectively associated comprises:

5. The method according to claim 1 or 2, wherein the generating a third voice associated with each of the plurality of objects based on a first voice in the initial voice and a second voice associated with each of the plurality of objects comprises:

6. The speech generation method according to claim 1, wherein the speech content of the third speech further comprises: a message reference identification referencing a history message, the message reference identification to associate the third voice with the history message.

7. An apparatus for generating speech, the apparatus comprising:

a determination module to determine a plurality of objects based on the initial speech;

the acquisition module is used for acquiring second voice, wherein different objects in the plurality of objects are associated with different second voices;

a generating module, configured to generate third voices respectively associated with the multiple objects based on a first voice in the initial voice and second voices respectively associated with the multiple objects;

8. The speech generating apparatus of claim 7, wherein the determining module is configured to at least one of:

or the like, or, alternatively,

receiving a first input of a user;

9. An electronic device comprising a processor and a memory, the memory storing a program or instructions executable on the processor, the program or instructions when executed by the processor implementing the steps of the speech generation method according to any of claims 1-6.

10. A readable storage medium, characterized in that it stores thereon a program or instructions which, when executed by a processor, implement the steps of the speech generation method according to any one of claims 1-6.