CN110781344A

CN110781344A - Method, device and computer storage medium for voice message synthesis

Info

Publication number: CN110781344A
Application number: CN201810765112.4A
Authority: CN
Inventors: 童小林; 陈晓磊
Original assignee: Shanghai Zhangmen Science and Technology Co Ltd
Current assignee: Shanghai Zhangmen Science and Technology Co Ltd
Priority date: 2018-07-12
Filing date: 2018-07-12
Publication date: 2020-02-11

Abstract

The invention provides a method, equipment and computer storage medium for synthesizing voice messages, wherein the method comprises the following steps: acquiring voice chat messages sent by at least one second user in a group chat scene after acquiring a function of triggering voice message synthesis in the group chat scene by a first user; and synthesizing the collected voice chat messages sent by the at least one second user, and outputting audio data obtained by synthesizing. The method and the device can simplify the steps of the user for checking the voice chat messages in the group chat scene, and improve the efficiency of the user for checking the voice chat messages.

Description

Method, device and computer storage medium for voice message synthesis

[ technical field ] A method for producing a semiconductor device

The present invention relates to the field of data processing technologies, and in particular, to a method, device, and computer storage medium for synthesizing a voice message.

[ background of the invention ]

In the current group chat scenario, the number of voice chat messages is too many due to too many people participating in the group chat, and when a user wants to check the voice chat messages sent by a certain user, the chat messages of all users need to be searched one by one. In addition, when a new chat message arrives and the user returns to the group chat scene after viewing the new message, the user can continue to view the new chat message after searching the chat record again when the user forgets which voice chat message is viewed. Therefore, the steps of the user viewing the voice chat message in the group chat scene are complicated, and the efficiency of viewing the voice chat message is low.

[ summary of the invention ]

In view of this, the present invention provides a method, device and computer storage medium for synthesizing a voice message, which are used to simplify the steps of a user viewing a voice chat message in a group chat scene, and improve the efficiency of the user viewing the voice chat message.

The technical scheme adopted by the invention for solving the technical problem is to provide a method for synthesizing voice messages, which comprises the following steps: acquiring voice chat messages sent by at least one second user in a group chat scene after acquiring a function of triggering voice message synthesis in the group chat scene by a first user; and synthesizing the collected voice chat messages sent by the at least one second user, and outputting audio data obtained by synthesizing.

According to the technical scheme, the voice chat messages sent by at least one second user in the group chat scene are synthesized to obtain the audio data, so that the first user can more quickly view the voice chat messages of other users in the group chat scene without looking up one by one, the steps of viewing the voice chat messages in the group chat scene by the user are simplified, and the efficiency of viewing the voice chat messages in the group chat scene by the user is improved.

[ description of the drawings ]

Fig. 1 is a flowchart of a method for synthesizing a voice message according to an embodiment of the present invention;

fig. 2 is a block diagram of a computer system/server according to an embodiment of the invention.

[ detailed description ] embodiments

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in detail with reference to the accompanying drawings and specific embodiments.

The terminology used in the embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the examples of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be understood that the term "and/or" as used herein is merely one type of association that describes an associated object, meaning that three relationships may exist, e.g., a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship.

The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination" or "in response to a detection", depending on the context. Similarly, the phrases "if determined" or "if detected (a stated condition or event)" may be interpreted as "when determined" or "in response to a determination" or "when detected (a stated condition or event)" or "in response to a detection (a stated condition or event)", depending on the context.

Fig. 1 is a flowchart of a method for synthesizing a voice message according to an embodiment of the present invention, as shown in fig. 1, the method includes:

in 101, after acquiring a function that a first user triggers voice message synthesis in a group chat scene, acquiring a voice chat message sent by at least one second user in the group chat scene.

In this step, after acquiring the function of triggering voice message synthesis in a group chat scene by a first user, acquiring a voice chat message sent by at least one second user in the current group chat scene. The first user is a current user, and the second user is a user included in a group chat scene.

In some embodiments, when the first user triggers the function of voice message synthesis in the group chat scenario, the following may be employed: the first user may click a button for voice message synthesis in the group chat interface to trigger the button, or the first user may issue a voice command for voice message synthesis in the group chat interface to trigger the button. The invention does not limit the way of triggering the voice message synthesis function of the first user in the group chat scene. And when the first user triggers the voice message synthesis function, the operation of collecting the voice chat messages sent by at least one second user in the group chat scene is executed.

Specifically, when capturing a voice chat message sent by at least one second user in a group chat scenario, the following manner may be adopted: determining at least one second user and identification information corresponding to the second user, wherein the identification information corresponding to the second user is information indicating the identity of the second user, and can be information such as an ID (identity) of the second user, a user name or a nickname in a group chat scene; and acquiring voice chat messages sent by at least one second user in the group chat scene according to the determined identification information, namely acquiring the voice chat messages sent by the second user in the current group chat scene according to the identification information of the second user.

It is understood that, when a user sends a voice chat message in a group chat scenario, identification information indicating the identity of the user who sent the voice chat message, such as an ID, a user name, or a nickname of the user, is included in addition to the sent voice chat message itself. Therefore, according to the identification information of the second user, the voice chat message corresponding to the identification information of the second user can be collected from all the voice chat messages in the group chat scene.

In some embodiments, the at least one second user in the group chat scenario may be determined in the following manner: the preset user in the group chat scene can be used as a second user, and the preset user can be one or a plurality of users; the user selected by the first user in the group chat scenario may also be taken as the second user, for example, the user selected by the first user from the group chat user list may also be taken as the second user, the avatar clicked by the first user in the group chat scenario and the user corresponding to the chat message may also be taken as the second user, and one or more users selected by the first user may also be taken as the second user; and taking the user with the sending times ranked in the top n as a second user according to the sending times of the chat messages of the users in the group chat scene in the preset time period, wherein n is a positive integer greater than or equal to 1.

Therefore, after at least one second user in the group chat scene is determined, the identification information corresponding to the determined at least one second user can be obtained, and the voice chat message sent by the at least one second user in the group chat scene is collected according to the identification information of the second user.

When collecting a voice chat message sent by at least one second user in a group chat scene according to the identification information of the second user, the following method may be adopted: and acquiring voice chat messages corresponding to the identification information of at least one second user in the group chat scene within a preset time period. The preset time period may be a preset fixed time period, for example, 20: 00 to 22: 00; the preset period may also be a period of time a predetermined length of time before the current time, for example a period of time within 2 hours before the current time.

In addition, since the user may send a text chat message in addition to a voice chat message while chatting, the text chat message sent by the user may be related to the voice chat message. Therefore, in some embodiments, to ensure that the second user's voice chat message can be collected completely, the steps further include the following: collecting text chat messages sent by at least one second user in a group chat scene; determining whether the collected text chatting message is related to the corresponding voice chatting message sent by the second user, for example, by calculating text similarity between the text chatting message and the voice chatting message adjacent to the text chatting message, determining to be related when the text similarity exceeds a preset threshold, otherwise, determining not to be related; if so, carrying out voice conversion on the collected text chatting message to obtain a voice chatting message corresponding to the text chatting message; and synthesizing the voice message by using the obtained voice chat message.

When the text chat message is subjected to voice conversion, the following modes can be adopted: extracting the voice characteristics of the second user according to the voice chat message of the second user corresponding to the text chat message; and performing voice conversion on the text chatting message of the second user by using the extracted sound characteristics of the second user. Through the voice conversion mode, the voice chat message obtained according to the text chat message can accord with the tone of the corresponding second user, and therefore the listening experience of the first user is improved. And if the sound feature corresponding to the second user cannot be extracted, performing voice conversion on the text chat message by using the preset sound feature.

In 102, the collected voice chat messages sent by the at least one second user are synthesized, and audio data resulting from the synthesis are output.

In this step, the voice chat message of the second user acquired in step 101 is synthesized, and audio data obtained by the synthesis is output. The synthesized audio data contains the voice content contained in each voice chat message.

In synthesizing the collected voice chat messages sent by the at least one second user, the following may be adopted: acquiring the transmission time of the collected voice chat messages transmitted by at least one second user; and synthesizing the voice chat messages according to the sequence of time to obtain corresponding audio data. That is, after the voice chat messages of the second user are acquired, the voice chat messages of the second user are sequentially spliced according to the sending time of each message to obtain corresponding audio data.

In synthesizing the collected chat messages sent by the at least one second user, the following method may be further adopted: filtering the voice chat messages sent by at least one second user according to a preset selection rule to obtain voice chat messages meeting the preset selection rule; and synthesizing the voice chat information which accords with the preset selection rule according to the time sequence.

When the voice chat message is filtered according to the preset selection rule, the following modes can be adopted: selecting voice chat messages meeting preset duration; or selecting a voice chat message not containing specific voice content; or selecting a voice chat message meeting the chat topic selected by the first user.

The process of selecting a voice chat message that satisfies the chat topic selected by the first user is illustrated: acquiring text content corresponding to a voice chat message sent by a second user; calculating the text similarity between the acquired text content and the chat topic selected by the first user; and taking the voice chat message of which the text similarity exceeds a preset threshold as the voice chat message meeting the chat topic selected by the first user.

In addition, when synthesizing the voice chat message transmitted by the at least one second user, the following is also included: analyzing the effective duration of each voice chat message, for example, removing blank voice, ending prompt tone and the like in each voice chat message to obtain each voice chat message with the effective duration; and comparing the time of the synthesized audio data with the sum of the effective duration of each voice chat message to check whether the synthesized audio data is complete, if the synthesized audio data is consistent with the audio data, indicating that the synthesized audio data is complete, and if the synthesized audio data is incomplete, re-synthesizing each voice chat message according to the process. In addition, when synthesizing a voice chat message, it is necessary to further adjust the sound frequency of the synthesized audio data so that the sound in the synthesized audio data is not distorted.

This step further includes, when synthesizing the voice chat message: when a plurality of second users are available, adding identification information of the corresponding second user for each voice chat message, for example, adding a user name of the corresponding second user at the beginning of the voice chat message; when there is one second user, the identification information of the second user corresponding to each voice chat message may be added to each voice chat message, or may not be added. By adding the identification information of the corresponding second user to the voice chat message, the first user can distinguish which second user said each voice chat message when listening to the synthesized audio data, thereby further improving the listening experience of the first user.

When synthesizing the collected voice chat messages sent by the at least one second user, the following method can be adopted: and taking the voice chat message sent by at least one second user as the input of a message synthesis model obtained by pre-training, and taking the output result of the message synthesis model as the audio data obtained by synthesis.

The message synthesis model used in the above process is obtained by pre-training in the following way:

acquiring a plurality of voice chat message sets and corresponding audio data thereof, wherein the acquired voice chat message sets comprise a plurality of voice chat messages, and the acquired audio data comprises voice contents corresponding to the plurality of voice chat messages; and taking each voice chat message set as input, taking audio data corresponding to each voice chat message set as output, training the deep learning model, and obtaining a message synthesis model. When the audio data used in the training process has higher voice quality, such as higher voice articulation degree, naturalness of voice synthesis, and the like, the message synthesis model obtained by training can be enabled to generate corresponding audio data with high voice quality according to the collected voice chat message of the second user.

In this step, after the audio data including the voice content in the voice chat message is generated, the obtained audio data can be directly returned to the first user, and the first user plays the obtained audio data by means of other multimedia playing software. The first user can also be provided with corresponding function buttons at a preset position of the current group chat interface, and the first user performs corresponding operations on the audio data through the provided function buttons, such as playing, pausing, fast forwarding, fast rewinding, storing and the like. The preset position of the group chat interface may be the top, the bottom, two sides, etc. of the group chat interface, which is not limited in the present invention.

In addition, after the audio data is synthesized, the following may be included: acquiring text contents corresponding to voice contents in the audio data obtained by synthesis; and generating subtitles corresponding to the audio data according to the obtained text content, so that the text corresponding to the currently played content is displayed when the user plays the audio data, and the listening experience of the user can be further improved.

Fig. 2 illustrates a block diagram of an exemplary computer system/server 012 suitable for use to implement some embodiments of the invention. The computer system/server 012 shown in fig. 2 is only an example, and should not bring any limitations to the function and scope of the embodiments of the present invention.

As shown in fig. 2, the computer system/server 012 is embodied as a general purpose computing device. The components of computer system/server 012 may include, but are not limited to: one or more processors or processing units 016, a system memory 028, and a bus 018 that couples various system components including the system memory 028 and the processing unit 016.

Bus 018 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, a processor, or a local bus using any of a variety of bus architectures. By way of example, such architectures include, but are not limited to, Industry Standard Architecture (ISA) bus, micro-channel architecture (MAC) bus, enhanced ISA bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus.

Computer system/server 012 typically includes a variety of computer system readable media. Such media may be any available media that is accessible by computer system/server 012 and includes both volatile and nonvolatile media, removable and non-removable media.

System memory 028 can include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM)030 and/or cache memory 032. The computer system/server 012 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 034 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 2, commonly referred to as a "hard drive"). Although not shown in FIG. 2, a magnetic disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In such cases, each drive may be connected to bus 018 via one or more data media interfaces. Memory 028 can include at least one program product having a set (e.g., at least one) of program modules configured to carry out the functions of embodiments of the present invention.

Program/utility 040 having a set (at least one) of program modules 042 can be stored, for example, in memory 028, such program modules 042 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each of which examples or some combination thereof might include an implementation of a network environment. Program modules 042 generally perform the functions and/or methodologies of embodiments of the present invention as described herein.

The computer system/server 012 may also communicate with one or more external devices 014 (e.g., keyboard, pointing device, display 024, etc.), and in some embodiments of the invention, the computer system/server 012 communicates with an external radar device, and may also communicate with one or more devices that enable a user to interact with the computer system/server 012, and/or with any devices (e.g., network card, modem, etc.) that enable the computer system/server 012 to communicate with one or more other computing devices. Such communication may occur through an input/output (I/O) interface 022. Also, the computer system/server 012 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the internet) via the network adapter 020. As shown, the network adapter 020 communicates with the other modules of the computer system/server 012 via bus 018. It should be appreciated that, although not shown, other hardware and/or software modules may be used in conjunction with the computer system/server 012, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

The processing unit 016 executes various functional applications and data processing by running programs stored in the system memory 028, for example, implementing a method for voice message synthesis, which may include:

acquiring a function of triggering voice message synthesis in a group chat scene by a first user, and then collecting chat messages sent by at least one second user in the group chat scene;

synthesizing the collected chat messages sent by the at least one second user, and outputting audio data obtained by synthesizing.

The computer program described above may be provided in a computer storage medium encoded with a computer program that, when executed by one or more computers, causes the one or more computers to perform the method flows and/or apparatus operations shown in the above-described embodiments of the invention. For example, the method flows provided by the embodiments of the invention are executed by one or more processors described above.

With the development of time and technology, the meaning of media is more and more extensive, and the propagation path of computer programs is not limited to tangible media any more, and can also be downloaded from a network directly and the like. Any combination of one or more computer-readable media may be employed. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

According to the technical scheme, the voice chat messages sent by the second user in the group chat scene are synthesized, so that the first user can more quickly view the voice chat messages of other users in the group chat scene without looking up one by one, the steps of the user for viewing the voice chat messages in the group chat scene are simplified, and the efficiency of the user for viewing the voice chat messages in the group chat scene is improved.

In the embodiments provided in the present invention, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described device embodiments are merely illustrative, and for example, the division of the units is only one logical functional division, and other divisions may be realized in practice.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

The integrated unit implemented in the form of a software functional unit may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium and includes several instructions to enable a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to execute some steps of the methods according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A method of voice message synthesis, the method comprising:

acquiring voice chat messages sent by at least one second user in a group chat scene after acquiring a function of triggering voice message synthesis in the group chat scene by a first user;

and synthesizing the collected voice chat messages sent by the at least one second user, and outputting audio data obtained by synthesizing.

2. The method of claim 1, wherein capturing voice chat messages sent by at least one second user in a group chat scenario comprises:

determining at least one second user in a group chat scene and corresponding identification information thereof;

and acquiring voice chat messages sent by at least one second user in the group chat scene according to the identification information.

3. The method of claim 2, wherein determining at least one second user in a group chat scenario comprises:

taking a preset user in a group chat scene as a second user; or

Taking a user selected by the first user in a group chat scene as a second user; or

And taking the users who send the chat messages in the group chat scene with the top n digits as second users, wherein n is a positive integer greater than or equal to 1.

4. The method of claim 2, wherein the collecting voice chat messages sent by at least one second user in a group chat scenario according to the identification information comprises:

and acquiring voice chat messages corresponding to the identification information of at least one second user in the group chat scene in a preset time period.

5. The method of claim 1, further comprising:

collecting text chat messages sent by at least one second user in a group chat scene;

determining whether the collected text chatting message is related to the voice chatting message sent by the corresponding second user;

if so, carrying out voice conversion on the collected text chatting message to obtain a voice chatting message corresponding to the text chatting message;

and synthesizing the voice message by using the obtained voice chat message.

6. The method of claim 1, wherein synthesizing the captured voice chat message sent by the at least one second user comprises:

acquiring the transmission time of each collected voice chat message transmitted by at least one second user;

and sequentially splicing the voice chat messages sent by the at least one second user according to the time sequence, and synthesizing to obtain corresponding audio data.

7. The method of claim 1, further comprising:

filtering the voice chat messages sent by at least one second user according to a preset selection rule;

and synthesizing the voice chat messages which are obtained by filtering and meet the preset selection rule and are sent by the at least one second user.

8. The method of claim 7, wherein filtering the voice chat messages sent by the at least one second user according to the preset selection rule comprises:

selecting voice chat messages meeting preset duration; or

Selecting a voice chat message not containing specific voice content; or

And selecting voice chat messages meeting the chat topic selected by the first user.

9. The method of claim 1, wherein synthesizing the captured voice chat message sent by the at least one second user further comprises:

and when a plurality of second users exist, adding the identification information of the corresponding second user for each voice chat message.

10. The method of claim 1, wherein synthesizing the captured voice chat message sent by the at least one second user further comprises:

analyzing the effective duration of each voice chat message sent by the second user;

comparing whether the time of the synthesized audio data is consistent with the sum of the effective duration of each voice chat message, if so, indicating that the obtained audio data is complete;

and if the two users are not consistent, synthesizing the voice chat message sent by the second user again.

11. The method of claim 1, wherein synthesizing the captured voice chat message sent by the at least one second user further comprises:

and taking the voice chat message sent by at least one second user as the input of a message synthesis model obtained by pre-training, and taking the output result of the message synthesis model as the audio data obtained by synthesis.

12. The method of claim 11, wherein the message composition model is pre-trained by:

acquiring a plurality of voice chat message sets and corresponding audio data thereof, wherein the voice chat message sets comprise a plurality of voice chat messages;

and taking each voice chat message set as input, taking audio data corresponding to each voice chat message set as output, training the deep learning model, and obtaining a message synthesis model.

13. The method of claim 1, wherein outputting the synthesized audio data comprises:

generating a function button for operating the audio data at a preset position of a group chat interface;

and executing corresponding operation on the audio data through the function button triggered by the first user.

14. The method of claim 1, further comprising:

and acquiring text content corresponding to the voice content in the audio data, and generating subtitles corresponding to the audio data according to the acquired text content.

15. An apparatus, characterized in that the apparatus comprises:

one or more processors;

a storage device for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-14.

16. A storage medium containing computer-executable instructions for performing the method of any one of claims 1-14 when executed by a computer processor.