CN116935851A

CN116935851A - Method and device for voice conversion, voice conversion system and storage medium

Info

Publication number: CN116935851A
Application number: CN202210346283.XA
Authority: CN
Inventors: 瞿志
Original assignee: Haier Smart Home Co Ltd; Qingdao Haier Multimedia Co Ltd
Current assignee: Haier Smart Home Co Ltd; Qingdao Haier Multimedia Co Ltd
Priority date: 2022-04-02
Filing date: 2022-04-02
Publication date: 2023-10-24

Abstract

The application relates to the technical field of intelligent household appliances, and discloses a method for voice conversion, which comprises the following steps: acquiring input audio and voiceprint information of a user; obtaining corresponding text information according to the input audio; and obtaining output voice corresponding to the text information according to the voiceprint information. And converting the audio input by the user into corresponding text information which the user wants to express, and extracting corresponding voiceprint information of the user. The text information is converted into common or understandable audio at the output end, and the converted output voice content contains the tone and the tone consistent with the input voice according to the voiceprint information of the user. Therefore, on the premise of improving the applicability of the real-time translation of the call, the translated audio is ensured to have a certain degree of reduction, and the use experience of the user is improved. The application also discloses a device for voice conversion, a voice conversion system and a storage medium.

Description

Method and device for voice conversion, voice conversion system and storage medium

Technical Field

The application relates to the technical field of intelligent household appliances, in particular to a method and a device for voice conversion, a voice conversion system and a storage medium.

Background

At present, with the advent of globalization age, trans-regional communication is more and more frequent. In the cross-regional communication process, people can smoothly communicate by using translation software so as to solve the trouble caused by language obsolescence. In the voice call process, even though two parties use different languages, the barrier-free voice call can be realized through the function of real-time translation of call voice.

The related art provides a method for translating talking voice in real time, wherein the method comprises the following steps: when the electronic terminal needs to translate the talking voice in real time, transmitting talking voice data received from the base station in real time to a server for translating the talking voice at a first coding rate, and receiving a corresponding translation result from the server; determining whether the time for receiving the translation result corresponding to the call voice data is greater than a first time threshold; and when the determined time is greater than the first time threshold, transmitting call voice data received from the base station in real time to the server at a second coding rate, and receiving a corresponding translation result from the server.

In the process of implementing the embodiments of the present disclosure, it is found that at least the following problems exist in the related art:

by the method, real-time translation of the conversation can be realized, but the mood and intonation of a speaker cannot be translated, namely, the reduction degree cannot be ensured after translation. And in the presence of different languages or different dialects, this method is not applicable.

Disclosure of Invention

The following presents a simplified summary in order to provide a basic understanding of some aspects of the disclosed embodiments. This summary is not an extensive overview, and is intended to neither identify key/critical elements nor delineate the scope of such embodiments, but is intended as a prelude to the more detailed description that follows.

The embodiment of the disclosure provides a method and a device for voice conversion, a voice conversion system and a storage medium, so as to improve the reduction degree of translated audio on the premise of improving the applicability of real-time translation of a call.

In some embodiments, the above method comprises: acquiring input audio and voiceprint information of a user; obtaining corresponding text information according to the input audio; and obtaining output voice corresponding to the text information according to the voiceprint information.

Optionally, obtaining corresponding text information according to the input audio includes: acquiring input parameters of a user; the input audio is converted into input text information according to the input parameters.

Optionally, the input parameters include: voice category information at the input or user information at the input.

Optionally, after converting the input audio into the input text information, further comprising: obtaining output parameters of a user; and obtaining output text information corresponding to the input text information according to the output parameters.

Optionally, the output parameters include: voice category information at the output end or user information at the output end.

Optionally, obtaining output voice corresponding to the text information according to the voiceprint information includes: and converting the output text information into output voice according to the voiceprint information.

Optionally, after obtaining the output voice corresponding to the text information, further comprising: the output voice is sent to an output end, and voiceprint information is uploaded and stored; and training a conversion model according to the voiceprint information so as to improve the subsequent conversion accuracy.

In some embodiments, the apparatus comprises: comprising a processor and a memory storing program instructions, the processor being configured to perform the above-described method for speech conversion when the program instructions are run.

In some embodiments, the voice conversion system includes: the above-described apparatus for speech conversion.

In some embodiments, the storage medium stores program instructions that, when executed, perform the method for speech conversion described above.

The method and device for voice conversion, the voice conversion system and the storage medium provided by the embodiment of the disclosure can realize the following technical effects:

and converting the audio input by the user into corresponding text information which the user wants to express, and extracting corresponding voiceprint information of the user. The corresponding text information is standard language information, such as Mandarin, english, etc., and may be set by the user. The text information is converted into common or understandable audio at the output end, and the converted output voice content contains the tone and the tone consistent with the input voice according to the voiceprint information of the user. Therefore, on the premise of improving the applicability of the call real-time translation, the reduction degree of the translated audio is improved, and the use experience of a user is improved.

The foregoing general description and the following description are exemplary and explanatory only and are not restrictive of the application.

Drawings

One or more embodiments are illustrated by way of example and not limitation in the figures of the accompanying drawings, in which like references indicate similar elements, and in which like reference numerals refer to similar elements, and in which:

FIG. 1 is a schematic diagram of a method for speech conversion provided by an embodiment of the present disclosure;

FIG. 2 is a schematic diagram of another method for speech conversion provided by an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of another method for speech conversion provided by an embodiment of the present disclosure;

FIG. 4 is a schematic diagram of another method for speech conversion provided by an embodiment of the present disclosure;

FIG. 5 is a schematic diagram of another method for speech conversion provided by an embodiment of the present disclosure;

fig. 6 is a schematic diagram of an apparatus for speech conversion provided by an embodiment of the present disclosure.

Detailed Description

So that the manner in which the features and techniques of the disclosed embodiments can be understood in more detail, a more particular description of the embodiments of the disclosure, briefly summarized below, may be had by reference to the appended drawings, which are not intended to be limiting of the embodiments of the disclosure. In the following description of the technology, for purposes of explanation, numerous details are set forth in order to provide a thorough understanding of the disclosed embodiments. However, one or more embodiments may still be practiced without these details. In other instances, well-known structures and devices may be shown simplified in order to simplify the drawing.

The terms first, second and the like in the description and in the claims of the embodiments of the disclosure and in the above-described figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate in order to describe embodiments of the present disclosure. Furthermore, the terms "comprise" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion.

The term "plurality" means two or more, unless otherwise indicated.

In the embodiment of the present disclosure, the character "/" indicates that the front and rear objects are an or relationship. For example, A/B represents: a or B.

The term "and/or" is an associative relationship that describes an object, meaning that there may be three relationships. For example, a and/or B, represent: a or B, or, A and B.

The term "corresponding" may refer to an association or binding relationship, and the correspondence between a and B refers to an association or binding relationship between a and B.

In the embodiment of the disclosure, the intelligent home appliance refers to a home appliance formed after a microprocessor, a sensor technology and a network communication technology are introduced into the home appliance, and has the characteristics of intelligent control, intelligent sensing and intelligent application, the operation process of the intelligent home appliance often depends on the application and processing of modern technologies such as the internet of things, the internet and an electronic chip, for example, the intelligent home appliance can realize remote control and management of a user on the intelligent home appliance by connecting the electronic appliance.

In the disclosed embodiment, the terminal device refers to an electronic device with a wireless connection function, and the terminal device can be connected with the intelligent household electrical appliance through connecting with the internet, or can be directly connected with the intelligent household electrical appliance through bluetooth, wiFi or the like. In some embodiments, the terminal device is, for example, a mobile device, a computer, or an in-vehicle device built into a hover vehicle, etc., or any combination thereof. The mobile device may include, for example, a cell phone, smart home device, wearable device, smart mobile device, virtual reality device, etc., or any combination thereof, wherein the wearable device includes, for example: smart watches, smart bracelets, pedometers, etc.

As shown in conjunction with fig. 1, an embodiment of the present disclosure provides a method for speech conversion, including:

s01, the terminal equipment acquires input audio and voiceprint information of a user.

S02, the terminal equipment obtains corresponding text information according to the input audio.

S03, the terminal equipment obtains output voice corresponding to the text information according to the voiceprint information.

By adopting the voice conversion method provided by the embodiment of the disclosure, the audio input by the user can be converted into the corresponding text information which the user wants to express, and the corresponding voiceprint information of the user can be extracted. The corresponding text information is standard language information, such as Mandarin, english, etc., and may be set by the user. The text information is converted into common or understandable audio at the output end, and the converted output voice content contains the tone and the tone consistent with the input voice according to the voiceprint information of the user, so that the effect of real-time translation in the conversation process is achieved. Therefore, on the premise of improving the applicability of the real-time translation of the call, the translated audio is ensured to have a certain degree of reduction, and the use experience of the user is improved.

As shown in connection with fig. 2, an embodiment of the present disclosure provides another method for speech conversion, comprising:

S021, the terminal equipment acquires the input parameters of the user.

S022, the terminal equipment converts the input audio into input text information according to the input parameters.

By adopting the method for voice conversion provided by the embodiment of the disclosure, the applicability of the real-time conversation translation can be improved, and the translated audio frequency is ensured to have a certain reduction degree. In order to avoid deviation in the process of converting the input audio into the input text information caused by misjudgment of the user input audio into a dialect with similar pronunciation and the like in the conversion process. The user can choose to define the conversion process to a certain extent before using the function, thereby ensuring the accuracy of converting the input audio into the input text information. And converting the text information into common or understandable audio at the output end, and enabling the converted output voice content to contain the tone and the tone consistent with the input voice according to the voiceprint information of the user. Therefore, on the premise of improving the applicability of the real-time translation of the call, the translated audio is ensured to have a certain degree of reduction, and the use experience of the user is improved.

Therefore, the accuracy of audio translation can be improved according to the requirements of the user, and the use experience of the user is further improved. In order to avoid deviation in the process of converting the input audio into the input text information caused by misjudgment of the user input audio into a dialect with similar pronunciation and the like in the conversion process. The user can choose to define the conversion process to a certain extent before using the function, thereby ensuring the accuracy of converting the input audio into the input text information. The output parameters may be obtained by selecting a voice type of the input (such as english, japanese, cantonese, or other official speech) before the input user starts talking. Or the device is based on the language information of the user contained in the personal information bound with the user account. Alternatively, the device invokes language information bound to the relevant personal information (such as facial recognition, fingerprint recognition, voiceprint recognition, etc.) of the identified user based on the unique attributes of the individual.

As shown in conjunction with fig. 3, an embodiment of the present disclosure provides another method for speech conversion, comprising:

S021, the terminal equipment acquires the input parameters of the user.

S023, the terminal equipment acquires output parameters of the user.

S024, the terminal equipment obtains output text information corresponding to the input text information according to the output parameters.

By adopting the method for voice conversion provided by the embodiment of the disclosure, the applicability of the real-time conversation translation can be improved, and the translated audio frequency is ensured to have a certain reduction degree. In order to improve the conversion efficiency, the simultaneous interpretation effect is ensured. Thus, the input audio information is converted into standard language information. And simultaneously, converting the text information into the text information of the common language at the output end, and enabling the converted output voice content to contain the language gas and tone consistent with the input voice according to the voiceprint information of the user. The output text is directly converted into the output end audio, and the conversion process is limited to a certain extent before the function is used, so that the conversion efficiency of converting the output text information into the input audio is improved. And on the premise of improving the applicability of the real-time translation of the call, the translated audio is ensured to have a certain degree of reduction, and the use experience of the user is improved.

Therefore, the accuracy of audio translation can be improved according to the requirements of the user, and the use experience of the user is further improved. In order to improve the conversion efficiency, the simultaneous interpretation effect is ensured. Therefore, the input audio information is converted into standard language information, and the text information is converted into the text information of the common languages at the output end. The output text is directly converted into the output end audio, and the conversion process is limited to a certain extent before the function is used, so that the conversion efficiency of converting the output text information into the input audio is improved. The output parameters may be obtained by selecting a voice type of the input (such as english, japanese, cantonese, or other official speech) before the input user starts talking. Or the equipment is used for carrying out the processing according to the language information of the user contained in the personal information bound with the account number of the user at the output end. Alternatively, the device invokes language information bound to the output end user according to relevant personal information (such as facial recognition, fingerprint recognition, voiceprint recognition and other personal unique attributes) of the output end user.

As shown in connection with fig. 4, an embodiment of the present disclosure provides another method for speech conversion, comprising:

S021, the terminal equipment acquires the input parameters of the user.

S023, the terminal equipment acquires output parameters of the user.

S031, the terminal equipment converts the output text information into output voice according to the voiceprint information.

By adopting the method for voice conversion provided by the embodiment of the disclosure, the applicability of the real-time conversation translation can be improved, and the translated audio frequency is ensured to have a certain reduction degree. In the process of converting the output text information into the output voice, the voice print information of the input end is matched. After the audio translation is converted, the effects of original mood and intonation of the input end can be achieved. Therefore, on the premise of improving the applicability of the real-time translation of the call, the translated audio is ensured to have a certain degree of reduction, and the use experience of a user is improved.

As shown in conjunction with fig. 5, an embodiment of the present disclosure provides another method for speech conversion, comprising:

S04, the terminal equipment sends the output voice to an output end, and uploads and stores the voiceprint information.

S05, the terminal equipment trains a conversion model according to the voiceprint information so as to improve the subsequent conversion accuracy.

By adopting the method for voice conversion provided by the embodiment of the disclosure, the reduction degree of the translated audio can be improved. And establishing a corresponding conversion model according to the personal information of the user. Through the use habit and sounding characteristics of the user, the corresponding conversion data of the user is uploaded after the function is used each time. And continuously correcting the conversion model according to the conversion accuracy in each use process of the user, so that the reduction degree of the translated audio is improved. And furthermore, the audio after translation is ensured to have a certain reduction degree on the premise of improving the applicability of the real-time translation of the call, and the use experience of a user is improved. The conversion model may be a conversion model in the process of converting the input audio into the input text information, or may be a conversion model in the process of converting the output text into the output audio.

As shown in connection with fig. 6, an embodiment of the present disclosure provides an apparatus for speech conversion, including a processor (processor) 100 and a memory (memory) 101. Optionally, the apparatus may further comprise a communication interface (Communication Interface) 102 and a bus 103. The processor 100, the communication interface 102, and the memory 101 may communicate with each other via the bus 103. The communication interface 102 may be used for information transfer. The processor 100 may invoke logic instructions in the memory 101 to perform the method for speech conversion of the above-described embodiments.

Further, the logic instructions in the memory 101 described above may be implemented in the form of software functional units and may be stored in a computer readable storage medium when sold or used as a stand alone product.

The memory 101 is a computer readable storage medium that can be used to store a software program, a computer executable program, such as program instructions/modules corresponding to the methods in the embodiments of the present disclosure. The processor 100 executes functional applications and data processing, i.e. implements the method for speech conversion in the above-described embodiments, by running program instructions/modules stored in the memory 101.

The memory 101 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, at least one application program required for a function; the storage data area may store data created according to the use of the terminal device, etc. Further, the memory 101 may include a high-speed random access memory, and may also include a nonvolatile memory.

The embodiment of the disclosure provides a voice conversion system, which comprises the device for voice conversion.

The disclosed embodiments provide a storage medium storing computer-executable instructions configured to perform the above-described method for speech conversion.

The storage medium may be a transitory storage medium or a non-transitory storage medium.

Embodiments of the present disclosure may be embodied in a software product stored on a storage medium, including one or more instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of a method according to embodiments of the present disclosure. And the aforementioned storage medium may be a non-transitory storage medium including: a plurality of media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or a transitory storage medium.

The above description and the drawings illustrate embodiments of the disclosure sufficiently to enable those skilled in the art to practice them. Other embodiments may involve structural, logical, electrical, process, and other changes. The embodiments represent only possible variations. Individual components and functions are optional unless explicitly required, and the sequence of operations may vary. Portions and features of some embodiments may be included in, or substituted for, those of others. Moreover, the terminology used in the present application is for the purpose of describing embodiments only and is not intended to limit the claims. As used in the description of the embodiments and the claims, the singular forms "a," "an," and "the" (the) are intended to include the plural forms as well, unless the context clearly indicates otherwise. Similarly, the term "and/or" as used in this disclosure is meant to encompass any and all possible combinations of one or more of the associated listed. Furthermore, when used in the present disclosure, the terms "comprises," "comprising," and/or variations thereof, mean that the recited features, integers, steps, operations, elements, and/or components are present, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. Without further limitation, an element defined by the phrase "comprising one …" does not exclude the presence of other like elements in a process, method or apparatus comprising such elements. In this context, each embodiment may be described with emphasis on the differences from the other embodiments, and the same similar parts between the various embodiments may be referred to each other. For the methods, products, etc. disclosed in the embodiments, if they correspond to the method sections disclosed in the embodiments, the description of the method sections may be referred to for relevance.

Those of skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. The skilled artisan may use different methods for each particular application to achieve the described functionality, but such implementation should not be considered to be beyond the scope of the embodiments of the present disclosure. It will be clearly understood by those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.

In the embodiments disclosed herein, the disclosed methods, articles of manufacture (including but not limited to devices, apparatuses, etc.) may be practiced in other ways. For example, the apparatus embodiments described above are merely illustrative, and for example, the division of the units may be merely a logical function division, and there may be additional divisions when actually implemented, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. In addition, the coupling or direct coupling or communication connection shown or discussed with each other may be through some interface, device or unit indirect coupling or communication connection, which may be in electrical, mechanical or other form. The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to implement the present embodiment. In addition, each functional unit in the embodiments of the present disclosure may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. In the description corresponding to the flowcharts and block diagrams in the figures, operations or steps corresponding to different blocks may also occur in different orders than that disclosed in the description, and sometimes no specific order exists between different operations or steps. For example, two consecutive operations or steps may actually be performed substantially in parallel, they may sometimes be performed in reverse order, which may be dependent on the functions involved. Each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Claims

1. A method for speech conversion, comprising:

acquiring input audio and voiceprint information of a user;

obtaining corresponding text information according to the input audio;

and obtaining output voice corresponding to the text information according to the voiceprint information.

2. The method according to claim 1, wherein obtaining corresponding text information from the input audio comprises:

acquiring input parameters of a user;

and converting the input audio into input text information according to the input parameters.

3. The method of claim 2, wherein the input parameters comprise: voice category information at the input or user information at the input.

4. The method of claim 2, further comprising, after said converting said input audio to input text information:

obtaining output parameters of a user;

and obtaining output text information corresponding to the input text information according to the output parameters.

5. The method of claim 4, wherein the output parameters comprise: voice category information at the output end or user information at the output end.

6. The method of claim 4, wherein obtaining output speech corresponding to the text information based on the voiceprint information comprises:

and converting the output text information into output voice according to the voiceprint information.

7. The method according to any one of claims 1 to 6, further comprising, after said obtaining the output speech corresponding to the text information:

the output voice is sent to an output end, and the voiceprint information is uploaded and stored;

and training a conversion model according to the voiceprint information so as to improve the subsequent conversion accuracy.

8. An apparatus for speech conversion comprising a processor and a memory storing program instructions, wherein the processor is configured to perform the method for speech conversion of any of claims 1 to 7 when the program instructions are run.

9. A speech conversion system comprising the apparatus for speech conversion of claim 8.

10. A storage medium storing program instructions which, when executed, perform the method for speech conversion of any one of claims 1 to 7.