CN115050349A

CN115050349A - Method, apparatus, device and medium for text-to-audio

Info

Publication number: CN115050349A
Application number: CN202210669457.6A
Authority: CN
Inventors: 熊佳新
Original assignee: Douyin Vision Beijing Co Ltd
Current assignee: Douyin Vision Beijing Co Ltd
Priority date: 2022-06-14
Filing date: 2022-06-14
Publication date: 2022-09-13
Anticipated expiration: 2042-06-14
Also published as: CN115050349B

Abstract

The present disclosure relates to a method, apparatus, device, and medium for text-to-audio conversion. The method comprises the following steps: if the target chapter is detected to have text change, determining a first local text of a first chapter text after the change and a first text position of the first local text relative to a second chapter text before the change based on a preset text unit smaller than one chapter; performing audio conversion processing on the first local text to generate a first local audio; a second chapter of audio after the alteration is generated based on the first text position, the first local audio, and the first chapter of audio before the alteration. According to the embodiment of the disclosure, the audio frequency retransformation processing is performed only on the first local text with text change, the text length of the audio frequency retransformation is shortened, the time cost and the resource cost in the audio frequency generation process of the electronic book are saved, and the efficiency of text-to-audio frequency conversion is improved.

Description

Method, apparatus, device and medium for text-to-audio conversion

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a method, an apparatus, a device, and a medium for converting text into audio.

Background

With the popularization of intelligent devices, electronic books are becoming the mainstream form of reading for the public. In order to further facilitate the reading of the electronic books by the user, the related reading application program is provided with a book listening function for converting the text into the audio in addition to the text reading function in the electronic form.

Currently, a processing unit for converting text contents of an electronic book into audio is one chapter. Therefore, when the text content of the electronic book changes (such as text addition, text modification, text deletion, and the like), the audio conversion process needs to be performed on all the text contents of the whole chapter again, which undoubtedly increases the time cost and the resource cost of the audio generation process of the electronic book.

Disclosure of Invention

In order to solve the above technical problem, the present disclosure provides a method, an apparatus, a device, and a medium for text-to-audio conversion, so as to save time cost and resource cost of an audio generation process of an electronic book, thereby improving efficiency of text-to-audio conversion.

In a first aspect, the present disclosure provides a method of text-to-audio, the method comprising:

if the target chapter is detected to have text change, determining a first local text of a first chapter text relative to a second chapter text and a first text position of the first local text based on a preset text unit; the preset text unit is smaller than one chapter, the first chapter text is a chapter text after the target chapter is subjected to text change, and the second chapter text is a chapter text before the target chapter is subjected to text change;

performing audio conversion processing on the first local text to generate a first local audio;

generating a second section of audio after the target section has a text change based on the first text position, the first local audio and the first section of audio; wherein the first chapter audio is the chapter audio before the text change of the target chapter occurs.

In a second aspect, the present disclosure provides an apparatus for text-to-audio conversion, the apparatus comprising:

the first text position determining module is used for determining a first local text of a first chapter text relative to a second chapter text and a first text position of the first local text based on a preset text unit if the target chapter is detected to have a text change; the preset text unit is smaller than one chapter, the first chapter text is a chapter text after the target chapter is subjected to text change, and the second chapter text is a chapter text before the target chapter is subjected to text change;

the first local audio generation module is used for performing audio conversion processing on the first local text to generate first local audio;

a second section audio generating module, configured to generate a second section audio after a text change occurs in the target section based on the first text position, the first local audio, and the first section audio; wherein the first chapter audio is the chapter audio before the text change of the target chapter occurs.

In a third aspect, the present disclosure provides an electronic device, comprising:

a processor;

a memory for storing executable instructions;

wherein the processor is configured to read the executable instructions from the memory and execute the executable instructions to implement the method for text-to-audio conversion described in any of the embodiments of the present disclosure.

In a fourth aspect, the present disclosure provides a computer-readable storage medium storing a computer program which, when executed by a processor, causes the processor to implement the method of text-to-audio explained in any of the embodiments of the present disclosure.

In the method, the device, the equipment and the medium for text-to-audio conversion of the embodiment, when a text change of a target chapter is detected, a first local text of a first chapter text after the text change relative to a second chapter text before the text change and a first text position of the first local text are determined based on a preset text unit smaller than one chapter, audio conversion processing is carried out on the first local text to generate a first local audio, and a second chapter audio of the target chapter after the text change is generated based on the first text position, the first local audio and the first chapter audio of the target chapter before the text change; the method and the device have the advantages that under the condition of text change, the audio frequency retransformation processing is only carried out on the first local text with the text change, the audio frequency retransformation process of the local text without the text change is omitted, the text length of the audio frequency retransformation is shortened, the time cost and the resource cost in the audio frequency generation process of the electronic book are saved on the basis of ensuring the consistency of the text content and the audio frequency content, and the efficiency of text-to-audio frequency conversion is improved.

Drawings

The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that elements and features are not necessarily drawn to scale.

Fig. 1 is a schematic flowchart of a method for converting text into audio according to an embodiment of the present disclosure;

FIG. 2 is a schematic flow chart illustrating another method for text-to-audio conversion provided by an embodiment of the present disclosure;

fig. 3 is a schematic structural diagram of an apparatus for text-to-audio conversion according to an embodiment of the present disclosure;

fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order, and/or performed in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "include" and variations thereof as used herein are open-ended, i.e., "including but not limited to". The term "based on" is "based, at least in part, on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions for other terms will be given in the following description.

It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence of the functions performed by the devices, modules or units.

It is noted that references to "a", "an", and "the" modifications in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.

The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.

The method for converting the text into the audio provided by the embodiment of the disclosure can be applied to a scene of performing audio conversion processing on the text in an electronic form, and is particularly suitable for the situation of performing audio conversion processing on an electronic book of which the text content is changed after the audio is converted. The method may be performed by a device for text-to-audio conversion, which may be implemented by software and/or hardware, and which may be integrated in an electronic device having certain computing capabilities. The electronic device may include, but is not limited to, a mobile terminal such as a smart phone, a Personal Digital Assistant (PDA), a Tablet PC, a notebook PC, a Portable Multimedia Player (PMP), a car terminal, a wearable device, etc., and a fixed terminal such as a Digital tv, a desktop Computer, a smart home device, etc.

Fig. 1 shows a flowchart of a method for text-to-audio conversion according to an embodiment of the present disclosure. As shown in fig. 1, the method of text-converting audio may include the steps of:

s110, if the target chapter is detected to have text change, determining a first local text of the first chapter text relative to a second chapter text and a first text position of the first local text based on a preset text unit.

The target chapters are chapters in the electronic books of the audio corresponding to the text to be obtained. The electronic book in the embodiment of the present disclosure may be a digital version of a regular book, or may be a digital reading content that is an informal book but is divided into chapters, and the like. The preset text unit is a preset data processing unit for performing text conversion audio. In the embodiment of the present disclosure, the preset text unit is smaller than one chapter, and may be, for example, one paragraph, several sentences, or one sentence.

The first chapter text is the text content of the whole chapter after the text change of the target chapter.

The second section text is the text content of the entire section before the text change of the target section occurs. The first partial text is the text content corresponding to the preset text unit in which the text change occurs in the first chapter text. The text change may be at least one of adding text content, modifying original text content, and deleting text content. The first text position is an arrangement position of the first partial text in the first chapter text or the second chapter text, and is related to a preset text unit. For example, when the preset text unit is a paragraph, the first text position may be the several paragraphs in the chapter text; when the predetermined text unit is a sentence, the first text position may be the second sentence in the chapter text, and so on.

Specifically, first, the electronic device may perform detection on whether the text content is changed for the target chapter.

In an example, the electronic device can time-align the text content of the target section at the current time and its text content prior to the current time. If the comparison result shows that the texts are completely consistent, the target chapter is indicated to have no text change; and if the comparison result shows that the texts are not completely consistent, the target chapter is indicated to have text change.

In another example, the electronic device can detect whether notification information related to a change in text content of the target chapter is received. For example, when the text of the target chapter is modified (by an author, or by an audit software assistance), notification information that the text has been modified is additionally generated and sent to the electronic device. In this way, the electronic device detects that the target chapter has been changed in text.

Then, when detecting that the target chapter has a text change, the electronic device acquires a first local text of the first chapter text and a first text position of the first local text, wherein the first local text has a text change relative to the second chapter text, according to a preset text unit.

In an example, the electronic device can obtain, via text editing software, the first partial text and its corresponding first text position. In this example, the text editing software may record information related to the text change, such as the time when the text was changed, the text content before and after the change, the text position of the text content before and after the change, and so on. Therefore, the electronic equipment can determine the first local text from the first chapter text according to the text content before and after the change and the text position thereof and according to the preset text unit. Then, the position of the first text is determined according to the arrangement position of the first local text in the first section text or the second section text.

In another example, the electronic device may obtain the first partial text and the first text position thereof by comparing the first section text and the second section text of the text length of the preset text unit one by one according to the preset text unit.

In some embodiments, the predetermined text unit is determined based on a text change length and/or a text unit value range. The text change length here refers to a length of a text that is changed when the text content of any chapter is changed within a period of time before the current time. The text unit value range refers to a preset value range of a preset text unit, and is a left-closed right-open interval, the lower limit value of the interval is the minimum processing unit of a chapter, namely, a sentence, and the upper limit value of the interval is a chapter, namely, the value range of the text unit is greater than or equal to one sentence and less than one chapter.

In one example, the preset text unit is determined based on a text change length. In this example, the electronic device may collect text change lengths over a plurality of text changes, and then count statistics such as a mean or a mode of the text change lengths. The statistical value can reflect the average distribution condition of the text change length in multiple text changes, so that the statistical value can be determined as a preset text unit.

In another example, the preset text unit is determined based on a text unit value range. In this example, the electronic device can select a value within a value range of the text unit as the preset text unit. In yet another example, the preset text unit is determined based on the text change length and the text unit value range. In this example, the preset text unit is adaptively adjusted within a value range of the text unit according to a statistical value of a plurality of text change lengths. For example, if the statistical value is within the value range of the text unit, the statistical value can be directly determined as the preset text unit; if the statistical value is not in the text unit value range, a value closest to the statistical value can be selected from the text unit value range to serve as a preset text unit. For another example, in the case that the statistical value is large, the network quality, the device performance, and the like of the user terminal may be considered, and a numerical value that is close to the statistical value and is relatively small is selected from the numeric value range of the text unit as the preset text unit, and the like.

In some embodiments, the predetermined text unit is a sentence. In view of the fact that the cost of converting text into speech is proportional to the length of text, and the smallest text unit in a chapter is a sentence, in order to improve the universality and conversion efficiency of the text-to-speech audio, the predetermined text unit can be determined as a sentence in the present embodiment.

And S120, performing audio conversion processing on the first local text to generate a first local audio.

Specifically, after the electronic device obtains the first local Text, the first local Text may be converted into corresponding audio (i.e., first local audio) by using a technology such as Text To Speech (TTS). If the number of the first partial texts is multiple, the same number of first partial audios can be converted.

It should be noted that, if the text change type is the newly added text content or the modified text content, the first local audio is the audio consistent with the content of the first local text. If the text change type is deleting text content, then the first local audio is null.

And S130, generating second chapter audio after the text change of the target chapter based on the first text position, the first local audio and the first chapter audio.

The first chapter audio is the chapter audio before the text change of the target chapter occurs.

Specifically, the electronic device may process the first section audio by using the first local audio corresponding to the changed text and the first text position thereof, so as to obtain a second section audio.

In an example, the electronic device can modify the first section audio according to the first local audio.

For example, when the first text position is an arrangement position in the second section of text, the electronic device may position the audio position in the first section of audio according to the first text position, and replace the local audio at the audio position with the first local text corresponding to the first text position to generate the second section of audio.

For another example, when the first text position is the arrangement position in the first chapter text, the first text position may be mapped to the arrangement position in the corresponding second chapter text according to a pre-established position mapping relationship between the arrangement position of the local text (i.e., the second local text) in the first chapter text, where the text change does not occur, and the arrangement position of the second local text in the second chapter text. Then, the electronic device can position to the audio position in the first section of audio according to the mapped first text position, and replace the local audio at the audio position with the first local text corresponding to the first text position to generate a second section of audio.

It should be noted that, if the text change type is the new text content, the local audio at the audio position is empty, and the first local audio is inserted into the audio position; if the text change type is the modified text content, replacing the original local audio at the audio position by the first local text; and if the text change type is deleting the text content, the first local audio is empty, and the original local audio at the audio position in the first section of audio is deleted after the replacement processing.

In another example, the electronic device may respectively obtain local audios corresponding to each preset text unit in the first section of text, and then sequentially concatenate the local audios to generate the second section of audio. The local audio corresponding to the first local text is the first local audio obtained by conversion through a TTS technology, for example; the local audio corresponding to the second local text can be extracted from the first chapter of audio.

In the method for converting text into audio provided by the embodiment of the disclosure, when a text change of a target chapter is detected, a first local text of a first chapter text after the text change and a first text position of the first local text relative to a second chapter text before the text change are determined based on a preset text unit smaller than one chapter, and the first local text is subjected to audio conversion processing to generate a first local audio, and a second chapter audio of the target chapter after the text change is generated based on the first text position, the first local audio and the first chapter audio of the target chapter before the text change; the method and the device have the advantages that under the condition of text change, the audio frequency retransformation processing is only carried out on the first local text with the text change, and the audio frequency retransformation process of the local text without the text change is omitted, so that the text length of the audio frequency retransformation is shortened, the time cost and the resource cost in the audio frequency generation process of the electronic book are saved, and the efficiency of text-to-audio frequency conversion is improved.

Fig. 2 is a flowchart of another method for text-to-audio conversion provided by the embodiment of the present disclosure. The method for converting the text into the audio further optimizes the determination of the first local text with the text change in the first chapter text and the first text position of the first local text based on the preset text unit. On the basis, the generation of the second section audio after the text change of the target section based on the first text position, the first local audio and the first section audio can be further optimized. On the basis, the text unit can be preset as a sentence for example to be explained. Wherein explanations of the same or corresponding terms as those of the above embodiments are omitted. Referring to fig. 2, the method for text-to-audio conversion specifically includes:

s210, if the target chapter is detected to have text change, acquiring a first chapter text and a second chapter text.

Specifically, after detecting that a text change occurs in a target chapter, the electronic device may pull, from the network side, a first chapter text of the target chapter before the text change occurs and a second chapter text of the target chapter after the text change occurs according to a chapter identifier (e.g., book name + chapter name/chapter code) of the target chapter and information (e.g., time) of the change occurs; or, the electronic device may read the first chapter text and the second chapter text from the internal storage medium or the external storage medium according to the chapter identifier of the target chapter and the changed information.

S220, comparing the first section text with the second section text based on a preset text unit, and determining the positions of the first section text and the first text.

Specifically, in order to reduce the coupling between the audio conversion function and the text editing function and improve the universality of the method for converting text into audio provided by the embodiment of the present disclosure, in the embodiment, the first local text is determined by comparing the first section text with the second section text, and the position of the first text is recorded during the comparison.

In specific implementation, the electronic device compares the local text contents of the corresponding text lengths in the first section text and the second section text one by taking the preset text unit as a processing unit. If the comparison result of any local text content is that the text contents are inconsistent, the local text content in the first section of text is changed relative to the corresponding local text content in the second section of text, so that the local text content in the first section of text is determined as the first local text. And, the arrangement position of the local text content in the first chapter text is determined as the first text position of the first local text.

In some embodiments, S220 comprises: according to a preset text unit, decomposing the first section of text into a plurality of third local texts, and decomposing the second section of text into a plurality of fourth local texts; and for each third local text, matching the third local text with each fourth local text, and determining the third local text as the first local text and determining the arrangement position of the third local text in the first chapter text as the first text position when the matching fails.

And the third local text is the local text content in the first chapter text. The fourth local text is the local text content in the second section of text.

Specifically, the electronic device first splits the first section of text into a plurality of third partial texts according to a preset text unit (exemplified as a sentence), and records an ordering relationship between the third partial texts. For example, each third local text is stored as an old text string sequence strListOld according to the text order relationship of each third local text in the first chapter text. Similarly, the electronic device splits the second section of text into a plurality of fourth local texts according to a preset text unit, and records the ordering relationship among the fourth local texts. For example, the fourth local texts are stored as a new text string sequence strListNew according to the text sequence relationship of the fourth local texts in the second section of text.

Then, the electronic device traverses each character string in the new text character string sequence strListNew, and executes: and matching the character string strNew in the traversed strListNew with each character string strOld of the old text character string sequence strListold. If one of the matching results is not completely consistent, which indicates that the sentence corresponding to the character string strNew has a modified text change, determining the character string strNew as a first local text, and recording the sequence number of the character string strNew in a new text string sequence strListNew as the above-determined ordering position of the first local text in the first section text, i.e. the first text position. And if the matching results are completely inconsistent, the sentence corresponding to the character string strNew is subjected to text change of a newly added type, the character string strNew is determined as a first local text, and the serial number of the character string strNew in a new text character string sequence strListNew is recorded as the first text position of the determined first local text.

According to the process, the electronic device can obtain each first partial text of the first section text which has text change relative to the second section text, and the first partial texts are all the change text contents of the modification type or the new addition type. Therefore, the determination process of the modified text of the deletion type can be reduced, the determination efficiency of the first local text is improved, and the efficiency of converting the text into the audio is further improved.

And S230, determining a second local text of the first section text relative to the second section text without text change and a second text position of the second local text in the second section text based on a preset text unit.

Specifically, in this embodiment, the electronic device generates the second section audio by acquiring the local audio corresponding to each preset text unit in the second section text and sequentially splicing the local audios. Therefore, in addition to obtaining the first local text and the first text position, the electronic device may obtain a second local text in the first section of text where no text change occurs and an arrangement position of the second local text in the second section of text, that is, a second text position.

In specific implementation, the traversal related content in S220 may be referred to, that is, the electronic device traverses each character string in the new text character string sequence strListNew, and performs the following steps in the traversal process: and matching the character string strNew in the traversed strListNew with each character string strOld of the old text character string sequence strListold. If one of the matching results is completely consistent, which indicates that the sentence corresponding to the character string strNew has no text change, the sentence corresponding to the character string strNew is determined as a second local text, and the serial number of the character string strOld matched and consistent with the character string strNew in the old text string sequence strListold is recorded as a second text position of the determined second local text in the second section text.

In addition to the execution sequence, S230 may be executed before S220 or after S240.

S240, an audio conversion process is performed on the first local text to generate a first local audio.

And S250, determining second local audio corresponding to the second local text from the first chapter audio based on the second text position.

Specifically, the first chapter audio may also be decomposed into a plurality of partial audios in the same sentence order as the fourth partial texts (may be referred to as third partial audios) in a preset text unit. Then, the electronic device may select, according to each second text position, a third local audio at a corresponding position from the third local audios as a second local audio corresponding to the second local text.

In some embodiments, S250 comprises: if the first section of audio is in a lossy audio format, converting the first section of audio into a lossless audio format to generate a third section of audio; and performing local audio extraction on the third section of audio based on the audio starting time corresponding to the second text position to obtain second local audio.

The lossy audio format is an audio format subjected to audio encoding processing and/or audio compression processing, such as MP3 format. The lossless audio format is an audio format that has not been subjected to audio encoding processing and audio compression processing, such as a Pulse Code Modulation (PCM) format. And the audio starting time is the starting time recorded according to a preset text unit in the process of generating the third chapter of audio. For example, in an example where the preset text unit is one sentence, the audio start time is the start time of the audio corresponding to each sentence recorded when the third chapter of audio is synthesized.

Specifically, in the process of extracting the second local audio, the electronic device first determines the audio format of the first section of audio. If the audio format is a lossless audio format, the subsequent processing is directly performed. If the audio format is a lossy audio format, the audio format can be converted into a lossless audio format, so that chapter audio after format conversion, namely third chapter audio, is obtained, and then subsequent processing is performed. This is because, when synthesizing and storing the first chapter audio, the audio is generally compressed into a lossy audio format such as MP3, so that the start point of the local audio in the binary data list of the chapter audio cannot be found directly from the start time of the audio; and by using lossless audio formats such as PCM and the like, the starting position of the local audio in the binary data list of the chapter audio can be directly calculated according to the starting time of the audio.

And then, the electronic equipment calculates and obtains the audio starting time of each third local audio according to the audio duration of each third local audio. And then, the electronic equipment extracts the third local audio of the audio starting time corresponding to each second text position from all the third local audios corresponding to the third section audio according to the plurality of audio starting times obtained by calculation, and the third local audio is used as a second local audio.

And S260, splicing the first local audio and the second local audio based on the first text position and the second text position to generate a second section of audio.

Specifically, according to the above process, the electronic device obtains a first text position and a first local audio corresponding to a first local text in which a text change has occurred, and a second text position and a second local audio corresponding to a second local text in which no text change has occurred. Then, the electronic device may sort the first local audios and the second local audios according to the first text position and the second text position, and concatenate the sorted local audios to generate a second section audio.

In some embodiments, S260 comprises: determining a third text position of the second local text in the second section of text based on the position mapping relation and the second text position, and establishing a corresponding relation between the third text position and the second local audio; determining an audio ordering of the first and second local audio based on a correspondence between the first text position and the first local audio and a correspondence between the third text position and the second local audio; and splicing the first local audio and the second local audio based on the audio sorting to generate second chapter audio.

And recording the corresponding relation between the arrangement position of the second local text in the first section text and the arrangement position of the second local text in the second section text in the position mapping relation.

Specifically, according to the above description, the first text position is the arrangement position of the local text in the second chapter text, and the second text position is the arrangement position of the local text in the first chapter text, but the first chapter text and the second chapter text may not be aligned with each other in the text position due to the text change. Therefore, to ensure proper sequencing of subsequent local audio, the first text position and the second text position may be converted to an arrangement position in the same chapter of text. In view of the fact that the second section text is the section text after the change, the second text position may be converted into an arrangement position in the second section text in the present embodiment. In specific implementation, the electronic device queries the position mapping relationship according to the second text positions to obtain the arrangement position in the second section text corresponding to each second text position, namely, a third text position. And, based on the correspondence between the second text position and the second local audio described above, a correspondence between the third text position and the second local audio may be established. Thus, each first local text in the second section of text corresponds to a first text position and first local audio, and each second local text in the second section of text that is the same corresponds to a third text position and second local audio.

Then, the electronic device may arrange the first local audio corresponding to the first text position and the second local audio corresponding to the third text position in sequence according to the arrangement order relationship between the first text position and the third text position, so as to obtain a local audio sequence having an audio order that is consistent with the text order of each fourth local text in the second section of text. Then, the electronic device sequentially splices each first local audio and each second local audio in the local audio sequence to generate a second section of audio.

On one hand, the method for converting the text into the audio provided by the embodiment of the disclosure can acquire the first section of text and the second section of text, compare the first section of text with the second section of text based on the preset text unit, and determine the positions of the first section of text and the first text; the coupling between the audio conversion function and the text editing function is reduced, so that the independence of the text conversion audio is improved, and the universality of the method for converting the text into the audio provided by the embodiment of the disclosure is improved. On the other hand, second local audio corresponding to the second local text can be determined from the first section of audio based on the second text position, and the first local audio and the second local audio are spliced based on the first text position and the second text position to generate second section of audio; the method and the device have the advantages that the second section audio after the text change is generated according to the general processing flow of the text-to-audio, so that the processing flow of the changed text of the deletion type and the corresponding local audio can be reduced, the efficiency of text-to-audio is further improved, and the universality of the method for text-to-audio is further improved.

The following is an embodiment of an apparatus for text-to-audio conversion according to an embodiment of the present invention, which belongs to the same inventive concept as the method for text-to-audio conversion according to the above embodiments, and reference may be made to the above embodiment of the method for text-to-audio conversion according to details that are not described in detail in the embodiment of the apparatus for text-to-audio conversion.

Fig. 3 shows a schematic structural diagram of an apparatus for text-to-audio conversion provided by an embodiment of the present disclosure. As shown in fig. 3, the apparatus 300 for text-to-audio conversion may include:

a first text position determining module 310, configured to determine, based on a preset text unit, a first local text and a first text position of the first local text, where the text change occurs in the first chapter text relative to the second chapter text, if it is detected that the text change occurs in the target chapter; the preset text unit is smaller than one chapter, the first chapter text is a chapter text after the target chapter is subjected to text change, and the second chapter text is a chapter text before the target chapter is subjected to text change;

a first local audio generating module 320, configured to perform audio conversion processing on the first local text to generate a first local audio;

a second section audio generating module 330, configured to generate a second section audio after the text change occurs in the target section based on the first text position, the first local audio, and the first section audio; the first chapter audio is the chapter audio before the text change of the target chapter occurs.

In the apparatus for text-to-audio conversion provided by the embodiment of the present disclosure, when a text change of a target chapter is detected, a first local text of a first chapter text after the text change and a first text position of the first local text relative to a second chapter text before the text change are determined based on a preset text unit smaller than one chapter, and audio conversion processing is performed on the first local text to generate a first local audio, and a second chapter audio of the target chapter after the text change is generated based on the first text position, the first local audio and the first chapter audio of the target chapter before the text change; the method and the device have the advantages that under the condition of text change, the audio frequency retransformation processing is only carried out on the first local text with the text change, and the audio frequency retransformation process of the local text without the text change is omitted, so that the text length of the audio frequency retransformation is shortened, the time cost and the resource cost in the audio frequency generation process of the electronic book are saved, and the efficiency of text-to-audio frequency conversion is improved.

In some embodiments, the first text position determination module 310 includes:

the chapter text acquisition sub-module is used for acquiring a first chapter text and a second chapter text;

and the first text position determining submodule is used for comparing the first section text with the second section text based on a preset text unit and determining the position of the first local text and the first text.

Further, the first text position determination submodule is specifically configured to:

according to a preset text unit, decomposing the first section of text into a plurality of third local texts, and decomposing the second section of text into a plurality of fourth local texts;

and for each third local text, matching the third local text with each fourth local text, and determining the third local text as the first local text and determining the arrangement position of the third local text in the first chapter text as the first text position when the matching fails.

In some embodiments, the apparatus 300 for text-to-audio further comprises a second text position determination module for:

determining a second text position of a second local text of the first chapter text relative to the second chapter text without text change and the second local text in the second chapter text based on a preset text unit before generating a second chapter audio of the target chapter after text change based on the first text position, the first local audio and the first chapter audio;

accordingly, the second section audio generating module 330 includes:

the second local audio determining submodule is used for determining second local audio corresponding to the second local text from the first section of audio based on the second text position;

and the second section audio generation sub-module is used for splicing the first local audio and the second local audio based on the first text position and the second text position to generate second section audio.

In some embodiments, the second section audio generation sub-module is specifically configured to:

determining a third text position of the second local text in the first chapter of text based on the position mapping relation and the second text position, and establishing a corresponding relation between the third text position and the second local audio; the position mapping relation records the corresponding relation between the arrangement position of the second local text in the first section text and the arrangement position of the second local text in the second section text;

determining an audio ordering of the first and second local audio based on a correspondence between the first text position and the first local audio and a correspondence between the third text position and the second local audio;

and splicing the first local audio and the second local audio based on the audio sorting to generate second chapter audio.

In some embodiments, the second local audio determination sub-module is specifically configured to:

if the first section of audio is in a lossy audio format, converting the first section of audio into a lossless audio format to generate a third section of audio; the lossy audio format is an audio format subjected to audio coding processing and/or audio compression processing; the lossless audio format is an audio format that has not been subjected to audio encoding processing and audio compression processing;

performing local audio extraction on the third section of audio based on the audio starting time corresponding to the second text position to obtain second local audio; and the audio starting time is the starting time recorded according to a preset text unit in the process of generating the third chapter of audio.

In some embodiments, the preset text unit of text alteration length is determined based on the text alteration length and/or the text unit value range.

Further, the preset text unit is a sentence.

The device for converting the text into the audio, provided by the embodiment of the invention, can execute the method for converting the text into the audio, provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.

It should be noted that, in the embodiment of the apparatus for text-to-audio conversion, the modules and sub-modules included in the embodiment are only divided according to functional logic, but are not limited to the above division as long as the corresponding functions can be implemented; in addition, the specific names of the functional modules/sub-modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present disclosure.

Embodiments of the present disclosure also provide an electronic device that may include a processor and a memory, which may be used to store executable instructions. Wherein the processor may be configured to read the executable instructions from the memory and execute the executable instructions to implement the method of text-to-audio described in any of the embodiments above.

Fig. 4 shows a schematic structural diagram of an electronic device provided in an embodiment of the present disclosure.

As shown in fig. 4, the electronic device 400 may include a processing means (e.g., a central processing unit, a graphics processor, etc.) 401 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)402 or a program loaded from a storage means 408 into a Random Access Memory (RAM) 403. In the RAM 403, various programs and data necessary for the operation of the information processing apparatus 400 are also stored. The processing device 401, the ROM 402, and the RAM 403 are connected to each other via a bus 404. An input/output interface (I/O interface) 405 is also connected to the bus 404.

Generally, the following devices may be connected to the I/O interface 405: input devices 406 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; an output device 407 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 408 including, for example, tape, hard disk, etc.; and a communication device 409. The communication means 409 may allow the electronic device 400 to communicate wirelessly or by wire with other devices to exchange data.

It should be noted that the electronic device 400 shown in fig. 4 is only an example, and should not bring any limitation to the functions and the scope of the embodiments of the present disclosure. That is, while fig. 4 illustrates an electronic device 400 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program carried on a non-transitory computer readable medium, the computer program containing program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication device 409, or from the storage device 408, or from the ROM 402. The computer program, when executed by the processing device 401, performs the above-described functions defined in the method of text-to-audio of any embodiment of the present disclosure.

Embodiments of the present disclosure also provide a computer-readable storage medium storing a computer program, which, when executed by a processor, causes the processor to implement the method of text-to-audio in any of the embodiments of the present disclosure.

It should be noted that the computer readable medium in the present disclosure can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

In some embodiments, the clients, servers may communicate using any currently known or future developed network protocol, such as HTTP, and may be interconnected with any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the Internet (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network.

The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to perform the steps of the method of text-to-audio explained in any of the embodiments of the present disclosure.

In embodiments of the present disclosure, computer program code for carrying out operations of the present disclosure may be written in any combination of one or more programming languages, including but not limited to an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), systems on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), and the like.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other embodiments in which any combination of the features described above or their equivalents does not depart from the spirit of the disclosure. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.

Further, while operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limitations on the scope of the disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.

Claims

1. A method of text-to-audio, comprising:

2. The method of claim 1, wherein determining, based on the predetermined text unit, a first partial text of a first section of text that has a text change relative to a second section of text and a first text position of the first partial text comprises:

acquiring the first chapter text and the second chapter text;

and comparing the first section text with the second section text based on the preset text unit, and determining the first local text and the first text position.

3. The method of claim 2, wherein comparing the first section of text to the second section of text based on the predetermined unit of text and determining the first local text and the first text position comprises:

according to the preset text unit, decomposing the first section of text into a plurality of third local texts, and decomposing the second section of text into a plurality of fourth local texts;

and for each third local text, matching the third local text with each fourth local text, and if the matching fails, determining the third local text as the first local text, and determining the arrangement position of the third local text in the first chapter text as the first text position.

4. The method of any of claims 1-3, wherein prior to the generating a second section of audio after the text change to the target section based on the first text position, the first local audio, and the first section audio, the method further comprises:

determining a second local text of the first section text relative to the second section text without text change and a second text position of the second local text in the second section text based on the preset text unit;

the generating a second section audio after the text change of the target section based on the first text position, the first local audio and the first section audio comprises:

determining second local audio corresponding to the second local text from the first section audio based on the second text position;

and splicing the first local audio and the second local audio based on the first text position and the second text position to generate the second section audio.

5. The method of claim 4, wherein the splicing the first local audio and the second local audio based on the first text position and the second text position to generate the second section audio comprises:

determining a third text position of the second local text in the first chapter text based on the position mapping relation and the second text position, and establishing a corresponding relation between the third text position and the second local audio; recording a corresponding relation between the arrangement position of the second local text in the first chapter text and the arrangement position of the second local text in the second chapter text in the position mapping relation;

determining an audio ordering of the first local audio and the second local audio based on a correspondence between the first text position and the first local audio and a correspondence between the third text position and the second local audio;

based on the audio ordering, the first local audio and the second local audio are spliced to generate the second section audio.

6. The method of claim 4, wherein the determining, based on the second text position, a second local audio corresponding to the second local text from the first section audio comprises:

if the first section of audio is in a lossy audio format, converting the first section of audio into a lossless audio format to generate a third section of audio; the lossy audio format is an audio format subjected to audio coding processing and/or audio compression processing; the lossless audio format is an audio format that has not been subjected to the audio encoding process and the audio compression process;

performing local audio extraction on the third section of audio based on the audio starting time corresponding to the second text position to obtain second local audio; and the audio starting time is the starting time recorded according to the preset text unit in the process of generating the third chapter of audio.

7. The method according to claim 1, wherein the preset text unit of text alteration length is determined based on text alteration length and/or text unit value range.

8. The method of claim 7, wherein the predetermined text unit is a sentence.

9. An apparatus for text-to-audio, comprising:

the first text position determining module is used for determining a first local text of a first chapter text relative to a second chapter text and a first text position of the first local text based on a preset text unit if the target chapter text is detected to be changed; the preset text unit is smaller than one chapter, the first chapter text is a chapter text after the target chapter is subjected to text change, and the second chapter text is a chapter text before the target chapter is subjected to text change;

10. An electronic device, comprising:

a processor;

a memory for storing executable instructions;

wherein the processor is configured to read the executable instructions from the memory and execute the executable instructions to implement the method of text-to-audio according to any of claims 1-8.

11. A computer-readable storage medium, characterized in that the storage medium stores a computer program which, when executed by a processor, causes the processor to carry out the method of text-to-audio according to any of the preceding claims 1-8.