WO2023241415A1 - 用于生成文本的配乐的方法、装置、电子设备和介质 - Google Patents

用于生成文本的配乐的方法、装置、电子设备和介质 Download PDF

Info

Publication number
WO2023241415A1
WO2023241415A1 PCT/CN2023/098710 CN2023098710W WO2023241415A1 WO 2023241415 A1 WO2023241415 A1 WO 2023241415A1 CN 2023098710 W CN2023098710 W CN 2023098710W WO 2023241415 A1 WO2023241415 A1 WO 2023241415A1
Authority
WO
WIPO (PCT)
Prior art keywords
plot
paragraphs
neural network
text
unit
Prior art date
Application number
PCT/CN2023/098710
Other languages
English (en)
French (fr)
Inventor
伍林
陈子恺
殷翔
马泽君
Original Assignee
北京有竹居网络技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京有竹居网络技术有限公司 filed Critical 北京有竹居网络技术有限公司
Publication of WO2023241415A1 publication Critical patent/WO2023241415A1/zh

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H1/00Details of electrophonic musical instruments
    • G10H1/0008Associated control or indicating means
    • G10H1/0025Automatic or semi-automatic music composition, e.g. producing random music, applying rules from music theory or modifying a musical piece
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2240/00Data organisation or data communication aspects, specifically adapted for electrophonic musical tools or instruments
    • G10H2240/075Musical metadata derived from musical analysis or for use in electrophonic musical instruments
    • G10H2240/085Mood, i.e. generation, detection or selection of a particular emotional content or atmosphere in a musical piece

Definitions

  • Embodiments of the present disclosure relate to the field of artificial intelligence technology, and more specifically, to methods, apparatuses, electronic devices, computer-readable storage media, and computer program products for generating a soundtrack of text.
  • background music In audiobook production, background music (BGM) is often inserted in order to pursue an immersive effect. Background music is related to the plot. For example, comedy plots will be paired with humorous music, and tragic plots will be paired with sad music.
  • embodiments of the present disclosure propose a technical solution for generating text soundtracks.
  • a method for generating a soundtrack of text includes dividing the text into at least one plot unit based on semantics of a plurality of paragraphs of the text.
  • the method also includes determining an episode category for at least one episode unit.
  • the method also includes determining music matching at least one episode unit based on the episode category. Based on this approach, the plot in the text can be automatically and accurately determined and matching background music selected for the plot, thereby improving the effect of audiobooks.
  • a method for training a first neural network model is provided method.
  • the first neural network model is used to generate hidden state representations and plot category representations of paragraphs in text.
  • the method includes: using the first neural network model to generate plot division representations and hidden state representations for each of the plurality of paragraphs in the training data set, wherein each of the plurality of paragraphs in the training data set has a first label and a second label, and the first The label indicates whether the corresponding paragraph is at a plot boundary, and the second label indicates the plot category of the corresponding paragraph.
  • the method also includes determining a first loss based on the first label and the episodic representation.
  • the method also includes determining a second loss based on the second label and the hidden state representation.
  • the method also includes updating parameters of the first neural network model based on the first loss and the second loss. Based on this method, when the neural network is trained to divide the text into plots, the neural network also learns the plot category information of the paragraph, so that the trained model has higher plot division accuracy.
  • an apparatus for generating a soundtrack of text includes a plot division module, a plot classification module and a music determination module.
  • the plot division module is configured to divide the text into at least one plot unit based on semantics of a plurality of paragraphs of the text.
  • the episode classification module is configured to determine an episode category for at least one episode unit.
  • the music determination module is configured to determine music matching at least one episode unit based on the episode category.
  • an apparatus for training a first neural network model includes a representation generation module configured to use a first neural network model to generate a hidden state representation of a plurality of paragraphs in a training data set, wherein the plurality of paragraphs in the training data set have labels, the labels indicate the corresponding paragraphs Episode categories.
  • the apparatus also includes a loss calculation module configured to determine the first loss based on the label and the hidden state representation.
  • the apparatus further includes a parameter update module configured to update parameters of the first neural network model based on the first loss.
  • an electronic device comprising: at least one processing unit; at least one memory coupled to the at least one processing unit and storing information for processing by the at least one Instructions executed by the unit, the instructions, when executed by the at least one processing unit, cause the electronic device to perform the method according to the first or second aspect of the present disclosure.
  • a computer-readable storage medium including machine-executable instructions that, when executed by a device, cause the device to perform root execution. The method according to the first aspect or the second aspect of the present disclosure.
  • a computer program product comprising machine-executable instructions that, when executed by a device, cause the device to perform a method according to the first or second aspect of the present disclosure. method described.
  • FIG. 1 illustrates a schematic diagram of an example environment in which various embodiments of the present disclosure can be implemented
  • FIG. 2 illustrates a schematic flowchart of a method for generating a soundtrack of text according to an embodiment of the present disclosure
  • FIG. 3 shows a schematic diagram of an exemplary plot division according to an embodiment of the present disclosure
  • FIG. 4 shows a schematic diagram of the structure of a first neural network model for dividing plots according to an embodiment of the present disclosure
  • Figure 5 shows a schematic diagram of a process of training a first neural network model according to an embodiment of the present disclosure
  • Figure 6 shows a schematic flowchart of a method for training a first neural network model according to an embodiment of the present disclosure
  • FIG. 7 shows a schematic diagram of the structure of a second neural network model for determining plot categories according to an embodiment of the present disclosure
  • FIG. 8 shows a schematic flowchart of a method for determining plot categories according to an embodiment of the present disclosure
  • Figure 9 shows a schematic flowchart of a method of selecting music for a plot according to an embodiment of the present disclosure
  • FIG. 10 shows a schematic block diagram of an apparatus for generating a soundtrack of text according to an embodiment of the present disclosure
  • Figure 11 shows a schematic block diagram of an apparatus for training a neural network model according to an embodiment of the present disclosure
  • Figure 12 shows a schematic block diagram of an example device that may be used to implement embodiments of the present disclosure.
  • the term “include” and its variations mean an open inclusion, ie, "including but not limited to.” Unless otherwise stated, the term “or” means “and/or”. The term “based on” means “based at least in part on.” The terms “one example embodiment” and “an embodiment” mean “at least one example embodiment.” The term “another embodiment” means “at least one additional embodiment”. The terms “first,” “second,” etc. may refer to different or the same object. Other explicit and implicit definitions may be included below.
  • embodiments of the present disclosure provide a solution for automatically selecting background music based on text.
  • the text is first divided into several plot units based on the semantics of multiple paragraphs included in the text.
  • determine the plot category of the plot unit In some embodiments, the plot category may reflect the emotional information contained in the plot unit.
  • music matching the determined unit is determined based on the determined plot category. In this way, the scope and category of each plot in the text can be automatically and accurately determined, providing Episode selection matches background music, thus enhancing the audiobook effect.
  • Figure 1 shows a schematic diagram of an example environment 100 in which various embodiments of the present disclosure can be implemented.
  • Text 101 may include content obtained from, for example, a novel or other genre e-book.
  • the text 101 includes several chapters of the e-book, and each chapter may include several paragraphs, and the paragraphs include characters and punctuation marks in any language.
  • text 101 may be input to a text-to-speech system (Text-to-Speech, TTS) 120 to generate speech corresponding to text 101 .
  • Speech may be generated using any known or future developed text-to-speech technology (eg, neural network model).
  • the speech obtained from the text-to-speech conversion system 120 corresponds to the characters in the text 101 and does not include any background music. Therefore, just listening to the speech converted from text 101 lacks immersion for the audience and the effect is not good.
  • Text 101 may also be provided to the soundtrack system 110 .
  • the soundtrack system 110 may be implemented on a single device or a cluster of multiple devices, for example, on a cloud-based server as a cloud service that generates background music from text.
  • the soundtrack system 110 is used to generate background music for the text 101 .
  • the text 101 may include several chapters, and each chapter may include several plots. It should be understood that different plots may contain different emotional information, such as tension, warmth, threat, etc., so appropriate music types need to be selected to match.
  • the soundtrack system 110 is designed to include a plot division module 112, a plot classification module 114, and a music determination module 116.
  • the plot division module 112 uses the paragraphs of the text 101 as the division granularity to divide the text 101 into several plot units (herein, plot unit and plot have the same meaning, and they can be used interchangeably).
  • the plot classification model 114 determines a category for each divided plot unit, and the category reflects the emotional information contained in the plot.
  • the music determination module 116 determines music that matches the plot unit according to the category of the plot unit, for example, selects a piece of music with the same emotional information from the music library, or generates a piece of such music.
  • the plot segmentation module 112 and the plot classification module 114 may use neural network models to automatically segment text and determine categories of plots, respectively. Detailed description will be given below with reference to Figures 2 to 8 and will not be described in detail here.
  • the synthesis module 130 combines the background music and the speech from the text-to-speech system 120 to generate the audiobook 140.
  • FIG. 1 An exemplary environment in which embodiments of the present disclosure can be implemented is described above with reference to FIG. 1 . It should be understood that Figure 1 is only schematic, and the environment may also include more modules or systems, or some modules or systems may be omitted, or the modules or systems shown may be recombined. Embodiments of the present disclosure may be implemented in environments different from those shown in FIG. 1 , and the disclosure is not limited thereto.
  • Figure 2 illustrates a schematic flowchart of a method 200 for generating a soundtrack of text, in accordance with an embodiment of the present disclosure.
  • the method 200 may be implemented, for example, by the soundtrack system 110 as shown in FIG. 1 . It should be understood that method 200 may also include additional actions not shown and/or illustrated actions may be omitted, and the scope of the present disclosure is not limited in this regard.
  • the method 200 is described in detail below in conjunction with FIG. 1 .
  • the text 101 is divided into at least one plot unit based on the semantics of the plurality of paragraphs of the text 101.
  • the text 101 may include several chapters of an e-book, and a chapter may be composed of several paragraphs.
  • text 101 may be a chapter of an electronic book, which includes multiple paragraphs.
  • dividing the text into plot units refers to dividing the text into continuous text subsets with paragraphs as the smallest unit. Each text subset includes at least one paragraph and has the same emotional information.
  • a neural network model may be used to determine paragraphs in the text 101 at plot boundaries, whereby the text 101 may be divided into at least one plot unit based on the paragraphs at the plot boundaries.
  • paragraph 3 illustrates a schematic diagram of an exemplary plot division according to an embodiment of the present disclosure.
  • text 101 is schematically shown as including paragraph 1 through paragraph n, where n is an integer of any suitable size.
  • the neural network model can generate a label for any paragraph from paragraph 1 to paragraph n, and the label indicates whether the corresponding paragraph is at the plot boundary.
  • paragraph k (k is an integer less than n) has the label [SEP] 301
  • paragraph k+1 has the label [SEP] 302
  • paragraph 2 and paragraph k and between paragraph k+1 and paragraph
  • the other paragraphs between n have tags [NON] 303 and 304.
  • the label [SEP] indicates that the paragraph is at plot boundary 305.
  • the above-mentioned neural network model for plot division can be a trained binary classification model, for example, determining whether the label of a paragraph is [NON] or [SEP].
  • FIG. 4 shows a schematic diagram of the structure of a first neural network model 400 according to an embodiment of the present disclosure.
  • the first neural network model 400 is used to divide the text 101 into plot units.
  • the first neural network model 400 includes a first semantic network 402 .
  • the first semantic network 402 is used to generate semantic representations of individual paragraphs in the text 101 .
  • the first semantic network 420 may be a pre-trained BERT model.
  • the first semantic network 402 receives the character sequence p ij of the text, where p ij represents the j-th paragraph of the i-th chapter of the text 101, and generates the semantic representation e ij of the corresponding paragraph. It should be noted that all paragraphs of the text 101 are input to the first semantic network 402 together, and respective semantic representations of all paragraphs are generated.
  • the generated semantic representation e ij may be a CLS flag generated by the BERT model for the paragraph, such as a 768-dimensional vector.
  • the first neural network model 400 also includes a recurrent neural network 404 located behind the first semantic network 402 .
  • the recurrent neural network 404 may be, for example, a Bidirectional Gated Recurrent Unit (BiGRU) model.
  • BiGRU Bidirectional Gated Recurrent Unit
  • the recurrent neural network 404 is used to extract the sequential dependency information between adjacent paragraphs of the text 101 to enhance the semantic representation e ij , and generate the hidden state representation h ij of the corresponding paragraph. For example, based on the semantic representation eij of paragraph pij and the semantic representations eij-1 and eij+1 of nearby paragraphs pij-1 and pij+1 , a hidden state representation h of paragraph pij can be generated ij .
  • the hidden state representation h ij of paragraph p ij depends on the specific implementation of the recurrent neural network model 404.
  • h ij may be, for example, a 512-dimensional vector.
  • the hidden state representation of all paragraphs is generated.
  • the classification of a paragraph as to whether it is at a plot boundary may be determined based on a hidden state representation of the paragraph.
  • the hidden state representation of the paragraph may be provided to the feed-forward network 409 of the first neural network model for generating the output vector o ij of the paragraph, from The division label 410 is obtained.
  • the feedforward network 409 may be, for example, a fully connected layer.
  • the first neural network model 400 may also include a convolutional network 406 and a similarity network 408 located after the recurrent neural network 404.
  • the convolutional network 406 is used to extract coherence between adjacent paragraphs.
  • the convolution network 406 may include a 1-dimensional convolution layer with a convolution kernel size of 3 and a stride of 1, that is, for three adjacent hidden states h ij-1 , h ij , h ij+1 performs a convolution operation to obtain the convolution result c ij of paragraph p ij . It should be understood that the above example of the convolution network 406 is only illustrative, and the present disclosure does not limit the specific implementation of the convolution network 406.
  • the similarity network 408 can mine similarity information Sim ij between adjacent paragraphs based on the convolution result c ij as an additional hidden state representation.
  • o′ ij represents the input of the feedforward network 409
  • o ij represents the output of the feedforward network 409 (also called plot division representation), which represents the probability of the corresponding paragraph at the plot boundary
  • sim() is the similarity calculation function.
  • T represents the transpose operation
  • FF s () and FF f () represent full connection
  • plot division labels for each paragraph of the text 101 can be generated from the text 101, thereby dividing the text into several plots.
  • Embodiments of the present disclosure also provide an effective training method that utilizes the plot carried by the hidden state representation of the paragraph to construct the training target of the first neural network model 400 .
  • Figure 5 shows a schematic diagram of a process of training a neural network model according to an embodiment of the present disclosure. picture.
  • the hidden state representation h ij generated by the recurrent neural network 404 and the plot division representation O ij generated by the feedforward network are provided to the multi-task training module 501.
  • the multi-task training module 501 may be implemented in the soundtrack system 110 of FIG. 1 , or may be implemented on other devices separate from the soundtrack system 110 .
  • the multi-task training module 501 constructs the loss function of the first neural network model 400 based on the hidden state representation h ij and the plot division representation O ij
  • the multi-task training module 501 constructs the loss function of the first neural network model 400 as:
  • is a hyperparameter
  • FIG. 6 shows a schematic flowchart of a method 600 for training a neural network model according to an embodiment of the present disclosure.
  • the method 600 may be implemented, for example, by the multi-task training module 501 as shown in Figure 5 . It should be understood that method 600 may also include additional actions not shown and/or actions shown may be omitted, and the scope of the present disclosure is not limited in this respect. Method 600 is described in detail below in conjunction with FIG. 5 .
  • the first neural network model 400 is used to generate hidden state representations of the plurality of paragraphs in the training data set.
  • the training dataset consists of text consisting of multiple paragraphs, which can have corresponding labels indicating the plot category of the paragraph.
  • plot category tags may be added manually or to individual paragraphs of the e-book.
  • Tags may indicate categories such as warmth, joy, romance, excitement, threat, sadness, hurt, misunderstanding, conflict, positive, negative, neutral, and the like. Consecutive paragraphs with the same plot category label can be considered a plot unit. Thus, plot boundaries are formed at plot changes or chapter changes.
  • the first neural network model 400 may include a first semantic network 402 and a recurrent neural network 404.
  • first semantic network 402 By inputting character representations of paragraphs of text into the first semantic network 402, semantic representations of each paragraph are obtained.
  • the semantic representation can also be input to the recurrent neural network 404, thereby generating hidden state representations for each of the multiple paragraphs. This is similar to the process described with reference to Figure 4.
  • a first loss is determined based on the hidden state representation and labels.
  • the first loss may be the plot category based loss described with reference to Figure 5 It can be obtained according to formula (4) and formula (5). I won’t go into details here.
  • parameters of the first neural network are updated based on the first loss.
  • the parameters of the first neural network may be updated iteratively by a gradient descent method.
  • the first neural network model 400 may also include a convolutional network 406 and a similarity network 408.
  • the method 600 may further include generating, based on the convolutional network 406 and the similarity network 408, a plot division representation of each of the plurality of paragraphs from the hidden state representation of the plurality of paragraphs.
  • the second loss may be determined based on the plot division representation and the labels of the plurality of paragraphs.
  • the second loss may be the division loss described with reference to FIG. 5 It can be calculated by formula (3). It should be understood that since the plot category of the paragraph is obtained according to the plot category label, the boundary between plot units is also obtained, that is, the plot division label [SEP] or [NON].
  • the parameters of the first neural network model may be updated based on the first loss and the second loss.
  • the first loss and the second loss are combined together according to formula (6), and the parameters of the first neural network are iteratively updated through the gradient descent method.
  • plot category information is used in the training process, when using When using a neural network model to predict plot divisions, the predicted plot categories of individual paragraphs are not used. This is because a single paragraph contains less plot category information, which may lead to errors in plot category prediction.
  • a plot category for at least one plot unit is determined.
  • a single paragraph has less plot information, so plot classification is based on the entirety of the plot unit.
  • another neural network model is used to determine the categories of plot units.
  • FIG. 7 shows a schematic diagram of the structure of a second neural network model 700 for determining plot categories according to an embodiment of the present disclosure.
  • the second neural network model 700 includes a second semantic network 702 and a self-attention network 704.
  • the second semantic network 702 is used to generate a semantic representation of the text content.
  • the second semantic network 702 may be a BERT model and may be a copy of the first semantic network 402 of the trained first network model 400.
  • the plotted text contents S1, S2, ...St are provided to the second semantic network 702, and corresponding semantic representations U1, U2, ...Ut are generated.
  • FIG. 7 shows multiple second semantic networks.
  • the second neural network model 700 may include one second semantic network 702 .
  • Self-attention network 704 may include a multi-head attention layer. Using the self-attention network 704, the second neural network model 700 can determine the plot category based on more important text content.
  • the training of the second neural network model 700 may adopt a cross-entropy based loss function.
  • the paragraphs or plot units of the training set for the second neural network model 700 may have labels for plot categories such as warmth, joy, romance, excitement, threat, sadness, hurt, misunderstanding, conflict, positive, negative, neutral.
  • Figure 8 shows a schematic flow diagram of a method 800 for determining episode categories according to an embodiment of the present disclosure.
  • the method 800 may be implemented, for example, by the soundtrack system 120 as shown in FIG. 1 . It should be understood that method 800 may also include additional actions not shown and/or actions shown may be omitted, and the scope of the present disclosure is not limited in this regard. Method 800 is described below in conjunction with FIG. 7 .
  • Method 800 is used to determine a plot category of a first plot unit in the determined at least one plot unit. Method 800 may also be used to determine plot categories for other plot units.
  • the first plot unit is divided into a plurality of paragraph groups.
  • the paragraphs in the first plot unit can be combined in the order of paragraphs to obtain multiple paragraph groups. For example only, if the first plot unit includes twenty paragraphs, paragraphs 1 to 5 can be combined into paragraph group S1, paragraphs 6 to 10 can be combined into paragraph group S2, and so on.
  • paragraphs 1 to 5 can be combined into paragraph group S1
  • paragraphs 6 to 10 can be combined into paragraph group S2
  • Embodiments of the present disclosure do not limit the number of paragraph groups within a plot unit and the number of paragraphs within each paragraph group.
  • plot units may be divided into paragraph groups in a random manner. For example, first treat the first plot unit as a whole and randomly divide it into two paragraph groups. Then, the longer paragraph group is randomly divided into two smaller paragraph groups, and so on, until the number of paragraph groups in the plot unit reaches a preset number, such as 8 or any other number.
  • a group semantic representation for each of the plurality of paragraph groups is generated.
  • the character sequences of each paragraph of the paragraph group are spliced together in sequence and input to the second semantic network 702.
  • the resulting group semantic representation may be, for example, a 768-dimensional vector.
  • the CLS tag output by the BERT model can be used as a group semantic representation, and the CLS tag represents the overall semantic information of the paragraph group. If the spliced character sequence is too long, the previous character sequence can be intercepted and used as the input of the second semantic network 702.
  • an episode category representation of the first plot unit is generated from the plurality of sets of semantic representations based on the self-attention network to determine the first plot category.
  • the plot category representation indicates the probability that the plot unit belongs to each plot category.
  • the plot category with the maximum probability may be determined as the plot category of the first plot unit.
  • the plot category of each plot unit in the text 101 can be determined.
  • music matching at least one episode unit is determined based on the episode category.
  • the music may be selected from an existing music library, where the music in the music library may have associated tag information. If the label information of the music matches the plot category of the plot unit, for example, the semantic similarity is high, the music can be considered to match the current plot unit. Alternatively, music with a similar style may also be generated based on plot categories.
  • the speech may be generated by text-to-speech system 120.
  • Figure 9 shows a schematic flowchart of a method 900 of selecting music for an episode according to an embodiment of the present disclosure.
  • the current episode is a long episode. For example, if the number of words of the episode exceeds a threshold number (eg, 200 words), it may be determined that the current episode is a long episode. If it is not a long episode, method 900 proceeds to block 904 to select music that matches the episode category.
  • a threshold number eg, 200 words
  • method 900 proceeds to block 906 to determine whether the episode has more dialogue than narrative. If so, at block 908, the dialogue portion is determined as content to be soundtracked. Otherwise, the method proceeds to block 910 where the narrative portion is determined to be the content to be scored.
  • a threshold eg, 500 words. If it is greater than the threshold, the method 900 proceeds to block 914 to select multiple matching pieces of music and splice them. If not, method 900 proceeds to block 916 to select music that matches the plot.
  • a neural network model is used to perform episodic segmentation based on semantic information and through a sequence annotation method.
  • Embodiments of the present disclosure also provide an effective training method for the neural network model.
  • semantic information and attention mechanisms are also utilized to achieve accurate plot classification. Compared with traditional solutions, embodiments of the present disclosure save a lot of manual work in adding background music to audiobooks and achieve good performance.
  • FIG. 10 shows a schematic block diagram of an apparatus 800 for generating a soundtrack of text according to an embodiment of the present disclosure.
  • Device 1000 may be arranged at soundtrack system 110 .
  • apparatus 800 is implemented, for example, by a computing device or a cluster of devices implementing soundtrack system 110 .
  • the device 800 includes a plot division module 1010, a plot classification module 1020, and a music determination module 1030.
  • the plot division module 1010 is configured to divide the text into at least one plot unit based on the semantics of multiple paragraphs of the text.
  • Episode classification module 1020 is configured to determine an episode category for at least one episode unit.
  • the music determination module 1030 is configured to determine music matching at least one episode unit based on the episode category.
  • the plot division module 1010 is further configured to use the first neural network model to determine paragraphs among the plurality of paragraphs at plot boundaries, and based on the paragraphs determined to be at plot boundaries, divide the text into At least one plot unit.
  • the first neural network includes a first semantic network and a recurrent neural network
  • the plot division module 1010 is further configured to: based on the first semantic network, generate respective semantic representations of the multiple paragraphs, based on the recurrent neural network, from The semantic representation of multiple paragraphs generates respective hidden state representations of multiple paragraphs, and based on the hidden state representation of multiple paragraphs, determines the division categories of multiple paragraphs regarding whether they are at plot boundaries.
  • the first neural network further includes a convolutional network and a similarity network
  • the plot segmentation module 1010 is further configured to: generate additional hidden states from the hidden state representations of the multiple paragraphs based on the convolutional network and the similarity network. state representation; and based on hidden state representations and additional hidden state representations of multiple paragraphs, generating respective plot division representations of multiple paragraphs to determine division categories.
  • At least one plot unit includes a first plot unit, and the plot classification module 1010 is further configured to determine a first plot category of the first plot unit using a second neural network model.
  • the second neural network unit includes a second semantic network and a self-attention network
  • the plot classification module 1020 is further configured to: divide the first plot unit into multiple paragraph groups; based on the second semantic network, generate respective group semantic representations of the plurality of paragraph groups; and based on the self-attention network, generating a plot category representation of the first plot unit from the group semantic representations of the plurality of paragraph groups to determine the first plot category.
  • the music determination module is further configured to select matching music from the music library based on the episode category and the length of the speech corresponding to the at least one episode unit.
  • Figure 11 shows a schematic block diagram of an apparatus 1100 for generating a soundtrack of text according to an embodiment of the present disclosure.
  • Device 1100 may be arranged at soundtrack system 110 .
  • apparatus 1100 is implemented, for example, by a computing device or a cluster of devices implementing soundtrack system 110 .
  • the apparatus 1100 includes a representation generation module 1110 configured to generate hidden state representations of a plurality of paragraphs in a training data set using a first neural network model. Multiple paragraphs in the training data set have labels indicating the context of the corresponding paragraphs. section category.
  • the apparatus 1100 also includes a loss calculation module 1120.
  • the loss calculation module 1120 is configured to determine the first loss based on the label and the hidden state representation.
  • the apparatus 1100 also includes a parameter update module 1130.
  • the parameter update module 1130 is configured to update parameters of the first neural network model based on the first loss.
  • the first neural network model may include a first semantic network and a recurrent neural network.
  • the first semantic network may be, for example, a BERT model
  • the recurrent neural network may be, for example, a bidirectional GRU model.
  • the representation generation module 1110 may be further configured to determine respective semantic representations of the plurality of paragraphs in the first semantic network, and generate respective hidden state representations of the plurality of paragraphs from the semantic representations of the plurality of paragraphs based on the recurrent neural network.
  • the first neural network may also include a convolutional network and a similarity network.
  • a convolutional network can follow a recurrent neural network and can be, for example, a one-dimensional convolutional network.
  • the similarity network can follow the convolutional network and include a similarity calculation layer to calculate the similarity of adjacent paragraphs.
  • the representation generation module 1110 may also be configured to generate respective plot division representations of the plurality of paragraphs from the hidden state representations of the plurality of paragraphs based on the convolutional network and the similarity network.
  • the loss calculation module 112 may also be configured to determine the second loss based on the episodic representation and the labels.
  • the parameter updating unit may be further configured to update the parameters of the first neural network model based on the first loss and the second loss.
  • labels in the training data set may indicate that the corresponding passage has one of the following plot categories: warm, happy, romantic, exciting, threatening, sad, hurt, misunderstanding, conflict, positive, negative, neutral.
  • FIG. 12 illustrates a schematic block diagram of an example device 1200 that may be used to implement embodiments of the present disclosure.
  • a backup system and/or a recovery system may be implemented by the device 1200.
  • device 1200 includes a central processing unit (CPU) 1201 that can operate on a computer in accordance with computer program instructions stored in read-only memory (ROM) 1202 or loaded from storage unit 908 into random access memory (RAM) 1203 Program instructions to perform various appropriate actions and processes.
  • ROM 1202 read-only memory
  • RAM random access memory
  • RAM 1203 the device can also be stored Prepare various programs and data required for 900 operations.
  • the CPU 1201, ROM 1202, and RAM 1203 are connected to each other through a bus 1204.
  • An input/output (I/O) interface 1205 is also connected to bus 1204.
  • I/O interface 1205 Multiple components in the device 1200 are connected to the I/O interface 1205, including: input unit 1206, such as a keyboard, mouse, etc.; output unit 1207, such as various types of displays, speakers, etc.; storage unit 1208, such as a magnetic disk, optical disk, etc. ; and communication unit 1209, such as a network card, modem, wireless communication transceiver, etc.
  • the communication unit 1209 allows the device 1200 to exchange information/data with other devices through computer networks such as the Internet and/or various telecommunications networks.
  • methods 200, 600, 800 and/or 900 may be performed by the processing unit 1201.
  • methods 200, 600, 800, and/or 900 may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 1208.
  • part or all of the computer program may be loaded and/or installed onto device 900 via ROM 1202 and/or communication unit 1209.
  • the computer program is loaded into RAM 1203 and executed by CPU 1201, one or more actions of methods 200, 600, 800, and/or 900 described above may be performed.
  • the disclosure may be a method, apparatus, system and/or computer program product.
  • a computer program product may include a computer-readable storage medium having thereon computer-readable program instructions for performing various aspects of the present disclosure.
  • Computer-readable storage media may be tangible devices that can retain and store instructions for use by an instruction execution device.
  • the computer-readable storage medium may be, for example, but not limited to, an electrical storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the above. More specific examples (non-exhaustive list) of computer-readable storage media include: portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM) or Flash memory), Static Random Access Memory (SRAM), Compact Disk Read Only Memory (CD-ROM), Digital Versatile Disk (DVD), Memory Stick, Floppy Disk, Mechanical Coding Device, such as a printer with instructions stored on it.
  • RAM random access memory
  • ROM read-only memory
  • EPROM erasable programmable read-only memory
  • Flash memory Static Random Access Memory
  • CD-ROM Compact Disk Read Only Memory
  • DVD Digital Versatile Disk
  • Memory Stick
  • Computer-readable storage media are not to be construed as transient signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g. For example, light pulses through fiber optic cables), or electrical signals transmitted through wires.
  • Computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to various computing/processing devices, or to an external computer or external storage device over a network, such as the Internet, a local area network, a wide area network, and/or a wireless network.
  • the network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers, and/or edge servers.
  • a network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage on a computer-readable storage medium in the respective computing/processing device .
  • Computer program instructions for performing operations of the present disclosure may be assembly instructions, instruction set architecture (ISA) instructions, machine instructions, machine-related instructions, microcode, firmware instructions, state setting data, or instructions in one or more programming languages.
  • the computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server implement.
  • the remote computer can be connected to the user's computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computer (such as an Internet service provider through the Internet). connect).
  • LAN local area network
  • WAN wide area network
  • an external computer such as an Internet service provider through the Internet. connect
  • an electronic circuit such as a programmable logic circuit, a field programmable gate array (FPGA), or a programmable logic array (PLA)
  • the electronic circuit can Computer readable program instructions are executed to implement various aspects of the disclosure.
  • These computer-readable program instructions may be provided to a processing unit of a general-purpose computer, a special-purpose computer, or other programmable data processing apparatus, thereby producing a machine such that the instructions, when executed by a processing unit of the computer or other programmable data processing apparatus, ,produce A device that implements the functions/actions specified in one or more blocks in the flowchart and/or block diagram.
  • These computer-readable program instructions can also be stored in a computer-readable storage medium. These instructions cause the computer, programmable data processing device and/or other equipment to work in a specific manner. Therefore, the computer-readable medium storing the instructions includes An article of manufacture that includes instructions that implement aspects of the functions/acts specified in one or more blocks of the flowcharts and/or block diagrams.
  • Computer-readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other equipment, causing a series of operating steps to be performed on the computer, other programmable data processing apparatus, or other equipment to produce a computer-implemented process , thereby causing instructions executed on a computer, other programmable data processing apparatus, or other equipment to implement the functions/actions specified in one or more blocks in the flowcharts and/or block diagrams.
  • each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions that embody one or more elements for implementing the specified logical function(s).
  • Executable instructions may occur out of the order noted in the figures. For example, two consecutive blocks may actually execute substantially in parallel, or they may sometimes execute in the reverse order, depending on the functionality involved.
  • each block of the block diagram and/or flowchart illustration, and combinations of blocks in the block diagram and/or flowchart illustration can be implemented by special purpose hardware-based systems that perform the specified functions or acts. , or can be implemented using a combination of specialized hardware and computer instructions.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本公开的实施例涉及用于生成文本的配乐的方法、装置、电子设备和介质。方法包括基于文本的多个段落的语义将文本划分为至少一个情节单元。方法还包括确定至少一个情节单元的情节类别。方法还包括基于情节类别确定与至少一个情节单元匹配的音乐。根据本公开的实施例,能够自动并且精确确定出文本中各个情节的范围及其类别,并为情节选择匹配的背景音乐,从而提高了有声读物的效果。

Description

用于生成文本的配乐的方法、装置、电子设备和介质
相关申请的交叉引用
本申请要求申请号为202210693446.1、题为“用于生成文本的配乐的方法、装置、电子设备和介质”、申请日为2022年6月17日的中国发明专利申请的优先权,通过引用方式将该申请整体并入本文。
技术领域
本公开的实施例涉及人工智能技术领域,并且更具体地,涉及用于生成文本的配乐的方法、装置、电子设备、计算机可读存储介质和计算机程序产品。
背景技术
在有声读物制作中,为了追求身临其境的效果,往往会插入背景音乐(BGM)。背景音乐与情节有关,例如,喜剧情节会搭配诙谐幽默的音乐,悲剧情节则会搭配令人伤感的音乐等。
背景音乐的选择依赖于情节判定。然而,现有的情节判定往往通过人工的方式来区分情节。该方法耗时费力,需要很高的人力成本。
发明内容
有鉴于此,本公开的实施例提出了一种用于生成文本配乐的技术方案。
根据本公开的第一方面,提供了一种用于生成文本的配乐的方法。方法包括基于文本的多个段落的语义将文本划分为至少一个情节单元。方法还包括确定至少一个情节单元的情节类别。方法还包括基于情节类别确定与至少一个情节单元匹配的音乐。基于这种方式,能够从自动且精确地确定文本中的情节,并为情节选择匹配的背景音乐,从而提高了有声读物的效果。
根据本公开的第二方面,提供了一种用于训练第一神经网络模型的 方法。第一神经网络模型用于生成文本中的段落的隐藏状态表示和情节类别表示。方法包括:使用所述第一神经网络模型,生成训练数据集中的多个段落各自的情节划分表示和隐藏状态表示,其中训练数据集中的多个段落各自具有第一标签和第二标签,第一标签指示相应段落是否在情节边界处,第二标签指示相应段落的情节类别。方法还包括基于第一标签和情节划分表示来确定第一损失。方法还包括基于第二标签和隐藏状态表示来确定第二损失。方法还包括基于第一损失和第二损失,更新第一神经网络模型的参数。基于这种方式,在神经网络被训练用于对文本进行情节划分的过程中,神经网络还学习到段落的情节类别信息,从而使训练后的模型具有更高的情节划分精度。
根据本公开的第三方面,还提供了一种用于生成文本的配乐的装置。装置包括情节划分模块、情节分类模块和音乐确定模块。情节划分模块被配置为基于文本的多个段落的语义将文本划分为至少一个情节单元。情节分类模块被配置为确定至少一个情节单元的情节类别。音乐确定模块被配置为基于情节类别确定与至少一个情节单元匹配的音乐。
根据本公开的第四方面,还提供了一种用于训练第一神经网络模型的装置。装置包括表示生成模块,所述表示生成模块被配置为使用第一神经网络模型,生成训练数据集中的多个段落的隐藏状态表示,其中训练数据集中的多个段落具有标签,标签指示相应段落的情节类别。装置还包括损失计算模块,被配置为基于标签和隐藏状态表示来确定第一损失。装置还包括参数更新模块,被配置为基于第一损失更新第一神经网络模型的参数。
根据本公开的第五方面,提供了一种电子设备,包括:至少一个处理单元;至少一个存储器,所述至少一个存储器被耦合到所述至少一个处理单元并且存储用于由所述至少一个处理单元执行的指令,所述指令当由所述至少一个处理单元执行时,使得所述电子设备执行根据本公开的第一方面或第二方面所述的方法。
根据本公开的第六方面,提供了一种计算机可读存储介质,包括机器可执行指令,所述机器可执行指令在由设备执行时使所述设备执行根 据本公开的第一方面或第二方面所述的方法。
根据本公开的第七方面,提供了一种计算机程序产品,包括机器可执行指令,所述机器可执行指令在由设备执行时使所述设备执行根据本公开的第一方面或第二方面所述的方法。
提供该内容部分是为了以简化的形式来介绍对概念的选择,它们在下文的具体实施方式中将被进一步描述。该内容部分无意标识本公开的关键特征或必要特征,也无意限制本公开的范围。
附图说明
通过结合附图对本公开示例性实施例进行更详细的描述,本公开的上述以及其它目的、特征和优势将变得更加明显,其中,在本公开示例性实施例中,相同的参考标号通常代表相同部件。
图1示出了本公开的多个实施例能够在其中实现的示例环境的示意图;
图2示出了根据本公开的实施例的用于生成文本的配乐的方法的示意流程图;
图3示出了根据本公开的实施例的示例性情节划分的示意图;
图4示出了根据本公开的实施例的用于划分情节的第一神经网络模型的结构的示意图;
图5示出了根据本公开的实施例的训练第一神经网络模型的过程的示意图;
图6示出了根据本公开的实施例的用于训练第一神经网络模型的方法的示意流程图;
图7示出了根据本公开的实施例的用于确定情节类别的第二神经网络模型的结构的示意图;
图8示出了根据本公开的实施例的用于确定情节类别的方法的示意流程图;
图9示出了根据本公开的实施例的为情节选择音乐的方法的示意流程图;
图10示出了根据本公开的实施例的用于生成文本的配乐的装置的示意框图;
图11示出了根据本公开的实施例的用于训练神经网络模型的装置的示意框图;
图12示出了可以用来实施本公开内容的实施例的示例设备的示意性框图。
具体实施方式
可以理解的是,本技术方案所涉及的数据(包括但不限于数据本身、数据的获取或使用)应当遵循相应法律法规及相关规定的要求。
下面将参照附图更详细地描述本公开的优选实施例。虽然附图中显示了本公开的优选实施例,然而应该理解,可以以各种形式实现本公开而不应被这里阐述的实施例所限制。相反,提供这些实施例是为了使本公开更加透彻和完整,并且能够将本公开的范围完整地传达给本领域的技术人员。
在本文中使用的术语“包括”及其变形表示开放性包括,即“包括但不限于”。除非特别申明,术语“或”表示“和/或”。术语“基于”表示“至少部分地基于”。术语“一个示例实施例”和“一个实施例”表示“至少一个示例实施例”。术语“另一实施例”表示“至少一个另外的实施例”。术语“第一”、“第二”等等可以指代不同的或相同的对象。下文还可能包括其他明确的和隐含的定义。
在有声读物中,背景音乐能够增强听众的沉浸体验,帮助听众更好地理解故事情节。传统上,需要人工来选择与情节适应的背景音乐,这项工作费时费力,成本巨大。有鉴于此,本公开的实施例提供了一种基于文本来自动选择背景音乐的方案。根据该方案,首先基于文本所包括的多个段落的语义,将文本划分为若干个情节单元。接下来确定情节单元的情节类别。在一些实施例中,情节类别可以反映出情节单元所蕴含的情绪信息。然后,基于所确定的情节类别确定与确定单元匹配的音乐。以此方式,能够自动并且精确确定出文本中各个情节的范围及类别,为 情节选择匹配的背景音乐,从而提高了有声读物的效果。
以下参考图1至12详细描述本公开的实施例的实现细节。
图1示出了本公开的多个实施例能够在其中实现的示例环境100的示意图。
文本101可以包括从例如小说或其他体裁的电子书获取的内容。例如,文本101包括电子书的若干个章节、每个章节可以包括若干个段落,段落包括任何语言的字符和标点符号。为了生成有声读物,文本101可以被输入到文本语音转换系统(Text-to-Speech,TTS)120,从而生成与文本101对应的语音。可以使用任何已知或未来开发的文本语音转换技术(例如,神经网络模型)来生成语音。从文本语音转换系统120得到的语音与文本101中的字符是彼此对应的,并且不包括任何背景音乐。因此,仅收听从文本101转换得到的语音,听众缺乏沉浸感,效果不佳。
文本101还可以被提供至配乐系统110。配乐系统110可以被实现在单个设备或多个设备组成的集群上,例如,实现在基于云的服务器上,作为一种从文本生成背景音乐的云服务。配乐系统110用于为文本101生成背景音乐。如上所述,文本101可以包括若干个章节,每个章节又包含若干个情节。应理解,不同情节可能蕴含不同的情绪信息,例如,紧张、温暖、威胁等,因此需要选择合适的音乐类型来进行匹配。
为此,配乐系统110设计为包括情节划分模块112、情节分类模块114和音乐确定模块116。情节划分模块112以文本101的段落作为划分粒度,将文本101划分为若干个情节单元(本文中,情节单元和情节具有相同含义,二者可互换使用)。情节分类模型114为划分得到的每个情节单元确定其类别,类别反映情节所蕴含的情绪信息。音乐确定模块116根据情节单元的类别来确定与情节单元匹配的音乐,例如,从音乐库中选择具有相同情绪信息的一段音乐,或者生成一段这样的音乐。
在一些实施例中,情节划分模块112和情节分类模块114可以分别使用神经网络模型来自动划分文本和确定情节的类别。下文中将参考图2至图8详细说明,这里暂不详述。
接下来,所确定的音乐将作为背景音乐被提供至合成模块130。合 成模块130将背景音乐和来自文本语音转换系统120的语音组合,从而生成有声读物140。
以上参考图1描述了能够实施本公开的实施例的示例性环境。应理解,图1仅是示意性的,环境还可以包括更多的模块或系统,或者可以省略一些模块或系统,或者所示的模块或系统可以重新组合。本公开的实施例可以在与图1所示不同的环境中实施,本公开对此不做限制。
图2示出了根据本公开的实施例的用于生成文本的配乐的方法200的示意流程图。方法200例如可以由如图1所示的配乐系统110来实现。应当理解,方法200还可以包括未示出的附加动作和/或可以省略所示出的动作,本公开的范围在此方面不受限制。以下结合图1详细描述方法200。
在框210,基于文本101的多个段落的语义,将文本101划分为至少一个情节单元。如上所述,文本101可以包括电子书的若干个章节,而章节由若干个段落组成。例如,文本101可以是电子书的一个章节,其包括多个段落。本文中,将文本划分为情节单元指代以段落为最小单位,将文本划分连续的文本子集,每个文本子集包括至少一个段落并且具有相同的情绪信息。
在一些实施例中,可以使用神经网络模型来确定文本101中在情节边界处的段落,由此可以基于在情节边界处的段落,将文本101划分为至少一个情节单元。
图3其示出了根据本公开的实施例的示例性情节划分的示意图。图3中,文本101被示意性示出为包括段落1至段落n,其中n为任何合适大小的整数。神经网络模型可以为段落1至段落n的任一段落生成标签,标签指示相应的段落是否在情节边界处。如图3所示,段落k(k为小于n的整数)具有标签[SEP]301,段落k+1具有标签[SEP]302,而段落2至段落k之间、以及段落k+1至段落n之间的其他段落具有标签[NON]303和304。这里,标签[SEP]指示段落在情节边界305处。而标签[NON]指示段落在单个情节内。应理解,连续的[SEP]标签指示存在可能的情节边界305。在图3所示的示例性情节划分中,文本101被划分为情节1 和情节2,情节1包括文本101的段落1至段落k,情节2包括段落k+1至段落n。需要说明的是,图3的情节划分仅是示意性的,文本101可以被划分为任意数目的情节,并且每个情节的段落数可以是任何数目。
上述用于情节划分的神经网络模型可以是经训练的二元分类模型,例如,判断段落的标签为[NON]还是[SEP]。
图4示出了根据本公开的实施例的第一神经网络模型400的结构的示意图。第一神经网络模型400用于将文本101划分为情节单元。
第一神经网络模型400包括第一语义网络402。第一语义网络402用于生成文本101中的各个段落的语义表示。在一些实施例中,第一语义网络420可以是经过预训练的BERT模型。第一语义网络402接收文本的字符序列pij,这里pij表示文本101的第i章第j个段落,并且生成相应的段落的语义表示eij。需要说明的是,文本101的全部段落一起被输入到第一语义网络402,并且生成全部段落各自的语义表示。
在第一语义网络402是BERT模型(例如12层)的情况下,所生成的语义表示eij可以是BERT模型针对该段落生成的CLS标志,例如一个768维的向量。
第一神经网络模型400还包括位于第一语义网络402之后的循环神经网络404。在一些实施例中,循环神经网络404可以是例如双向门控循环单元(BiGRU)模型。
循环神经网络404用于提取文本101的相邻段落之间的顺序依赖关系信息来增强语义表示eij,并且生成相应段落的隐藏状态表示hij。例如,基于段落pij的语义表示eij以及附近段落pi j-1和pi j+1的语义表示ei j-1和ei j+1,可以生成段落pij的隐藏状态表示hij。段落pij的隐藏状态表示hij取决于循环神经网络模型404的具体实现。例如,在循环神经网络404的512单元BiGRU的情况下,hij可以是例如512维的向量。类似地,经过循环神经网络404之后,生成了全部段落的隐藏状态表示。
在一些实施例中,可以基于段落的隐藏状态表示,确定段落关于是否在情节边界处的划分类别。换句话说,可以将段落的隐藏状态表示提供至第一神经网络模型的前馈网络409用于生成段落的输出向量oij,从 而得到划分标签410。这里,前馈网络409可以是例如全连接层。
在一些实施例中,为实现更高的划分精度,第一神经网络模型400还可以包括位于循环神经网络404之后的卷积网络406和相似度网络408。
卷积网络406用于提取相邻段落之间的相干性。在一些实施例中,卷积网络406可以包括1维卷积层,其卷积核大小为3,步长为1,即,针对相邻的三个隐藏状态表示hij-1,hij,hij+1进行卷积操作得到段落pij的卷积结果cij。应理解,上述卷积网络406的示例仅为示意性的,本公开对卷积网络406的具体实现不做限制。
相似度网络408可以基于卷积结果cij来挖掘相邻段落之间的相似度信息Simij,作为附加隐藏状态表示。
由此,利用卷积网络406和相似度网络408,可以从隐藏状态表示hij来生成附加隐藏状态表示Simij,可以将它们组合在一起作为前馈网络409的输入o′ij,如下等式(1)和(2)所示

oij=FFf(o′ij)         (2)
其中,o′ij表示前馈网络409的输入,oij表示前馈网络409的输出(也称为情节划分表示),其表示相应段落在情节边界处的概率,sim()是相似度计算函数,例如余弦相似度,T表示转置操作,FFs()和FFf()表示全连接,表示向量拼接操作。
由此,利用第一神经网络模型400,可以从文本101生成文本101的各个段落的情节划分标签,从而将文本划分为若干个情节。
应理解,在使用第一神经网络模型400进行针对情节划分的推理之前,需要对其进行训练。本公开的实施例还提供了一种有效的训练方法,其利用段落的隐藏状态表示所携带的情节来构建第一神经网络模型400的训练目标。
以下参考图5和图6进行说明。
图5示出了根据本公开的实施例的训练神经网络模型的过程的示意 图。如图所示,训练过程中,循环神经网络404生成的隐藏状态表示hij以及前馈网络生成的情节划分表示Oij被提供到多任务训练模块501。多任务训练模块501可以被实现在图1的配乐系统110中,也可以被实现在与配乐系统110分离的其他设备处。
多任务训练模块501基于隐藏状态表示hij和情节划分表示Oij构建第一神经网络模型400的损失函数
首先,构建基于情节划分结果偏差的划分损失对于每个段落pij,如果其为[SEP],则其标签yij=1,否则yij=0,并且假设yij的输出概率为oij。划分损失如以下公式(3)所示
然后,构建基于情节类别的损失根据如下公式(4)和公式(5)得到
pij=softmax(FFc(hij))       (4)
其中是情节分类的类别集合,并且如果段落pij的情节类别是c,则其情节类别标签lijc=1,否则lijc=0,并且pijc是该段落被预测为情节类别c的概率。
由此,多任务训练模块501将第一神经网络模型400的损失函数构建为:
其中λ是超参数。
图6示出了根据本公开的实施例的训练神经网络模型的方法600的示意流程图。方法600例如可以由如图5所示的多任务训练模块501来实现。应当理解,方法600还可以包括未示出的附加动作和/或可以省略所示出的动作,本公开的范围在此方面不受限制。以下结合图5详细描述方法600。
在框610,使用第一神经网络模型400生成训练数据集中的多个段落的隐藏状态表示。训练数据集包括由多个段落组成的文本,段落可以具有相应的指示段落的情节类别的标签。
在一些实施例中,可以通过人工方式或为电子书的各个段落添加情节类别标签。标签可以指示例如温暖、高兴、浪漫、激昂、威胁、悲伤、受伤、误会、冲突、积极、消极、中立等类别。具有相同情节类别标签的连续段落可以被视为一个情节单元。由此,在情节变化或章节变化处,形成情节边界。
在一些实施例中,第一神经网络模型400可以包括第一语义网络402和循环神经网络404。通过将文本的段落的字符表示输入到第一语义网络402,得到各个段落的语义表示。语义表示还可以被输入到循环神经网络404,从而生成多个段落各自的隐藏状态表示。这与参考图4所描述的过程类似。
在框620,基于隐藏状态表示和标签,确定第一损失。这里第一损失可以是参考图5描述的基于情节类别的损失可以根据公式(4)和公式(5)得到。这里不再赘述。
在框630,基于所述第一损失,更新第一神经网络的参数。可以根据通过梯度下降法迭代地更新第一神经网络的参数。
在一些实施例中,第一神经网络模型400还可以包括卷积网络406和相似度网络408。方法600还可以包括基于卷积网络406和相似度网络408,从多个段落的隐藏状态表示生成多个段落各自的情节划分表示。可以基于情节划分表示和多个段落的标签,确定第二损失。这里,第二损失可以是参考图5描述的划分损失可以通过公式(3)来计算。应理解,由于根据该情节类别标签获得了段落的情节类别,因此也就获得了情节单元之间的边界,即,情节划分标签[SEP]或[NON]。
然后,可以基于第一损失和第二损失,更新第一神经网络模型的参数。例如,根据公式(6)将第一损失和第二损失组合在一起,通过梯度下降法迭代地更新第一神经网络的参数。
需要注意的是,虽然在训练过程中使用了情节类别信息,但是在使 用神经网络模型进行预测情节划分预测时,并不使用单个段落的预测的情节类别。这是因为单个段落的包含较少的情节类别信息,可能导致情节类别预测的错误。
继续参考图2,在框220,确定至少一个情节单元的情节类别。如上所述,单个段落的情节信息较少,因此基于情节单元的整体进行情节分类。在一些实施例中,使用另一神经网络模型来确定情节单元的类别。
图7其示出了根据本公开的实施例的用于确定情节类别的第二神经网络模型700的结构的示意图。
第二神经网络模型700包括第二语义网络702和自注意力网络704。第二语义网络702用于生成文本内容的语义表示。在一些实施例中,第二语义网络702可以是BERT模型,并且可以是经过训练后的第一网络模型400的第一语义网络402的副本。
如图所示,已划分情节的文本内容S1、S2、…St被提供至第二语义网络702,并生成相应的语义表示U1、U2、…Ut。为了方便理解,图7示出了多个第二语义网络,然而仅是为了示意目的,第二神经网络模型700可以包括一个第二语义网络702。
自注意力网络704可以包括多头注意力层。利用自注意力网络704可以第二神经网络模型700基于更为重要的文本内容来确定情节类别。
第二神经网络模型700的训练可以采用基于交叉熵的损失函数。用于第二神经网络模型700的训练集的段落或情节单元可以具有例如温暖、高兴、浪漫、激昂、威胁、悲伤、受伤、误会、冲突、积极、消极、中立的情节类别的标签。
图8示出了根据本公开的实施例的用于确定情节类别的方法800的示意流程图。方法800例如可以由如图1所示的配乐系统120来实现。应当理解,方法800还可以包括未示出的附加动作和/或可以省略所示出的动作,本公开的范围在此方面不受限制。以下结合图7描述方法800。
方法800用于确定所确定的至少一个情节单元中的第一情节单元的情节类别。方法800也可以用于确定其他情节单元的情节类别。
在框810,将第一情节单元划分为多个段落组。为了衡量第一情节 单元的总体语义信息,而不是单个段落,可以将第一情节单元中的段落按照段落的顺序进行组合,得到多个段落组。仅作为示例,第一情节单元包括二十个段落,则可以将第1至5段组合为段落组S1、第6至10段组合为段落组S2,以此类推。本公开的实施例对情节单元内的段落组的数目,以及每个段落组内的段落的数目不做限制。
在一些实施例中,可以按照随机方式将情节单元切分为段落组。例如,首先将第一情节单元视为一个整体,随机切分为两个段落组。然后,再将较长的一个段落组随机切分为两个更小的段落组,以此类推,直到情节单元内的段落组的数量达到预先设定数字,例如8个或其他任何数字。
在框820,基于第二语义网络702,生成多个段落组各自的组语义表示。在一些实施例中,将段落组的各个段落的字符序列依次拼接在一起,输入到第二语义网络702。在第二语义网络702是BERT模型的情况下,所得到的组语义表示可以是例如768维的向量。可以将BERT模型输出的CLS标记用作组语义表示,CLS标志表示段落组的整体的语义信息。如果拼接后的字符序列过长,则可以截取前面的字符序列来作为第二语义网络702的输入。
在框830,基于自注意力网络,从多个组语义表示生成第一情节单元的情节类别表示,以确定第一情节类别。情节类别表示指示该情节单元属于各个情节类别的概率。可以将具有最大概率的情节类别确定为第一情节单元的情节类别。
通过上述方法800,可以确定文本101中的每个情节单元的情节类别。
继续参考图2,在框230,基于情节类别,确定与至少一个情节单元匹配的音乐。音乐可以是从已有的音乐库中选择的,其中音乐库中的音乐可以具有相关联的标签信息。可以如果音乐的标签信息与情节单元的情节类别匹配,例如语义相似度较高,则可以认为该音乐与当前的情节单元匹配。备选地,还可以基于情节类别来生成的具有类似风格的音乐。
在一些实施例中,基于情节类别和与情节单元对应的语音的长度, 从音乐库中选择匹配的音乐。语音可以由文本语音转换系统120生成。
图9示出了根据本公开的实施例的为情节选择音乐的方法900的示意流程图。
在框902,确定当前的情节是否为长情节。例如,如果情节的字数超过阈值数目(例如,200字),则可以确定当前的情节是长情节。如果不是长情节,方法900前进到框904,选择与情节类别匹配的音乐。
如果是长情节,则方法900前进到框906,确定情节中的对白是否多于叙事。如果是,则在框908,将对白部分确定为待配乐内容。否则,方法前进到框910,将叙事部分确定为待配乐内容。
接下来,在框912,确定待配乐内的长度是否大于阈值(例如,500字),如果大于阈值,则方法900前进到框914,选择多个匹配的音乐并拼接。如果否,则方法900前进到框916,选择与情节匹配的音乐。
以上参照图1至图9描述了根据本公开的实施例的用于生成文本配乐的方法或过程。相比于现有的方案,本公开的实施例能够从自动且精确地确定文本中的情节,并为情节选择匹配的背景音乐,从而提高了有声读物的效果。在一些实施例中,使用神经网络模型基于语义信息并且通过序列标注方法进行情节划分。本公开的实施例还提供了针对该神经网络模型的有效训练方法。在一些实施例中,还利用语义信息和注意力机制实现了准确的情节分类。相比于传统方案,本公开的实施例节省了为有声读物添加背景音乐的大量人力工作,并达到了良好的性能。
图10示出了根据本公开的实施例的用于生成文本的配乐的装置800的示意框图。装置1000可以被布置在在配乐系统110处。因此,装置800例如由实现配乐系统110的计算设备或设备集群来实现。
如图所示,装置800包括情节划分模块1010、情节分类模块1020、音乐确定模块1030。
情节划分模块1010被配置为基于文本的多个段落的语义,将文本划分为至少一个情节单元。情节分类模块1020被配置为确定至少一个情节单元的情节类别。音乐确定模块1030被配置为基于情节类别确定与至少一个情节单元匹配的音乐。
在一些实施例中,情节划分模块1010还被配置为使用第一神经网络模型来确定多个段落中的在情节边界处的段落,以及基于被确定为在情节边界处的段落,将文本划分为至少一个情节单元。
在一些实施例中,第一神经网络包括第一语义网络和循环神经网络,情节划分模块1010还被配置为:基于第一语义网络,生成多个段落各自的语义表示,基于循环神经网络,从多个段落的语义表示生成多个段落各自的隐藏状态表示,以及基于多个段落的隐藏状态表示,确定多个段落关于是否在情节边界处的划分类别。
在一些实施例中,第一神经网络还包括卷积网络和相似度网络,并且情节划分模块1010还被配置为:基于卷积网络和相似度网络,从多个段落的隐藏状态表示生成附加隐藏状态表示;以及基于多个段落的隐藏状态表示和附加隐藏状态表示,生成多个段落各自的情节划分表示以确定划分类别。
在一些实施例中,至少一个情节单元包括第一情节单元,并且情节分类模块1010还被配置为:使用第二神经网络模型来确定第一情节单元的第一情节类别。
在一些实施例中,第二神经网络单元包括第二语义网络和自注意力网络,情节分类模块1020还被配置为:将第一情节单元划分为多个段落组;基于第二语义网络,生成多个段落组各自的组语义表示;以及基于自注意力网络,从多个段落组的组语义表示生成第一情节单元的情节类别表示以确定第一情节类别。
在一些实施例中,音乐确定模块还被配置为基于情节类别和与至少一个情节单元对应的语音的长度,从音乐库中选择匹配的音乐。
图11示出了根据本公开的实施例的用于生成文本的配乐的装置1100的示意框图。装置1100可以被布置在在配乐系统110处。因此,装置1100例如由实现配乐系统110的计算设备或设备集群来实现。
如图所示,装置1100包括表示生成模块1110,表示生成模块1110被配置为使用第一神经网络模型,生成训练数据集中的多个段落的隐藏状态表示。训练数据集中的多个段落具有标签,标签指示相应段落的情 节类别。
装置1100还包括损失计算模块1120。损失计算模块1120被配置为基于标签和隐藏状态表示来确定第一损失。
装置1100还包括参数更新模块1130。参数更新模块1130被配置为基于第一损失更新第一神经网络模型的参数。
在一些实施例中,第一神经网络模型可以包括第一语义网络和循环神经网络。第一语义网络可以是例如BERT模型,并且循环神经网络可以是例如双向GRU模型。
表示生成模块1110还可以被配置为于第一语义网络确定多个段落各自的语义表示,并且基于循环神经网络,从多个段落的语义表示生成多个段落各自的隐藏状态表示。
在一些实施例中,第一神经网络还可以包括卷积网络和相似度网络。卷积网络可以在循环神经网络之后,可以是例如一维卷积网络。相似度网络可以在卷积网络之后,并包括相似度计算层,用于计算相邻段落的相似度。
表示生成模块1110还可以被配置为基于卷积网络和相似度网络,从多个段落的隐藏状态表示生成多个段落各自的情节划分表示。
损失计算模块112还可以被配置为基于情节划分表示和标签来确定第二损失。参数更新单元还可以被配置为基于第一损失和第二损失更新第一神经网络模型的参数。
在一些实施例中,训练数据集中的标签可以指示相应段落具有以下情节类别之一:温暖、高兴、浪漫、激昂、威胁、悲伤、受伤、误会、冲突、积极、消极、中立。
图12示出了可以用来实施本公开内容的实施例的示例设备1200的示意性框图。例如,根据本公开实施例的备份系统和/或恢复系统可以由设备1200来实施。如图所示,设备1200包括中央处理单元(CPU)1201,其可以根据存储在只读存储器(ROM)1202中的计算机程序指令或者从存储单元908加载到随机访问存储器(RAM)1203中的计算机程序指令,来执行各种适当的动作和处理。在RAM 1203中,还可存储设 备900操作所需的各种程序和数据。CPU 1201、ROM 1202以及RAM 1203通过总线1204彼此相连。输入/输出(I/O)接口1205也连接至总线1204。
设备1200中的多个部件连接至I/O接口1205,包括:输入单元1206,例如键盘、鼠标等;输出单元1207,例如各种类型的显示器、扬声器等;存储单元1208,例如磁盘、光盘等;以及通信单元1209,例如网卡、调制解调器、无线通信收发机等。通信单元1209允许设备1200通过诸如因特网的计算机网络和/或各种电信网络与其他设备交换信息/数据。
上文所描述的各个过程和处理,例如方法200、600、800和/或900,可由处理单元1201执行。例如,在一些实施例中,方法200、600、800和/或900可被实现为计算机软件程序,其被有形地包含于机器可读介质,例如存储单元1208。在一些实施例中,计算机程序的部分或者全部可以经由ROM 1202和/或通信单元1209而被载入和/或安装到设备900上。当计算机程序被加载到RAM 1203并由CPU 1201执行时,可以执行上文描述的方法200、600、800和/或900的一个或多个动作。
本公开可以是方法、装置、系统和/或计算机程序产品。计算机程序产品可以包括计算机可读存储介质,其上载有用于执行本公开的各个方面的计算机可读程序指令。
计算机可读存储介质可以是可以保持和存储由指令执行设备使用的指令的有形设备。计算机可读存储介质例如可以是――但不限于――电存储设备、磁存储设备、光存储设备、电磁存储设备、半导体存储设备或者上述的任意合适的组合。计算机可读存储介质的更具体的例子(非穷举的列表)包括:便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、静态随机存取存储器(SRAM)、便携式压缩盘只读存储器(CD-ROM)、数字多功能盘(DVD)、记忆棒、软盘、机械编码设备、例如其上存储有指令的打孔卡或凹槽内凸起结构、以及上述的任意合适的组合。这里所使用的计算机可读存储介质不被解释为瞬时信号本身,诸如无线电波或者其他自由传播的电磁波、通过波导或其他传输媒介传播的电磁波(例 如,通过光纤电缆的光脉冲)、或者通过电线传输的电信号。
这里所描述的计算机可读程序指令可以从计算机可读存储介质下载到各个计算/处理设备,或者通过网络、例如因特网、局域网、广域网和/或无线网下载到外部计算机或外部存储设备。网络可以包括铜传输电缆、光纤传输、无线传输、路由器、防火墙、交换机、网关计算机和/或边缘服务器。每个计算/处理设备中的网络适配卡或者网络接口从网络接收计算机可读程序指令,并转发该计算机可读程序指令,以供存储在各个计算/处理设备中的计算机可读存储介质中。
用于执行本公开操作的计算机程序指令可以是汇编指令、指令集架构(ISA)指令、机器指令、机器相关指令、微代码、固件指令、状态设置数据、或者以一种或多种编程语言的任意组合编写的源代码或目标代码,所述编程语言包括面向对象的编程语言—诸如Smalltalk、C++等,以及常规的过程式编程语言—诸如“C”语言或类似的编程语言。计算机可读程序指令可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络—包括局域网(LAN)或广域网(WAN)—连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。在一些实施例中,通过利用计算机可读程序指令的状态信息来个性化定制电子电路,例如可编程逻辑电路、现场可编程门阵列(FPGA)或可编程逻辑阵列(PLA),该电子电路可以执行计算机可读程序指令,从而实现本公开的各个方面。
这里参照根据本公开实施例的方法、装置(系统)和计算机程序产品的流程图和/或框图描述了本公开的各个方面。应当理解,流程图和/或框图的每个方框以及流程图和/或框图中各方框的组合,都可以由计算机可读程序指令实现。
这些计算机可读程序指令可以提供给通用计算机、专用计算机或其它可编程数据处理装置的处理单元,从而生产出一种机器,使得这些指令在通过计算机或其它可编程数据处理装置的处理单元执行时,产生 了实现流程图和/或框图中的一个或多个方框中规定的功能/动作的装置。也可以把这些计算机可读程序指令存储在计算机可读存储介质中,这些指令使得计算机、可编程数据处理装置和/或其他设备以特定方式工作,从而,存储有指令的计算机可读介质则包括一个制造品,其包括实现流程图和/或框图中的一个或多个方框中规定的功能/动作的各个方面的指令。
也可以把计算机可读程序指令加载到计算机、其它可编程数据处理装置、或其它设备上,使得在计算机、其它可编程数据处理装置或其它设备上执行一系列操作步骤,以产生计算机实现的过程,从而使得在计算机、其它可编程数据处理装置、或其它设备上执行的指令实现流程图和/或框图中的一个或多个方框中规定的功能/动作。
附图中的流程图和框图显示了根据本公开的多个实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段或指令的一部分,所述模块、程序段或指令的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个连续的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或动作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。
以上已经描述了本公开的各实施方式,上述说明是示例性的,并非穷尽性的,并且也不限于所披露的各实施方式。在不偏离所说明的各实施方式的范围和精神的情况下,对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。本文中所用术语的选择,旨在最好地解释各实施方式的原理、实际应用或对市场中的技术的改进,或者使本技术领域的其他普通技术人员能理解本文披露的各实施方式。

Claims (15)

  1. 一种用于生成文本的配乐的方法,包括:
    基于所述文本的多个段落的语义,将所述文本划分为至少一个情节单元;
    确定所述至少一个情节单元的情节类别;以及
    基于所述情节类别,确定与所述至少一个情节单元匹配的音乐。
  2. 根据权利要求1所述的方法,其中,将所述文本划分为至少一个情节单元包括:
    使用第一神经网络模型来确定所述多个段落中的在情节边界处的段落;以及
    基于被确定为在情节边界处的段落,将所述文本划分为至少一个情节单元。
  3. 根据权利要求2所述的方法,其中,所述第一神经网络包括第一语义网络和循环神经网络,并且确定所述多个段落中的在情节边界处的段落包括:
    基于所述第一语义网络,生成所述多个段落各自的语义表示;
    基于所述循环神经网络,从所述多个段落的所述语义表示生成所述多个段落各自的隐藏状态表示;以及
    基于所述多个段落的所述隐藏状态表示,确定所述多个段落关于是否在情节边界处的划分类别。
  4. 根据权利要求3所述的方法,其中,所述第一神经网络还包括卷积网络和相似度网络,并且确定所述多个段落关于是否在情节边界处的划分类别包括:
    基于所述卷积网络和所述相似度网络,从所述多个段落的所述隐藏状态表示生成附加隐藏状态表示;以及
    基于所述多个段落的所述隐藏状态表示和所述附加隐藏状态表示,生成所述多个段落各自的情节划分表示以确定所述划分类别。
  5. 根据权利要求1所述的方法,其中所述至少一个情节单元包括第一情节单元,并且确定所述至少一个情节单元的情节类别包括:
    使用第二神经网络模型来确定所述第一情节单元的第一情节类别。
  6. 根据权利要求5所述的方法,其中所述第二神经网络单元包括第二语义网络和自注意力网络,并且确定所述第一情节单元的第一情节类别包括:
    将所述第一情节单元划分为多个段落组;
    基于所述第二语义网络,生成所述多个段落组各自的组语义表示;以及
    基于所述自注意力网络,从所述多个段落组的所述组语义表示生成所述第一情节单元的情节类别表示以确定所述第一情节类别。
  7. 根据权利要求1所述的方法,其中确定与所述至少一个情节单元匹配的音乐包括:
    基于所述情节类别和与所述至少一个情节单元对应的语音的长度,从音乐库中选择匹配的音乐。
  8. 一种用于训练第一神经网络模型的方法,包括:
    使用所述第一神经网络模型,生成训练数据集中的多个段落的隐藏状态表示,其中所述训练数据集中的多个段落具有相应的标签,所述标签指示相应段落的情节类别,
    基于所述隐藏状态表示和所述标签,确定第一损失;
    基于所述第一损失,更新所述第一神经网络模型的参数。
  9. 根据权利要求8所述的方法,其中所述第一神经网络模型包括第一语义网络和循环神经网络,所述方法包括:
    基于所述第一语义网络,确定所述多个段落各自的语义表示;
    基于所述循环神经网络,从所述多个段落的所述语义表示生成所述多个段落各自的隐藏状态表示。
  10. 根据权利要求9所述的方法,其中所述第一神经网络还包括卷积网络和相似度网络,所述方法还包括:
    基于所述卷积网络和所述相似度网络,从所述多个段落的所述隐藏状态表示生成所述多个段落各自的情节划分表示;
    基于所述情节划分表示和所述标签,确定第二损失;以及
    基于所述第一损失和所述第二损失,更新所述第一神经网络模型的参数。
  11. 根据权利要求8所述的方法,其中标签指示相应段落具有以下情节类别之一:温暖、高兴、浪漫、激昂、威胁、悲伤、受伤、误会、冲突、积极、消极、中立。
  12. 一种用于生成文本的配乐的装置,包括:
    情节划分模块,被配置为基于所述文本的多个段落的语义,将所述文本划分为至少一个情节单元;
    情节分类模块;被配置为确定所述至少一个情节单元的情节类别;以及
    音乐确定模块,被配置为基于所述情节类别确定与所述至少一个情节单元匹配的音乐。
  13. 一种用于训练第一神经网络模型的装置,包括:
    表示生成模块,被配置为使用所述第一神经网络模型,生成训练数据集中的多个段落的隐藏状态表示,其中所述训练数据集中的所述多个段落具有标签,所述标签指示相应段落的情节类别;
    损失计算模块,被配置为基于所述标签和所述隐藏状态表示来确定第一损失;
    参数更新模块,被配置为基于所述第一损失更新所述第一神经网络模型的参数。
  14. 一种电子设备,包括:
    至少一个处理单元;
    至少一个存储器,所述至少一个存储器被耦合到所述至少一个处理单元并且存储用于由所述至少一个处理单元执行的指令,所述指令当由所述至少一个处理单元执行时,使得所述设备执行根据权利要求1至11中任一项所述的方法。
  15. 一种计算机可读存储介质,包括机器可执行指令,所述机器可执行指令在由设备执行时使所述设备执行根据权利要求1至11中的任一项所述的方法。
PCT/CN2023/098710 2022-06-17 2023-06-06 用于生成文本的配乐的方法、装置、电子设备和介质 WO2023241415A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210693446.1A CN115101032A (zh) 2022-06-17 2022-06-17 用于生成文本的配乐的方法、装置、电子设备和介质
CN202210693446.1 2022-06-17

Publications (1)

Publication Number Publication Date
WO2023241415A1 true WO2023241415A1 (zh) 2023-12-21

Family

ID=83291054

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/098710 WO2023241415A1 (zh) 2022-06-17 2023-06-06 用于生成文本的配乐的方法、装置、电子设备和介质

Country Status (2)

Country Link
CN (1) CN115101032A (zh)
WO (1) WO2023241415A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115101032A (zh) * 2022-06-17 2022-09-23 北京有竹居网络技术有限公司 用于生成文本的配乐的方法、装置、电子设备和介质

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170169811A1 (en) * 2015-12-09 2017-06-15 Amazon Technologies, Inc. Text-to-speech processing systems and methods
CN109726308A (zh) * 2018-12-27 2019-05-07 上海连尚网络科技有限公司 一种生成小说的背景音乐的方法及设备
CN110502748A (zh) * 2019-07-19 2019-11-26 平安科技(深圳)有限公司 文本主题抽取方法、装置及计算机可读存储介质
CN110750996A (zh) * 2018-07-18 2020-02-04 广州阿里巴巴文学信息技术有限公司 多媒体信息的生成方法、装置及可读存储介质
CN111767740A (zh) * 2020-06-23 2020-10-13 北京字节跳动网络技术有限公司 音效添加方法和装置、存储介质和电子设备
CN113722491A (zh) * 2021-09-08 2021-11-30 北京有竹居网络技术有限公司 确定文本情节类型的方法、装置、可读介质及电子设备
CN115101032A (zh) * 2022-06-17 2022-09-23 北京有竹居网络技术有限公司 用于生成文本的配乐的方法、装置、电子设备和介质

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120296637A1 (en) * 2011-05-20 2012-11-22 Smiley Edwin Lee Method and apparatus for calculating topical categorization of electronic documents in a collection
CN107038154A (zh) * 2016-11-25 2017-08-11 阿里巴巴集团控股有限公司 一种文本情感识别方法和装置
CN109543722A (zh) * 2018-11-05 2019-03-29 中山大学 一种基于情感分析模型的情感趋势预测方法
CN109299290A (zh) * 2018-12-07 2019-02-01 广东小天才科技有限公司 一种基于知识图谱的配乐推荐方法及电子设备
CN111164601B (zh) * 2019-12-30 2023-07-18 深圳市优必选科技股份有限公司 情感识别方法、智能装置和计算机可读存储介质
WO2021225550A1 (en) * 2020-05-06 2021-11-11 Iren Yaser Deniz Emotion recognition as feedback for reinforcement learning and as an indicator of the explanation need of users
CN111782576B (zh) * 2020-07-07 2021-10-15 北京字节跳动网络技术有限公司 背景音乐的生成方法、装置、可读介质、电子设备
CN112560503B (zh) * 2021-02-19 2021-07-02 中国科学院自动化研究所 融合深度特征和时序模型的语义情感分析方法
CN113158684B (zh) * 2021-04-21 2022-09-27 清华大学深圳国际研究生院 一种情绪分析方法、情绪提醒方法及情绪提醒控制装置

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170169811A1 (en) * 2015-12-09 2017-06-15 Amazon Technologies, Inc. Text-to-speech processing systems and methods
CN110750996A (zh) * 2018-07-18 2020-02-04 广州阿里巴巴文学信息技术有限公司 多媒体信息的生成方法、装置及可读存储介质
CN109726308A (zh) * 2018-12-27 2019-05-07 上海连尚网络科技有限公司 一种生成小说的背景音乐的方法及设备
CN110502748A (zh) * 2019-07-19 2019-11-26 平安科技(深圳)有限公司 文本主题抽取方法、装置及计算机可读存储介质
CN111767740A (zh) * 2020-06-23 2020-10-13 北京字节跳动网络技术有限公司 音效添加方法和装置、存储介质和电子设备
CN113722491A (zh) * 2021-09-08 2021-11-30 北京有竹居网络技术有限公司 确定文本情节类型的方法、装置、可读介质及电子设备
CN115101032A (zh) * 2022-06-17 2022-09-23 北京有竹居网络技术有限公司 用于生成文本的配乐的方法、装置、电子设备和介质

Also Published As

Publication number Publication date
CN115101032A (zh) 2022-09-23

Similar Documents

Publication Publication Date Title
Chen et al. Extending context window of large language models via positional interpolation
CN107783960B (zh) 用于抽取信息的方法、装置和设备
US11423233B2 (en) On-device projection neural networks for natural language understanding
CN111368996B (zh) 可传递自然语言表示的重新训练投影网络
US11816439B2 (en) Multi-turn dialogue response generation with template generation
US11948058B2 (en) Utilizing recurrent neural networks to recognize and extract open intent from text inputs
US11271876B2 (en) Utilizing a graph neural network to identify supporting text phrases and generate digital query responses
US11657802B2 (en) Utilizing a dynamic memory network for state tracking
US20170351663A1 (en) Iterative alternating neural attention for machine reading
CN111522958A (zh) 文本分类方法和装置
US20210133279A1 (en) Utilizing a neural network to generate label distributions for text emphasis selection
CN110990555B (zh) 端到端检索式对话方法与系统及计算机设备
US20210056169A1 (en) Example based entity extraction, slot filling and value recommendation
CN113076739A (zh) 一种实现跨领域的中文文本纠错方法和系统
CN111368514A (zh) 模型训练及古诗生成方法、古诗生成模型、设备和介质
CN111767694B (zh) 文本生成方法、装置和计算机可读存储介质
WO2023241415A1 (zh) 用于生成文本的配乐的方法、装置、电子设备和介质
JP2023539470A (ja) 自動ナレッジ・グラフ構成
CN113268560A (zh) 用于文本匹配的方法和装置
JPWO2014073206A1 (ja) 情報処理装置、及び、情報処理方法
US20230094828A1 (en) Audio file annotation
WO2024012284A1 (zh) 音频识别方法、装置、电子设备和计算机程序产品
CN115827865A (zh) 一种融合多特征图注意力机制的不良文本分类方法及系统
US11379738B2 (en) Using higher order actions to annotate a syntax tree with real data for concepts used to generate an answer to a question
JP6309852B2 (ja) 強調位置予測装置、強調位置予測方法及びプログラム

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23822984

Country of ref document: EP

Kind code of ref document: A1