WO2024077906A1 - 语音文本生成方法、语音文本生成模型的训练方法、装置 - Google Patents

语音文本生成方法、语音文本生成模型的训练方法、装置 Download PDF

Info

Publication number
WO2024077906A1
WO2024077906A1 PCT/CN2023/087793 CN2023087793W WO2024077906A1 WO 2024077906 A1 WO2024077906 A1 WO 2024077906A1 CN 2023087793 W CN2023087793 W CN 2023087793W WO 2024077906 A1 WO2024077906 A1 WO 2024077906A1
Authority
WO
WIPO (PCT)
Prior art keywords
sample
text
speech
spoken
target
Prior art date
Application number
PCT/CN2023/087793
Other languages
English (en)
French (fr)
Inventor
冯明超
陈蒙
覃杰
Original Assignee
京东科技信息技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 京东科技信息技术有限公司 filed Critical 京东科技信息技术有限公司
Publication of WO2024077906A1 publication Critical patent/WO2024077906A1/zh

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1822Parsing for meaning understanding
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue

Definitions

  • the present disclosure relates to the fields of artificial intelligence technology and intelligent customer service technology, and more specifically, to a speech-to-text generation method, a training method, device, equipment, medium and program product of a speech-to-text generation model.
  • Intelligent dialogue systems automatically generate business-related intelligent voice information, or generate intelligent reply information based on user voice information, thereby realizing automatic voice interaction with users through intelligent dialogue systems to meet users' relevant needs.
  • the intelligent dialogue system usually converts text with standard written sentences into intelligent voice information, and the generated intelligent voice information is relatively stiff and dull, which is quite different from the voice information generated in normal human conversations.
  • the present disclosure provides a speech-to-text generation method, a training method, an apparatus, a device, a medium and a program product for a speech-to-text generation model.
  • One aspect of the present disclosure provides a method for generating speech text, comprising:
  • inserting a target modal particle into the standard text according to the predicted insertion position to obtain a target spoken text includes:
  • the masked standard text is input into the speech-to-text generation model so that the speech-to-text generation model inserts the target modal particle at the target insertion position in the predicted insertion position to generate the target spoken text.
  • the above-mentioned voice text generation method also includes:
  • the initial corpus set includes an initial spoken language corpus text generated according to the spoken language voice corpus;
  • the distribution characteristics of the modal particles are determined based on the part-of-speech tagging results of the spoken corpus and the modal particle tagging results of the spoken corpus.
  • part-of-speech tagging is performed on a standard text to obtain a part-of-speech tagging result including:
  • the above semantic recognition model includes:
  • a first semantic recognition model based on a recurrent neural network model and a conditional random field model
  • the second semantic recognition model is constructed based on dependency syntactic analysis.
  • Another aspect of the present disclosure further provides a method for training a speech-to-text generation model, comprising:
  • the target training set is used to train the initial speech-to-text generation model to obtain a trained speech-to-text generation model, wherein the target training set includes the masked standard text of the training sample, the second sample part-of-speech tagging result of the training sample spoken text, and the sample modal particle tagging result of the training sample spoken text.
  • the training method of the above-mentioned speech-to-text generation model further includes:
  • the first sample standard text in the first sample set and the first sample spoken text associated with the first sample standard text are respectively updated using the sample confusion words in the sample confusion dictionary to obtain a first sample standard text containing the second sample standard text and the first sample spoken text associated with the first sample standard text.
  • the training sample set is constructed according to the first sample set and the second sample set.
  • the training method of the above-mentioned speech-to-text generation model further includes:
  • the sample confusion dictionary is constructed based on the sample standard corpus text and the sample confusion corpus text.
  • the training method of the above-mentioned speech-to-text generation model further includes:
  • sample initial corpus set includes a sample initial spoken language corpus text generated according to a sample spoken language voice corpus
  • the distribution characteristics of the sample modal particles are determined.
  • Another aspect of the present disclosure further provides a speech text generation device, comprising:
  • the tagging module is used to perform part-of-speech tagging on the standard text to obtain the part-of-speech tagging result;
  • a first determination module is used to determine the target part of speech from the above part of speech tagging results according to the distribution characteristics of modal particles;
  • a second determination module is used to determine a predicted insertion position according to the position of the content corresponding to the target part of speech in the standard text
  • an insertion module configured to insert the target modal particle into the standard text according to the predicted insertion position to obtain a target spoken text
  • a generation module is used to generate a target speech text according to the target spoken text.
  • Another aspect of the present disclosure further provides a training device for a speech-to-text generation model, comprising:
  • a sample tagging module is used to perform part-of-speech tagging on the training sample standard text in the training sample set and the training sample spoken text associated with the training sample standard text, respectively, to obtain a first sample part-of-speech tagging result of the training sample standard text, a second sample part-of-speech tagging result of the training sample spoken text, and a sample modal particle tagging result of the training sample spoken text;
  • a sample first determination module used to determine the sample target part of speech from the first sample part of speech tagging result according to the sample modal particle distribution characteristics
  • a second sample determination module used to determine a sample prediction insertion position according to the position of the sample content corresponding to the sample target part of speech in the training sample standard text
  • a sample masking module used for masking the sample prediction insertion position in the above training sample standard text to obtain the training sample masked standard text, wherein the above training sample masked standard text has the first sample part-of-speech tagging result;
  • a training module is used to train an initial speech-to-text generation model using a target training set to obtain a trained speech-to-text generation model, wherein the target training set includes the masked standard text of the training sample, the second sample part-of-speech tagging result of the training sample spoken text, and the sample modal particle tagging result of the training sample spoken text.
  • Another aspect of the present disclosure provides an electronic device, comprising:
  • processors one or more processors
  • a memory for storing one or more programs
  • the one or more processors When the one or more programs are executed by the one or more processors, the one or more processors implement the method as described above.
  • Another aspect of the present disclosure provides a computer-readable storage medium storing computer-executable instructions, which are used to implement the above method when executed.
  • Another aspect of the present disclosure provides a computer program product, which includes computer executable instructions, and when the instructions are executed, are used to implement the method as described above.
  • the part-of-speech of each standard word in the standard text can be obtained according to the part-of-speech tagging result, and a predicted insertion position where an interjection can be inserted is determined from the part-of-speech standard result according to the distribution feature, and a target interjection is inserted into the standard text according to the predicted insertion position, so that the obtained target spoken text can have the colloquial characteristics of normal human conversation, so that the target voice text generated according to the target spoken text can at least partially solve the technical problem that the relevant intelligent voice information is relatively stiff and dull, and is quite different from human conversation, so that the target voice text can be closer to the colloquial characteristics in the voice information of human conversation, and the target voice text has the technical effect of anthropomorphic characteristics, thereby improving the user experience during voice interaction.
  • FIG1 schematically shows an exemplary system architecture to which a method and apparatus for generating speech text can be applied according to an embodiment of the present disclosure
  • FIG2 schematically shows a flow chart of a method for generating speech text according to an embodiment of the present disclosure
  • FIG3 schematically shows a flow chart of a method for generating speech text according to another embodiment of the present disclosure
  • FIG4 schematically shows a flow chart of inserting a target modal particle into a standard text according to a predicted insertion position to obtain a target spoken text according to an embodiment of the present disclosure
  • FIG5 schematically shows an application scenario diagram of the method for generating speech text according to an embodiment of the present disclosure
  • FIG6 schematically shows a flow chart of a method for training a speech-to-text generation model according to an embodiment of the present disclosure
  • FIG7 schematically shows a block diagram of a speech text generation device according to an embodiment of the present disclosure
  • FIG8 schematically shows a block diagram of a speech text generation apparatus according to an embodiment of the present disclosure.
  • FIG9 schematically shows a block diagram of an electronic device suitable for implementing a method for generating speech-to-text and a method for training a speech-to-text generation model according to an embodiment of the present disclosure.
  • the voice information generated by the intelligent dialogue system is usually generated based on written text, ignoring the modal particles, hesitation words, and restatement words that may exist in real-life conversations between people. Therefore, although the voice information generated by speech synthesis devices or manual translation is very standard, it is very stiff and dull, which can easily make users feel that they are talking to a machine, thereby reducing the user experience.
  • the embodiments of the present disclosure provide a method for generating speech text, a training method for a speech text generation model, a device, an apparatus, a medium and a program product.
  • the method for generating speech text includes: performing part-of-speech tagging on a standard text to obtain a part-of-speech tagging result; determining a target part-of-speech from the part-of-speech tagging result according to the distribution characteristics of modal particles; determining a predicted insertion position according to the position of the content corresponding to the target part-of-speech in the standard text; inserting the target modal particle into the standard text according to the predicted insertion position to obtain a target spoken text; and generating a target speech text according to the target spoken text.
  • the part-of-speech of each standard word in the standard text can be obtained according to the part-of-speech tagging result, and a predicted insertion position where an interjection can be inserted is determined from the part-of-speech standard result according to the distribution feature, and a target interjection is inserted into the standard text according to the predicted insertion position, so that the obtained target spoken text can have the colloquial characteristics of normal human conversation, so that the target voice text generated according to the target spoken text can at least partially solve the technical problems in the related intelligent voice information that are relatively stiff and dull and differ greatly from human conversation, so that the target voice text can be closer to the colloquial characteristics in the voice information of human conversation, and the target voice text has anthropomorphic characteristics, thereby improving the user experience during voice interaction.
  • the user's authorization or consent is obtained before obtaining or collecting the user's personal information.
  • FIG1 schematically shows an exemplary system architecture to which the speech text generation method and apparatus according to an embodiment of the present disclosure can be applied. It should be noted that FIG1 is only an example of a system architecture to which the embodiment of the present disclosure can be applied, in order to help those skilled in the art understand the technical content of the present disclosure, but does not mean that the embodiment of the present disclosure cannot be used in other devices, systems, environments or scenarios.
  • the system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105.
  • the network 104 is used to provide a medium for communication links between the terminal devices 101, 102, 103 and the server 105.
  • the network 104 may include various connection types, such as wired and/or wireless communication links, etc.
  • terminal devices 101, 102, 103 Users can use terminal devices 101, 102, 103 to interact with server 105 through network 104 to receive or send messages, etc.
  • Various communication client applications can be installed on terminal devices 101, 102, 103, such as shopping applications, web browser applications, search applications, instant messaging tools, email clients and/or social platform software, etc. (only as examples).
  • Terminal devices 101, 102, and 103 may be various electronic devices having display screens and supporting web browsing, including Including but not limited to smartphones, tablet computers, laptops, desktop computers, etc.
  • the server 105 may be a server that provides various services, such as a background management server (only an example) that provides support for websites browsed by users using the terminal devices 101, 102, and 103.
  • the background management server may analyze and process the received data such as user requests, and feed back the processing results (such as web pages, information, or data obtained or generated according to user requests) to the terminal device.
  • the speech text generation method provided in the embodiment of the present disclosure can generally be executed by the server 105. Accordingly, the speech text generation device provided in the embodiment of the present disclosure can generally be set in the server 105.
  • the speech text generation method provided in the embodiment of the present disclosure can also be performed by a server or server cluster that is different from the server 105 and can communicate with the terminal device 101, 102, 103 and/or the server 105.
  • the speech text generation device provided in the embodiment of the present disclosure can also be set in a server or server cluster that is different from the server 105 and can communicate with the terminal device 101, 102, 103 and/or the server 105.
  • the speech text generation method provided in the embodiment of the present disclosure can also be performed by the terminal device 101, 102, or 103, or it can also be performed by other terminal devices different from the terminal device 101, 102, or 103.
  • the speech text generation device provided in the embodiment of the present disclosure can also be set in the terminal device 101, 102, or 103, or it can be set in other terminal devices different from the terminal device 101, 102, or 103.
  • the standard text may be originally stored in any one of the terminal devices 101, 102, or 103 (for example, the terminal device 101, but not limited thereto), or stored on an external storage device and may be imported into the terminal device 101. Then, the terminal device 101 may locally execute the speech text generation method provided in the embodiment of the present disclosure, or send the standard text to other terminal devices, servers, or server clusters, and the other terminal devices, servers, or server clusters that receive the standard text may execute the speech text generation method provided in the embodiment of the present disclosure.
  • terminal devices, networks and servers in Figure 1 is only illustrative. Any number of terminal devices, networks and servers may be provided according to implementation requirements.
  • FIG2 schematically shows a flow chart of a method for generating speech text according to an embodiment of the present disclosure.
  • the method includes operations S210 to S250 .
  • part-of-speech tagging is performed on the standard text to obtain a part-of-speech tagging result.
  • the standard text may include text used for written communication, such as standard customer service staff response text, email body text, etc.
  • the standard customer service staff response text can be applied to the intelligent customer service Q&A device, which generates the corresponding voice text based on the standardized standard customer service staff response text, thereby realizing voice interaction with the user.
  • the generated voice text is usually stiff and dull, which is too different from the human conversation voice and is not suitable for the intelligent customer service Q&A device. Possessing spoken characteristics.
  • part-of-speech tagging is performed on the annotated text, and the obtained part-of-speech tagging results may include standard words generated after word segmentation in the standard text and part-of-speech features of the standard words.
  • the part-of-speech features may include, for example, adjective part-of-speech, verb part-of-speech, etc.
  • the embodiments of the present disclosure do not limit the specific method of part-of-speech tagging.
  • a network model built based on a neural network can be used to tag the standard text with parts of speech, but it is not limited to this.
  • a semantic recognition model built based on a statistical algorithm can also be used to tag the standard text with parts of speech.
  • the embodiments of the present disclosure do not limit the specific technical means for part-of-speech tagging, and those skilled in the art can make a choice according to actual conditions.
  • a target part of speech is determined from the part of speech tagging results according to the modal particle distribution feature.
  • modal particles may include words that express human emotions such as hesitation and doubt, such as “um”, “that is”, “for example”, “similar”, etc. However, it is not limited thereto, and may also include polite words that are expressed at the beginning and/or end of a human voice conversation, such as "if you are free”, “if you like", etc.
  • the parts of speech of words adjacent to modal particles in the dialogue text can be counted to determine the distribution characteristics of the modal particles, so that the target part of speech in the standard text can be predicted based on the respective characteristics of the modal particles.
  • the target part of speech may include a verb part of speech and an adjective part of speech.
  • a predicted insertion position is determined according to a position of the content corresponding to the target part of speech in the standard text.
  • a predicted insertion position may be determined based on the part of speech features of the standard words in the standard text, and the predicted insertion position may be an adjacent position of the standard words having the target part of speech.
  • the target modal particle is inserted into the standard text according to the predicted insertion position to obtain a target spoken text.
  • the corresponding target modal particle when there are multiple predicted insertion positions, the corresponding target modal particle can be inserted into each predicted insertion position, or the target modal particle can be inserted into the target insertion position among the predicted insertion positions.
  • the colloquial characteristics of the target spoken text can be enhanced without changing the semantic information of the standard text.
  • a target voice text is generated according to the target spoken text.
  • the target voice text may be voice information, and a target spoken text may be converted into the target voice text using a related speech synthesis device, and the target voice text may be generated based on the target spoken text.
  • the parts of speech of each standard word in the standard text can be obtained according to the results of the POS tagging, and the predicted insertion position where the modal particle can be inserted is determined from the standard results of the parts of speech according to the distribution characteristics, and the target particle is inserted into the standard text according to the predicted insertion position.
  • the particle can make the target spoken text have the colloquial characteristics of normal human conversation, so that the target voice text generated according to the target spoken text can at least partially solve the technical problems that the relevant intelligent voice information is relatively stiff and dull, and is quite different from human conversation.
  • the target voice text can be closer to the colloquial characteristics of the voice information of human conversation, and the target voice text has anthropomorphic characteristics, thereby enhancing the user experience during voice interaction.
  • FIG3 schematically shows a flow chart of a method for generating speech text according to another embodiment of the present disclosure.
  • the speech text generation method may further include operations S310 to S330 .
  • an initial corpus set is obtained, wherein the initial corpus set includes an initial spoken language corpus text generated according to a spoken language voice corpus.
  • part-of-speech tagging is performed on the initial spoken corpus text to obtain spoken corpus part-of-speech tagging results and spoken corpus modal particle tagging results.
  • the modal particle distribution feature is determined according to the spoken corpus part-of-speech tagging result and the spoken corpus modal particle tagging result.
  • the initial spoken corpus text may include, for example, a corpus text generated according to the conversation voice information in a real human conversation scene, and the corpus text records the text of the conversation voice information, that is, the initial spoken corpus text contains spoken corpus modal particles that humans habitually add.
  • the parts of speech of the spoken corpus words in the initial spoken corpus text can be obtained, and the positional relationship between the spoken corpus modal particles and each spoken corpus word can also be obtained.
  • the distribution characteristics of the spoken corpus modal particles can be determined, that is, the distribution characteristics of the modal particles in the spoken voice information can be determined.
  • the distribution characteristics of modal particles can represent that the statistical probability of the position after a spoken corpus word with a verb part of speech is 0.9, and the statistical probability of the position before a spoken corpus word with an adjective part of speech is 0.8. By counting the statistical probabilities of these positions, the distribution characteristics of modal particles are determined.
  • a position probability threshold may be set, and a position where the statistical probability threshold is greater than or equal to the position probability threshold may be used as the distribution statistical probability in the modal particle distribution feature.
  • operation S210 performing part-of-speech tagging on a standard text, and obtaining a part-of-speech tagging result may include the following operations.
  • the standard text is input into the semantic recognition model to obtain the part-of-speech tagging result; wherein the semantic recognition model includes: a first semantic recognition model based on a recurrent neural network model and a conditional random field model; or a semantic recognition model based on a dependency syntax model. Analyze and construct the second semantic recognition model.
  • the first semantic recognition model can, for example, be constructed by sequentially connecting a recurrent neural network model (RNN model) and a conditional random field model (CRF model), or it can also be constructed based on a bidirectional long short-term memory network model (Bi-LSTM model) and a conditional random field model (CRF model).
  • RNN model recurrent neural network model
  • CRF model conditional random field model
  • Bi-LSTM model bidirectional long short-term memory network model
  • CRF model conditional random field model
  • the recurrent neural network model may include a long short-term memory network model (LSTM model), a bidirectional long short-term memory network model (Bi-LSTM model), and those skilled in the art may design the specific network structure of the first semantic recognition model according to actual needs.
  • LSTM model long short-term memory network model
  • Bi-LSTM model bidirectional long short-term memory network model
  • the second semantic recognition model constructed based on dependency syntactic analysis may include, for example, an LTP (Language Technology Platform) language processing system, etc.
  • FIG4 schematically shows a flow chart of inserting a target modal particle into a standard text according to a predicted insertion position to obtain a target spoken text according to an embodiment of the present disclosure.
  • operation S240 inserting the target modal particle into the standard text according to the predicted insertion position, and obtaining the target spoken text includes operations S410 to S420 .
  • the predicted insertion position of the standard text is masked to obtain a masked standard text.
  • the masked standard text is input to the speech-text generation model so that the speech-text generation model inserts the target modal particle at the target insertion position in the predicted insertion position to generate the target spoken text.
  • a speech-to-text generation model may be constructed based on a BERT model, for example, may include a BERT-WWM model.
  • the masked standard text input into the BERT-WWM model may further iteratively predict the mask of the predicted insertion position, and then determine the target prediction position from the predicted insertion position, and determine the target modal particle of each target prediction position from the modal particle set based on the prediction ability of the BERT-WWM model, thereby realizing the production of the target spoken text.
  • the speech-to-text generation model can be obtained after training using relevant training methods.
  • the speech-to-text generation model is a BERT-WWM model
  • at least part of the standard words in the masked standard text can be replaced with synonyms or homophones based on the prediction ability of the BERT-WWM model, thereby further improving the spoken characteristics of the target spoken text.
  • FIG5 schematically shows an application scenario diagram of the method for generating speech text according to an embodiment of the present disclosure.
  • the application scenario may include a standard text 510 “Do you need a large-size display?”
  • the standard text 510 is input into a semantic recognition model 520 , and part-of-speech tagging of the standard text can be performed to obtain a part-of-speech tagging result 530 .
  • the semantic recognition model 520 may be constructed based on a bidirectional long short-term memory network model (Bi-LSTM model) and a conditional random field model (CRF model) connected in sequence.
  • Bi-LSTM model bidirectional long short-term memory network model
  • CRF model conditional random field model
  • the part-of-speech tagging result 530 may include the standard words “Excuse me”, “You”, “Need”, “Large size”, “Display”, “What” in the standard text 510.
  • the part-of-speech of each standard word may also be included, where "v” represents the verb part-of-speech, "r” represents the pronoun part-of-speech, "a” represents the adjective part-of-speech, "n” represents the noun part-of-speech, and "e” represents the modal particle part-of-speech.
  • the target part of speech can be determined as verb part of speech and adjective part of speech from the part of speech tagging result 530, and according to the content corresponding to the target part of speech, after the verb part of speech standard word "please ask", after the verb part of speech standard word "need”, and before the adjective part of speech standard word "large size” in the standard text 510, the predicted insertion position is determined, and each predicted insertion position is masked to obtain the masked standard text 540.
  • the masked standard text 540 may include mask units 541 and 542 corresponding to each predicted insertion position.
  • the masked standard text 540 is input into the speech text generation model 550, and the speech text generation model can determine the predicted insertion position as the target predicted insertion position, and insert the target modal particle " ⁇ " into the mask unit 541 corresponding to the target predicted insertion position, and insert the target modal particle " ⁇ ” into the mask unit 542 corresponding to the target predicted insertion position, thereby generating a target spoken text 560 "Excuse me, ⁇ , do you need this large-sized display?".
  • the target spoken text 560 can have a spoken feature close to human spoken speech information
  • the target speech text generated according to the target spoken text 560 can have a spoken feature, at least partially avoiding the stiffness and dullness of the generated speech information, and reducing the difference with human dialogue speech information.
  • FIG6 schematically shows a flow chart of a method for training a speech-to-text generation model according to an embodiment of the present disclosure.
  • the method includes operations S610 to S650 .
  • part-of-speech tagging is performed on the training sample standard text in the training sample set and the training sample spoken text associated with the training sample standard text, respectively, to obtain a first sample part-of-speech tagging result of the training sample standard text, a second sample part-of-speech tagging result of the training sample spoken text, and a sample modal particle tagging result of the training sample spoken text.
  • a sample target part of speech is determined from the first sample part of speech tagging result according to the sample modal particle distribution feature.
  • the predicted insertion position of the sample is determined according to the position of the sample content corresponding to the sample target part of speech in the training sample standard text.
  • the sample prediction insertion position in the training sample standard text is masked to obtain the training sample masked standard text, wherein the training sample masked standard text has a first sample part-of-speech tagging result.
  • the target training set is used to train the initial speech text generation model to obtain the trained speech text A model is generated, wherein the target training set includes a training sample mask standard text, a second sample part-of-speech tagging result of the training sample spoken text, and a sample modal particle tagging result of the training sample spoken text.
  • the training sample standard text may include a standard written text
  • the training sample spoken text may include a spoken text converted from the voice information generated after the sample user transcribes the pronunciation of the training sample standard text. Since the training sample spoken text is generated based on the transcribed voice after the sample user transcribes the voice, the training sample spoken text may include sample modal particles.
  • a training sample masked standard text and a training sample spoken text can be combined into a training sample pair, and a similarity label value can be determined based on the similarity between the training sample masked standard text and the training sample spoken text in the training sample pair.
  • the similarity label value can be used to iteratively adjust the weight parameters in the initial speech-text generation model, so that the generated speech-text generation model can predict the positional relationship between the sample modal particles and the first sample part-of-speech tagging result in the training sample standard text, so that the target insertion position can be accurately determined based on the predicted insertion position, and the target sample modal particles can be determined from the sample modal particles.
  • speech-text generation model trained by the speech-text generation model training method provided in the embodiment of the present disclosure can be used in the above-mentioned speech-text generation method.
  • the training method of the speech-to-text generation model also includes the following operations.
  • the first sample standard text and the first sample spoken text associated with the first sample standard text in the first sample set are respectively updated using the sample confusion words in the sample confusion dictionary to obtain a second sample set including the second sample standard text and the second sample spoken text; and a training sample set is constructed according to the first sample set and the second sample set.
  • the sample confusion dictionary may include sample confusion word pairs consisting of sample standard words and sample confusion words.
  • the initial speech text generation model can be made to fully learn the similar association relationship between the standard words and the confusion words, so that the speech text generation model obtained after training can automatically replace the standard words in the standard text with the sample confusion words, thereby further enriching the semantic expression of the target spoken text and making the target spoken text closer to the spoken characteristics of normal human conversation.
  • the training method of the speech-to-text generation model may further include the following operations.
  • the sample standard corpus text is processed by a speech synthesis device to obtain a sample speech corpus; speech recognition is performed on the sample speech corpus to obtain a sample confusion corpus text; and a sample confusion dictionary is constructed according to the sample standard corpus text and the sample confusion corpus text.
  • the sample standard corpus text may include text used for written communication, such as standard customer service staff response text, email body text, etc.
  • the sample voice corpus may include voice information generated after the voice synthesis device automatically recognizes the sample standard corpus text.
  • ASR Automatic Speech Recognition
  • the recognized sample confusion corpus text can be obtained. Due to the recognition capability limitation of the speech recognition device, at least part of the sample standard words in the sample standard corpus text can be recognized as sample confusion words, so that the sample confusion corpus text contains the sample confusion words recognized by the speech recognition device.
  • the sample standard words and the sample confusion words can be made into sample confusion word pairs, and then a sample confusion dictionary can be constructed.
  • the initial sample confusion corpus text whose confidence information is less than or equal to a preset confidence threshold is determined as the sample confusion corpus text, so that sample confusion words that are easily misrecognized can be selected according to the sample confusion corpus text, so that the sample confusion word pairs of the constructed sample confusion dictionary can more accurately reflect the association characteristics of the sample confusion words and the sample standard words.
  • the training method of the speech-to-text generation model may further include the following operations.
  • a sample initial corpus set is obtained, wherein the sample initial corpus set includes a sample initial spoken corpus text generated according to a sample spoken speech corpus; part-of-speech tagging is performed on the sample initial spoken corpus text to obtain a sample spoken corpus part-of-speech tagging result and a sample spoken corpus modal particle tagging result; and a sample modal particle distribution feature is determined according to the sample spoken corpus part-of-speech tagging result and the sample spoken corpus modal particle tagging result.
  • the sample initial spoken corpus text may include, for example, a corpus text generated according to the conversation voice information in a real human conversation scene, and the corpus text records the text of the conversation voice information, that is, the sample initial spoken corpus text contains spoken corpus modal particles that humans habitually add.
  • the part-of-speech tagging on the sample initial spoken corpus text the part-of-speech of the sample spoken corpus words in the sample initial spoken corpus text can be obtained, and the positional relationship between the sample spoken corpus modal particles and each sample spoken corpus word can also be obtained.
  • the distribution characteristics of the sample spoken corpus modal particles can be determined, that is, the distribution characteristics of the sample modal particles in the sample spoken voice information can be determined.
  • the training method of the speech-to-text generation model provided in the embodiment of the present disclosure is used to train The speech-to-text generation model can be used in the above-mentioned speech-to-text generation method.
  • FIG. 7 schematically shows a block diagram of a speech text generation apparatus according to an embodiment of the present disclosure.
  • the speech text generating apparatus 700 may include a marking module 710 , a first determining module 720 , a second determining module 730 , an inserting module 740 and a generating module 750 .
  • the tagging module 710 is used to perform part-of-speech tagging on the standard text to obtain a part-of-speech tagging result.
  • the first determination module 720 is used to determine the target part of speech from the part of speech tagging results according to the distribution characteristics of the modal particles.
  • the second determination module 730 is used to determine the predicted insertion position according to the position of the content corresponding to the target part of speech in the standard text.
  • the inserting module 740 is used to insert the target modal particle into the standard text according to the predicted insertion position to obtain the target spoken text.
  • the generating module 750 is used to generate a target speech text according to the target spoken text.
  • the insertion module may include: a mask unit and a generation unit.
  • the masking unit is used to mask the predicted insertion position of the standard text to obtain the masked standard text.
  • the generation unit is used to input the masked standard text into the speech-text generation model so that the speech-text generation model inserts the target modal particle at the target insertion position in the predicted insertion position to generate the target spoken text.
  • the speech text generation method may further include: an acquisition module, a corpus annotation module and a third determination module.
  • the acquisition module is used to acquire an initial corpus set, wherein the initial corpus set includes an initial spoken language corpus text generated according to the spoken language voice corpus.
  • the corpus annotation module is used to perform part-of-speech tagging on the initial spoken corpus text to obtain spoken corpus part-of-speech tagging results and spoken corpus modal particle tagging results.
  • the third determination module is used to determine the distribution characteristics of modal particles according to the spoken corpus part-of-speech tagging results and the spoken corpus modal particle tagging results.
  • the labeling module may include a labeling unit.
  • the tagging unit is used to input the standard text into the semantic recognition model to obtain the part-of-speech tagging results.
  • the semantic recognition model includes:
  • a first semantic recognition model based on a recurrent neural network model and a conditional random field model; or a second semantic recognition model based on dependency syntactic analysis.
  • the speech text generation device part in the embodiment of the present disclosure corresponds to the speech text generation method part in the embodiment of the present disclosure, and the description of the speech text generation device part specifically refers to the speech text generation method. The legal part will not be elaborated here.
  • FIG8 schematically shows a block diagram of a speech text generation apparatus according to an embodiment of the present disclosure.
  • the speech text generation apparatus 800 may include a sample annotation module 810 , a sample first determination module 820 , a sample second determination module 830 , a sample mask module 840 and a training module 850 .
  • the sample tagging module 810 is used to perform part-of-speech tagging on the training sample standard text in the training sample set and the training sample spoken text associated with the training sample standard text, respectively, to obtain the first sample part-of-speech tagging result of the training sample standard text, the second sample part-of-speech tagging result of the training sample spoken text, and the sample modal particle tagging result of the training sample spoken text.
  • the sample first determination module 820 is used to determine the sample target part of speech from the first sample part of speech tagging result according to the sample modal particle distribution feature.
  • the sample second determination module 830 is used to determine the sample prediction insertion position according to the position of the sample content corresponding to the sample target part of speech in the training sample standard text.
  • the sample masking module 840 is used to mask the sample prediction insertion position in the training sample standard text to obtain the training sample masked standard text, wherein the training sample masked standard text has a first sample part-of-speech tagging result.
  • the training module 850 is used to train the initial speech-text generation model using the target training set to obtain a trained speech-text generation model, wherein the target training set includes the training sample masked standard text, the second sample part-of-speech tagging result of the training sample spoken text, and the sample modal particle tagging result of the training sample spoken text.
  • the training device for the speech-to-text generation model may further include: a sample updating module and a sample building module.
  • the sample updating module is used to update the first sample standard text and the first sample spoken text associated with the first sample standard text in the first sample set respectively using the sample confusion words in the sample confusion dictionary to obtain a second sample set including the second sample standard text and the second sample spoken text.
  • the sample construction module is used to construct a training sample set according to the first sample set and the second sample set.
  • the training device for the speech-to-text generation model may further include: a corpus processing module, a recognition module, and a confusion dictionary construction module.
  • the corpus processing module is used to process the sample standard corpus text using a speech synthesis device to obtain a sample speech corpus.
  • the recognition module is used to perform speech recognition on the sample speech corpus to obtain the sample confusion corpus text.
  • the confusion dictionary building module is used to build a sample confusion dictionary based on the sample standard corpus text and the sample confusion corpus text.
  • the training device for the speech-text generation model may further include: obtaining a sample initial corpus; extraction module, sample corpus annotation module and sample third determination module.
  • the sample initial corpus acquisition module is used to acquire a sample initial corpus set, wherein the sample initial corpus set includes a sample initial spoken corpus text generated according to the sample spoken voice corpus;
  • the sample corpus annotation module is used to perform part-of-speech tagging on the sample initial spoken corpus text, and obtain the sample spoken corpus part-of-speech tagging results and the sample spoken corpus modal particle tagging results.
  • the sample third determination module is used to determine the distribution characteristics of the sample modal particles according to the part-of-speech tagging results of the sample spoken corpus and the modal particle tagging results of the sample spoken corpus.
  • training device part of the speech-text generation model in the embodiment of the present disclosure corresponds to the training method part of the speech-text generation model in the embodiment of the present disclosure.
  • the description of the training device part of the speech-text generation model specifically refers to the training method part of the speech-text generation model, which will not be repeated here.
  • any one or more of the modules and units, or at least part of the functions of any one of them can be implemented in one module.
  • any one or more of the modules, submodules, units, and subunits can be split into multiple modules for implementation.
  • any one or more of the modules and units can be at least partially implemented as hardware circuits, such as field programmable gate arrays (FPGAs), programmable logic arrays (PLAs), systems on chips, systems on substrates, systems on packages, application specific integrated circuits (ASICs), or can be implemented by hardware or firmware in any other reasonable way of integrating or packaging the circuit, or implemented in any one of the three implementation methods of software, hardware, and firmware, or in any appropriate combination of any of them.
  • FPGAs field programmable gate arrays
  • PLAs programmable logic arrays
  • ASICs application specific integrated circuits
  • one or more of the modules and units can be at least partially implemented as computer program modules, and when the computer program modules are run, the corresponding functions can be performed.
  • any multiple of the annotation module 710, the first determination module 720, the second determination module 730, the insertion module 740, and the generation module 750 can be combined in one module/unit for implementation, or any one of the modules/units can be split into multiple modules/units. Alternatively, at least part of the functions of one or more of these modules/units can be combined with at least part of the functions of other modules/units/sub-units and implemented in one module/unit.
  • At least one of the annotation module 710, the first determination module 720, the second determination module 730, the insertion module 740, and the generation module 750 can be at least partially implemented as a hardware circuit, such as a field programmable gate array (FPGA), a programmable logic array (PLA), a system on a chip, a system on a substrate, a system on a package, an application specific integrated circuit (ASIC), or can be implemented by hardware or firmware such as any other reasonable way of integrating or packaging the circuit, or by any one of the three implementation methods of software, hardware, and firmware, or by a suitable combination of any of them.
  • FPGA field programmable gate array
  • PLA programmable logic array
  • ASIC application specific integrated circuit
  • At least one of the labeling module 710, the first determination module 720, the second determination module 730, the insertion module 740, and the generation module 750 may be at least partially implemented as a computer program.
  • a computer program module can execute corresponding functions when the computer program module is executed.
  • Figure 9 schematically shows a block diagram of an electronic device suitable for implementing a method for generating speech text and a method for training a speech text generation model according to an embodiment of the present disclosure.
  • the electronic device shown in Figure 9 is only an example and should not bring any limitation to the functions and scope of use of the embodiment of the present disclosure.
  • the electronic device 900 includes a processor 901, which can perform various appropriate actions and processes according to a program stored in a read-only memory (ROM) 902 or a program loaded from a storage part 908 into a random access memory (RAM) 903.
  • the processor 901 may include, for example, a general-purpose microprocessor (e.g., a CPU), an instruction set processor and/or a related chipset and/or a special-purpose microprocessor (e.g., an application-specific integrated circuit (ASIC)), etc.
  • the processor 901 may also include an onboard memory for caching purposes.
  • the processor 901 may include a single processing unit or multiple processing units for performing different actions of the method flow according to an embodiment of the present disclosure.
  • RAM 903 various programs and data required for the operation of the electronic device 900 are stored.
  • the processor 901, ROM 902 and RAM 903 are connected to each other through a bus 904.
  • the processor 901 performs various operations of the method flow according to the embodiment of the present disclosure by executing the program in ROM 902 and/or RAM 903. It should be noted that the program can also be stored in one or more memories other than ROM 902 and RAM 903.
  • the processor 901 can also perform various operations of the method flow according to the embodiment of the present disclosure by executing the program stored in the one or more memories.
  • the electronic device 900 may further include an input/output (I/O) interface 905, which is also connected to the bus 904.
  • the system 900 may further include one or more of the following components connected to the I/O interface 905: an input section 906 including a keyboard, a mouse, etc.; an output section 907 including a cathode ray tube (CRT), a liquid crystal display (LCD), etc., and a speaker, etc.; a storage section 908 including a hard disk, etc.; and a communication section 909 including a network interface card such as a LAN card, a modem, etc.
  • the communication section 909 performs communication processing via a network such as the Internet.
  • a drive 910 is also connected to the I/O interface 905 as needed.
  • a removable medium 911 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, etc., is installed on the drive 910 as needed, so that a computer program read therefrom is installed into the storage section 908 as needed.
  • the method flow according to the embodiment of the present disclosure can be implemented as a computer software program.
  • an embodiment of the present disclosure includes a computer program product, which includes a computer program carried on a computer-readable storage medium, and the computer program contains program code for executing the method shown in the flowchart.
  • the computer program can be downloaded and installed from the network through the communication part 909, and/or installed from the removable medium 911.
  • the system of the embodiment of the present disclosure is executed.
  • the systems, devices, apparatuses, modules, units, etc. described above may be implemented by computer program modules.
  • the present disclosure also provides a computer-readable storage medium, which may be included in the device/apparatus/system described in the above embodiments; or may exist independently without being assembled into the device/apparatus/system.
  • the above computer-readable storage medium carries one or more programs, and when the above one or more programs are executed, the method according to the embodiment of the present disclosure is implemented.
  • the computer-readable storage medium may be a non-volatile computer-readable storage medium.
  • it may include, but is not limited to: a portable computer disk, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination thereof.
  • a computer-readable storage medium may be any tangible medium containing or storing a program that may be used by or in combination with an instruction execution system, apparatus, or device.
  • the computer-readable storage medium may include the ROM 902 and/or RAM 903 described above and/or one or more memories other than ROM 902 and RAM 903.
  • the embodiments of the present disclosure also include a computer program product, which includes a computer program, and the computer program contains program code for executing the method provided by the embodiments of the present disclosure.
  • the computer program product runs on an electronic device, the program code is used to enable the electronic device to implement the above method provided by the embodiments of the present disclosure.
  • the above functions defined in the system/device of the embodiment of the present disclosure are executed.
  • the system, device, module, unit, etc. described above can be implemented by a computer program module.
  • the computer program may be based on a tangible storage medium such as an optical storage device, a magnetic storage device, etc.
  • the computer program may also be transmitted and distributed in the form of a signal on a network medium, and downloaded and installed through the communication part 909, and/or installed from a removable medium 911.
  • the program code contained in the computer program may be transmitted using any appropriate network medium, including but not limited to: wireless, wired, etc., or any suitable combination of the above.
  • the program code for executing the computer program provided by the embodiments of the present disclosure can be written in any combination of one or more programming languages.
  • these computer programs can be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages.
  • Programming languages include but are not limited to programming languages such as Java, C++, Python, "C" language, or similar programming languages.
  • the program code can be executed entirely on the user computing device, partially on the user device, partially on a remote computing device, or entirely on a remote computing device. Executed on a remote computing device or server.
  • the remote computing device may be connected to the user computing device through any type of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computing device (e.g., through the Internet using an Internet service provider).
  • LAN local area network
  • WAN wide area network
  • each box in the flowchart or block diagram may represent a module, a program segment, or a part of a code, and the above-mentioned module, program segment, or a part of the code contains one or more executable instructions for implementing the specified logical function.
  • the functions marked in the box may also occur in an order different from that marked in the accompanying drawings. For example, two boxes represented in succession can actually be executed substantially in parallel, and they can sometimes be executed in the opposite order, depending on the functions involved.
  • each box in the block diagram or flowchart, and the combination of boxes in the block diagram or flowchart can be implemented with a dedicated hardware-based system that performs the specified function or operation, or can be implemented with a combination of dedicated hardware and computer instructions.
  • the features recorded in the various embodiments and/or claims of the present disclosure can be combined and/or combined in various ways, even if such a combination or combination is not explicitly recorded in the present disclosure.
  • the features described in the various embodiments and/or claims of the present disclosure may be combined and/or combined in a variety of ways. All of these combinations and/or combinations fall within the scope of the present disclosure.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Machine Translation (AREA)

Abstract

本公开提供了一种语音文本生成方法,可以应用于人工智能技术领域和智能客服领域。该语音文本生成方法包括:对标准文本进行词性标注,得到词性标注结果;根据语气词分布特征从词性标注结果中确定目标词性;根据与目标词性对应的内容在标准文本中的位置确定预测插入位置;根据预测插入位置在标准文本中插入目标语气词,得到目标口语文本;以及根据目标口语文本生成目标语音文本。本公开还提供了语音文本生成模型的训练方法、语音文本生成装置、语音文本生成模型的训练装置、设备、介质及程序产品。

Description

语音文本生成方法、语音文本生成模型的训练方法、装置
本申请要求于2022年10月9日提交的、申请号为202211231004.1的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本公开涉及人工智能技术领域和智能客服技术领域,更具体地,涉及一种语音文本生成方法、语音文本生成模型的训练方法、装置、设备、介质及程序产品。
背景技术
随着人工智能技术的发展,智能对话系统的应用场景越来越广泛,智能对话系统通过自动生成与业务相关的智能语音信息,或者根据用户的语音信息生成智能回复信息,从而实现利用智能对话系统自动与用户进行语音交互,以满足用户的相关需求。
在实现本公开构思的过程中,发明人发现相关技术中至少存在如下问题,智能对话系统通常采用具有标准书面句式的文本转换为智能语音信息,生成的智能语音信息较为生硬呆板,与人类正常对话中产生的语音信息差别较大。
发明内容
有鉴于此,本公开提供了一种语音文本生成方法、语音文本生成模型的训练方法、装置、设备、介质及程序产品。
本公开的一个方面提供了一种语音文本生成方法,包括:
对标准文本进行词性标注,得到词性标注结果;
根据语气词分布特征从上述词性标注结果中确定目标词性;
根据与上述目标词性对应的内容在上述标准文本中的位置确定预测插入位置;
根据上述预测插入位置在上述标准文本中插入目标语气词,得到目标口语文本;以及
根据上述目标口语文本生成目标语音文本。
根据本公开的实施例,根据上述预测插入位置在上述标准文本中插入目标语气词,得到目标口语文本包括:
对上述标准文本的预测插入位置进行掩码,得到掩码标准文本;
将上述掩码标准文本输入至语音文本生成模型,以便上述语音文本生成模型在上述预测插入位置中的目标插入位置插入目标语气词,生成上述目标口语文本。
根据本公开的实施例,上述语音文本生成方法还包括:
获取初始语料集,其中,上述初始语料集包括根据口语语音语料生成的初始口语语料文本;
对上述初始口语语料文本进行词性标注,得到口语语料词性标注结果、口语语料语气词标注结果;
根据上述口语语料词性标注结果和上述口语语料语气词标注结果,确定上述语气词分布特征。
根据本公开的实施例,对标准文本进行词性标注,得到词性标注结果包括:
将上述标准文本输入至语义识别模型,得到上述词性标注结果;
其中,上述语义识别模型包括:
基于循环神经网络模型与条件随机场模型构建的第一语义识别模型;或者
基于依存句法分析构建的第二语义识别模型。
本公开的另一方面还提供了一种语音文本生成模型的训练方法,包括:
对训练样本集中的训练样本标准文本和与上述训练样本标准文本关联的训练样本口语文本分别进行词性标注,得到上述训练样本标准文本的第一样本词性标注结果、上述训练样本口语文本的第二样本词性标注结果、上述训练样本口语文本的样本语气词标注结果;
根据样本语气词分布特征从上述第一样本词性标注结果中确定样本目标词性;
根据与上述样本目标词性对应的样本内容在上述训练样本标准文本中的位置确定样本预测插入位置;
对上述训练样本标准文本中的样本预测插入位置进行掩码,得到训练样本掩码标准文本,其中,上述训练样本掩码标准文本具有第一样本词性标注结果;
利用目标训练集训练初始语音文本生成模型,得到训练后的语音文本生成模型,其中,上述目标训练集包括上述训练样本掩码标准文本、上述训练样本口语文本的第二样本词性标注结果、上述训练样本口语文本的样本语气词标注结果。
根据本公开的实施例,上述语音文本生成模型的训练方法还包括:
利用样本混淆词典中的样本混淆词分别更新第一样本集中的第一样本标准文本和与上述第一样本标准文本关联的第一样本口语文本,得到包含有第二样本标准文本和第 二样本口语文本的第二样本集;
根据上述第一样本集与上述第二样本集构建上述训练样本集。
根据本公开的实施例,上述语音文本生成模型的训练方法还包括:
利用语音合成装置处理样本标准语料文本,得到样本语音语料;
对上述样本语音语料进行语音识别,得到样本混淆语料文本;
根据上述样本标准语料文本和上述样本混淆语料文本,构建上述样本混淆词典。
根据本公开的实施例,上述语音文本生成模型的训练方法还包括:
获取样本初始语料集,其中,上述样本初始语料集包括根据样本口语语音语料生成的样本初始口语语料文本;
对上述样本初始口语语料文本进行词性标注,得到样本口语语料词性标注结果、样本口语语料语气词标注结果;
根据上述样本口语语料词性标注结果和上述样本口语语料语气词标注结果,确定上述样本语气词分布特征。
本公开的另一方面还提供了一种语音文本生成装置,包括:
标注模块,用于对标准文本进行词性标注,得到词性标注结果;
第一确定模块,用于根据语气词分布特征从上述词性标注结果中确定目标词性;
第二确定模块,用于根据与上述目标词性对应的内容在上述标准文本中的位置确定预测插入位置;
插入模块,用于根据上述预测插入位置在上述标准文本中插入目标语气词,得到目标口语文本;以及
生成模块,用于根据上述目标口语文本生成目标语音文本。
本公开的另一方面还提供了一种语音文本生成模型的训练装置,包括:
样本标注模块,用于对训练样本集中的训练样本标准文本和与上述训练样本标准文本关联的训练样本口语文本分别进行词性标注,得到上述训练样本标准文本的第一样本词性标注结果、上述训练样本口语文本的第二样本词性标注结果、上述训练样本口语文本的样本语气词标注结果;
样本第一确定模块,用于根据样本语气词分布特征从上述第一样本词性标注结果中确定样本目标词性;
样本第二确定模块,用于根据与上述样本目标词性对应的样本内容在上述训练样本标准文本中的位置确定样本预测插入位置;
样本掩码模块,用于对上述训练样本标准文本中的样本预测插入位置进行掩码,得到训练样本掩码标准文本,其中,上述训练样本掩码标准文本具有第一样本词性标注结果;
训练模块,用于利用目标训练集训练初始语音文本生成模型,得到训练后的语音文本生成模型,其中,上述目标训练集包括上述训练样本掩码标准文本、上述训练样本口语文本的第二样本词性标注结果、上述训练样本口语文本的样本语气词标注结果。
本公开的另一个方面提供了一种电子设备,包括:
一个或多个处理器;
存储器,用于存储一个或多个程序,
其中,当所述一个或多个程序被所述一个或多个处理器执行时,使得所述一个或多个处理器实现如上所述的方法。
本公开的另一方面提供了一种计算机可读存储介质,存储有计算机可执行指令,所述指令在被执行时用于实现如上所述的方法。
本公开的另一方面提供了一种计算机程序产品,所述计算机程序产品包括计算机可执行指令,所述指令在被执行时用于实现如上所述的方法。
根据本公开的实施例,在对标准文本进行词性标注,得到词性标注结果后,可以根据词性标注结果获得标准文本中每个标准词的词性,根据于此分布特征从词性标准结果中确定可以插入语气词的预测插入位置,根据该预测插入位置在标准文本中插入目标语气词,可以使得到的目标口语文本具有人类正常对话所具备的口语化特性,从而使根据目标口语文本生成的目标语音文本可以至少部分解决相关智能语音信息中较为生硬呆板,与人类对话差别较大的技术问题,使目标语音文本可以更加接近人类对话的语音信息中的口语化特征,使目标语音文本具有拟人化特点的技术效果,提升用户在进行语音交互过程中的使用体验。
附图说明
通过以下参照附图对本公开实施例的描述,本公开的上述以及其他目的、特征和优点将更为清楚,在附图中:
图1示意性示出了根据本公开实施例的可以应用语音文本生成方法、装置的示例性系统架构;
图2示意性示出了根据本公开实施例的语音文本生成方法的流程图;
图3示意性示出了根据本公开另一实施例的语音文本生成方法的流程图;
图4示意性示出了根据本公开实施例的根据预测插入位置在标准文本中插入目标语气词,得到目标口语文本的流程图;
图5示意性示出了根据本公开实施例的语音文本生成方法的应用场景图;
图6示意性示出了根据本公开实施例的语音文本生成模型的训练方法的流程图;
图7示意性示出了根据本公开的实施例的语音文本生成装置的框图;
图8示意性示出了根据本公开的实施例的语音文本生成装置的框图;以及
图9示意性示出了根据本公开实施例的适于实现语音文本生成方法、语音文本生成模型的训练方法的电子设备的框图。
具体实施方式
以下,将参照附图来描述本公开的实施例。但是应该理解,这些描述只是示例性的,而并非要限制本公开的范围。在下面的详细描述中,为便于解释,阐述了许多具体的细节以提供对本公开实施例的全面理解。然而,明显地,一个或多个实施例在没有这些具体细节的情况下也可以被实施。此外,在以下说明中,省略了对公知结构和技术的描述,以避免不必要地混淆本公开的概念。
在此使用的术语仅仅是为了描述具体实施例,而并非意在限制本公开。在此使用的术语“包括”、“包含”等表明了所述特征、步骤、操作和/或部件的存在,但是并不排除存在或添加一个或多个其他特征、步骤、操作或部件。
在此使用的所有术语(包括技术和科学术语)具有本领域技术人员通常所理解的含义,除非另外定义。应注意,这里使用的术语应解释为具有与本说明书的上下文相一致的含义,而不应以理想化或过于刻板的方式来解释。
在使用类似于“A、B和C等中至少一个”这样的表述的情况下,一般来说应该按照本领域技术人员通常理解该表述的含义来予以解释(例如,“具有A、B和C中至少一个的系统”应包括但不限于单独具有A、单独具有B、单独具有C、具有A和B、具有A和C、具有B和C、和/或具有A、B、C的系统等)。
智能对话系统产生的语音信息通常根据书面化的文本生成语音信息,忽略了了真实场景中人与人交谈可能存在的语气词、犹豫词、重述词等。因此,基于语音合成装置或者由人工转译生成的语音信息虽然非常标准,但是十分生硬呆板,很容易让用户察觉到是在与机器对话,从而会降低用户的使用体验。
本公开的实施例提供了一种语音文本生成方法、语音文本生成模型的训练方法、装置、设备、介质及程序产品。该语音文本生成方法包括:对标准文本进行词性标注,得到词性标注结果;根据语气词分布特征从词性标注结果中确定目标词性;根据与目标词性对应的内容在标准文本中的位置确定预测插入位置;根据预测插入位置在标准文本中插入目标语气词,得到目标口语文本;以及根据目标口语文本生成目标语音文本。
根据本公开的实施例,在对标准文本进行词性标注,得到词性标注结果后,可以根据词性标注结果获得标准文本中每个标准词的词性,根据于此分布特征从词性标准结果中确定可以插入语气词的预测插入位置,根据该预测插入位置在标准文本中插入目标语气词,可以使得到的目标口语文本具有人类正常对话所具备的口语化特性,从而使根据目标口语文本生成的目标语音文本可以至少部分解决相关智能语音信息中较为生硬呆板,与人类对话差别较大的技术问题,使目标语音文本可以更加接近人类对话的语音信息中的口语化特征,使目标语音文本具有拟人化特点,提升用户在进行语音交互过程中的使用体验。
在本公开的技术方案中,在获取或采集用户个人信息之前,均获取了用户的授权或同意。
在本公开的技术方案中,所涉及的用户个人信息的收集、存储、使用、加工、传输、提供、公开和应用等处理,均符合相关法律法规的规定,采取了必要保密措施,且不违背公序良俗。
图1示意性示出了根据本公开实施例的可以应用语音文本生成方法、装置的示例性系统架构。需要注意的是,图1所示仅为可以应用本公开实施例的系统架构的示例,以帮助本领域技术人员理解本公开的技术内容,但并不意味着本公开实施例不可以用于其他设备、系统、环境或场景。
如图1所示,根据该实施例的系统架构100可以包括终端设备101、102、103、网络104和服务器105。网络104用以在终端设备101、102、103和服务器105之间提供通信链路的介质。网络104可以包括各种连接类型,例如有线和/或无线通信链路等等。
用户可以使用终端设备101、102、103通过网络104与服务器105交互,以接收或发送消息等。终端设备101、102、103上可以安装有各种通讯客户端应用,例如购物类应用、网页浏览器应用、搜索类应用、即时通信工具、邮箱客户端和/或社交平台软件等(仅为示例)。
终端设备101、102、103可以是具有显示屏并且支持网页浏览的各种电子设备,包 括但不限于智能手机、平板电脑、膝上型便携计算机和台式计算机等等。
服务器105可以是提供各种服务的服务器,例如对用户利用终端设备101、102、103所浏览的网站提供支持的后台管理服务器(仅为示例)。后台管理服务器可以对接收到的用户请求等数据进行分析等处理,并将处理结果(例如根据用户请求获取或生成的网页、信息、或数据等)反馈给终端设备。
需要说明的是,本公开实施例所提供的语音文本生成方法一般可以由服务器105执行。相应地,本公开实施例所提供的语音文本生成装置一般可以设置于服务器105中。本公开实施例所提供的语音文本生成方法也可以由不同于服务器105且能够与终端设备101、102、103和/或服务器105通信的服务器或服务器集群执行。相应地,本公开实施例所提供的语音文本生成装置也可以设置于不同于服务器105且能够与终端设备101、102、103和/或服务器105通信的服务器或服务器集群中。或者,本公开实施例所提供的语音文本生成方法也可以由终端设备101、102、或103执行,或者也可以由不同于终端设备101、102、或103的其他终端设备执行。相应地,本公开实施例所提供的语音文本生成装置也可以设置于终端设备101、102、或103中,或设置于不同于终端设备101、102、或103的其他终端设备中。
例如,标准文本可以原本存储在终端设备101、102、或103中的任意一个(例如,终端设备101,但不限于此)之中,或者存储在外部存储设备上并可以导入到终端设备101中。然后,终端设备101可以在本地执行本公开实施例所提供的语音文本生成方法,或者将标准文本发送到其他终端设备、服务器、或服务器集群,并由接收该标准文本的其他终端设备、服务器、或服务器集群来执行本公开实施例所提供的语音文本生成方法。
应该理解,图1中的终端设备、网络和服务器的数目仅仅是示意性的。根据实现需要,可以具有任意数目的终端设备、网络和服务器。
图2示意性示出了根据本公开实施例的语音文本生成方法的流程图。
如图2所示,该方法包括操作S210~S250。
在操作S210,对标准文本进行词性标注,得到词性标注结果。
根据本公开的实施例,标准文本可以包括用于书面沟通交流的文本,例如标准客服人员应答文本、邮件正文文本等。
需要说明的是,标准客服人员应答文本可以应用于智能问客服答装置中,该智能客服问答装置根据标准化的标准客服人员应答文本,生成对应的语音文本,从而实现与用户进行语音交互,但生成的语音文本通常较为生硬呆板,与人类对话语音差别过大,不 具备口语化特点。
根据本公开的实施例,对标注文本进行词性标注,得到的词性标注结果可以包括标准文本中进行分词后产生的标准词和标准词的词性特征,词性特征例如可以包括形容词词性、动词词性等。
需要说明的是,本公开的实施例针对词性标注的具体方法不做限定,例如可以采用基于神经网络构建的网络模型对标准文本进行词性标注,但不仅限于此,还可以采用基于统计算法构建的语义识别模型对标准文本进行词性标注,本公开的实施例对词性标注的具体技术手段不做限定,本领域技术人员可以根据实际情况进行选择。
在操作S220,根据语气词分布特征从词性标注结果中确定目标词性。
根据本公开的实施例,语气词可以包括表示犹豫、疑惑等人类情感的词汇,例如“嗯”、“就是”、“比如”“类似的”等。但不仅限于此,还可以包括在人类语音对话的开始和/或结束会进行表达的礼貌用于类词,例如“如果有空的话”、“如果你喜欢”等。
根据本公开的实施例,可以根据大量的对话文本,统计对话文本中和语气词邻接的词的词性,确定语气词分布特征,从而可以根据语气词分别特征来预测标准文本中的目标词性。
在本公开的实施例中,目标词性可以包括动词词性和形容词词性。
在操作S230,根据与目标词性对应的内容在标准文本中的位置确定预测插入位置。
根据本公开的实施例,在确定目标词性后,可以根据标准文本中的标准词的词性特征,确定预测插入位置,该预测插入位置可以是具有目标词性的标准词的邻接位置。
在操作S240,根据预测插入位置在标准文本中插入目标语气词,得到目标口语文本。
根据本公开的实施例,可以在预测插入位置有多个的情况下,可以在每个预测插入位置均插入相应地目标语气词,或者还可以在预测插入位置中的目标插入位置插入目标语气词。在标准文本中插入目标语气词后,可以使目标口语文本在不改变标准文本语义信息的情况下,增强口语化特性。
在操作S250,根据目标口语文本生成目标语音文本。
根据本公开的实施例,目标语音文本可以是语音信息,可以利用相关语音合成设备将目标口语文本转化为目标语音文本,根据目标口语文本生成的目标语音文本。
根据本公开的实施例,在对标准文本进行词性标注,得到词性标注结果后,可以根据词性标注结果获得标准文本中每个标准词的词性,根据于此分布特征从词性标准结果中确定可以插入语气词的预测插入位置,根据该预测插入位置在标准文本中插入目标语 气词,可以使得到的目标口语文本具有人类正常对话所具备的口语化特性,从而使根据目标口语文本生成的目标语音文本可以至少部分解决相关智能语音信息中较为生硬呆板,与人类对话差别较大的技术问题,使目标语音文本可以更加接近人类对话的语音信息中的口语化特性,使目标语音文本具有拟人化特点,提升用户在进行语音交互过程中的使用体验。
图3示意性示出了根据本公开另一实施例的语音文本生成方法的流程图。
如图3所示,语音文本生成方法还可以包括操作S310~操作S330。
在操作S310,获取初始语料集,其中,初始语料集包括根据口语语音语料生成的初始口语语料文本。
在操作S320,对初始口语语料文本进行词性标注,得到口语语料词性标注结果、口语语料语气词标注结果。
在操作S330,根据口语语料词性标注结果和口语语料语气词标注结果,确定语气词分布特征。
根据本公开的实施例,初始口语语料文本例如可以包括根据人类真实对话场景中对话语音信息生成的语料文本,该语料文本记录有对话语音信息的文本,即初始口语语料文本中包含有人类习惯添加的口语语料语气词。通过对初始口语语料文本进行词性标注,可以得到初始口语语料文本中,口语语料词的词性,还可以得到口语语料语气词与各个口语语料词之间的位置关系,通过分析统计该位置关系,可以确定口语语料语气词的分布特征,即可以确定在口语语音信息中的语气词分布特征。
例如,语气词分布特征可以表征在具有动词词性的口语语料词之后位置的统计概率为0.9,在具有形容词词性的口语语料词之前位置的统计概率为0.8,通过统计该些位置的统计概率,确定语气词分布特征。
根据本公开的实施例,可以设定位置概率阈值,在并将统计概率阈值大于或等于位置概率阈值的位置作为语气词分布特征中的分布统计概率。
需要说明的是,本公开实施例中的“之前”表示与该词邻接,且位于该词位置之前,相应地,“之后”表示与该词邻接,且位于该词位置之后。
根据本公开的实施例,操作S210,对标准文本进行词性标注,得到词性标注结果可以包括如下操作。
将标准文本输入至语义识别模型,得到词性标注结果;其中,语义识别模型包括:基于循环神经网络模型与条件随机场模型构建的第一语义识别模型;或者基于依存句法 分析构建的第二语义识别模型。
根据本公开的实施例,第一语义识别模型例如可以是根据循环神经网络模型(RNN模型)与条件随机场模型(CRF模型)依序连接构建得到的,或者还可以是基于双向长短期记忆网络模型(Bi-LSTM模型)和与条件随机场模型(CRF模型)依序连接构建得到的。
应该理解的是,循环神经网络模型可以包括长短期记忆网络模型(LSTM模型),(双向长短期记忆网络模型(Bi-LSTM模型),本领域技术人员可以根据实际需求对第一语义识别模型的具体网络结构进行设计。
根据本公开的实施例,基于依存句法分析构建的第二语义识别模型例如可以包括LTP(Language Technology Platform)语言处理系统等。
图4示意性示出了根据本公开实施例的根据预测插入位置在标准文本中插入目标语气词,得到目标口语文本的流程图。
如图4所示,操作S240,根据预测插入位置在标准文本中插入目标语气词,得到目标口语文本包括操作S410~S420。
在操作S410,对标准文本的预测插入位置进行掩码,得到掩码标准文本。
在操作S420,将掩码标准文本输入至语音文本生成模型,以便语音文本生成模型在预测插入位置中的目标插入位置插入目标语气词,生成目标口语文本。
根据本公开的实施例,语音文本生成模型可以是基于BERT模型构建得到的,例如可以包括BERT-WWM模型,掩码标准文本输入至BERT-WWM模型可以对预测插入位置的掩码进一步迭代预测,进而从预测插入位置中确定目标预测位置,并基于BERT-WWM模型的预测能力从语气词集中确定各个目标预测位置的目标语气词,实现生产目标口语文本。
需要说明的是,语音文本生成模型可以是经过相关训练方法训练后得到的,在语音文本生成模型是BERT-WWM模型的情况下,还可以基于BERT-WWM模型的预测能力,对掩码标准文本中的至少部分标准词替换为同义词、同音词,从而进一步提升目标口语文本的口语化特性。
图5示意性示出了根据本公开实施例的语音文本生成方法的应用场景图。
如图5所示,该应用场景中可以包括标准文本510“请问您需要大尺寸的显示器么”,将标准文本510输入至语义识别模型520,可以实现对标准文本进行词性标注,得到词性标注结果530。
在本公开的实施例中,语义识别模型520可以是基于双向长短期记忆网络模型(Bi-LSTM模型)和与条件随机场模型(CRF模型)依序连接构建得到的。
词性标注结果530可以包括标准文本510中的标准词“请问”、“您”、“需要”、“大尺寸的”、“显示器”、“么”。还可以包括每个标准词各自的词性,其中“v”表示动词词性,“r”表示代词词性,“a”表示形容词词性,“n”表示名词词性,“e”表示语气词词性。
根据语气词分布特征,可以从所述词性标注结果530中确定目标词性为动词词性和形容词词性,并根据目标词性对应的内容,在标准文本510中的动词词性标准词“请问”之后,动词词性标准词“需要”之后,以及形容词词性标准词“大尺寸的”之前,确定预测插入位置,并对每个预测插入位置进行掩码,得到掩码标准文本540。掩码标准文本540中,可以包括每个预测插入位置对应的掩码单元541、542。
将掩码标准文本540输入至语音文本生成模型550,语音文本生成模型可以将预测插入位置确定为目标预测插入位置,并将目标语气词“嗯”插入至目标预测插入位置对应的掩码单元541,将目标语气词“这个”插入至目标预测插入位置对应的掩码单元542,进而生成目标口语文本560“请问嗯您需要这个大尺寸的显示器么”。从而可以使目标口语文本560具有接近人类口语化语音信息的口语化特性,根据目标口语文本560生成的目标语音文本可以具备口语化特性,至少部分避免生成的语音信息生硬呆板,减少与人类对话语音信息的差别。
图6示意性示出了根据本公开实施例的语音文本生成模型的训练方法的流程图。
如图6所示,该方法包括操作S610~S650。
在操作S610,对训练样本集中的训练样本标准文本和与训练样本标准文本关联的训练样本口语文本分别进行词性标注,得到训练样本标准文本的第一样本词性标注结果、训练样本口语文本的第二样本词性标注结果、训练样本口语文本的样本语气词标注结果。
在操作S620,根据样本语气词分布特征从第一样本词性标注结果中确定样本目标词性。
在操作S630,根据与样本目标词性对应的样本内容在训练样本标准文本中的位置确定样本预测插入位置。
在操作S640,对训练样本标准文本中的样本预测插入位置进行掩码,得到训练样本掩码标准文本,其中,训练样本掩码标准文本具有第一样本词性标注结果。
在操作S650,利用目标训练集训练初始语音文本生成模型,得到训练后的语音文本 生成模型,其中,目标训练集包括训练样本掩码标准文本、训练样本口语文本的第二样本词性标注结果、训练样本口语文本的样本语气词标注结果。
根据本公开的实施例,训练样本标准文本可以包括标准的书面化文本,训练样本口语文本可以包括样本用户对训练样本标准文本发音转述后生成的语音信息转化为的口语化文本,训练样本口语文本由于经过样本用户的语音转述后,在根据转述的语音生成,因此训练样本口语文本可以包含有样本语气词。至少部分客服了相关技术中采用标准文本训练语音文本生成模型,从而使训练得到的语音文本生成模型不能学习到口语对话表达中可能存在的语气词、犹豫词等语气词的特性。
根据本公开的实施例,可以将训练样本掩码标准文本和训练样本口语文本组成训练样本对,并根据训练样本对中训练样本掩码标准文本和训练样本口语文本的相似度确定相似度标签值,该相似度标签值可以用于迭代地调整初始语音文本生成模型中的权重参数,使生成的语音文本生成模型,可以预测样本语气词与训练样本标准文本中第一样本词性标注结果之间的位置关系,从而可以准确地根据预测插入位置确定目标插入位置,并从样本语气词中确定目标样本语气词。
需要说明的是,根据本公开实施例提供的语音文本生成模型的训练方法训练得到的语音文本生成模型,可以用于上述语音文本生成方法。
根据本公开的实施例,语音文本生成模型的训练方法还包括如下操作。
利用样本混淆词典中的样本混淆词分别更新第一样本集中的第一样本标准文本和与第一样本标准文本关联的第一样本口语文本,得到包含有第二样本标准文本和第二样本口语文本的第二样本集;根据第一样本集与第二样本集构建训练样本集。
根据本公开的实施例,样本混淆词典可以包括样本标准词和样本混淆词构成的样本混淆词对,通过样本混淆词典中的样本混淆词替换第一样本标准文本中的样本标准词,以及通过样本混淆词典中的样本混淆词替换第一样本口语文本中的样本标准词,可以分别得到大量的第二样本标准文本和第二样本口语文本,从而根据第一样本集与第二样本集构建得到训练样本集,可以扩充训练样本数据的数量,以增强训练样本集的训练能力。进一步地,利用包含有第一样本集与第二样本集的训练样本集训练初始语音文本生成模型,可以使初始语音文本生成模型充分学习到标准词与混淆词之间的相似关联关系,从而可以使训练后得到的语音文本生成模型自动将标准文本中的标准词替换为样本混淆词,从而进一步丰富目标口语文本的语义表达方式,使目标口语文本更贴近人类正常对话的口语化特性。
根据本公开的实施例,语音文本生成模型的训练方法还可以包括如下操作。
利用语音合成装置处理样本标准语料文本,得到样本语音语料;对样本语音语料进行语音识别,得到样本混淆语料文本;根据样本标准语料文本和样本混淆语料文本,构建样本混淆词典。
根据本公开的实施例,样本标准语料文本可以包括用于书面沟通交流的文本,例如标准客服人员应答文本、邮件正文文本等。样本语音语料可以包括语音合成装置自动识别样本标准语料文本后,生成的语音信息。利用语音识别(Automatic Speech Recognition,ASR)装置识别样本语音语料,可以得到识别后的样本混淆语料文本,由于语音识别装置的识别能力限制,因此可以将样本标准语料文本中的至少部分样本标准词识别为样本混淆词,从而使样本混淆语料文本包含有语音识别装置识别到的样本混淆词。根据样本标准语料文本和样本混淆语料文本的比对结果,可以将样本标准词和样本混淆词做成样本混淆词对,进而构建得到样本混淆词典。
根据本公开的实施例,可以在利用语音识别装置对样本语音语料进行语音识别后,通过确定语音识别装置输出的初始样本样本混淆语料文本的置信度信息,将置信度信息小于或等于预设置信度阈值的初始样本混淆语料文本确定为样本混淆语料文本,从而可以根据样本混淆语料文本中选择出容易被识别错误的样本混淆词,使构建得到的样本混淆词典的样本混淆词对更加准确地体现样本混淆词与样本标准词的关联特征。
根据本公开的实施例,语音文本生成模型的训练方法还可以包括如下操作。
获取样本初始语料集,其中,所述样本初始语料集包括根据样本口语语音语料生成的样本初始口语语料文本;对所述样本初始口语语料文本进行词性标注,得到样本口语语料词性标注结果、样本口语语料语气词标注结果;根据所述样本口语语料词性标注结果和所述样本口语语料语气词标注结果,确定所述样本语气词分布特征。
根据本公开的实施例,样本初始口语语料文本例如可以包括根据人类真实对话场景中对话语音信息生成的语料文本,该语料文本记录有对话语音信息的文本,即样本初始口语语料文本中包含有人类习惯添加的口语语料语气词。通过对样本初始口语语料文本进行词性标注,可以得到样本初始口语语料文本中,样本口语语料词的词性,还可以得到样本口语语料语气词与各个样本口语语料词之间的位置关系,通过分析统计该位置关系,可以确定样本口语语料语气词的分布特征,即可以确定在样本口语语音信息中的样本语气词分布特征。
需要说明的是,根据本公开的实施例提供的语音文本生成模型的训练方法训练得到 的语音文本生成模型,可以用于上述语音文本生成方法。
图7示意性示出了根据本公开的实施例的语音文本生成装置的框图。
如图7所示,语音文本生成装置700可以包括标注模块710、第一确定模块720、第二确定模块730、插入模块740和生成模块750。
标注模块710用于对标准文本进行词性标注,得到词性标注结果。
第一确定模块720用于根据语气词分布特征从词性标注结果中确定目标词性。
第二确定模块730用于根据与目标词性对应的内容在标准文本中的位置确定预测插入位置。
插入模块740用于根据预测插入位置在标准文本中插入目标语气词,得到目标口语文本。
生成模块750用于根据目标口语文本生成目标语音文本。
根据本公开的实施例,插入模块可以包括:掩码单元和生成单元。
掩码单元用于对标准文本的预测插入位置进行掩码,得到掩码标准文本。
生成单元用于将掩码标准文本输入至语音文本生成模型,以便语音文本生成模型在预测插入位置中的目标插入位置插入目标语气词,生成目标口语文本。
根据本公开的实施例,语音文本生成方法还可以包括:获取模块、语料标注模块和第三确定模块。
获取模块用于获取初始语料集,其中,初始语料集包括根据口语语音语料生成的初始口语语料文本。
语料标注模块用于对初始口语语料文本进行词性标注,得到口语语料词性标注结果、口语语料语气词标注结果。
第三确定模块用于根据口语语料词性标注结果和口语语料语气词标注结果,确定语气词分布特征。
根据本公开的实施例,标注模块可以包括标注单元。
标注单元用于将标准文本输入至语义识别模型,得到词性标注结果。
其中,语义识别模型包括:
基于循环神经网络模型与条件随机场模型构建的第一语义识别模型;或者基于依存句法分析构建的第二语义识别模型。
需要说明的是,本公开的实施例中语音文本生成装置部分与本公开的实施例中语音文本生成方法部分是相对应的,语音文本生成装置部分的描述具体参考语音文本生成方 法部分,在此不再赘述。
图8示意性示出了根据本公开的实施例的语音文本生成装置的框图。
如图8所示,语音文本生成装置800可以包括样本标注模块810、样本第一确定模块820、样本第二确定模块830、样本掩码模块840和训练模块850。
样本标注模块810用于对训练样本集中的训练样本标准文本和与训练样本标准文本关联的训练样本口语文本分别进行词性标注,得到训练样本标准文本的第一样本词性标注结果、训练样本口语文本的第二样本词性标注结果、训练样本口语文本的样本语气词标注结果。
样本第一确定模块820用于根据样本语气词分布特征从第一样本词性标注结果中确定样本目标词性。
样本第二确定模块830用于根据与样本目标词性对应的样本内容在训练样本标准文本中的位置确定样本预测插入位置。
样本掩码模块840用于对训练样本标准文本中的样本预测插入位置进行掩码,得到训练样本掩码标准文本,其中,训练样本掩码标准文本具有第一样本词性标注结果。
训练模块850用于利用目标训练集训练初始语音文本生成模型,得到训练后的语音文本生成模型,其中,目标训练集包括训练样本掩码标准文本、训练样本口语文本的第二样本词性标注结果、训练样本口语文本的样本语气词标注结果。
根据本公开的实施例,语音文本生成模型的训练装置还可以包括:样本更新模块和样本构建模块。
样本更新模块用于利用样本混淆词典中的样本混淆词分别更新第一样本集中的第一样本标准文本和与第一样本标准文本关联的第一样本口语文本,得到包含有第二样本标准文本和第二样本口语文本的第二样本集。
样本构建模块用于根据第一样本集与第二样本集构建训练样本集。
根据本公开的实施例,语音文本生成模型的训练装置还可以包括:语料处理模块、识别模块和混淆词典构建模块。
语料处理模块用于利用语音合成装置处理样本标准语料文本,得到样本语音语料。
识别模块用于对样本语音语料进行语音识别,得到样本混淆语料文本。
混淆词典构建模块用于根据样本标准语料文本和样本混淆语料文本,构建样本混淆词典。
根据本公开的实施例,语音文本生成模型的训练装置还可以包括:样本初始语料获 取模块、样本语料标注模块和样本第三确定模块。
样本初始语料获取模块用于获取样本初始语料集,其中,样本初始语料集包括根据样本口语语音语料生成的样本初始口语语料文本;
样本语料标注模块用于对样本初始口语语料文本进行词性标注,得到样本口语语料词性标注结果、样本口语语料语气词标注结果。
样本第三确定模块用于根据样本口语语料词性标注结果和样本口语语料语气词标注结果,确定样本语气词分布特征。
需要说明的是,本公开的实施例中语音文本生成模型的训练装置部分与本公开的实施例中语音文本生成模型的训练方法部分是相对应的,语音文本生成模型的训练装置部分的描述具体参考语音文本生成模型的训练方法部分,在此不再赘述。
根据本公开的实施例的模块、单元中的任意多个、或其中任意多个的至少部分功能可以在一个模块中实现。根据本公开实施例的模块、子模块、单元、子单元中的任意一个或多个可以被拆分成多个模块来实现。根据本公开实施例的模块、单元中的任意一个或多个可以至少被部分地实现为硬件电路,例如现场可编程门阵列(FPGA)、可编程逻辑阵列(PLA)、片上系统、基板上的系统、封装上的系统、专用集成电路(ASIC),或可以通过对电路进行集成或封装的任何其他的合理方式的硬件或固件来实现,或以软件、硬件以及固件三种实现方式中任意一种或以其中任意几种的适当组合来实现。或者,根据本公开实施例的模块、单元中的一个或多个可以至少被部分地实现为计算机程序模块,当该计算机程序模块被运行时,可以执行相应的功能。
例如,标注模块710、第一确定模块720、第二确定模块730、插入模块740和生成模块750中的任意多个可以合并在一个模块/单元中实现,或者其中的任意一个模块/单元可以被拆分成多个模块/单元。或者,这些模块/单元中的一个或多个模块/单元的至少部分功能可以与其他模块/单元/子单元的至少部分功能相结合,并在一个模块/单元中实现。根据本公开的实施例,标注模块710、第一确定模块720、第二确定模块730、插入模块740和生成模块750中的至少一个可以至少被部分地实现为硬件电路,例如现场可编程门阵列(FPGA)、可编程逻辑阵列(PLA)、片上系统、基板上的系统、封装上的系统、专用集成电路(ASIC),或可以通过对电路进行集成或封装的任何其他的合理方式等硬件或固件来实现,或以软件、硬件以及固件三种实现方式中任意一种或以其中任意几种的适当组合来实现。或者,标注模块710、第一确定模块720、第二确定模块730、插入模块740和生成模块750中的至少一个可以至少被部分地实现为计算机程 序模块,当该计算机程序模块被运行时,可以执行相应的功能。
图9示意性示出了根据本公开实施例的适于实现语音文本生成方法、语音文本生成模型的训练方法的电子设备的框图。图9示出的电子设备仅仅是一个示例,不应对本公开实施例的功能和使用范围带来任何限制。
如图9所示,根据本公开实施例的电子设备900包括处理器901,其可以根据存储在只读存储器(ROM)902中的程序或者从存储部分908加载到随机访问存储器(RAM)903中的程序而执行各种适当的动作和处理。处理器901例如可以包括通用微处理器(例如CPU)、指令集处理器和/或相关芯片组和/或专用微处理器(例如,专用集成电路(ASIC)),等等。处理器901还可以包括用于缓存用途的板载存储器。处理器901可以包括用于执行根据本公开实施例的方法流程的不同动作的单一处理单元或者是多个处理单元。
在RAM 903中,存储有电子设备900操作所需的各种程序和数据。处理器901、ROM 902以及RAM 903通过总线904彼此相连。处理器901通过执行ROM 902和/或RAM 903中的程序来执行根据本公开实施例的方法流程的各种操作。需要注意,所述程序也可以存储在除ROM 902和RAM 903以外的一个或多个存储器中。处理器901也可以通过执行存储在所述一个或多个存储器中的程序来执行根据本公开实施例的方法流程的各种操作。
根据本公开的实施例,电子设备900还可以包括输入/输出(I/O)接口905,输入/输出(I/O)接口905也连接至总线904。系统900还可以包括连接至I/O接口905的以下部件中的一项或多项:包括键盘、鼠标等的输入部分906;包括诸如阴极射线管(CRT)、液晶显示器(LCD)等以及扬声器等的输出部分907;包括硬盘等的存储部分908;以及包括诸如LAN卡、调制解调器等的网络接口卡的通信部分909。通信部分909经由诸如因特网的网络执行通信处理。驱动器910也根据需要连接至I/O接口905。可拆卸介质911,诸如磁盘、光盘、磁光盘、半导体存储器等等,根据需要安装在驱动器910上,以便于从其上读出的计算机程序根据需要被安装入存储部分908。
根据本公开的实施例,根据本公开实施例的方法流程可以被实现为计算机软件程序。例如,本公开的实施例包括一种计算机程序产品,其包括承载在计算机可读存储介质上的计算机程序,该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中,该计算机程序可以通过通信部分909从网络上被下载和安装,和/或从可拆卸介质911被安装。在该计算机程序被处理器901执行时,执行本公开实施例的系统中限 定的上述功能。根据本公开的实施例,上文描述的系统、设备、装置、模块、单元等可以通过计算机程序模块来实现。
本公开还提供了一种计算机可读存储介质,该计算机可读存储介质可以是上述实施例中描述的设备/装置/系统中所包含的;也可以是单独存在,而未装配入该设备/装置/系统中。上述计算机可读存储介质承载有一个或者多个程序,当上述一个或者多个程序被执行时,实现根据本公开实施例的方法。
根据本公开的实施例,计算机可读存储介质可以是非易失性的计算机可读存储介质。例如可以包括但不限于:便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本公开中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。
例如,根据本公开的实施例,计算机可读存储介质可以包括上文描述的ROM 902和/或RAM 903和/或ROM 902和RAM 903以外的一个或多个存储器。
本公开的实施例还包括一种计算机程序产品,其包括计算机程序,该计算机程序包含用于执行本公开实施例所提供的方法的程序代码,当计算机程序产品在电子设备上运行时,该程序代码用于使电子设备实现本公开实施例所提供的上述方法。
在该计算机程序被处理器901执行时,执行本公开实施例的系统/装置中限定的上述功能。根据本公开的实施例,上文描述的系统、装置、模块、单元等可以通过计算机程序模块来实现。
在一种实施例中,该计算机程序可以依托于光存储器件、磁存储器件等有形存储介质。在另一种实施例中,该计算机程序也可以在网络介质上以信号的形式进行传输、分发,并通过通信部分909被下载和安装,和/或从可拆卸介质911被安装。该计算机程序包含的程序代码可以用任何适当的网络介质传输,包括但不限于:无线、有线等等,或者上述的任意合适的组合。
根据本公开的实施例,可以以一种或多种程序设计语言的任意组合来编写用于执行本公开实施例提供的计算机程序的程序代码,具体地,可以利用高级过程和/或面向对象的编程语言、和/或汇编/机器语言来实施这些计算程序。程序设计语言包括但不限于诸如Java,C++,python,“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算设备上执行、部分地在用户设备上执行、部分在远程计算设备上执行、或者完全 在远程计算设备或服务器上执行。在涉及远程计算设备的情形中,远程计算设备可以通过任意种类的网络,包括局域网(LAN)或广域网(WAN),连接到用户计算设备,或者,可以连接到外部计算设备(例如利用因特网服务提供商来通过因特网连接)。
附图中的流程图和框图,图示了按照本公开各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,上述模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图或流程图中的每个方框、以及框图或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。本领域技术人员可以理解,本公开的各个实施例和/或权利要求中记载的特征可以进行多种组合和/或结合,即使这样的组合或结合没有明确记载于本公开中。特别地,在不脱离本公开精神和教导的情况下,本公开的各个实施例和/或权利要求中记载的特征可以进行多种组合和/或结合。所有这些组合和/或结合均落入本公开的范围。
以上对本公开的实施例进行了描述。但是,这些实施例仅仅是为了说明的目的,而并非为了限制本公开的范围。尽管在以上分别描述了各实施例,但是这并不意味着各个实施例中的措施不能有利地结合使用。本公开的范围由所附权利要求及其等同物限定。不脱离本公开的范围,本领域技术人员可以做出多种替代和修改,这些替代和修改都应落在本公开的范围之内。

Claims (13)

  1. 一种语音文本生成方法,包括:
    对标准文本进行词性标注,得到词性标注结果;
    根据语气词分布特征从所述词性标注结果中确定目标词性;
    根据与所述目标词性对应的内容在所述标准文本中的位置确定预测插入位置;
    根据所述预测插入位置在所述标准文本中插入目标语气词,得到目标口语文本;以及
    根据所述目标口语文本生成目标语音文本。
  2. 根据权利要求1所述的语音文本生成方法,其中,根据所述预测插入位置在所述标准文本中插入目标语气词,得到目标口语文本包括:
    对所述标准文本的预测插入位置进行掩码,得到掩码标准文本;
    将所述掩码标准文本输入至语音文本生成模型,以便所述语音文本生成模型在所述预测插入位置中的目标插入位置插入目标语气词,生成所述目标口语文本。
  3. 根据权利要求1所述的语音文本生成方法,还包括:
    获取初始语料集,其中,所述初始语料集包括根据口语语音语料生成的初始口语语料文本;
    对所述初始口语语料文本进行词性标注,得到口语语料词性标注结果、口语语料语气词标注结果;
    根据所述口语语料词性标注结果和所述口语语料语气词标注结果,确定所述语气词分布特征。
  4. 根据权利要求1所述的语音文本生成方法,对标准文本进行词性标注,得到词性标注结果包括:
    将所述标准文本输入至语义识别模型,得到所述词性标注结果;
    其中,所述语义识别模型包括:
    基于循环神经网络模型与条件随机场模型构建的第一语义识别模型;或者
    基于依存句法分析构建的第二语义识别模型。
  5. 一种语音文本生成模型的训练方法,包括:
    对训练样本集中的训练样本标准文本和与所述训练样本标准文本关联的训练样本口语文本分别进行词性标注,得到所述训练样本标准文本的第一样本词性标注结果、所述训练样本口语文本的第二样本词性标注结果、所述训练样本口语文本的样本语气词标注结果;
    根据样本语气词分布特征从所述第一样本词性标注结果中确定样本目标词性;
    根据与所述样本目标词性对应的样本内容在所述训练样本标准文本中的位置确定样本预测插入位置;
    对所述训练样本标准文本中的样本预测插入位置进行掩码,得到训练样本掩码标准文本,其中,所述训练样本掩码标准文本具有第一样本词性标注结果;
    利用目标训练集训练初始语音文本生成模型,得到训练后的语音文本生成模型,其中,所述目标训练集包括所述训练样本掩码标准文本、所述训练样本口语文本的第二样本词性标注结果、所述训练样本口语文本的样本语气词标注结果。
  6. 根据权利要求5所述的训练方法,还包括:
    利用样本混淆词典中的样本混淆词分别更新第一样本集中的第一样本标准文本和与所述第一样本标准文本关联的第一样本口语文本,得到包含有第二样本标准文本和第二样本口语文本的第二样本集;
    根据所述第一样本集与所述第二样本集构建所述训练样本集。
  7. 根据权利要求6所述的训练方法,还包括:
    利用语音合成装置处理样本标准语料文本,得到样本语音语料;
    对所述样本语音语料进行语音识别,得到样本混淆语料文本;
    根据所述样本标准语料文本和所述样本混淆语料文本,构建所述样本混淆词典。
  8. 根据权利要求5所述的训练方法,还包括:
    获取样本初始语料集,其中,所述样本初始语料集包括根据样本口语语音语料生成的样本初始口语语料文本;
    对所述样本初始口语语料文本进行词性标注,得到样本口语语料词性标注结果、样本口语语料语气词标注结果;
    根据所述样本口语语料词性标注结果和所述样本口语语料语气词标注结果,确定所述样本语气词分布特征。
  9. 一种语音文本生成装置,包括:
    标注模块,用于对标准文本进行词性标注,得到词性标注结果;
    第一确定模块,用于根据语气词分布特征从所述词性标注结果中确定目标词性;
    第二确定模块,用于根据与所述目标词性对应的内容在所述标准文本中的位置确定预测插入位置;
    插入模块,用于根据所述预测插入位置在所述标准文本中插入目标语气词,得到目标口语文本;以及
    生成模块,用于根据所述目标口语文本生成目标语音文本。
  10. 一种语音文本生成模型的训练装置,包括:
    样本标注模块,用于对训练样本集中的训练样本标准文本和与所述训练样本标准文本关联的训练样本口语文本分别进行词性标注,得到所述训练样本标准文本的第一样本词性标注结果、所述训练样本口语文本的第二样本词性标注结果、所述训练样本口语文本的样本语气词标注结果;
    样本第一确定模块,用于根据样本语气词分布特征从所述第一样本词性标注结果中确定样本目标词性;
    样本第二确定模块,用于根据与所述样本目标词性对应的样本内容在所述训练样本标准文本中的位置确定样本预测插入位置;
    样本掩码模块,用于对所述训练样本标准文本中的样本预测插入位置进行掩码,得到训练样本掩码标准文本,其中,所述训练样本掩码标准文本具有第一样本词性标注结果;
    训练模块,用于利用目标训练集训练初始语音文本生成模型,得到训练后的语音文本生成模型,其中,所述目标训练集包括所述训练样本掩码标准文本、所述训练样本口语文本的第二样本词性标注结果、所述训练样本口语文本的样本语气词标注结果。
  11. 一种电子设备,包括:
    一个或多个处理器;
    存储器,用于存储一个或多个程序,
    其中,当所述一个或多个程序被所述一个或多个处理器执行时,使得所述一个或多个处理器实现权利要求1至8中任一项所述的方法。
  12. 一种计算机可读存储介质,其上存储有可执行指令,该指令被处理器执行时使处理器实现权利要求1至8中任一项所述的方法。
  13. 一种计算机程序产品,包括计算机程序,所述计算机程序在被处理器执行时实现根据权利要求1至8中任一项所述的方法。
PCT/CN2023/087793 2022-10-09 2023-04-12 语音文本生成方法、语音文本生成模型的训练方法、装置 WO2024077906A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211231004.1A CN115620726A (zh) 2022-10-09 2022-10-09 语音文本生成方法、语音文本生成模型的训练方法、装置
CN202211231004.1 2022-10-09

Publications (1)

Publication Number Publication Date
WO2024077906A1 true WO2024077906A1 (zh) 2024-04-18

Family

ID=84861060

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/087793 WO2024077906A1 (zh) 2022-10-09 2023-04-12 语音文本生成方法、语音文本生成模型的训练方法、装置

Country Status (2)

Country Link
CN (1) CN115620726A (zh)
WO (1) WO2024077906A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115620726A (zh) * 2022-10-09 2023-01-17 京东科技信息技术有限公司 语音文本生成方法、语音文本生成模型的训练方法、装置

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108170674A (zh) * 2017-12-27 2018-06-15 东软集团股份有限公司 词性标注方法和装置、程序产品及存储介质
US10599767B1 (en) * 2018-05-31 2020-03-24 The Ultimate Software Group, Inc. System for providing intelligent part of speech processing of complex natural language
US20210312124A1 (en) * 2020-04-03 2021-10-07 Bewgle Technologies Pvt Ltd. Method and system for determining sentiment of natural language text content
CN114218424A (zh) * 2022-02-22 2022-03-22 杭州一知智能科技有限公司 一种基于wav2vec的语气词插入的语音交互方法及系统
CN114708868A (zh) * 2022-03-17 2022-07-05 北京中科智加科技有限公司 一种文本顺滑的语音识别方法、系统及存储介质
CN114912448A (zh) * 2022-07-15 2022-08-16 山东海量信息技术研究院 一种文本扩展方法、装置、设备及介质
CN115620726A (zh) * 2022-10-09 2023-01-17 京东科技信息技术有限公司 语音文本生成方法、语音文本生成模型的训练方法、装置

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108170674A (zh) * 2017-12-27 2018-06-15 东软集团股份有限公司 词性标注方法和装置、程序产品及存储介质
US10599767B1 (en) * 2018-05-31 2020-03-24 The Ultimate Software Group, Inc. System for providing intelligent part of speech processing of complex natural language
US20210312124A1 (en) * 2020-04-03 2021-10-07 Bewgle Technologies Pvt Ltd. Method and system for determining sentiment of natural language text content
CN114218424A (zh) * 2022-02-22 2022-03-22 杭州一知智能科技有限公司 一种基于wav2vec的语气词插入的语音交互方法及系统
CN114708868A (zh) * 2022-03-17 2022-07-05 北京中科智加科技有限公司 一种文本顺滑的语音识别方法、系统及存储介质
CN114912448A (zh) * 2022-07-15 2022-08-16 山东海量信息技术研究院 一种文本扩展方法、装置、设备及介质
CN115620726A (zh) * 2022-10-09 2023-01-17 京东科技信息技术有限公司 语音文本生成方法、语音文本生成模型的训练方法、装置

Also Published As

Publication number Publication date
CN115620726A (zh) 2023-01-17

Similar Documents

Publication Publication Date Title
KR102401942B1 (ko) 번역품질 평가 방법 및 장치
US11915692B2 (en) Facilitating end-to-end communications with automated assistants in multiple languages
US10176804B2 (en) Analyzing textual data
CN110287278B (zh) 评论生成方法、装置、服务器及存储介质
US11354521B2 (en) Facilitating communications with automated assistants in multiple languages
US9805718B2 (en) Clarifying natural language input using targeted questions
CN107861954B (zh) 基于人工智能的信息输出方法和装置
WO2020052069A1 (zh) 用于分词的方法和装置
CN111177350A (zh) 智能语音机器人的话术形成方法、装置和系统
US20220156467A1 (en) Hybrid Natural Language Understanding
US20240029709A1 (en) Voice generation method and apparatus, device, and computer readable medium
CN111414745A (zh) 文本标点确定方法与装置、存储介质、电子设备
WO2024077906A1 (zh) 语音文本生成方法、语音文本生成模型的训练方法、装置
CN113051895A (zh) 语音识别的方法、装置、电子设备、介质和程序产品
CN111460224B (zh) 评论数据的质量标注方法、装置、设备及存储介质
CN112711943A (zh) 一种维吾尔文语种识别方法、装置及存储介质
US11709989B1 (en) Method and system for generating conversation summary
US20210118434A1 (en) Pattern-based statement attribution
US20220343068A1 (en) Intent detection via multi-hop unified syntactic graph
Tumpalan et al. English-filipino speech topic tagger using automatic speech recognition modeling and topic modeling
CN113744737B (zh) 语音识别模型的训练、人机交互方法、设备和存储介质
RU2820264C1 (ru) Способ и система обучения системы чат-бота
Brindha et al. AI based chatbot for education management
CN112699186A (zh) 一种基于暗语的事理图谱构建方法及系统
Gafurov et al. Named Entity Recognition in Natural Language Texts obtained through Audio Interfaces

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23876118

Country of ref document: EP

Kind code of ref document: A1