CN109065019B

CN109065019B - Intelligent robot-oriented story data processing method and system

Info

Publication number: CN109065019B
Application number: CN201810981546.8A
Authority: CN
Inventors: 贾志强
Original assignee: Beijing Guangnian Wuxian Technology Co Ltd
Current assignee: Beijing Guangnian Wuxian Technology Co Ltd
Priority date: 2018-08-27
Filing date: 2018-08-27
Publication date: 2021-06-15
Anticipated expiration: 2038-08-27
Also published as: CN109065019A

Abstract

The invention discloses a story data processing method and system for an intelligent robot. The method comprises the following steps: acquiring story text data; analyzing the story text data, and identifying conversation and voice-over in the story text; calling a story data processing model, and carrying out sound effect processing on the conversation and the voice-over in the story text to generate conversation and voice-over data with sound effects; and generating and outputting multi-modal data matched with the story text, wherein the multi-modal data comprises the dialogue with sound effect and the voice-over data. Compared with the prior art, according to the method and the system, the story in the text form can be converted into multi-mode data which can be displayed in a multi-mode, and the display modes of dialogues and bystanders in the story can be optimized in a targeted mode, so that the user experience of listeners in telling the story is greatly improved.

Description

Intelligent robot-oriented story data processing method and system

Technical Field

The invention relates to the field of computers, in particular to a story data processing method and system for an intelligent robot.

Background

In the traditional daily life of human beings, reading characters is a main way for people to appreciate literary works. However, in certain specific scenarios, people also appreciate literary works by sound, e.g., listening to a comment, listening to a reciting, etc. Most often, children with inadequate literacy are often listened to by others' narration (listening to others telling a story).

With the continuous development of multimedia technology, more and more multimedia devices are applied to the daily life of human beings. With the support of multimedia technology, the body of the acoustic form of the literary works, in particular the storytelling, is gradually transformed to multimedia devices.

In general, storytelling using multimedia devices is usually manual storytelling in advance and recording audio files. The multimedia device simply plays the recorded audio file. With the development of computer technology, in order to simply and conveniently acquire a sound source, in the prior art, a method of converting text data into audio data is also adopted. Therefore, manual text recitation and recording are not needed, and story telling can be realized by using the multimedia equipment only by providing story text. However, the text-to-speech conversion is directly performed by using a computer technology, and only the direct conversion of text contents can be ensured, and the harmony of real people in story telling cannot be achieved, so that in the prior art, story telling based on the text conversion technology is quite dry and uninteresting, only direct word meanings can be simply conveyed, and the user experience is very poor.

Disclosure of Invention

The invention provides a story data processing method facing an intelligent robot, which comprises the following steps:

acquiring story text data;

analyzing the story text data, and identifying conversation and voice-over in the story text;

calling a story data processing model, and carrying out sound effect processing on the conversation and the voice-over in the story text to generate conversation and voice-over data with sound effects;

and generating and outputting multi-modal data matched with the story text, wherein the multi-modal data comprises the dialogue with sound effect and the voice-over data.

In an embodiment, the multimodal data further comprises intelligent robot motion data, wherein:

generating corresponding intelligent robot action data for the dialogue and the dialogue in the story text.

In an embodiment, the method further comprises:

combining the dialogue and the voice-over data with the sound effect, and performing character voice conversion on the dialogue and the voice-over in the story text to generate dialogue and voice-over data with the sound effect;

converting texts except dialogues and voices in the story text data into first voice data;

and fusing the dialogue with the sound effect, the voice data beside the voice and the first voice data to generate story voice data.

In one embodiment, calling a story data processing model to perform sound effect processing on the dialogue and the voice-over in the story text comprises the following steps:

performing text recognition on the story text, performing content element disassembly on a story based on a text recognition result, and extracting story elements;

determining sound effect characteristics matching the conversation and the voice-over according to story elements corresponding to the conversation and the voice-over;

and converting the dialogue and the voice-over into the dialogue and voice-over data with the sound effect, wherein the dialogue and voice-over data are matched with the sound effect characteristics.

In an embodiment, the story elements corresponding to the conversation include a conversation character, conversation content, conversation environment, and/or conversation context references.

In an embodiment, the story elements corresponding to the voice-over include voice-over content, voice-over environment, and/or voice-over context references.

The invention also proposes a storage medium on which a program code implementing the method according to the invention is stored.

The invention also provides a story data processing system facing the intelligent robot, which comprises:

a text acquisition module configured to acquire story text data;

a text parsing module configured to parse the story text data, identify dialogs and bystanders in the story text;

a story data processing model library configured to store story data processing models;

the sound effect processing module is configured to call the story data processing model, carry out sound effect processing on the conversation and the voice-over in the story text and generate conversation and voice-over data with sound effects;

a multi-modal story data generation module configured to generate and output multi-modal data matching the story text, the multi-modal data including the dialogue with sound effects and the voice-over data.

In one embodiment, the multimodal story data generating module further comprises:

the first voice conversion unit is configured to combine the dialogue and the voice-over data with the sound effect, perform character voice conversion on the dialogue and the voice-over in the story text, and generate the dialogue and the voice-over data with the sound effect;

a second voice conversion unit configured to convert text other than the dialogue and the voice in the story text data into first voice data;

and the voice synthesis unit is configured to fuse the dialogue and voice-over data with sound effect and the first voice data to generate story voice data.

The invention also provides an intelligent story machine, comprising:

the input acquisition module is configured to acquire multi-modal input of a user and receive story requirements of the user;

the story data processing system is configured to acquire corresponding story text data according to the user story requirements and generate the multi-modal data;

an output module configured to output the multimodal data to a user.

Compared with the prior art, according to the method and the system, the story in the text form can be converted into multi-mode data which can be displayed in a multi-mode, and the display modes of dialogues and bystanders in the story can be optimized in a targeted mode, so that the user experience of listeners in telling the story is greatly improved.

Additional features and advantages of the invention will be set forth in the description which follows. Also, some of the features and advantages of the invention will be apparent from the description, or may be learned by practice of the invention. The objectives and some of the advantages of the invention may be realized and attained by the process particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:

FIGS. 1 and 2 are flow diagrams of methods according to embodiments of the invention;

FIGS. 3 and 4 are schematic system configurations according to embodiments of the present invention;

fig. 5 and 6 are schematic diagrams of story teller according to embodiments of the present invention.

Detailed Description

The following detailed description will be provided for the embodiments of the present invention with reference to the accompanying drawings and examples, so that the practitioner of the present invention can fully understand how to apply the technical means to solve the technical problems, achieve the technical effects, and implement the present invention according to the implementation procedures. It should be noted that, as long as there is no conflict, the embodiments and the features of the embodiments of the present invention may be combined with each other, and the technical solutions formed are within the scope of the present invention.

Aiming at the problems, the invention provides a story data processing method facing an intelligent robot. In the method, the story in the text form is converted into multi-modal data which can be displayed in a multi-modal manner, so that the expressive force of story content is improved.

Further, in practical application scenarios, when human beings communicate with each other, the voice of different people is different, and the voice has the own voice characteristics of the speaker. Generally, the story text usually contains dialogue and dialogue as well as dialogue and dialogue as a character in the story. Therefore, in one embodiment, the matched sound effect is added to the conversation and the voice-over in the story in a targeted manner, so that the voice expression of the conversation and the voice-over is more real and vivid, the vividness of the story telling is improved, and the user experience is optimized.

The detailed flow of a method according to an embodiment of the invention is described in detail below based on the accompanying drawings, the steps shown in the flow chart of which can be executed in a computer system containing instructions such as a set of computer executable instructions. Although a logical order of steps is illustrated in the flowcharts, in some cases, the steps illustrated or described may be performed in an order different than presented herein.

As shown in fig. 1, in one embodiment, the method includes the following steps:

s110, acquiring story text data;

s120, analyzing the story text data, and identifying conversation and voice-over in the story text;

s131, calling a story data processing model;

s132, carrying out sound effect processing on the dialogue and the voice-over in the story text to generate dialogue and voice-over data with sound effects;

s140, generating and outputting multi-modal data matching the story text, the multi-modal data including the dialogue and the voice-over data with sound effects generated in the step S132.

Further, in one embodiment, the dialog and voice-over TTS output is mainly performed by voice, so that the final output multi-modal data includes the dialog and voice-over data with sound effect and converted into voice. Specifically, in one embodiment, the dialogue and the voice-over data with sound effect are combined to perform character voice conversion on the dialogue and the voice-over in the story text to generate the dialogue and the voice-over data with sound effect.

Further, in order to further improve the vividness of the story expression, in one embodiment, the story is not limited to be told in a voice manner, and the dialogue and the voice-over are displayed in a text manner. Specifically, in one embodiment, the multimodal data includes dialogue and voice-over text data with sound effects.

Further, to further enhance the vividness of the story performance, in one embodiment, the story is not limited to being told in voice and/or text. Specifically, in an embodiment, the multi-modal data generated in step S130 further includes intelligent robot motion data, where:

corresponding intelligent robot action data is generated for the dialog and the dialogue in the story text.

Therefore, when the intelligent robot carries out story telling, the intelligent robot can output dialogue with sound effect and voice-over data and simultaneously can assist corresponding actions, thereby greatly improving the vividness of story telling.

Further, in the story text, other contents may be contained in addition to the dialog and the dialogue. In one embodiment, text in the story text data except conversation and voice is also converted into voice data and fused with the conversation and voice-over data with sound effects. Specifically, the method further comprises:

combining the dialogue and the voice-over data with sound effect to perform character voice conversion on the dialogue and the voice-over in the story text to generate dialogue and voice-over data with sound effect;

and combining the dialogue and the voice-over data with sound effect and the first voice data to generate story voice data.

Further, in order to ensure that the sound effect added to the dialogue and the voice-over data can improve the vividness of story expression, rather than adopting an error sound effect to reduce the story expression, in one embodiment, the story text data is analyzed to determine story content, and the sound effect corresponding to the dialogue and the voice-over is determined according to the specific content of the story.

Specifically, in one embodiment, the story text data is parsed based on text recognition techniques. Specifically, in one embodiment, parsing story text data includes: and performing text recognition on the story text data to determine story content.

Further, in consideration of the characteristics of computer analysis, in one embodiment, the story text data is analyzed in an element decomposition mode. Specifically, in one embodiment, the story is subjected to content element decomposition based on the text recognition result, and story elements are extracted, wherein the story elements comprise the style, characters and/or conversation of the story.

Specifically, in an embodiment, invoking a story data processing model to perform sound effect processing on dialog and voice-over in a story text includes:

text recognition is carried out on the story text, content elements of the story are disassembled based on the text recognition result, and story elements are extracted;

the dialogue and the voice-over are converted into dialogue and voice-over data with sound effect, wherein the dialogue and the voice-over data are matched with sound effect characteristics.

Specifically, in an embodiment, as shown in fig. 2, the method includes the following steps:

s210, obtaining story text data;

s220, analyzing the story text data;

s221, performing content element disassembling on the story based on the text recognition result, and extracting story elements;

s222, recognizing dialogue and voice-over in the story text;

s230, calling a story data processing model;

s231, determining sound effect characteristics matching the conversation and the voice-over according to story elements corresponding to the conversation and the voice-over;

s232, converting the dialogue and the voice-over into dialogue and voice-over data with sound effect, wherein the dialogue and the voice-over data are matched with sound effect characteristics.

Specifically, in an embodiment, the parsing target is divided into several specific categories (several story elements), keyword extraction is performed for each story element, and the extracted keywords and the story element tags are saved as a parsing result.

Further, in the story text, the descriptors of the dialog and the dialogue, the description content, and/or the description background may be different according to the progress of the story content. Therefore, in one embodiment, the sound effects corresponding to the dialog and the dialog are determined according to the story elements corresponding to the dialog and the voice-over. Specifically, in one embodiment, the sound effect is determined for each sentence of dialog and each sentence of voice-over, respectively.

In particular, in one embodiment, the story elements corresponding to a conversation include a conversation character, conversation content, conversation environment, and/or conversation context references.

In particular, in an embodiment, the story elements corresponding to the voice-overs include voice-over content, voice-over environment, and/or voice-over context references.

Further, based on the method of the present invention, the present invention also provides a storage medium having stored thereon program codes that can implement the method of the present invention.

Furthermore, based on the method, the invention also provides a story data processing system facing the intelligent robot.

Specifically, as shown in fig. 3, in an embodiment, the system includes:

a text acquisition module 310 configured to acquire story text data;

a text parsing module 320 configured to parse story text data, identify dialogs and bystanders in the story text;

a story data processing model library 341 configured to store story data processing models;

the sound effect processing module 340 is configured to invoke a story data processing model, perform sound effect processing on the dialogue and the voice-over in the story text, and generate dialogue and voice-over data with sound effects;

a multi-modal story data generation module 330 configured to generate and output multi-modal data matching story text, the multi-modal data including dialogue with sound effects and voice over data.

Further, in an embodiment, multimodal story data generation module 330 is further configured to generate corresponding smart robot action data for dialogs and bystandings in the story text.

Further, in an embodiment, as shown in fig. 4, multimodal story data generating module 430 further includes:

a voice conversion unit 431 configured to convert text other than the dialogue and the voice in the story text data into first voice data;

a voice conversion unit 432 configured to perform text-to-speech conversion on the dialog and the voice-over in the story text in combination with the dialog and the voice-over data with the sound effect to generate the dialog and the voice-over data with the sound effect;

and a voice synthesis unit 433 configured to fuse the dialogue and voice-over data with sound effects and the first voice data to generate story voice data.

Furthermore, based on the story data processing system provided by the invention, the invention also provides an intelligent story machine. Specifically, as shown in fig. 5, in an embodiment, the story teller includes:

an input acquisition module 510 configured to collect user multimodal input, receive user story requirements;

a story data processing system 520 configured to obtain corresponding story text data according to a user story requirement, and generate multi-modal data;

an output module 530 configured to output the multimodal data to a user.

Specifically, in one embodiment, the output module 530 includes a playing unit configured to play the dialogue and voice-over data with sound effects.

Specifically, as shown in fig. 6, in an embodiment, the story machine includes an intelligent device 610 and a cloud server 620, wherein:

cloud server 620 includes a story data processing system 630. The story data processing system 630 is configured to invoke a capability interface of the cloud server 620 to obtain story text data and analyze the story text data, and generate and output multi-modal data including dialogue and voice-over data with sound effects. Specifically, each capability interface of the story data processing system 630 calls a corresponding logic process during the data parsing process.

Specifically, in an embodiment, the capability interfaces of the cloud server 620 include a text recognition interface 621, a text/speech conversion interface 622, and a sound effect synthesis interface 623.

The smart device 610 includes a human-machine interaction input output module 611, a communication module 612, a play module 613, and an action module 614.

The human-computer interaction input/output module 611 is configured to obtain a control instruction of the user and determine a story listening requirement of the user.

The communication module 612 is configured to output the user story listening requirement acquired by the human-computer interaction input/output module 611 to the cloud server 620, and receive multi-modal data from the cloud server 620.

The playing module 613 is configured to play the dialogue and voice-over data with sound effects or the story voice data in the multimodal data.

The action module 614 is configured to make corresponding action behaviors according to the intelligent robot action data in the multi-modal data.

Specifically, in a specific application scenario, the human-computer interaction input/output module 611 obtains a control instruction of the user, and determines a story listening requirement of the user.

Communication module 612 sends the user story listening requirements to cloud server 620.

Cloud server 620 selects corresponding story text data based on user story listening requirements. The story data processing system in the cloud server 620 obtains and parses story text data, and generates and outputs multimodal data. The multi-modal data comprises intelligent robot action data and story voice data, wherein the story voice data comprises dialogue with sound effect and voice data beside the voice.

The communication module 612 receives multimodal data sent by the cloud server 620.

The play module 613 plays story voice data among the multimodal data received by the communication module 612.

The action module 614 makes corresponding action behaviors according to the intelligent robot action data in the multi-modal data.

It is to be understood that the disclosed embodiments of the invention are not limited to the particular structures, process steps, or materials disclosed herein but are extended to equivalents thereof as would be understood by those ordinarily skilled in the relevant arts. It is also to be understood that the terminology used herein is for the purpose of describing particular embodiments only, and is not intended to be limiting.

Reference in the specification to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the invention. Thus, appearances of the phrase "an embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment.

Although the embodiments of the present invention have been described above, the above description is only for the convenience of understanding the present invention, and is not intended to limit the present invention. There are various other embodiments of the method of the present invention. Various corresponding changes or modifications may be made by those skilled in the art without departing from the spirit of the invention, and these corresponding changes or modifications are intended to fall within the scope of the appended claims.

Claims

1. A story data processing method facing an intelligent robot is characterized by comprising the following steps:

acquiring story text data;

analyzing the story text data, and identifying conversation and voice-over in the story text; the method comprises the steps of analyzing story text data in an element decomposition mode, dividing the story text data serving as an analysis target into several corresponding categories according to set story element types, further extracting keywords for each category, and storing the extracted keywords and story element labels as analysis results;

generating and outputting multi-modal data matched with the story text, wherein the multi-modal data comprises the dialogue with sound effect and the voice-over data;

the method further comprises the following steps:

combining the dialogue and the voice-over data with the sound effect, and performing character voice conversion on the dialogue and the voice-over in the story text to generate dialogue and voice-over data with the sound effect; wherein the story elements corresponding to the voice-overs include voice-over content, voice-over environment, and/or voice-over context references;

2. The method of claim 1, wherein the multimodal data further comprises intelligent robot motion data, wherein:

3. The method of claim 2, wherein invoking a story data processing model to perform sound effect processing on dialog and voice-overs in the story text comprises:

4. The method of claim 3, wherein the story elements corresponding to the conversation include a conversation character, conversation content, conversation environment, and/or conversation context references.

5. A storage medium having stored thereon program code for implementing the method according to any one of claims 1-4.

6. An intelligent robot-oriented story data processing system, the system comprising:

a text acquisition module configured to acquire story text data;

a text parsing module configured to parse the story text data, identify dialogs and bystanders in the story text; the method comprises the steps of analyzing story text data in an element decomposition mode, dividing the story text data serving as an analysis target into several corresponding categories according to set story element types, further extracting keywords for each category, and storing the extracted keywords and story element labels as analysis results;

a multi-modal story data generation module configured to generate and output multi-modal data matching the story text, the multi-modal data including the dialogue with sound effects and the voice-over data;

the multimodal story data generation module further comprises:

the first voice conversion unit is configured to combine the dialogue and the voice-over data with the sound effect, perform character voice conversion on the dialogue and the voice-over in the story text, and generate the dialogue and the voice-over data with the sound effect; wherein the story elements corresponding to the voice-overs include voice-over content, voice-over environment, and/or voice-over context references;

7. An intelligent story machine, the story machine comprising:

the story data processing system of claim 6, configured to retrieve corresponding story text data in accordance with the user story requirements, generating the multimodal data;

an output module configured to output the multimodal data to a user.