CN109241331B

CN109241331B - Intelligent robot-oriented story data processing method

Info

Publication number: CN109241331B
Application number: CN201811114587.3A
Authority: CN
Inventors: 贾志强
Original assignee: Beijing Guangnian Infinite Technology Co ltd
Current assignee: Beijing Guangnian Infinite Technology Co ltd
Priority date: 2018-09-25
Filing date: 2018-09-25
Publication date: 2022-03-15
Anticipated expiration: 2038-09-25
Also published as: CN109241331A

Abstract

The invention discloses a story data processing method and system for an intelligent robot. The method comprises the following steps: acquiring story text data; analyzing the story text data, wherein text recognition and word segmentation processing are carried out on the story text data; performing sound effect data matching on a story text word segmentation result based on a story sound effect model, acquiring sound effect data corresponding to the story text word segmentation result and determining the mutual relation between the sound effect data and the text word segmentation result corresponding to the sound effect data; and based on the mutual relation, fusing the sound effect data and the story text data to generate and output story content audio data. Compared with the prior art, the method and the system can convert the story in the text form into the story content audio data with the sound effect, thereby greatly improving the user experience of a listener when telling the story.

Description

Intelligent robot-oriented story data processing method

Technical Field

The invention relates to the field of computers, in particular to a story data processing method for an intelligent robot.

Background

In the traditional daily life of human beings, reading characters is a main way for people to appreciate literary works. However, in certain specific scenarios, people also appreciate literary works by sound, e.g., listening to a comment, listening to a reciting, etc. Most often, children with inadequate literacy are often listened to by others' narration (listening to others telling a story).

With the continuous development of multimedia technology, more and more multimedia devices are applied to the daily life of human beings. With the support of multimedia technology, the body of the acoustic form of the literary works, in particular the storytelling, is gradually transformed to multimedia devices.

In general, storytelling using multimedia devices is usually manual storytelling in advance and recording audio files. The multimedia device simply plays the recorded audio file. With the development of computer technology, in order to simply and conveniently acquire a sound source, in the prior art, a method of converting text data into audio data is also adopted. Therefore, manual text recitation and recording are not needed, and story telling can be realized by using the multimedia equipment only by providing story text. However, the text-to-speech conversion is directly performed by using a computer technology, and only the direct conversion of text contents can be ensured, and the harmony of real people in story telling cannot be achieved, so that in the prior art, story telling based on the text conversion technology is quite dry and uninteresting, only direct word meanings can be simply conveyed, and the user experience is very poor.

Disclosure of Invention

In order to improve user experience, the invention provides a story data processing method facing an intelligent robot, which comprises the following steps:

acquiring story text data;

analyzing the story text data, wherein text recognition and word segmentation processing are carried out on the story text data;

performing sound effect data matching on a story text word segmentation result based on a story sound effect model, acquiring sound effect data corresponding to the story text word segmentation result and determining the mutual relation between the sound effect data and the text word segmentation result corresponding to the sound effect data;

and based on the mutual relation, fusing the sound effect data and the story text data to generate and output story content audio data.

In one embodiment:

the mutual relation comprises a story position corresponding to the sound effect data;

and fusing the sound effect data and the story text data to generate story content audio data, wherein the audio corresponding to the sound effect data is fused at a corresponding story position in the story content audio data.

In one embodiment, the sound effect data includes:

a sound effect tag, the sound effect tag comprising a sound effect type;

and/or the presence of a gas in the gas,

sound effect control data, the sound effect control data including sound effect duration.

In one embodiment:

analyzing the story text data, wherein a story type corresponding to the current story text data is obtained;

and performing sound effect data matching on the story text word segmentation result based on a story sound effect model, wherein the matched story sound effect model is called based on the story type.

In one embodiment, sound effect data matching is performed on a story text word segmentation result based on a story sound effect model, sound effect data corresponding to the story text word segmentation result is obtained, and matching detail description between the sound effect data and a text word segmentation result corresponding to the sound effect data is determined, wherein:

and selecting corresponding sound effect data from a sound effect library according to story elements corresponding to the text word segmentation result and/or the semantics of the text word segmentation result, and determining the matching detailed description between the sound effect data and the corresponding text word segmentation result.

The invention also proposes a storage medium on which a program code implementing the method according to the invention is stored.

The invention also provides a story data processing system facing the intelligent robot, which comprises:

a text acquisition module configured to acquire story text data;

the text analysis module is configured to analyze the story text data, wherein text recognition and word segmentation processing are carried out on the story text data;

the sound effect processing module is configured to perform sound effect data matching on story text word segmentation results based on a story sound effect model, acquire sound effect data corresponding to the story text word segmentation results and determine the mutual relationship between the sound effect data and the text word segmentation results corresponding to the sound effect data;

and the multi-mode story data generation module is configured to fuse the sound effect data and the story text data based on the mutual relation, generate story content audio data and output the story content audio data.

In one embodiment:

the text analysis module is also configured to acquire a story type corresponding to the current story text data;

the sound-effect processing module is further configured to invoke a matched story sound-effect model based on the story type.

In one embodiment:

the text parsing module is further configured to acquire story elements corresponding to the text word segmentation result and/or semantics of the text word segmentation result;

the sound effect processing module is also configured to select corresponding sound effect data from a sound effect library according to story elements corresponding to the text word segmentation result and/or the semantics of the text word segmentation result and determine the matching detail description between the sound effect data and the corresponding text word segmentation result.

The invention also provides an intelligent story machine, comprising:

the input acquisition module is configured to acquire multi-modal input of a user and confirm story requirements of the user;

the story data processing system is configured to acquire corresponding story text data according to the story requirements of the user and generate story content audio data;

an output module configured to output the story content audio data to a user.

Compared with the prior art, the method and the system can convert the story in the text form into the story content audio data with the sound effect, thereby greatly improving the user experience of a listener when telling the story.

Additional features and advantages of the invention will be set forth in the description which follows. Also, some of the features and advantages of the invention will be apparent from the description, or may be learned by practice of the invention. The objectives and some of the advantages of the invention may be realized and attained by the process particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:

FIG. 1 is a flow diagram of a method according to an embodiment of the invention;

FIGS. 2 and 3 are partial flow diagrams of methods according to embodiments of the invention;

FIG. 4 is a system architecture diagram according to an embodiment of the invention;

fig. 5 and 6 are schematic diagrams of story teller according to embodiments of the present invention.

Detailed Description

The following detailed description will be provided for the embodiments of the present invention with reference to the accompanying drawings and examples, so that the practitioner of the present invention can fully understand how to apply the technical means to solve the technical problems, achieve the technical effects, and implement the present invention according to the implementation procedures. It should be noted that, as long as there is no conflict, the embodiments and the features of the embodiments of the present invention may be combined with each other, and the technical solutions formed are within the scope of the present invention.

In order to improve user experience, the invention provides a story data processing method facing an intelligent robot. In the method, the corresponding sound effect is matched for the story, the sound effect data and the story text data are fused to generate the story content audio data, and therefore the expressive force of the story content is improved.

Furthermore, in an actual application scene, the sound effect playing needs to be matched with a specific playing time, and the sound effect is played at an incorrect time, so that the expressive force of the story content cannot be improved, and the user experience of listening to the story can be reduced. Therefore, in the method of the invention, the sound effect data corresponding to the story text word segmentation result is obtained and the correlation between the sound effect data and the text word segmentation result corresponding to the sound effect data is determined. And then based on the mutual relation, fusing the sound effect data and the story text data, so that the final sound effect playing can achieve the perfect expression effect.

The detailed flow of a method according to an embodiment of the invention is described in detail below based on the accompanying drawings, the steps shown in the flow chart of which can be executed in a computer system containing instructions such as a set of computer executable instructions. Although a logical order of steps is illustrated in the flowcharts, in some cases, the steps illustrated or described may be performed in an order different than presented herein.

As shown in fig. 1, in one embodiment, the method includes the following steps:

s110, acquiring story text data;

s120, analyzing the story text data, wherein text recognition and word segmentation processing are carried out on the story text data;

s130, performing sound effect data matching on the story text word segmentation result based on the story sound effect model, acquiring sound effect data corresponding to the story text word segmentation result (S131) and determining the mutual relation between the sound effect data and the text word segmentation result corresponding to the sound effect data (S132);

and S140, fusing the sound effect data and the story text data based on the mutual relation between the sound effect data and the text word segmentation result corresponding to the sound effect data, and generating and outputting story content audio data.

Specifically, in one embodiment, the story text is converted into story speech based on a text-to-speech conversion engine (TTS) and the sound effect audio in the sound effect data is fused in the story speech.

Further, in an embodiment, the sound effect data is fused with the story text data, and what position of the story text data the sound effect data is fused to is considered first. Specifically, in an embodiment, the relationship between the sound effect data and the corresponding text segmentation result includes a story position corresponding to the sound effect data. In the process of fusing the sound effect data and the story text data to generate story content audio data, audio corresponding to the sound effect data is fused at a corresponding story position in the story content audio data.

Further, in order to fuse the audio corresponding to the sound effect data to the corresponding story position in the story content audio data as perfectly as possible, in an embodiment, the playing time length of the audio needs to be referred to in the process of fusing the audio. Specifically, in one embodiment, the sound effect data includes:

a sound effect tag, wherein the sound effect tag comprises a sound effect type;

and/or the presence of a gas in the gas,

sound effect control data, the sound effect control data including a sound effect duration.

Further, in an embodiment, the sound effect data is fused with the story text data, and the way of fusing the sound effect data to the story text data is also considered. For example, it is possible to insert the effect audio directly into the story text, to replace a part of the story text with the effect audio, or to use the effect audio as background sound.

In one embodiment, the mode of fusing the sound effect data to the story text data is determined by the mutual relationship between the sound effect data and the corresponding text segmentation result. Specifically, in an embodiment, the correlation between the sound effect data and the text segmentation result corresponding thereto includes a matching detail description between the sound effect data and the text segmentation result corresponding thereto, and the manner in which the sound effect data is fused to the story text data is determined by the matching detail description between the sound effect data and the text segmentation result corresponding thereto.

Further, in order to ensure that the sound effect can improve the vividness of story expression instead of adopting an error sound effect to reduce the story expression, in an embodiment, the corresponding sound effect data is determined according to the semantics of the text word segmentation result.

Further, in one embodiment, the corresponding sound effects are determined based on the story elements. Specifically, in one embodiment, text recognition is performed on a story text, content elements are disassembled on the story based on a text recognition result, and story elements are extracted; and determining sound effect data matched with the text word segmentation result according to the story elements corresponding to the text word segmentation result.

Specifically, in an embodiment, the parsing target is divided into several specific categories (several story elements), keyword extraction is performed for each story element, and the extracted keywords and the story element tags are saved as a parsing result. In particular, in one embodiment, the story elements include story backgrounds, story characters, event content, event occurrence contexts, and/or event progression stages. For example, in a specific application scenario, "a class ring is sounded, and he rushes to go to and get away from the classroom and enter the classroom," when telling that the class ring is sounded, a school ring mixed with a story is played, thereby bringing a child into such a scenario.

Specifically, in an embodiment, the story text segmentation result is subjected to sound effect data matching based on the story sound effect model, and sound effect data corresponding to the story text segmentation result is obtained, wherein:

Specifically, in an embodiment, as shown in fig. 2, the method includes the following steps:

s210, text recognition and word segmentation processing are carried out on story text data;

s220, performing content element disassembly on the story based on the text recognition result, and extracting story elements;

s230, matching story elements for the story text word segmentation result;

s240, calling a story sound effect model;

s250, selecting corresponding sound effect data from a sound effect library according to story elements corresponding to the text word segmentation result and/or the semantics of the text word segmentation result;

s260, determining a story position corresponding to the sound effect data;

s270, determining the matching detailed description between the sound effect data and the corresponding text word segmentation result.

Furthermore, in an actual application scenario, for different types of stories, the types of audiences are different, and correspondingly, listening preferences of the audiences are different, so that if user experience is to be improved as much as possible, corresponding sound effect selection strategies need to be adopted according to different story types, for example, for stories of forest adventure types, sound effect strategies which are animal-related and have tense sound effect style are preferentially adopted.

Specifically, in one embodiment, different styles of sound effects are employed, depending on the type of story. Furthermore, in an embodiment, the sound effects of the uniform style are adopted corresponding to the same story, so that the abrupt feeling generated when the sound effects of different styles are switched can be avoided, and the user experience is improved.

Further, in one embodiment, different story sound effect models are invoked corresponding to different types of stories, thereby ensuring that the finally matched sound effect data matches the story types. Specifically, in one embodiment:

and performing sound effect data matching on the story text word segmentation result based on the story sound effect model, wherein the matched story sound effect model is called based on the story type.

Specifically, in an embodiment, as shown in fig. 3, the method includes the following steps:

s310, text recognition and word segmentation processing are carried out on story text data;

s320, performing content element disassembly on the story based on the text recognition result, and extracting story elements;

s330, determining the story type based on the text recognition result;

s340, matching story elements for the story text word segmentation result;

s350, calling a story sound effect model matched with the story type;

s360, selecting corresponding sound effect data from a sound effect library according to story elements corresponding to the text word segmentation result and/or the semantics of the text word segmentation result;

s370, determining a story position corresponding to the sound effect data;

and S380, determining the matching detailed description between the sound effect data and the corresponding text word segmentation result.

Further, based on the method of the present invention, the present invention also provides a storage medium having stored thereon program codes that can implement the method of the present invention.

Furthermore, based on the method, the invention also provides a story data processing system facing the intelligent robot.

Specifically, as shown in fig. 4, in an embodiment, the system includes:

a text acquisition module 410 configured to acquire story text data;

a text parsing module 420 configured to parse the story text data, wherein the story text data is subjected to text recognition and word segmentation;

the sound effect processing module 430 is configured to perform sound effect data matching on the story text segmentation result based on the story sound effect model, acquire sound effect data corresponding to the story text segmentation result and determine the correlation between the sound effect data and the text segmentation result corresponding to the sound effect data;

and the multi-modal story data generation module 440 is configured to fuse the sound effect data and the story text data based on the correlation between the sound effect data and the text segmentation result corresponding to the sound effect data, and generate and output story content audio data.

Specifically, in one embodiment, sound-effect processing module 430 is configured to invoke a story sound-effect model from story sound-effect model library 431, and select matched sound-effect data from sound-effect library 432 by using the story sound-effect model.

Further, in one embodiment:

the text parsing module 420 is further configured to obtain a story type corresponding to the current story text data;

the sound effects processing module 430 is further configured to invoke the matched story sound effects model based on the story type.

Further, in one embodiment:

the text parsing module 420 is further configured to obtain a story element corresponding to the text segmentation result and/or semantics of the text segmentation result;

the sound effect processing module 430 is further configured to select corresponding sound effect data from the sound effect library according to the story element corresponding to the text segmentation result and/or the semantic meaning of the text segmentation result, and determine the matching detail description between the sound effect data and the text segmentation result corresponding to the sound effect data.

Furthermore, based on the story data processing system provided by the invention, the invention also provides an intelligent story machine. Specifically, as shown in fig. 5, in an embodiment, the story teller includes:

an input acquisition module 510 configured to collect user multimodal input, confirming user story requirements;

a story data processing system 520 configured to obtain corresponding story text data according to a user story requirement, and generate story content audio data;

an output module 530 configured to output the story content audio data to a user.

Specifically, in one embodiment, the output module 530 includes a playing unit configured to play the story content audio data.

Specifically, as shown in fig. 6, in an embodiment, the story machine includes an intelligent device 610 and a cloud server 620, wherein:

cloud server 620 includes a story data processing system 630. The story data processing system 630 is configured to invoke a capability interface of the cloud server 620 to obtain story text data and analyze the story text data, and generate and output story content audio data containing merged sound effects. Specifically, each capability interface of the story data processing system 630 calls a corresponding logic process during the data parsing process.

Specifically, in an embodiment, the capability interfaces of the cloud server 620 include a text segmentation interface 624, a text recognition interface 621, a text/speech conversion interface 622, and a sound effect synthesis interface 623.

The intelligent device 610 includes a human-computer interaction input and output module 611, a communication module 612 and a playing module 613.

The smart device 610 may be a tablet computer, a robot, a mobile phone, a story machine, a picture book reading robot.

The human-computer interaction input/output module 611 is configured to obtain a control instruction of the user and determine a story listening requirement of the user.

The communication module 612 is configured to output the user story listening requirement acquired by the human-computer interaction input/output module 611 to the cloud server 620, and receive multi-modal data from the cloud server 620.

The playing module 613 is configured to play the story content audio data.

Specifically, in a specific application scenario, the human-computer interaction input/output module 611 obtains a control instruction of the user, and determines a story listening requirement of the user.

Communication module 612 sends the user story listening requirements to cloud server 620.

Cloud server 620 selects corresponding story text data based on user story listening requirements. The story data processing system in cloud server 620 obtains and parses story text data, and generates and outputs story content audio data.

The communication module 612 receives story content audio data sent by the cloud server 620.

The playing module 613 plays the story content audio data received by the communication module 612.

It is to be understood that the disclosed embodiments of the invention are not limited to the particular structures, process steps, or materials disclosed herein but are extended to equivalents thereof as would be understood by those ordinarily skilled in the relevant arts. It is also to be understood that the terminology used herein is for the purpose of describing particular embodiments only, and is not intended to be limiting.

Reference in the specification to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the invention. Thus, appearances of the phrase "an embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment.

Although the embodiments of the present invention have been described above, the above description is only for the convenience of understanding the present invention, and is not intended to limit the present invention. There are various other embodiments of the method of the present invention. Various corresponding changes or modifications may be made by those skilled in the art without departing from the spirit of the invention, and these corresponding changes or modifications are intended to fall within the scope of the appended claims.

Claims

1. A story data processing method facing an intelligent robot is characterized by comprising the following steps:

acquiring story text data;

based on the mutual relation, the sound effect data and the story text data are fused to generate and output story content audio data;

in the process of carrying out sound effect data matching on story text word segmentation results based on a story sound effect model, obtaining sound effect data corresponding to the story text word segmentation results and determining matching detail description between the sound effect data and the corresponding text word segmentation results, wherein:

decomposing content elements of the story based on the text recognition result, and extracting story elements; determining sound effect data matched with the text word segmentation result according to the story elements corresponding to the text word segmentation result;

selecting corresponding sound effect data from a sound effect library according to story elements corresponding to the text word segmentation result and/or the semantics of the text word segmentation result, and determining matching detail description between the sound effect data and the corresponding text word segmentation result;

specifically, the target to be analyzed is divided into several specific story elements, keyword extraction is performed on each story element, and the extracted keywords and story element labels are stored as analysis results; the story elements include story backgrounds, story characters, event content, event occurrence contexts, and/or event progression stages.

2. The method of claim 1, wherein:

3. The method of claim 1, wherein the sound effect data comprises:

a sound effect tag, the sound effect tag comprising a sound effect type;

and/or the presence of a gas in the gas,

4. The method of claim 1, wherein:

5. A storage medium having stored thereon program code for implementing the method according to any one of claims 1-4.

6. An intelligent robot-oriented story data processing system, the system comprising:

a text acquisition module configured to acquire story text data;

the multi-mode story data generation module is configured to fuse the sound effect data and the story text data based on the mutual relation, generate story content audio data and output the story content audio data;

the sound effect processing module is also configured to select corresponding sound effect data from a sound effect library according to story elements corresponding to the text word segmentation result and/or the semantics of the text word segmentation result and determine matching detail description between the sound effect data and the corresponding text word segmentation result;

the sound effect processing module is specifically configured to disassemble content elements of the story based on the text recognition result and extract story elements; determining sound effect data matched with the text word segmentation result according to the story elements corresponding to the text word segmentation result;

specifically, the text analysis module divides the target to be analyzed into several specific story elements, extracts keywords for each story element, and stores the extracted keywords and story element labels as analysis results; the story elements include story backgrounds, story characters, event content, event occurrence contexts, and/or event progression stages.

7. The system of claim 6, wherein:

8. An intelligent story machine, the story machine comprising:

a story data processing system according to claim 6 or 7, configured to obtain corresponding story text data in accordance with the user's story requirements, generating story content audio data;

an output module configured to output the story content audio data to a user.