CN117893718A

CN117893718A - Dynamic generation method, system, equipment and medium for lecture scene

Info

Publication number: CN117893718A
Application number: CN202410303010.6A
Authority: CN
Inventors: 李翔; 赵璧; 詹歆; 艾莉娜; 方泽军
Original assignee: Xinlicheng Education Technology Co ltd
Current assignee: Xinlicheng Education Technology Co ltd
Priority date: 2024-03-18
Filing date: 2024-03-18
Publication date: 2024-04-16
Anticipated expiration: 2044-03-18

Abstract

The invention provides a dynamic generation method, a system, equipment and a medium for a speech scene, wherein the method comprises the following steps: acquiring speech text and real-time speech data, wherein the real-time speech data comprises audio data and nonverbal behavior data; semantic analysis is carried out on the lecture text to obtain a lecture theme vector and emotion tendencies; performing acoustic and behavioral analysis on the real-time speech data to obtain a talent expression feature vector; and matching the corresponding scene templates according to the lecture theme vector, the emotion tendency and the talent expression feature vector based on the pre-constructed dynamic mapping relation, and adjusting scene elements of the scene templates according to the real-time lecture data to obtain the real-time lecture scene picture. The intelligent speech training method and the intelligent speech training system can improve the intelligent degree of speech training, enable the virtual speech scene to be further matched with the emotion and rhythm of the lecturer, and greatly improve the use experience of the lecturer.

Description

Dynamic generation method, system, equipment and medium for lecture scene

Technical Field

The invention relates to the technical field of intelligent lectures, in particular to a method, a system, equipment and a medium for dynamically generating lecture scenes.

Background

At present, in order to enable a presenter to have better presentation experience, virtual presentation scenes can be generated through AR or VR technology, and the presenter can embody a real presentation environment in a designated virtual stage. However, the virtual lecture scene is generally fixed at present, and a lecturer needs to manually select the required virtual scene before lecturing, so that the intelligent degree is low; and the effect displayed by the selected virtual scene cannot be changed during the lecture of the lecturer, so that the lecturer is in the same virtual lecture scene for a long time to carry out the lecture, and the use experience of the lecturer is poor.

Disclosure of Invention

The embodiment of the invention provides a dynamic generation method, a system, equipment and a medium for a lecture scene, which are used for solving the problems of the related technology and have the following technical scheme:

in a first aspect, an embodiment of the present invention provides a method for dynamically generating a speech scene, including:

acquiring speech text and real-time speech data, wherein the real-time speech data comprises audio data and nonverbal behavior data;

semantic analysis is carried out on the lecture text to obtain a lecture theme vector and emotion tendencies;

Performing acoustic and behavioral analysis on the real-time speech data to obtain a talent expression feature vector;

and matching the corresponding scene templates according to the lecture theme vector, the emotion tendency and the talent expression feature vector based on the pre-constructed dynamic mapping relation, and adjusting scene elements of the scene templates according to the real-time lecture data to obtain the real-time lecture scene picture.

In one embodiment, acoustic and behavioral analysis of real-time speech data includes:

Analyzing the audio data based on a pre-constructed acoustic model to obtain acoustic characteristics; the acoustic features include pitch change value, cadence stability, and energy dynamic range;

Analyzing nonverbal behavior data based on a pre-constructed behavior analysis model to obtain behavior characteristics;

and carrying out vector conversion on the acoustic feature and the behavior feature to obtain a mouth talent expression feature vector comprising the acoustic feature vector and the behavior feature vector.

In one embodiment, a method for analyzing emotional tendency includes:

Carrying out emotion analysis on the lecture text based on a natural language analysis model to obtain a first emotion feature;

classifying the first emotion characteristics to obtain positive emotion characteristics and negative emotion characteristics;

And calculating the positive emotion intensity and the negative emotion intensity according to the positive emotion characteristics and the negative emotion characteristics, and estimating emotion tendencies through the positive emotion intensity and the negative emotion intensity.

In one embodiment, adjusting the scene elements of the scene template according to the real-time lecture data includes:

performing appointed action recognition on the nonverbal behavior data, and adjusting scene elements corresponding to the appointed action according to a preset rule under the condition that the appointed action is recognized to obtain target scene elements;

generating a real-time lecture scene picture according to the target scene element;

wherein the scene elements include background, lighting, and sound effects.

carrying out emotion analysis on the audio data based on the natural language analysis model to obtain second emotion characteristics;

And adjusting the scene element according to the second emotion characteristics, and updating the real-time lecture scene picture according to the adjusted scene element.

In one embodiment, the method further comprises:

rendering the real-time lecture scene picture through an MR rendering algorithm to obtain a virtual lecture scene;

And superposing the virtual lecture scene in the user field of view picture for display and interaction.

In one embodiment, the method further comprises:

Calculating a comprehensive talent index score according to the talent expression feature vector and a preset weight coefficient;

generating real-time feedback suggestions according to the comprehensive talent index scores and a preset feedback template library based on a natural language generation algorithm;

And carrying out virtualized rendering on the real-time feedback suggestion through an MR rendering algorithm to obtain a virtual suggestion picture, and superposing the virtual suggestion picture in a user visual field picture for display and interaction.

In a second aspect, an embodiment of the present invention provides a system for dynamically generating a speech scene, which executes the method for dynamically generating a speech scene as described above.

In a third aspect, an embodiment of the present invention provides an electronic device, including: memory and a processor. Wherein the memory and the processor are in communication with each other via an internal connection, the memory is configured to store instructions, the processor is configured to execute the instructions stored by the memory, and when the processor executes the instructions stored by the memory, the processor is configured to perform the method of any one of the embodiments of the above aspects.

In a fourth aspect, embodiments of the present invention provide a computer readable storage medium storing a computer program, the method of any one of the above embodiments being performed when the computer program is run on a computer.

The advantages or beneficial effects in the technical scheme at least comprise:

According to the invention, a proper speech scene can be generated according to the speech theme, emotion tendency and talent expression characteristics of a presenter, so that the intelligent degree is improved; meanwhile, scene elements of the lecture scene can be dynamically adjusted according to real-time lecture data of the lecturer, so that the virtual lecture scene can be further matched with emotion and rhythm of the lecturer, and the use experience of the lecturer is greatly improved.

The foregoing summary is for the purpose of the specification only and is not intended to be limiting in any way. In addition to the illustrative aspects, embodiments, and features described above, further aspects, embodiments, and features of the present invention will become apparent by reference to the drawings and the following detailed description.

Drawings

In the drawings, the same reference numerals refer to the same or similar parts or elements throughout the several views unless otherwise specified. The figures are not necessarily drawn to scale. It is appreciated that these drawings depict only some embodiments according to the disclosure and are not therefore to be considered limiting of its scope.

FIG. 1 is a flow diagram of a dynamic speech scene generation method of the present invention;

Fig. 2 is a block diagram of an electronic device according to an embodiment of the invention.

Detailed Description

Hereinafter, only certain exemplary embodiments are briefly described. As will be recognized by those of skill in the pertinent art, the described embodiments may be modified in various different ways without departing from the spirit or scope of the present invention. Accordingly, the drawings and description are to be regarded as illustrative in nature and not as restrictive.

Example 1

At present, in order to enable a presenter to have better presentation experience, virtual presentation scenes can be generated through AR or VR technology, and the presenter can embody a real presentation environment in a designated virtual stage. However, at present, the virtual lecture scene is generally fixed, a lecturer needs to manually select the required virtual scene before lecturing, and the intelligent degree is relatively low; and the effect displayed by the selected virtual scene cannot be changed during the lecture of the lecturer, so that the lecturer is in the same virtual lecture scene for a long time to carry out the lecture, and the use experience of the lecturer is poor.

In order to solve the above problems, the embodiment provides a dynamic generation method of a lecture scene, and provides a more real and immersive lecture practice experience, and real-time feedback and guidance by combining VR and AR technologies. The presenter needs to pre-configure the MR device, including the high resolution VR head-up, AR glasses with accurate tracking functionality, and the multi-function handle controller, to ensure that all devices are in an optimal working state, ready to provide an immersive experience for the presenter. After the equipment is started, the system performs environment calibration to ensure that the spatial relationship between the virtual lecture scene and the real environment is correctly mapped. Meanwhile, the speaker is positioned through the AR glasses and the VR head-displaying sensor, so that movements and gestures of the speaker can be accurately captured.

After the preparation of the MR apparatus described above, referring to fig. 1, the following lecture scene dynamic generation step is performed:

Step S1: and acquiring speech text and real-time speech data, wherein the real-time speech data comprises audio data and nonverbal behavior data.

The lecture text can be submitted to the system by the lecturer, and the lecture text records the lecture content of the lecturer. The real-time speech data is acquired by the MR device in real time, for example, the audio data can be recorded by a voice recording function on the VR head display or the AR glasses, the non-speech behavior data can be acquired by the MR device, and the non-speech behavior data comprises limb motion data, facial expression data, eye movement data and the like.

Step S2: and carrying out semantic analysis on the lecture text to obtain a lecture theme vector and emotion tendencies.

Semantic analysis is carried out on a lecture text through a pre-built deep learning model, the deep learning model is built by taking a large amount of text content as an input training sample, taking a lecture theme vector as output, and training the deep learning model to obtain the deep learning model. The lecture text is imported into a trained deep learning model, and a corresponding lecture topic vector can be obtained through analysis, so that topic contents of lectures required by lectures can be determined, and lecture scenes related to the topics can be matched for display.

In addition, emotion analysis is carried out on the lecture text through a natural language analysis model, keywords related to emotion expression are screened out, and the keywords related to emotion expression are marked as first emotion characteristics; classifying the first emotion features according to the emotion expressed by each first emotion feature, so as to distinguish positive emotion and negative emotion features, marking the features associated with the positive emotion as positive emotion features, and marking the features associated with the negative emotion as negative emotion features.

Calculating the proportion of the positive emotion features to the total number of the features according to the marks of the features to obtain the positive emotion intensity; and calculating the proportion of the negative emotion features to the total number of the features to obtain the negative emotion intensity. According to the intensity comparison of the positive emotion characteristics and the negative emotion characteristics, whether the positive emotion characteristics are more in duty ratio or the negative emotion characteristics are more in duty ratio can be determined; under the condition that the positive emotion intensity is larger than the negative emotion intensity, the emotion tendency of the speech text is positive tendency; when the positive emotion intensity is smaller than the negative emotion intensity, the emotion tendency of the speech text is a negative tendency.

The lecture atmosphere corresponding to different emotion tendencies is different, so that the emotion tendencies corresponding to the lecture text are evaluated, the lecture atmosphere of a lecturer can be predicted, the lecture scene corresponding to the lecture atmosphere is matched, and the visual experience is improved.

Step S3: and carrying out acoustic and behavioral analysis on the real-time speech data to obtain the characteristic vector of talent expression.

The real-time speech data mainly comprise audio data and non-speech behavior data acquired in real time through an MR device during the speech of a presenter, and the oral presentation of the presenter is presented through the real-time speech data.

The vectorization method of the talent expression comprises the following steps:

Step S31: analyzing the audio data based on a pre-constructed acoustic model to obtain acoustic characteristics; the acoustic model can analyze multiple dimensions of speech speed, volume, pause time and the like of a speaker to obtain acoustic features at least comprising pitch change values, rhythm stability and energy dynamic range.

The pitch change value can be determined according to the change condition of the volume, and the larger the volume change is, the larger the corresponding pitch change value is; the rhythm stability can be determined according to the condition of the speed change of the speech and the pause time, the speed change of the speech is large, the difference between the front pause time and the rear pause time is large, and the corresponding rhythm stability is poor; the energy dynamic range can analyze the sound intensity according to the sound wave corresponding to the audio data, so as to determine the energy of the sound, and the energy dynamic range can be determined according to the change of the energy. The mapping relation between each acoustic feature and the audio data can be determined through a pre-constructed acoustic model, and after the acoustic model is trained through a large number of training samples, the audio data can be input into the acoustic model to output the required acoustic features, so that the speech state of a presenter can be determined according to the acoustic features, and the speech scene can be dynamically adjusted.

Step S32: and analyzing the nonverbal behavior data based on a pre-constructed behavior analysis model to obtain behavior characteristics.

The nonverbal behavior data in this embodiment is mainly motion data corresponding to when the presenter performs a limb motion. The presenter can collect non-verbal behavioral data through sensors built into the AR glasses and the handle controller in the MR device. The limb actions of the presenter can be analyzed through a pre-constructed behavior analysis model. The behavior analysis model can be obtained through learning and training by a large number of training samples, and the limb actions of a presenter can be determined based on the trained behavior analysis model. The specific meaning can be given to part of limb actions in advance, for example, a certain limb action is preconfigured to automatically trigger an associated instruction, so that the appointed scene adjustment operation is automatically executed according to the instruction, the scene effect can be customized by a presenter at any time in the process of presentation, the presentation of the presenter and the presentation of the presentation scene are mutually matched, and the operation flexibility and the use experience of the presenter are improved.

Step S33: and carrying out vector conversion on the acoustic feature and the behavior feature to obtain a mouth talent expression feature vector comprising the acoustic feature vector and the behavior feature vector.

The characteristics output by the acoustic model and the behavior analysis model are quantized to obtain acoustic characteristic vectors and behavior characteristic vectors, which belong to the process of converting the characteristics from the form into the data form, so that the characteristic vectors of the talents are directly input into the dynamic mapping relation for processing, and the compatibility is improved.

Step S4: and matching the corresponding scene templates according to the lecture theme vector, the emotion tendency and the talent expression feature vector based on the pre-constructed dynamic mapping relation, and adjusting scene elements of the scene templates according to the real-time lecture data to obtain the real-time lecture scene picture.

The dynamic mapping relation among the lecture theme vector, the emotion tendency, the talent expression feature vector and the scene templates can be dynamically determined through a mechanical learning model, and the matched scene templates are determined according to the pre-constructed dynamic mapping relation after the lecture theme vector, the emotion tendency and the talent expression feature vector corresponding to the lecturer are obtained.

Each scene template is configured with default scene elements and interaction elements, and after the scene templates are determined through dynamic mapping relations, basic lecture scenes can be displayed according to the default elements. The scene elements comprise background environment, scene illumination, sound effect and the like. The interaction element is used for determining an interaction mode between the lecturer and the virtual lecture scene, and the interaction mode can be that the system makes corresponding interaction operation according to the appointed limb action of the lecturer after the lecturer makes the appointed limb action, or the system makes corresponding interaction operation according to the appointed language instruction after the lecturer sends the appointed language instruction.

In addition, the scene elements of the scene template can be individually adjusted according to the analysis result of the real-time speech data, so that the speech scene is closer to the speech state of a presenter, and the effect of immersive speech is realized.

The method for adjusting the scene elements of the scene template according to the real-time speech data comprises the following steps:

Step S41: performing appointed action recognition on the nonverbal behavior data, and adjusting scene elements corresponding to the appointed action according to a preset rule under the condition that the appointed action is recognized to obtain target scene elements; wherein the scene elements include background, illumination, sound effects, etc.

Step S42: and generating a real-time lecture scene picture according to the target scene element.

When the interaction mode between the lecturer and the virtual lecture scene is determined to be action interaction according to the scene template, a control instruction can be sent to the lecture scene through non-language behaviors made by the lecturer during the lecture; the presenter may preset rules including defining the specified actions and control instructions to which the specified actions match; when the non-language behavior data is analyzed, the fact that the presenter is executing the specified action is identified, and the control instruction corresponding to the specified action is called to be executed, so that the scene elements of the scene template are changed, and the presenter scene can be subjected to personalized adjustment on the background, illumination, sound effect and the like according to the limb language of the presenter. In the actual use process, the AR glasses and the sensors arranged in the handle controller can be utilized, the system starts to capture gestures and body language of a speaker, and the non-verbal communication information can be used for enhancing interactivity of a virtual speech scene, such as changing slides displayed in the speech scene through gestures or adjusting scene elements such as background, illumination, sound effect and the like. The personalized adjusted scene elements are marked as target scene elements, and then the background, illumination or sound effect of the lecture scene can be adjusted on the basis of the scene template according to the target scene elements, so that a real-time lecture scene picture is obtained.

In some embodiments, in addition to adjusting the lecture scene according to the specified action of the lecturer, the display effect of the lecture scene can be adjusted according to the audio data generated when the lecture is performed by the lecturer, which comprises the following specific steps:

Presetting a rule, wherein the rule comprises a control instruction matched with a designated emotion state;

When the interaction mode between the lecturer and the virtual lecture scene is determined to be voice interaction according to the scene template, carrying out emotion analysis on the audio data based on the natural language analysis model to obtain a second emotion feature;

and determining an emotion state according to the second emotion characteristics, determining a corresponding control instruction according to the emotion state based on a preset rule, adjusting scene elements according to the control instruction, and updating a real-time lecture scene picture according to the adjusted scene elements.

For example, the preset rule is that when the emotional state of the presenter changes from calm to anger, the corresponding control instruction is to control the volume of the scene to increase. During the process of a lecturer, audio data of the lecturer are collected in real time, emotion of the lecturer when the lecturer performs the lecture is determined based on a natural language analysis model, when a preset emotion change is triggered, a control instruction for increasing the scene volume is generated, the default volume level of the lecture scene is increased to a specified volume level according to the control instruction, the effect of increasing the volume is achieved, the display effect of the lecture scene is matched and synchronized with the emotion of the lecturer, and the lecturer has the personally-on-site lecture effect.

It should be noted that the training process of the model is already disclosed in the prior art, and is not described in detail here, and the application of the model is claimed in the present application.

The adjusted lecture scene is subjected to virtual rendering through an MR technology, a virtual lecture scene is generated, and the virtual lecture scene is superimposed in a user visual field picture displayed by a VR head display or AR eyes, wherein the VR head display provides high-resolution and wide-visual field visual experience, a lecturer is completely immersed in the virtual lecture scene, and the lecturer can see detailed virtual stages, listeners and other visual elements through the head display as if the lecture environment is real.

In some embodiments, real-time feedback may be provided based on the performance of the presenter while the presenter exercises. Specifically, the acoustic features, the emotion features, the behavior features and other features obtained by analysis in the steps can be generated into corresponding display pictures, and the display pictures are displayed through AR eyes or VR head displays; in addition, feedback data of a listener can be acquired when the lecturer speaks, the feedback data of the listener is generated into a corresponding virtual picture for display, the lecturer can know own lecture performance in real time from multiple dimensions such as voice quality, body language, listener effect and the like, and the analysis results can help the lecturer to know own advantages and places needing improvement.

For example, knowing that the language of the presenter's limb is stiffer through behavioral analysis, a prompt display may be made through AR glasses or VR head display to encourage the presenter to use more open gestures.

In one embodiment, the method further comprises: calculating a comprehensive talent index score according to the talent expression feature vector and a preset weight coefficient; generating real-time feedback suggestions according to the comprehensive talent index scores and a preset feedback template library based on a natural language generation algorithm; and (5) performing picture rendering on the real-time feedback suggestion, and then superposing the real-time feedback suggestion in the field of view of the user for display and interaction. It should be noted that the natural language generation algorithm is already disclosed in the prior art, and the description thereof will not be repeated here, and the application of the algorithm is claimed in the present application, not the algorithm itself.

After various presentation characteristics of the presenter are obtained through the analysis, each presentation characteristic vector is multiplied by the weight coefficient corresponding to each presentation characteristic vector, and then products are added to obtain a comprehensive talent index score of the presenter, and the calculated comprehensive talent index score is displayed on an AR eye or VR head display so that the presenter can know the presentation performance of the presenter.

In addition, different expression feature vectors are matched with corresponding feedback suggestions in advance based on big data statistics, and different comprehensive talent index scores are also matched with the corresponding feedback suggestions to form a feedback template library; in the actual use process, a matched real-time feedback suggestion is found according to the talent expression feature vector and the comprehensive talent index score of a presenter, the real-time feedback suggestion is subjected to virtualized rendering based on an MR technology to generate a corresponding rendering picture, the corresponding rendering picture is called a virtual suggestion picture, the virtual suggestion picture is overlapped in a user field picture displayed by an AR eye or a VR head display, so that the presenter can make corresponding presentation adjustment according to the feedback suggestion, and the presentation skill of the presenter is improved.

Specifically, the system receives audio signals of a presenter in real time, and extracts high-level acoustic feature vectors comprising features such as pitch variation, rhythm stability and the like through an acoustic analysis model.

Meanwhile, the system processes the speech text, and utilizes natural language processing technology to evaluate emotion feature vectors, wherein the vectors cover positive emotion intensity, negative emotion intensity, emotion diversity and the like.

And combining the acoustic feature vector and the emotion feature vector, and calculating the comprehensive talent expression index by the system through a weighting sum mode by utilizing a pre-constructed scoring model Mscore. Different weighting coefficients are used to balance the impact of acoustic and emotional features in the overall score.

According to the comprehensive talent expression index and a preset feedback template library, the system dynamically generates targeted real-time feedback suggestions through a Natural Language Generation (NLG) technology. The various characteristics and real-time feedback suggestions obtained by the analysis are directly displayed in the field of view of the user through the MR technology, so that a presenter can be helped to adjust the lecture skill in real time.

In addition, the presenter can also perform finer interaction operations, such as adjusting the position of objects in the virtual environment or changing the presentation content, through the multi-function handle controller. The keys and the touch pad of the handle provide rich interaction modes for a presenter, and the intuitiveness and flexibility of control are enhanced.

In the whole lecture preparation process, the MR technology realizes seamless fusion of virtual content and the real world. Not only can the lecturer practice the lecture in the virtual environment, but also the spatial sense and the depth sense in the real environment can be felt, so that the practice effect is more similar to the real lecture situation.

After the practice is finished, the lecturer can feed back scene setting according to the experience of the lecturer. The system uses these feedback as new inputs for further optimizing the dynamic mapping model and scene adjustment strategy to provide a more accurate and personalized scene configuration in future speech exercises. In addition, the system periodically reviews feedback and presentation data of all lectures, and optimizes the dynamic mapping model and the real-time adjustment strategy using a machine learning algorithm to improve the adaptability and accuracy of the scene generation engine. Through the complex calculation and real-time feedback mechanisms, the system can provide highly personalized and timely effective talent enhancement guidance for the lecturer, and help the lecturer to realize better performance in the actual lecture. It should be noted that machine learning algorithms are disclosed in the prior art and will not be repeated here, and that the application claims the application of the algorithm rather than the algorithm itself.

In some embodiments, a multi-user virtual collaboration space is implemented through MR technology, allowing a presenter, coach and team member to co-enter the same virtual presentation scenario. In this space, users can see each other's avatars, communicate speech and gestures, share lecture material, and even perform role playing and real-time feedback on a virtual stage. Specifically, when a presenter, a coach and a team member enter a virtual presentation scene, the virtual avatar and the initial position of each user can be initialized through a preset algorithm, so that all participants are ensured to be synchronous in the same virtual space. And capturing voice and motion data of each user in real time, processing the data through a designated voice analysis algorithm and a designated motion analysis algorithm, synchronously updating the virtual avatar state of each user, and keeping the consistency and real-time of interaction of all users in the virtual collaboration space.

And converting the voice signal of the user into text, and analyzing emotion so as to understand the communication content and emotion tendencies of the user. According to the talent dimension index and emotion analysis result of the user, the interaction influence in the virtual environment is dynamically adjusted, such as illumination, sound effect and virtual audience reaction are adjusted, so that immersion and interaction experience are enhanced. The system collects interaction data among multiple users, evaluates team cooperation efficiency according to interaction frequency of the multiple users, and generates real-time feedback suggestions according to evaluation results. This helps guide the team in improving collaboration policies and communication efficiency.

Meanwhile, aiming at the problems of network delay and data synchronization possibly occurring in the multi-user virtual collaboration space, the system implements a dynamic network adjustment strategy and a data interpolation technology, and optimizes the real-time interaction experience between users. And continuously collecting user interaction data and feedback, continuously learning and optimizing the system by using a deep learning technology, and improving the accuracy, response speed and user satisfaction of the system.

The present embodiments are directed to providing a comprehensive, immersive lecture preparation and collaboration environment for lectures and teams by fusing Virtual Reality (VR) and Augmented Reality (AR) technologies. The main functions include:

1. Immersive lecture scene construction: the user is allowed to customize the virtual lecture scene according to the lecture content and style, including stage layout, background, interactive elements, etc.

2. Real-time talent skill assistance: the speech performance indexes such as speech speed, volume and the like are displayed in real time through the AR technology, so that a presenter can be helped to adjust the speech performance in real time.

3. Multiuser virtual collaboration: and a plurality of users are supported to cooperate in the same virtual environment to carry out speech design, exercise and feedback communication.

4. Dynamic scene adaptation: and dynamically adjusting the lecture scene according to the lecture content and the user feedback so as to optimally match the emotion and rhythm of the lecture.

Example two

The embodiment provides a dynamic generation system of a lecture scene, which executes the dynamic generation method of the lecture scene of the first embodiment; specifically, the system includes:

The first analysis module is used for acquiring a speech text, carrying out semantic analysis on the speech text and obtaining a speech topic vector and emotion tendencies;

The second analysis module is used for acquiring real-time speech data, wherein the real-time speech data comprises audio data and nonverbal behavior data; performing acoustic and behavioral analysis on the real-time speech data to obtain a talent expression feature vector;

and the scene adjustment module is used for matching the corresponding scene templates according to the speech theme vector, the emotion tendency and the talent expression feature vector based on the pre-constructed dynamic mapping relation, and adjusting scene elements of the scene templates according to the real-time speech data to obtain a real-time speech scene picture.

The functions of each module in the system of the embodiment of the present invention may be referred to the corresponding descriptions in the above method, and will not be repeated here.

Example III

Fig. 2 shows a block diagram of an electronic device according to an embodiment of the invention. As shown in fig. 2, the electronic device includes: memory 100 and processor 200, and memory 100 stores a computer program executable on processor 200. The processor 200 implements the lecture scene dynamic generation method in the above-described embodiment when executing the computer program. The number of memory 100 and processors 200 may be one or more.

The electronic device further includes:

the communication interface 300 is used for communicating with external equipment and performing data interaction transmission.

If the memory 100, the processor 200, and the communication interface 300 are implemented independently, the memory 100, the processor 200, and the communication interface 300 may be connected to each other and perform communication with each other through buses. The bus may be an industry standard architecture (Industry Standard Architecture, ISA) bus, an external device interconnect (PERIPHERAL COMPONENT INTERCONNECT, PCI) bus, or an extended industry standard architecture (Extended Industry Standard Architecture, EISA) bus, among others. The bus may be classified as an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in fig. 2, but not only one bus or one type of bus.

Alternatively, in a specific implementation, if the memory 100, the processor 200, and the communication interface 300 are integrated on a chip, the memory 100, the processor 200, and the communication interface 300 may communicate with each other through internal interfaces.

The embodiment of the invention provides a computer readable storage medium storing a computer program which, when executed by a processor, implements the method provided in the embodiment of the invention.

The embodiment of the invention also provides a chip, which comprises a processor and is used for calling the instructions stored in the memory from the memory and running the instructions stored in the memory, so that the communication equipment provided with the chip executes the method provided by the embodiment of the invention.

The embodiment of the invention also provides a chip, which comprises: the input interface, the output interface, the processor and the memory are connected through an internal connection path, the processor is used for executing codes in the memory, and when the codes are executed, the processor is used for executing the method provided by the embodiment of the invention.

It should be appreciated that the processor may be a central processing unit (Central Processing Unit, CPU), other general purpose processor, digital signal processor (DIGITAL SIGNAL processing, DSP), application Specific Integrated Circuit (ASIC), field programmable gate array (fieldprogrammablegate array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, etc. A general purpose processor may be a microprocessor or any conventional processor or the like. It is noted that the processor may be a processor supporting an advanced reduced instruction set machine (ADVANCED RISC MACHINES, ARM) architecture.

Further, optionally, the memory may include a read-only memory and a random access memory, and may further include a nonvolatile random access memory. The memory may be volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The nonvolatile memory may include a read-only memory (ROM), a Programmable ROM (PROM), an erasable programmable ROM (erasable PROM), an electrically erasable programmable EPROM (EEPROM), or a flash memory, among others. Volatile memory can include random access memory (random access memory, RAM), which acts as external cache memory. By way of example, and not limitation, many forms of RAM are available. For example, static random access memory (STATIC RAM, SRAM), dynamic random access memory (dynamic random access memory, DRAM), synchronous Dynamic Random Access Memory (SDRAM), double data rate synchronous dynamic random access memory (double DATA DATE SDRAM, DDR SDRAM), enhanced synchronous dynamic random access memory (ENHANCED SDRAM, ESDRAM), synchronous link dynamic random access memory (SYNCHLINK DRAM, SLDRAM), and direct memory bus random access memory (direct rambus RAM, DR RAM).

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, the processes or functions in accordance with the present invention are fully or partially produced. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. Computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another.

In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.

Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In the description of the present invention, the meaning of "a plurality" is two or more, unless explicitly defined otherwise.

The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any person skilled in the art will readily recognize that various changes and substitutions are possible within the scope of the present invention. Therefore, the protection scope of the invention is subject to the protection scope of the claims.

Claims

1. A dynamic generation method of a speech scene is characterized by comprising the following steps:

Carrying out semantic analysis on the speech text to obtain a speech topic vector and emotion tendencies;

and matching the corresponding scene templates according to the lecture theme vector, the emotion tendency and the talent expression feature vector based on a pre-constructed dynamic mapping relation, and adjusting scene elements of the scene templates according to the real-time lecture data to obtain a real-time lecture scene picture.

2. The method of claim 1, wherein the performing acoustic and behavioral analysis on the real-time speech data comprises:

analyzing the nonverbal behavior data based on a pre-constructed behavior analysis model to obtain behavior characteristics;

And carrying out vector transformation on the acoustic feature and the behavior feature to obtain the characteristic vector of the talent expression comprising the acoustic characteristic vector and the behavior characteristic vector.

3. The method for dynamically generating a speech scene according to claim 1, wherein the method for analyzing emotional tendency comprises:

And calculating positive emotion intensity and negative emotion intensity according to the positive emotion characteristics and the negative emotion characteristics, and estimating the emotion tendency through the positive emotion intensity and the negative emotion intensity.

4. The method for dynamically generating a lecture scene according to claim 1, wherein said adjusting scene elements of the scene template according to the real-time lecture data includes:

The nonverbal behavior data are recognized by the appointed action, and the scene element corresponding to the appointed action is adjusted according to a preset rule under the condition that the appointed action is recognized, so that a target scene element is obtained; wherein the scene elements include a background, illumination, and sound effects;

And generating the real-time lecture scene picture according to the target scene element.

5. The method for dynamically generating a lecture scene according to claim 1, wherein said adjusting scene elements of the scene template according to the real-time lecture data includes:

carrying out emotion analysis on the audio data based on a natural language analysis model to obtain a second emotion feature;

adjusting the scene element according to the second emotion characteristics to obtain an adjusted scene element; wherein the scene elements include a background, illumination, and sound effects;

And generating the real-time lecture scene picture according to the adjusted scene element.

6. The method for dynamically generating a speech scene according to claim 1, further comprising:

And rendering the real-time lecture scene picture through an MR rendering algorithm to obtain a virtual lecture scene picture which is used for being overlapped on the view picture of the user.

7. The method for dynamically generating a speech scene according to claim 1, further comprising:

And carrying out virtualization rendering treatment on the real-time feedback advice through an MR rendering algorithm to obtain a virtual advice picture which is used for being overlapped on the visual field picture of the user.

8. A speech scene dynamic generation system, characterized in that the speech scene dynamic generation method according to any one of claims 1 to 7 is performed.

9. An electronic device, comprising: the method for dynamically generating the lecture scene according to any one of claims 1 to 7 comprises a processor and a memory, wherein the memory stores instructions, and the instructions are loaded and executed by the processor.

10. A computer readable storage medium, wherein a computer program is stored in the computer readable storage medium, and when the computer program is executed by a processor, the method for dynamically generating a lecture scene according to any one of claims 1 to 7 is implemented.