CN113450804A

CN113450804A - Voice visualization method and device, projection equipment and computer readable storage medium

Info

Publication number: CN113450804A
Application number: CN202110697947.2A
Authority: CN
Inventors: 李禹�; 曹琦; 王骁逸; 张聪; 胡震宇
Original assignee: Shenzhen Huole Science and Technology Development Co Ltd
Current assignee: Shenzhen Huole Science and Technology Development Co Ltd
Priority date: 2021-06-23
Filing date: 2021-06-23
Publication date: 2021-09-28

Abstract

The disclosure provides a voice visualization method and device, a projection device and a computer readable storage medium. The present disclosure includes: acquiring collected data of a user, and converting user voice data in the collected data into text information; acquiring a target emotion category corresponding to the acquired data; acquiring a target character special effect corresponding to the user voice data according to the target emotion category; and performing visualization processing on the character information according to the target character special effect, and displaying projection information obtained after visualization processing in a projection surface. Therefore, the voice visualization method in the embodiment of the disclosure can visualize the text information, and can set the special effect after the text information visualization according to the real-time target emotion category of the user, thereby enhancing the immersion feeling of the user during projection and increasing the flexibility during projection.

Description

Voice visualization method and device, projection equipment and computer readable storage medium

Technical Field

The present disclosure relates to the field of voice data processing, and in particular, to a voice visualization method and apparatus, a projection device, and a computer-readable storage medium.

Background

With the popularization of the internet and the rapid popularity of digital audio, users can play various audio data (such as audio novels, songs, etc.) through projection devices such as mobile phones, tablet computers, etc. In order to improve the diversity of display information in the audio data playing process, in the related technology, the audio features are visually represented by extracting the features of the audio data and by an image rendering mode, so that the effect that a picture changes along with the change of the audio data is achieved, namely, the music experience is explained by using an image language.

After the existing projection equipment visualizes the characters by voice, the visualization special effect of the characters cannot be changed in real time according to the actual scene, and the displayed picture is monotonous and inflexible.

Disclosure of Invention

The invention provides a voice visualization method and device, projection equipment and a computer readable storage medium, and aims to solve the problems that after voice is visualized by the existing projection equipment, the visualization special effect of the characters cannot be changed in real time according to the actual scene, and the displayed picture is monotonous and not flexible enough.

In a first aspect, the present disclosure provides a method of speech visualization, the method comprising:

acquiring collected data of a user, and converting user voice data in the collected data into text information;

acquiring a target emotion category corresponding to the acquired data;

acquiring a target character special effect corresponding to the user voice data according to the target emotion category;

and performing visualization processing on the character information according to the target character special effect, and displaying projection information obtained after visualization processing in a projection surface.

In some embodiments, the obtaining of the target emotion classification corresponding to the collected data includes one or more of the following methods:

analyzing user image data in the collected data to obtain the target emotion category;

analyzing the user physiological parameters in the collected data, and determining the target emotion types according to the user physiological parameters;

analyzing the user voice data to obtain the target emotion category, wherein the user voice data comprises at least one audio parameter of volume, tone and speed;

and extracting emotion keywords in the text information, and determining the target emotion category according to the emotion keywords.

In some embodiments, before performing visualization processing on the text information according to the target text special effect and displaying projection information obtained after the visualization processing in a projection surface, the method further includes:

extracting target graphic keywords in the text information;

determining a target graphic special effect corresponding to the target graphic keyword, wherein the target graphic special effect comprises at least one of a weather special effect and an action special effect;

the process of carrying out visualization processing on the character information according to the target character special effect and displaying projection information obtained after visualization processing in a projection plane comprises the following steps:

and performing visualization processing on the character information according to the target character special effect and the target graphic special effect, and displaying projection information obtained after visualization processing in a projection surface.

extracting target sound keywords in the text information;

determining a target sound special effect corresponding to the target graphic key words;

and performing visualization processing on the character information according to the target character special effect and the target sound special effect, and displaying projection information obtained after visualization processing in a projection surface.

detecting whether a microphone function is enabled;

if the microphone function is started, acquiring text information in the user voice data, and inquiring articles or lyrics corresponding to the text information through a parallel network;

and performing visualization processing on the article or the lyrics according to the special effect of the target character, and displaying projection information obtained after visualization processing in a projection plane.

In some embodiments, the visualizing the text information according to the target text special effect and displaying projection information obtained after the visualization processing in a projection plane includes:

detecting whether a microphone function is enabled;

In some embodiments, the obtaining user speech data comprises:

acquiring original audio data and extracting target voiceprint data in the original audio data;

inquiring a preset historical voiceprint database, and judging whether historical voiceprint data matched with the target voiceprint data exist or not;

if the historical voiceprint data matched with the target voiceprint data exist in the historical voiceprint database, the original audio data are used as user voice data;

in a second aspect, the present disclosure provides a speech visualization apparatus comprising:

the device comprises an acquisition unit, a processing unit and a display unit, wherein the acquisition unit is used for acquiring the acquisition data of a user and converting the user voice data in the acquisition data into text information;

the emotion acquisition unit is used for acquiring a target emotion category corresponding to the acquired data;

the special effect obtaining unit is used for obtaining a target character special effect corresponding to the user voice data according to the target emotion type;

and the display unit is used for carrying out visualization processing on the character information according to the target character special effect and displaying the projection information obtained after visualization processing in a projection surface.

In some embodiments, the emotion acquisition unit is further configured to implement one or more of the following methods:

In some embodiments, the speech visualization apparatus further comprises a target graphical effect determination unit configured to:

extracting target graphic keywords in the text information;

the display unit is further configured to:

In some embodiments, the speech visualization apparatus further comprises a target sound effect determination unit, the target sound effect determination unit being further configured to:

extracting target sound keywords in the text information;

the display unit is further configured to:

In some embodiments, the speech visualization apparatus further comprises an instruction execution unit for:

inquiring a preset instruction set, and judging whether a target instruction matched with the text information exists in the preset instruction set;

if a target instruction matched with the text information exists in the preset instruction set, executing a target instruction corresponding to the text information of the target instruction;

and when the target instruction is executed, stopping displaying the projection information.

In some embodiments, the voice visualization apparatus further comprises a networked acquisition unit configured to:

detecting whether a microphone function is enabled;

the display unit is further configured to:

In some embodiments, the obtaining unit is further configured to:

and if the historical voiceprint data matched with the target voiceprint data exist in the historical voiceprint database, taking the original audio data as the user voice data.

In a third aspect, the present disclosure also provides a projection device, which includes a processor and a memory, where the memory stores a computer program, and the processor executes the steps in any one of the voice visualization methods provided by the present disclosure when calling the computer program in the memory.

In a fourth aspect, the present disclosure also provides a computer-readable storage medium, on which a computer program is stored, which is loaded by a processor to perform the steps of the speech visualization method.

In summary, the present disclosure includes: acquiring collected data of a user, and converting user voice data in the collected data into text information; acquiring a target emotion category corresponding to the acquired data; acquiring a target character special effect corresponding to the user voice data according to the target emotion category; and performing visualization processing on the character information according to the target character special effect, and displaying projection information obtained after visualization processing in a projection surface. Therefore, the voice visualization method in the embodiment of the disclosure can visualize the text information, and can set the special effect after the text information visualization according to the real-time target emotion category of the user, thereby enhancing the immersion feeling of the user during projection and increasing the flexibility during projection.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present disclosure, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic view of an application scenario of a speech visualization method provided by an embodiment of the present disclosure;

FIG. 2 is a schematic flow chart diagram of a method for speech visualization provided in embodiments of the present disclosure;

FIG. 3 is a schematic flow chart of acquiring projection information provided in the embodiments of the present disclosure;

FIG. 4 is a schematic flow chart of another embodiment of the present disclosure for acquiring projection information;

FIG. 5 is a flow diagram of executing a target instruction provided in an embodiment of the present disclosure;

FIG. 6 is a schematic flow chart illustrating a process for obtaining projection information according to information obtained through networking, according to an embodiment of the present disclosure;

fig. 7 is a schematic structural diagram of an embodiment of a speech visualization apparatus provided in an embodiment of the present disclosure;

fig. 8 is a schematic structural diagram of an embodiment of a projection device provided in an embodiment of the present disclosure.

Detailed Description

The technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are only a part of the embodiments of the present disclosure, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the protection scope of the present disclosure.

In the description of the embodiments of the present disclosure, it should be understood that the terms "first", "second", and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicit indication of the number of technical features indicated. Thus, features defined as "first", "second", may explicitly or implicitly include one or more of the described features. In the description of the embodiments of the present disclosure, "a plurality" means two or more unless specifically limited otherwise.

The following description is presented to enable any person skilled in the art to make and use the disclosure. In the following description, details are set forth for the purpose of explanation. It will be apparent to one of ordinary skill in the art that the present disclosure may be practiced without these specific details. In other instances, well-known processes have not been described in detail so as not to obscure the description of the embodiments of the present disclosure with unnecessary detail. Thus, the present disclosure is not intended to be limited to the embodiments shown, but is to be accorded the widest scope consistent with the principles and features disclosed in the embodiments of the present disclosure.

The embodiment of the disclosure provides a voice visualization method and device, projection equipment and a computer readable storage medium. Wherein the speech visualization means may be integrated in the projection device.

First, before describing the embodiments of the present disclosure, the following description will be given about the related contents of the embodiments of the present disclosure with respect to the application context.

A projector is a device that can project images or videos onto a curtain. With the development of the technology, in order to enhance the interactivity between the user and the projection, external devices such as a microphone and a camera are gradually added to the projector to intelligently obtain the user-defined input information, so that the input information is projected onto a curtain as an image or a video. In order to improve the interest of the projected content, a special effect can be added to the elements in the projected content, and when the special effect is generated, the dynamic generation parameters related to the special effect can comprise a change period and a change amplitude, wherein the change period refers to the time required for the special effect to complete one period. For convenience of explanation, the change width is exemplified below. For example, the projection content is an image projected on a curtain, the elements in the image include three characters of "large", "home", and "good", and the change amplitude refers to n centimeters, provided that the special effect of the element is set to be shifted to the left by n centimeters and then shifted to the right by n centimeters to return to the original position within one period.

The main execution body of the speech visualization method according to the embodiment of the present disclosure may be the speech visualization device provided in the embodiment of the present disclosure, or different types of projection devices such as a server device, a physical host, or a User Equipment (UE) integrated with the speech visualization device, where the speech visualization device may be implemented in a hardware or software manner, and the UE may specifically be a terminal device such as a smart phone, a tablet computer, a notebook computer, a palm computer, a desktop computer, or a Personal Digital Assistant (PDA).

The projection device may operate in a single mode or may operate in a cluster mode.

Referring to fig. 1, fig. 1 is a scene schematic diagram of a speech visualization system provided by an embodiment of the present disclosure. The voice visualization system may include a projection device 100, and a voice visualization apparatus is integrated in the projection device 100.

In addition, as shown in fig. 1, the speech visualization system may further include a memory 200 for storing data, such as storing text data.

It should be noted that the scene schematic diagram of the speech visualization system shown in fig. 1 is merely an example, and the speech visualization system and the scene described in the embodiment of the present disclosure are for more clearly illustrating the technical solution of the embodiment of the present disclosure, and do not form a limitation on the technical solution provided in the embodiment of the present disclosure.

In the following, a speech visualization method provided by an embodiment of the present disclosure is described, where a projection device is used as an execution subject, and the execution subject will be omitted in subsequent embodiments of the method for simplifying and facilitating the description.

Referring to fig. 2, fig. 2 is a schematic flowchart of a speech visualization method provided by an embodiment of the present disclosure. It should be noted that, although a logical order is shown in the flow chart, in some cases, the steps shown or described may be performed in an order different than that shown or described herein. The speech visualization method may specifically include the following steps 201-204, wherein:

201. acquiring the collected data of a user, and converting the user voice data in the collected data into text information.

The acquired data refers to user data acquired by the projection equipment through an acquisition peripheral. Illustratively, the acquisition data may include user voice data, user physiological parameters, image data of the user, and the like. For example, the collected data can be collected through a bracelet connected with the Bluetooth to obtain the physiological data of the heartbeat, the blood pressure and the like of the user, and can also be collected through an external camera connected with the Bluetooth to obtain the image data of the user.

The following describes in detail how and what the user speech data is acquired:

after the projection equipment is started, the voice data of the user is collected through external collection equipment. The embodiment of the disclosure does not limit the acquisition device, and the projection device may be externally connected with a microphone of a moving coil type, a capacitor type, an electret type or a recently emerging silicon microphone, and the like in advance, and the sound emitted when the user speaks is acquired through the microphone. After the microphone collects the sound, the sound is transmitted to the projection equipment as user voice data to be analyzed and stored.

The user speech data may include various information such as audio parameters and text information.

The audio parameter refers to a feature related to sound, for example, the user speech data may include a volume of the user when speaking, a tone of the user when speaking, or a plurality of features related to sound. Different states of the user can be obtained through analysis of different audio parameters, so that when different states need to be obtained, the user voice data can be analyzed according to an actual application scene to obtain different audio parameters.

On the other hand, the text information refers to corresponding text obtained by performing speech recognition on the user speech data, for example, when the user sings a song using a microphone externally connected to the photographing apparatus, the lyric of the song is 'good family', then the projection apparatus can obtain an audio parameter 'dajiahao' corresponding to the 'good family', and the text information 'good family' can be obtained by performing text recognition on the audio parameter. Specifically, the projection device may perform intelligent character Recognition on "dajiahao" in the user voice data through an Automatic Speech Recognition technology (Automatic Speech Recognition) module, and determine that the character information in the user voice data is "big family" rather than wrong character information such as "big frame" according to the relationship, context, and the like between audio parameters in the user voice data.

Further, in order to avoid that the projection device misunderstands the ambient noise as the user voice data to affect the projection result, the user voice data can be firstly distinguished and screened according to the volume of the sound when being collected. For example, when voice data of a user is collected, sound with volume lower than a preset decibel value in the collected sound is screened out. Because the decibel of a person during normal speaking is 40 decibels to 60 decibels, the sound with the volume lower than 30 decibels can be taken as the surrounding environment noise collected by mistake to be filtered, and the sound with the volume greater than or equal to 30 decibels is taken as the voice data of the user.

In addition, different screening modes can be preset in the projection equipment, and noise filtering decibel thresholds in different screening modes are different, so that a user can conveniently adjust the filtering decibel threshold according to different scenes. The decibel threshold value of the screening noise is used for judging the sound lower than the decibel threshold value set by the current screening mode as the noise when the user voice data is collected. For example, three screening modes of high decibel, medium decibel and low decibel with decibel thresholds of 60 decibel, 40 decibel and 30 decibel can be preset. When being in KTV or other scenes that the ambient noise volume is great, the user can adjust the screening mode to high decibel mode, and projection equipment can regard the sound that the volume is less than 60 decibels as the ambient noise filtering of mistake gathering when gathering user voice data this moment, therefore even the volume of ambient noise is high, projection equipment can not think ambient noise mistake as user voice data yet.

The method for distinguishing noise and user voice data by decibels is easily influenced by the speaking habits of users and the surrounding environment, so that the user voice data and the noise can be distinguished according to the voiceprints of the users. For convenience of understanding, a specific scenario that the projection device filters noise according to the voiceprint is given in this embodiment, and step 201 may be specifically implemented by:

(1) original audio data are obtained, and target voiceprint data in the original audio data are extracted.

The original voice data refers to voice data which is obtained by the projection equipment and is not processed. At least one of user speech data and ambient noise is included in the raw speech data. Therefore, in order to distinguish the user voice data from the ambient noise, the projection device may first perform voiceprint extraction on the original audio data through a preset voiceprint extraction module to obtain target voiceprint data.

(2) And inquiring a preset historical voiceprint database, and judging whether historical voiceprint data matched with the target voiceprint data exist.

(3) And if the historical voiceprint data which are the same as the target voiceprint data exist in the historical voiceprint database, taking the original audio data as the user voice data.

In order to determine whether the target voiceprint data is from the user voice data, the projection device may query historical voiceprint data stored in a historical voiceprint database, and if data matched with the target voiceprint data exists in the historical voiceprint data, it is indicated that the target voiceprint data is from the user voice data, and the projection device performs subsequent visualization steps on the original audio data. If the historical voiceprint data does not have data matched with the target voiceprint data, the target voiceprint data is from ambient noise, and the projection equipment does not process the original audio data.

Specifically, when the user uses the projection device for the first time, the user can input his own voice, and after the user inputs the voice, the projection device extracts the voiceprint in the input voice and stores the voiceprint in the historical voiceprint database. In addition, the projection equipment can also upload the updated historical voiceprint database to the cloud end in a networking manner, and data in the cloud end is updated. Through uploading the high in the clouds, even the user types and the projection equipment who uses is different, the projection equipment who uses also can filter the surrounding environment noise. When the voice print data recording device is used every time after being recorded, the projection device can judge the target voice print data according to the historical voice print database.

It should be noted that if the user never enters his or her own voice, the projection device will not be able to distinguish between the user's voice data and the ambient noise. Thus, if the projection device detects that the device has never entered a sound, a warning tone or signal may be emitted to alert the user to enter the sound.

202. And obtaining a target emotion category according to the acquired data.

The target emotion category may include an emotion category obtained according to one or more of collected data such as user voice data, user physiological parameters, user image data and the like.

Illustratively, the target emotion classification may contain an emotion classification derived from the user speech data. The audio parameters in the user speech data are different when the user is in different emotions. For example, when the user is in an angry emotion, the volume in the user's voice data may be greater than the volume when speaking normally, and the pitch may be higher than the pitch when speaking normally. Therefore, the projection device can judge the target emotion category in which the user is located according to different audio parameters contained in the user voice data, wherein the target emotion category is one of preset common emotion categories such as happy emotion, sad emotion and angry emotion, and can also be multiple emotions in the common emotion categories. Specifically, after receiving user voice data transmitted by a microphone, the projection device performs processing such as feature extraction, classification prediction and the like on the user voice data to obtain a target emotion category.

For convenience of understanding, in this embodiment, a projection device obtains a specific scene of a target emotion category according to the user voice data, and step 202 may be implemented specifically by:

and calling a preset first emotion recognition model to carry out prediction processing on the user voice data to obtain a target emotion category.

In the embodiment of the disclosure, the first emotion recognition model is obtained by acquiring a large amount of voice data and training corresponding labels thereof in advance. The voice data includes sounds emitted by people under various emotions, for example, the voice data may include sounds collected under anger, calmness and other emotions, corresponding emotion labels are marked in advance for different sounds, for example, an angry label is marked for voice data with extremely large volume and extremely high tone, and a calm label is marked for voice data with 40-50 decibels and less tone fluctuation. After training, the first emotion recognition model can predict the target emotion category where the user is located according to information in the user voice data.

Further, the first emotion recognition model may include a Convolutional Neural Network (CNN), in which various functions such as feature extraction and classification prediction are implemented by using a Convolutional layer, a pooling layer, and a full connection layer, for example, performing convolution operation on user voice data through the Convolutional layer to obtain a feature vector, performing pooling operation on the feature vector by using the pooling layer, and calculating the pooled feature vector through the full connection layer to obtain a target emotion category of the user.

In order to improve the accuracy of the target emotion category determination, in some embodiments, the target emotion category may also be predicted based on both the user speech data and the user physiological parameters. For convenience of understanding, in this embodiment, a specific scenario that the projection device predicts the target emotion category according to the user voice data and the user physiological parameter is given, and step 202 may be specifically implemented by:

(1) and calling a voice processing sub-model in a preset second emotion recognition model to carry out prediction processing on the voice data of the user to obtain the voice emotion category.

(2) And acquiring physiological parameters of the user, and calling a physiological data processing sub-model in a preset second emotion recognition model to perform prediction processing on the physiological parameters of the user to obtain physiological emotion categories.

(3) And calling an emotion fusion sub-model in a preset second emotion recognition model to perform emotion fusion on the voice emotion category and the physiological emotion category to obtain a target emotion category.

Specifically, the projection device first calls the voice processing submodel in the second emotion recognition model to predict the user voice data to obtain the voice emotion category obtained according to the user voice data, and the specific description may refer to the description above of predicting the user voice data by using the first emotion recognition model, which is not described herein again.

In the embodiment of the present disclosure, besides the microphone, the projection device is further connected to an external device, such as a bracelet, a camera, and the like, which can collect physiological parameters of the user, such as expression, blood pressure, heartbeat, and the like. The user physiological parameters, like the user speech data, may also characterize the mood of the user. For example, the activity of the face and eyes is large when the person is in a hurry, and the activity of the face and eyes is small when the person is calm. As another example, a person may have a significantly increased number of heartbeats in a panic, while a person may have a heartbeat in the range of 60 beats/minute to 90 beats/minute when calm. Therefore, the projection device can also predict the physiological emotion category of the user according to the physiological parameters of the user.

Further, if the user physiological parameters include the user images, the user images can be preprocessed to enhance the accuracy of determining the target emotion category. For example, the user image may be preprocessed by filtering or the like, such as denoising or contrast enhancement.

Specifically, the projection device may invoke a physiological data processing sub-model in the second emotion recognition model to perform prediction processing on the physiological parameters of the user to obtain a physiological emotion category. The physiological data processing submodel is obtained by a large amount of data training with labels. For example, physiological data of a person at different emotions may be collected in advance, and then emotion labels may be applied to the different physiological data. For example, a confused mood label is played for a heartbeat of 130 beats/minute and a calm mood label is played for a heartbeat of 70 beats/minute. After training, the physiological data processing submodel can predict the physiological emotion category of the user according to the information in the physiological parameters of the user.

Further, the physiological data processing submodel may also include a convolutional neural network, which is not described in detail herein.

After the physiological emotion category and the voice emotion category are obtained, the projection equipment can call an emotion fusion sub-model in the second emotion recognition model to respectively obtain emotion feature vectors of the voice emotion category and the physiological emotion category, then respectively endow the emotion feature vectors with corresponding weight values to complete combination of the feature vectors, obtain a target feature vector, and finally predict the target emotion category according to the target feature vector.

Because the embodiment of the disclosure combines the user physiological parameters and the user voice data, the judgment accuracy of the target emotion category is higher compared with a method only according to the user voice data.

In order to reduce the calculation amount, a recognition model can be used for carrying out prediction processing on the physiological parameters of the user and the voice data of the user to obtain the target emotion category.

Specifically, the recognition model also needs to be obtained through training of a large amount of data carrying labels, and is not described herein again. After the user physiological parameters and the user voice data are obtained, the projection equipment calls the recognition model to respectively extract the feature vectors of the user physiological parameters and the user voice data, then the two vectors are respectively endowed with corresponding weight values to complete combination of the feature vectors, a target feature vector is obtained, and finally the target emotion category is predicted according to the target feature vector.

In order to further improve the accuracy of the target emotion category judgment, in some embodiments, the projection device may further obtain the speech emotion category according to the emotion keyword and the audio parameter related to the emotion in the text information. For convenience of understanding, the embodiment provides a projection device which obtains a specific scene of a speech emotion category according to emotion keywords and audio parameters related to emotion in text information:

(1) and performing character recognition processing on the user voice data to obtain character information.

(2) And carrying out semantic extraction processing on the character information to obtain emotion keywords.

The projection device may extract the emotion keywords in the text information through an nlp (natural Language processing) module. Specifically, algorithms such as TextRank and TF-IDF can be adopted to judge whether each character or word in the character information is an emotion keyword according to the importance of the character or word, and the higher the importance is, the closer the character or word is to a preset emotion-related character or word. Therefore, the projection device can extract the characters or words with high importance as the emotion keywords.

(3) And extracting audio parameters in the user voice data, wherein the audio parameters comprise at least one of the speed of speech, the volume and the pitch.

The explanation of the audio parameters can refer to the above explanation, and will not be described in detail herein.

Specifically, the projection device may receive user voice data sent by the microphone, acquire information such as a spectrum curve and a time domain curve of the user voice data, and acquire at least one of a speech rate, a volume, and a pitch according to the information.

(4) And calling a voice processing sub-model in a preset second emotion recognition model to carry out prediction processing on the emotion keywords and the audio parameters to obtain the voice emotion category.

The projection equipment can call the voice processing submodel to respectively obtain the emotion keywords and the feature vectors of the audio parameters, then respectively endow the feature vectors with corresponding weight values to complete the combination of the feature vectors, obtain the voice feature vectors, and finally predict the voice emotion categories according to the voice feature vectors.

Compared with a method for directly obtaining the voice emotion category according to the audio parameter, the method for obtaining the voice emotion category by combining the emotion keyword and the audio parameter has higher prediction accuracy, and further enables the obtained target emotion category to be more accurate.

It should be noted that the projection device may determine the target emotion category according to one of the user physiological parameter, the user image data, the user voice data and the emotion keyword, and may also determine the target emotion categories according to multiple determined target emotion categories at the same time, and the specific implementation manner may refer to a method for obtaining the target emotion category according to the user physiological parameter and the user voice data, and a method for obtaining the target emotion category according to the user voice data, which is not described in detail herein.

203. And acquiring a target character special effect corresponding to the user voice data according to the target emotion category.

The target character special effect refers to the characteristic of character projection after visualization. Illustratively, the target text special effect may include a text color, a text size, a dynamic special effect, and the like. For example, the target text special effect corresponding to the user voice data may include a special effect of text color: red, special effect of character size: 18 pounds, and dynamic special effects: jumping. After the target character special effects of red and 18 pounds are visualized, the size of the projected characters is 18 pounds, the color of the characters is red, and the characters jump.

The projection equipment can search the corresponding special effect model according to the target emotion category to obtain the target character special effect, and can modify a preset original special effect model according to the target emotion category to obtain the target character special effect.

On one hand, when the projection device searches the corresponding special effect model according to the target emotion category, the projection device can search the model corresponding to the target emotion category from a plurality of preset models respectively corresponding to the emotion categories so as to obtain the target character special effect. The method can be specifically realized by the following steps:

(1) and acquiring a preset model set.

(2) And inquiring a preset model set, and acquiring a target character special effect generation model corresponding to the target emotion type.

For example, a research and development worker may set a preset model corresponding to a happy emotion category and may also set a preset model corresponding to a calm emotion category in the projection device, and the preset model set may include at least one of the preset models.

When visualization is carried out, the projection equipment queries a target character special effect generation model corresponding to the target emotion type in a preset model set. For example, when the target emotion category is happy, the projection device traverses each preset model in the preset model set, determines an emotion category corresponding to each preset model, and then takes the preset model corresponding to the target emotion category as a target character special effect generation model.

On the other hand, the projection device can adjust the dynamic generation parameters in the preset original special effect model according to the target emotion category. For example, when the target emotion category is happy, the change period may be reduced, and the change amplitude may be increased, so as to present a special effect that the character is rapidly changed by a large margin, thereby creating a cheerful atmosphere.

In some embodiments, the dynamic generation parameters may be adjusted according to user image data of the user to obtain a specific target text special effect production model. For convenience of understanding, a specific scenario for adjusting dynamic generation parameters is provided in this embodiment, where the adjusting of the dynamic generation parameters in the preset initial special effect generation model according to the target emotion category to obtain a target text special effect generation model includes:

(1) user image data of a user is collected.

The projection equipment can acquire images or videos through an external camera and other image acquisition devices, and then the acquired images or videos are used as user image data.

(2) And obtaining the user variation amplitude and the user height of the user according to the user image data.

The projection device may obtain the height of the user from an image or video frame of the user's whole body included in the user image data. Specifically, the projection device may detect the height of the human body in the image or the video frame according to the image detection model, and then amplify the height in the image according to the scaling ratio of the image and the reality to obtain the height of the user.

On the other hand, the projection device may obtain the user variation amplitude according to the movement amplitude of the human body part in the plurality of image sequences or videos. Specifically, the projection apparatus may detect the movement amplitude of a part such as an arm or a head in an image by using an image detection method such as an optical flow method, and then enlarge the movement amplitude in the image according to the scaling ratio of the image to reality to obtain the user variation amplitude.

(3) And obtaining a special effect variation amplitude corresponding to the user variation amplitude according to a preset variation amplitude corresponding relation and the user height.

(4) And adjusting target dynamic generation parameters related to the change amplitude in a preset initial special effect generation model to obtain a target character special effect generation model, wherein the change amplitude of characters generated by the target character special effect generation model is the special effect change amplitude.

After obtaining the user variation amplitude and the user height, the projection device can obtain the special effect variation amplitude of the dynamic characters projected on display devices such as a screen and a curtain according to the corresponding relation of the variation amplitudes. Referring to the above example of the variation range, n cm is the variation range of the special effect. Therefore, after the projection equipment obtains the special effect change amplitude, the value which the target dynamic generation parameter should reach when the change amplitude of the dynamic character reaches the special effect change amplitude can be calculated, and then the target dynamic generation parameter is adjusted in a targeted manner.

Further, the variation width correspondence of the formula (1) may be adopted:

z ═ (C/H) × N formula (1)

Wherein Z is the special effect variation amplitude, C is the default text height of the dynamic text after the text information is visualized when leaving the factory, H is the height of the user, and N is the variation amplitude of the user.

Therefore, the projection equipment can intelligently adjust the dynamic generation parameters of the target according to the human body movement amplitude and the user height of the user by acquiring the user image data and adjusting the dynamic generation parameters according to the user image data, and the variation amplitude of the dynamic characters is large when the human body movement amplitude is large for the same person. For different people, the higher the height of the user is, the larger the change amplitude of the dynamic characters is.

204. And performing visualization processing on the character information according to the target character special effect, and displaying projection information obtained after visualization processing in a projection surface.

After the target character special effect is obtained, the projection equipment can perform visualization processing on the character information according to the target character special effect to obtain projection information. For example, the projection device may input the text information into a special effect generation model corresponding to the target emotion category to obtain projection information containing dynamic text.

Wherein, the dynamic characters are composed of character characters and dynamic effects. The text characters are consistent with the text information in the user's voice data. For example, when the text information in the user voice data is "great family", that is, the user says "great family" the dynamic text is "great family" containing the dynamic effect. The dynamic effect is preset in the projection device, and may include effects such as text skipping, text blinking, and the like, which are not limited in the embodiment of the present disclosure.

In the embodiment of the disclosure, the projection device may call a preset special effect generation model to generate projection information including dynamic texts, after the text information is input into the special effect generation model, the special effect generation model determines characters in the text information as text characters, and controls movement of the text characters according to a dynamic generation parameter to obtain the dynamic texts, so as to form the projection information, where the dynamic generation parameter may be a change period in each time period, a change amplitude of the text characters in each change period, and the like.

To sum up, the embodiments of the present disclosure include: acquiring collected data of a user, and converting user voice data in the collected data into text information; acquiring a target emotion category corresponding to the acquired data; acquiring a target character special effect corresponding to the user voice data according to the target emotion category; and performing visualization processing on the character information according to the target character special effect, and displaying projection information obtained after visualization processing in a projection surface. Therefore, the voice visualization method in the embodiment of the disclosure can visualize the text information, and can set the special effect after the text information visualization according to the real-time target emotion category of the user, thereby enhancing the immersion feeling of the user during projection and increasing the flexibility during projection.

In order to further enhance the immersion of the user during projection, the projection device can add special effects to the projection information according to the keywords in the text information. Referring to fig. 3, before performing visualization processing on the text information according to the target text special effect and displaying projection information obtained after the visualization processing in a projection plane, the method further includes:

301. and extracting target graphic keywords in the text information.

Wherein the target graphic keyword is one of a plurality of preset graphic keywords. For example, preset graphic keywords of weather and action categories may be set in the projection device, and when the projection device performs step 301, the target graphic keywords may be extracted through a preset NLP module, and specifically, reference may be made to the above process when the emotion keywords are extracted. For example, when the preset keyword includes "rain", if the user says "rain", that is, "rain" exists in the text information, the projection device may extract "rain" as the target image keyword.

302. And determining a target graphic special effect corresponding to the target graphic keyword, wherein the target graphic special effect comprises at least one of a weather special effect and an action special effect.

Performing visualization processing on the character information according to the target character special effect, and displaying projection information obtained after visualization processing in a projection plane, wherein the visualization processing comprises the following steps:

303. and performing visualization processing on the character information according to the target character special effect and the target graphic special effect, and displaying projection information obtained after visualization processing in a projection surface.

The projection equipment can traverse a preset special effect database, determine a preset graph key word corresponding to each preset graph special effect in the special effect database, and extract a target graph special effect corresponding to the target graph key word so as to add the target graph special effect to projection information. For example, a preset graphic special effect corresponding to a keyword 'raining' in the special effect database is a dynamic effect of raindrops falling, and when a target graphic keyword is raining, the projection device extracts the dynamic effect from the special effect database as a target graphic special effect. For another example, the preset graphic special effect corresponding to the keyword "bleeding" in the special effect database is a dynamic effect of red liquid falling, and when the target graphic keyword is bleeding, the projection device extracts the corresponding dynamic effect from the special effect database as the target graphic special effect.

When the projection device adds the target graphic special effect to the projection information, there may be a plurality of addition modes. For example, the projection device may add the target graphic special effect to the visualized text, or the projection device may take the target graphic special effect as a background special effect in the projection information. For example, the projection device may add a dynamic effect of rain to a dynamic text character, creating a raindrop-down effect on the text character. The projection device may also use the dynamic effect of rain as a background in the projection information. The disclosed embodiments are not limited in the manner of addition.

Furthermore, besides the weather special effect and the action special effect, the projection equipment can add more different special effects to the projection information according to different target graphic keywords in the text information. Illustratively, the different effects may include a place-related effect, an animal-related effect, and the like. For example, the projection device may further extract a target graphic keyword related to the target location in the text information, then generate a target graphic special effect showing the target location, and add the special effect to the projection information. Or the projection equipment can extract target graphic keywords related to the target animal in the character information, then generate a target graphic special effect containing the target animal and add the special effect to the projection information.

Therefore, projection information can be further enriched by adding a target graphic special effect to the projection information, and the projection information is changed in real time according to the target graphic key words mentioned by the user, so that the projection flexibility is improved.

In addition to the target graphic special effect, the projection device may add a sound special effect to the projection information, and referring to fig. 4, after performing visualization processing on the text information according to the target text special effect and displaying the projection information obtained after the visualization processing in the projection plane, the method further includes:

401. and extracting target sound keywords in the text information.

402. And determining a target sound special effect corresponding to the target sound keyword.

403. and performing visualization processing on the character information according to the target character special effect and the target sound special effect, and displaying projection information obtained after visualization processing in a projection surface.

Wherein the target sound keyword is one of a plurality of preset sound keywords. For example, preset sound keywords of a category corresponding to the environmental sounds or the animal vocals may be set in the projection device, and when the projection device performs step 301, the target graphic keywords may be extracted through a preset NLP module, and specifically, reference may be made to the above process when the emotion keywords are extracted. For example, when the preset keyword includes "wind sound", if the user says "listen to the wind sound together", that is, "wind sound" exists in the text information, the projection device may extract "wind sound" as the target image keyword.

The manner in which the projection device determines the target sound effect and adds the target sound effect may refer to step 302, and will not be described in detail.

Sometimes, when the user speaks to control the projection device by using voice, referring to fig. 5, after performing visualization processing on the text information according to the target text special effect and displaying projection information obtained after the visualization processing in a projection plane, the method further includes:

501. and inquiring a preset instruction set, and judging whether a target instruction matched with the text information exists in the preset instruction set.

The matching of the text information and the target instruction can indicate that the instruction text information contained in the target instruction is the same as the text information in the user voice data, so that the text information and the instruction text information need to be compared when judging whether the target instruction matched with the text information exists in the preset instruction set.

When the projection equipment compares the text information with the instruction text information, not only all information contained in the text information can be compared with the instruction text information, but also part of information contained in the text information can be compared with the instruction text information. For example, if the text information is "good weather today, i want to open XX music software", if the projection device compares all information and instruction text information contained in the text information, a large number of preset instructions need to be preset to realize voice control, so as to avoid the situation that the text information contains instruction text information, but other information cannot be matched, so that a large amount of storage space is occupied, and the comparison time is long. If the projection equipment firstly extracts the instruction key words related to the instructions in the character information and then compares the extracted instruction key words, the storage space can be saved, the comparison time is shortened, and the efficiency is higher. For example, when the text information is also "it is good today and i want to open XX music software", the projection device may first extract the instruction keyword "open XX music software", and then determine whether the text information and the instruction text information are the same.

Furthermore, the projection equipment can also perform synonym expansion on the instruction keywords to obtain a plurality of synonym keywords, and then the instruction keywords and the synonym keywords are respectively compared with the instruction character information to improve the judgment accuracy. Illustratively, after the projection device extracts the instruction keywords, synonym expansion can be performed on the instruction keywords through a network or a preset word bank to form a plurality of expanded keywords, and then the instruction keywords and each expanded keyword are respectively compared with instruction character information, so that the situation that voice control cannot be realized due to different user descriptions is avoided. For example, when the instruction keyword is "open XX music software", synonym expansion may be performed to form a plurality of expansion keywords including "open XX music software", "start XX music software", and the like, and then the "open XX music software", "start XX music software", and the other expansion keywords are respectively compared with the instruction text information. It is understood that the method of synonym expansion can be also used when extracting other keywords, such as emotion keywords or weather keywords, to improve the intelligence of the projection device.

502. And if the preset instruction set has a target instruction matched with the text information, executing the target instruction corresponding to the text information of the target instruction.

When comparing, the projection equipment traverses all the preset instructions in the preset instruction set, determines instruction character information contained in the preset instructions, and executes target instructions corresponding to the target instruction character information when the target instruction character information identical to the character information is found.

In order to improve the intellectualization of voice control, the projection device can also judge a preset instruction with the matching degree with the text information higher than a preset threshold value as a target instruction. For example, a matching threshold of 90% may be set, and when the target instruction text information having 90% of characters identical to the text information is found, that is, when the matching degree between the text information and the target instruction reaches 90%, the target instruction is executed.

503. And when the target instruction is executed, stopping displaying the projection information.

When a user sings or reads, the lyrics and the article can be projected on a curtain or a screen in order to facilitate the audience or the user to look at the next sentence of content in advance, or to facilitate the review of the content that has been sung or read. Referring to fig. 6, at this time, the visualizing the text information in the user voice data according to the target emotion category to obtain dynamic projection information including dynamic text includes:

601. it is detected whether the microphone function is enabled.

There are various methods of detecting whether the microphone function is enabled, and one of the simplest methods is to detect the state of the microphone interface. For example, the projection device may detect whether a microphone interface disposed on the projection device is in a conducting state, that is, whether a plug of a microphone is plugged into the microphone interface, and if the microphone interface is in the conducting state, it indicates that the microphone function is enabled, and the user voice data received by the projection device is voice data input by a user from the microphone.

602. And if the microphone function is started, acquiring text information in the user voice data, and inquiring articles or lyrics corresponding to the text information through a parallel network.

If the microphone function is enabled, the user is possibly in a state of singing or reading an article, so that the next sentence of content can be checked in advance by the audience or the user, or the whole content of the lyrics or the article can be inquired in advance for reviewing the content which is already sung or read, the user can be facilitated, the microphone function can also be used as a prompter of the user, and the states of forgetting the content and the like can be avoided. The query method includes, but is not limited to, a network query, for example, the projection device may search for articles or lyrics including text information through a search engine on the network.

In addition, the projection equipment can also determine articles or lyrics corresponding to the text information according to the browsing number and the position of the searched information. Since there may be many articles or lyrics including text information, for example, a song or a comment article including a lyric of a certain sentence, the projection apparatus may use the information located at the top in the search engine or the information with the largest number of browsed words as the article or lyric corresponding to the text information.

603. And performing visualization processing on the article or the lyrics according to the special effect of the target character, and displaying projection information obtained after visualization processing in a projection plane.

If the text information corresponds to the lyrics, the projection device can also add information corresponding to the singer in the projection information. For example, the projection device may obtain an image or video of a singer corresponding to the lyrics, and use the image or video of the singer as a background in the dynamic projection information.

When in projection, if the projection decibel value of the projection sound in the dynamic projection information is too large, the sound of the user may be masked, so that the user voice data is the projection sound acquired by mistake. In order to avoid this situation, the decibel value during projection may be adjusted according to the decibel value of the user voice data when the projection information is played, and specifically, the following steps may be taken to adjust the decibel value during projection:

(1) and extracting the projection decibel value in the dynamic projection information and the user decibel value in the user voice data.

(2) And comparing the projection decibel value with the user decibel value.

(3) And if the difference between the projection decibel value and the user decibel value is larger than the preset decibel value, taking the default decibel value as the projection decibel value, and playing the projection information.

In order to judge whether the projection decibel value is too large, the projection device extracts the projection decibel value from the control chip or the storage chip and extracts a user decibel value, namely the volume when a user speaks, from the user voice data.

In order to compare the magnitude relationship between the projection decibel value and the user decibel value, a preset decibel value can be set to judge whether the sound during projection can cover the sound of the user. If the difference between the projection decibel value and the user decibel value is greater than the preset decibel value, it is indicated that the projection decibel value is too high, and the sound of the user may be masked. In order to avoid that the adjusted projection decibel value still masks the user's voice, the default decibel value may be set to a lower decibel value, for example, 40 decibels, i.e., the decibel value when the person speaks lightly.

In addition, the projection equipment can set different default decibel values for each user according to the voiceprints of different users. The following is a detailed description:

the projection equipment can identify the voiceprints in the user voice data after acquiring the user voice data, and then respectively collects decibel values with the longest playing time when the projection equipment performs visual processing and playing according to the user voice data containing different voiceprints. For example for user speech data a and user speech data B containing different voiceprints a and voiceprints B. For example, when performing visualization processing and playing according to the user voice data a, the decibel value condition when the projection apparatus plays is: the sound is played for 60 minutes in total at 60 decibels, 20 minutes in total at 40 decibels, and 5 minutes in total at 80 decibels, so that the preference of the user corresponding to the voiceprint A is 60 decibels, and the 60 decibels can be set as a default decibel value. And for the decibel value condition when carrying out visualization processing and playing according to user voice data B, the projection equipment plays: the sound is played for 55 minutes in total at 20 db, 20 minutes at 40 db, and 5 minutes at 80 db, so that the preference of the user corresponding to the voiceprint B is 20 db, and 20 db can be set as the default db value. After multiple times of collection, the preference of decibel values of different users can be obtained, so that if the projection equipment identifies the voiceprint A again, when the difference between the projection decibel value and the user decibel value is greater than the preset decibel value, 60 decibels can be used as the projection decibel value, and projection information is played. And the projection equipment identifies the voiceprint B again, and can take 20 decibels as a projection decibel value and play projection information when the difference between the projection decibel value and the user decibel value is greater than a preset decibel value.

In order to better implement the voice visualization method in the embodiment of the present disclosure, on the basis of the voice visualization method, a voice visualization apparatus is further provided in the embodiment of the present disclosure, as shown in fig. 7, which is a schematic structural diagram of an embodiment of the voice visualization apparatus in the embodiment of the present disclosure, the voice visualization apparatus 700 includes:

an obtaining unit 701, configured to obtain collected data of a user, and convert user voice data in the collected data into text information;

an emotion acquisition unit 702, configured to acquire a target emotion category corresponding to the acquired data;

a special effect obtaining unit 703, configured to obtain a target text special effect corresponding to the user voice data according to the target emotion category;

and the display unit 704 is configured to perform visualization processing on the text information according to the target text special effect, and display projection information obtained after visualization processing in a projection plane.

In some embodiments, emotion acquisition unit 702 is further configured to implement one or more of the following methods:

In some embodiments, the speech visualization apparatus 700 further comprises a target graphical effect determination unit 705, the target graphical effect determination unit 705 being configured to:

extracting target graphic keywords in the text information;

the presentation unit 704 is further configured to:

In some embodiments, the speech visualization apparatus 700 further comprises a target sound special effect determination unit 706, the target graphic sound determination unit 706 being configured to:

extracting target graphic keywords in the text information;

the presentation unit 704 is further configured to:

In some embodiments, the speech visualization apparatus 700 further comprises an instruction execution unit 707, the instruction execution unit 707 configured to:

In some embodiments, the speech visualization apparatus 700 further comprises a networked acquisition unit 708, the networked acquisition unit 708 being configured to:

detecting whether a microphone function is enabled;

In some embodiments, the obtaining unit 701 is further configured to:

In a specific implementation, the above units may be implemented as independent entities, or may be combined arbitrarily to be implemented as the same or several entities, and the specific implementation of the above units may refer to the foregoing method embodiments, which are not described herein again.

Since the voice visualization apparatus can execute the steps in the voice visualization method in any embodiment of the present disclosure, the beneficial effects that can be realized by the voice visualization method in any embodiment of the present disclosure can be realized, which are detailed in the foregoing description and will not be described herein again.

In addition, in order to better implement the voice visualization method in the embodiment of the present disclosure, based on the voice visualization method, an embodiment of the present disclosure further provides a projection device, referring to fig. 8, fig. 8 shows a schematic structural diagram of the projection device in the embodiment of the present disclosure, specifically, the projection device in the embodiment of the present disclosure includes a processor 801, where the processor 801 is configured to implement each step of the voice visualization method in any embodiment when executing a computer program stored in a memory 802; alternatively, the processor 801 is configured to implement the functions of the units in the corresponding embodiment of fig. 7 when executing the computer program stored in the memory 802.

Illustratively, a computer program may be partitioned into one or more modules/units, which are stored in the memory 802 and executed by the processor 801 to accomplish the disclosed embodiments. One or more modules/units may be a series of computer program instruction segments capable of performing certain functions, the instruction segments being used to describe the execution of a computer program in a computer device.

The projection device may include, but is not limited to, a processor 801, a memory 802. It will be appreciated by those skilled in the art that the illustration is merely an example of a projection device and is not meant to be a limitation of projection devices and may include more or less components than those shown, or some components may be combined, or different components, for example, an electronic device may also include an input output device, a network access device, a bus, etc., and the processor 801, the memory 802, the input output device, the network access device, etc., are connected via the bus.

The Processor 801 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. The general purpose processor may be a microprocessor or the processor may be any conventional processor or the like that is the control center for the projection device and that connects the various parts of the overall projection device using various interfaces and lines.

The memory 802 may be used to store computer programs and/or modules, and the processor 801 may implement various functions of the computer device by running or executing the computer programs and/or modules stored in the memory 802 and invoking data stored in the memory 802. The memory 802 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the stored data area may store data (e.g., audio data, video data, etc.) created based on the use of the projection device, etc. In addition, the memory may include high speed random access memory, and may also include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other volatile solid state storage device.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the voice visualization apparatus, the projection device and the corresponding units thereof described above may refer to descriptions of the voice visualization method in any embodiment, and detailed descriptions thereof are omitted here.

It will be understood by those skilled in the art that all or part of the steps of the methods of the above embodiments may be performed by instructions or by associated hardware controlled by the instructions, which may be stored in a computer readable storage medium and loaded and executed by a processor.

Therefore, embodiments of the present disclosure provide a computer-readable storage medium, where a plurality of instructions are stored, where the instructions can be loaded by a processor to execute steps in a speech visualization method in any embodiment of the present disclosure, and specific operations may refer to descriptions of the speech visualization method in any embodiment, and are not described herein again.

Wherein the computer-readable storage medium may include: read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disks, and the like.

Since the instructions stored in the computer-readable storage medium can execute the steps in the speech visualization method in any embodiment of the present disclosure, the beneficial effects that can be achieved by the speech visualization method in any embodiment of the present disclosure can be achieved, which are detailed in the foregoing description and will not be described herein again.

The speech visualization method, the apparatus, the projection device, and the computer-readable storage medium provided by the embodiments of the present disclosure are described in detail above, and specific examples are applied herein to illustrate the principles and implementations of the present disclosure, and the description of the embodiments above is only used to help understand the method and the core ideas of the present disclosure; meanwhile, for those skilled in the art, according to the idea of the present disclosure, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present description should not be construed as a limitation to the present disclosure.

Claims

1. A method for speech visualization, the method comprising:

acquiring a target emotion category corresponding to the acquired data;

2. The method of claim 1, wherein the obtaining of the target emotion classification corresponding to the collected data comprises one or more of the following methods:

3. The voice visualization method according to claim 1, wherein before visualizing the text information according to the target text special effect and displaying projection information obtained after visualization in a projection plane, the method further comprises:

extracting target graphic keywords in the text information;

4. The voice visualization method according to claim 1, wherein before visualizing the text information according to the target text special effect and displaying projection information obtained after visualization in a projection plane, the method further comprises:

extracting target sound keywords in the text information;

5. The voice visualization method according to claim 1, wherein after the text information is visualized according to the target text special effect and the projection information obtained after the visualization is displayed in a projection plane, the method further comprises:

6. The voice visualization method according to claim 1, wherein before visualizing the text information according to the target text special effect and displaying projection information obtained after visualization in a projection plane, the method further comprises:

detecting whether a microphone function is enabled;

and performing visualization processing on the article or the lyrics according to the target character special effect, and displaying projection information obtained after visualization processing in a projection plane.

7. The speech visualization method according to claim 1, wherein the obtaining user speech data comprises:

8. A speech visualization device, comprising:

an acquisition unit configured to acquire user voice data;

the emotion acquisition unit is used for acquiring a target emotion category according to the user voice data;

the visualization unit is used for performing visualization processing on the character information in the user voice data according to the target emotion type to obtain dynamic projection information containing dynamic characters;

and the playing unit is used for playing the dynamic projection information.

9. Projection device, characterized in that it comprises a processor and a memory, in which a computer program is stored, which processor, when calling the computer program in the memory, executes a speech visualization method according to one of claims 1 to 7.

10. A computer-readable storage medium, having stored thereon a computer program which is loaded by a processor for performing the steps of the method for speech visualization according to any of the claims 1 to 7.