CN116501919A

CN116501919A - Prompting method, device, equipment and storage medium

Info

Publication number: CN116501919A
Application number: CN202310552144.7A
Authority: CN
Inventors: 请求不公布姓名; 王勇
Original assignee: Hubei Xingji Meizu Technology Co ltd
Current assignee: Hubei Xingji Meizu Technology Co ltd
Priority date: 2023-05-16
Filing date: 2023-05-16
Publication date: 2023-07-28

Abstract

The application discloses a prompting method, a prompting device, prompting equipment and a storage medium. The method comprises the following steps: acquiring historical audio data of a target object; determining the speech rate text of the target object according to the historical audio data; acquiring real-time audio data of the target object; determining the position information of the text corresponding to the real-time audio data in the speech speed text of the target object; and determining the content to be prompted according to the position information and the speech speed text of the target object, and displaying the content to be prompted through a prompter display screen. Through the technical scheme of the method and the device, the content needing to be prompted can be rapidly positioned in the prerecorded audio according to the content being performed by the speaker or singer, so that the convenience of the prompter device is improved, and personalized prompter information and speed based on any content of the prerecorded audio can be accurately provided for the speaker or singer.

Description

Prompting method, device, equipment and storage medium

Technical Field

The embodiment of the application relates to the technical field of audio processing, in particular to a prompting method, a prompting device, prompting equipment and a storage medium.

Background

The word prompter is a common technique in life. For example, we sing karaoke requires a word. In addition to singing, speech also requires a prompter. In general, a prompter for a lecture is a display screen, which displays the content of a manuscript by a high-brightness display device, and reflects the content onto a piece of special coated glass with an angle of 45 degrees in front of a camera lens, so that a lecturer can face a camera when looking at the lecture. It is supported on the same axis with the presenter, the camera and the tripod, thereby generating the feeling of relatedness that the lecturer always faces to the audience, and improving the lecture quality. Each individual has its own unique speaking speed, but current prompters are unable to provide personalized prompter services, either in speech or singing.

Disclosure of Invention

In view of this, an embodiment of the present application provides a prompting method, including:

acquiring historical audio data of a target object;

determining the speech rate text of the target object according to the historical audio data;

acquiring real-time audio data of the target object;

determining the position information of the text corresponding to the real-time audio data in the speech speed text of the target object;

And determining the content to be prompted according to the position information and the speech speed text of the target object, and displaying the content to be prompted through a prompter display screen.

The embodiment of the application also provides a prompting device, which comprises:

the first acquisition module is used for acquiring historical audio data of the target object;

the first determining module is used for determining the speech rate text of the target object according to the historical audio data;

the second acquisition module is used for acquiring real-time audio data of the target object;

the second determining module is used for determining the position information of the text corresponding to the real-time audio data in the speech speed text of the target object;

and the acquisition and prompting module is used for determining the content to be prompted according to the position information and the speech speed text of the target object, and displaying the content to be prompted through a prompter display screen.

The embodiment of the application also provides electronic equipment, which comprises:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein, the liquid crystal display device comprises a liquid crystal display device,

the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the prompting method described in any of the embodiments of the present application.

The embodiment of the application also provides a computer readable storage medium, wherein the computer readable storage medium stores computer instructions, and the computer instructions are used for enabling a processor to implement the prompting method according to any embodiment of the application when the processor executes the prompting method.

It should be understood that the description of this section is not intended to identify key or critical features of the embodiments of the application or to delineate the scope of the application. Other features of the present application will become apparent from the description that follows.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered limiting the scope, and that other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a prompting method in an embodiment of the present application;

FIG. 2 is a schematic diagram of a prompting method in an embodiment of the present application;

FIG. 3 is a schematic structural diagram of a prompting device according to an embodiment of the present application;

Fig. 4 is a schematic structural diagram of an electronic device implementing the prompting method according to an embodiment of the present application.

Detailed Description

In order to make the present application solution better understood by those skilled in the art, the following description will be made in detail and with reference to the accompanying drawings in the embodiments of the present application, it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, shall fall within the scope of the present application.

It should be noted that the terms "first," "second," and the like in the description and claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that embodiments of the present application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Example 1

The prompter is a professional device for prompting the speech of a lecturer or an actor when the lecturer or the actor performs singing, and is generally arranged in front of the lecturer or the actor, but the prompter content needs to be manually input, and a special person also needs to manually control the display sequence behind a curtain when in use. In addition, the prompter is clumsy and difficult to deploy, and the prompter is inconvenient to arrange in some important occasions such as scenes of reporting to an upper level. In the prior art, a caption technology similar to a prompter is also provided, and the difference between the two technologies is that the caption takes sound as an original input and outputs corresponding characters, so as to provide information for a viewer; the purpose of the prompter is to help the presenter or singer to master the content and rhythm of the presenter or singing, and to prompt the presenter or singer of the speed and tone of speech through the positions of the words fading out and highlighting, which is important in the scenes of the presenter or singing, etc.

The embodiment of the application provides a prompting method, a prompting device, prompting equipment and a storage medium, so that the content which needs to be prompted is determined by rapidly positioning in prerecorded audio according to the content which is being performed by a speaker or singer, the convenience of the prompter equipment is improved, and personalized prompter information and speed of any content can be accurately provided for the speaker or singer.

Fig. 1 is a flowchart of a prompting method in an embodiment of the present application, where the embodiment is applicable to the case of prompting, and the method may be performed by a prompting device in the embodiment of the present application, where the device may be implemented in a software and/or hardware manner, as shown in fig. 1, and the method specifically includes the following steps:

s101, acquiring historical audio data of a target object.

In this embodiment, the target object may be a user such as a presenter or singer, and the number of specific target objects may be one or more, which is not limited in this embodiment.

The historical audio data may be audio file data corresponding to a pre-recorded target object. The historical audio data of the target object may be audio data when a lecturer recites the content of the lecture such as poems and singing, or may be audio data when a song content such as a song drama is singed by a singer.

Specifically, in one embodiment, the target object may record the historical audio data in advance when performing a lecture or singing, and the recording tool may be any recording device (for example, a recording device of a recording studio or a terminal device such as a smart phone). Historical audio data of the target object recorded by the recording device while the speech or singing is being performed may also be stored in the audio storage device. The historical audio data of the prerecorded target object when the speech or singing is performed can be acquired from the recording device or the audio storage device.

S102, determining the speech rate text of the target object according to the historical audio data.

It should be noted that, the speech rate text may be a text extracted from audio data, where the target object speaks the speech rate corresponding to the speech rate when speaking or singing.

Extracting text from audio data may be obtained by any speech technique. In one embodiment, an AI voice recognition technology (such as DNN-HMM deep neural network technology) can be adopted to perform voice recognition on the historical audio data, and meanwhile, the speech speed text corresponding to the historical audio data is transcribed through AI grammar and semantic recognition, so that the difficulty of extracting the content of the prompter can be simplified, and the time of manual production is reduced.

Specifically, in one embodiment, after the pre-recorded historical audio data is obtained from the recording device or the audio storage device, the text of the speech speed corresponding to the historical audio data is transcribed through AI grammar and semantic recognition, and is recorded in a lrc (abbreviation of english word lyric corresponding to lyrics is used as an extension of lyrics file) file.

For example, a section of historical audio data contains two poems, "the white sun is mountain-free, the yellow river enters the ocean," and the format of the corresponding text in the converted lrc file can be expressed as follows:

[00:00.000] <1050> white <2300> day <800> is exhausted from <500> mountain <2800>,

[00:07.450] <1500> yellow <1200> river <1800> into <750> sea <530> stream.

Wherein, [00:00.000] can be the start time of the first sentence "still mountain full", [00:07.450] can be the start time of the second sentence "yellow river into ocean," and the number in front of each word can be the duration of each word. For example, <1050> is a "white" word that takes 1.050 seconds, <2300> is a "day" word that takes 2.300 seconds, and so on.

In the actual operation, the format of the lrc file may also be represented by time periods divided by intervals. The content of the lrc file may be a song, a lecture, a drama, or the like.

When the lrc file is generated, the spoken content is recognized according to the historical audio data, and meanwhile, the duration time of each word is required to be separated, so that the color can be changed or the prompt word can be highlighted during the speech, and the user is reminded of the speech progress.

S103, acquiring real-time audio data of the target object.

In the embodiment of the present application, the real-time audio data is opposite to the historical audio data, and the real-time audio data may be audio file data corresponding to a target object recorded in real time. The real-time audio data of the target object may be audio data when a lecturer recites the content of the lecture such as poems and singing, or may be audio data when a song content such as a song drama is singed by a singer.

In particular, in one embodiment, the target object may collect real-time audio data while the target object is speaking or singing, and the collection tool may be, for example, AR glasses or other devices with microphones worn by the target object. Real-time audio data of the target object while the target object is speaking or singing may be obtained from the AR glasses end or other device with a microphone worn by the target object.

S104, determining the position information of the text corresponding to the real-time audio data in the speech speed text of the target object.

In various embodiments, the text corresponding to the real-time audio data of the target object may be compared with the speech rate text of the target object, so as to obtain the position information of the text corresponding to the real-time audio data in the speech rate text of the target object.

Specifically, in one embodiment, the recording device or the audio storage device synchronizes the speech rate text of the target object to the AR glasses end worn by the target object, and the AR glasses compare the real-time audio data corresponding text of the target object with the speech rate text of the target object to find out the position information of the real-time audio data corresponding text in the speech rate text of the target object.

For example, if the speed text of the target object is "<1050> white <2300> day <800> and <500> mountain <2800> is out", and the real-time audio data corresponding text is "white day", the location information of the real-time audio data corresponding text in the speed text of the target object may be the location of the first two words.

S105, determining the content to be prompted according to the position information and the speech speed text of the target object, and displaying the content to be prompted through the prompter display screen.

It should be explained that the content to be prompted may be content to be displayed on the prompter display screen.

In one embodiment of the present application, the tagout display may be a wearable device worn by the target object, for example, may be AR glasses.

The AR is a technology for calculating the position and angle of a camera image in real time, adding a corresponding image, video and a three-dimensional model, acquiring real environment information through image pickup, and then superposing a virtual projection object (such as an image, a scene or system prompt information) into the acquired real environment information, and displaying the virtual projection object to a user, so that the user can simultaneously see the visual effect of the virtual object existing in the real environment in sense, namely, the 'enhancement' of the real scene is realized.

Specifically, in one embodiment, the wearable device worn by the target object, for example, AR glasses, determines the content to be prompted according to the position information of the text corresponding to the real-time audio data in the speech speed text of the target object and the speech speed text of the target object. For example, the speech rate text of the target object may be a text "the time (speech rate) is complete in the white sun, the yellow river enters the ocean," the position information of the text corresponding to the real-time audio data in the speech rate text of the target object may be the first two words, that is, "the white sun," and the determined content to be prompted may be the content behind the white sun, for example, "the text is complete in the white sun" or "the complete in the mountain," and the number of specific prompting words may be preset by the user. After the content needing to be prompted is determined, the content needing to be prompted is displayed through the display screen of the AR glasses worn by the target object, and the prompter information of any content is accurately provided for a presenter or singer, so that the convenience of prompter equipment is improved, and meanwhile, the prompter content is projected on the inner screen of the AR glasses, cannot be checked outside, the privacy of use is improved, and the privacy of the presenter is protected.

In the actual operation process, when there are a plurality of target objects, that is, when there are a plurality of users playing the lecture or singing at the same time, in one embodiment, the sound of all the target objects can be recorded into one audio data, and displayed on the prompter display of each target, each user can view the lecture or singing content of all the people at the same time (the target text corresponding to each user can be distinguished by different colors or other effects), and when the user wants to view only the content of the lecture or singing of the user, only the part of the lecture or singing of the user can be selected for display. In another embodiment, each target object's own speech or singing content may be made into a single audio data and then displayed via the respective prompter display.

According to the method and the device, the historical audio data of the target object are obtained, the speech speed text of the target object is determined according to the historical audio data, the real-time audio data of the target object are obtained, the position information of the text corresponding to the real-time audio data in the speech speed text of the target object is determined, the content needing to be prompted is determined according to the position information and the speech speed text of the target object, and the content needing to be prompted is displayed through a prompter display screen. Through the technical scheme of the method and the device, the content needing to be prompted can be rapidly positioned in the prerecorded audio according to the content being performed by the speaker or singer, so that the convenience of the prompter device is improved, and personalized prompter information and speed based on any content of the prerecorded audio can be accurately provided for the speaker or singer.

In some embodiments, determining content to be prompted according to the location information and the speech rate text of the target object includes:

and acquiring the original text corresponding to the historical audio data.

It should be noted that the original text may be an accurate text without mispronounced or multi-word missing words that is spoken or singed when the presenter or singer collects the historical audio data.

For example, the original text corresponding to the historical audio data may be text content such as a lecture of a lecturer or text content such as lyrics of a singer.

And determining the content to be prompted according to the position information, the original text and the speech speed text of the target object.

In one embodiment, when the historical audio data and/or the real-time audio data are identified in terms of speech, grammar and semantics, incorrect identification such as misprinting or multi-word missing may occur, for example, the original text corresponding to the historical audio data may be "the white sun is in the ocean current", but misprinting such as "the hundred sun is in the ocean current" may occur when the historical audio data and/or the real-time audio data are identified, and at this time, the most suitable position may be found in the original text corresponding to the historical audio data according to the identification result of the real-time audio data to determine the content to be prompted.

Specifically, determining the content to be prompted according to the position information of the text corresponding to the real-time audio data in the speech speed text of the target object, the original text and the speech speed text of the target object.

In some embodiments, determining the speech rate text of the target object from the historical audio data includes:

audio amplitude information and audio frequency information of the historical audio data are acquired.

Specifically, amplitude information and frequency information of an audio signal in prerecorded historical audio data are obtained.

And identifying the audio amplitude information and the audio frequency information to obtain the text and speech rate information corresponding to the historical audio data.

In one embodiment, an AI voice recognition technology (such as DNN-HMM deep neural network technology) can be adopted to perform voice recognition on the historical audio data, and characters and speech speed information corresponding to the historical audio data are transcribed through AI grammar and semantic recognition, so that the difficulty of extracting the content of the prompter can be simplified, and the time of manual production is reduced. But implementation of the present solution is not limited to AI speech, grammar and semantic recognition.

In one embodiment, pre-recorded historical audio data is obtained, sentence breaks, word spacing locations and corresponding times are identified according to amplitude and frequency changes of audio signals in the historical audio data, the sentence breaks, word spacing locations and corresponding times are recorded in a lrc (abbreviations of English words corresponding to lyrics are used as extensions of lyrics files), and duration of each sentence or even each word or word is detected by analyzing amplitude changes of human voice in the historical audio data.

And determining the speech speed text of the target object according to the text and the speech speed information corresponding to the historical audio data.

Specifically, the speech rate text of the target object can be determined according to the information such as the text, the sentence breaking, the word interval position and the corresponding time corresponding to the historical audio data.

In some embodiments, the speech rate information includes: at least one of a duration corresponding to each word, a duration corresponding to each sentence, and a start time corresponding to each sentence.

For example, the time interval between two sentences in the historical audio data may be determined according to a start time corresponding to each sentence in the historical audio data and a duration time corresponding to each sentence in the historical audio data. Specifically, according to the sum of the starting time corresponding to a sentence in the historical audio data and the duration corresponding to the sentence in the historical audio data, the duration corresponding to the sentence in the historical audio data and the ending time corresponding to the sentence in the historical audio data can be determined, and according to the ending time corresponding to the sentence in the historical audio data and the starting time corresponding to the next sentence in the historical audio data, the interval time between the sentence and the next sentence in the historical audio data can be determined. In one embodiment, the prompting mode corresponding to the interval time between two sentences may be preset, for example, may be a progress bar, specifically may be a prompting mode of a plurality of dots, the color or the size of which gradually changes along with the time progress, etc., in another embodiment, the prompting mode corresponding to the interval time between two sentences may also be another prompting mode capable of being used for representing the time progress, which is not limited in this embodiment. The prompting mode corresponding to the interval time between two sentences is obtained, the prompting mode corresponding to each word in the historical audio data is determined according to the duration time corresponding to each word in the historical audio data, the prompting mode corresponding to each sentence in the historical audio data is determined according to the prompting mode corresponding to each word in the historical audio data, and the prompting mode corresponding to the historical audio data is determined according to the prompting mode corresponding to the interval time between two sentences and the prompting mode corresponding to each sentence in the historical audio data.

For example, in one embodiment, when a lecturer or singer plays a lecture or sings, there may be pre-music, or blank, located before the first sentence of the lecture or singing. The starting time of the historical audio data is obtained, and the time interval is determined according to the starting time of the historical audio data and the starting time of the speech speed text. The prompting mode corresponding to the time interval may be preset, for example, may be a progress bar, specifically may be a prompting mode of a plurality of dots, gradually changing color or size along with the time progress, or may be other prompting modes capable of being used for indicating the time progress, which is not limited in this embodiment. And determining the prompting mode corresponding to the content to be prompted according to the prompting mode corresponding to the time interval, the prompting mode corresponding to each sentence in the historical audio data and the prompting mode corresponding to the interval time between every two sentences.

In some embodiments, determining the content to be prompted according to the location information, the original text and the speech rate text of the target object includes:

the text to be pronounced is determined from the location information and the original text.

The text to be pronounced may be a text to be pronounced when a target object such as a presenter or singer is speaking or singing.

In one embodiment, the original text may be "still mountain is still, and the yellow river enters the ocean," and the position information of the speaker or singer and other target objects just after sounding is "still", "still" is the position of the first two words in the original text, and the determined words to be sounding may be words behind "still", for example, "still mountain is still" or "still mountain is still, and the number of specific prompt words may be preset by the user, which is not limited in this embodiment.

And determining the prompting mode of the words to be pronounced according to the speech speed text of the target object.

Specifically, after determining the text to be pronounced, the prompt mode of the text to be pronounced can be determined according to the speech speed text of the target object.

In some embodiments, the prompting means includes: the text to be pronounced is prompted with at least one of a cursor, a highlight, and a ticker.

In one embodiment, the presentation means corresponding to the text to be uttered may be, for example, a cursor, in another embodiment, the presentation means corresponding to the text to be uttered may be, for example, a highlight scroll, and the presentation means corresponding to the text to be uttered may be, for example, a ticker type, a thickness change presentation, or the like, which is not limited in this embodiment.

In some embodiments, determining a prompting mode of words to be pronounced according to the speech rate text of the target object includes:

and determining the duration corresponding to each word in the words to be pronounced and the interval time between adjacent words according to the speed text of the target object.

In one embodiment, the speech speed text corresponding to the target object can be obtained when the historical video data is identified, the speech speed text comprises the duration time corresponding to each word and the interval time between adjacent words, and the duration time corresponding to each word and the interval time between adjacent words in the words to be pronounced in the speech speed text of the target object are obtained.

And determining the prompting mode of the word to be pronounced according to the duration corresponding to each word in the word to be pronounced and the interval time between adjacent words.

In one embodiment, the prompting mode of prompting the words to be pronounced on the word-raising display screen can be determined according to the duration corresponding to each word in the words to be pronounced and the interval time between adjacent words. In the actual operation process, the duration time corresponding to each word in the historical audio data may be different, and the prompting mode corresponding to each word in the words to be pronounced may also be different. For example, when the prompting mode is a highlight scrolling mode, the highlighting duration corresponding to each word in the word to be pronounced is also different, for example, the highlighting duration corresponding to each word in the word to be pronounced may be the duration corresponding to each word in the word to be pronounced, and finally the prompting mode corresponding to the word to be pronounced may be determined according to the prompting mode corresponding to each word in the word to be pronounced.

In some embodiments, after displaying the content to be prompted through the prompter display screen, the method further comprises:

and receiving a control instruction for the current display content.

In this embodiment, the control instruction may be an instruction issued by the user to adjust the content being displayed on the display screen of the wearable device. In one embodiment, the control instruction may be an instruction to adjust the playing speed of the content being displayed on the display screen of the wearable device, and in another embodiment, the control instruction may also be an instruction to adjust the playing progress of the content being displayed on the display screen of the wearable device.

The current display content may be content being displayed on a display screen of the wearable device worn by the target object when the control instruction is received.

Specifically, the user may issue the control instruction directly on the wearable device, and in one embodiment, the user may pass through a control key on the wearable device or a virtual key on a display screen of the wearable device. In another embodiment, the user may also issue control instructions through gesture recognition or the like. In addition, the user can send a control instruction through terminal equipment such as a smart phone, and in the actual operation process, when the control instruction is sent to the wearable equipment worn by the target object through the terminal equipment such as the smart phone, one or more of a plurality of interconnection communication technologies widely applied at present can be used: for example, wifi, bluetooth, or cloud terminal, etc., which is not limited in this embodiment.

And updating the current display content according to the control instruction, the current display content and the speech speed text of the target object.

Specifically, after the wearable device receives a control instruction for the current display content, updating the current display content corresponding to the prompter display screen according to the control instruction, the content being displayed on the prompter display screen and the speech speed text of the target object. In one embodiment, for example, the play speed may be updated, and in another embodiment, the play progress may be updated, for example.

In one embodiment, updating the current display content according to the control instructions, the current display content, and the speech-speed text of the target object includes:

and determining the control instruction as a play speed adjustment instruction.

It should be noted that the play speed adjustment instruction may be an instruction issued by the user to adjust the play speed of the content being displayed on the display screen of the wearable device.

Specifically, whether a control instruction sent by a user is a play speed adjustment instruction is detected.

And acquiring a speed adjustment parameter corresponding to the play speed adjustment instruction and the current play speed.

In this embodiment, the speed adjustment parameter may be a multiple of the playing speed of the current display content corresponding to the display screen of the wearable device, for example, may be 0.5 times, 0.75 times, 1.5 times, or 2 times. Specifically, the speed adjustment parameter may be preset, and the user may select the playing speed to be adjusted.

The current playing speed may be a playing speed of a current display content corresponding to a display screen of the wearable device worn by the target object when the playing speed adjustment instruction is received.

Specifically, if the received control instruction is a play speed adjustment instruction, a speed adjustment parameter selected by a user and corresponding to the play speed adjustment instruction and a current play speed of current display content corresponding to a display screen of wearable equipment worn by the target object are obtained.

And determining the target playing speed according to the speed adjusting parameter and the current playing speed.

The target playing speed may be a playing speed after adjusting content displayed on a display screen of a wearable device worn by the target object.

Specifically, the result of the product of the current playing speed and the speed adjustment parameter may be taken as the target playing speed. For example, the current playing speed of the current display content corresponding to the display screen of the wearable device worn by the target object is 1 time speed, and the speed adjustment parameter corresponding to the playing speed adjustment instruction is 0.5 time, so that the target playing speed may be 1×0.5=0.5 time speed.

And updating the current display content according to the target playing speed, the current display content and the speech speed text of the target object.

Specifically, the current display content corresponding to the display screen of the wearable device worn by the target object is updated to the word to be pronounced, which is played according to the target playing speed, and the word to be pronounced is prompted according to the prompting mode corresponding to the word to be pronounced.

By setting the speed adjustment parameters corresponding to the play speed adjustment instruction, the user can control the acceleration or deceleration prompt of the prompter according to the actual situation, so that the prompter can conveniently and accurately provide prompter information of any content for a presenter or singer.

In another embodiment, updating the current display content according to the control instruction, the current display content and the speech rate text of the target object includes:

and determining the control instruction as a play progress adjustment instruction.

It should be noted that the play progress adjustment instruction may be an instruction issued by the user to adjust the play progress of the content being displayed on the display screen of the wearable device. For example, the play progress adjustment instruction may be to play a previous sentence or play a next sentence, etc., and the play progress adjustment options such as playing the previous sentence or playing the next sentence may be preset, so that the user may select the play progress to be adjusted.

Specifically, whether a control instruction sent by a user is a play progress adjustment instruction is detected.

If the target instruction is a play progress adjusting instruction, determining a text to be prompted according to the play progress adjusting instruction, the current display content and the speech speed text of the target object.

The speech speed text of the target object comprises text to be prompted, namely, the text to be prompted is text included in the speech speed text of the target object.

It should be explained that the text to be prompted may be the text to be prompted after the display screen of the wearable device worn by the target object is updated.

Specifically, if the received control instruction is a play progress adjustment instruction, updating the current display content corresponding to the display screen of the wearable device to a text to be prompted according to the play progress adjustment instruction and the speech speed text of the target object. For example, the current display content corresponding to the display screen of the wearable device is "the holiday is still mountain", and the user selects to play the next sentence, and at this time, the determined play progress adjustment instruction may be "play the next sentence", and then it may be determined that the text to be prompted on the display screen of the wearable device may be "yellow river into ocean current".

And updating the current display content of the display screen of the wearable device worn by the target object to a text to be prompted.

Specifically, the current display content of the display screen of the wearable device worn by the target object is updated to be the text to be prompted.

By setting the play progress adjusting instruction, the user can control to return to the previous sentence or skip one sentence according to the actual situation, so that the method is convenient and accurate to provide the speeches information of any content for the lecturer or singer.

As an exemplary description of an embodiment of the present application, fig. 2 is a schematic diagram of a prompting method in an embodiment of the present application. As shown in fig. 2, the content displayed on the display screen of the wearable AR glasses worn by the target object is "the white sun is still mountain, the yellow river enters the ocean, the current speech progress is" the white sun is already spoken, "the next word to be speech is" mountain, "the font of the word already speech is thicker, the font of the word not yet speech is thinner, and the prompt can be accurately and intuitively performed for the user. Meanwhile, buttons which are 0.5 times slower, the last sentence, the continuation/pause, the next sentence and 0.5 times faster are arranged below the display screen, and a user can control the playing speed or the playing progress and play or pause according to actual conditions. According to the technical scheme, the user can control the accelerating and decelerating prompt of the prompter according to the actual conditions, or return to the previous sentence, skip one sentence and the like, the wearable device AR glasses can accurately provide the prompter information of any content for the presenter or singer, the convenience of the prompter is improved, the prompter content is projected on the inner screen of the AR glasses, the outside cannot be checked, the use privacy is improved, and the privacy of the presenter is protected.

According to the technical scheme, the duration of each sentence or even each word or word is detected by analyzing the amplitude change and the frequency change of voice in the audio, corresponding texts are transcribed through grammar and semantic recognition, the obtained texts and time information are assembled into lrc files after recognition is completed, the difficulty of the content of the prompter is simplified, and the time for manual production is reduced. The user can synchronize lrc files to the AR glasses through terminal equipment such as a smart phone during formal speech or singing, and the problem that the prompter is clumsy and not well deployed is solved. The AR glasses synchronously prompt text content on a display screen of the AR glasses according to time information in the lrc file, so that leakage of the content of the prompter and perception of others are prevented, and fading or highlighting prompt is automatically carried out according to speed and sequence during display, so that a user is more conveniently prompted to adjust the speech speed and intonation of a speech or singing, and meanwhile, the user can control prompt speed and progress through a mobile phone or AR glasses equipment during prompt.

Example two

Fig. 3 is a schematic structural diagram of a prompting device in an embodiment of the present application. The embodiment may be applicable to the case of prompting, and the device may be implemented in a software and/or hardware manner, and may be integrated in any device that provides a prompting function, as shown in fig. 3, where the prompting device specifically includes: a first acquisition module 201, a first determination module 202, a second acquisition module 203, a second determination module 204, and an acquisition and prompt module 205.

Wherein, the first obtaining module 201 is configured to obtain historical audio data of a target object;

a first determining module 202, configured to determine a speech rate text of the target object according to the historical audio data;

a second obtaining module 203, configured to obtain real-time audio data of the target object;

a second determining module 204, configured to determine location information of the text corresponding to the real-time audio data in the speech speed text of the target object;

the acquiring and prompting module 205 is configured to determine, according to the location information and the speech rate text of the target object, a content to be prompted, and display the content to be prompted through a prompter display screen.

In some embodiments, the acquisition and hint module 205 includes:

the first acquisition unit is used for acquiring an original text corresponding to the historical audio data;

and the first determining unit is used for determining the content to be prompted according to the position information, the original text and the speech speed text of the target object.

In some embodiments, the first determining module 202 includes:

a second acquisition unit configured to acquire audio amplitude information and audio frequency information of the historical audio data;

The identification unit is used for identifying the audio amplitude information and the audio frequency information to obtain text and speech speed information corresponding to the historical audio data;

and the second determining unit is used for determining the speech rate text of the target object according to the text and the speech rate information corresponding to the historical audio data.

In some embodiments, the first determining unit comprises:

a first sub-determining unit for determining a word to be uttered according to the position information and the original text;

and the second sub-determining unit is used for determining the prompting mode of the words to be pronounced according to the speech speed text of the target object.

In some embodiments, the second sub-determination unit is specifically configured to:

determining the duration time corresponding to each word in the words to be pronounced and the interval time between adjacent words according to the speech rate text of the target object;

And determining the prompting mode of the word to be pronounced according to the duration time corresponding to each word in the word to be pronounced and the interval time between adjacent words.

In some embodiments, the prompting device further comprises:

the receiving module is used for receiving a control instruction for the current display content after the content to be prompted is displayed through the prompter display screen;

the updating module is configured to update the current display content according to the control instruction, the current display content and the speech speed text of the target object after the content to be prompted is displayed through the prompter display screen, and includes:

determining the control instruction as a play speed adjustment instruction;

acquiring a speed adjustment parameter and a current playing speed corresponding to the playing speed adjustment instruction;

determining a target playing speed according to the speed adjusting parameter and the current playing speed;

and updating the current display content according to the target playing speed, the current display content and the language speed text of the target object.

The product can execute the prompting method provided by any embodiment of the application, and has the corresponding functional modules and beneficial effects of executing the prompting method.

Example III

Fig. 4 shows a schematic diagram of the structure of an electronic device 30 that may be used to implement embodiments of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. Electronic equipment may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the application described and/or claimed herein.

In some embodiments, the electronic device is a wearable device worn by the target object, and the tagout display is a screen generated by an optical engine of the wearable device. In one embodiment, the electronic device may be, for example, a wearable device such as head-mounted AR glasses, and the prompter display screen may be, for example, a screen generated by an AR light engine (i.e., a projection device that displays integrated with the optical system) of the wearable device such as head-mounted AR glasses.

As shown in fig. 4, the electronic device 30 includes at least one processor 31, and a memory, such as a Read Only Memory (ROM) 32, a Random Access Memory (RAM) 33, etc., communicatively connected to the at least one processor 31, wherein the memory stores a computer program executable by the at least one processor, and the processor 31 can perform various suitable actions and processes according to the computer program stored in the Read Only Memory (ROM) 32 or the computer program loaded from the storage unit 38 into the Random Access Memory (RAM) 33. In the RAM 33, various programs and data required for the operation of the electronic device 30 may also be stored. The processor 31, the ROM 32 and the RAM 33 are connected to each other via a bus 34. An input/output (I/O) interface 35 is also connected to bus 34.

Various components in electronic device 30 are connected to I/O interface 35, including: an input unit 36 such as a keyboard, a mouse, etc.; an output unit 37 such as various types of displays, speakers, and the like; a storage unit 38 such as a magnetic disk, an optical disk, or the like; and a communication unit 39 such as a network card, modem, wireless communication transceiver, etc. The communication unit 39 allows the electronic device 30 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

The processor 31 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of processor 31 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, digital Signal Processors (DSPs), and any suitable processor, controller, microcontroller, etc. The processor 31 performs the various methods and processes described above, such as the hint method:

acquiring historical audio data of a target object;

acquiring real-time audio data of the target object;

In some embodiments, the prompting method may be implemented as a computer program tangibly embodied on a computer-readable storage medium, such as storage unit 38. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 30 via the ROM 32 and/or the communication unit 39. When the computer program is loaded into RAM 33 and executed by processor 31, one or more steps of the prompting method described above may be performed. Alternatively, in other embodiments, the processor 31 may be configured to perform the prompting method in any other suitable manner (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

A computer program for carrying out the methods of the present application may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the computer programs, when executed by the processor, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be implemented. The computer program may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this application, a computer-readable storage medium may be a tangible medium that can contain, or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. The computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. Alternatively, the computer readable storage medium may be a machine readable signal medium. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on an electronic device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) through which a user can provide input to the electronic device. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), blockchain networks, and the internet.

The computing system may include clients and servers. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service are overcome.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present application may be performed in parallel, sequentially, or in a different order, so long as the desired results of the technical solutions of the present application are achieved, and the present application is not limited herein.

The above embodiments do not limit the scope of the application. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present application are intended to be included within the scope of the present application.

Claims

1. A method of prompting, comprising:

acquiring historical audio data of a target object;

acquiring real-time audio data of the target object;

2. The method of claim 1, wherein determining content to be prompted based on the location information and the speech rate text of the target object comprises:

acquiring an original text corresponding to the historical audio data;

3. The method of claim 2, wherein determining the speech rate text of the target object from the historical audio data comprises:

acquiring audio amplitude information and audio frequency information of the historical audio data;

identifying the audio amplitude information and the audio frequency information to obtain text and speech rate information corresponding to the historical audio data;

and determining the speech rate text of the target object according to the text and the speech rate information corresponding to the historical audio data.

4. The method of claim 3, wherein the speech rate information comprises: at least one of a duration corresponding to each word, a duration corresponding to each sentence, and a start time corresponding to each sentence.

5. The method of claim 2, wherein determining content to be prompted based on the location information, the original text, and the pace text of the target object comprises:

Determining the text to be pronounced according to the position information and the original text;

6. The method of claim 5, wherein the prompting means comprises: the text to be pronounced is prompted with at least one of a cursor, a highlight, and a ticker.

7. The method of claim 5, wherein determining the alert mode of the word to be pronounced based on the pace text of the target object comprises:

8. The method as recited in claim 1, further comprising:

receiving a control instruction for the current display content;

updating the current display content according to the control instruction, the current display content and the language speed text of the target object, including:

Determining the control instruction as a play speed adjustment instruction;

9. An electronic device, the electronic device comprising:

at least one processor; and

the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the prompting method of any one of claims 1-8.

10. The electronic device of claim 9, wherein the electronic device is a wearable device worn by a target object and the tagout display is a screen generated by an optical engine of the wearable device.