CN115550707A - Method and device for synchronizing voice information and demonstration information - Google Patents

Method and device for synchronizing voice information and demonstration information Download PDF

Info

Publication number
CN115550707A
CN115550707A CN202110726144.5A CN202110726144A CN115550707A CN 115550707 A CN115550707 A CN 115550707A CN 202110726144 A CN202110726144 A CN 202110726144A CN 115550707 A CN115550707 A CN 115550707A
Authority
CN
China
Prior art keywords
information
demonstration
presentation
voice
blocks
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110726144.5A
Other languages
Chinese (zh)
Inventor
夏海荣
伏煦
郭伟胜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to CN202110726144.5A priority Critical patent/CN115550707A/en
Priority to PCT/CN2022/094711 priority patent/WO2023273702A1/en
Publication of CN115550707A publication Critical patent/CN115550707A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/4302Content synchronisation processes, e.g. decoder synchronisation
    • H04N21/4307Synchronising the rendering of multiple content streams or additional data on devices, e.g. synchronisation of audio on a mobile phone with the video output on the TV screen
    • H04N21/43072Synchronising the rendering of multiple content streams or additional data on devices, e.g. synchronisation of audio on a mobile phone with the video output on the TV screen of multiple content streams on the same device
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/431Generation of visual interfaces for content selection or interaction; Content or additional data rendering
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/431Generation of visual interfaces for content selection or interaction; Content or additional data rendering
    • H04N21/4312Generation of visual interfaces for content selection or interaction; Content or additional data rendering involving specific graphical features, e.g. screen layout, special fonts or colors, blinking icons, highlights or animations
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/439Processing of audio elementary streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/439Processing of audio elementary streams
    • H04N21/4398Processing of audio elementary streams involving reformatting operations of audio signals
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/485End-user interface for client configuration
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/485End-user interface for client configuration
    • H04N21/4858End-user interface for client configuration for modifying screen layout parameters, e.g. fonts, size of the windows

Abstract

The application discloses a method and a device for synchronizing voice information and demonstration information. In the method, demonstration information in a plurality of demonstration information blocks in a demonstration file is obtained; receiving a first voice of a user, and converting the first voice to obtain a first voice text; matching the first voice text with the demonstration information corresponding to the demonstration information blocks, and determining the first demonstration information block successfully matched with the first voice text; the first piece of presentation information is highlighted. The method realizes that the demonstration information block corresponding to the current speech content of the speaker is determined according to the current speech content of the speaker, and the demonstration information block is highlighted, so that the audience can conveniently determine the current speech content of the speaker; meanwhile, the burden of the speaker is greatly reduced, the content of the current speech does not need to be indicated through body language or teaching aids in the speech process, and the production process of the presentation file by the speaker is also facilitated to be simplified.

Description

Method and device for synchronizing voice information and demonstration information
Technical Field
The present application relates to the field of computers, and in particular, to a method and an apparatus for synchronizing speech information and presentation information.
Background
Spoken presentation is a process of conveying information to an audience through utterances. The speaker of the oral lecture can improve the effect of delivering information by the lecture technique compared to the written report. When the speech content is complex, the presenter typically prepares a presentation file (e.g., a slide show or other visual object, including an electronic document, a publication, etc.) to prompt the presenter himself or herself and the audience for important information.
During the presentation, the audience needs to follow the speaker's voice and instructions to locate and track the content on the presentation file. Taking a common slide as an example, a slide presentation file may include a plurality of pages, each of which may include a brief and concise title, and a number of specific contents explaining the title, such as a text paragraph, a figure, a table, a formula, and the like. In the process of demonstrating the slide, a speaker usually indicates the position of the current speech content to an audience in the modes of body language, a mouse, a laser pen, a pointer and the like; in order to further improve the experience of the audience, the presenter can also make the animation effect of the slide in advance, such as highlight display, shaking display, zooming display and the like, so that the audience can conveniently determine the content of the current presentation of the presenter.
No matter the lecturer indicates the content of the current lecture through body language or teaching aid, or the content of the current lecture is prompted by setting the display effect of the demonstration file in advance, the lecturer is complicated and complex.
Disclosure of Invention
The embodiment of the application provides a method and a device for synchronizing voice information and presentation information, which are used for synchronously highlighting a presentation information block matched with the current speech content according to the voice information of a speaker.
In a first aspect, an embodiment of the present application provides a method for synchronizing voice information and presentation information, including: acquiring demonstration information in a plurality of demonstration information blocks in a demonstration file; receiving a first voice of a user, and converting the first voice to obtain a first voice text; matching the first voice text with the demonstration information corresponding to the demonstration information blocks, and determining a first demonstration information block successfully matched with the first voice text; highlighting the first piece of presentation information.
In the above embodiment of the present application, the presentation information of each presentation information block in the presentation file is read, after the voice information of the speaker is received, the voice information is identified to obtain a voice text, the voice text is matched with the presentation information to determine the presentation information block corresponding to the content of the current speech of the speaker, and the presentation information block is highlighted, so that the audience can conveniently determine the content of the current speech of the speaker. Meanwhile, the burden of the speaker is greatly reduced, the content of the current speech does not need to be indicated through body language or teaching aids in the speech process, and the production process of the presentation file by the speaker is facilitated to be simplified.
In one possible implementation, the method further includes: acquiring position information of the plurality of demonstration information blocks; the matching the first speech text with the presentation information corresponding to the plurality of presentation information blocks comprises: determining candidate demonstration information blocks according to the position information of the demonstration information blocks, wherein the candidate demonstration information blocks are one or more of the demonstration information blocks; and matching the first voice text with the demonstration information corresponding to the candidate demonstration information block. In general, the speech sequence is usually from top to bottom and from left to right, so that candidate presentation information blocks are determined according to the position information of the presentation information blocks, which helps to simplify the matching process and shorten the matching time.
In one possible implementation, the method further includes: determining candidate demonstration information blocks according to the first demonstration information block, wherein the candidate demonstration information blocks are one or more of the plurality of demonstration information blocks; receiving a second voice of a user, and converting the second voice to obtain a second voice text; matching the second voice text with the demonstration information corresponding to the candidate demonstration information block, and determining a second demonstration information block matched with the second voice text; highlighting the second piece of presentation information. After receiving the first voice, the second voice can be received, and a second demonstration information block matched with the second voice is determined and highlighted; and determining the candidate demonstration information block according to the first demonstration information block, thereby being beneficial to quickly determining the second demonstration information block matched with the second voice, being beneficial to simplifying the matching process and shortening the matching time.
In one possible implementation, the method further includes: acquiring position information of the plurality of demonstration information blocks; the determining candidate presentation information blocks according to the first presentation information block includes: determining the candidate demonstration information block according to the position information of the first demonstration information block and the position information of other demonstration information blocks; the other presentation information blocks are presentation information blocks other than the first presentation information block among the plurality of presentation information blocks. The candidate demonstration information blocks are determined according to the position information, so that the candidate demonstration information blocks are accurately determined, and the matching process is simplified.
In a possible implementation manner, the determining candidate presentation information blocks according to the first presentation information block includes: determining the candidate demonstration information blocks according to the identification information of the first demonstration information block and the identification information in other demonstration information blocks; the other presentation information blocks are presentation information blocks other than the first presentation information block among the plurality of presentation information blocks. And the candidate demonstration information blocks are determined according to the identification information, so that the candidate demonstration information blocks are accurately determined, and the matching process is simplified.
In one possible implementation, the method further includes: acquiring position information of a plurality of demonstration information blocks in the demonstration file; determining the demonstration sequence of the demonstration information blocks according to the position information of each demonstration information block and a preset rule; the determining candidate presentation information blocks according to the first presentation information block includes: and taking the next presentation information block of the first presentation information block as the candidate presentation information block according to the presentation sequence. In this implementation manner, an association rule between the position information of the presentation information block and the speech order, for example, "from top to bottom, left to right" may be preset, so as to automatically determine the speech order of the presentation information block; the candidate presentation information blocks are determined according to the predicted presentation order, which is helpful for accurately determining the candidate presentation information blocks and also helpful for avoiding repeated matching of the successfully matched presentation information blocks.
In one possible implementation, the obtaining of the presentation information in a plurality of presentation information blocks in the presentation file includes: acquiring the demonstration information of a plurality of demonstration information blocks in the demonstration file; determining a first vector corresponding to each demonstration information block according to the demonstration information; the matching the first voice text with the presentation information corresponding to the plurality of presentation information blocks includes: determining a second vector corresponding to the first voice text; and matching the second vector with the first vector corresponding to each presentation information block. And matching is carried out according to the vector corresponding to the demonstration information and the vector corresponding to the voice text, so that the matching precision is improved.
In a possible implementation manner, the receiving a voice of a user, recognizing the voice, and converting the voice into a voice text includes: receiving voice of a user, wherein the voice comprises N batches of voice, the ith batch of voice is received before the (i + 1) th batch of voice, N is a positive integer, and i is any integer from 1 to N-1; recognizing and converting the voice to obtain voice texts in N batches; the determining a second vector corresponding to the speech text and matching the second vector with the first vector corresponding to each candidate information block includes: determining a second vector corresponding to the jth batch of speech texts, and determining the similarity between the second vector and a first vector corresponding to each candidate information block; if the similarity between the second vector and the first vector corresponding to each candidate information block does not meet a preset condition, determining the similarity between the second vector corresponding to the jth batch of speech texts and the jth +1 batch of speech texts and the first vector corresponding to each candidate information block; if the similarity of the first vector and the second vector meets a preset condition, determining that a candidate information block corresponding to the first vector with the similarity meeting the preset condition is matched with the voice; and if the similarity of the second vector and the first vector corresponding to each candidate information block does not meet the preset condition, continuously adding the next batch of voice texts for matching again. The voices are matched batch by batch, so that the matched demonstration information block can be more accurately determined.
In a possible implementation manner, the similarity satisfying a preset condition includes: the similarity value is the highest, and the similarity value is larger than a preset threshold value and/or the difference value between the similarity value and other similarities is larger than a preset threshold value.
In one possible implementation, the similarity is a cosine similarity.
In one possible implementation, the determining a first vector corresponding to each presentation information block includes: and determining the weight of each participle in the demonstration information block aiming at each demonstration information block, and determining a first vector corresponding to the demonstration information block according to the participle with the weight larger than a preset threshold value. Important information is extracted from the demonstration information block, and a vector is determined according to the extracted important information, so that the matching result is more accurate.
In a possible implementation manner, the determining a second vector corresponding to the first speech text includes: determining the weight of each participle in the first voice text, and determining a second vector corresponding to the first voice text according to the participle with the weight larger than a preset threshold value.
In one possible implementation, the weights of the participles are determined based on a word frequency inverse document frequency TF-IDF algorithm.
In one possible implementation, the method further includes: and screening the demonstration information blocks which do not contain important information, namely the demonstration information blocks of which the weight of each participle is smaller than a preset threshold value or the demonstration information blocks of which the weight after the weight accumulation of each participle is smaller than the preset threshold value. The method avoids highlighting the demonstration blocks (such as page numbers) which do not contain important information, and is also beneficial to reducing the matching times and shortening the matching time.
In one possible implementation, the method further includes: establishing an incidence relation between the first voice and the first demonstration information block; the incidence relation is used for: when the first voice is played again, the first demonstration information block can be displayed according to the incidence relation; or, when the first demonstration information block is demonstrated, the first voice can be played according to the association relation. After the lecture, the making of the demonstration courseware can be simplified through the established incidence relation, or when the lecture is performed again aiming at the content of the demonstration file, the corresponding information can be directly highlighted according to the incidence relation without being matched again.
In one possible implementation, the method further includes: and generating summary information of the demonstration file according to the first voice. The user can conveniently know the content of the demonstration file according to the abstract information when the demonstration file is obtained, and the user does not need to open the demonstration file and read the content in the file.
In a second aspect, the present application provides an apparatus for synchronizing speech information and presentation information, the apparatus comprising means for performing the method of the first aspect and any one of the possible designs of the first aspect; these modules/units may be implemented by hardware or by hardware executing corresponding software.
Illustratively, an obtaining module may be included for obtaining presentation information in a plurality of presentation information blocks in a presentation file; the receiving module is used for receiving a first voice of a user and converting the first voice message to obtain a first voice text; the matching module is used for matching the first voice text with the demonstration information corresponding to the demonstration information blocks and determining the first demonstration information block successfully matched with the first voice text; and the display module is used for highlighting the first demonstration information block.
In a third aspect, an embodiment of the present application provides an apparatus for synchronizing voice information and presentation information, including a processor, and a memory coupled to the processor; the processor is configured to execute the instructions or programs in the memory to perform the method for synchronizing the voice information and the presentation information according to the first aspect and any one of the implementation manners.
In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, in which computer-readable instructions are stored, and when the computer-readable instructions are executed on a computer, the method according to any one of the possible implementation manners of the first aspect is executed.
In a fifth aspect, embodiments of the present application provide a computer program product containing instructions that, when run on a computer, cause the method according to the first aspect and any one of the possible implementations to be performed.
Drawings
FIG. 1 is a schematic diagram of a slide animation effect provided by an embodiment of the present application;
FIG. 2 is a schematic flowchart illustrating a method for synchronizing voice information and presentation information according to an embodiment of the present application;
FIG. 3 is a schematic diagram of a slide show and track information provided by an embodiment of the present application;
FIG. 4 is a schematic diagram of a voice sequence FIFO queue according to an embodiment of the present application;
FIG. 5 is a schematic diagram of slide content provided by an embodiment of the present application;
FIG. 6 is a diagram illustrating the inclusion and overlap relationship between presentation information blocks according to an embodiment of the present application;
FIG. 7 is a schematic diagram of batch-to-batch matching provided by an embodiment of the present application;
fig. 8 is a schematic structural diagram of a device for synchronizing speech information and presentation information according to an embodiment of the present application;
fig. 9 is a schematic structural diagram of a device for synchronizing voice information and presentation information according to an embodiment of the present application.
Detailed Description
The slide is also called demonstration manuscript and brief report, and is a playable file made up by using characters and pictures, etc. and adding some special-effect dynamic display effect.
The presenter often uses animation effects to represent a dynamic process when making slides corresponding to the content of the presentation. For example, a separate border may be overlaid on a piece of text to highlight it, or a series of text boxes may be displayed in sequence to characterize its sequence. The actions (display, disappearance, movement, etc.) of elements (or "objects") in these animations can be triggered by events (e.g., keyboard, mouse actions, timers, etc.).
For example, a slide has three important elements a, B and C to be introduced with emphasis, as shown in a in fig. 1, and the user wants to highlight the text boxes a, B and C in sequence during the explanation process. Generally, a user needs to set the highlighted animation effect of each text box and set the sequence of the animation effects; further, it may be necessary to set a trigger condition for each box animation effect separately (for example, clicking a mouse). When the user is presenting to the page, a picture as shown in a in fig. 1 is displayed first; when the user clicks the mouse, the textbox a is highlighted, as shown in b of fig. 1; when the user clicks the mouse again, the text box A is not highlighted, and the text box B is highlighted, as shown in c in FIG. 1; when the user clicks the mouse again, text box B is de-highlighted and text box C is highlighted, as shown at d in FIG. 1.
This highlighting scheme helps the listener to focus on the current main information, while minimizing the influence of other contents on the screen to improve the performance of the speech. Therefore, the setting of highlighting a presentation file is widely applied to the production of the presentation file. However, this requires the user to spend a lot of time manually designing all details of the animation, which is cumbersome; and the user needs to operate a mouse, a keyboard and the like to trigger the animation display in the demonstration process.
In order to simplify the operation of a user, the embodiment of the application provides a method and a device for synchronizing voice information and presentation information, which are used for synchronously highlighting a presentation information block matched with the current speech content according to the voice of a speaker, so that the user does not need to spend a large amount of time on animation production of a presentation file, and the current speech content does not need to be indicated through body language or a teaching aid in the speech process.
The method for synchronizing the voice information and the presentation information provided by the embodiment of the application can be realized by a computer or other equipment with an information processing function, and the following description is given by taking the computer as an example. A computer or other device may implement the function of synchronizing voice information with presentation information by running program code for implementing the above-described method. For example, an application program for implementing the method may be installed in a computer, and when the application program is executed by the computer, a function of synchronizing voice information with presentation information is implemented. Optionally, the computer or other device may also include or be connected to a microphone to receive speech.
Referring to fig. 2, a flow chart of a method for synchronizing voice information and presentation information according to an embodiment of the present application is schematically shown, where as shown in the drawing, the method may include the following steps:
step 201, obtaining the demonstration information in a plurality of demonstration information blocks in the demonstration file.
A user first inputs a presentation file, where the presentation file may include a file in a format such as PPT, PDF, XML, or word, but the format of the presentation file is not limited in this embodiment of the application.
In a possible implementation manner, the presentation file input by the user may include elements such as text boxes, pictures, tables, formulas, and the like, and each element may be regarded as one presentation information block, and the presentation information in each presentation information block is obtained. For example, if the presentation file input by the user includes a plurality of text boxes and a plurality of pictures, each text box and each picture may be regarded as a presentation information block, and the text information in each text box and each picture may be obtained.
Further, the presentation information in the presentation information block may be text information, formula information, or other information. For example, the main information in the picture is image information rather than text information, and in this case, the image information (such as pixel information, picture type information, and the like) may be taken as the acquired presentation information; or, the content of the image can be confirmed and converted into text information by an image recognition technology, and the text information is used as the acquired demonstration information; the image information and the text information may also be used together as the acquired presentation information.
In another possible implementation manner, the presentation file input by the user may include a text box or a chart with an excessively long length, and at this time, the elements with an excessively long length may also be divided to obtain a plurality of presentation information blocks, and information in each presentation information block is obtained. For example, if a slide only includes one text box, and the text box includes a large amount of text information, at this time, the text information in the text box may be divided to obtain a plurality of presentation information blocks, and each presentation information block includes a part of text information, for example, the text information may be divided according to paragraphs, or the text information may be divided according to sentences. For another example, the text information of a text box may be divided into one or more stages according to the format of the text in the text box. For another example, as shown in fig. 3, only the primary content is included in the text box 1; the text box 2 comprises two levels of contents, wherein the first level of contents is a scheme main point, and the suspended indented main points I, II and III are the second level; the text box 3 also comprises two levels of content, wherein the first level of content is the "scheme cost", and the indented "cost one", "cost two" and "cost three" are the second level; in this case, the text box 2 and/or the text box 3 may be divided.
The two implementations described above can also be used in combination. For example, if the number of characters contained in a certain element exceeds a preset character threshold, the element is divided into two or more presentation information blocks, and if the number of characters contained in the element does not exceed the character threshold, the element is used as one presentation information block; alternatively, whether to divide an element may be determined based on its size or other attribute values.
Step 202, receiving a first voice of a user, and converting the first voice to obtain a first voice text.
Specifically, the received first voice may be the first voice received through a voice input device such as a microphone, that is, the real-time voice received when the user speaks; or, the received speech content may be an audio file, such as an audio file recorded by the user, so as to implement the subsequent production of the speech content.
After receiving the first voice, the first voice is recognized and converted into a first voice text.
Speech recognition is used in more and more scenes as a means of human-computer interaction. Speech is a one-dimensional physical signal that is collected into a computer by analog-to-digital conversion. The digitized speech signal is sent to a speech recognition engine, typically returned in segments, such as a phrase, or a sentence, to maintain good readability. Typically, factors that affect segment boundaries include silence or semantic termination. Major speech recognition providers have used deep learning techniques and larger amounts of data, thus achieving lower word error rates (word error rates).
And step 203, matching the first voice text with the demonstration information corresponding to the plurality of demonstration information blocks, and determining the first demonstration information block successfully matched with the first voice text.
Matching the first voice text with the demonstration information, and determining whether a demonstration information block matched with the first voice text exists or not; if the matched demonstration information block exists, the successfully matched demonstration information block is used as a first demonstration information block; for example, the first presentation information block may be a presentation information block corresponding to the present lecture content of the speaker.
Step 204, highlighting the first presentation information block.
The first presentation information block is highlighted, so that the audience can more easily determine the content of the current speech, focus on the current main information and improve the speech effect without being influenced by other contents of the screen. Highlighting may include highlighting, adding a background, border color change, dithering, popping, and the like, which may enable the first piece of presentation information to more easily attract the attention of the listener.
In the embodiment of the application, the demonstration information of each demonstration information block in the demonstration file is acquired, after the voice of the speaker is received, the voice is identified to obtain the voice text, the voice text is matched with the demonstration information to determine the demonstration information block corresponding to the content of the current speech of the speaker, and the demonstration information block is highlighted, so that the audience can conveniently determine the content of the current speech of the speaker. For example, a one-page slide show may be as shown in FIG. 3, including four text boxes, and the audio track of the user's voice may be as shown in the lower part of FIG. 3; according to the voice of the user, determining the title content corresponding to the first batch of voice of the user, namely the text box 1, and then highlighting the text box 1; determining that the second batch of speech corresponds to the solution gist, i.e., text box 2 may be highlighted. Meanwhile, the burden of the speaker is greatly reduced, the content of the current speech does not need to be indicated through body language or teaching aids in the speech process, and the production process of the presentation file by the speaker is also facilitated to be simplified.
Since there may be pauses in the lecture of the speaker or short pauses after the end of a speech, the speech may be divided into batches, as shown in fig. 3, where a speech segment is divided into 4 batches. During the speech, the speech text sequence S = { S } can be obtained through speech recognition 1 ,s 2 ,..,s T In which any element s i Speech text that can represent a batch of speech, s for each batch i Can be a word sequence of length Ki, i.e. s i ={t i1 ,t i2 ,…,t iKi Where t is a word or a word.
The batch of speech text sequences S can be generated by a speech recognition engine, the length, content of S being related to the speech style. The sequence S may be a continuously generated process for a batch S therein i Is a continuous process and therefore can be managed using a first-in-first-out (FIFO) queue, so that it is not necessary to maintain the entire history, as shown in fig. 4.
In one possible case, a plurality of batches s i May correspond to the content of one presentation information block. In addition, the lecturing process has certain randomness, and although a lecturer makes a presentation file in advance and puts main contents into the presentation file, the lecturer can also tell contents which are not in the presentation file in the lecturing process. Fig. 3 illustrates a case where it is determined that the first batch of speech corresponds to the text box 1, but other cases may also exist, where if the first batch of speech is not successfully matched with the plurality of text boxes, the received second batch of speech is matched with the text information in the plurality of text boxes, and if the first batch of speech is successfully matched with the text box 1, the first batch of speech and the second batch of speech may be used as the first speech, and the text box 1 may be used as the first presentation information block. And if the second batch of voices are not matched successfully, continuously matching the subsequently received voices.
Further, in addition to acquiring text information of a piece of presentation information, position information of a piece of presentation information, for example, coordinate information (e.g., vertex coordinate information) of the piece of presentation information on the current page may be acquired. For example, the location information of each piece of presentation information may be acquired at the same time when the presentation information of the pieces of presentation information is acquired in step 201, or the location information may be acquired before or after step 201. In performing step 203, candidate presentation information blocks may be determined according to the position information of the presentation information blocks, so as to match the first speech text with the presentation information corresponding to the candidate presentation information blocks. Wherein the candidate presentation information blocks are one or more of the plurality of presentation information blocks acquired in step 201.
In general, if a page includes a plurality of pieces of presentation information, the speech is usually presented from top to bottom and from left to right. For example, taking the slide shown in fig. 3 as an example, where the text box 1 is at the top of the slide on the page, the probability that the content in the text box 1 is first spoken by the speaker is high, and therefore, when the slide is displayed, the text box 1 may be set as the candidate presentation information block first. For another example, as shown in fig. 5, the text box 1.1 and the text box 2.1 are both located at the top of the page slide, but the text box 1.1 is located at the left side of the text box 2.1, and it is estimated that the content in the text box 1.1 is most likely to be spoken by the speaker first according to the convention from top to bottom and from left to right, and therefore, when the page slide is displayed, the text box 1.1 may be set as the candidate presentation information block first. Therefore, the candidate demonstration information blocks are determined according to the position information of the demonstration information blocks, the matching process is facilitated to be simplified, and the matching time is shortened.
In addition, according to the position information of each piece of demonstration information, the situation that part of the pieces of demonstration information contain or overlap can be determined. For example, in the case of a shown in fig. 6, the text box B and the text box C are completely located in the text box a, and the speaker may speak the content in the text box a first and then respectively speak the content in the text box B and the text box C according to the normal speaking habit. For another example, in the case shown in B in fig. 6, the text box a is located uppermost, the text box B is located below the text box a and has an overlapping portion with the text box a, and the text box C is located below the text box B and has an overlapping portion with the text box B; according to the usual practice of speech, the speaker may speak the contents of the text box a, then the text box B, and finally the text box C.
It should be understood that the candidate presentation information block selection policy of "from top to bottom and from left to right" is only an example, and in actual application, corresponding candidate presentation information block selection policies may be set according to different scene requirements, and according to the selection policies, one candidate presentation information block may be determined, or multiple candidate presentation information blocks may be determined, which is not limited in this embodiment of the present application.
In the speech process, the speech of the speaker is continuously output, and after the first speech is received and the first demonstration information block matched with the first speech is determined, the second speech can be obtained, and the second speech is recognized and converted into a second speech text. And then matching a second voice text corresponding to the second voice with the demonstration information of the demonstration information block, determining a second demonstration information block matched with the second voice, and then highlighting the second demonstration information block. As described above, the first voice may be a voice received in real time, and correspondingly, the second voice information may also be a voice acquired in real time through a voice input device such as a microphone, and acquired after receiving the first voice; or, the first voice may also be voice information acquired through a recorded audio file, and correspondingly, the second voice may also be acquired through the audio file and acquired after the first voice is acquired.
Alternatively, when the second piece of presentation information is highlighted, the highlighting of the first piece of presentation information may be canceled so that only one piece of presentation information is currently highlighted, thereby enabling the audience to quickly focus on the current lecture content of the lecturer.
In order to simplify the matching process and shorten the matching time, the candidate demonstration information block can be determined according to the first demonstration information block after the first demonstration information block is determined, so that the second voice text corresponding to the second voice is matched with the demonstration information of the candidate demonstration information block, and the second demonstration information block corresponding to the second voice can be determined quickly. Wherein, the candidate presentation information block determined according to the first presentation information block is one or more presentation information blocks except the first presentation information block in the plurality of presentation information blocks acquired in step 201.
It may also be set whether a presentation information block that has been matched to work is allowed to be a candidate presentation information block again. Because the lecture process of the lecturer has certain randomness, the content a may have been previously described, that is, the presentation information block a corresponding to the content a has been highlighted, and when the content B is described (the presentation information block B corresponding to the content B is highlighted), when it is considered that the content a needs to be supplemented with the lecture, the content a may be described again, and at this time, whether the presentation information block a can be repeatedly highlighted or not may be determined according to a preset candidate setting.
Similarly, when determining the candidate presentation information block according to the first presentation information block, the candidate presentation information block may also be determined according to the acquired position information of the first presentation information block and the position information of the other presentation information blocks. As shown in fig. 5, after the matched text box 1.1 is determined according to the first voice, according to the position information of each text box, it can be determined that the text box 1.1.1 is obliquely below the text box 1.1 and is closest to the text box 1.1; the text boxes 1.2 are located just below the text box 1; and text box 2.1 is located directly to the right of text box 1.1; it is presumed from the position information that the contents of the text box 1.1.1, the text box 1.2, and the text box 2.1 are all likely to be the contents to be presented by the presenter, and therefore, the text box 1.1.1, the text box 1.2, and the text box 2.1 can be set as candidate presentation information blocks. And then matching a second voice text corresponding to the second voice with the demonstration information of the candidate demonstration information block, thereby determining the second demonstration information block.
In one possible design, candidate presentation information blocks may also be determined based on the identification information in each presentation information block. Still taking fig. 5 as an example, each text box in the page of the slide show contains serial number identifiers, such as "1.1", "1.1.1", "2.1", "3.1", and so on. These sequence numbers generally imply the lecture sequence of the corresponding content, for example, a common lecture sequence may be "1.1" → "1.1.1" → "1.1.2" → "1.2" → "1.3" → "2.1" → "2.2" → "3.1" → "3.2" → "; alternatively, the speaker may first give a caption such as "1.1", "2.1" or "3.1", and give a caption such as "1.1.1" or "1.1.2" as a specific detail. Furthermore, common identifying information that can be used to imply an order can also be "first," "second," "third," "first," "second," "again," and so forth.
Of course, the location information of the presentation information block may also be combined with the identification information in the presentation information block to determine candidate presentation information blocks. For example, it is possible that only a presentation information block including a headline includes identification information, as indicated by "1.1", "2.1", "3.1" in fig. 5, and other presentation information blocks do not include identification information; at this time, if the presentation information block "1.1" is successfully matched and highlighted, when the candidate presentation information blocks are determined, the presentation information block located below the presentation information block "1.1" and closest to it, and the presentation information block "2.1" may be both taken as candidate presentation information blocks.
Furthermore, the demonstration sequence of the demonstration information blocks can be determined according to the position information and/or the identification information of each demonstration information block and a preset rule, and then the candidate demonstration information blocks are determined according to the demonstration sequence. For example, the preset rule may be "from top to bottom, left to right"; alternatively, the preset rule may also be: and determining the speech sequence of the demonstration information blocks containing the identification information, and determining the speech sequence of the demonstration information blocks without the identification information according to the position information. Still taking fig. 5 as an example, according to the position information and the identification information, determining the presentation order of each text box as follows: "1.1" → "1.1.1" → "1.1.2" → "1.2" → "1.3" → "2.1" → "2.2" → "3.1" → "3.2" → "; when the slide is switched to the page slide, determining that the first highlighted text box is '1.1' according to the presentation sequence, taking the text box '1.1' as a candidate presentation information block, matching the received voice text corresponding to the first batch of voices with the presentation information of the text box '1.1', and highlighting the text box '1.1' if the matching is successful; and then according to the demonstration sequence, taking the next text box 1.1.1 as a candidate demonstration information block, matching the received voice text corresponding to the second batch of voices with the demonstration information of the text box 1.1.1, if the matching is successful, highlighting the text box 1.1.1, taking the text box 1.1.2 as the candidate demonstration information block, and if not, continuing to match the received voices with the text box 1.1.1.
In a particular embodiment, a set V of presentation information blocks may be established for each page in the presentation file, and the presentation information blocks point to a set E.
Wherein, the set of presentation information blocks V = { V1, V2, \8230;, vn }, where vi corresponds to one presentation information block, i ∈ {0,1, \8230;, n }, and each vi may correspond to the information table as shown in table 1.
TABLE 1
Attribute name Attribute value
Type (B) Text, pictures, shapes, formulas, tables, or the like
Position of upper left corner (x1,y1)
Lower right corner position (x2,y2)
Text content Text information
Subentry Pointing to other contained blocks of presentation information
The text information may be all text information in the presentation information block, or a word segmentation sequence, or a corresponding vector; "pointing to other presentation blocks contained" may indicate the situation as shown in fig. 6, i.e. the contained or overlapping relationship of the presentation block with other presentation information blocks.
It should be understood that table 1 is merely illustrative, and that each vi corresponding information table may include more or less attribute information than table 1 when applicable.
The presentation information block points to a set E, the element E of which ij Indicating that the presentation information block vj is pointed from the presentation information block vi, i.e. if the presentation information block vi is currently highlighted, the presentation information block vj may be considered as a candidate presentation information block.
After the combination V and the combination E are established, candidate demonstration information blocks or demonstration sequences can be determined according to the set E, and the content of the current speech is determined according to the received voice and the set V.
In a possible implementation manner, when the step 201 is executed, a first vector corresponding to each piece of presentation information may be determined according to the obtained presentation information of each piece of presentation information; after the first speech text is obtained in step 202, a second vector corresponding to the first speech text may be determined. Accordingly, when the step 203 is executed, the second vector is matched with the first vector corresponding to the presentation information block, so as to determine the presentation information block matched with the first voice.
For example, if the presentation information is a text box and the obtained presentation information is text information in the text box, the corresponding word vector can be determined; and if the demonstration information block is a picture and the obtained demonstration information is image information, generating a vector of the picture according to the image information.
When the second vector is matched with the first vector, whether matching is successful or not can be determined according to the similarity between the first vector and the second vector.
In one possible design, the similarity between the first word vector and the second vector may be determined according to equation (1):
Figure BDA0003138720010000091
where A represents a first vector, B represents a second vector, and J (A, B) represents the similarity between the first word vector and the second vector. And (3) calculating the similarity according to the formula (1), and being simple and quick.
In another possible design, the first vector and the second vector may be constructed based on a vector (word embedding). Word embedding is a method of representing vocabulary semantics by a numeric vector. The Word embedding method can better represent the Word semantics through the context of the modeling words, is more accurate and flexible, and becomes a standard method for tasks in many natural language fields. Word embedding can extract at various granularities, such as letters or Chinese characters (characters), words (words), sentences (senses), documents (documents), and the like.
Specifically, the similarity between the first vector and the second vector can also be determined according to formula (2):
Figure BDA0003138720010000101
wherein, emb v Representing a first vector, emb, constructed on the basis of word embedding s A second vector based on the even word embedding is shown, and cos theta shows the cosine similarity of the first vector and the second vector.
If the second vector of the first voice is successfully matched with the first vector corresponding to the first demonstration information block, the similarity between the second vector of the first voice and the first vector corresponding to the first demonstration information block is the highest, and the similarity is greater than or equal to a preset threshold value; or the similarity between the second vector of the first voice and the first vector corresponding to the first demonstration information block is the highest, and the difference between the similarity between the second vector of the first voice and the first vector corresponding to the first demonstration information block and the similarity between the second vector of the first voice and the first vectors of other demonstration information blocks is greater than or equal to a preset threshold.
As described above, the received speech may be a batch of speech, and during matching, the speech may be sequentially matched according to the batch, so as to determine the presentation information block corresponding to the received speech. In a possible implementation manner, the received speech includes N batches of speech, where the ith batch of speech is received before the (i + 1) th batch of speech, N is a positive integer, i is any integer from 1 to N-1, and N batches of speech texts corresponding to the N batches of speech are determined accordingly. When step 203 is executed, the second vector corresponding to the 1 st batch of speech texts may be first matched with the first vector corresponding to each candidate presentation information block, and the similarity of the vectors is determined, so as to determine whether there is a candidate presentation information block that is successfully matched. And if the matching is not successful, matching the second vectors corresponding to the first batch of voice texts and the second batch of voice texts with the first vector corresponding to each candidate demonstration information block, and determining the similarity of the vectors, thereby determining whether the candidate demonstration information blocks which are successfully matched exist. And if the matching is not successful, continuously matching the second vectors corresponding to the first batch of voice texts, the second batch of voice texts and the third batch of voice texts with the first vector corresponding to each candidate demonstration information block, and determining whether the candidate demonstration information blocks which are successfully matched exist. If the matching is not successful, the voice texts in the next batch are added for matching again until the matching is successful.
This is illustrated in connection with fig. 7: currently, there are 3 candidate presentation information blocks, which are candidate presentation information block one, candidate presentation information block two, and candidate presentation information block three, respectively. And respectively matching the second vectors corresponding to the received first batch of speech (T1) with the first vectors corresponding to the 3 candidate demonstration information blocks, wherein the similarity is respectively 0.1, 01 and 0.1, and continuous matching is required because the 3 similarities do not exceed the threshold value of 0.5. Then, the second vectors corresponding to the received first batch of speech and the second batch of speech (T1 + T2) are respectively matched with the first vectors corresponding to the 3 candidate presentation information blocks, the similarity of the vectors is respectively 0.2, 0.15, 0.1,3 similarities do not exceed the threshold value 0.5, and continuous matching is needed. Then, the second vectors corresponding to the received first batch of speech, second batch of speech and third batch of speech (T1 + T2+ T3) are respectively matched with the first vectors corresponding to the 3 candidate presentation information blocks, the similarity of the vectors is respectively 0.3, 0.15, 0.1,3 similarities do not exceed the threshold value 0.5, and continuous matching is required. And respectively matching second vectors corresponding to the received first batch of voice, second batch of voice, third batch of voice and fourth batch of voice (T1 + T2+ T3+ T4) with first vectors corresponding to 3 candidate demonstration information blocks, wherein the similarity of the second vectors is 0.55, 0.15 and 0.1, and the similarity of the first vectors with the candidate demonstration information block is greater than a preset threshold value, so that the first vectors are successfully matched with the candidate demonstration information block.
In addition, the speaker does not necessarily have to speak exactly in accordance with the contents in the presentation file at the time of the lecture, and the speaker may speak contents temporarily thought out or the speaker may expand largely in accordance with the contents in the presentation file. In this case, the voice information in a period of time may not be related to the information in the presentation file, and if matching is performed according to the increasing batches, the matching result may be affected, and a load may be imposed on the computer. In some embodiments, a time threshold or a batch threshold may also be set, and matching is not performed for voice messages that exceed the time threshold or the batch threshold. For example, if the batch threshold is set to 5 in advance, when the matching of the speech texts corresponding to 5 batches of speech is not successful, the speech of the 1 st batch can be discarded while the speech of the 6 th batch is added, and the matching can be performed according to the speech of the 2 nd to the 6 th batches.
Further, the first vector corresponding to the presentation information block may be determined by extracting important information for each presentation information block and according to the extracted important information.
In addition, important information can be extracted from the voice text, and a second vector corresponding to the voice text can be determined according to the extracted important information. As described above, the acquired voice information may be received in real time, in which case, there may be a high requirement on the speed of information processing, so that it may be determined whether to extract important information from the voice text according to the requirement on the processing speed and the processing capability of the computer; or the acquired voice information may be acquired through a recorded audio file, that is, an offline processing scene, at this time, important information may be extracted from the voice text, so that a second vector corresponding to the voice text is obtained, and thus, the matching accuracy is improved.
Taking the obtained presentation information as text information as an example, the above-mentioned extraction of important information can be performed based on a word frequency inverse document frequency (TF-IDF) algorithm. The principle of the TF-IDF algorithm is: on one hand, for a word in a document, if the frequency of the word appearing in the document d is higher, the importance is higher; on the other hand, if the vocabulary is more frequent in all documents D, the importance of the vocabulary is lower, i.e., less specific. When the TF-IDF algorithm is used for evaluating the importance of the presentation information, each presentation information can be used as a document D in the standard TF-IDF algorithm, and the set of all the presentation information of all the pages in the presentation file is used as all the documents D in the TF-IDF algorithm. Alternatively, the entire document D may include other documents besides the presentation file, for example, a predetermined corpus or other user-added documents. Thus, the weight of each word t in the presentation information of each presentation information block can be calculated, and thus, the weight of one presentation information block can be cumulatively calculated. For example, a vocabulary (vocabularies) may be constructed for each page in the presentation file, or for the entire presentation file, and the vocabulary may be obtained by performing word segmentation on the text information in the presentation information block and setting a weight for each word in the vocabulary according to the TF-IDF algorithm. Correspondingly, in matching, if the higher-weight participles are successfully matched, the probability that the matching of the first vector and the second vector is successful is higher.
In one particular embodiment, the weight for each participle may be determined according to equation (3):
tfidf(t,d,D)=tf(t,d)×idf(t,D) (3)
wherein tfidf (t, D, D) represents the weight of the participle; tf (t, d) represents the word frequency (term frequency) of the participle in the presentation file, and represents the inverse file frequency (inverted document frequency) of the participle in the whole document.
Alternatively, tf (t, d) above may be determined by equation (4):
Figure BDA0003138720010000111
wherein N = |, D | represents the number of all documents D.
After the weight of each participle is determined, a first vector corresponding to the presentation information block can be determined according to the participle with the weight greater than or equal to a preset threshold value.
It should be noted that the weight of the presentation information block may be obtained by determining the weight of each participle through the TF-IDF algorithm and then accumulating, or may be configured in advance.
In one possible design, after extracting important information from the presentation information blocks, the presentation information blocks that do not contain important information may be screened out. For example, the edge position of each page of the slide may be provided with a page number, and the page number usually does not contain important information and does not need to be highlighted, so that the presentation information block where the page number is located can be screened out in the above manner, thereby avoiding the presentation information block where the page number is located from being highlighted, and simultaneously helping to reduce the number of times of matching and shorten the matching time. Specifically, the weight of each demonstration information block can be determined according to the TF-IDF algorithm, and the demonstration information blocks with the weight values smaller than a preset weight threshold are removed as demonstration information blocks not containing important information; or if the weight of each participle in the demonstration information block is smaller than a preset threshold value, the demonstration information block which does not contain important information is selected from the demonstration information block.
In addition, unimportant demonstration information blocks can be screened out according to the occupied area of the demonstration information blocks. For example, the presentation information blocks in which the page numbers are located are usually smaller in area, and the presentation information blocks with too small area are screened out by setting a threshold value of the area of the presentation information blocks or setting a threshold value of the ratio of the area of the presentation information blocks to the area of the page.
In some embodiments, page turn detection may also be performed during the course of a lecture. The method for synchronizing the voice information and the presentation information can be performed for each presentation page, whether the current presentation page is changed or not can be determined through page turning detection, and whether the voice information and the presentation information are synchronized for the next presentation page or not can be determined. In one embodiment, the computer may obtain the current picture frame F by specifying a time interval j If g (F) j-1 ,F j )>ε, the current picture frame F j And the last picture frame F j-1 With a large difference, it is considered that page turning has occurred. Wherein g (F) j-1 ,F j ) Is a comparison of F j And F j-1 The difference between the two is a preset threshold value. Difference function g (F) j-1 ,F j ) Is to F j And F j-1 And respectively solving the difference values of the pixel points one by one, and then calculating the integral difference value. Alternatively, for efficiency reasons, the method is also applicableThe picture frame may be block compared.
In other embodiments, the first speech text may also be subjected to global matching, that is, the first speech text is matched with the presentation information of the entire presentation file, or the first speech text may also be matched with the presentation information of the presentation information block in the current page and the next page or other pages; therefore, the demonstration page and the demonstration information block which are matched with the current speech content of the speaker are determined according to the matching result, and the demonstration file is automatically turned to the next page or automatically turned to other pages.
In a possible implementation manner, after the step 203 is executed, that is, after it is determined that the first speech text of the first speech matches the presentation information of the first presentation information block, an association relationship between the first speech and the first presentation information block may also be established; the association is used for: the first presentation information block can be highlighted according to the association relationship when the first voice is played again, or the first voice can be played according to the association relationship when the first presentation information block is highlighted. For example, after a lecture, related courseware can be made according to the recorded audio file of the lecture, the presentation file and the established association relationship.
Similarly, after determining that the second speech text of the second speech matches the presentation information of the second presentation information block, the association relationship between the second speech and the second presentation information block may also be established.
In addition, the summary information of the demonstration file can be generated according to the first voice and/or the second voice, so that a user can conveniently know the content of the demonstration file according to the summary information when acquiring the demonstration file, and the user does not need to open the demonstration file and read the content in the file.
Based on the same technical concept, the embodiment of the application also provides a device for synchronizing the voice information and the demonstration information, which is used for realizing the voice information and demonstration information synchronizing method and the function of any realizing mode.
Fig. 8 is a schematic diagram illustrating a structure of an apparatus for synchronizing voice information with presentation information, where the apparatus may include: an acquisition module 801, a receiving module 802, a matching module 803, and a display module 804.
An obtaining module 801, configured to obtain presentation information in a plurality of presentation information blocks in a presentation file;
a receiving module 802, configured to receive a first voice of a user, and convert the first voice to obtain a first voice text;
a matching module 803, configured to match the first speech text with the presentation information corresponding to the multiple presentation information blocks, and determine a first presentation information block successfully matched with the first speech text;
a displaying module 804, configured to highlight the first presentation information block.
In a possible implementation manner, the obtaining module 801 is further configured to: acquiring position information of the plurality of demonstration information blocks; the matching module 803 is specifically configured to: determining candidate demonstration information blocks according to the position information of the demonstration information blocks, wherein the candidate demonstration information blocks are one or more of the demonstration information blocks; and matching the first voice text with the demonstration information corresponding to the candidate demonstration information block.
In a possible implementation manner, the receiving module 802 is further configured to: receiving second voice information of a user, and converting the second voice to obtain a second voice text; the matching module 803 is further configured to: determining candidate demonstration information blocks according to the first demonstration information block, wherein the candidate demonstration information blocks are one or more of the plurality of demonstration information blocks; matching the second voice text with the demonstration information corresponding to the candidate demonstration information block, and determining a second demonstration information block matched with the second voice text; the display module 804 is further configured to: highlighting the second piece of presentation information.
In a possible implementation manner, the matching module 803 is specifically configured to: determining candidate demonstration information blocks according to the first demonstration information block, wherein the candidate demonstration information blocks are one or more of the plurality of demonstration information blocks; and matching the first voice text with the demonstration information corresponding to the candidate demonstration information block.
In a possible implementation manner, the obtaining module 801 is further configured to: acquiring position information of the plurality of demonstration information blocks; when determining the candidate presentation information block according to the first presentation information block, the matching module 803 is specifically configured to: determining the candidate demonstration information block according to the position information of the first demonstration information block and the position information of other demonstration information blocks; the other presentation information blocks are presentation information blocks other than the first presentation information block among the plurality of presentation information blocks.
In a possible implementation manner, when determining a candidate presentation information block according to the first presentation information block, the matching module 803 is specifically configured to: determining the candidate demonstration information blocks according to the identification information of the first demonstration information block and the identification information in other demonstration information blocks; the other presentation information blocks are presentation information blocks other than the first presentation information block among the plurality of presentation information blocks.
In a possible implementation manner, the obtaining module 801 is further configured to: acquiring position information of a plurality of demonstration information blocks in the demonstration file; determining the demonstration sequence of the demonstration information blocks according to the position information of each demonstration information block and a preset rule; when determining the candidate presentation information block according to the first presentation information block, the matching module 803 is specifically configured to: and taking the next presentation information block of the first presentation information block as the candidate presentation information block according to the presentation sequence.
In a possible implementation manner, the obtaining module 801 is specifically configured to: acquiring the demonstration information of a plurality of demonstration information blocks in the demonstration file; determining a first vector corresponding to each demonstration information block according to the demonstration information; the obtaining module 801 is further configured to: determining a second vector corresponding to the first voice text; the matching is specifically used for: and matching the second vector with the first vector corresponding to each presentation information block.
In a possible implementation manner, the obtaining module 801, when determining the first vector corresponding to each presentation information block, is specifically configured to: and determining the weight of each participle in the demonstration information block aiming at each demonstration information block, and determining a first vector corresponding to the demonstration information block according to the participle with the weight larger than a preset threshold value.
In a possible implementation manner, the apparatus further includes an establishing module (not shown in the figure) configured to establish an association relationship between the first voice and the first presentation information block; the incidence relation is used for: when the first voice is played again, the first demonstration information block can be displayed according to the incidence relation; or, when the first presentation information block is presented, the first voice can be played according to the association relationship.
In a possible implementation manner, the apparatus further includes a generating module (not shown in the figure) configured to generate summary information of the presentation file according to the first voice.
Based on the same technical concept, the embodiment of the application also provides a device for synchronizing the voice information and the demonstration information, which is used for realizing the functions of the method for synchronizing the voice information and the demonstration information and any realization mode thereof.
Fig. 9 is a schematic diagram illustrating a structure of a device for synchronizing voice information and presentation information, where the device may include: a processor 901, a memory 902 coupled to the processor 901; further, the device may also include a communication bus 903, a communication interface 904.
Specifically, the processor 901 may be a general-purpose CPU, a microprocessor, an ASIC, or one or more integrated circuits for controlling the execution of the programs of the present application.
The memory 902 may be, but is not limited to, a read-only memory (ROM) or other type of static storage device that can store static information and instructions, a Random Access Memory (RAM) or other type of dynamic storage device that can store information and instructions, an electrically erasable programmable read-only memory (EEPROM) or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. The memory 902, which may be a stand-alone memory such as an off-chip memory, is coupled to the processor 901 via the communication bus 903. The memory 902 may also be integrated with the processor 901.
The communication bus 903 may include a path that transfers information between the above components. The communication bus 903 may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like.
A communication interface 904 for receiving information from an external device. For example, the device may receive voice information from a microphone through communication interface 904.
In particular implementations, processor 901 may include one or more CPUs as one embodiment.
Specifically, the processor 901 can execute instructions or programs in the memory 903 to perform the following steps: acquiring demonstration information in a plurality of demonstration information blocks in a demonstration file; receiving a first voice of a user, and converting the first voice to obtain a first voice text; matching the first voice text with the demonstration information corresponding to the demonstration information blocks, and determining a first demonstration information block successfully matched with the first voice text; highlighting the first piece of presentation information.
In addition, the above devices can also be used for the steps of the aforementioned method for synchronizing the voice information and the presentation information and any implementation manner thereof. The beneficial effects can be referred to the previous description, and are not described in detail herein.
Based on the same technical concept, embodiments of the present application further provide a computer-readable storage medium, in which computer-readable instructions are stored, and when the computer-readable instructions are executed on a computer, the method for synchronizing voice information and presentation information according to any one of the foregoing possible implementations is executed.
Embodiments of the present application provide a computer program product containing instructions that, when executed on a computer, cause the computer to perform embodiments of the above-described method for synchronizing speech information with presentation information.
It is to be understood that the terms "first," "second," "third," and the like in the description of the present application are used for descriptive purposes only and not for purposes of indicating or implying relative importance, nor order. Reference throughout this specification to "one embodiment" or "some embodiments," or the like, means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the present application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," or the like, in various places throughout this specification are not necessarily all referring to the same embodiment, but rather "one or more but not all embodiments" unless specifically stated otherwise. The terms "comprising," "including," "having," and variations thereof mean "including, but not limited to," unless expressly specified otherwise.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While the preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including the preferred embodiment and all changes and modifications that fall within the scope of the present application.
It will be apparent to those skilled in the art that various changes and modifications may be made in the embodiments of the present application without departing from the spirit and scope of the embodiments of the present application. Thus, if such modifications and variations of the embodiments of the present application fall within the scope of the claims of the present application and their equivalents, the present application is also intended to encompass such modifications and variations.

Claims (20)

1. A method for synchronizing speech information with presentation information, comprising:
acquiring demonstration information in a plurality of demonstration information blocks in a demonstration file;
receiving a first voice of a user, and converting the first voice to obtain a first voice text;
matching the first voice text with the demonstration information corresponding to the demonstration information blocks, and determining a first demonstration information block successfully matched with the first voice text;
highlighting the first piece of presentation information.
2. The method of claim 1, further comprising:
acquiring position information of the plurality of demonstration information blocks;
the matching the first voice text with the presentation information corresponding to the plurality of presentation information blocks includes:
determining candidate demonstration information blocks according to the position information of the demonstration information blocks, wherein the candidate demonstration information blocks are one or more of the demonstration information blocks;
and matching the first voice text with the demonstration information corresponding to the candidate demonstration information block.
3. The method of claim 1, further comprising:
determining candidate demonstration information blocks according to the first demonstration information block, wherein the candidate demonstration information blocks are one or more of the plurality of demonstration information blocks;
receiving a second voice of a user, and converting the second voice to obtain a second voice text;
matching the second voice text with the demonstration information corresponding to the candidate demonstration information block, and determining a second demonstration information block matched with the second voice text;
highlighting the second piece of presentation information.
4. The method of claim 3, further comprising:
acquiring position information of the plurality of demonstration information blocks;
the determining candidate presentation information blocks according to the first presentation information block includes:
determining the candidate demonstration information block according to the position information of the first demonstration information block and the position information of other demonstration information blocks; the other presentation information blocks are presentation information blocks other than the first presentation information block among the plurality of presentation information blocks.
5. The method of claim 3, wherein determining candidate presentation information blocks from the first presentation information block comprises:
determining the candidate demonstration information blocks according to the identification information of the first demonstration information block and the identification information in other demonstration information blocks; the other presentation information blocks are presentation information blocks other than the first presentation information block in the plurality of presentation information blocks.
6. The method according to any one of claims 1-5, wherein the obtaining the presentation information in a plurality of presentation information blocks in the presentation file comprises:
acquiring the demonstration information of a plurality of demonstration information blocks in the demonstration file; determining a first vector corresponding to each demonstration information block according to the demonstration information;
the matching the first voice text with the presentation information corresponding to the plurality of presentation information blocks includes:
determining a second vector corresponding to the first voice text; and matching the second vector with the first vector corresponding to each presentation information block.
7. The method of claim 6, wherein determining the first vector for each block of presentation information comprises:
and for each demonstration information block, determining a first vector corresponding to the demonstration information block according to the weight value of the participle in the demonstration information block.
8. The method according to any one of claims 1-7, further comprising:
establishing an incidence relation between the first voice and the first demonstration information block;
the incidence relation is used for: when the first voice is played again, the first demonstration information block can be displayed according to the incidence relation; or, when the first presentation information block is presented, the first voice can be played according to the association relationship.
9. The method according to any one of claims 1-8, further comprising:
and generating summary information of the demonstration file according to the first voice.
10. An apparatus for synchronizing speech information with presentation information, comprising:
the acquisition module is used for acquiring the demonstration information in a plurality of demonstration information blocks in the demonstration file;
the receiving module is used for receiving a first voice of a user and converting the first voice to obtain a first voice text;
the matching module is used for matching the first voice text with the demonstration information corresponding to the demonstration information blocks and determining a first demonstration information block successfully matched with the first voice text;
and the display module is used for highlighting the first demonstration information block.
11. The apparatus of claim 10, wherein the obtaining module is further configured to:
acquiring position information of the plurality of demonstration information blocks;
the matching module is specifically configured to:
determining candidate demonstration information blocks according to the position information of the demonstration information blocks, wherein the candidate demonstration information blocks are one or more of the demonstration information blocks;
and matching the first voice text with the demonstration information corresponding to the candidate demonstration information block.
12. The apparatus of claim 10, wherein the receiving module is further configured to:
receiving second voice information of a user, and converting the second voice to obtain a second voice text;
the matching module is further configured to: determining candidate demonstration information blocks according to the first demonstration information block, wherein the candidate demonstration information blocks are one or more of the plurality of demonstration information blocks; matching the second voice text with the demonstration information corresponding to the candidate demonstration information block, and determining a second demonstration information block matched with the second voice text;
the display module is further configured to: highlighting the second piece of presentation information.
13. The apparatus of claim 12, wherein the obtaining module is further configured to:
acquiring position information of the plurality of demonstration information blocks;
when determining the candidate presentation information block according to the first presentation information block, the matching module is specifically configured to:
determining the candidate demonstration information block according to the position information of the first demonstration information block and the position information of other demonstration information blocks; the other presentation information blocks are presentation information blocks other than the first presentation information block among the plurality of presentation information blocks.
14. The apparatus of claim 12, wherein the matching module, when determining the candidate presentation information block based on the first presentation information block, is specifically configured to:
determining the candidate demonstration information blocks according to the identification information of the first demonstration information block and the identification information in other demonstration information blocks; the other presentation information blocks are presentation information blocks other than the first presentation information block among the plurality of presentation information blocks.
15. The apparatus according to any one of claims 10 to 14, wherein the obtaining module is specifically configured to: acquiring the demonstration information of a plurality of demonstration information blocks in the demonstration file; determining a first vector corresponding to each demonstration information block according to the demonstration information;
the acquisition module is further configured to: determining a second vector corresponding to the first voice text;
the matching is specifically for: and matching the second vector with the first vector corresponding to each presentation information block.
16. The apparatus of claim 15, wherein the obtaining module, when determining the first vector corresponding to each block of presentation information, is specifically configured to:
and determining the weight of each participle in the demonstration information block aiming at each demonstration information block, and determining a first vector corresponding to the demonstration information block according to the participle with the weight larger than a preset threshold value.
17. The apparatus according to any of claims 10-16, further comprising an establishing module for establishing an association relationship between the first voice and the first presentation information block;
the incidence relation is used for: when the first voice is played again, the first demonstration information block can be displayed according to the association relation; or, when the first demonstration information block is demonstrated, the first voice can be played according to the association relation.
18. The apparatus according to any of claims 10-17, further comprising a generating module configured to generate summary information of the presentation file according to the first speech.
19. An apparatus for synchronizing speech information with presentation information, comprising a processor coupled to a memory; the processor is configured to execute instructions or programs in the memory to perform the method of synchronizing speech information and presentation information according to any one of claims 1-9.
20. A computer-readable storage medium having stored therein instructions which, when executed on a computer, cause the computer to perform the method of synchronizing speech information with presentation information according to any one of claims 1-9.
CN202110726144.5A 2021-06-29 2021-06-29 Method and device for synchronizing voice information and demonstration information Pending CN115550707A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202110726144.5A CN115550707A (en) 2021-06-29 2021-06-29 Method and device for synchronizing voice information and demonstration information
PCT/CN2022/094711 WO2023273702A1 (en) 2021-06-29 2022-05-24 Method and apparatus for synchronizing speech information and presentation information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110726144.5A CN115550707A (en) 2021-06-29 2021-06-29 Method and device for synchronizing voice information and demonstration information

Publications (1)

Publication Number Publication Date
CN115550707A true CN115550707A (en) 2022-12-30

Family

ID=84690322

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110726144.5A Pending CN115550707A (en) 2021-06-29 2021-06-29 Method and device for synchronizing voice information and demonstration information

Country Status (2)

Country Link
CN (1) CN115550707A (en)
WO (1) WO2023273702A1 (en)

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2008152605A (en) * 2006-12-19 2008-07-03 Toyohashi Univ Of Technology Presentation analysis device and presentation viewing system
CN110232111A (en) * 2019-05-30 2019-09-13 杨钦清 A kind of text display method, device and terminal device
CN112114771A (en) * 2019-06-20 2020-12-22 珠海金山办公软件有限公司 Presentation file playing control method and device
CN112001155B (en) * 2020-09-29 2021-04-30 上海松鼠课堂人工智能科技有限公司 Intelligent voice labeling method and system

Also Published As

Publication number Publication date
WO2023273702A1 (en) 2023-01-05

Similar Documents

Publication Publication Date Title
EP3405912A1 (en) Analyzing textual data
CN113614825A (en) Word lattice augmentation for automatic speech recognition
CN112818089B (en) Text phonetic notation method, electronic equipment and storage medium
US11810471B2 (en) Computer implemented method and apparatus for recognition of speech patterns and feedback
CN114143479B (en) Video abstract generation method, device, equipment and storage medium
TW201822190A (en) Speech recognition system and method thereof, vocabulary establishing method and computer program product
US10276150B2 (en) Correction system, method of correction, and computer program product
US9679566B2 (en) Apparatus for synchronously processing text data and voice data
CN112382295A (en) Voice recognition method, device, equipment and readable storage medium
CN111126084A (en) Data processing method and device, electronic equipment and storage medium
US11176943B2 (en) Voice recognition device, voice recognition method, and computer program product
US10867525B1 (en) Systems and methods for generating recitation items
US11704090B2 (en) Audio interactive display system and method of interacting with audio interactive display system
CN114363531B (en) H5-based text description video generation method, device, equipment and medium
KR20120045906A (en) Apparatus and method for correcting error of corpus
CN115550707A (en) Method and device for synchronizing voice information and demonstration information
Selvaraj et al. Enhanced portable text to speech converter for visually impaired
CN113038175B (en) Video processing method and device, electronic equipment and computer readable storage medium
US20170270949A1 (en) Summary generating device, summary generating method, and computer program product
CN110428668B (en) Data extraction method and device, computer system and readable storage medium
CN113409761A (en) Speech synthesis method, speech synthesis device, electronic equipment and computer-readable storage medium
JP7258627B2 (en) Scoring support device, its method, and program
US11935425B2 (en) Electronic device, pronunciation learning method, server apparatus, pronunciation learning processing system, and storage medium
US20240012998A1 (en) Information processing device, information processing method, and computer program product
WO2023047623A1 (en) Information processing device, information processing method, and information processing program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication