AU2021104873A4

AU2021104873A4 - An audio-visual analysing system for automated presentation delivery feedback generation

Info

Publication number: AU2021104873A4
Application number: AU2021104873A
Authority: AU
Inventors: Gail Bower; Tim Kirkman
Original assignee: Individual
Current assignee: Individual
Priority date: 2021-02-25
Filing date: 2021-08-03
Publication date: 2021-09-30
Anticipated expiration: 2029-08-03
Also published as: WO2022178587A1

Abstract

An audio-visual analysing system for automated presentation delivery feedback generation comprising a computing device having a: processor and memory device operably coupled thereto, the memory device comprising computer program code instructions and associated data fetched, interpreted and executed by the processor in use; a video interface interfacing a video camera; an audio interface interfacing a microphone; wherein the computer program code instructions comprise: a video analysis controller which analyses video captured by the video camera to generate a video analysis feedback score; and an audio analysis controller which analyses audio captured by the microphone to generate an audio analysis feedback score. 12 1/4 Profile Displa y Video Audio - Interface Interface Interface 4-4 (3-Structure Data Tx-ospeech VpidyeVodoAdnalysis l~i Processor FAudio AnalysisPrcso Templating Controllers Fp Memory Figure 1

Description

1/4

4-4

Profile Displa y Video Audio - Interface Interface Interface

(3-Structure

Data

Tx-ospeech

VpidyeVodoAdnalysis l~i Processor FAudio AnalysisPrcso

Templating Controllers Memory Fp

Figure 1

An audio-visual analysing system for automated presentation

delivery feedback generation

Field of the Invention

[0001]This invention relates generally to an audio-visual analysing system for automated presentation delivery feedback generation.

Summary of the Disclosure

[0002] There is provided herein an audio-visual analysing system for automated generation of presentation delivery feedback.

[0003] The system comprises a processor and memory device operably coupled thereto. The memory device comprises computer program code instructions and associated data which is fetched, interpreted and executed by the processor in use.

[0004] The computer device has a video interface interfacing a video camera and an audio interface interfacing a microphone.

[0005] The computer program code instructions comprise a video analysis controller which analyses video captured by the video camera to generate a video analysis feedback score. Furthermore, the computer program code instructions comprise an audio analysis controller which analyses audio captured by the microphone to generate an audio analysis feedback score.

[0006] Other aspects of the invention are also disclosed.

Brief Description of the Drawings

[0007] Notwithstanding any other forms which may fall within the scope of the present invention, preferred embodiments of the disclosure will now be described, by way of example only, with reference to the accompanying drawings in which:

[0008] Figure 1 shows an audio-visual analysing system for automated presentation delivery feedback generation in accordance with an embodiment;

[0009] Figure 2 shows exemplary processing by the system of Figure 1 in accordance with an embodiment;

[0010] Figure 3 shows exemplary video processing by the system of Figure 1 in

accordance with an embodiment; and

[0011] Figure 4 shows exemplary audio processing by the system of Figure 1 in

accordance with an embodiment.

Description of Embodiments

[0012] Figure 1 shows an audio-visual analysing system 100 for presentation delivery

feedback.

[0013] The system 100 comprises a computing device 101 having a processor 102 in

operable communication with a memory device 103 across a system bus 104.

[0014] The memory device 102 comprises computer program code instructions and

associated data which are fetched, interpreted and executed by the processor 102 in

use for implementing the functionality described herein.

[0015]The computing device 101 comprises a display interface 105 interfacing a

digital display device 106. The processor 102 controls the display interface 105 to

display a user interface 107 on the digital display 106 comprising digital information

108.

[0016]The computing device 101 further comprises a video interface 109 which

captures video from a video camera 110. The computing device 101 further comprises

an audio interface 111 which captures audio from a microphone 112.

[0017]The memory 103 may comprise data 113 comprising presentation templates

114 which are used to generate a presentation structure 115. The data 113 may

further store parameters 116 in relation to a presentation. The data 113 may further

store a profile 117 generated according to calculated video and audio analysis

feedback scores.

[0018] The memory 103 controllers 118 may comprise a templating controller 119 for

generating the presentation structure 115 from the presentation templates 114.

[0019] The controllers 118 may further comprise an audio analysis controller 120

which analyses audio captured by the microphone 112. Furthermore, the controllers

118 may comprise a video analysis controller 121 which analyses video captured by

the video camera 110. In embodiments the controllers 118 may further comprise a text-to-speech controller which converts text from the presentation structure 115 to speech.

[0020] Figure 2 illustrates exemplary processing 123 by the system 100.

[0021] The processing 123 may comprise template interface interaction at step 124

wherein the template generation controller 119 generates the presentation data

structure 115 and associated parameters at step 125 using at least one template 114.

[0022] Specifically, for the generation of a presentation structure 115, a user of the

system 100 may select a template 114 for generating a presentation wherein the

system 100 displays a user interface 107 comprising on-screen controls to generate

the presentation structure 115.

[0023] The interface 107 may request various information such as the objectives of

the presentation, introduction, talking points, conclusions and the like. In

embodiments, the interface 107 may have input fields for what the presenter would

like the audience to think, feel and do.

[0024] The information input may be inserted into placeholders of the relevant

template 114 for the generation of the presentation structure 115.

[0025] The user may also configure various parameters, such as the intended

audience, style of presentation and the like. These parameters may be stored in

relation to the presentation structure 115.

[0026] Once having generated the presentation at step 124 and 125, the system

displays the presentation in the interface 107 according to the generated presentation

structure 115. For example, the interface 107 may display key points, tips and content

for each stage of the presentation, such as for the introduction, main part and

conclusion stages of the presentation.

[0027] The user uses the presentation using the interface 107 to deliver the

presentation while the system 100 records the user.

[0028] Specifically, at step 127 the system 100 captures audio via the microphone

112 and, at step 128, the audio analysis controller 120 analyses the audio to generate

an audio analysis feedback score.

[0029] Simultaneously, at step 126, the system 100 captures video data using the

video camera 110 and, at step 129, the video analysis controller 129 analyses the

video for generating a video analysis feedback score.

[0030] In a preferred embodiment, the system 100 captures both audio and video.

However, in embodiments, the system 100 captures either audio or video, including

depending on the hardware capabilities of the computer 100.

[0031] At step 130 the system 100 updates the user profile 117 with the audio and

video feedback scores and, at step 131, displays the results thereof on the interface

107.

[0032] Figure 3 illustrates exemplary video analysis processing 132 performed by the

video analysis controller 121 for gaze detection in accordance with an embodiment.

[0033] The processing 132 is used to classify the gaze of the user during presentation

and, more specifically, the classify whether the user is looking directly ahead or up,

down, left and right during the presentation.

[0034] In accordance with this classification, the video analysis controller 121 may

increase the video feedback score when the user is looking directly ahead as opposed

to the sides or up and down.

[0035] In embodiments, the video analysis controller 121 may further adjust the video

analysis feedback score depending on whether the user is looking to the sides or

down wherein the video analysis controller 121 penalises the user (such as by

decrementing the video analysis feedback score or incrementing the video analysis

feedback score by a smaller amount) when the user is looking down at presentation

material as opposed to looking centre and side-to-side and engaging with the

audience.

[0036] The video analysis processing 132 may comprise facial key point detection at

step 133 wherein facial key points are identified from the video data. The facial key

points detected may comprise nose, chin, facial outline, eyebrow, ear, hairline key

points.

[0037] Step 134 - 136 may be used for detecting gaze at step 137. Step 134 may

comprise binary masking wherein the video data is converted to black and white. Such conversion may depend on ambient lighting conditions and the like and the video analysis controller 121 my dynamically adjust a threshold accordingly.

[0038]At step 134, the video analysis controller 121 specifically attempts to binary

mask the eye region of the user. As such, the video analysis controller 121 may

segment the eye region using the facial key points detected at step 133 and then

dynamically adjusts the threshold controller until such time that two independent

white regions (i.e. the eyes) are detected within a continuous black background.

[0039] Step 135 may comprise segmentation wherein the white regions (correlating

to each eye) are segmented.

[0040] Step 136 may comprise contour finding. Specifically, contour finding may

comprise complete or partial circular contour finding to identify the generally circular

iris within the sclera. In embodiments, contour finding may identify the general shape

of the sclera.

[0041]At step 137, the analysis controller can therefore classify the gaze.

Specifically, the video analysis controller 121 may classify the centre point of the iris

region identified by the aforedescribed contour finding wherein the respective position

of the centre point with respect to the surrounding segmented region is indicative of

the gaze of the user.

[0042] In other words, if the centre point is substantially within the centre of the

segmented area, the video analysis controller 121 determines that the gaze is directly

forward whereas if the centre point is to one side of the segmented area, the video

analysis controller 121 determines that the gaze is to one side.

[0043] At step 138 the video analysis controller 121 classifies the gaze. Classification

may comprise classifies in the gaze into five regions comprising centre, up, down, left

and right.

[0044] Each region may comprise an associated weighting/score used for updating

the video analysis feedback score. For example, the central region may comprise a

weighting of 10, the left and right regions may yet comprise a weighting of five, the

upper region may comprise a weighting of three and a lower region may comprise a

weighting of zero.

[0045] As such, for each time period, such as one minute, the video analysis controller

121 may detect the duration of the gaze within each of these regions and calculate a

video feedback analysis score proportion with the time with an end region and the

associated score.

[0046] For example, for a one-minute period, should the gaze be detected as being

within the central region for 30 seconds and within the lower region for 30 seconds,

the video analysis controller 121 may assign video analysis feedback score of five,

being the average of the scores of ten and five for the respective central and lower

regions.

[0047] Figure 4 illustrates exemplary audio processing 139 in accordance with an

embodiment. An initial score 142, such as zero, may be set at step 142.

[0048] Step 140 comprises capturing audio from the microphone 112.

[0049] Step 141 may comprise the audio analysis controller 121 performing volume

measurement of the captured audio. The audio analysis controller 121 may compare

the volume average or time period volume average against a target threshold region

such as between 70 and 80 dB.

[0050] The audio analysis controller 121 may positively adjust the score at step 143

when the detected volume is within this target range and negatively adjust the score

when the detected volume is outside this target range.

[0051] In embodiments, the audio analysis controller 121 may set the target threshold

region depending on the various parameters. For example, the parameters may

include the size of the venue (such as a meeting room or theatre), the number of

attendees, whether an audio amplifier is being used, distance from the microphone

and the like. Depending on these parameters, the audio analysis controller 121 may

select the target threshold region from a lookup table. For example, for a meeting

room venue, the target threshold region may be between 60 and 70 dB as opposed

to the theatre venue wherein the target threshold region would be higher such as from

to 80 dB.

[0052] The processing 139 may comprise speech-to-text at step 144 wherein the

audio analysis controller 121 converts the audio to speech. Specifically, the speech to-text converts the audio into a string of words each having an associated timing marker.

[0053]As such, at step 145, the processing 130 may comprise pace measurement

wherein the audio analysis controller 121 adjusts the score at step 146 with reference

to a target pace threshold range. Similarly, the audio analysis controller 121 may

adjust the target pace threshold range depending on the various parameters. For

example, depending on the type of presentation, audience and/or topic of the

presentation, the target pace may vary accordingly.

[0054] Step 147 may comprise pause measurement 147 wherein the pauses between

words and between sentences are measured. At step 148, the audio analysis

controller 120 may adjust the score accordingly. Similarly, the audio analysis

controller 121 may utilise a target pause threshold range depending on the configured

parameters 116.

[0055]The processing 139 may comprise filler word detection at step 149. Filler

worded detection 149 may comprise cross-referencing the detected words with a

pause word dictionary 151. The pause word dictionary 151 may comprise pause

words such as "umm", "ah" and the like. The audio analysis controller 120 may

decrement the score at step 150 proportionate to the number of filler words detected.

[0056] At step 152, the audio analysis controller 122 may assign the score to various

categories.

[0057] The display of the feedback results at step 131 may comprise displaying the

actual video and audio analysis feedback score. Alternatively, the video and audio

analysis feedback score may be assigned to various categories.

[0058] For example, the gaze detection may be used to classify that the user gaze

during presentation is "good", or "looking down too much".

[0059] Furthermore, the audio feedback score may use to classify that the user's

speech is "a bit rushed for the intended audience" or "needs more pauses for greater

emphasis for this type of topic".

[0060] In embodiments, the text-to-speech controller 122 may convert text of a

presentation as defined by the presentation structure 115 to speech. In embodiments, the text-to-speech controller 122 may present the presentation depending on the configured parameters 116. For example, depending on the target pace, pause and volume thresholds which depend on the configured parameters 116, the text-to speech controller may convert the text of the presentation to speech with volume, pace and pauses within the target thresholds.

[0061] In embodiments, the system may convert the presentation structure to a

conventional presentation format, such as MicrosoftTM PowerPointTM.

[0062]The foregoing description, for purposes of explanation, used specific

nomenclature to provide a thorough understanding of the invention. However, it will

be apparent to one skilled in the art that specific details are not required in order to

practise the invention. Thus, the foregoing descriptions of specific embodiments of

the invention are presented for purposes of illustration and description. They are not

intended to be exhaustive or to limit the invention to the precise forms disclosed as

obviously many modifications and variations are possible in view of the above

teachings. The embodiments were chosen and described in order to best explain the

principles of the invention and its practical applications, thereby enabling others

skilled in the art to best utilize the invention and various embodiments with various

modifications as are suited to the particular use contemplated. It is intended that the

following claims and their equivalents define the scope of the invention.

[0063] The term "approximately" or similar as used herein should be construed as

being within 20% of the value stated unless otherwise indicated.

Claims

Claims 1. An audio-visual analysing system for automated presentation delivery

feedback generation comprising a computing device having a:

processor and memory device operably coupled thereto, the memory device

comprising computer program code instructions and associated data fetched,

interpreted and executed by the processor in use;

a video interface interfacing a video camera;

an audio interface interfacing a microphone; wherein the computer program

code instructions comprise:

a video analysis controller which analyses video captured by the video

camera to generate a video analysis feedback score; and

an audio analysis controller which analyses audio captured by the

microphone to generate an audio analysis feedback score.
2. The system as claimed in claim 1, wherein the computer program code

instructions further comprise a templating controller which generates a presentation

data structure using at least one presentation template.
3. The system as claimed in claim 1, wherein at least one parameter is stored in

relation to the presentation data structure.
4. The system as claimed in claim 2, wherein the system further comprises a

display interface interfacing a digital display and wherein the system displays a user

interface configured according to the presentation data structure whilst the video and

audio is captured.
5. The system as claimed in claim 1, wherein video analysis comprises gaze

detection.
6. The system as claimed in claim 5, wherein gaze detection comprises

segmentation to segment eye regions and contour finding to identify an iris within

each segment.
7. The system as claimed in claim 6, wherein segmentation comprises selecting

eye regions according to detected facial key points.
8. The system as claimed in claim 6, wherein segmentation comprises binary

masking.
9. The system as claimed in claim 8, wherein binary masking comprises adaptive

binary masking thresholding.
10. The system as claimed in claim 9, wherein adaptive binary masking

thresholding comprises adapting threshold until two regions are detected.
11. The system as claimed in claim 5, wherein gaze detection comprises assigning

a detected gaze to a plurality of gaze regions.
12. The system as claimed in claim 11, wherein the gaze regions comprise a

central region, upper region, lower region and side regions.
13. The system as claimed in claim 12, wherein each region is associated with a

score and wherein the video analysis comprises adjusting the video analysis

feedback score depending on the score associated with each region.
14. The system as claimed in claim 13, wherein video analysis further comprises

adjusting the video analysis feedback score depending on a time period associated

with each region.
15. The system as claimed in claim 1, wherein audio analysis comprises volume

measurement.
16. The system as claimed in claim 15, wherein the audio analysis controller

adjusts the audio analysis feedback score with reference to a target volume threshold

range.
17. The system as claimed in claim 16, wherein the target volume threshold range

depends on a presentation parameter.
18. The system as claimed in claim 1, wherein the audio analysis comprises

speech-to-text to convert the audio to words, each having a timing marker associated

therewith and pace measurement which determines a pace according to time of

marking is associated with each word.
19. The system as claimed in claim 18, wherein the audio analysis controller

adjusts the audio analysis feedback score with reference to a target pace threshold

range.
20. The system as claimed in claim 19, wherein the target pace threshold range

depends on a presentation parameter.
21. The system as claimed in claim 1, wherein the audio analysis comprises

speech-to-text to convert the audio to words, each having a timing marker associated

therewith and pause measurement which determines pauses between words

according to time of marking is associated with each word.
22. The system as claimed in claim 21, wherein the audio analysis controller

adjusts the audio analysis feedback score with reference to a target pause threshold

range.
23. The system as claimed in claim 22, wherein the target pause threshold range

depends on a presentation parameter.
24. The system as claimed in claim 1, wherein the audio analysis comprises

speech-to-text to convert the audio to words and filler worded detection which detects

filler words from a filler words dictionary.
25. The system as claimed in claim 24, wherein the audio analysis controller

adjusts the audio analysis feedback score proportionate to a number of detected filler

words.
26. The system as claimed in claim 1, wherein the computer program code

instructions comprise a text-to-speech controller which converts text of a presentation

is defined by a presentation structure to speech.
27. The system as claimed in claim 26, wherein the text-to-speech controller

converts the text depending on at least one parameter configured for the presentation

structure.