GB2585334A

GB2585334A - Method for obscuring or encrypting a voice recording

Info

Publication number: GB2585334A
Application number: GB1902423.1A
Authority: GB
Inventors: Coker Tim
Original assignee: Individual
Current assignee: Individual
Priority date: 2019-02-22
Filing date: 2019-02-22
Publication date: 2021-01-13
Also published as: GB201902423D0

Abstract

A speech recording is obfuscated by processing it with a speech recognition means, identifying individual spoken words, extracting those words, reversing them, and then re-inserting the reversed word in place of where it was extracted from the recording. The reinsertion step may be done so as to neither overwrite unprocessed data nor create new gaps in the modified recording. A word table may be generated (Fig. 4). The method may take place substantially in real-time. The method may be used to obscure an actor’s speech, for example in either a motion picture or in a theatre production, with the word table used to generate subtitles. This method allows a recording to be rendered unintelligible by reversing the time order in which each word is stored whilst retaining the word order in the recording. The original recording may be recovered by using information about the word breaks.

Description

Method for obscuring or encrypting a voice recording This invention relates to a method of obscuring or encrypting a voice message or spoken word recording.

Voice messages and recordings are a common form of communication, for example voice-mail used in modern telephony, but more generally the spoken word has been recorded since the 19' century. Initially this was done using analogue methods but it is now invariably done with electronic and digital methods. Other examples of the application of voice recordings include the sound track to motion pictures, radio productions (that are not live) and audio books. In all cases the recorded form will consist of a series of spoken words that together form a speech, utterance or perhaps simply information in its most general sense. In some cases the recording may also include backing music or sound effects, but these are not an essential part of the invention.

In certain cases it may be advantageous or necessary to obscure or even encrypt such a voice recording, for security or possibly just to defeat casual eavesdropping. This can be done in many ways which will be familiar to those skilled in the art, for example by applying an encryption key to the digital message.

One method of obscuring a voice recording is to simply reverse the time order in which the recording is stored, this has the effect that when played back the recording becomes completely unintelligible. Note that with this method both the individual words and the sequence in which they are stored is reversed. However this method has the disadvantage in that the original message can be very easily restored by simply reversing the order of the stored message a second time, noting that no additional information is required to achieve this. The invention presently disclosed improves on this simple technique by reversing the time order in which each word is stored but retaining the order of the sequence of words in the recording. Now, when played back, the modified recording is still unintelligible, but the original version of the message cannot be restored by simply reversing the whole message. The original message can now only be fully restored if the restoration means also has the information regarding the positions in time of the word breaks in the original message. This requirement for additional information to recover the original form of the message means that the process is actually a form of encryption, and the information about the word breaks is equivalent to an encryption key.

The invention will now be described, purely by way of example, with reference to the accompanying Figures in which: Figure 1 shows a breakdown of the steps required in order to obscure or encrypt a voice message according to the invention.

Figure 2 shows Steps 2 and 3 of the process in more detail.

Figure 3 indicates how the order of words in the message is retained.

Figure 4 shows an example of a Word Table.

In one embodiment of the invention the process has four steps which are those shown in Figure 1. First the actual voice recording is digitised (Step 1, although this aspect may be carried out by a separate means), using any method known in the art, for example sampling the recorded waveform at a suitable frequency (16kHz is adequate for simple speech) and then digitising each sample, typically with 16 bit precision. These digitised samples can then be packed together in a digital file. By way of example, the ".wav" file format describes how uncompressed, digitised sound can be recorded in this way.

In Step 2, the digitised recording is processed using any suitable speech recognition means, for example CMUSphinx (htfros://cmusphinx.github.io) is an Open Source speech recognition library. One function available in CMUSphinx is a means to iterate through a voice message or digitised speech recording and to output the start and finish times of each word as it is recognised. With this information the individual words can be extracted from the recording and then individually reversed (Step 3). Once the whole recording has been processed in this way, it is stored (Step 4) as an encrypted or obscured message, using any suitable storage means. Note that this step is not essential to the invention, for example the modified recording could be broadcast and then discarded.

Figure 2 shows more detail for Steps 2 and 3. For Step 2, Step 2a 25 is the processing referred to above which has as its output the start/finish time within the recording of individual words (and optionally, the text form of the word itself). In Step 2b, this start/finish time is used to extract the individual word from the recording, said extracted word can then be reversed (Step 3a) and then re-inserted into the recording in place of the original word (Step 3b). An optional output of Step 2 is a "Word Table" (Step 2c), indicating, for each word recognised, the start/finish time within the recording. If this output is stored separately to the modified recording it becomes an encryption key that can be used to restore the original from a modified recording.

In Figure 3 we show, diagrammatically, how a recorded message is processed. The reversed spelling of each word is meant to indicate how the individual words are reversed but the order of words within the whole is retained.

Figure 4 is an example of a word table that might be produced at 15 Step 2c. If this is transmitted to a recipient along with the modified recording, then the original recording can be reconstructed using the following process: A processing means scans through the recording to the point indicated by the first entry in the word table.

An extract of the recording, starting and finishing at the times indicated for the said first entry, is made.

This extract is reversed, then re-inserted into the recording starting at the start time of the said first entry.

This process is repeated for each subsequent entry in the said

table.

Where necessary the extracted and/or reversed data would be decompressed and optionally re-compressed and the same regard for data size as the encryption methods should also be made (see below).

Note that, if the word table is transmitted to a recipient as an encryption key, then, clearly, the text form of the actual words is not included.

In a second embodiment of the invention, the recording might be compressed after it is sampled and digitised, for example according to the well known mp3 format. In this embodiment, the extracted word would need to be de-compressed before being reversed, and then re-compressed before being re-inserted into the message. The means to achieve this would also need to take account of the form of the compression, in that the compressed recording has something of a "frame" structure (unlike the uncompressed form which is effectively a continuous stream of samples without internal structure), thus when a word is extracted you would need to extract between frame boundaries in order to ensure that sufficient information is extracted to allow the extracted word to be de-compressed properly. Once decompressed the word is reversed then re-compressed before being re-inserted into the original.

Note that in the first embodiment, the file size of the modified recording should be exactly the same as the original, but in the second embodiment this is not necessarily the case, and the modified file could be the same size as the original, or smaller or indeed larger.

In a third embodiment, being a variation to the second embodiment, the reversed word is not re-compressed before being re-inserted, in which case the file size of the modified recording will definitely be larger than that of the original. In this case, whilst the method is different, the form of the modified recording will actually be the same as the first embodiment. In the cases where the reversed and possibly re-compressed word does not have the same data size as the extract it is replacing, when it is reinserted into the file (ie recording) this must be done in such a way as not to over-write information that has not yet been processed or create gaps in the recording that were not present in the original. Means to do this will be known to those skilled in the art.

In a fourth embodiment, wherein the original recording is compressed, the whole recording is decompressed before the word recognition and extraction/reversion/re-insertion process is carried out, but the subsequent file is not re-compressed. A fifth embodiment, being a minor variation to the fourth embodiment, would be to re-compress the recording after the extraction/reversion process has been carried out. A sixth embodiment is a variation to the first embodiment wherein the recording is compressed after being processed.

In terms of security it should be noted that the second, fifth and sixth embodiments are much more secure than any of the first, 25 third or fourth embodiments. This is because a compressed recording cannot be simply reversed in order to play it back unlike the other embodiments, where it is feasible to simply reverse a modified recording, although the information in the modified/reversed recording is still likely to be obscured or the speech unintelligible.

In a second application of the invention, which may use any of the six embodiments thus far described although methods that don't use compression will be more straightforward, the invention is used to provide a sound effect that can be referred to as "alien translation". This could be required for a science-fiction film or theatre production, where actors are required to portray aliens who would not be talking in English. In this case the method described herein has the advantage over prior art methods in that the rhythm of the spoken words is retained whilst the sound of each word is rendered unintelligible. This is because each word is reversed individually and not the order of words in the whole, thus the rhythm is retained and as a consequence the overall effect is more believable than simply reversing everything.

This application of the invention has the further advantage that the word table produced at Step 2c can be readily and immediately used to provide sub-titles such that the audience will be able to follow what the "alien" is saying. In the case of a motion picture, the word table information can be used to create a subtitle track to the film in the standard way. However for a theatre production, the word table would be used to display sub-title like information to the audience on additional screens. As an additional advantage to the invention, the method clearly works "on-the-fly" meaning that the subtitle for each word is created in close to real time, whereas a prior art method, based on reversing the whole recording, would neither broadcast the actor's alien words, nor the subtitles, to the audience in anything like real time, thus spoiling the overall effect.

Note that throughout this specification, the term "recording" is used to refer to a voice message or speech that has been processed according to the invention or is undergoing such processing or is about to undergo such processing. It should be understood that the term "recording" does not necessarily imply that the result of the method being applied is used to create a permanent or substantially permanent copy of the speech or message. For example the modified "recording" could be broadcast to an audience (as in the second application) and then discarded creating no permanent copy. Nor should it be inferred that the input to the first processing step is required to be a recording rather than, for example, the live output of a microphone, or any other form of data stream. It should also be understood that as the method is followed, there will exist within the processing means what will in effect be interim copies of the voice message or speech, and these should all be considered as forms of a recording, according to the invention, and interpreted accordingly.

Further note that, whilst the speech recognition aspect of the method (Step 2a) can be accomplished with high accuracy, the method is not itself dependent on achieving 100% accuracy in this step. For the second application of the invention (motion picture or theatre productions wherein the actor is talking from a script) accuracy is more important but the script itself can be used to substantially improve the accuracy of the speech recognition process should it be necessary.

Claims

Claims 1. A method of encrypting or obscuring a digitally recorded voice message or passage of speech, including the following steps: processing the digitised message with a speech recognition means in order to identify the individual spoken words, and for each word identified, extracting it from the recording, and reversing the extracted word, and re-inserting it into the recording in place of the part that has been extracted.
2. A method according to Claim 1, in which the original recording is compressed and including the additional steps: where each word is extracted, this is done so taking due account of the formatting of the compressed data, and each extracted word is decompressed before being reversed, and each reversed word is re-compressed before being re-inserted into the recording.wherein Lhe re-insertion sLep is carried cid, so as Lo neiLher over-write unprocessed data nor create gaps in the modified 20 recording that were not present in the original.
3. A method according to Claim 1, in which the original recording is compressed and including the following additional steps: where each word is extracted, this is done so taking due account of the formatting of the compressed data, and each extracted word is decompressed before being reversed, and each reversed word is re-inserted into the recording without being re-compressed.wherein the re-insertion step is carried out so as to neither over-write unprocessed data nor create gaps in the modified recording that were not present in the original.
4. A method of encrypting or obscuring a recorded voice message or passage of speech in which the recording is compressed and including the additional steps: de-compressing the entire recording, and processing the said de-compressed recording according to the method described in Claim 1.
S. A method according to Claim 4 including the additional step of re-compressing the recording after processing.
6. A method according to Claim 1 including the additional step of compressing the recording after processing.
7. A method of encrypting or obscuring a recorded voice message or passage of speech according to any of Claims 1 to 6 including the additional step of generating a word table.
8. A method to allow an actor in a motion picture to portray an alien not speaking in an intelligible language, including the steps of: recording the actor's voice, and processing said recording according to Claim 7, and using the processed recording in the sound track of the motion picture.
9. A method according to Claim 8, including the additional step 10 of using "word table" information derived from the method to create sub-titles for the motion picture.
10. A method to allow an actor in a theatre production to portray an alien not speaking in an intelligible language, including the steps of: recording the actor's voice, and processing said recording according Lo Claim 7, broadcasting the processed recording to the audience wherein the processing is executed quickly enough that the overall effect happens substantially in real-time.
11. A method according to Claim 10, including the additional step of using the "word table" information that can be derived from the method to create sub-titles for the theatre audience that can be displayed on suitably placed screens.
12. A method according to any of Claims 8 to 11 wherein the accuracy of the processing step is improved by reference to the script the actor is talking from.
13. A method for decrypting a recording encrypted using a method according to any previous Claim wherein a word table is generated, 10 including the steps of: scanning through the recording to the point indicated in the first entry in the word table, and making an extract from the recording starting and finishing at the times indicated for this entry, and reversing this extract, and re-inserting the reversed extract into the recording starting at the start time of this entry, and repeating these steps for each subsequent entry in the table.wherein the re-insertion step is carried so as to neither over-20 write unprocessed data or create gaps in the modified recording that were not present in the original.
14. A device for modifying or encrypting a voice message or passage of speech including means for: recording said speech, and sampling and digitising said recording, and processing said digital recording according to any of Claims 1 to 7.
15. A device according to Claim 14 in which the processing means is fast enough to process the recording in substantially real-time.
16. A device according to Claim IS including further means for broadcasting said modified recording to an audience in such a way that it appears that an actor is speaking unintelligibly.
17. A device according to Claim 14, 15 or 16 including further means to create sub-titles from word table information.
18. A device according Lo Claim 17 including further means Lo display said sub-titles to a theatre audience.