WO2023141133A2

WO2023141133A2 - Sound isolation

Info

Publication number: WO2023141133A2
Application number: PCT/US2023/011012
Authority: WO
Inventors: Matthew FREED; Jackson BLUME; William Leon
Original assignee: Malamute, Inc.
Priority date: 2022-01-20
Filing date: 2023-01-18
Publication date: 2023-07-27
Also published as: WO2023141133A3

Abstract

Examples described herein provide a computer-implemented method that includes defining a training dataset. The training dataset includes a ground truth and a training input. The method further includes training a machine learning model to perform vocal extraction using the training dataset. The method further includes performing vocal extraction, using the machine learning model, on an audio stream to extract a vocal aspect of the audio stream.

Description

SOUND ISOLATION

BACKGROUND

[0001] Embodiments described herein generally relate to processing systems, and more specifically, to sound isolation.

[0002] Processing systems, such as smart phones, laptops, cloud computing nodes, and the like can process audio signals associated with detected sound waves. For example, a processing system can receive a sound signal and can process the sound signal to extract useful information and/or to implement an action based on the sound signal. Sound processing can be performed using analog processors to operate on the sound signal directly and/or digital processing to operate mathematically on a digital representation of the sound signal.

SUMMARY

[0003] In one exemplary embodiment, a computer-implemented method for vocal extraction is provided. The method includes defining a training dataset. The training dataset includes a ground truth and a training input. The method further includes training a machine learning model to perform vocal extraction using the training dataset. The method further includes performing vocal extraction, using the machine learning model, on an audio stream to extract a vocal aspect of the audio stream.

[0004] In addition to one or more of the features described herein, or as an alternative, further embodiments of the method may include that the ground truth includes clean speech vocal audio and clean noisy sound audio.

[0005] In addition to one or more of the features described herein, or as an alternative, further embodiments of the method may include that the training input includes clean speech combined with noisy audio.

[0006] In addition to one or more of the features described herein, or as an alternative, further embodiments of the method may include performing an action based at least in part on a content of a vocal aspect of the audio stream. [0007] In addition to one or more of the features described herein, or as an alternative, further embodiments of the method may include that the audio stream is segmented into chunks, which are stored in an array.

[0008] In another exemplary embodiment a processing system is provided. The processing system includes a memory having computer readable instructions and a processing device for executing the computer readable instructions, the computer readable instructions controlling the processing device to perform operations. The operations include receiving sound data from a device in communication with the processing system. The operations further include performing natural language processing on the sound data, the natural language processing including performing vocal extraction on the sound data using a trained machine learning model. The operations further include determining an action to implement based at least in part on content of the sound data as determined during the vocal extraction. The operations further include causing the action to be implemented.

[0009] In addition to one or more of the features described herein, or as an alternative, further embodiments of the system may include that the operations further include performing inbound routing on the sound data to route the sound data within the processing system.

[0010] In addition to one or more of the features described herein, or as an alternative, further embodiments of the system may include that the operations further include performing outbound routing on the action to implement.

[0011] In addition to one or more of the features described herein, or as an alternative, further embodiments of the system may include that the natural language processing further includes performing text to speech analysis on the sound data.

[0012] In addition to one or more of the features described herein, or as an alternative, further embodiments of the system may include that the natural language processing further includes performing speech to text analysis on the sound data.

[0013] In addition to one or more of the features described herein, or as an alternative, further embodiments of the system may include that the operations further include training the machine learning model to perform the vocal extraction using a training dataset. [0014] In addition to one or more of the features described herein, or as an alternative, further embodiments of the system may include that the training dataset includes a ground truth and a training input.

[0015] In addition to one or more of the features described herein, or as an alternative, further embodiments of the system may include that the ground truth includes clean speech vocal audio and clean noisy sound audio.

[0016] In addition to one or more of the features described herein, or as an alternative, further embodiments of the system may include that the training input includes clean speech combined with noisy audio.

[0017] In addition to one or more of the features described herein, or as an alternative, further embodiments of the system may include that the operations further include causing a device in communication with the processing system to generate an augmented reality interface on the device based at least in part on the content of the sound data as determined during the vocal extraction.

[0018] In addition to one or more of the features described herein, or as an alternative, further embodiments of the system may include that the operations further include causing a device in communication with the processing system to generate a digital twin interface on the device based at least in part on the content of the sound data as determined during the vocal extraction.

[0019] In another exemplary embodiment an edge device is provided. The edge device includes a memory having computer readable instructions and a processing device for executing the computer readable instructions, the computer readable instructions controlling the processing device to perform operations. The operations include receiving raw audio. The operations further include performing vocal extraction on the raw audio using a trained machine learning model to extract a voice aspect from the raw audio. The operations further include combining the voice aspect with the raw audio at a user-defined ratio. The operations further include generating an output audio signal.

[0020] In addition to one or more of the features described herein, or as an alternative, further embodiments of the edge device may include an electronic input, wherein the electronic input is used to set the user-defined ratio. [0021] In addition to one or more of the features described herein, or as an alternative, further embodiments of the edge device may include that the electronic input is selected from the group consisting of a potentiometer, a rotary encoder, and a touch bar.

[0022] In addition to one or more of the features described herein, or as an alternative, further embodiments of the edge device may include an audio input for receiving the raw audio and an audio output to output the output audio signal.

[0023] In another exemplary embodiment, a computer-implemented method is provided. The method includes recording audio information from a user while recording tracking information associated with a movement of the user. The method further includes storing the audio information as an audio track. The method further includes storing the tracking information as a spatial track. The method further includes playing the audio track while generating a graphical representation of the tracking information using the spatial track.

[0024] In another exemplary embodiment, a computer-implemented method for sound extraction is provided. The method includes defining a training dataset, the training dataset comprising a ground truth and a training input. The method further includes training a machine learning model to perform vocal extraction using the training dataset. The method further includes performing sound extraction, using the machine learning model, on an audio stream to extract a sound aspect of the audio stream.

[0025] The above features and advantages, and other features and advantages, of the disclosure are readily apparent from the following detailed description when taken in connection with the accompanying drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

[0026] The specifics of the exclusive rights described herein are particularly pointed out and distinctly claimed in the claims at the conclusion of the specification. The foregoing and other features and advantages of the embodiments of the invention are apparent from the following detailed description taken in conjunction with the accompanying drawings in which:

[0027] FIG. 1 depicts a block diagram of a processing system for sound isolation according to one or more embodiments described herein; [0028] FIG. 2 depicts a block diagram of the processing system of FIG. 1 according to one or more embodiments described herein;

[0029] FIG. 3 depicts a block diagram of a method for performing vocal extraction according to one or more embodiments described herein;

[0030] FIG. 4 depicts an edge device according to one or more embodiments described herein;

[0031] FIGS. 5 A and 5B depict views of an example of the edge device of FIG. 4 according to one or more embodiments described herein;

[0032] FIG. 6 depicts an interface of a communication system according to one or more embodiments described herein;

[0033] FIG. 7 depicts an interface of a communication system according to one or more embodiments described herein;

[0034] FIG. 8 depicts a flow diagram of a method for messaging and tracking according to one or more embodiments described herein; and

[0035] FIG. 9 depicts a block diagram of a processing system for implementing the presently described techniques according to one or more embodiments described herein.

[0036] The diagrams depicted herein are illustrative. There can be many variations to the diagram or the operations described therein without departing from the scope of one or more embodiments of the invention. For instance, the actions can be performed in a differing order or actions can be added, deleted or modified. Also, the term “coupled” and variations thereof describes having a communications path between two elements and does not imply a direct connection between the elements with no intervening elements/connections between them. All of these variations are considered a part of the specification.

DETAILED DESCRIPTION

[0037] One or more embodiments described herein provide for sound isolation.

[0038] Some environments may be noisy due to background noises generated within or near the environment. Examples of such environments include manufacturing environments, agriculture environments, hydrocarbon exploration and recovery environments, warehousing environments, retail environments, medical environments, military environments, and the like. For example, considering the case of a manufacturing environment (e.g., a factory), machinery and other equipment in the environment generate noise. Such noise can make it difficult to communicate between individuals and/or can make it difficult for sound to be recorded in the environment and useful information (e.g., spoken commands) to be extracted from the sound.

[0039] The above-described aspects address the shortcomings of the prior art by extracting vocal information from audio recordings and/or streaming audio using a trained machine learning model. One or more embodiments described herein provide technological improvements over current methods of language processing by providing real-time (or near- real-time) voice extraction. An analysis can be performed on the extracted voice information to determine whether to implement an action (e.g., to cause a machine to shut down, to cause a delivery to be made, etc.).

[0040] According to one or more embodiments described herein, a method is provided for sound extraction. An example of sound extraction is vocal extraction. First, a machine learning model is trained, using a training dataset, to perform the sound extraction (e.g., vocal extraction). The training dataset includes a ground truth (e.g., clean speech vocal audio and clean noisy sound audio) and a training input (e.g., clean speech combined with noisy audio). The model is trained using the training dataset such that the model is trained to generate sound data that includes audio of interest (e.g., vocal aspects) but removes noise or other portions of the audio that are not of interest (e.g., non- vocal aspects) of the original audio signal of interest. After training, the model can be used to perform vocal extraction on an audio stream (e.g., an audio signal of interest) to extract a sound of interest (e.g., a vocal aspect) of the audio stream. Actions can then be taken depending on the content/context of the sound of interest (e.g., the vocal aspect) of the audio stream, which can be determined using one or more natural language processing techniques for example.

[0041] FIG. 1 depicts a block diagram of a processing system 100 for sound isolation according to one or more embodiments described herein. In this example, the processing system 100 includes a processing device 102 (e.g., one or more of the processors 921 of FIG. 9), a memory 104 (e.g., the RAM 924 and/or the ROM 922 of FIG. 9), a communications adapter 106 (e.g., the network adapter 926 of FIG. 9), a data store 108 (e.g., the storage 934 of FIG. 9), a language processing engine 110, and a routing engine 112.

[0042] The various engines (e.g., the language processing engine 110, the routing engine 112) described regarding FIG. 1 can be implemented as instructions stored on a computer- readable storage medium, as hardware modules, as special-purpose hardware (e.g., application specific hardware, application specific integrated circuits (ASICs), application specific special processors (ASSPs), field programmable gate arrays (FPGAs), as embedded controllers, hardwired circuitry, etc.), or as some combination or combinations of these. According to aspects of the present disclosure, the engine(s) described herein can be a combination of hardware and programming. The programming can be processor executable instructions stored on a tangible memory, and the hardware can include the processing device 102 for executing those instructions. Thus a system memory (e.g., the memory 104) can store program instructions that when executed by the processing device 102 implement the engines described herein. Other engines can also be utilized to include other features and functionality described in other examples herein.

[0043] In one or more embodiments, the language processing engine 110 and/or the routing engine 112 can be implemented using cloud computing technologies. Cloud computing can supplement, support, or replace some or all of the functionality of the elements of the processing system 100. For example, some or all of the functionality of the language processing engine 110 and/or the routing engine 112 can be implemented using a cloud computing node 142 of a cloud computing system 140. As an example, the cloud computing system 140 can be used instead of or in cooperation with the data store 108 to store data.

[0044] The processing system 100 can communicate with other systems/devices, such as one or more of the devices 120, the cloud computing node 142 of the cloud computing system 140, and the like via the communications adapter 106. In some examples, the processing system 100 is directly connected (via wired and/or wireless links) to the other systems or devices. In other examples, the processing system 100 is indirectly connected (via wired and/or wireless links) to the other systems or devices, such as shown in FIG. 1, such as via a network 107.

[0045] The network 107 represents any one or a combination of different types of suitable communications networks such as, for example, cable networks, public networks (e.g., the Internet), private networks, wireless networks, cellular networks, or any other suitable private and/or public networks. Further, the network 107 can have any suitable communication range associated therewith and may include, for example, global networks (e.g., the Internet), metropolitan area networks (MANs), wide area networks (WANs), local area networks (LANs), or personal area networks (PANs). In addition, the network 107 can include any type of medium over which network traffic may be carried including, but not limited to, coaxial cable, twisted-pair wire, optical fiber, a hybrid fiber coaxial (HFC) medium, microwave terrestrial transceivers, radio frequency communication mediums, satellite communication mediums, or any combination thereof.

[0046] The processing system 100 receives sound data, such as from one or more devices 120 that are disposed in an environment 122 (e.g., a manufacturing environment, an agriculture environment, a hydrocarbon exploration and recovery environment, a warehousing environment, etc.). The sound data can include analog sound recordings and/or digital representations of the sound. The devices 120 can each include one or more microphones to capture sound/audio in and/or around the environment 122, which are then sent to the processing system 100 as the sound data. Such sound data can include noise (e.g., generated by machines in the environment 122), spoken sound (e.g., a person talking), and the like. Due to the noise in the environment 122, it may be difficult to recognize and/or understand the spoken sound.

[0047] The processing system 100 can also receive other data from the one or more devices 120 and/or from one or more machines (not shown) within the environment 122. For example, one or more of the devices 120 and/or one or more of the machines (not shown) within the environment 122 can provide information such as operating state of a machine, ambient temperature, process updates, position/orientation information about a machine or device, etc.

[0048] The processing system 100 receives the sound data from one or more of the devices 120 and processes the sound using the language processing engine 110. Particularly, the language processing engine 110 processes the sound data using one or more data processing algorithms to extract useful information from the sound data. The processing system 100 then performs routing of the useful information based on predefined rules and/or user-defined rules. Using the information extracted by the language processing engine 110, actions can be implemented, such as causing a machine to perform an operation, causing a delivery to be performed, etc. Further, using the information extracted by the language processing engine 110, notifications can be generated, such as generating a notification to alert a manager/supervisor of a problem with a machine. Further features and functions of the processing system 100 are now described with reference to FIG. 2.

[0049] Particularly, FIG. 2 depicts a block diagram of the processing system 100 of FIG. 1 according to one or more embodiments described herein. External hardware (e.g., one or more of the devices 120) can connect to the processing system 100, such as through an application programming interface (API) 203. The processing system 100 can receive data/information from one or more of the devices 120. Such information can include raw audio from humans, data from machines, etc. (e.g., any information anything that needs to be communicated somewhere. The data can be stored locally and/or remotely, such as on a cloud-based server. The data can be processed using natural language processing (NLP) techniques, as will be described in more detail herein. The processing system 100 can then route information extracted from the data based on rules set/leamed applied to the content of the information extracted from the data. The processing system 100 can send information back to a database and/or to one or more of the devices 120.

[0050] The devices 120 refer to any suitable device that can connect to the processing system 100 via the API 203. In some examples, one or more of the devices 120 can be human-related devices. Human-related devices are devices that can capture audio, can receive audio information, and/or can send/receive information beyond just audio (e.g., text, images, etc.). Human-related devices are generally human interactive. Examples of human-related devices include headphones, tablet computers, virtual reality (VR) and/or augmented reality (AR), heads-up displays (HUD), smart phones, computers, microphones, radios, and the like. In some examples, one or more of the devices 120 can be non-human-related devices are devices that can send non-audio information. Non-human-related devices are generally not human interactive. Examples of non-human-related devices include industrial machines, internet-of-things (loT) sensors, and the like. Non-human related devices may also be capable of performing an action in the environment 122.

[0051] The processing system 100 receives raw data 201 (e.g., sound data, other data, etc.) from one or more of the devices 120 (including one or more human-related devices and/or one or more non-human related devices). Examples of the devices could include a smart phone 221, an AR HUD 222, a tablet computer 223, or any other suitable device. The raw data 201 is received at an input 202. The input 202 may be based on the type of raw data 201 being received. For example, the input could be a 3.5 mm audio line-in port to receive analog data from one or more of the devices 120. According to one or more embodiments described herein, one or more of the devices 120 can communicate with the processing system 100 via the API 203. Thus, a digital representation of the raw data 201 can be provided to the processing system 100 by one or more of the devices 120 using the API 203.

[0052] The processing system 100 can be implemented as a cloud-based processing system (e.g., as the cloud computing node 142), as a local processing system (e.g., the processing system 100), or the like. According to one or more embodiments described herein, the processing system 100 receives the raw data 201 from one or more of the devices 120, stores the data in the data store 108, routes the data using inbound routing 204 and/or outbound routing 209, performs data processing techniques using NLP processing 205, sends the data back out via an output 211 to one or more of the devices 120.

[0053] Once the processing system 100 receives the sound data (e.g., the raw data 201), the processing system 100 can analyze the raw data 201 and/or information received via the API 203. To do this, the processing system 100 (e.g., using the routing engine 112) performs inbound routing 204 on the data. The inbound routing 204 can be performed using hard- coded inbound routing rules, for example. In some examples, the inbound routing rules for inbound routing 204 cannot be changed by a user. The inbounding routing rules direct how data is routed throughout the processing system 100 (e.g., which data should go where and when). For example, data may be routed to the data store 108 for storage. As another example, sound data are routed to the NLP processing 205 for natural language processing. It can be defined how to route the raw data 201 based on content of the data as determined using information about the data received from the API 203. For example, raw data received from a specific device could be routed based on being received from that specific device. The content of the raw data 201 can be used in some examples to understand how to route the data. In such cases, some language processing may be performed on the raw data 201 prior to performing inbound routing 204.

[0054] The processing system 100 performs NLP processing 205 using, for example, the language processing engine 110. The NLP processing 205 represents a stack of data processing algorithms that can be used to extract useful information from incoming data. The

NLP processing 205 can also reformat the data. In the example of FIG. 2, the NLP processing 205 can include one or more of text to speech 206 (e.g., conversion of normal language text into speech), vocal extraction 207, and/or speech to text 208 (e.g., conversion of spoken language into text). The vocal extraction 207, which is an example of sound extraction, relates to extracting or isolating vocal sounds (or any other suitable sound of interest) from audio that may include the vocal sounds (or other suitable sound of interest) along with noise. Vocal extraction 207 is described in more detail with reference to FIG. 3. Other examples of NLP processing 205 can include context analysis (e.g., determining what is occurring based on the content of the data), hot word detection (e.g., detecting a keyword, such as “fire” or “help”), and other suitable language processing.

[0055] The data store 108 can be any suitable device, system, or schema for storing data, such as a database. The data store 108 can store the raw data 201, data/information received via the API 203, and any other suitable data (e.g., routing rules, etc.). For example, the data store 108 can store raw data, processed data, and/or formatted data.

[0056] The outbound routing 209 (also referred to as “federated routing”) determines how data is routed via the output 211 and which of the one or more devices 120 receives information (e.g., send info 212) from the processing system 100. For example, the outbound routing 209 provides decision making using outbound routing rules that dictate who/where/when to send information. The outbound routing rules can be user-defined. An example of an outbound routing rule is as follows: visual/node scripting of information routing can be performed (e.g., any device received from Machine One go to a specific person). Another example of an outbound routing rule is as follows: online machine learning is used to determine prioritization (using notification prioritization 210) of information when being sent out (e.g., higher priority information is sent to a user’s smart phone to provide a real-time (or near-real-time) alert, while lower priority information may be queued or otherwise stored for later access/analysis).

[0057] In one or more examples, such as shown in FIG. 2, the processing system 100 is in communication with (or includes) a system control center 250. The system control center 250 provides for administrators or other users to change how the processing system 100 performs. For example, an admin/user can modify outbound routing rules (e.g., using the node-based ruleset 254), monitor information stored in the data store 108 (e.g., using the data browser 252), etc. [0058] FIG. 3 depicts a block diagram of a method 300 for performing vocal extraction (e.g., the vocal extraction 207 of the NLP processing 205 of FIG. 2) according to one or more embodiments described herein. The method 300 can be performed by any suitable system and/or device, such as the processing system 100 of FIGS. 1 and 2, the edge device 400 of FIGS. 4, 5 A, and 5B, and the like.

[0059] It should be appreciated that vocal extraction is an example of sound extraction, which involves extracting any sound of interest from an audio signal of interest. For example, an audio signal of interest could be an audio signal captured from a particular machine. Sound extraction can be performed on that audio signal to extract a sound of interest, such as a sound associated with a known event for the machine (e.g., a machine malfunction). Similarly, sound extraction can be performed on an audio signal associated with a vehicle to extract a sound of interest, such as a sound associated with a seat belt being fastened. Sound extraction can also be performed to analyze other audio signals of interest to extract a sound of interest. The method 300 is now described with reference to vocal extraction but is more generally applicable to sound extraction.

[0060] According to one or more embodiments described herein, vocal extraction can be performed using a trained machine learning model, such as a neural network. More specifically, the present techniques can incorporate and utilize rule-based decision making and artificial intelligence reasoning to accomplish the various operations described herein, namely performing vocal extraction, for example. The phrase “machine learning” broadly describes a function of electronic systems that learn from data. A machine learning system, module, or engine (e.g., the language processing engine 110) can include a trainable machine learning algorithm that can be trained, such as in an external cloud environment, to learn functional relationships between inputs and outputs that are currently unknown, and the resulting model can be used for performing vocal extraction.

[0061] In one or more embodiments, machine learning functionality can be implemented using an artificial neural network (ANN) having the capability to be trained to perform a currently unknown function. In machine learning and cognitive science, ANNs are a family of statistical learning models inspired by the biological neural networks of animals, and in particular the brain. ANNs can be used to estimate or approximate systems and functions that depend on a large number of inputs. Recurrent neural networks (RNN) are a class of ANN that are particularly useful at analyzing audio. In some cases, RNNs implement long shortterm memory networks (LSTMs).

[0062] ANNs can be embodied as so-called “neuromorphic” systems of interconnected processor elements that act as simulated “neurons” and exchange “messages” between each other in the form of electronic signals. Similar to the so-called “plasticity” of synaptic neurotransmitter connections that carry messages between biological neurons, the connections in ANNs that carry electronic messages between simulated neurons are provided with numeric weights that correspond to the strength or weakness of a given connection. The weights can be adjusted and tuned based on experience, making ANNs adaptive to inputs and capable of learning. For example, an ANN for handwriting recognition is defined by a set of input neurons that can be activated by the pixels of an input image. After being weighted and transformed by a function determined by the network’ s designer, the activation of these input neurons are then passed to other downstream neurons, which are often referred to as “hidden” neurons. This process is repeated until an output neuron is activated. The activated output neuron determines which character was read. It should be appreciated that these same techniques can be applied in the case of performing vocal extraction as described herein.

[0063] With reference now to FIG. 3, at block 302, a training dataset is defined for training the machine learning model for vocal extraction. The training dataset includes a ground truth and a training input. The ground truth, for example, can include two components: clean speech vocal audio (that is, audio that contains only speech) and clean “noisy” sound audio (that is, audio that contains background noise without any speech). The training input includes clean speech combined with noisy audio. According to one or more embodiments described herein, the dataset can be sampled to 16kHz and interpolated into intl6 format. This provides for less data to stream over networks and less data to compute versus other sampling/formats. The dataset acts as training data to train the machine learning model for vocal extraction at block 304. According to one or more embodiments described herein, the dataset can be generated by randomly selecting a random portion of voice and random portion of noisy audio.

[0064] Particularly, at block 304, the dataset are used to train the machine learning model. The model is trained to output clean speech alone. That is, the model is trained to remove the background noise from an input of clean speech combined with noisy audio. According to one or more embodiments described herein, the machine learning model can be trained on longer time length segments of audio than the model is expected to be used during inference. For example, inference may be performed in real-time (or near-real-time) on very short audio segments (e.g., less than 1 second); in such cases, the time length of segments used for training can be longer (e.g., 6 seconds). This provides improved performance by the model than training on real- world expected time lengths that would be processed. In one or more examples, a learning rate can be reduced (e.g., halved) after a certain amount of epochs, steps, or performance stagnation. In one or more examples, the learning rate can be reduced (e.g., halved) multiple times.

[0065] According to one or more embodiments described herein, the machine learning model can be trained on a loss function of signal to noise ratio. This is useful for real-time (or near- real-time) applications where normalization of the output of the model will not work. For example, normalization may be used when using loss functions such as Source Invariant Signal to Noise Ratio. However, normalization cannot be used when processing individual chunks of continuous data for real-time (or near-real-time). For half precision training capabilities, the epsilon in the signal to noise ratio can be changed from le-8 to le-7 for single source signal to noise loss function. This prevents division by zero. Half-precision training can be used to speed up training (e.g., reduce training time).

[0066] Once the machine learning model is trained to perform vocal extraction, inference can be performed using the model. Specifically, at block 306, vocal extraction can be performed, using the machine learning model, on a stream of sound data. The stream of sound data may include the raw data 201 received at the input 202 of the processing system 100 as shown in FIG. 2.

[0067] According to one or more embodiments described herein, the stream of sound data (e.g., the raw data 201) are pre-processed before the model is applied to the sound data to perform vocal extraction. For example, the stream of sound data may be an array that contains an exact or greater amount of audio information that is being requested. For realtime (or near-real-time) purposes, a larger array than the audio streaming chunk amount is used. This array accumulates a certain amount of audio information over time. For example, an array size of 2048 bits can be used. Newly received sound data (in the form of audio chunks) are appended to the end of the array and a corresponding amount of old sound data (in the form of audio chunks) are removed from the beginning of the array. For example, if 512 bits of new sound data are received, these 512 bits are added to the end of the 2048 bit array, and the first 512 bits in the array (e.g., old sound data) are removed. Thus, the array maintains 2048 bits of the most recent sound data. The full array (e.g., the full 2048 bits) are then passed into the machine learning model, which is used to perform vocal extraction. It should be appreciated that the machine learning model may experience increased performance/accuracy on larger amounts of sound data (e.g., the full array) since it has more data to see.

[0068] According to one or more embodiments described herein, if the array is the exact amount of sound data requested (e.g., not larger), then the sound data are passed directly to the model for vocal extraction without preprocessing.

[0069] According to one or more embodiments described herein, the sound data can be split and batched. For example, the array (e.g., the 2048 bit array) can be split into multiple smaller arrays (e.g. four 512 bit arrays). In such cases, the sound data are passed through the machine learning model then re-appended together to fill the original array (e.g., 2048 bits).

[0070] The machine learning model described herein is optimized for real-time (or near-real time) performance. This provides for receiving real-time (or near-real-time) sound data, such as of the environment 122 from one or more of the devices 120, and then performing realtime (or near-real-time) vocal extraction.

[0071] According to one or more embodiments described herein, the machine learning model architecture is a modified instance of the Conv-Tasnet architecture, which uses a linear encoder to generate representations of speech optimized for separating individual speakers, although other suitable architectures can also be used. According to one or more embodiments described herein, in the case of the Conv-Tasnet architecture, the machine learning model described herein is modified using the following parameters: ConvTasNet(number_sources = 1, encoder_kernel_size = 16, encoder_number_features = 512, mask_kernel_size = 3, mask_number_features = 64, mask_number_hidden_featuress = 256, mask_number_layers = 7, mask_number_stacks = 2).

[0072] According to one or more embodiments described herein, the machine learning model uses a leaky relu activation function. This provides for the model to be portable to hardwareaccelerating and model-simplifying software. Conventional hardware acceleration is not capable of converting such a model with a parametric activation function. [0073] According to one or more embodiments described herein, the model outputs sound data that includes vocal aspects but removes noise or other non-vocal aspects of the original sound data. In some cases, the output sound data can be post-processed. For example, if the input sound data are real-time (or near-real-time) streaming sound data, an outbound array is deleted except a last audio chunk size length, which can be user defined or pre-set. For example, a 2048 bit outbound array can be used and the audio chuck size length can be defined to be 512 bits. For streaming, the last 512 bit of sound data in the outbound array are maintained but the rest of the sound data stored in the outbound array are deleted/removed. For non-real-time streaming, the output data can be used as is. In case where the sound data were split and batched, the output data can be concatenated.

[0074] One example use case for using the machine learning model (i.e., for performing inference using the model) is to monitor (in real-time or near-real-time) sound/audio occurring in the environment 122 of FIG. 1 using one or more of the devices 120 to capture the audio/sound. The processing system 100 can then perform NLP processing 205, including vocal extraction 207 using the trained machine learning model. The extracted voice component of the sound/audio can be analyzed (e.g., using speech to text 208, hot word detection, context analysis, etc.) to understand the vocal component of the sound/audio. One or more actions can then be implemented based on information contained in the vocal component of the sound/audio. For example, if a machine operation within the environment 122 states “this machine needs more material,” then one or more of the devices 120 can capture this audio, which may include other noises such as background noise generated by the machine. The processing system 100 can then process the spoken language the NLP processing 205 to extract (e.g., using the machine learning model trained at block 304) the vocal component of the captured audio. The phrase “this machine needs more material” can be analyzed, and an action can be taken, such as dispatching more material to the machine automatically using an autonomous robot or other suitable device.

[0075] Additional processes also may be included, and it should be understood that the process depicted in FIG. 3 represents an illustration, and that other processes may be added or existing processes may be removed, modified, or rearranged without departing from the scope of the present disclosure.

[0076] Another use case for the machine learning model trained at block 304 is to implement vocal extraction inference in an edge-based device as shown in FIGS. 4, 5A, and 5B. For example, FIG. 4 depicts an edge device 400 according to one or more embodiments described herein. According to one or more embodiments described herein, edge device 400 includes a memory storing computer readable instructions and a processing device for executing the computer readable instructions. The computer readable instructions control the processing device to perform operations/functions as are now described.

[0077] The edge device 400 receives audio/sound at an audio input 401 (e.g., a 3.5 mm line in port). Vocal extraction 403 is then performed on the raw audio 402, for example, using block 306 of the method 300. In examples, the vocal extraction 403 can include a machine learning model trained according to the method 300 or another suitable training technique, and the raw audio 402 can be input into the model. The output of the model (e.g., the vocal components of the raw audio 402 without other noises/sounds) is combined with the original audio input (received at audio input 401) at a suitable ratio. The ratio can be defined by a user, for example, using an electronic input, such as a potentiometer, graphical user interface, a rotary encoder, a touch bar, or other input. The ratio is the ratio between the vocal component of the raw audio 402 (e.g., the output from the vocal extraction 403) relative to the original audio/sound received at the audio input 401. The ratio can be increased to increase the voice aspect of the original audio or decreased to decrease the voice aspect of the original audio. The combined audio is then output as output audio 405 via an audio output 406 (e.g., a 3.5 mm line out port).

[0078] An example use case of the edge device 400 is to increase/decrease dialog in a movie, television show, etc. For example, the edge device 400 can receive audio from a television, extract the voice component, and change the ratio between voice to non-voice aspects of the original audio. Thus, the output audio 405 either represents an increase or decrease of the ratio between voice to non-voice aspects of the original audio based on a user preference/selection.

[0079] FIGS. 5 A and 5B depict views of an example of the edge device 400 of FIG. 4 according to one or more embodiments described herein. Particularly, FIG. 5A depicts an isometric view of the edge device 400, and FIG. 5B depicts a top view of the edge device 400. As shown, the edge device 400 includes the audio input 401, the audio output 406, and a potentiometer 501 that can be used to change the ratio between voice to non-voice aspects of the original audio. [0080] FIG. 6 depicts an interface 600 of a communication system according to one or more embodiments described herein. The communication system can be the processing system 100 or another suitable device. The communication system provides for storing, modifying, and communicating information by representing it in three-dimensional space through different hardware mediums. This provides a special understanding to incoming/outgoing communications (e.g., the inputs/outputs of the processing system 100).

[0081] In the example of FIG. 6, the interface represents an augmented reality interface. For example, the output 211 of the processing system 100 can send augmented reality information (e.g., send info 212) to the AR HUD 222 or another suitable device for receiving AR information. In this case, the interface 600 represents a real-time (or near-real-time) video stream (e.g., captured by a camera associated with the AR HUD 222). Two AR elements 601, 602 are overlaid on the video stream. The AR element 601 is a label associated with a station (e.g., “STATION 6”) within an environment (e.g., the environment 122). The AR element 602 provides a text-base message that provides a status for the station (e.g., “STATION 6 IS OUT OF MATERIAL, WE HAVE NOT BEEN ABLE TO GET AHOLD OF MAX TO RESUPPLY”). This text-based message is generated using the NLP processing 205 of the processing system by analyzing sound/audio captured by one or more of the devices 120 within the environment 122. The vocal extraction 207 can be used to extract vocal components of the input audio/sound, and the speech to text 208 can be used to generate the text-based message as shown. According to one or more embodiments described herein, the message can include an option to play audio associated with the text-based message (e.g., the raw data 201 received by the processing system 100 and/or audio that is processed by the NLP processing 205). According to one or more embodiments described herein, the message can include an option to respond to the message, such as by text, audio, video, etc.

[0082] FIG. 7 depicts an interface 700 of a communication system according to one or more embodiments described herein. In this example, the processing system 100 causes one or more of the devices 120 (e.g., the smart phone 221, the tablet computer 223, etc.) to generate the interface 700 based on the analyzed contents of the received audio/sound (e.g., the raw data 201). In this example, the interface 700 is a digital twin representation 710 of the environment 122 as shown and is created in an augmented reality environment. The interface 700 includes elements 701, 702, 703, 704, 705, which represent stations (e.g., machines, work areas, etc.) within the environment 122. The interface also includes element 706, which provides instructions regarding the element 704. As shown, each of the elements 701-705 can have information (e.g., status, efficiency, warning, etc.) associated therewith. Such information can be generated based on the content of audio received in the environment 122 as described herein. For example, the vocal extraction 207 can be used to extract vocal components of the input audio/sound from the environment 122, and the speech to text 208 can be used to generate the text-based messages as shown by elements 701, 702, 703. As another example, one or more of the elements 701, 702, 703 can be generated using, in part or in whole, data received from one or more machines in the environment 122. For example, a machine can send data about itself (e.g., status information) to the processing system 100 via the API 203. Such information can be used, either alone or with the audio/sound information received, to generate one or more of the elements 701, 702, 703.

[0083] FIG. 8 depicts a flow diagram of a method 800 for messaging and tracking according to one or more embodiments described herein. For example, the method 800 provides for synchronous/asynchronous messaging, such as voicemail or streaming of augmented reality elements. The method 800 provides for tracking the positions and/or movements of objects, users (or portions thereof (e.g., a user’s hands)) within the environment 122 and storing tracking data associated with the positions and/or movements within the environment 122 with audio/sound captured by one or more of the devices 120.

[0084] For example, the method 800 provides for recording audio and tracking motion for playback. As an example, an individual can record a voice message while performing a task, and data associated with tracking motion of the user during the recording of the voice message can be stored. This provides for the recording to be played by others such that the voice message is played audibly while the tracking information can be displayed visually. This enables the listener to view processes/messages/information as if the individual who had left the message was with the listener. In examples, this can be performed remotely where one or more individuals are shown a digital twin and/or AR representation of the environment 122, for example.

[0085] For example, at step 801, a machine expert 810 uses a machine 811. The user has an associated device which can be any suitable device such as the smart phone 221, the AR HUD 222, the tablet computer 223. The associated device can have body- and hand-pose tracking capabilities according to one or more embodiments. In some examples, the associated device includes an AR/VR device.

[0086] At step 802, a rough alignment is performed to align audio and movement recording. For example, the machine expert 810 aligns an AR/VR displayed digital twin (e.g., the digital twin representation 710 of FIG. 7) with the machine 811.

[0087] At step 803, the machine expert 810 performs one or more actions while recording an audio message. According to one or more embodiments described herein, the user begins recording the user’ s movements and voice while describing the interactions with the machine 811. For example, the user uses the user’s hand to flip a switch on the machine 811 while saying “First, I flipped the power switch from OFF to ON.” A microphone associated with the user’s device records the voice message while one or more sensors (e.g., an accelerator, an inertial measurement unit, a LiDAR sensor, and/or camera) records the user’s actions/movements as tracking data.

[0088] At step 804, the voice message (e.g., an audio track) and tracking data (e.g., a spatial track) are collected, such as at the processing system 100 or another suitable location. In examples, this can include collecting and recording the data as audio track data, device tracking data, pose estimation from track body/hands/controllers, and the like.

[0089] At step 805, the stored data are pushed or otherwise made available to another device, such as a server. In examples, this can include making available the audio track data, device tracking data, pose estimation from track body/hands/controllers, and the like.

[0090] At block 806, the received data are replayed. This can depend on a desired or available format for replaying the data. The audio track and/or spatial track are played back using any suitable interface/system depending on what system is available and/or user preferences. As a one-dimensional example, the audio track can be replayed in a web browser. As a two-dimensional example, the spatial track can be replayed on a web browser with three-dimensional graphics and viewpoint control, such as to provide field-of-view and/or top views, among others. As a three-dimensional example, the spatial track can be replayed within an AR/VR capable device with viewpoint tracking capabilities.

[0091] In this way, a user can record/stream what they wish to document on the current device (either in real life or digital twin). The user can provide and record visual highlights of locations of interest, and the user’s movements and spoken audio are recorded. The recordings of the audio and movement are stored together and can be played back locally and/or remotely, such as on different hardware than the user used to record the audio/tracking.

[0092] It is understood that one or more embodiments described herein is capable of being implemented in conjunction with any other type of computing environment now known or later developed. For example, FIG. 9 depicts a block diagram of a processing system 900 for implementing the techniques described herein. In accordance with one or more embodiments described herein, the processing system 900 is an example of the processing system 100 and/or is an example of the cloud computing node 142. In examples, processing system 900 has one or more central processing units (“processors” or “processing resources” or “processing devices”) 921a, 921b, 921c, etc. (collectively or generically referred to as processor(s) 921 and/or as processing device(s)). In aspects of the present disclosure, each processor 921 can include a reduced instruction set computer (RISC) microprocessor.

Processors 921 are coupled to system memory (e.g., random access memory (RAM) 924) and various other components via a system bus 933. Read only memory (ROM) 922 is coupled to system bus 933 and may include a basic input/output system (BIOS), which controls certain basic functions of processing system 900.

[0093] Further depicted are an input/output (I/O) adapter 927 and a network adapter 926 coupled to system bus 933. I/O adapter 927 may be a small computer system interface (SCSI) adapter that communicates with a hard disk 923 and/or a storage device 925 or any other similar component. I/O adapter 927, hard disk 923, and storage device 925 are collectively referred to herein as mass storage 934. Operating system 940 for execution on processing system 900 may be stored in mass storage 934. The network adapter 926 interconnects system bus 933 with an outside network 936 enabling processing system 900 to communicate with other such systems.

[0094] A display (e.g., a display monitor) 935 is connected to system bus 933 by display adapter 932, which may include a graphics adapter to improve the performance of graphics intensive applications and a video controller. In one aspect of the present disclosure, adapters 926, 927, and/or 932 may be connected to one or more I/O busses that are connected to system bus 933 via an intermediate bus bridge (not shown). Suitable I/O buses for connecting peripheral devices such as hard disk controllers, network adapters, and graphics adapters typically include common protocols, such as the Peripheral Component Interconnect (PCI). Additional input/output devices are shown as connected to system bus 933 via user interface adapter 928 and display adapter 932. A keyboard 929, mouse 930, and speaker 931 may be interconnected to system bus 933 via user interface adapter 928, which may include, for example, a Super I/O chip integrating multiple device adapters into a single integrated circuit.

[0095] In some aspects of the present disclosure, processing system 900 includes a graphics processing unit 937. Graphics processing unit 937 is a specialized electronic circuit designed to manipulate and alter memory to accelerate the creation of images in a frame buffer intended for output to a display. In general, graphics processing unit 937 is very efficient at manipulating computer graphics and image processing, and has a highly parallel structure that makes it more effective than general-purpose CPUs for algorithms where processing of large blocks of data is done in parallel.

[0096] Thus, as configured herein, processing system 900 includes processing capability in the form of processors 921, storage capability including system memory (e.g., RAM 924), and mass storage 934, input means such as keyboard 929 and mouse 930, and output capability including speaker 931 and display 935. In some aspects of the present disclosure, a portion of system memory (e.g., RAM 924) and mass storage 934 collectively store the operating system 940 to coordinate the functions of the various components shown in processing system 900.

[0097] Various embodiments are described herein with reference to the related drawings. Alternative embodiments can be devised without departing from the scope described herein. Various connections and positional relationships (e.g., over, below, adjacent, etc.) are set forth between elements in the following description and in the drawings. These connections and/or positional relationships, unless specified otherwise, can be direct or indirect, and embodiments described herein are not intended to be limiting in this respect. Accordingly, a coupling of entities can refer to either a direct or an indirect coupling, and a positional relationship between entities can be a direct or indirect positional relationship. Moreover, the various tasks and process steps described herein can be incorporated into a more comprehensive procedure or process having additional steps or functionality not described in detail herein. [0098] The following definitions and abbreviations are to be used for the interpretation of the claims and the specification. As used herein, the terms “comprises,” “comprising,” “includes,” “including,” “has,” “having,” “contains” or “containing,” or any other variation thereof, are intended to cover a non-exclusive inclusion. For example, a composition, a mixture, process, method, article, or apparatus that comprises a list of elements is not necessarily limited to only those elements but can include other elements not expressly listed or inherent to such composition, mixture, process, method, article, or apparatus.

[0099] Additionally, the term “exemplary” is used herein to mean “serving as an example, instance or illustration.” Any embodiment or design described herein as “exemplary” is not necessarily to be construed as preferred or advantageous over other embodiments or designs. The terms “at least one” and “one or more” may be understood to include any integer number greater than or equal to one, i.e. one, two, three, four, etc. The terms “a plurality” may be understood to include any integer number greater than or equal to two, i.e. two, three, four, five, etc. The term “connection” may include both an indirect “connection” and a direct “connection.”

[0100] The terms “about,” “substantially,” “approximately,” and variations thereof, are intended to include the degree of error associated with measurement of the particular quantity based upon the equipment available at the time of filing the application. For example, “about” can include a range of ± 8% or 5%, or 2% of a given value.

[0101] For the sake of brevity, conventional techniques related to making and using embodiments described herein may or may not be described in detail herein. In particular, various aspects of computing systems and specific computer programs to implement the various technical features described herein are well known. Accordingly, in the interest of brevity, many conventional implementation details are only mentioned briefly herein or are omitted entirely without providing the well-known system and/or process details.

[0102] One or more embodiments described herein may be a system, a method, and/or a computer program product at any possible technical detail level of integration. The computer program product may include a computer readable storage medium (or media) having computer readable program instructions thereon for causing a processor to carry out one or more embodiments described herein. [0103] The computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. A non-exhaustive list of more specific examples of the computer readable storage medium includes the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a static random access memory (SRAM), a portable compact disc read-only memory (CD- ROM), a digital versatile disk (DVD), a memory stick, a floppy disk, a mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon, and any suitable combination of the foregoing. A computer readable storage medium, as used herein, is not to be construed as being transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission media (e.g., light pulses passing through a fiberoptic cable), or electrical signals transmitted through a wire.

[0104] Computer readable program instructions described herein can be downloaded to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network. The network may comprise copper transmission cables, optical transmission fibers, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. A network adapter card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium within the respective computing/processing device.

[0105] Computer readable program instructions for carrying out operations of one or more embodiments described herein may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, configuration data for integrated circuitry, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C++, or the like, and procedural programming languages, such as the “C” programming language or similar programming languages. The computer readable program instructions may execute entirely on the user’s computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider). In some embodiments, electronic circuitry including, for example, programmable logic circuitry, field- programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instruction by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of one or more embodiments described herein.

[0106] Embodiments described herein are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to one or more embodiments described herein. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer readable program instructions.

[0107] These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, a programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable storage medium having instructions stored therein comprises an article of manufacture including instructions which implement aspects of the function/act specified in the flowchart and/or block diagram block or blocks.

[0108] The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable apparatus or other device to produce a computer implemented process, such that the instructions which execute on the computer, other programmable apparatus, or other device implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

[0109] The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments described herein. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the blocks may occur out of the order noted in the Figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts or carry out combinations of special purpose hardware and computer instructions.

[0110] The descriptions of the various embodiments described herein have been presented for purposes of illustration, but are not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope of the described embodiments. The terminology used herein was chosen to best explain the principles of the embodiments, the practical application or technical improvement over technologies found in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments described herein.

Claims

CLAIMS What is claimed is:

1. A computer- implemented method for vocal extraction, the method comprising: defining a training dataset, the training dataset comprising a ground truth and a training input; training a machine learning model to perform vocal extraction using the training dataset; and performing vocal extraction, using the machine learning model, on an audio stream to extract a vocal aspect of the audio stream.

2. The computer- implemented method of claim 1, wherein the ground truth comprises clean speech vocal audio and clean noisy sound audio.

3. The computer- implemented method of claim 1, wherein the training input comprises clean speech combined with noisy audio.

4. The computer- implemented method of claim 1, further comprising performing an action based at least in part on a content of a vocal aspect of the audio stream.

5. The computer- implemented method of claim 1, wherein the audio stream is segmented into chunks, which are stored in an array.

6. A processing system comprising: a memory comprising computer readable instructions; and a processing device for executing the computer readable instructions, the computer readable instructions controlling the processing device to perform operations comprising: receiving sound data from a device in communication with the processing system; performing natural language processing on the sound data, the natural language processing comprising performing vocal extraction on the sound data using a trained machine learning model; determining an action to implement based at least in part on content of the sound data as determined during the vocal extraction; and causing the action to be implemented.

7. The processing system of claim 6, wherein the operations further comprise performing inbound routing on the sound data to route the sound data within the processing system.

8. The processing system of claim 6, wherein the operations further comprise performing outbound routing on the action to implement.

9. The processing system of claim 6, wherein the natural language processing further comprises performing text to speech analysis on the sound data.

10. The processing system of claim 6, wherein the natural language processing further comprises performing speech to text analysis on the sound data.

11. The processing system of claim 6, wherein the operations further comprise training the machine learning model to perform the vocal extraction using a training dataset.

12. The processing system of claim 11, wherein the training dataset comprises a ground truth and a training input.

13. The processing system of claim 12, wherein the ground truth comprises clean speech vocal audio and clean noisy sound audio.

14. The processing system of claim 12, wherein the training input comprises clean speech combined with noisy audio.

15. The processing system of claim 6, wherein the operations further comprise causing a device in communication with the processing system to generate an augmented reality interface on the device based at least in part on the content of the sound data as determined during the vocal extraction.

16. The processing system of claim 6, wherein the operations further comprise causing a device in communication with the processing system to generate a digital twin interface on the device based at least in part on the content of the sound data as determined during the vocal extraction.

17. An edge device comprising: a memory comprising computer readable instructions; and a processing device for executing the computer readable instructions, the computer readable instructions controlling the processing device to perform operations comprising: receiving raw audio; performing vocal extraction on the raw audio using a trained machine learning model to extract a voice aspect from the raw audio; combining the voice aspect with the raw audio at a user-defined ratio; and generating an output audio signal.

18. The edge device of claim 17, further comprising an electronic input, wherein the electronic input is used to set the user-defined ratio.

19. The edge device of claim 18, wherein the electronic input is selected from the group consisting of a potentiometer, a rotary encoder, and a touch bar.

20. The edge device of claim 17, further comprising an audio input for receiving the raw audio and an audio output to output the output audio signal.

21. A method comprising: recording audio information from a user while recording tracking information associated with a movement of the user; storing the audio information as an audio track; storing the tracking information as a spatial track; and playing the audio track while generating a graphical representation of the tracking information using the spatial track.

22. A computer-implemented method for sound extraction, the method comprising: defining a training dataset, the training dataset comprising a ground truth and a training input; training a machine learning model to perform vocal extraction using the training dataset; and performing sound extraction, using the machine learning model, on an audio stream to extract a sound aspect of the audio stream.