WO2020176813A1 - Système et procédé d'orchestration de réseau neuronal - Google Patents
Système et procédé d'orchestration de réseau neuronal Download PDFInfo
- Publication number
- WO2020176813A1 WO2020176813A1 PCT/US2020/020246 US2020020246W WO2020176813A1 WO 2020176813 A1 WO2020176813 A1 WO 2020176813A1 US 2020020246 W US2020020246 W US 2020020246W WO 2020176813 A1 WO2020176813 A1 WO 2020176813A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- neural network
- engine
- classification
- image
- data
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/60—Information retrieval; Database structures therefor; File system structures therefor of audio data
- G06F16/68—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
- G06F16/58—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/047—Probabilistic or stochastic networks
Definitions
- One of the methods for classifying a first media segment of a first data type having a corresponding media segment of a second data type includes: extracting a first set of media features of the first media segment of the first data type; generating, using an engine prediction neural network, a best candidate neural network based at least on the first set of media features; determining whether a predicted value of accuracy of the best candidate neural network is above a predetermined accuracy threshold; when the predicted value of accuracy of the best candidate neural network is below the predetermined accuracy threshold, classifying the corresponding media segment of a second data type using a second classification neural network; and selecting, based at least on results of the classification of the corresponding media segment of a second data type, a third classification neural network to classify the first media segment of the first data type.
- the best candidate neural network can be a neural network having a highest predicted value of accuracy.
- the first and second data types can be data of different classes.
- the first type of data can be audio data and the second type of data can be image data or metadata.
- the third classification neural network and the best candidate neural network are different.
- extracting the first set of media features of the first data type includes extracting audio features of the first media segment using outputs of one or more layers of a speech-to-text classification neural network.
- the first data type can be audio data and the corresponding media segment of second data type can image data, transcription data, or metadata.
- the corresponding media segment can span a duration of 30 seconds before and after from when the first media segment occurs within a multimedia file.
- the first segment can occur at the 9 minutes playtime timestamp
- the process of extracting the first set of media features of the first data type can comprise extracting image features of the first media segment by using outputs of one or more layers of an image classification neural network.
- the first data type can be image data and the corresponding media segment of second data type can be audio data, transcription data, or metadata.
- the process of classifying the corresponding media segment of a second data type using the second classification neural network can include: extracting a second set of media features of the corresponding media segment of the second data type; and generating, using the engine prediction neural network, a best candidate neural network based at least on the second set of media features.
- the second classification neural network can be the best candidate neural network.
- the process of extracting the first set of media features of the first data type can include extracting the first set of media features of the first media segment using outputs of one or more hidden layers of a fourth classification neural network trained to classify data of the first data type.
- the first data type can be audio data, image data, transcription data, or metadata.
- the process of selecting the third image classification neural network can include: determining context of the corresponding media segment based at least on the classification result of the corresponding media segment received from the second classification neural network; and selecting the third classification neural network based on the determined context.
- a second method for classifying a portion of an image includes: segmenting the image into a plurality of image portions; extracting image features of a first image portion of the plurality of image portion using a first image classification engine; inputting the extracted image features into an engine prediction neural network to generate a list of one or more best candidate engines; determining a predicted accuracy value for each best candidate engines; if none of the best candidate engines has a predicted accuracy value above a predetermined accuracy threshold, identifying an alternate data set associated with the first image portion, the alternate data set comprises non-image data; requesting a second classification engine to classify the alternate data set; receiving, from the second classification engine, a second classification result of the alternate data set; and selecting, based at least on the second classification result of the alternate data set, a third image classification engine to re-classify the portion of the image.
- the second classification engine is trained to classify data in a same class as the alternate data set.
- the first and third image classification engines can be different.
- a system for classifying a first media segment of a first data type having a corresponding media segment of a second data type includes: a memory; and one or more processors coupled to the memory.
- the one or more processors are configured to: extract a first set of media features of the first media segment of the first data type; generate, using an engine prediction neural network, a best candidate neural network based at least on the first set of media features; determine whether a predicted value of accuracy of the best candidate neural network is above a predetermined accuracy threshold; when the predicted value of accuracy of the best candidate neural network is below the predetermined accuracy threshold, classify the corresponding media segment of a second data type using a second classification neural network; and select, based at least on results of the classification of the corresponding media segment of a second data type, a third classification neural network to classify the first media segment of the first data type.
- the first and second data types can be different.
- the best candidate neural network can be a neural network having a highest predicted value of accuracy, and the third classification neural
- Figures 3A and 3B are flow diagrams of training processes in accordance with some aspects of the disclosure.
- Figure 6 is a flow diagram of a classification process in accordance with an aspect of the disclosure.
- Figure 10 is a chart illustrating empirical data of loss vs the channel size of the autoencoder neural network in accordance with some aspects of the disclosure.
- SRC 105 can also use the classification result from the corresponding audio data to tailor the field of engines to be orchestrated.
- transcription of the corresponding audio data can be “touchdown, what a pass!”
- SRC 105 can orchestrate only facial classification engines specialized for sports. This can be done by using an engine prediction neural network that is trained to only orchestrate sport- specialized image classification engines. In this way, the next list of best candidate engines can have a much higher rate of accuracy.
- the smart router conductor can be configured to use outputs of one or more hidden layers of the speech recognition neural network to extract relevant (e.g., dominant) features of the audio file.
- the smart router conductor can be configured to use outputs of one or more layers of a deep speech neural network, by Mozilla Research, which has five hidden layers.
- outputs of one or more hidden layers of the deep speech neural network can be used as inputs of an engine prediction neural network.
- outputs from the last hidden layer of a deep neural network e.g., Deep Speech
- an engine prediction neural network which can be a fully-layered convolutional neural network.
- the engine prediction neural network is configured to predict one or more best-candidate engines (engines with the best predicted results) based at least on the audio features of an audio spectrogram of the segment.
- the engine prediction neural network is configured to predict one or more best-candidate engines based at least on outputs of one or more layers of a deep neural network trained to perform speech recognition.
- the outputs of one or more layers of a speech recognition deep neural network are representative of dominant audio features of a media (e.g., audio) segment.
- the engine prediction neural network (e.g., backend neural network of the hybrid deep neural network) can be a neural network such as, but not limited to, a deep neural network (e.g., RNN), a feedforward neural network, a convolutional neural network (CNN), a faster R-CNN, a mask R-CNN, a SSD neural network, a hybrid neural network, etc.
- one or more subprocesses of process 300 can be performed interchangeably.
- one or more subprocesses such as subprocesses 305, 310, 315, and 320 can be performed in different orders or in parallel.
- subprocesses 315 and 320 can be performed prior to subprocesses 305 and 310.
- the backend neural network (e.g., the engine prediction neural network) can be a deep neural network (e.g., RNN), a feedforward neural network, a convolutional neural network (CNN), a faster R-CNN, a mask R-CNN, a SSD neural network, a hybrid neural network, etc.
- Outputs of the frontend image classification neural network can be used as inputs to the backend neural network to generate a list of engines with a predicted classification accuracy.
- Process 350 starts at 355 where an input media file (e.g., a multimedia file, an image file) is segmented into a plurality of segments.
- the input media file can be an image file.
- the image file is not segmented.
- the plurality of segments of an image file can be portions of the image at various locations of the image (e.g., middle portion, upper-left comer portion, upper-right comer portion).
- the image features of the image or the plurality of segments of the image are extracted. In some embodiments, this can be accomplished by analyzing each segment using an image classification engine such as, but not limited to, a VGG image neural network and then extracting the outputs (e.g., weights) of one or more layers of the image classification engine.
- an image classification engine such as, but not limited to, a VGG image neural network
- the outputs of the last hidden layer of the image classification engine can be used to represent the dominant image features of each segment.
- the outputs of the second and last hidden layers of the image classification engine can be combined and use to represent the dominant image features of each segment (or the entire image).
- each engine to be orchestrated by SRC 105 in the ecosystem of engines is tasked to classify the one or more segments of the image.
- the classification results from each engine for each segment are scored to generate a classification accuracy score for each segment and engine. For example, given 4 image segments, each engine will have 4 different classification accuracy score— one for each segment.
- the engine prediction neural network (e.g., backend neural network) of SRC 105 is trained to associate the image features of each image segment with the classification accuracy score of each engine for that particular image segment.
- the classification accuracy score of each engine for a segment can be obtained by comparing the classification results of the segment with the ground truth data of the segment.
- FIG. 4 illustrates a process 400 for transcribing an input media file using a hybrid neural network that can preemptive orchestrate a group of engines of an engine ecosystem in accordance with some embodiments of the present invention.
- Process 400 starts at 405 where the input media file is segmented into a plurality of segments.
- the media file can be segmented based on a time duration (segments with a fixed time duration), audio features, topic, scene, and/or metadata of the input media file.
- the input media file can also be segmented using a combination of the above variables (e.g., duration, topic, scene, etc.).
- the media file (e.g., audio file, video file) can be segmented by duration of 2-10 seconds.
- the audio file can be segmented into a plurality segments having an approximate duration of 5 seconds.
- the input media file can be segmented by duration and only at locations where no speech is detected. In this way, the input media file is not segmented such that a word sound is broken between two segments.
- the input media file can also be segmented based on two or more variables such as topic and duration, scene and duration, metadata and duration, etc.
- subprocess 105 can use a segmentation module (see item 8515 of FIG. 8) to segment the input media file by scenes and then by duration to yield 5-second segments of various scenes.
- process 400 can segment by a duration of 10-second segments and then further segment each 10-second segment by certain dominant audio feature(s) or scene(s).
- the scene of various segments of the input media file can be identified using metadata of the input media file or using a neural network trained to identify scenes, using metadata and/or images, from the input media file. Each segment can be preprocessed and transformed into an appropriate format for use as inputs of a neural network
- the image can be segmented into a plurality of image portions (e.g., a facial portion, an object portion).
- a neural network e.g., DNN, hybrid deep neural network
- DNN deep neural network
- preemptively orchestrate e.g., pairing
- the hybrid deep neural network can include two or more neural networks of different architectures (e.g., RNN, CNN).
- the hybrid deep neural network can include a RNN frontend and a CNN backend. The RNN frontend can be trained to ingest speech spectrograms and generate text (speech-to-text classification).
- an image classification neural network is used to extract image features of the image. This can be done by extracting outputs of one or more hidden layers of the image classification neural network. Similar to the transcription case, any combination of outputs from two or more hidden layers can be used as inputs to the engine prediction neural network. Additionally, outputs only from the last hidden layer of the image classification neural network can be used as inputs to the engine prediction neural network.
- the CNN backend can be an engine prediction neural network trained to identify a list of best-candidate engines for transcribing each segment based on at least audio features (e.g., outputs of RNN frontend) of the segment and the predicted WER of each engine for the segment.
- the list of best-candidate engines can have one or more engines identified for each segment.
- a best-candidate engine is an engine that is predicted to provide results having a certain level of accuracy (e.g., WER of 15% or less).
- a best- candidate engine can also be an engine that is predicted to provide the most accurate results compared to other engines in the ecosystem.
- the engines can be ranked by accuracy.
- each engine can have multiple WERs. Each WER of an engine is associated with one set of audio features of a segment of the audio file.
- transcription outputs from the best-candidate engines sourced at 115 are combined to generate a combined transcription result.
- Features extraction is a process that is performed during both the training stage and the production stage.
- training stage as in process 100, features extraction is performed at 110 where the audio features of the input media file are extracted by extracting outputs of one or more layers of a neural network trained to ingest audio and generate text.
- the audio features extraction process can be performed on a segment of an audio file or on the entire input file (and then segmented into portions).
- features extraction is performed on an audio segment to be transcribed so that the engine prediction neural network can use the extracted audio features to predict the WER of one or more engines in the engine ecosystem (for the audio segment). In this way, the engine with the highest predicted WER for an audio segment can be selected to transcribe the audio segment. This can save a significant amount of resources by eliminating the need to perform transcription using a trial and error or random approach to engine selection.
- FIG. 5 graphically illustrates a hybrid deep neural network 500 used to extract audio features and to preemptively orchestrate audio segments to best candidate transcription engines in accordance with some embodiments of the present disclosure.
- hybrid deep neural network 500 includes an RNN frontend 550 and a CNN backend 560.
- RNN frontend 550 can be a pre-trained speech recognition network
- CNN backend 560 can be an engine prediction neural network trained to predict the WERs of one or more engines in the engine ecosystem based at least on outputs from RNN frontend 550.
- an audio signal can be segmented into small time segments 505, 510, and 515.
- Each of segments 505, 510, and 515 has its respective audio features 520, 525, and 530.
- audio features of each segment are just audio spectrograms and the dominant features of the spectrograms are not yet known.
- neural network 550 can be a recurrent neural network with five hidden layers.
- the five hidden layers can be configured to encode phoneme(s) of the audio input file or phoneme(s) of a waveform across one or more of the five layers.
- the LSTM units are designed to remember values of one or more layers over a period of time such that one or more audio features of the input media file can be mapped to the entire phoneme, which can spread over multiple layers and/or multiple segments.
- the outputs of the fifth layer of the RNN are then used as inputs to engine-prediction layer 560, which can be a regression-based analyzer configured to learn the relationship between the dominant audio features of the segment and the WER of the engine for that segment (which was established at 120).
- each engine that is to be orchestrated must be trained using training data with ground truth transcription data.
- the WER can be calculated based on the comparison of the engine outputs with the ground truth transcription data.
- the trained collection of engines can be orchestrated such that subprocess 215 (for example) can select one or more of the orchestrated engines (engines in the ecosystem that have been used to train engine prediction neural network) that can best transcribe a given media segment.
- FIG. 6 illustrates a process 600 for performing engines orchestration between different classes (e.g., interclass) of data.
- Process 600 starts at subprocess 605 where one or more classification results of a first group of one or more segments are received from a first classification engine.
- the one or more classification results can be, but not limited to, transcription results, image classification (e.g., tagging) results, or object classification results.
- a first group of one or more segments can have one segment or many segments.
- a transcription engine can output transcription results of a first group of segments having ten audio segments.
- an object classification engine can output object recognition results of objects at several portions of an image. In other words, objects at multiple portions of the image can be recognized (e.g., classified) by the object classification engine.
- the first classification engine can be a transcription engine, an obj ect or image recognition engine, a color classification engine, an animal classification engine, a facial classification, etc.
- the first group of segments can be audio segments, portions of an images, portions of a larger transcripts, or portions of a metadata.
- the first classification engine is a transcription engine
- the first group of segments can be segments of an audio file.
- the first classification engine is an object recognition engine
- the first group of segments can be segments (e.g., portions) of an image or the entire image.
- Each of the classification results can include a low confidence of accuracy value provided by the classification engine.
- a low confidence segment is a segment having a low confidence of accuracy value below a certain accuracy threshold, which can be dynamically selected.
- a third classification neural network can be selected, based at least on the second classification result (from 620) to reclassify the segment with a low confidence of accuracy, which was identified at subprocess 610.
- SRC 105 can use the second classification result that identifies jersey number 16 associated with“Jared Goff’ or recognizes Jared Goffs face using facial recognition to select a transcription specialized in sports, pronouns, or football.
- the one or more classification results received at 605 can be a facial recognition result of image 225 of FIG. 2.
- the facial recognition result received from a facial recognition engine can be “Clint Eastwood.”
- this result is identified to have a low confidence of accuracy— an accuracy value of 35%.
- a second group of one of one or more segments (of a multimedia file) relating to image 225 is identified. This can be audio data or metadata occurring contemporaneously with image 225 in the multimedia file.
- the second group of one or more segments can be one or more of audio segments 115a-115e having timestamp generally around (e.g., 30 seconds before and after) the timestamp of image 225.
- one or more of the audio segments 115a-115e are transcribed (e.g., speech-to-text classification).
- SRC 105 can determine, using a topic classification engine of the transcription results, that the topic is sports or football. SRC 150 can then select a facial recognition engine that is specialized in sports (e.g., trained with sports personalities).
- the predicted confidence of accuracy of the best engine among the list of best candidate engines is determined.
- the list of best candidate engines can have two engines.
- the first engine can have a predicted accuracy of 25% and the second engine can have a predicted accuracy of 37%.
- the best engine is the second engine, with a 37% predicted accuracy.
- the accuracy threshold can be set at 65%. If the accuracy threshold is met, the best candidate engine is requested to classify the media file.
- the media file can be a segment or the entire media file. For example, if the media file is an audio segment, then the best candidate transcription engine is requested to transcribe the audio segment at 720. In another example, if the media file is an image, then best candidate image classification is requested to classify the image and/or objects within the image.
- the outputs of the best candidate engine are received at subprocess 605 and interclass orchestration process continues through subprocess 625.
- the interclass orchestration process can select a more appropriate engine using process 600.
- the process proceeds (at 730) directly to subprocess 615 of process 600 and continues through subprocess 625.
- the media file is an audio file
- a second group of one or more segments relating to the audio file is identified.
- the second group of segments can be image 205 and/or image 225.
- the second group of segments can also be metadata contemporaneously to the audio segment or within a certain time span (e.g., 5 minutes before and after) of the timestamp of the audio segment.
- skipping to subprocess 615 when the accuracy threshold is not met at 715 resources can be conserved by not requesting engine(s) with predicted low value of accuracy, identified prior to subprocess 605 (e.g., at subprocess 415 of process 400 where the candidate best engine is predicted for each segment), to classify the media file. Skipping to subprocess 615 (from subprocess 715) enables SRC 105 to perform interclass orchestration immediately after determining that none of the best candidate engines among the list of best candidate engines has a sufficiently high confidence of accuracy value. This saves both time and resources and enables SRC 105 to be more efficient.
- FIG. 8 is a bar chart illustrating the improvements for engines outputs using the smart router conductor with preemptive orchestration. As shown in FIG. 8, a typical baseline accuracy for any engine is 57% to 65% accuracy. However, using SRC 105, the accuracy of the resulting transcription can be dramatically improved. In one scenario, the improvement is 19% better than the next best transcription engine working alone.
- Orchestration can include a process that classifies how accurate each engine of a collection of engines transcribes an audio segment based on the raw audio features of the audio segment.
- preemptively orchestration can involve the pairing of a plurality of media segments with corresponding best transcription engines based at least on extracted audio features of each segment. For instance, each audio segment can be paired with one or more best transcription engines by the backend CNN (e.g., orchestrator).
- outputs from the last layer of frontend neural network are used as inputs to the backend CNN.
- outputs from the fifth layer of the deep speech neural network can be used as inputs to the backend CNN.
- Outputs from the fifth layer of the deep speech neural network can have 2048 features per time step. The number of channels (one for each of the 2048 features) in between the two layers is a free parameter. Accordingly, there can be a lot of parameters due to the 2048 input channels, which leads to a CNN with very large dimensions.
- a dimension reduction layer is used.
- the dimension reduction layer can be a CNN layer with a filter size of 1. This is equivalent to a fully connected layer that operates independently on each time step.
- the number of parameters can scale as nm x n 0 ut. This can be beneficial because the number of parameters is not a product (multiple) of the filter size.
- the backend CNN can be a three-layer CNN with one dimension-reduction layer followed by a layer with filter size 3 and a layer with filter size 5.
- the number of parameters of this backend CNN can be:
- outputs from one or more layers of the frontend neural network can be used as inputs to the backend neural network (e.g., engine prediction neural network).
- the backend neural network e.g., engine prediction neural network
- only outputs from the last hidden layer are used as inputs to the backend neural network.
- the first and last hidden layers can be used as inputs for the backend neural network.
- outputs from the second and the penultimate hidden layers can be used as inputs to the backend neural network.
- Outputs from other combinations of layers are contemplated and are within the scope of this disclosure. For example, outputs from the first and fourth layers can be used as inputs. In another example, outputs from the second and fifth layers can also be used.
- outputs from layer 5 appear to provide the best results (though the results are within a margin of error). Additionally, the last point shows the combined features of layers 1 and 5 as inputs. Here, the larger gap between the training loss and the test loss can imply that there exists some overfitting.
- overfitting can be an issue. Overfitting occurs when the training data set is smaller than an equivalent data set. In transcription, the number of outputs is effectively reduced to a single number (the word error rate) per engine per audio segment. With a large number of features (e.g., input features) being extracted from the frontend neural network, the number of parameters in the frontend neural network is similarly large as the number of parameters in a layer, which scales as the product of the input and output features. In other words, the number of input channels can be very large and can approach an impractical large value.
- features e.g., input features
- the number of input channels can be reduced without the need to re-train the entire frontend neural network, while keeping 2048 input features per time step in one layer constant, by using an autoencoder.
- Autoencoders can be trained using the signal itself, no external ground truth is required. Furthermore, the effective amount of training data scales well with the dimensionality of the signal.
- autoencoder can be used to reduce the 2048 input features per time step (and roughly 500 time steps per audio file) to a single number per engine. During the training process, the autoencoder starts with 2048 features per time step. This translates to roughly five orders of magnitude more training data based on the same quantity of raw audio for the autoencoder as compared to the orchestrator. With that much training data overfitting is not an issue.
- the autoencoder can be trained independently and accurately apart from the training of the backend neural network. And a good autoencoder can reduce the dimensionality of the signal without losing much information and this reduced dimensionality translates directly into fewer parameters in our orchestration model which reduces the potential for overfitting.
- FIG. 10 shows losses on the test set for various sizes of autoencoder vs the number of channels in the output of the first layer in the orchestration network. It should be noted that the number of parameters (and therefore the potential for overfitting) scales roughly as the product of the input and output channels. Further, more output channels mean more information is being carried over to the rest of the network and can potentially lead to more accurate predictions.
- FIG. 11 shows the losses for the training and testing trail runs for various numbers of channel.
- training the autoencoder for longer can translate to lower losses for the autoencoder, and there was no sign of overfitting.
- Autoencoder which was trained for longer was producing a more accurate representation of the signal.
- accuracy itself is the problem.
- noise itself can be thought of as a kind of regulation that prevents overfitting much in the same way as a dropout of a term.
- the results do seem to indicate that there is little value to having a finely tuned autoencoder model.
- FIG. 12 is a system diagram of an exemplary smart router conductor system 1200 for training one or more neural networks and performing transcription using the trained one or more neural networks in accordance with some embodiments of the present disclosure.
- System 1200 may include a database 1205, file segmentation module 1210, neural networks module 1215, feature extraction module 1220, training module 1225, communication module 1230, and conductor 1150.
- System 1200 may reside on a single server or may be distributed at various locations on a network.
- one or more components or modules (e.g., 1205, 1210, 1215, etc.) of system 1200 may be distributed across various locations throughout a network.
- Each component or module of system 1200 may communicate with each other and with external entities via communication module 1230.
- Each component or module of system 1200 may include its own sub-communication module to further facilitate with intra and/or inter-system communication.
- Database 1205 can include training data sets and customers ingested data.
- Database 1205 can also include data collected by a data aggregator (not shown) that automatically collects and index data from various sources such as the Internet, broadcasted radio stations, broadcasted TV stations, etc.
- File segmentation module 1210 includes algorithms and instructions that, when executed by a processor, cause the processor to segment a media file into a plurality of segments as described above with respect to at least subprocess 305 of FIG. 3 A, 355 of subprocess 350, and subprocess 405 of FIG. 4.
- Neural networks module 1215 can be an ecosystem of neural networks that includes a hybrid deep neural network (e.g., neural network 500), pre-trained speech recognition neural networks (e.g., neural network 550), engine prediction neural network (e.g., neural network 560), transcription neural networks (e.g., engines), other classification neural networks of varying architectures.
- Transcription engines can include local transcription engine(s) and third-party transcription engines such as engines provided by IBM®, Microsoft®, and Nuance®, for example.
- Feature extraction module 1220 includes algorithms and instructions that, when executed by a processor, cause the processor to extract audio features of each media segment as described above with respect to at least subprocesses 310 and 410 of FIGS. 3 and 4, respectively.
- Feature extraction module 1220 can work in conjunction with other modules of system 1200 to perform the audio feature extraction as described in subprocesses 110 and 210.
- feature extraction module 1220 and neural networks module 1210 can be configured to cooperatively perform the functions of subprocesses 310 and 410.
- neural networks module 1210 and feature extraction module 1220 can share or have overlapping responsibilities and functions.
- Training module 1220 includes algorithms and instructions that, when executed by a processor, cause the processor to perform the respective functions and features of at least subprocesses 415, 420, and 425 of FIG. 1.
- training module 1220 can be configured to train a neural network to predict the WER of an engine for each segment based at least on audio features of each segment by mapping the engine WER of each segment to audio features of the segment.
- Training module 1220 can also be configured to train an engine prediction neural network to associate image features with an engine’s classification performance.
- Conductor 1250 includes algorithms and instructions that, when executed by a processor, cause the processor to perform the respective the functions and features of the smart router conductor as describe above with respect, but not limited, to processes 100, 200, 300, 400, 600, and 700.
- conductor 1250 includes algorithms and instructions that, when executed by a processor, cause the processor to: segment the media file into a plurality of segments; extract, using a first neural network, audio features of a first and second segment of the plurality of segments; and identify, using a second neural network, a best-candidate engine for each of the first and second segments based at least on audio features of the first and second segments.
- conductor 1250 includes algorithms and instructions that, when executed by a processor, cause the processor to: segment the audio file into a plurality of audio segments; use a first audio segment of the plurality of audio segments as inputs to a deep neural network; and use outputs of one or more hidden layers of the deep neural network as inputs to a second neural network that is trained to identify a first transcription engine having a highest predicted transcription accuracy among a group of transcription engines for the first audio segment based at least on the outputs of the one or more hidden layers of the deep neural network.
- conductor 1250 includes algorithms and instructions that, when executed by a processor, cause the processor to: segment a ground truth image file into one or more image portions; extract image features of the one or more image portions using outputs of one or more hidden layers of an image classification neural network; classify the ground truth image using a plurality of image classification engines; train an engine prediction neural network to associate the extracted image features of the ground truth image with the classification performance (e.g., accuracy score) of each of the plurality of image classification engines.
- the classification performance e.g., accuracy score
- conductor 1250 includes algorithms and instructions that, when executed by a processor, cause the processor to: segment an image file into one or more image portions; extract image features of the one or more image portions using outputs of one or more hidden layers of an image classification neural network; use the extracted image features as input to a trained engine prediction neural network to generate a list of best candidate image classification engines.
- each of the modules e.g., 1205, 1210, 1215, 1220, 1225, 1230
- one or more functions of each of the modules can be shared with another modules within transcription system 1200.
- FIG. 13 illustrates an exemplary system or apparatus 1300 in which processes 100 and 200 can be implemented.
- an element, or any portion of an element, or any combination of elements may be implemented with a processing system 1314 that includes one or more processing circuits 1304.
- Processing circuits 1304 may include micro-processing circuits, microcontrollers, digital signal processing circuits (DSPs), field programmable gate arrays (FPGAs), programmable logic devices (PLDs), state machines, gated logic, discrete hardware circuits, and other suitable hardware configured to perform the various functionalities described throughout this disclosure. That is, the processing circuit 1304 may be used to implement any one or more of the processes described above and illustrated in FIGS. 1-4, 6, and 7.
- the processing system 1314 may be implemented with a bus architecture, represented generally by the bus 1302.
- the bus 1302 may include any number of interconnecting buses and bridges depending on the specific application of the processing system 1314 and the overall design constraints.
- the bus 1302 may link various circuits including one or more processing circuits (represented generally by the processing circuit 1304), the storage device 1305, and a machine-readable, processor-readable, processing circuit-readable or computer-readable media (represented generally by a non- transitory machine-readable medium 1306).
- the bus 1302 may also link various other circuits such as, but not limited to, timing sources, peripherals, voltage regulators, and power management circuits, which are well known in the art, and therefore, will not be described any further.
- the bus interface 1308 may provide an interface between bus 1302 and a transceiver 1310.
- the transceiver 1310 may provide a means for communicating with various other apparatus over a transmission medium.
- a user interface 1312 e.g., keypad, display, speaker, microphone, touchscreen, motion sensor
- the processing circuit 1304 may be responsible for managing the bus 1302 and for general processing, including the execution of software stored on the machine-readable medium 1306.
- the software when executed by processing circuit 1304, causes processing system 1314 to perform the various functions described herein for any particular apparatus.
- Machine-readable medium 1306 may also be used for storing data that is manipulated by processing circuit 1304 when executing software.
- One or more processing circuits 1304 in the processing system may execute software or software components.
- Software shall be construed broadly to mean instructions, instruction sets, code, code segments, program code, programs, subprograms, software modules, applications, software applications, software packages, routines, subroutines, objects, executables, threads of execution, procedures, functions, etc., whether referred to as software, firmware, middleware, microcode, hardware description language, or otherwise.
- a processing circuit may perform the tasks.
- a code segment may represent a procedure, a function, a subprogram, a program, a routine, a subroutine, a module, a software package, a class, or any combination of instructions, data structures, or program statements.
- a code segment may be coupled to another code segment or a hardware circuit by passing and/or receiving information, data, arguments, parameters, or memory or storage contents. Information, arguments, parameters, data, etc. may be passed, forwarded, or transmitted via any suitable means including memory sharing, message passing, token passing, network transmission, etc.
- the software may reside on machine-readable medium 1306.
- the machine-readable medium 1306 may be a non-transitory machine-readable medium.
- a non-transitory processing circuit-readable, machine-readable or computer-readable medium includes, by way of example, a magnetic storage device (e.g., solid state drive, hard disk, floppy disk, magnetic strip), an optical disk (e.g., digital versatile disc (DVD), Blu-Ray disc), a smart card, a flash memory device (e.g., a card, a stick, or a key drive), RAM, ROM, a programmable ROM (PROM), an erasable PROM (EPROM), an electrically erasable PROM (EEPROM), a register, a removable disk, a hard disk, a CD-ROM and any other suitable medium for storing software and/or instructions that may be accessed and read by a machine or computer.
- a magnetic storage device e.g., solid state drive, hard disk, floppy disk, magnetic strip
- an optical disk e.g., digital versatile disc (DVD), Blu-Ray disc
- a smart card e.g., a flash memory
- machine-readable medium may include, but are not limited to, non-transitory media such as , but not limited to, portable or fixed storage devices, optical storage devices, and various other media capable of storing, containing or carrying instruction(s) and/or data.
- the various methods described herein may be fully or partially implemented by instructions and/or data that may be stored in a“machine-readable medium,”“computer-readable medium,”“processing circuit-readable medium” and/or“processor-readable medium” and executed by one or more processing circuits, machines and/or devices.
- the machine-readable medium may also include, by way of example, a carrier wave, a transmission line, and any other suitable medium for transmitting software and/or instructions that may be accessed and read by a computer.
- the machine-readable medium 1306 may reside in the processing system 1314, external to the processing system 1314, or distributed across multiple entities including the processing system 1314.
- the machine-readable medium 1306 may be embodied in a computer program product.
- a computer program product may include a machine-readable medium in packaging materials.
- a first example method for classifying a first media segment of a first data type having a corresponding media segment of a second data type includes: extracting a first set of media features of the first media segment of the first data type; generating, using an engine prediction neural network, a best candidate neural network based at least on the first set of media features; determining whether a predicted value of accuracy of the best candidate neural network is above a predetermined accuracy threshold; when the predicted value of accuracy of the best candidate neural network is below the predetermined accuracy threshold, classifying the corresponding media segment of a second data type using a second classification neural network; and selecting, based at least on results of the classification of the corresponding media segment of a second data type, a third classification neural network to classify the first media segment of the first data type.
- the best candidate neural network can be a neural network having a highest predicted value of accuracy.
- the first and second data types can be data of different classes.
- the first type of data can be audio data and the second type of data can be image data or metadata.
- the third classification neural network and the best candidate neural network are different.
- extracting the first set of media features of the first data type comprises extracting audio features of the first media segment using outputs of one or more layers of a speech-to-text classification neural network.
- the first data type comprises audio data and the corresponding media segment of second data type comprises image data, transcription data, or metadata.
- the corresponding media segment can span a duration of 30 seconds before and after from when the first media segment occurs within a multimedia file.
- extracting the first set of media features of the first data type comprises extracting image features of the first media segment using outputs of one or more layers of an image classification neural network.
- the first data type comprises image data and the corresponding media segment of second data type comprises audio data, transcription data, or metadata.
- the second aspect of the first example method may be implemented in combination with the first aspect of the second example method, though the example embodiments are not limited in this respect.
- the first example method can further include: classifying the corresponding media segment of a second data type using the second classification neural network comprising: extracting a second set of media features of the corresponding media segment of the second data type; and generating, using the engine prediction neural network, a best candidate neural network based at least on the second set of media features.
- the second classification neural network comprises the best candidate neural network.
- extracting the first set of media features of the first data type comprises extracting the first set of media features of the first media segment using outputs of one or more hidden layers of a fourth classification neural network trained to classify data of the first data type.
- the first data type comprises audio data, image data, transcription data, or metadata.
- the engine prediction neural network can be pre-trained to associate outputs of the one or more hidden layers of the fourth classification neural network with predicted classification performances of a plurality of neural network.
- the fourth aspect of the first example method may be implemented in combination with the first, second, and/or third aspect of the first example method, though the example embodiments are not limited in this respect.
- selecting the third image classification neural network can include: determining context of the corresponding media segment based at least on the classification result of the corresponding media segment received from the second classification neural network; and selecting the third classification neural network based on the determined context.
- the fifth aspect of the first example method may be implemented in combination with the first, second, third, and/or fourth aspect of the first example method, though the example embodiments are not limited in this respect.
- One or more of the components, processes, features, and/or functions illustrated in the figures may be rearranged and/or combined into a single component, block, feature or function or embodied in several components, steps, or functions. Additional elements, components, processes, and/or functions may also be added without departing from the disclosure.
- the apparatus, devices, and/or components illustrated in the Figures may be configured to perform one or more of the methods, features, or processes described in the Figures.
- the algorithms described herein may also be efficiently implemented in software and/or embedded in hardware.
- the term“and/or” placed between a first entity and a second entity means one of (1) the first entity, (2) the second entity, and (3) the first entity and the second entity.
- Multiple entities listed with“and/or” should be construed in the same manner, i.e.,“one or more” of the entities so conjoined.
- Other entities may optionally be present other than the entities specifically identified by the“and/or” clause, whether related or unrelated to those entities specifically identified.
- a reference to“A and/or B”, when used in conjunction with open-ended language such as“comprising” can refer, in one embodiment, to A only (optionally including entities other than B); in another embodiment, to B only (optionally including entities other than A); in yet another embodiment, to both A and B (optionally including other entities).
- These entities may refer to elements, actions, structures, processes, operations, values, and the like.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Biophysics (AREA)
- Evolutionary Computation (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Computational Linguistics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Library & Information Science (AREA)
- Databases & Information Systems (AREA)
- Multimedia (AREA)
- Image Analysis (AREA)
Abstract
L'invention concerne des procédés et des systèmes de classification d'un fichier multimédia à l'aide de données interclasses. L'un des procédés peut utiliser des résultats de classification provenant d'un ou de plusieurs moteurs de différentes classes pour sélectionner un moteur différent correspondant à la tâche de classification d'origine. Par exemple, étant donné un segment audio ayant des métadonnées et des données d'image associées, le procédé interclasse décrit peut utiliser les résultats de classification provenant d'une classification thématique de métadonnées et/ou un résultat de classification d'image des données d'image comme entrées pour sélectionner un nouveau moteur de transcription afin de transcrire le segment audio.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US16/287,892 US11017780B2 (en) | 2017-08-02 | 2019-02-27 | System and methods for neural network orchestration |
US16/287,892 | 2019-02-27 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2020176813A1 true WO2020176813A1 (fr) | 2020-09-03 |
Family
ID=72238529
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2020/020246 WO2020176813A1 (fr) | 2019-02-27 | 2020-02-27 | Système et procédé d'orchestration de réseau neuronal |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2020176813A1 (fr) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2000063837A1 (fr) * | 1999-04-20 | 2000-10-26 | Textwise, Llc | Systeme pour extraire des informations multimedia de l'internet en utilisant de multiples agents intelligents evolutifs |
US20170206431A1 (en) * | 2016-01-20 | 2017-07-20 | Microsoft Technology Licensing, Llc | Object detection and classification in images |
US9892344B1 (en) * | 2015-11-30 | 2018-02-13 | A9.Com, Inc. | Activation layers for deep learning networks |
-
2020
- 2020-02-27 WO PCT/US2020/020246 patent/WO2020176813A1/fr active Application Filing
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2000063837A1 (fr) * | 1999-04-20 | 2000-10-26 | Textwise, Llc | Systeme pour extraire des informations multimedia de l'internet en utilisant de multiples agents intelligents evolutifs |
US9892344B1 (en) * | 2015-11-30 | 2018-02-13 | A9.Com, Inc. | Activation layers for deep learning networks |
US20170206431A1 (en) * | 2016-01-20 | 2017-07-20 | Microsoft Technology Licensing, Llc | Object detection and classification in images |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20240312184A1 (en) | System and method for neural network orchestration | |
US20200075019A1 (en) | System and method for neural network orchestration | |
US11017780B2 (en) | System and methods for neural network orchestration | |
Xu et al. | Head fusion: Improving the accuracy and robustness of speech emotion recognition on the IEMOCAP and RAVDESS dataset | |
US20190043487A1 (en) | Methods and systems for optimizing engine selection using machine learning modeling | |
Lozano-Diez et al. | An analysis of the influence of deep neural network (DNN) topology in bottleneck feature based language recognition | |
US20190385610A1 (en) | Methods and systems for transcription | |
Tran et al. | Ensemble application of ELM and GPU for real-time multimodal sentiment analysis | |
US20200286485A1 (en) | Methods and systems for transcription | |
US11176947B2 (en) | System and method for neural network orchestration | |
Yang et al. | Multi-scale semantic feature fusion and data augmentation for acoustic scene classification | |
US20210279525A1 (en) | Hierarchy-preserving learning for multi-label classification | |
US11711558B2 (en) | User classification based on user content viewed | |
Wang | Polyphonic sound event detection with weak labeling | |
Hosseini et al. | Multimodal modelling of human emotion using sound, image and text fusion | |
Xia et al. | Learning salient segments for speech emotion recognition using attentive temporal pooling | |
Punithavathi et al. | [Retracted] Empirical Investigation for Predicting Depression from Different Machine Learning Based Voice Recognition Techniques | |
WO2024093578A1 (fr) | Procédé et appareil de reconnaissance vocale, et dispositif électronique, support de stockage et produit programme d'ordinateur | |
Morrison et al. | Voting ensembles for spoken affect classification | |
Vlasenko et al. | Fusion of acoustic and linguistic information using supervised autoencoder for improved emotion recognition | |
Ntalampiras | Directed acyclic graphs for content based sound, musical genre, and speech emotion classification | |
Rana et al. | Multi-task semisupervised adversarial autoencoding for speech emotion | |
WO2020176813A1 (fr) | Système et procédé d'orchestration de réseau neuronal | |
Goossens et al. | To invest or not to invest: Using vocal behavior to predict decisions of investors in an entrepreneurial context | |
US20230070957A1 (en) | Methods and systems for detecting content within media streams |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20763618 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20763618 Country of ref document: EP Kind code of ref document: A1 |