-
Gradient-Driven 3D Segmentation and Affordance Transfer in Gaussian Splatting Using 2D Masks
Authors:
Joji Joseph,
Bharadwaj Amrutur,
Shalabh Bhatnagar
Abstract:
3D Gaussian Splatting has emerged as a powerful 3D scene representation technique, capturing fine details with high efficiency. In this paper, we introduce a novel voting-based method that extends 2D segmentation models to 3D Gaussian splats. Our approach leverages masked gradients, where gradients are filtered by input 2D masks, and these gradients are used as votes to achieve accurate segmentati…
▽ More
3D Gaussian Splatting has emerged as a powerful 3D scene representation technique, capturing fine details with high efficiency. In this paper, we introduce a novel voting-based method that extends 2D segmentation models to 3D Gaussian splats. Our approach leverages masked gradients, where gradients are filtered by input 2D masks, and these gradients are used as votes to achieve accurate segmentation. As a byproduct, we discovered that inference-time gradients can also be used to prune Gaussians, resulting in up to 21% compression. Additionally, we explore few-shot affordance transfer, allowing annotations from 2D images to be effectively transferred onto 3D Gaussian splats. The robust yet straightforward mathematical formulation underlying this approach makes it a highly effective tool for numerous downstream applications, such as augmented reality (AR), object editing, and robotics. The project code and additional resources are available at https://jojijoseph.github.io/3dgs-segmentation.
△ Less
Submitted 17 September, 2024;
originally announced September 2024.
-
Rethinking Prompting Strategies for Multi-Label Recognition with Partial Annotations
Authors:
Samyak Rawlekar,
Shubhang Bhatnagar,
Narendra Ahuja
Abstract:
Vision-language models (VLMs) like CLIP have been adapted for Multi-Label Recognition (MLR) with partial annotations by leveraging prompt-learning, where positive and negative prompts are learned for each class to associate their embeddings with class presence or absence in the shared vision-text feature space. While this approach improves MLR performance by relying on VLM priors, we hypothesize t…
▽ More
Vision-language models (VLMs) like CLIP have been adapted for Multi-Label Recognition (MLR) with partial annotations by leveraging prompt-learning, where positive and negative prompts are learned for each class to associate their embeddings with class presence or absence in the shared vision-text feature space. While this approach improves MLR performance by relying on VLM priors, we hypothesize that learning negative prompts may be suboptimal, as the datasets used to train VLMs lack image-caption pairs explicitly focusing on class absence. To analyze the impact of positive and negative prompt learning on MLR, we introduce PositiveCoOp and NegativeCoOp, where only one prompt is learned with VLM guidance while the other is replaced by an embedding vector learned directly in the shared feature space without relying on the text encoder. Through empirical analysis, we observe that negative prompts degrade MLR performance, and learning only positive prompts, combined with learned negative embeddings (PositiveCoOp), outperforms dual prompt learning approaches. Moreover, we quantify the performance benefits that prompt-learning offers over a simple vision-features-only baseline, observing that the baseline displays strong performance comparable to dual prompt learning approach (DualCoOp), when the proportion of missing labels is low, while requiring half the training compute and 16 times fewer parameters
△ Less
Submitted 12 September, 2024;
originally announced September 2024.
-
Chemical Reaction Neural Networks for Fitting Accelerating Rate Calorimetry Data
Authors:
Saakaar Bhatnagar,
Andrew Comerford,
Zelu Xu,
Davide Berti Polato,
Araz Banaeizadeh,
Alessandro Ferraris
Abstract:
As the demand for lithium-ion batteries rapidly increases there is a need to design these cells in a safe manner to mitigate thermal runaway. Thermal runaway in batteries leads to an uncontrollable temperature rise and potentially fires, which is a major safety concern. Typically, when modelling the chemical kinetics of thermal runaway calorimetry data ( e.g. Accelerating Rate Calorimetry (ARC)) i…
▽ More
As the demand for lithium-ion batteries rapidly increases there is a need to design these cells in a safe manner to mitigate thermal runaway. Thermal runaway in batteries leads to an uncontrollable temperature rise and potentially fires, which is a major safety concern. Typically, when modelling the chemical kinetics of thermal runaway calorimetry data ( e.g. Accelerating Rate Calorimetry (ARC)) is needed to determine the temperature-driven decomposition kinetics. Conventional methods of fitting Arrhenius Ordinary Differential Equation (ODE) thermal runaway models to Accelerated Rate Calorimetry (ARC) data make several assumptions that reduce the fidelity and generalizability of the obtained model. In this paper, Chemical Reaction Neural Networks (CRNNs) are trained to fit the kinetic parameters of N-equation Arrhenius ODEs to ARC data obtained from a Molicel 21700 P45B. The models are found to be better approximations of the experimental data. The flexibility of the method is demonstrated by experimenting with two-equation and four-equation models. Thermal runaway simulations are conducted in 3D using the obtained kinetic parameters, showing the applicability of the obtained thermal runaway models to large-scale simulations.
△ Less
Submitted 3 September, 2024; v1 submitted 21 August, 2024;
originally announced August 2024.
-
NL2OR: Solve Complex Operations Research Problems Using Natural Language Inputs
Authors:
Junxuan Li,
Ryan Wickman,
Sahil Bhatnagar,
Raj Kumar Maity,
Arko Mukherjee
Abstract:
Operations research (OR) uses mathematical models to enhance decision-making, but developing these models requires expert knowledge and can be time-consuming. Automated mathematical programming (AMP) has emerged to simplify this process, but existing systems have limitations. This paper introduces a novel methodology that uses recent advances in Large Language Model (LLM) to create and edit OR sol…
▽ More
Operations research (OR) uses mathematical models to enhance decision-making, but developing these models requires expert knowledge and can be time-consuming. Automated mathematical programming (AMP) has emerged to simplify this process, but existing systems have limitations. This paper introduces a novel methodology that uses recent advances in Large Language Model (LLM) to create and edit OR solutions from non-expert user queries expressed using Natural Language. This reduces the need for domain expertise and the time to formulate a problem. The paper presents an end-to-end pipeline, named NL2OR, that generates solutions to OR problems from natural language input, and shares experimental results on several important OR problems.
△ Less
Submitted 13 August, 2024;
originally announced August 2024.
-
Potential Field Based Deep Metric Learning
Authors:
Shubhang Bhatnagar,
Narendra Ahuja
Abstract:
Deep metric learning (DML) involves training a network to learn a semantically meaningful representation space. Many current approaches mine n-tuples of examples and model interactions within each tuplets. We present a novel, compositional DML model, inspired by electrostatic fields in physics that, instead of in tuples, represents the influence of each example (embedding) by a continuous potentia…
▽ More
Deep metric learning (DML) involves training a network to learn a semantically meaningful representation space. Many current approaches mine n-tuples of examples and model interactions within each tuplets. We present a novel, compositional DML model, inspired by electrostatic fields in physics that, instead of in tuples, represents the influence of each example (embedding) by a continuous potential field, and superposes the fields to obtain their combined global potential field. We use attractive/repulsive potential fields to represent interactions among embeddings from images of the same/different classes. Contrary to typical learning methods, where mutual influence of samples is proportional to their distance, we enforce reduction in such influence with distance, leading to a decaying field. We show that such decay helps improve performance on real world datasets with large intra-class variations and label noise. Like other proxy-based methods, we also use proxies to succinctly represent sub-populations of examples. We evaluate our method on three standard DML benchmarks- Cars-196, CUB-200-2011, and SOP datasets where it outperforms state-of-the-art baselines.
△ Less
Submitted 28 May, 2024;
originally announced May 2024.
-
Open-Source Assessments of AI Capabilities: The Proliferation of AI Analysis Tools, Replicating Competitor Models, and the Zhousidun Dataset
Authors:
Ritwik Gupta,
Leah Walker,
Eli Glickman,
Raine Koizumi,
Sarthak Bhatnagar,
Andrew W. Reddie
Abstract:
The integration of artificial intelligence (AI) into military capabilities has become a norm for major military power across the globe. Understanding how these AI models operate is essential for maintaining strategic advantages and ensuring security. This paper demonstrates an open-source methodology for analyzing military AI models through a detailed examination of the Zhousidun dataset, a Chines…
▽ More
The integration of artificial intelligence (AI) into military capabilities has become a norm for major military power across the globe. Understanding how these AI models operate is essential for maintaining strategic advantages and ensuring security. This paper demonstrates an open-source methodology for analyzing military AI models through a detailed examination of the Zhousidun dataset, a Chinese-originated dataset that exhaustively labels critical components on American and Allied destroyers. By demonstrating the replication of a state-of-the-art computer vision model on this dataset, we illustrate how open-source tools can be leveraged to assess and understand key military AI capabilities. This methodology offers a robust framework for evaluating the performance and potential of AI-enabled military capabilities, thus enhancing the accuracy and reliability of strategic assessments.
△ Less
Submitted 24 May, 2024; v1 submitted 20 May, 2024;
originally announced May 2024.
-
On Streaming Codes for Simultaneously Correcting Burst and Random Erasures
Authors:
Shobhit Bhatnagar,
Biswadip Chakraborty,
P. Vijay Kumar
Abstract:
Streaming codes are packet-level codes that recover dropped packets within a strict decoding-delay constraint. We study streaming codes over a sliding-window (SW) channel model which admits only those erasure patterns which allow either a single burst erasure of $\le b$ packets along with $\le e$ random packet erasures, or else, $\le a$ random packet erasures, in any sliding-window of $w$ time slo…
▽ More
Streaming codes are packet-level codes that recover dropped packets within a strict decoding-delay constraint. We study streaming codes over a sliding-window (SW) channel model which admits only those erasure patterns which allow either a single burst erasure of $\le b$ packets along with $\le e$ random packet erasures, or else, $\le a$ random packet erasures, in any sliding-window of $w$ time slots. We determine the optimal rate of a streaming code constructed via the popular diagonal embedding (DE) technique over such a SW channel under delay constraint $τ=(w-1)$ and provide an $O(w)$ field size code construction. For the case $e>1$, we show that it is not possible to significantly reduce this field size requirement, assuming the well-known MDS conjecture. We then provide a block code construction whose DE yields a streaming code achieving the rate derived above, over a field of size sub-linear in $w,$ for a family of parameters having $e=1.$ We show the field size optimality of this construction for some parameters, and near-optimality for others under a sparsity constraint. Additionally, we derive an upper-bound on the $d_{\text{min}}$ of a cyclic code and characterize cyclic codes which achieve this bound via their ability to simultaneously recover from burst and random erasures.
△ Less
Submitted 10 May, 2024;
originally announced May 2024.
-
On Streaming Codes for Burst and Random Errors
Authors:
Shobhit Bhatnagar,
P. Vijay Kumar
Abstract:
Streaming codes (SCs) are packet-level codes that recover erased packets within a strict decoding-delay deadline. Streaming codes for various packet erasure channel models such as sliding-window (SW) channel models that admit random or burst erasures in any SW of a fixed length have been studied in the literature, and the optimal rate as well as rate-optimal code constructions of SCs over such cha…
▽ More
Streaming codes (SCs) are packet-level codes that recover erased packets within a strict decoding-delay deadline. Streaming codes for various packet erasure channel models such as sliding-window (SW) channel models that admit random or burst erasures in any SW of a fixed length have been studied in the literature, and the optimal rate as well as rate-optimal code constructions of SCs over such channel models are known. In this paper, we study error-correcting streaming codes ($\text{SC}_{\text{ERR}}$s), i.e., packet-level codes which recover erroneous packets within a delay constraint. We study $\text{SC}_{\text{ERR}}$s for two classes of SW channel models, one that admits random packet errors, and another that admits multiple bursts of packet errors, in any SW of a fixed length. For the case of random packet errors, we establish the equivalence of an $\text{SC}_{\text{ERR}}$ and a corresponding SC that recovers from random packet erasures, thus determining the optimal rate of an $\text{SC}_{\text{ERR}}$ for this setting, and providing a rate-optimal code construction for all parameters. We then focus on SCs that recover from multiple erasure bursts and derive a rate-upper-bound for such SCs. We show the necessity of a divisibility constraint for the existence of an SC constructed by the popular diagonal embedding technique, that achieves this rate-bound under a stringent delay requirement. We then show that a construction known in the literature achieves this rate-bound when the divisibility constraint is met. We further show the equivalence of the SCs considered and $\text{SC}_{\text{ERR}}$s for the setting of multiple error bursts, under a stringent delay requirement.
△ Less
Submitted 10 May, 2024;
originally announced May 2024.
-
Improving Multi-label Recognition using Class Co-Occurrence Probabilities
Authors:
Samyak Rawlekar,
Shubhang Bhatnagar,
Vishnuvardhan Pogunulu Srinivasulu,
Narendra Ahuja
Abstract:
Multi-label Recognition (MLR) involves the identification of multiple objects within an image. To address the additional complexity of this problem, recent works have leveraged information from vision-language models (VLMs) trained on large text-images datasets for the task. These methods learn an independent classifier for each object (class), overlooking correlations in their occurrences. Such c…
▽ More
Multi-label Recognition (MLR) involves the identification of multiple objects within an image. To address the additional complexity of this problem, recent works have leveraged information from vision-language models (VLMs) trained on large text-images datasets for the task. These methods learn an independent classifier for each object (class), overlooking correlations in their occurrences. Such co-occurrences can be captured from the training data as conditional probabilities between a pair of classes. We propose a framework to extend the independent classifiers by incorporating the co-occurrence information for object pairs to improve the performance of independent classifiers. We use a Graph Convolutional Network (GCN) to enforce the conditional probabilities between classes, by refining the initial estimates derived from image and text sources obtained using VLMs. We validate our method on four MLR datasets, where our approach outperforms all state-of-the-art methods.
△ Less
Submitted 19 September, 2024; v1 submitted 24 April, 2024;
originally announced April 2024.
-
Piecewise-Linear Manifolds for Deep Metric Learning
Authors:
Shubhang Bhatnagar,
Narendra Ahuja
Abstract:
Unsupervised deep metric learning (UDML) focuses on learning a semantic representation space using only unlabeled data. This challenging problem requires accurately estimating the similarity between data points, which is used to supervise a deep network. For this purpose, we propose to model the high-dimensional data manifold using a piecewise-linear approximation, with each low-dimensional linear…
▽ More
Unsupervised deep metric learning (UDML) focuses on learning a semantic representation space using only unlabeled data. This challenging problem requires accurately estimating the similarity between data points, which is used to supervise a deep network. For this purpose, we propose to model the high-dimensional data manifold using a piecewise-linear approximation, with each low-dimensional linear piece approximating the data manifold in a small neighborhood of a point. These neighborhoods are used to estimate similarity between data points. We empirically show that this similarity estimate correlates better with the ground truth than the similarity estimates of current state-of-the-art techniques. We also show that proxies, commonly used in supervised metric learning, can be used to model the piecewise-linear manifold in an unsupervised setting, helping improve performance. Our method outperforms existing unsupervised metric learning approaches on standard zero-shot image retrieval benchmarks.
△ Less
Submitted 22 March, 2024;
originally announced March 2024.
-
Two-Timescale Critic-Actor for Average Reward MDPs with Function Approximation
Authors:
Prashansa Panda,
Shalabh Bhatnagar
Abstract:
In recent years, there has been a lot of research activity focused on carrying out non-asymptotic convergence analyses for actor-critic algorithms. Recently a two-timescale critic-actor algorithm has been presented for the discounted cost setting in the look-up table case where the timescales of the actor and the critic are reversed and only asymptotic convergence shown. In our work, we present th…
▽ More
In recent years, there has been a lot of research activity focused on carrying out non-asymptotic convergence analyses for actor-critic algorithms. Recently a two-timescale critic-actor algorithm has been presented for the discounted cost setting in the look-up table case where the timescales of the actor and the critic are reversed and only asymptotic convergence shown. In our work, we present the first two-timescale critic-actor algorithm with function approximation in the long-run average reward setting and present the first finite-time non-asymptotic as well as asymptotic convergence analysis for such a scheme. We obtain optimal learning rates and prove that our algorithm achieves a sample complexity of $\mathcal{\tilde{O}}(ε^{-2.08})$ for the mean squared error of the critic to be upper bounded by $ε$ which is better than the one obtained for two-timescale actor-critic in a similar setting. A notable feature of our analysis is that unlike recent single-timescale actor-critic algorithms, we present a complete asymptotic convergence analysis of our scheme in addition to the finite-time bounds that we obtain and show that the (slower) critic recursion converges asymptotically to the attractor of an associated differential inclusion with actor parameters corresponding to local maxima of a perturbed average reward objective. We also show the results of numerical experiments on three benchmark settings and observe that our critic-actor algorithm performs on par and is in fact better than the other algorithms considered.
△ Less
Submitted 24 May, 2024; v1 submitted 2 February, 2024;
originally announced February 2024.
-
Investigating the Surrogate Modeling Capabilities of Continuous Time Echo State Networks
Authors:
Saakaar Bhatnagar
Abstract:
Continuous Time Echo State Networks (CTESNs) are a promising yet under-explored surrogate modeling technique for dynamical systems, particularly those governed by stiff Ordinary Differential Equations (ODEs). A key determinant of the generalization accuracy of a CTESN surrogate is the method of projecting the reservoir state to the output. This paper shows that of the two common projection methods…
▽ More
Continuous Time Echo State Networks (CTESNs) are a promising yet under-explored surrogate modeling technique for dynamical systems, particularly those governed by stiff Ordinary Differential Equations (ODEs). A key determinant of the generalization accuracy of a CTESN surrogate is the method of projecting the reservoir state to the output. This paper shows that of the two common projection methods (linear and nonlinear), the surrogates developed via the nonlinear projection consistently outperform those developed via the linear method. CTESN surrogates are developed for several challenging benchmark cases governed by stiff ODEs, and for each case, the performance of the linear and nonlinear projections is compared. The results of this paper demonstrate the applicability of CTESNs to a variety of problems while serving as a reference for important algorithmic and hyper-parameter choices for CTESNs
△ Less
Submitted 5 January, 2024; v1 submitted 2 December, 2023;
originally announced December 2023.
-
Approximate Linear Programming for Decentralized Policy Iteration in Cooperative Multi-agent Markov Decision Processes
Authors:
Lakshmi Mandal,
Chandrashekar Lakshminarayanan,
Shalabh Bhatnagar
Abstract:
In this work, we consider a cooperative multi-agent Markov decision process (MDP) involving m agents. At each decision epoch, all the m agents independently select actions in order to maximize a common long-term objective. In the policy iteration process of multi-agent setup, the number of actions grows exponentially with the number of agents, incurring huge computational costs. Thus, recent works…
▽ More
In this work, we consider a cooperative multi-agent Markov decision process (MDP) involving m agents. At each decision epoch, all the m agents independently select actions in order to maximize a common long-term objective. In the policy iteration process of multi-agent setup, the number of actions grows exponentially with the number of agents, incurring huge computational costs. Thus, recent works consider decentralized policy improvement, where each agent improves its decisions unilaterally, assuming that the decisions of the other agents are fixed. However, exact value functions are considered in the literature, which is computationally expensive for a large number of agents with high dimensional state-action space. Thus, we propose approximate decentralized policy iteration algorithms, using approximate linear programming with function approximation to compute the approximate value function for decentralized policy improvement. Further, we consider (both) cooperative multi-agent finite and infinite horizon discounted MDPs and propose suitable algorithms in each case. Moreover, we provide theoretical guarantees for our algorithms and also demonstrate their advantages over existing state-of-the-art algorithms in the literature.
△ Less
Submitted 29 April, 2024; v1 submitted 20 November, 2023;
originally announced November 2023.
-
Finite-Time Analysis of Three-Timescale Constrained Actor-Critic and Constrained Natural Actor-Critic Algorithms
Authors:
Prashansa Panda,
Shalabh Bhatnagar
Abstract:
Actor Critic methods have found immense applications on a wide range of Reinforcement Learning tasks especially when the state-action space is large. In this paper, we consider actor critic and natural actor critic algorithms with function approximation for constrained Markov decision processes (C-MDP) involving inequality constraints and carry out a non-asymptotic analysis for both of these algor…
▽ More
Actor Critic methods have found immense applications on a wide range of Reinforcement Learning tasks especially when the state-action space is large. In this paper, we consider actor critic and natural actor critic algorithms with function approximation for constrained Markov decision processes (C-MDP) involving inequality constraints and carry out a non-asymptotic analysis for both of these algorithms in a non-i.i.d (Markovian) setting. We consider the long-run average cost criterion where both the objective and the constraint functions are suitable policy-dependent long-run averages of certain prescribed cost functions. We handle the inequality constraints using the Lagrange multiplier method. We prove that these algorithms are guaranteed to find a first-order stationary point (i.e., $\Vert \nabla L(θ,γ)\Vert_2^2 \leq ε$) of the performance (Lagrange) function $L(θ,γ)$, with a sample complexity of $\mathcal{\tilde{O}}(ε^{-2.5})$ in the case of both Constrained Actor Critic (C-AC) and Constrained Natural Actor Critic (C-NAC) algorithms. We also show the results of experiments on three different Safety-Gym environments.
△ Less
Submitted 29 May, 2024; v1 submitted 25 October, 2023;
originally announced October 2023.
-
The Reinforce Policy Gradient Algorithm Revisited
Authors:
Shalabh Bhatnagar
Abstract:
We revisit the Reinforce policy gradient algorithm from the literature. Note that this algorithm typically works with cost returns obtained over random length episodes obtained from either termination upon reaching a goal state (as with episodic tasks) or from instants of visit to a prescribed recurrent state (in the case of continuing tasks). We propose a major enhancement to the basic algorithm.…
▽ More
We revisit the Reinforce policy gradient algorithm from the literature. Note that this algorithm typically works with cost returns obtained over random length episodes obtained from either termination upon reaching a goal state (as with episodic tasks) or from instants of visit to a prescribed recurrent state (in the case of continuing tasks). We propose a major enhancement to the basic algorithm. We estimate the policy gradient using a function measurement over a perturbed parameter by appealing to a class of random search approaches. This has advantages in the case of systems with infinite state and action spaces as it relax some of the regularity requirements that would otherwise be needed for proving convergence of the Reinforce algorithm. Nonetheless, we observe that even though we estimate the gradient of the performance objective using the performance objective itself (and not via the sample gradient), the algorithm converges to a neighborhood of a local minimum. We also provide a proof of convergence for this new algorithm.
△ Less
Submitted 8 October, 2023;
originally announced October 2023.
-
Physics Informed Neural Networks for Modeling of 3D Flow-Thermal Problems with Sparse Domain Data
Authors:
Saakaar Bhatnagar,
Andrew Comerford,
Araz Banaeizadeh
Abstract:
Successfully training Physics Informed Neural Networks (PINNs) for highly nonlinear PDEs on complex 3D domains remains a challenging task. In this paper, PINNs are employed to solve the 3D incompressible Navier-Stokes (NS) equations at moderate to high Reynolds numbers for complex geometries. The presented method utilizes very sparsely distributed solution data in the domain. A detailed investigat…
▽ More
Successfully training Physics Informed Neural Networks (PINNs) for highly nonlinear PDEs on complex 3D domains remains a challenging task. In this paper, PINNs are employed to solve the 3D incompressible Navier-Stokes (NS) equations at moderate to high Reynolds numbers for complex geometries. The presented method utilizes very sparsely distributed solution data in the domain. A detailed investigation on the effect of the amount of supplied data and the PDE-based regularizers is presented. Additionally, a hybrid data-PINNs approach is used to generate a surrogate model of a realistic flow-thermal electronics design problem. This surrogate model provides near real-time sampling and was found to outperform standard data-driven neural networks when tested on unseen query points. The findings of the paper show how PINNs can be effective when used in conjunction with sparse data for solving 3D nonlinear PDEs or for surrogate modeling of design spaces governed by them.
△ Less
Submitted 3 November, 2023; v1 submitted 6 September, 2023;
originally announced September 2023.
-
Long-Distance Gesture Recognition using Dynamic Neural Networks
Authors:
Shubhang Bhatnagar,
Sharath Gopal,
Narendra Ahuja,
Liu Ren
Abstract:
Gestures form an important medium of communication between humans and machines. An overwhelming majority of existing gesture recognition methods are tailored to a scenario where humans and machines are located very close to each other. This short-distance assumption does not hold true for several types of interactions, for example gesture-based interactions with a floor cleaning robot or with a dr…
▽ More
Gestures form an important medium of communication between humans and machines. An overwhelming majority of existing gesture recognition methods are tailored to a scenario where humans and machines are located very close to each other. This short-distance assumption does not hold true for several types of interactions, for example gesture-based interactions with a floor cleaning robot or with a drone. Methods made for short-distance recognition are unable to perform well on long-distance recognition due to gestures occupying only a small portion of the input data. Their performance is especially worse in resource constrained settings where they are not able to effectively focus their limited compute on the gesturing subject. We propose a novel, accurate and efficient method for the recognition of gestures from longer distances. It uses a dynamic neural network to select features from gesture-containing spatial regions of the input sensor data for further processing. This helps the network focus on features important for gesture recognition while discarding background features early on, thus making it more compute efficient compared to other techniques. We demonstrate the performance of our method on the LD-ConGR long-distance dataset where it outperforms previous state-of-the-art methods on recognition accuracy and compute efficiency.
△ Less
Submitted 8 August, 2023;
originally announced August 2023.
-
Off-Policy Average Reward Actor-Critic with Deterministic Policy Search
Authors:
Naman Saxena,
Subhojyoti Khastigir,
Shishir Kolathaya,
Shalabh Bhatnagar
Abstract:
The average reward criterion is relatively less studied as most existing works in the Reinforcement Learning literature consider the discounted reward criterion. There are few recent works that present on-policy average reward actor-critic algorithms, but average reward off-policy actor-critic is relatively less explored. In this work, we present both on-policy and off-policy deterministic policy…
▽ More
The average reward criterion is relatively less studied as most existing works in the Reinforcement Learning literature consider the discounted reward criterion. There are few recent works that present on-policy average reward actor-critic algorithms, but average reward off-policy actor-critic is relatively less explored. In this work, we present both on-policy and off-policy deterministic policy gradient theorems for the average reward performance criterion. Using these theorems, we also present an Average Reward Off-Policy Deep Deterministic Policy Gradient (ARO-DDPG) Algorithm. We first show asymptotic convergence analysis using the ODE-based method. Subsequently, we provide a finite time analysis of the resulting stochastic approximation scheme with linear function approximator and obtain an $ε$-optimal stationary policy with a sample complexity of $Ω(ε^{-2.5})$. We compare the average reward performance of our proposed ARO-DDPG algorithm and observe better empirical performance compared to state-of-the-art on-policy average reward actor-critic algorithms over MuJoCo-based environments.
△ Less
Submitted 19 July, 2023; v1 submitted 20 May, 2023;
originally announced May 2023.
-
A Framework for Provably Stable and Consistent Training of Deep Feedforward Networks
Authors:
Arunselvan Ramaswamy,
Shalabh Bhatnagar,
Naman Saxena
Abstract:
We present a novel algorithm for training deep neural networks in supervised (classification and regression) and unsupervised (reinforcement learning) scenarios. This algorithm combines the standard stochastic gradient descent and the gradient clipping method. The output layer is updated using clipped gradients, the rest of the neural network is updated using standard gradients. Updating the outpu…
▽ More
We present a novel algorithm for training deep neural networks in supervised (classification and regression) and unsupervised (reinforcement learning) scenarios. This algorithm combines the standard stochastic gradient descent and the gradient clipping method. The output layer is updated using clipped gradients, the rest of the neural network is updated using standard gradients. Updating the output layer using clipped gradient stabilizes it. We show that the remaining layers are automatically stabilized provided the neural network is only composed of squashing (compact range) activations. We also present a novel squashing activation function - it is obtained by modifying a Gaussian Error Linear Unit (GELU) to have compact range - we call it Truncated GELU (tGELU). Unlike other squashing activations, such as sigmoid, the range of tGELU can be explicitly specified. As a consequence, the problem of vanishing gradients that arise due to a small range, e.g., in the case of a sigmoid activation, is eliminated. We prove that a NN composed of squashing activations (tGELU, sigmoid, etc.), when updated using the algorithm presented herein, is numerically stable and has consistent performance (low variance). The theory is supported by extensive experiments. Within reinforcement learning, as a consequence of our study, we show that target networks in Deep Q-Learning can be omitted, greatly speeding up learning and alleviating memory requirements. Cross-entropy based classification algorithms that suffer from high variance issues are more consistent when trained using our framework. One symptom of numerical instability in training is the high variance of the neural network update values. We show, in theory and through experiments, that our algorithm updates have low variance, and the training loss reduces in a smooth manner.
△ Less
Submitted 20 May, 2023;
originally announced May 2023.
-
A Cubic-regularized Policy Newton Algorithm for Reinforcement Learning
Authors:
Mizhaan Prajit Maniyar,
Akash Mondal,
Prashanth L. A.,
Shalabh Bhatnagar
Abstract:
We consider the problem of control in the setting of reinforcement learning (RL), where model information is not available. Policy gradient algorithms are a popular solution approach for this problem and are usually shown to converge to a stationary point of the value function. In this paper, we propose two policy Newton algorithms that incorporate cubic regularization. Both algorithms employ the…
▽ More
We consider the problem of control in the setting of reinforcement learning (RL), where model information is not available. Policy gradient algorithms are a popular solution approach for this problem and are usually shown to converge to a stationary point of the value function. In this paper, we propose two policy Newton algorithms that incorporate cubic regularization. Both algorithms employ the likelihood ratio method to form estimates of the gradient and Hessian of the value function using sample trajectories. The first algorithm requires an exact solution of the cubic regularized problem in each iteration, while the second algorithm employs an efficient gradient descent-based approximation to the cubic regularized problem. We establish convergence of our proposed algorithms to a second-order stationary point (SOSP) of the value function, which results in the avoidance of traps in the form of saddle points. In particular, the sample complexity of our algorithms to find an $ε$-SOSP is $O(ε^{-3.5})$, which is an improvement over the state-of-the-art sample complexity of $O(ε^{-4.5})$.
△ Less
Submitted 21 April, 2023;
originally announced April 2023.
-
n-Step Temporal Difference Learning with Optimal n
Authors:
Lakshmi Mandal,
Shalabh Bhatnagar
Abstract:
We consider the problem of finding the optimal value of n in the n-step temporal difference (TD) learning algorithm. Our objective function for the optimization problem is the average root mean squared error (RMSE). We find the optimal n by resorting to a model-free optimization technique involving a one-simulation simultaneous perturbation stochastic approximation (SPSA) based procedure. Whereas…
▽ More
We consider the problem of finding the optimal value of n in the n-step temporal difference (TD) learning algorithm. Our objective function for the optimization problem is the average root mean squared error (RMSE). We find the optimal n by resorting to a model-free optimization technique involving a one-simulation simultaneous perturbation stochastic approximation (SPSA) based procedure. Whereas SPSA is a zeroth-order continuous optimization procedure, we adapt it to the discrete optimization setting by using a random projection operator. We prove the asymptotic convergence of the recursion by showing that the sequence of n-updates obtained using zeroth-order stochastic gradient search converges almost surely to an internally chain transitive invariant set of an associated differential inclusion. This results in convergence of the discrete parameter sequence to the optimal n in n-step TD. Through experiments, we show that the optimal value of n is achieved with our SDPSA algorithm for arbitrary initial values. We further show using numerical evaluations that SDPSA outperforms the state-of-the-art discrete parameter stochastic optimization algorithm Optimal Computing Budget Allocation (OCBA) on benchmark RL tasks.
△ Less
Submitted 17 July, 2024; v1 submitted 13 March, 2023;
originally announced March 2023.
-
Case-Base Neural Networks: survival analysis with time-varying, higher-order interactions
Authors:
Jesse Islam,
Maxime Turgeon,
Robert Sladek,
Sahir Bhatnagar
Abstract:
In the context of survival analysis, data-driven neural network-based methods have been developed to model complex covariate effects. While these methods may provide better predictive performance than regression-based approaches, not all can model time-varying interactions and complex baseline hazards. To address this, we propose Case-Base Neural Networks (CBNNs) as a new approach that combines th…
▽ More
In the context of survival analysis, data-driven neural network-based methods have been developed to model complex covariate effects. While these methods may provide better predictive performance than regression-based approaches, not all can model time-varying interactions and complex baseline hazards. To address this, we propose Case-Base Neural Networks (CBNNs) as a new approach that combines the case-base sampling framework with flexible neural network architectures. Using a novel sampling scheme and data augmentation to naturally account for censoring, we construct a feed-forward neural network that includes time as an input. CBNNs predict the probability of an event occurring at a given moment to estimate the full hazard function. We compare the performance of CBNNs to regression and neural network-based survival methods in a simulation and three case studies using two time-dependent metrics. First, we examine performance on a simulation involving a complex baseline hazard and time-varying interactions to assess all methods, with CBNN outperforming competitors. Then, we apply all methods to three real data applications, with CBNNs outperforming the competing models in two studies and showing similar performance in the third. Our results highlight the benefit of combining case-base sampling with deep learning to provide a simple and flexible framework for data-driven modeling of single event survival outcomes that estimates time-varying effects and a complex baseline hazard by design. An R package is available at https://github.com/Jesse-Islam/cbnn.
△ Less
Submitted 9 January, 2024; v1 submitted 16 January, 2023;
originally announced January 2023.
-
Generalized Simultaneous Perturbation-based Gradient Search with Reduced Estimator Bias
Authors:
Soumen Pachal,
Shalabh Bhatnagar,
L. A. Prashanth
Abstract:
We present in this paper a family of generalized simultaneous perturbation-based gradient search (GSPGS) estimators that use noisy function measurements. The number of function measurements required by each estimator is guided by the desired level of accuracy. We first present in detail unbalanced generalized simultaneous perturbation stochastic approximation (GSPSA) estimators and later present t…
▽ More
We present in this paper a family of generalized simultaneous perturbation-based gradient search (GSPGS) estimators that use noisy function measurements. The number of function measurements required by each estimator is guided by the desired level of accuracy. We first present in detail unbalanced generalized simultaneous perturbation stochastic approximation (GSPSA) estimators and later present the balanced versions (B-GSPSA) of these. We extend this idea further and present the generalized smoothed functional (GSF) and generalized random directions stochastic approximation (GRDSA) estimators, respectively, as well as their balanced variants. We show that estimators within any specified class requiring more number of function measurements result in lower estimator bias. We present a detailed analysis of both the asymptotic and non-asymptotic convergence of the resulting stochastic approximation schemes. We further present a series of experimental results with the various GSPGS estimators on the Rastrigin and quadratic function objectives. Our experiments are seen to validate our theoretical findings.
△ Less
Submitted 12 November, 2023; v1 submitted 20 December, 2022;
originally announced December 2022.
-
CASPR: Customer Activity Sequence-based Prediction and Representation
Authors:
Pin-Jung Chen,
Sahil Bhatnagar,
Sagar Goyal,
Damian Konrad Kowalczyk,
Mayank Shrivastava
Abstract:
Tasks critical to enterprise profitability, such as customer churn prediction, fraudulent account detection or customer lifetime value estimation, are often tackled by models trained on features engineered from customer data in tabular format. Application-specific feature engineering adds development, operationalization and maintenance costs over time. Recent advances in representation learning pr…
▽ More
Tasks critical to enterprise profitability, such as customer churn prediction, fraudulent account detection or customer lifetime value estimation, are often tackled by models trained on features engineered from customer data in tabular format. Application-specific feature engineering adds development, operationalization and maintenance costs over time. Recent advances in representation learning present an opportunity to simplify and generalize feature engineering across applications. When applying these advancements to tabular data researchers deal with data heterogeneity, variations in customer engagement history or the sheer volume of enterprise datasets. In this paper, we propose a novel approach to encode tabular data containing customer transactions, purchase history and other interactions into a generic representation of a customer's association with the business. We then evaluate these embeddings as features to train multiple models spanning a variety of applications. CASPR, Customer Activity Sequence-based Prediction and Representation, applies Transformer architecture to encode activity sequences to improve model performance and avoid bespoke feature engineering across applications. Our experiments at scale validate CASPR for both small and large enterprise applications.
△ Less
Submitted 28 November, 2022; v1 submitted 16 November, 2022;
originally announced November 2022.
-
Model-based Safe Deep Reinforcement Learning via a Constrained Proximal Policy Optimization Algorithm
Authors:
Ashish Kumar Jayant,
Shalabh Bhatnagar
Abstract:
During initial iterations of training in most Reinforcement Learning (RL) algorithms, agents perform a significant number of random exploratory steps. In the real world, this can limit the practicality of these algorithms as it can lead to potentially dangerous behavior. Hence safe exploration is a critical issue in applying RL algorithms in the real world. This problem has been recently well stud…
▽ More
During initial iterations of training in most Reinforcement Learning (RL) algorithms, agents perform a significant number of random exploratory steps. In the real world, this can limit the practicality of these algorithms as it can lead to potentially dangerous behavior. Hence safe exploration is a critical issue in applying RL algorithms in the real world. This problem has been recently well studied under the Constrained Markov Decision Process (CMDP) Framework, where in addition to single-stage rewards, an agent receives single-stage costs or penalties as well depending on the state transitions. The prescribed cost functions are responsible for mapping undesirable behavior at any given time-step to a scalar value. The goal then is to find a feasible policy that maximizes reward returns while constraining the cost returns to be below a prescribed threshold during training as well as deployment.
We propose an On-policy Model-based Safe Deep RL algorithm in which we learn the transition dynamics of the environment in an online manner as well as find a feasible optimal policy using the Lagrangian Relaxation-based Proximal Policy Optimization. We use an ensemble of neural networks with different initializations to tackle epistemic and aleatoric uncertainty issues faced during environment model learning. We compare our approach with relevant model-free and model-based approaches in Constrained RL using the challenging Safe Reinforcement Learning benchmark - the Open AI Safety Gym. We demonstrate that our algorithm is more sample efficient and results in lower cumulative hazard violations as compared to constrained model-free approaches. Further, our approach shows better reward performance than other constrained model-based approaches in the literature.
△ Less
Submitted 14 October, 2022;
originally announced October 2022.
-
A policy gradient approach for Finite Horizon Constrained Markov Decision Processes
Authors:
Soumyajit Guin,
Shalabh Bhatnagar
Abstract:
The infinite horizon setting is widely adopted for problems of reinforcement learning (RL). These invariably result in stationary policies that are optimal. In many situations, finite horizon control problems are of interest and for such problems, the optimal policies are time-varying in general. Another setting that has become popular in recent times is of Constrained Reinforcement Learning, wher…
▽ More
The infinite horizon setting is widely adopted for problems of reinforcement learning (RL). These invariably result in stationary policies that are optimal. In many situations, finite horizon control problems are of interest and for such problems, the optimal policies are time-varying in general. Another setting that has become popular in recent times is of Constrained Reinforcement Learning, where the agent maximizes its rewards while it also aims to satisfy some given constraint criteria. However, this setting has only been studied in the context of infinite horizon MDPs where stationary policies are optimal. We present an algorithm for constrained RL in the Finite Horizon Setting where the horizon terminates after a fixed (finite) time. We use function approximation in our algorithm which is essential when the state and action spaces are large or continuous and use the policy gradient method to find the optimal policy. The optimal policy that we obtain depends on the stage and so is non-stationary in general. To the best of our knowledge, our paper presents the first policy gradient algorithm for the finite horizon setting with constraints. We show the convergence of our algorithm to a constrained optimal policy. We also compare and analyze the performance of our algorithm through experiments and show that our algorithm performs better than some other well known algorithms.
△ Less
Submitted 14 October, 2024; v1 submitted 10 October, 2022;
originally announced October 2022.
-
Actor-Critic or Critic-Actor? A Tale of Two Time Scales
Authors:
Shalabh Bhatnagar,
Vivek S. Borkar,
Soumyajit Guin
Abstract:
We revisit the standard formulation of tabular actor-critic algorithm as a two time-scale stochastic approximation with value function computed on a faster time-scale and policy computed on a slower time-scale. This emulates policy iteration. We observe that reversal of the time scales will in fact emulate value iteration and is a legitimate algorithm. We provide a proof of convergence and compare…
▽ More
We revisit the standard formulation of tabular actor-critic algorithm as a two time-scale stochastic approximation with value function computed on a faster time-scale and policy computed on a slower time-scale. This emulates policy iteration. We observe that reversal of the time scales will in fact emulate value iteration and is a legitimate algorithm. We provide a proof of convergence and compare the two empirically with and without function approximation (with both linear and nonlinear function approximators) and observe that our proposed critic-actor algorithm performs on par with actor-critic in terms of both accuracy and computational effort.
△ Less
Submitted 13 June, 2024; v1 submitted 10 October, 2022;
originally announced October 2022.
-
An Agent-Based Fleet Management Model for First- and Last-Mile Services
Authors:
Saumya Bhatnagar,
Tarun Rambha,
Gitakrishnan Ramadurai
Abstract:
With the growth of cars and car-sharing applications, commuters in many cities, particularly developing countries, are shifting away from public transport. These shifts have affected two key stakeholders: transit operators and first- and last-mile (FLM) services. Although most cities continue to invest heavily in bus and metro projects to make public transit attractive, ridership in these systems…
▽ More
With the growth of cars and car-sharing applications, commuters in many cities, particularly developing countries, are shifting away from public transport. These shifts have affected two key stakeholders: transit operators and first- and last-mile (FLM) services. Although most cities continue to invest heavily in bus and metro projects to make public transit attractive, ridership in these systems has often failed to reach targeted levels. FLM service providers also experience lower demand and revenues in the wake of shifts to other means of transport. Effective FLM options are required to prevent this phenomenon and make public transport attractive for commuters. One possible solution is to forge partnerships between public transport and FLM providers that offer competitive joint mobility options. Such solutions require prudent allocation of supply and optimised strategies for FLM operations and ride-sharing. To this end, we build an agent- and event-based simulation model which captures interactions between passengers and FLM services using statecharts, vehicle routing models, and other trip matching rules. An optimisation model for allocating FLM vehicles at different transit stations is proposed to reduce unserved requests. Using real-world metro transit demand data from Bengaluru, India, the effectiveness of our approach in improving FLM connectivity and quantifying the benefits of sharing trips is demonstrated.
△ Less
Submitted 4 December, 2022; v1 submitted 9 August, 2022;
originally announced August 2022.
-
A Gradient Smoothed Functional Algorithm with Truncated Cauchy Random Perturbations for Stochastic Optimization
Authors:
Akash Mondal,
Prashanth L. A.,
Shalabh Bhatnagar
Abstract:
In this paper, we present a stochastic gradient algorithm for minimizing a smooth objective function that is an expectation over noisy cost samples, and only the latter are observed for any given parameter. Our algorithm employs a gradient estimation scheme with random perturbations, which are formed using the truncated Cauchy distribution from the delta sphere. We analyze the bias and variance of…
▽ More
In this paper, we present a stochastic gradient algorithm for minimizing a smooth objective function that is an expectation over noisy cost samples, and only the latter are observed for any given parameter. Our algorithm employs a gradient estimation scheme with random perturbations, which are formed using the truncated Cauchy distribution from the delta sphere. We analyze the bias and variance of the proposed gradient estimator. Our algorithm is found to be particularly useful in the case when the objective function is non-convex, and the parameter dimension is high. From an asymptotic convergence analysis, we establish that our algorithm converges almost surely to the set of stationary points of the objective function and obtains the asymptotic convergence rate. We also show that our algorithm avoids unstable equilibria, implying convergence to local minima. Further, we perform a non-asymptotic convergence analysis of our algorithm. In particular, we establish here a non-asymptotic bound for finding an epsilon-stationary point of the non-convex objective function. Finally, we demonstrate numerically through simulations that the performance of our algorithm outperforms GSF, SPSA, and RDSA by a significant margin over a few non-convex settings and further validate its performance over convex (noisy) objectives.
△ Less
Submitted 30 June, 2023; v1 submitted 30 July, 2022;
originally announced August 2022.
-
Reinforcement Learning for Task Specifications with Action-Constraints
Authors:
Arun Raman,
Keerthan Shagrithaya,
Shalabh Bhatnagar
Abstract:
In this paper, we use concepts from supervisory control theory of discrete event systems to propose a method to learn optimal control policies for a finite-state Markov Decision Process (MDP) in which (only) certain sequences of actions are deemed unsafe (respectively safe). We assume that the set of action sequences that are deemed unsafe and/or safe are given in terms of a finite-state automaton…
▽ More
In this paper, we use concepts from supervisory control theory of discrete event systems to propose a method to learn optimal control policies for a finite-state Markov Decision Process (MDP) in which (only) certain sequences of actions are deemed unsafe (respectively safe). We assume that the set of action sequences that are deemed unsafe and/or safe are given in terms of a finite-state automaton; and propose a supervisor that disables a subset of actions at every state of the MDP so that the constraints on action sequence are satisfied. Then we present a version of the Q-learning algorithm for learning optimal policies in the presence of non-Markovian action-sequence and state constraints, where we use the development of reward machines to handle the state constraints. We illustrate the method using an example that captures the utility of automata-based methods for non-Markovian state and action specifications for reinforcement learning and show the results of simulations in this setting.
△ Less
Submitted 1 January, 2022;
originally announced January 2022.
-
Dynamic Mirror Descent based Model Predictive Control for Accelerating Robot Learning
Authors:
Utkarsh A. Mishra,
Soumya R. Samineni,
Prakhar Goel,
Chandravaran Kunjeti,
Himanshu Lodha,
Aman Singh,
Aditya Sagi,
Shalabh Bhatnagar,
Shishir Kolathaya
Abstract:
Recent works in Reinforcement Learning (RL) combine model-free (Mf)-RL algorithms with model-based (Mb)-RL approaches to get the best from both: asymptotic performance of Mf-RL and high sample-efficiency of Mb-RL. Inspired by these works, we propose a hierarchical framework that integrates online learning for the Mb-trajectory optimization with off-policy methods for the Mf-RL. In particular, two…
▽ More
Recent works in Reinforcement Learning (RL) combine model-free (Mf)-RL algorithms with model-based (Mb)-RL approaches to get the best from both: asymptotic performance of Mf-RL and high sample-efficiency of Mb-RL. Inspired by these works, we propose a hierarchical framework that integrates online learning for the Mb-trajectory optimization with off-policy methods for the Mf-RL. In particular, two loops are proposed, where the Dynamic Mirror Descent based Model Predictive Control (DMD-MPC) is used as the inner loop Mb-RL to obtain an optimal sequence of actions. These actions are in turn used to significantly accelerate the outer loop Mf-RL. We show that our formulation is generic for a broad class of MPC-based policies and objectives, and includes some of the well-known Mb-Mf approaches. We finally introduce a new algorithm: Mirror-Descent Model Predictive RL (M-DeMoRL), which uses Cross-Entropy Method (CEM) with elite fractions for the inner loop. Our experiments show faster convergence of the proposed hierarchical approach on benchmark MuJoCo tasks. We also demonstrate hardware training for trajectory tracking in a 2R leg and hardware transfer for robust walking in a quadruped. We show that the inner-loop Mb-RL significantly decreases the number of training iterations required in the real system, thereby validating the proposed approach.
△ Less
Submitted 4 November, 2021;
originally announced December 2021.
-
Schedule Based Temporal Difference Algorithms
Authors:
Rohan Deb,
Meet Gandhi,
Shalabh Bhatnagar
Abstract:
Learning the value function of a given policy from data samples is an important problem in Reinforcement Learning. TD($λ$) is a popular class of algorithms to solve this problem. However, the weights assigned to different $n$-step returns in TD($λ$), controlled by the parameter $λ$, decrease exponentially with increasing $n$. In this paper, we present a $λ$-schedule procedure that generalizes the…
▽ More
Learning the value function of a given policy from data samples is an important problem in Reinforcement Learning. TD($λ$) is a popular class of algorithms to solve this problem. However, the weights assigned to different $n$-step returns in TD($λ$), controlled by the parameter $λ$, decrease exponentially with increasing $n$. In this paper, we present a $λ$-schedule procedure that generalizes the TD($λ$) algorithm to the case when the parameter $λ$ could vary with time-step. This allows flexibility in weight assignment, i.e., the user can specify the weights assigned to different $n$-step returns by choosing a sequence $\{λ_t\}_{t \geq 1}$. Based on this procedure, we propose an on-policy algorithm - TD($λ$)-schedule, and two off-policy algorithms - GTD($λ$)-schedule and TDC($λ$)-schedule, respectively. We provide proofs of almost sure convergence for all three algorithms under a general Markov noise framework.
△ Less
Submitted 23 November, 2021;
originally announced November 2021.
-
Gradient Temporal Difference with Momentum: Stability and Convergence
Authors:
Rohan Deb,
Shalabh Bhatnagar
Abstract:
Gradient temporal difference (Gradient TD) algorithms are a popular class of stochastic approximation (SA) algorithms used for policy evaluation in reinforcement learning. Here, we consider Gradient TD algorithms with an additional heavy ball momentum term and provide choice of step size and momentum parameter that ensures almost sure convergence of these algorithms asymptotically. In doing so, we…
▽ More
Gradient temporal difference (Gradient TD) algorithms are a popular class of stochastic approximation (SA) algorithms used for policy evaluation in reinforcement learning. Here, we consider Gradient TD algorithms with an additional heavy ball momentum term and provide choice of step size and momentum parameter that ensures almost sure convergence of these algorithms asymptotically. In doing so, we decompose the heavy ball Gradient TD iterates into three separate iterates with different step sizes. We first analyze these iterates under one-timescale SA setting using results from current literature. However, the one-timescale case is restrictive and a more general analysis can be provided by looking at a three-timescale decomposition of the iterates. In the process, we provide the first conditions for stability and convergence of general three-timescale SA. We then prove that the heavy ball Gradient TD algorithm is convergent using our three-timescale SA analysis. Finally, we evaluate these algorithms on standard RL problems and report improvement in performance over the vanilla algorithms.
△ Less
Submitted 22 November, 2021;
originally announced November 2021.
-
Finite Horizon Q-learning: Stability, Convergence, Simulations and an application on Smart Grids
Authors:
Vivek VP,
Dr. Shalabh Bhatnagar
Abstract:
Q-learning is a popular reinforcement learning algorithm. This algorithm has however been studied and analysed mainly in the infinite horizon setting. There are several important applications which can be modeled in the framework of finite horizon Markov decision processes. We develop a version of Q-learning algorithm for finite horizon Markov decision processes (MDP) and provide a full proof of i…
▽ More
Q-learning is a popular reinforcement learning algorithm. This algorithm has however been studied and analysed mainly in the infinite horizon setting. There are several important applications which can be modeled in the framework of finite horizon Markov decision processes. We develop a version of Q-learning algorithm for finite horizon Markov decision processes (MDP) and provide a full proof of its stability and convergence. Our analysis of stability and convergence of finite horizon Q-learning is based entirely on the ordinary differential equations (O.D.E) method. We also demonstrate the performance of our algorithm on a setting of random MDP as well as on an application on smart grids.
△ Less
Submitted 6 August, 2022; v1 submitted 27 October, 2021;
originally announced October 2021.
-
Memory Efficient Adaptive Attention For Multiple Domain Learning
Authors:
Himanshu Pradeep Aswani,
Abhiraj Sunil Kanse,
Shubhang Bhatnagar,
Amit Sethi
Abstract:
Training CNNs from scratch on new domains typically demands large numbers of labeled images and computations, which is not suitable for low-power hardware. One way to reduce these requirements is to modularize the CNN architecture and freeze the weights of the heavier modules, that is, the lower layers after pre-training. Recent studies have proposed alternative modular architectures and schemes t…
▽ More
Training CNNs from scratch on new domains typically demands large numbers of labeled images and computations, which is not suitable for low-power hardware. One way to reduce these requirements is to modularize the CNN architecture and freeze the weights of the heavier modules, that is, the lower layers after pre-training. Recent studies have proposed alternative modular architectures and schemes that lead to a reduction in the number of trainable parameters needed to match the accuracy of fully fine-tuned CNNs on new domains. Our work suggests that a further reduction in the number of trainable parameters by an order of magnitude is possible. Furthermore, we propose that new modularization techniques for multiple domain learning should also be compared on other realistic metrics, such as the number of interconnections needed between the fixed and trainable modules, the number of training samples needed, the order of computations required and the robustness to partial mislabeling of the training data. On all of these criteria, the proposed architecture demonstrates advantages over or matches the current state-of-the-art.
△ Less
Submitted 21 October, 2021;
originally announced October 2021.
-
Neural Network Compatible Off-Policy Natural Actor-Critic Algorithm
Authors:
Raghuram Bharadwaj Diddigi,
Prateek Jain,
Prabuchandran K. J.,
Shalabh Bhatnagar
Abstract:
Learning optimal behavior from existing data is one of the most important problems in Reinforcement Learning (RL). This is known as "off-policy control" in RL where an agent's objective is to compute an optimal policy based on the data obtained from the given policy (known as the behavior policy). As the optimal policy can be very different from the behavior policy, learning optimal behavior is ve…
▽ More
Learning optimal behavior from existing data is one of the most important problems in Reinforcement Learning (RL). This is known as "off-policy control" in RL where an agent's objective is to compute an optimal policy based on the data obtained from the given policy (known as the behavior policy). As the optimal policy can be very different from the behavior policy, learning optimal behavior is very hard in the "off-policy" setting compared to the "on-policy" setting where new data from the policy updates will be utilized in learning. This work proposes an off-policy natural actor-critic algorithm that utilizes state-action distribution correction for handling the off-policy behavior and the natural policy gradient for sample efficiency. The existing natural gradient-based actor-critic algorithms with convergence guarantees require fixed features for approximating both policy and value functions. This often leads to sub-optimal learning in many RL applications. On the other hand, our proposed algorithm utilizes compatible features that enable one to use arbitrary neural networks to approximate the policy and the value function and guarantee convergence to a locally optimal policy. We illustrate the benefit of the proposed off-policy natural gradient algorithm by comparing it with the vanilla gradient actor-critic algorithm on benchmark RL tasks.
△ Less
Submitted 15 June, 2022; v1 submitted 19 October, 2021;
originally announced October 2021.
-
Analyzing Cross Validation In Compressed Sensing With Mixed Gaussian And Impulse Measurement Noise With L1 Errors
Authors:
Chinmay Gurjarpadhye,
Shubhang Bhatnagar,
Ajit Rajwade
Abstract:
Compressed sensing (CS) involves sampling signals at rates less than their Nyquist rates and attempting to reconstruct them after sample acquisition. Most such algorithms have parameters, for example the regularization parameter in LASSO, which need to be chosen carefully for optimal performance. These parameters can be chosen based on assumptions on the noise level or signal sparsity, but this kn…
▽ More
Compressed sensing (CS) involves sampling signals at rates less than their Nyquist rates and attempting to reconstruct them after sample acquisition. Most such algorithms have parameters, for example the regularization parameter in LASSO, which need to be chosen carefully for optimal performance. These parameters can be chosen based on assumptions on the noise level or signal sparsity, but this knowledge may often be unavailable. In such cases, cross validation (CV) can be used to choose these parameters in a purely data-driven fashion. Previous work analysing the use of CV in CS has been based on the $\ell_2$ cross-validation error with Gaussian measurement noise. But it is well known that the $\ell_2$ error is not robust to impulse noise and provides a poor estimate of the recovery error, failing to choose the best parameter. Here we propose using the $\ell_1$ CV error which provides substantial performance benefits given impulse measurement noise. Most importantly, we provide a detailed theoretical analysis and error bounds for the use of $\ell_1$ CV error in CS reconstruction. We show that with high probability, choosing the parameter that yields the minimum $\ell_1$ CV error is equivalent to choosing the minimum recovery error (which is not observable in practice). To our best knowledge, this is the first paper which theoretically analyzes $\ell_1$-based CV in CS.
△ Less
Submitted 19 February, 2021;
originally announced February 2021.
-
Attention Actor-Critic algorithm for Multi-Agent Constrained Co-operative Reinforcement Learning
Authors:
P. Parnika,
Raghuram Bharadwaj Diddigi,
Sai Koti Reddy Danda,
Shalabh Bhatnagar
Abstract:
In this work, we consider the problem of computing optimal actions for Reinforcement Learning (RL) agents in a co-operative setting, where the objective is to optimize a common goal. However, in many real-life applications, in addition to optimizing the goal, the agents are required to satisfy certain constraints specified on their actions. Under this setting, the objective of the agents is to not…
▽ More
In this work, we consider the problem of computing optimal actions for Reinforcement Learning (RL) agents in a co-operative setting, where the objective is to optimize a common goal. However, in many real-life applications, in addition to optimizing the goal, the agents are required to satisfy certain constraints specified on their actions. Under this setting, the objective of the agents is to not only learn the actions that optimize the common objective but also meet the specified constraints. In recent times, the Actor-Critic algorithm with an attention mechanism has been successfully applied to obtain optimal actions for RL agents in multi-agent environments. In this work, we extend this algorithm to the constrained multi-agent RL setting. The idea here is that optimizing the common goal and satisfying the constraints may require different modes of attention. By incorporating different attention modes, the agents can select useful information required for optimizing the objective and satisfying the constraints separately, thereby yielding better actions. Through experiments on benchmark multi-agent environments, we show the effectiveness of our proposed algorithm.
△ Less
Submitted 6 January, 2021;
originally announced January 2021.
-
Robust Quadrupedal Locomotion on Sloped Terrains: A Linear Policy Approach
Authors:
Kartik Paigwar,
Lokesh Krishna,
Sashank Tirumala,
Naman Khetan,
Aditya Sagi,
Ashish Joglekar,
Shalabh Bhatnagar,
Ashitava Ghosal,
Bharadwaj Amrutur,
Shishir Kolathaya
Abstract:
In this paper, with a view toward fast deployment of locomotion gaits in low-cost hardware, we use a linear policy for realizing end-foot trajectories in the quadruped robot, Stoch $2$. In particular, the parameters of the end-foot trajectories are shaped via a linear feedback policy that takes the torso orientation and the terrain slope as inputs. The corresponding desired joint angles are obtain…
▽ More
In this paper, with a view toward fast deployment of locomotion gaits in low-cost hardware, we use a linear policy for realizing end-foot trajectories in the quadruped robot, Stoch $2$. In particular, the parameters of the end-foot trajectories are shaped via a linear feedback policy that takes the torso orientation and the terrain slope as inputs. The corresponding desired joint angles are obtained via an inverse kinematics solver and tracked via a PID control law. Augmented Random Search, a model-free and a gradient-free learning algorithm is used to train this linear policy. Simulation results show that the resulting walking is robust to terrain slope variations and external pushes. This methodology is not only computationally light-weight but also uses minimal sensing and actuation capabilities in the robot, thereby justifying the approach.
△ Less
Submitted 10 November, 2020; v1 submitted 30 October, 2020;
originally announced October 2020.
-
PAL : Pretext-based Active Learning
Authors:
Shubhang Bhatnagar,
Sachin Goyal,
Darshan Tank,
Amit Sethi
Abstract:
The goal of pool-based active learning is to judiciously select a fixed-sized subset of unlabeled samples from a pool to query an oracle for their labels, in order to maximize the accuracy of a supervised learner. However, the unsaid requirement that the oracle should always assign correct labels is unreasonable for most situations. We propose an active learning technique for deep neural networks…
▽ More
The goal of pool-based active learning is to judiciously select a fixed-sized subset of unlabeled samples from a pool to query an oracle for their labels, in order to maximize the accuracy of a supervised learner. However, the unsaid requirement that the oracle should always assign correct labels is unreasonable for most situations. We propose an active learning technique for deep neural networks that is more robust to mislabeling than the previously proposed techniques. Previous techniques rely on the task network itself to estimate the novelty of the unlabeled samples, but learning the task (generalization) and selecting samples (out-of-distribution detection) can be conflicting goals. We use a separate network to score the unlabeled samples for selection. The scoring network relies on self-supervision for modeling the distribution of the labeled samples to reduce the dependency on potentially noisy labels. To counter the paucity of data, we also deploy another head on the scoring network for regularization via multi-task learning and use an unusual self-balancing hybrid scoring function. Furthermore, we divide each query into sub-queries before labeling to ensure that the query has diverse samples. In addition to having a higher tolerance to mislabeling of samples by the oracle, the resultant technique also produces competitive accuracy in the absence of label noise. The technique also handles the introduction of new classes on-the-fly well by temporarily increasing the sampling rate of these classes.
△ Less
Submitted 28 March, 2021; v1 submitted 29 October, 2020;
originally announced October 2020.
-
Hindsight Experience Replay with Kronecker Product Approximate Curvature
Authors:
Dhuruva Priyan G M,
Abhik Singla,
Shalabh Bhatnagar
Abstract:
Hindsight Experience Replay (HER) is one of the efficient algorithm to solve Reinforcement Learning tasks related to sparse rewarded environments.But due to its reduced sample efficiency and slower convergence HER fails to perform effectively. Natural gradients solves these challenges by converging the model parameters better. It avoids taking bad actions that collapse the training performance. Ho…
▽ More
Hindsight Experience Replay (HER) is one of the efficient algorithm to solve Reinforcement Learning tasks related to sparse rewarded environments.But due to its reduced sample efficiency and slower convergence HER fails to perform effectively. Natural gradients solves these challenges by converging the model parameters better. It avoids taking bad actions that collapse the training performance. However updating parameters in neural networks requires expensive computation and thus increase in training time. Our proposed method solves the above mentioned challenges with better sample efficiency and faster convergence with increased success rate. A common failure mode for DDPG is that the learned Q-function begins to dramatically overestimate Q-values, which then leads to the policy breaking, because it exploits the errors in the Q-function. We solve this issue by including Twin Delayed Deep Deterministic Policy Gradients(TD3) in HER. TD3 learns two Q-functions instead of one and it adds noise tothe target action, to make it harder for the policy to exploit Q-function errors. The experiments are done with the help of OpenAis Mujoco environments. Results on these environments show that our algorithm (TDHER+KFAC) performs better inmost of the scenarios
△ Less
Submitted 9 October, 2020;
originally announced October 2020.
-
A reinforcement learning approach to hybrid control design
Authors:
Meet Gandhi,
Atreyee Kundu,
Shalabh Bhatnagar
Abstract:
In this paper we design hybrid control policies for hybrid systems whose mathematical models are unknown. Our contributions are threefold. First, we propose a framework for modelling the hybrid control design problem as a single Markov Decision Process (MDP). This result facilitates the application of off-the-shelf algorithms from Reinforcement Learning (RL) literature towards designing optimal co…
▽ More
In this paper we design hybrid control policies for hybrid systems whose mathematical models are unknown. Our contributions are threefold. First, we propose a framework for modelling the hybrid control design problem as a single Markov Decision Process (MDP). This result facilitates the application of off-the-shelf algorithms from Reinforcement Learning (RL) literature towards designing optimal control policies. Second, we model a set of benchmark examples of hybrid control design problem in the proposed MDP framework. Third, we adapt the recently proposed Proximal Policy Optimisation (PPO) algorithm for the hybrid action space and apply it to the above set of problems. It is observed that in each case the algorithm converges and finds the optimal policy.
△ Less
Submitted 2 September, 2020;
originally announced September 2020.
-
Computer Model Calibration with Time Series Data using Deep Learning and Quantile Regression
Authors:
Saumya Bhatnagar,
Won Chang,
Seonjin Kim Jiali Wang
Abstract:
Computer models play a key role in many scientific and engineering problems. One major source of uncertainty in computer model experiment is input parameter uncertainty. Computer model calibration is a formal statistical procedure to infer input parameters by combining information from model runs and observational data. The existing standard calibration framework suffers from inferential issues wh…
▽ More
Computer models play a key role in many scientific and engineering problems. One major source of uncertainty in computer model experiment is input parameter uncertainty. Computer model calibration is a formal statistical procedure to infer input parameters by combining information from model runs and observational data. The existing standard calibration framework suffers from inferential issues when the model output and observational data are high-dimensional dependent data such as large time series due to the difficulty in building an emulator and the non-identifiability between effects from input parameters and data-model discrepancy. To overcome these challenges we propose a new calibration framework based on a deep neural network (DNN) with long-short term memory layers that directly emulates the inverse relationship between the model output and input parameters. Adopting the 'learning with noise' idea we train our DNN model to filter out the effects from data model discrepancy on input parameter inference. We also formulate a new way to construct interval predictions for DNN using quantile regression to quantify the uncertainty in input parameter estimates. Through a simulation study and real data application with WRF-hydro model we show that our approach can yield accurate point estimates and well calibrated interval estimates for input parameters.
△ Less
Submitted 8 September, 2020; v1 submitted 29 August, 2020;
originally announced August 2020.
-
Learning Stable Manoeuvres in Quadruped Robots from Expert Demonstrations
Authors:
Sashank Tirumala,
Sagar Gubbi,
Kartik Paigwar,
Aditya Sagi,
Ashish Joglekar,
Shalabh Bhatnagar,
Ashitava Ghosal,
Bharadwaj Amrutur,
Shishir Kolathaya
Abstract:
With the research into development of quadruped robots picking up pace, learning based techniques are being explored for developing locomotion controllers for such robots. A key problem is to generate leg trajectories for continuously varying target linear and angular velocities, in a stable manner. In this paper, we propose a two pronged approach to address this problem. First, multiple simpler p…
▽ More
With the research into development of quadruped robots picking up pace, learning based techniques are being explored for developing locomotion controllers for such robots. A key problem is to generate leg trajectories for continuously varying target linear and angular velocities, in a stable manner. In this paper, we propose a two pronged approach to address this problem. First, multiple simpler policies are trained to generate trajectories for a discrete set of target velocities and turning radius. These policies are then augmented using a higher level neural network for handling the transition between the learned trajectories. Specifically, we develop a neural network-based filter that takes in target velocity, radius and transforms them into new commands that enable smooth transitions to the new trajectory. This transformation is achieved by learning from expert demonstrations. An application of this is the transformation of a novice user's input into an expert user's input, thereby ensuring stable manoeuvres regardless of the user's experience. Training our proposed architecture requires much less expert demonstrations compared to standard neural network architectures. Finally, we demonstrate experimentally these results in the in-house quadruped Stoch 2.
△ Less
Submitted 28 July, 2020;
originally announced July 2020.
-
A Stochastic Game Framework for Efficient Energy Management in Microgrid Networks
Authors:
Shravan Nayak,
Chanakya Ajit Ekbote,
Annanya Pratap Singh Chauhan,
Raghuram Bharadwaj Diddigi,
Prishita Ray,
Abhinava Sikdar,
Sai Koti Reddy Danda,
Shalabh Bhatnagar
Abstract:
We consider the problem of energy management in microgrid networks. A microgrid is capable of generating a limited amount of energy from a renewable resource and is responsible for handling the demands of its dedicated customers. Owing to the variable nature of renewable generation and the demands of the customers, it becomes imperative that each microgrid optimally manages its energy. This involv…
▽ More
We consider the problem of energy management in microgrid networks. A microgrid is capable of generating a limited amount of energy from a renewable resource and is responsible for handling the demands of its dedicated customers. Owing to the variable nature of renewable generation and the demands of the customers, it becomes imperative that each microgrid optimally manages its energy. This involves intelligently scheduling the demands at the customer side, selling (when there is a surplus) and buying (when there is a deficit) the power from its neighboring microgrids depending on its current and future needs. Typically, the transaction of power among the microgrids happens at a pre-decided price by the central grid. In this work, we formulate the problems of demand and battery scheduling, energy trading and dynamic pricing (where we allow the microgrids to decide the price of the transaction depending on their current configuration of demand and renewable energy) in the framework of stochastic games. Subsequently, we propose a novel approach that makes use of independent learners Deep Q-learning algorithm to solve this problem. Through extensive empirical evaluation, we show that our proposed framework is more beneficial to the majority of the microgrids and we provide a detailed analysis of the results.
△ Less
Submitted 15 November, 2020; v1 submitted 5 February, 2020;
originally announced February 2020.
-
Gait Library Synthesis for Quadruped Robots via Augmented Random Search
Authors:
Sashank Tirumala,
Aditya Sagi,
Kartik Paigwar,
Ashish Joglekar,
Shalabh Bhatnagar,
Ashitava Ghosal,
Bharadwaj Amrutur,
Shishir Kolathaya
Abstract:
In this paper, with a view toward fast deployment of learned locomotion gaits in low-cost hardware, we generate a library of walking trajectories, namely, forward trot, backward trot, side-step, and turn in our custom-built quadruped robot, Stoch 2, using reinforcement learning. There are existing approaches that determine optimal policies for each time step, whereas we determine an optimal policy…
▽ More
In this paper, with a view toward fast deployment of learned locomotion gaits in low-cost hardware, we generate a library of walking trajectories, namely, forward trot, backward trot, side-step, and turn in our custom-built quadruped robot, Stoch 2, using reinforcement learning. There are existing approaches that determine optimal policies for each time step, whereas we determine an optimal policy, in the form of end-foot trajectories, for each half walking step i.e., swing phase and stance phase. The way-points for the foot trajectories are obtained from a linear policy, i.e., a linear function of the states of the robot, and cubic splines are used to interpolate between these points. Augmented Random Search, a model-free and gradient-free learning algorithm is used to learn the policy in simulation. This learned policy is then deployed on hardware, yielding a trajectory in every half walking step. Different locomotion patterns are learned in simulation by enforcing a preconfigured phase shift between the trajectories of different legs. The transition from one gait to another is achieved by using a low-pass filter for the phase, and the sim-to-real transfer is improved by a linear transformation of the states obtained through regression.
△ Less
Submitted 30 December, 2019;
originally announced December 2019.
-
Hierarchical Average Reward Policy Gradient Algorithms
Authors:
Akshay Dharmavaram,
Matthew Riemer,
Shalabh Bhatnagar
Abstract:
Option-critic learning is a general-purpose reinforcement learning (RL) framework that aims to address the issue of long term credit assignment by leveraging temporal abstractions. However, when dealing with extended timescales, discounting future rewards can lead to incorrect credit assignments. In this work, we address this issue by extending the hierarchical option-critic policy gradient theore…
▽ More
Option-critic learning is a general-purpose reinforcement learning (RL) framework that aims to address the issue of long term credit assignment by leveraging temporal abstractions. However, when dealing with extended timescales, discounting future rewards can lead to incorrect credit assignments. In this work, we address this issue by extending the hierarchical option-critic policy gradient theorem for the average reward criterion. Our proposed framework aims to maximize the long-term reward obtained in the steady-state of the Markov chain defined by the agent's policy. Furthermore, we use an ordinary differential equation based approach for our convergence analysis and prove that the parameters of the intra-option policies, termination functions, and value functions, converge to their corresponding optimal values, with probability one. Finally, we illustrate the competitive advantage of learning options, in the average reward setting, on a grid-world environment with sparse rewards.
△ Less
Submitted 20 November, 2019;
originally announced November 2019.
-
A Convergent Off-Policy Temporal Difference Algorithm
Authors:
Raghuram Bharadwaj Diddigi,
Chandramouli Kamanchi,
Shalabh Bhatnagar
Abstract:
Learning the value function of a given policy (target policy) from the data samples obtained from a different policy (behavior policy) is an important problem in Reinforcement Learning (RL). This problem is studied under the setting of off-policy prediction. Temporal Difference (TD) learning algorithms are a popular class of algorithms for solving the prediction problem. TD algorithms with linear…
▽ More
Learning the value function of a given policy (target policy) from the data samples obtained from a different policy (behavior policy) is an important problem in Reinforcement Learning (RL). This problem is studied under the setting of off-policy prediction. Temporal Difference (TD) learning algorithms are a popular class of algorithms for solving the prediction problem. TD algorithms with linear function approximation are shown to be convergent when the samples are generated from the target policy (known as on-policy prediction). However, it has been well established in the literature that off-policy TD algorithms under linear function approximation diverge. In this work, we propose a convergent on-line off-policy TD algorithm under linear function approximation. The main idea is to penalize the updates of the algorithm in a way as to ensure convergence of the iterates. We provide a convergence analysis of our algorithm. Through numerical evaluations, we further demonstrate the effectiveness of our algorithm.
△ Less
Submitted 13 November, 2019;
originally announced November 2019.
-
Generalized Speedy Q-learning
Authors:
Indu John,
Chandramouli Kamanchi,
Shalabh Bhatnagar
Abstract:
In this paper, we derive a generalization of the Speedy Q-learning (SQL) algorithm that was proposed in the Reinforcement Learning (RL) literature to handle slow convergence of Watkins' Q-learning. In most RL algorithms such as Q-learning, the Bellman equation and the Bellman operator play an important role. It is possible to generalize the Bellman operator using the technique of successive relaxa…
▽ More
In this paper, we derive a generalization of the Speedy Q-learning (SQL) algorithm that was proposed in the Reinforcement Learning (RL) literature to handle slow convergence of Watkins' Q-learning. In most RL algorithms such as Q-learning, the Bellman equation and the Bellman operator play an important role. It is possible to generalize the Bellman operator using the technique of successive relaxation. We use the generalized Bellman operator to derive a simple and efficient family of algorithms called Generalized Speedy Q-learning (GSQL-w) and analyze its finite time performance. We show that GSQL-w has an improved finite time performance bound compared to SQL for the case when the relaxation parameter w is greater than 1. This improvement is a consequence of the contraction factor of the generalized Bellman operator being less than that of the standard Bellman operator. Numerical experiments are provided to demonstrate the empirical performance of the GSQL-w algorithm.
△ Less
Submitted 12 February, 2020; v1 submitted 1 November, 2019;
originally announced November 2019.
-
A Generalized Minimax Q-learning Algorithm for Two-Player Zero-Sum Stochastic Games
Authors:
Raghuram Bharadwaj Diddigi,
Chandramouli Kamanchi,
Shalabh Bhatnagar
Abstract:
We consider the problem of two-player zero-sum games. This problem is formulated as a min-max Markov game in the literature. The solution of this game, which is the min-max payoff, starting from a given state is called the min-max value of the state. In this work, we compute the solution of the two-player zero-sum game utilizing the technique of successive relaxation that has been successfully app…
▽ More
We consider the problem of two-player zero-sum games. This problem is formulated as a min-max Markov game in the literature. The solution of this game, which is the min-max payoff, starting from a given state is called the min-max value of the state. In this work, we compute the solution of the two-player zero-sum game utilizing the technique of successive relaxation that has been successfully applied in the literature to compute a faster value iteration algorithm in the context of Markov Decision Processes. We extend the concept of successive relaxation to the setting of two-player zero-sum games. We show that, under a special structure on the game, this technique facilitates faster computation of the min-max value of the states. We then derive a generalized minimax Q-learning algorithm that computes the optimal policy when the model information is not known. Finally, we prove the convergence of the proposed generalized minimax Q-learning algorithm utilizing stochastic approximation techniques, under an assumption on the boundedness of iterates. Through experiments, we demonstrate the effectiveness of our proposed algorithm.
△ Less
Submitted 18 March, 2022; v1 submitted 16 June, 2019;
originally announced June 2019.