Search | arXiv e-print repository

Gradient-Driven 3D Segmentation and Affordance Transfer in Gaussian Splatting Using 2D Masks

Authors: Joji Joseph, Bharadwaj Amrutur, Shalabh Bhatnagar

Abstract: 3D Gaussian Splatting has emerged as a powerful 3D scene representation technique, capturing fine details with high efficiency. In this paper, we introduce a novel voting-based method that extends 2D segmentation models to 3D Gaussian splats. Our approach leverages masked gradients, where gradients are filtered by input 2D masks, and these gradients are used as votes to achieve accurate segmentati… ▽ More 3D Gaussian Splatting has emerged as a powerful 3D scene representation technique, capturing fine details with high efficiency. In this paper, we introduce a novel voting-based method that extends 2D segmentation models to 3D Gaussian splats. Our approach leverages masked gradients, where gradients are filtered by input 2D masks, and these gradients are used as votes to achieve accurate segmentation. As a byproduct, we discovered that inference-time gradients can also be used to prune Gaussians, resulting in up to 21% compression. Additionally, we explore few-shot affordance transfer, allowing annotations from 2D images to be effectively transferred onto 3D Gaussian splats. The robust yet straightforward mathematical formulation underlying this approach makes it a highly effective tool for numerous downstream applications, such as augmented reality (AR), object editing, and robotics. The project code and additional resources are available at https://jojijoseph.github.io/3dgs-segmentation. △ Less

Submitted 17 September, 2024; originally announced September 2024.

Comments: Preprint, Under review for ICRA 2025

arXiv:2409.08381 [pdf, ps, other]

Rethinking Prompting Strategies for Multi-Label Recognition with Partial Annotations

Authors: Samyak Rawlekar, Shubhang Bhatnagar, Narendra Ahuja

Abstract: Vision-language models (VLMs) like CLIP have been adapted for Multi-Label Recognition (MLR) with partial annotations by leveraging prompt-learning, where positive and negative prompts are learned for each class to associate their embeddings with class presence or absence in the shared vision-text feature space. While this approach improves MLR performance by relying on VLM priors, we hypothesize t… ▽ More Vision-language models (VLMs) like CLIP have been adapted for Multi-Label Recognition (MLR) with partial annotations by leveraging prompt-learning, where positive and negative prompts are learned for each class to associate their embeddings with class presence or absence in the shared vision-text feature space. While this approach improves MLR performance by relying on VLM priors, we hypothesize that learning negative prompts may be suboptimal, as the datasets used to train VLMs lack image-caption pairs explicitly focusing on class absence. To analyze the impact of positive and negative prompt learning on MLR, we introduce PositiveCoOp and NegativeCoOp, where only one prompt is learned with VLM guidance while the other is replaced by an embedding vector learned directly in the shared feature space without relying on the text encoder. Through empirical analysis, we observe that negative prompts degrade MLR performance, and learning only positive prompts, combined with learned negative embeddings (PositiveCoOp), outperforms dual prompt learning approaches. Moreover, we quantify the performance benefits that prompt-learning offers over a simple vision-features-only baseline, observing that the baseline displays strong performance comparable to dual prompt learning approach (DualCoOp), when the proportion of missing labels is low, while requiring half the training compute and 16 times fewer parameters △ Less

Submitted 12 September, 2024; originally announced September 2024.

arXiv:2408.11984 [pdf, other]

Chemical Reaction Neural Networks for Fitting Accelerating Rate Calorimetry Data

Authors: Saakaar Bhatnagar, Andrew Comerford, Zelu Xu, Davide Berti Polato, Araz Banaeizadeh, Alessandro Ferraris

Abstract: As the demand for lithium-ion batteries rapidly increases there is a need to design these cells in a safe manner to mitigate thermal runaway. Thermal runaway in batteries leads to an uncontrollable temperature rise and potentially fires, which is a major safety concern. Typically, when modelling the chemical kinetics of thermal runaway calorimetry data ( e.g. Accelerating Rate Calorimetry (ARC)) i… ▽ More As the demand for lithium-ion batteries rapidly increases there is a need to design these cells in a safe manner to mitigate thermal runaway. Thermal runaway in batteries leads to an uncontrollable temperature rise and potentially fires, which is a major safety concern. Typically, when modelling the chemical kinetics of thermal runaway calorimetry data ( e.g. Accelerating Rate Calorimetry (ARC)) is needed to determine the temperature-driven decomposition kinetics. Conventional methods of fitting Arrhenius Ordinary Differential Equation (ODE) thermal runaway models to Accelerated Rate Calorimetry (ARC) data make several assumptions that reduce the fidelity and generalizability of the obtained model. In this paper, Chemical Reaction Neural Networks (CRNNs) are trained to fit the kinetic parameters of N-equation Arrhenius ODEs to ARC data obtained from a Molicel 21700 P45B. The models are found to be better approximations of the experimental data. The flexibility of the method is demonstrated by experimenting with two-equation and four-equation models. Thermal runaway simulations are conducted in 3D using the obtained kinetic parameters, showing the applicability of the obtained thermal runaway models to large-scale simulations. △ Less

Submitted 3 September, 2024; v1 submitted 21 August, 2024; originally announced August 2024.

arXiv:2408.07272 [pdf, other]

NL2OR: Solve Complex Operations Research Problems Using Natural Language Inputs

Authors: Junxuan Li, Ryan Wickman, Sahil Bhatnagar, Raj Kumar Maity, Arko Mukherjee

Abstract: Operations research (OR) uses mathematical models to enhance decision-making, but developing these models requires expert knowledge and can be time-consuming. Automated mathematical programming (AMP) has emerged to simplify this process, but existing systems have limitations. This paper introduces a novel methodology that uses recent advances in Large Language Model (LLM) to create and edit OR sol… ▽ More Operations research (OR) uses mathematical models to enhance decision-making, but developing these models requires expert knowledge and can be time-consuming. Automated mathematical programming (AMP) has emerged to simplify this process, but existing systems have limitations. This paper introduces a novel methodology that uses recent advances in Large Language Model (LLM) to create and edit OR solutions from non-expert user queries expressed using Natural Language. This reduces the need for domain expertise and the time to formulate a problem. The paper presents an end-to-end pipeline, named NL2OR, that generates solutions to OR problems from natural language input, and shares experimental results on several important OR problems. △ Less

Submitted 13 August, 2024; originally announced August 2024.

arXiv:2405.18560 [pdf, other]

Potential Field Based Deep Metric Learning

Authors: Shubhang Bhatnagar, Narendra Ahuja

Abstract: Deep metric learning (DML) involves training a network to learn a semantically meaningful representation space. Many current approaches mine n-tuples of examples and model interactions within each tuplets. We present a novel, compositional DML model, inspired by electrostatic fields in physics that, instead of in tuples, represents the influence of each example (embedding) by a continuous potentia… ▽ More Deep metric learning (DML) involves training a network to learn a semantically meaningful representation space. Many current approaches mine n-tuples of examples and model interactions within each tuplets. We present a novel, compositional DML model, inspired by electrostatic fields in physics that, instead of in tuples, represents the influence of each example (embedding) by a continuous potential field, and superposes the fields to obtain their combined global potential field. We use attractive/repulsive potential fields to represent interactions among embeddings from images of the same/different classes. Contrary to typical learning methods, where mutual influence of samples is proportional to their distance, we enforce reduction in such influence with distance, leading to a decaying field. We show that such decay helps improve performance on real world datasets with large intra-class variations and label noise. Like other proxy-based methods, we also use proxies to succinctly represent sub-populations of examples. We evaluate our method on three standard DML benchmarks- Cars-196, CUB-200-2011, and SOP datasets where it outperforms state-of-the-art baselines. △ Less

Submitted 28 May, 2024; originally announced May 2024.

arXiv:2405.12167 [pdf, other]

Open-Source Assessments of AI Capabilities: The Proliferation of AI Analysis Tools, Replicating Competitor Models, and the Zhousidun Dataset

Authors: Ritwik Gupta, Leah Walker, Eli Glickman, Raine Koizumi, Sarthak Bhatnagar, Andrew W. Reddie

Abstract: The integration of artificial intelligence (AI) into military capabilities has become a norm for major military power across the globe. Understanding how these AI models operate is essential for maintaining strategic advantages and ensuring security. This paper demonstrates an open-source methodology for analyzing military AI models through a detailed examination of the Zhousidun dataset, a Chines… ▽ More The integration of artificial intelligence (AI) into military capabilities has become a norm for major military power across the globe. Understanding how these AI models operate is essential for maintaining strategic advantages and ensuring security. This paper demonstrates an open-source methodology for analyzing military AI models through a detailed examination of the Zhousidun dataset, a Chinese-originated dataset that exhaustively labels critical components on American and Allied destroyers. By demonstrating the replication of a state-of-the-art computer vision model on this dataset, we illustrate how open-source tools can be leveraged to assess and understand key military AI capabilities. This methodology offers a robust framework for evaluating the performance and potential of AI-enabled military capabilities, thus enhancing the accuracy and reliability of strategic assessments. △ Less

Submitted 24 May, 2024; v1 submitted 20 May, 2024; originally announced May 2024.

arXiv:2405.06621 [pdf, other]

On Streaming Codes for Simultaneously Correcting Burst and Random Erasures

Authors: Shobhit Bhatnagar, Biswadip Chakraborty, P. Vijay Kumar

Abstract: Streaming codes are packet-level codes that recover dropped packets within a strict decoding-delay constraint. We study streaming codes over a sliding-window (SW) channel model which admits only those erasure patterns which allow either a single burst erasure of $\le b$ packets along with $\le e$ random packet erasures, or else, $\le a$ random packet erasures, in any sliding-window of $w$ time slo… ▽ More Streaming codes are packet-level codes that recover dropped packets within a strict decoding-delay constraint. We study streaming codes over a sliding-window (SW) channel model which admits only those erasure patterns which allow either a single burst erasure of $\le b$ packets along with $\le e$ random packet erasures, or else, $\le a$ random packet erasures, in any sliding-window of $w$ time slots. We determine the optimal rate of a streaming code constructed via the popular diagonal embedding (DE) technique over such a SW channel under delay constraint $τ=(w-1)$ and provide an $O(w)$ field size code construction. For the case $e>1$, we show that it is not possible to significantly reduce this field size requirement, assuming the well-known MDS conjecture. We then provide a block code construction whose DE yields a streaming code achieving the rate derived above, over a field of size sub-linear in $w,$ for a family of parameters having $e=1.$ We show the field size optimality of this construction for some parameters, and near-optimality for others under a sparsity constraint. Additionally, we derive an upper-bound on the $d_{\text{min}}$ of a cyclic code and characterize cyclic codes which achieve this bound via their ability to simultaneously recover from burst and random erasures. △ Less

Submitted 10 May, 2024; originally announced May 2024.

arXiv:2405.06606 [pdf, other]

On Streaming Codes for Burst and Random Errors

Authors: Shobhit Bhatnagar, P. Vijay Kumar

Abstract: Streaming codes (SCs) are packet-level codes that recover erased packets within a strict decoding-delay deadline. Streaming codes for various packet erasure channel models such as sliding-window (SW) channel models that admit random or burst erasures in any SW of a fixed length have been studied in the literature, and the optimal rate as well as rate-optimal code constructions of SCs over such cha… ▽ More Streaming codes (SCs) are packet-level codes that recover erased packets within a strict decoding-delay deadline. Streaming codes for various packet erasure channel models such as sliding-window (SW) channel models that admit random or burst erasures in any SW of a fixed length have been studied in the literature, and the optimal rate as well as rate-optimal code constructions of SCs over such channel models are known. In this paper, we study error-correcting streaming codes ($\text{SC}_{\text{ERR}}$s), i.e., packet-level codes which recover erroneous packets within a delay constraint. We study $\text{SC}_{\text{ERR}}$s for two classes of SW channel models, one that admits random packet errors, and another that admits multiple bursts of packet errors, in any SW of a fixed length. For the case of random packet errors, we establish the equivalence of an $\text{SC}_{\text{ERR}}$ and a corresponding SC that recovers from random packet erasures, thus determining the optimal rate of an $\text{SC}_{\text{ERR}}$ for this setting, and providing a rate-optimal code construction for all parameters. We then focus on SCs that recover from multiple erasure bursts and derive a rate-upper-bound for such SCs. We show the necessity of a divisibility constraint for the existence of an SC constructed by the popular diagonal embedding technique, that achieves this rate-bound under a stringent delay requirement. We then show that a construction known in the literature achieves this rate-bound when the divisibility constraint is met. We further show the equivalence of the SCs considered and $\text{SC}_{\text{ERR}}$s for the setting of multiple error bursts, under a stringent delay requirement. △ Less

Submitted 10 May, 2024; originally announced May 2024.

arXiv:2404.16193 [pdf, other]

Improving Multi-label Recognition using Class Co-Occurrence Probabilities

Authors: Samyak Rawlekar, Shubhang Bhatnagar, Vishnuvardhan Pogunulu Srinivasulu, Narendra Ahuja

Abstract: Multi-label Recognition (MLR) involves the identification of multiple objects within an image. To address the additional complexity of this problem, recent works have leveraged information from vision-language models (VLMs) trained on large text-images datasets for the task. These methods learn an independent classifier for each object (class), overlooking correlations in their occurrences. Such c… ▽ More Multi-label Recognition (MLR) involves the identification of multiple objects within an image. To address the additional complexity of this problem, recent works have leveraged information from vision-language models (VLMs) trained on large text-images datasets for the task. These methods learn an independent classifier for each object (class), overlooking correlations in their occurrences. Such co-occurrences can be captured from the training data as conditional probabilities between a pair of classes. We propose a framework to extend the independent classifiers by incorporating the co-occurrence information for object pairs to improve the performance of independent classifiers. We use a Graph Convolutional Network (GCN) to enforce the conditional probabilities between classes, by refining the initial estimates derived from image and text sources obtained using VLMs. We validate our method on four MLR datasets, where our approach outperforms all state-of-the-art methods. △ Less

Submitted 19 September, 2024; v1 submitted 24 April, 2024; originally announced April 2024.

Comments: Accepted to ICPR 2024, CVPR workshops 2024

arXiv:2403.14977 [pdf, other]

Piecewise-Linear Manifolds for Deep Metric Learning

Authors: Shubhang Bhatnagar, Narendra Ahuja

Abstract: Unsupervised deep metric learning (UDML) focuses on learning a semantic representation space using only unlabeled data. This challenging problem requires accurately estimating the similarity between data points, which is used to supervise a deep network. For this purpose, we propose to model the high-dimensional data manifold using a piecewise-linear approximation, with each low-dimensional linear… ▽ More Unsupervised deep metric learning (UDML) focuses on learning a semantic representation space using only unlabeled data. This challenging problem requires accurately estimating the similarity between data points, which is used to supervise a deep network. For this purpose, we propose to model the high-dimensional data manifold using a piecewise-linear approximation, with each low-dimensional linear piece approximating the data manifold in a small neighborhood of a point. These neighborhoods are used to estimate similarity between data points. We empirically show that this similarity estimate correlates better with the ground truth than the similarity estimates of current state-of-the-art techniques. We also show that proxies, commonly used in supervised metric learning, can be used to model the piecewise-linear manifold in an unsupervised setting, helping improve performance. Our method outperforms existing unsupervised metric learning approaches on standard zero-shot image retrieval benchmarks. △ Less

Submitted 22 March, 2024; originally announced March 2024.

Comments: Accepted at CPAL 2024 (Oral)

arXiv:2402.01371 [pdf, other]

Two-Timescale Critic-Actor for Average Reward MDPs with Function Approximation

Authors: Prashansa Panda, Shalabh Bhatnagar

Abstract: In recent years, there has been a lot of research activity focused on carrying out non-asymptotic convergence analyses for actor-critic algorithms. Recently a two-timescale critic-actor algorithm has been presented for the discounted cost setting in the look-up table case where the timescales of the actor and the critic are reversed and only asymptotic convergence shown. In our work, we present th… ▽ More In recent years, there has been a lot of research activity focused on carrying out non-asymptotic convergence analyses for actor-critic algorithms. Recently a two-timescale critic-actor algorithm has been presented for the discounted cost setting in the look-up table case where the timescales of the actor and the critic are reversed and only asymptotic convergence shown. In our work, we present the first two-timescale critic-actor algorithm with function approximation in the long-run average reward setting and present the first finite-time non-asymptotic as well as asymptotic convergence analysis for such a scheme. We obtain optimal learning rates and prove that our algorithm achieves a sample complexity of $\mathcal{\tilde{O}}(ε^{-2.08})$ for the mean squared error of the critic to be upper bounded by $ε$ which is better than the one obtained for two-timescale actor-critic in a similar setting. A notable feature of our analysis is that unlike recent single-timescale actor-critic algorithms, we present a complete asymptotic convergence analysis of our scheme in addition to the finite-time bounds that we obtain and show that the (slower) critic recursion converges asymptotically to the attractor of an associated differential inclusion with actor parameters corresponding to local maxima of a perturbed average reward objective. We also show the results of numerical experiments on three benchmark settings and observe that our critic-actor algorithm performs on par and is in fact better than the other algorithms considered. △ Less

Submitted 24 May, 2024; v1 submitted 2 February, 2024; originally announced February 2024.

arXiv:2312.01056 [pdf, other]

doi 10.3390/mca29010009

Investigating the Surrogate Modeling Capabilities of Continuous Time Echo State Networks

Authors: Saakaar Bhatnagar

Abstract: Continuous Time Echo State Networks (CTESNs) are a promising yet under-explored surrogate modeling technique for dynamical systems, particularly those governed by stiff Ordinary Differential Equations (ODEs). A key determinant of the generalization accuracy of a CTESN surrogate is the method of projecting the reservoir state to the output. This paper shows that of the two common projection methods… ▽ More Continuous Time Echo State Networks (CTESNs) are a promising yet under-explored surrogate modeling technique for dynamical systems, particularly those governed by stiff Ordinary Differential Equations (ODEs). A key determinant of the generalization accuracy of a CTESN surrogate is the method of projecting the reservoir state to the output. This paper shows that of the two common projection methods (linear and nonlinear), the surrogates developed via the nonlinear projection consistently outperform those developed via the linear method. CTESN surrogates are developed for several challenging benchmark cases governed by stiff ODEs, and for each case, the performance of the linear and nonlinear projections is compared. The results of this paper demonstrate the applicability of CTESNs to a variety of problems while serving as a reference for important algorithmic and hyper-parameter choices for CTESNs △ Less

Submitted 5 January, 2024; v1 submitted 2 December, 2023; originally announced December 2023.

arXiv:2311.11789 [pdf, other]

Approximate Linear Programming for Decentralized Policy Iteration in Cooperative Multi-agent Markov Decision Processes

Authors: Lakshmi Mandal, Chandrashekar Lakshminarayanan, Shalabh Bhatnagar

Abstract: In this work, we consider a cooperative multi-agent Markov decision process (MDP) involving m agents. At each decision epoch, all the m agents independently select actions in order to maximize a common long-term objective. In the policy iteration process of multi-agent setup, the number of actions grows exponentially with the number of agents, incurring huge computational costs. Thus, recent works… ▽ More In this work, we consider a cooperative multi-agent Markov decision process (MDP) involving m agents. At each decision epoch, all the m agents independently select actions in order to maximize a common long-term objective. In the policy iteration process of multi-agent setup, the number of actions grows exponentially with the number of agents, incurring huge computational costs. Thus, recent works consider decentralized policy improvement, where each agent improves its decisions unilaterally, assuming that the decisions of the other agents are fixed. However, exact value functions are considered in the literature, which is computationally expensive for a large number of agents with high dimensional state-action space. Thus, we propose approximate decentralized policy iteration algorithms, using approximate linear programming with function approximation to compute the approximate value function for decentralized policy improvement. Further, we consider (both) cooperative multi-agent finite and infinite horizon discounted MDPs and propose suitable algorithms in each case. Moreover, we provide theoretical guarantees for our algorithms and also demonstrate their advantages over existing state-of-the-art algorithms in the literature. △ Less

Submitted 29 April, 2024; v1 submitted 20 November, 2023; originally announced November 2023.

arXiv:2310.16363 [pdf, other]

Finite-Time Analysis of Three-Timescale Constrained Actor-Critic and Constrained Natural Actor-Critic Algorithms

Authors: Prashansa Panda, Shalabh Bhatnagar

Abstract: Actor Critic methods have found immense applications on a wide range of Reinforcement Learning tasks especially when the state-action space is large. In this paper, we consider actor critic and natural actor critic algorithms with function approximation for constrained Markov decision processes (C-MDP) involving inequality constraints and carry out a non-asymptotic analysis for both of these algor… ▽ More Actor Critic methods have found immense applications on a wide range of Reinforcement Learning tasks especially when the state-action space is large. In this paper, we consider actor critic and natural actor critic algorithms with function approximation for constrained Markov decision processes (C-MDP) involving inequality constraints and carry out a non-asymptotic analysis for both of these algorithms in a non-i.i.d (Markovian) setting. We consider the long-run average cost criterion where both the objective and the constraint functions are suitable policy-dependent long-run averages of certain prescribed cost functions. We handle the inequality constraints using the Lagrange multiplier method. We prove that these algorithms are guaranteed to find a first-order stationary point (i.e., $\Vert \nabla L(θ,γ)\Vert_2^2 \leq ε$) of the performance (Lagrange) function $L(θ,γ)$, with a sample complexity of $\mathcal{\tilde{O}}(ε^{-2.5})$ in the case of both Constrained Actor Critic (C-AC) and Constrained Natural Actor Critic (C-NAC) algorithms. We also show the results of experiments on three different Safety-Gym environments. △ Less

Submitted 29 May, 2024; v1 submitted 25 October, 2023; originally announced October 2023.

arXiv:2310.05000 [pdf, ps, other]

The Reinforce Policy Gradient Algorithm Revisited

Authors: Shalabh Bhatnagar

Abstract: We revisit the Reinforce policy gradient algorithm from the literature. Note that this algorithm typically works with cost returns obtained over random length episodes obtained from either termination upon reaching a goal state (as with episodic tasks) or from instants of visit to a prescribed recurrent state (in the case of continuing tasks). We propose a major enhancement to the basic algorithm.… ▽ More We revisit the Reinforce policy gradient algorithm from the literature. Note that this algorithm typically works with cost returns obtained over random length episodes obtained from either termination upon reaching a goal state (as with episodic tasks) or from instants of visit to a prescribed recurrent state (in the case of continuing tasks). We propose a major enhancement to the basic algorithm. We estimate the policy gradient using a function measurement over a perturbed parameter by appealing to a class of random search approaches. This has advantages in the case of systems with infinite state and action spaces as it relax some of the regularity requirements that would otherwise be needed for proving convergence of the Reinforce algorithm. Nonetheless, we observe that even though we estimate the gradient of the performance objective using the performance objective itself (and not via the sample gradient), the algorithm converges to a neighborhood of a local minimum. We also provide a proof of convergence for this new algorithm. △ Less

Submitted 8 October, 2023; originally announced October 2023.

arXiv:2309.03374 [pdf, other]

doi 10.1615/JMachLearnModelComput.2024051540

Physics Informed Neural Networks for Modeling of 3D Flow-Thermal Problems with Sparse Domain Data

Authors: Saakaar Bhatnagar, Andrew Comerford, Araz Banaeizadeh

Abstract: Successfully training Physics Informed Neural Networks (PINNs) for highly nonlinear PDEs on complex 3D domains remains a challenging task. In this paper, PINNs are employed to solve the 3D incompressible Navier-Stokes (NS) equations at moderate to high Reynolds numbers for complex geometries. The presented method utilizes very sparsely distributed solution data in the domain. A detailed investigat… ▽ More Successfully training Physics Informed Neural Networks (PINNs) for highly nonlinear PDEs on complex 3D domains remains a challenging task. In this paper, PINNs are employed to solve the 3D incompressible Navier-Stokes (NS) equations at moderate to high Reynolds numbers for complex geometries. The presented method utilizes very sparsely distributed solution data in the domain. A detailed investigation on the effect of the amount of supplied data and the PDE-based regularizers is presented. Additionally, a hybrid data-PINNs approach is used to generate a surrogate model of a realistic flow-thermal electronics design problem. This surrogate model provides near real-time sampling and was found to outperform standard data-driven neural networks when tested on unseen query points. The findings of the paper show how PINNs can be effective when used in conjunction with sparse data for solving 3D nonlinear PDEs or for surrogate modeling of design spaces governed by them. △ Less

Submitted 3 November, 2023; v1 submitted 6 September, 2023; originally announced September 2023.

arXiv:2308.04643 [pdf, other]

doi 10.1109/IROS55552.2023.10342147

Long-Distance Gesture Recognition using Dynamic Neural Networks

Authors: Shubhang Bhatnagar, Sharath Gopal, Narendra Ahuja, Liu Ren

Abstract: Gestures form an important medium of communication between humans and machines. An overwhelming majority of existing gesture recognition methods are tailored to a scenario where humans and machines are located very close to each other. This short-distance assumption does not hold true for several types of interactions, for example gesture-based interactions with a floor cleaning robot or with a dr… ▽ More Gestures form an important medium of communication between humans and machines. An overwhelming majority of existing gesture recognition methods are tailored to a scenario where humans and machines are located very close to each other. This short-distance assumption does not hold true for several types of interactions, for example gesture-based interactions with a floor cleaning robot or with a drone. Methods made for short-distance recognition are unable to perform well on long-distance recognition due to gestures occupying only a small portion of the input data. Their performance is especially worse in resource constrained settings where they are not able to effectively focus their limited compute on the gesturing subject. We propose a novel, accurate and efficient method for the recognition of gestures from longer distances. It uses a dynamic neural network to select features from gesture-containing spatial regions of the input sensor data for further processing. This helps the network focus on features important for gesture recognition while discarding background features early on, thus making it more compute efficient compared to other techniques. We demonstrate the performance of our method on the LD-ConGR long-distance dataset where it outperforms previous state-of-the-art methods on recognition accuracy and compute efficiency. △ Less

Submitted 8 August, 2023; originally announced August 2023.

Comments: Accepted to IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2023)

Journal ref: 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Detroit, MI, USA, 2023, pp. 1307-1312

arXiv:2305.12239 [pdf, other]

Off-Policy Average Reward Actor-Critic with Deterministic Policy Search

Authors: Naman Saxena, Subhojyoti Khastigir, Shishir Kolathaya, Shalabh Bhatnagar

Abstract: The average reward criterion is relatively less studied as most existing works in the Reinforcement Learning literature consider the discounted reward criterion. There are few recent works that present on-policy average reward actor-critic algorithms, but average reward off-policy actor-critic is relatively less explored. In this work, we present both on-policy and off-policy deterministic policy… ▽ More The average reward criterion is relatively less studied as most existing works in the Reinforcement Learning literature consider the discounted reward criterion. There are few recent works that present on-policy average reward actor-critic algorithms, but average reward off-policy actor-critic is relatively less explored. In this work, we present both on-policy and off-policy deterministic policy gradient theorems for the average reward performance criterion. Using these theorems, we also present an Average Reward Off-Policy Deep Deterministic Policy Gradient (ARO-DDPG) Algorithm. We first show asymptotic convergence analysis using the ODE-based method. Subsequently, we provide a finite time analysis of the resulting stochastic approximation scheme with linear function approximator and obtain an $ε$-optimal stationary policy with a sample complexity of $Ω(ε^{-2.5})$. We compare the average reward performance of our proposed ARO-DDPG algorithm and observe better empirical performance compared to state-of-the-art on-policy average reward actor-critic algorithms over MuJoCo-based environments. △ Less

Submitted 19 July, 2023; v1 submitted 20 May, 2023; originally announced May 2023.

Comments: Accepted at ICML 2023

arXiv:2305.12125 [pdf, other]

A Framework for Provably Stable and Consistent Training of Deep Feedforward Networks

Authors: Arunselvan Ramaswamy, Shalabh Bhatnagar, Naman Saxena

Abstract: We present a novel algorithm for training deep neural networks in supervised (classification and regression) and unsupervised (reinforcement learning) scenarios. This algorithm combines the standard stochastic gradient descent and the gradient clipping method. The output layer is updated using clipped gradients, the rest of the neural network is updated using standard gradients. Updating the outpu… ▽ More We present a novel algorithm for training deep neural networks in supervised (classification and regression) and unsupervised (reinforcement learning) scenarios. This algorithm combines the standard stochastic gradient descent and the gradient clipping method. The output layer is updated using clipped gradients, the rest of the neural network is updated using standard gradients. Updating the output layer using clipped gradient stabilizes it. We show that the remaining layers are automatically stabilized provided the neural network is only composed of squashing (compact range) activations. We also present a novel squashing activation function - it is obtained by modifying a Gaussian Error Linear Unit (GELU) to have compact range - we call it Truncated GELU (tGELU). Unlike other squashing activations, such as sigmoid, the range of tGELU can be explicitly specified. As a consequence, the problem of vanishing gradients that arise due to a small range, e.g., in the case of a sigmoid activation, is eliminated. We prove that a NN composed of squashing activations (tGELU, sigmoid, etc.), when updated using the algorithm presented herein, is numerically stable and has consistent performance (low variance). The theory is supported by extensive experiments. Within reinforcement learning, as a consequence of our study, we show that target networks in Deep Q-Learning can be omitted, greatly speeding up learning and alleviating memory requirements. Cross-entropy based classification algorithms that suffer from high variance issues are more consistent when trained using our framework. One symptom of numerical instability in training is the high variance of the neural network update values. We show, in theory and through experiments, that our algorithm updates have low variance, and the training loss reduces in a smooth manner. △ Less

Submitted 20 May, 2023; originally announced May 2023.

Comments: 30 pages, 12 figures

MSC Class: 90B05; 90C40; 90C90

arXiv:2304.10951 [pdf, ps, other]

A Cubic-regularized Policy Newton Algorithm for Reinforcement Learning

Authors: Mizhaan Prajit Maniyar, Akash Mondal, Prashanth L. A., Shalabh Bhatnagar

Abstract: We consider the problem of control in the setting of reinforcement learning (RL), where model information is not available. Policy gradient algorithms are a popular solution approach for this problem and are usually shown to converge to a stationary point of the value function. In this paper, we propose two policy Newton algorithms that incorporate cubic regularization. Both algorithms employ the… ▽ More We consider the problem of control in the setting of reinforcement learning (RL), where model information is not available. Policy gradient algorithms are a popular solution approach for this problem and are usually shown to converge to a stationary point of the value function. In this paper, we propose two policy Newton algorithms that incorporate cubic regularization. Both algorithms employ the likelihood ratio method to form estimates of the gradient and Hessian of the value function using sample trajectories. The first algorithm requires an exact solution of the cubic regularized problem in each iteration, while the second algorithm employs an efficient gradient descent-based approximation to the cubic regularized problem. We establish convergence of our proposed algorithms to a second-order stationary point (SOSP) of the value function, which results in the avoidance of traps in the form of saddle points. In particular, the sample complexity of our algorithms to find an $ε$-SOSP is $O(ε^{-3.5})$, which is an improvement over the state-of-the-art sample complexity of $O(ε^{-4.5})$. △ Less

Submitted 21 April, 2023; originally announced April 2023.

arXiv:2303.07068 [pdf, other]

n-Step Temporal Difference Learning with Optimal n

Authors: Lakshmi Mandal, Shalabh Bhatnagar

Abstract: We consider the problem of finding the optimal value of n in the n-step temporal difference (TD) learning algorithm. Our objective function for the optimization problem is the average root mean squared error (RMSE). We find the optimal n by resorting to a model-free optimization technique involving a one-simulation simultaneous perturbation stochastic approximation (SPSA) based procedure. Whereas… ▽ More We consider the problem of finding the optimal value of n in the n-step temporal difference (TD) learning algorithm. Our objective function for the optimization problem is the average root mean squared error (RMSE). We find the optimal n by resorting to a model-free optimization technique involving a one-simulation simultaneous perturbation stochastic approximation (SPSA) based procedure. Whereas SPSA is a zeroth-order continuous optimization procedure, we adapt it to the discrete optimization setting by using a random projection operator. We prove the asymptotic convergence of the recursion by showing that the sequence of n-updates obtained using zeroth-order stochastic gradient search converges almost surely to an internally chain transitive invariant set of an associated differential inclusion. This results in convergence of the discrete parameter sequence to the optimal n in n-step TD. Through experiments, we show that the optimal value of n is achieved with our SDPSA algorithm for arbitrary initial values. We further show using numerical evaluations that SDPSA outperforms the state-of-the-art discrete parameter stochastic optimization algorithm Optimal Computing Budget Allocation (OCBA) on benchmark RL tasks. △ Less

Submitted 17 July, 2024; v1 submitted 13 March, 2023; originally announced March 2023.

arXiv:2301.06535 [pdf, other]

doi 10.1016/j.mlwa.2024.100535

Case-Base Neural Networks: survival analysis with time-varying, higher-order interactions

Authors: Jesse Islam, Maxime Turgeon, Robert Sladek, Sahir Bhatnagar

Abstract: In the context of survival analysis, data-driven neural network-based methods have been developed to model complex covariate effects. While these methods may provide better predictive performance than regression-based approaches, not all can model time-varying interactions and complex baseline hazards. To address this, we propose Case-Base Neural Networks (CBNNs) as a new approach that combines th… ▽ More In the context of survival analysis, data-driven neural network-based methods have been developed to model complex covariate effects. While these methods may provide better predictive performance than regression-based approaches, not all can model time-varying interactions and complex baseline hazards. To address this, we propose Case-Base Neural Networks (CBNNs) as a new approach that combines the case-base sampling framework with flexible neural network architectures. Using a novel sampling scheme and data augmentation to naturally account for censoring, we construct a feed-forward neural network that includes time as an input. CBNNs predict the probability of an event occurring at a given moment to estimate the full hazard function. We compare the performance of CBNNs to regression and neural network-based survival methods in a simulation and three case studies using two time-dependent metrics. First, we examine performance on a simulation involving a complex baseline hazard and time-varying interactions to assess all methods, with CBNN outperforming competitors. Then, we apply all methods to three real data applications, with CBNNs outperforming the competing models in two studies and showing similar performance in the third. Our results highlight the benefit of combining case-base sampling with deep learning to provide a simple and flexible framework for data-driven modeling of single event survival outcomes that estimates time-varying effects and a complex baseline hazard by design. An R package is available at https://github.com/Jesse-Islam/cbnn. △ Less

Submitted 9 January, 2024; v1 submitted 16 January, 2023; originally announced January 2023.

arXiv:2212.10477 [pdf, ps, other]

Generalized Simultaneous Perturbation-based Gradient Search with Reduced Estimator Bias

Authors: Soumen Pachal, Shalabh Bhatnagar, L. A. Prashanth

Abstract: We present in this paper a family of generalized simultaneous perturbation-based gradient search (GSPGS) estimators that use noisy function measurements. The number of function measurements required by each estimator is guided by the desired level of accuracy. We first present in detail unbalanced generalized simultaneous perturbation stochastic approximation (GSPSA) estimators and later present t… ▽ More We present in this paper a family of generalized simultaneous perturbation-based gradient search (GSPGS) estimators that use noisy function measurements. The number of function measurements required by each estimator is guided by the desired level of accuracy. We first present in detail unbalanced generalized simultaneous perturbation stochastic approximation (GSPSA) estimators and later present the balanced versions (B-GSPSA) of these. We extend this idea further and present the generalized smoothed functional (GSF) and generalized random directions stochastic approximation (GRDSA) estimators, respectively, as well as their balanced variants. We show that estimators within any specified class requiring more number of function measurements result in lower estimator bias. We present a detailed analysis of both the asymptotic and non-asymptotic convergence of the resulting stochastic approximation schemes. We further present a series of experimental results with the various GSPGS estimators on the Rastrigin and quadratic function objectives. Our experiments are seen to validate our theoretical findings. △ Less

Submitted 12 November, 2023; v1 submitted 20 December, 2022; originally announced December 2022.

Comments: The material in this paper was presented in part at the Conference on Information Sciences and Systems (CISS) in March 2023

arXiv:2211.09174 [pdf, other]

CASPR: Customer Activity Sequence-based Prediction and Representation

Authors: Pin-Jung Chen, Sahil Bhatnagar, Sagar Goyal, Damian Konrad Kowalczyk, Mayank Shrivastava

Abstract: Tasks critical to enterprise profitability, such as customer churn prediction, fraudulent account detection or customer lifetime value estimation, are often tackled by models trained on features engineered from customer data in tabular format. Application-specific feature engineering adds development, operationalization and maintenance costs over time. Recent advances in representation learning pr… ▽ More Tasks critical to enterprise profitability, such as customer churn prediction, fraudulent account detection or customer lifetime value estimation, are often tackled by models trained on features engineered from customer data in tabular format. Application-specific feature engineering adds development, operationalization and maintenance costs over time. Recent advances in representation learning present an opportunity to simplify and generalize feature engineering across applications. When applying these advancements to tabular data researchers deal with data heterogeneity, variations in customer engagement history or the sheer volume of enterprise datasets. In this paper, we propose a novel approach to encode tabular data containing customer transactions, purchase history and other interactions into a generic representation of a customer's association with the business. We then evaluate these embeddings as features to train multiple models spanning a variety of applications. CASPR, Customer Activity Sequence-based Prediction and Representation, applies Transformer architecture to encode activity sequences to improve model performance and avoid bespoke feature engineering across applications. Our experiments at scale validate CASPR for both small and large enterprise applications. △ Less

Submitted 28 November, 2022; v1 submitted 16 November, 2022; originally announced November 2022.

Comments: Presented at the Table Representation Learning Workshop, NeurIPS 2022, New Orleans. Authors listed in random order

arXiv:2210.07573 [pdf, other]

Model-based Safe Deep Reinforcement Learning via a Constrained Proximal Policy Optimization Algorithm

Authors: Ashish Kumar Jayant, Shalabh Bhatnagar

Abstract: During initial iterations of training in most Reinforcement Learning (RL) algorithms, agents perform a significant number of random exploratory steps. In the real world, this can limit the practicality of these algorithms as it can lead to potentially dangerous behavior. Hence safe exploration is a critical issue in applying RL algorithms in the real world. This problem has been recently well stud… ▽ More During initial iterations of training in most Reinforcement Learning (RL) algorithms, agents perform a significant number of random exploratory steps. In the real world, this can limit the practicality of these algorithms as it can lead to potentially dangerous behavior. Hence safe exploration is a critical issue in applying RL algorithms in the real world. This problem has been recently well studied under the Constrained Markov Decision Process (CMDP) Framework, where in addition to single-stage rewards, an agent receives single-stage costs or penalties as well depending on the state transitions. The prescribed cost functions are responsible for mapping undesirable behavior at any given time-step to a scalar value. The goal then is to find a feasible policy that maximizes reward returns while constraining the cost returns to be below a prescribed threshold during training as well as deployment. We propose an On-policy Model-based Safe Deep RL algorithm in which we learn the transition dynamics of the environment in an online manner as well as find a feasible optimal policy using the Lagrangian Relaxation-based Proximal Policy Optimization. We use an ensemble of neural networks with different initializations to tackle epistemic and aleatoric uncertainty issues faced during environment model learning. We compare our approach with relevant model-free and model-based approaches in Constrained RL using the challenging Safe Reinforcement Learning benchmark - the Open AI Safety Gym. We demonstrate that our algorithm is more sample efficient and results in lower cumulative hazard violations as compared to constrained model-free approaches. Further, our approach shows better reward performance than other constrained model-based approaches in the literature. △ Less

Submitted 14 October, 2022; originally announced October 2022.

Comments: Proceedings of NeurIPS 2022

arXiv:2210.04527 [pdf, other]

doi 10.1109/CDC49753.2023.10383413

A policy gradient approach for Finite Horizon Constrained Markov Decision Processes

Authors: Soumyajit Guin, Shalabh Bhatnagar

Abstract: The infinite horizon setting is widely adopted for problems of reinforcement learning (RL). These invariably result in stationary policies that are optimal. In many situations, finite horizon control problems are of interest and for such problems, the optimal policies are time-varying in general. Another setting that has become popular in recent times is of Constrained Reinforcement Learning, wher… ▽ More The infinite horizon setting is widely adopted for problems of reinforcement learning (RL). These invariably result in stationary policies that are optimal. In many situations, finite horizon control problems are of interest and for such problems, the optimal policies are time-varying in general. Another setting that has become popular in recent times is of Constrained Reinforcement Learning, where the agent maximizes its rewards while it also aims to satisfy some given constraint criteria. However, this setting has only been studied in the context of infinite horizon MDPs where stationary policies are optimal. We present an algorithm for constrained RL in the Finite Horizon Setting where the horizon terminates after a fixed (finite) time. We use function approximation in our algorithm which is essential when the state and action spaces are large or continuous and use the policy gradient method to find the optimal policy. The optimal policy that we obtain depends on the stage and so is non-stationary in general. To the best of our knowledge, our paper presents the first policy gradient algorithm for the finite horizon setting with constraints. We show the convergence of our algorithm to a constrained optimal policy. We also compare and analyze the performance of our algorithm through experiments and show that our algorithm performs better than some other well known algorithms. △ Less

Submitted 14 October, 2024; v1 submitted 10 October, 2022; originally announced October 2022.

arXiv:2210.04470 [pdf, other]

doi 10.1109/LCSYS.2023.3288931

Actor-Critic or Critic-Actor? A Tale of Two Time Scales

Authors: Shalabh Bhatnagar, Vivek S. Borkar, Soumyajit Guin

Abstract: We revisit the standard formulation of tabular actor-critic algorithm as a two time-scale stochastic approximation with value function computed on a faster time-scale and policy computed on a slower time-scale. This emulates policy iteration. We observe that reversal of the time scales will in fact emulate value iteration and is a legitimate algorithm. We provide a proof of convergence and compare… ▽ More We revisit the standard formulation of tabular actor-critic algorithm as a two time-scale stochastic approximation with value function computed on a faster time-scale and policy computed on a slower time-scale. This emulates policy iteration. We observe that reversal of the time scales will in fact emulate value iteration and is a legitimate algorithm. We provide a proof of convergence and compare the two empirically with and without function approximation (with both linear and nonlinear function approximators) and observe that our proposed critic-actor algorithm performs on par with actor-critic in terms of both accuracy and computational effort. △ Less

Submitted 13 June, 2024; v1 submitted 10 October, 2022; originally announced October 2022.

arXiv:2208.04563 [pdf, other]

doi 10.1007/s11116-022-10363-z

An Agent-Based Fleet Management Model for First- and Last-Mile Services

Authors: Saumya Bhatnagar, Tarun Rambha, Gitakrishnan Ramadurai

Abstract: With the growth of cars and car-sharing applications, commuters in many cities, particularly developing countries, are shifting away from public transport. These shifts have affected two key stakeholders: transit operators and first- and last-mile (FLM) services. Although most cities continue to invest heavily in bus and metro projects to make public transit attractive, ridership in these systems… ▽ More With the growth of cars and car-sharing applications, commuters in many cities, particularly developing countries, are shifting away from public transport. These shifts have affected two key stakeholders: transit operators and first- and last-mile (FLM) services. Although most cities continue to invest heavily in bus and metro projects to make public transit attractive, ridership in these systems has often failed to reach targeted levels. FLM service providers also experience lower demand and revenues in the wake of shifts to other means of transport. Effective FLM options are required to prevent this phenomenon and make public transport attractive for commuters. One possible solution is to forge partnerships between public transport and FLM providers that offer competitive joint mobility options. Such solutions require prudent allocation of supply and optimised strategies for FLM operations and ride-sharing. To this end, we build an agent- and event-based simulation model which captures interactions between passengers and FLM services using statecharts, vehicle routing models, and other trip matching rules. An optimisation model for allocating FLM vehicles at different transit stations is proposed to reduce unserved requests. Using real-world metro transit demand data from Bengaluru, India, the effectiveness of our approach in improving FLM connectivity and quantifying the benefits of sharing trips is demonstrated. △ Less

Submitted 4 December, 2022; v1 submitted 9 August, 2022; originally announced August 2022.

arXiv:2208.00290 [pdf, ps, other]

A Gradient Smoothed Functional Algorithm with Truncated Cauchy Random Perturbations for Stochastic Optimization

Authors: Akash Mondal, Prashanth L. A., Shalabh Bhatnagar

Abstract: In this paper, we present a stochastic gradient algorithm for minimizing a smooth objective function that is an expectation over noisy cost samples, and only the latter are observed for any given parameter. Our algorithm employs a gradient estimation scheme with random perturbations, which are formed using the truncated Cauchy distribution from the delta sphere. We analyze the bias and variance of… ▽ More In this paper, we present a stochastic gradient algorithm for minimizing a smooth objective function that is an expectation over noisy cost samples, and only the latter are observed for any given parameter. Our algorithm employs a gradient estimation scheme with random perturbations, which are formed using the truncated Cauchy distribution from the delta sphere. We analyze the bias and variance of the proposed gradient estimator. Our algorithm is found to be particularly useful in the case when the objective function is non-convex, and the parameter dimension is high. From an asymptotic convergence analysis, we establish that our algorithm converges almost surely to the set of stationary points of the objective function and obtains the asymptotic convergence rate. We also show that our algorithm avoids unstable equilibria, implying convergence to local minima. Further, we perform a non-asymptotic convergence analysis of our algorithm. In particular, we establish here a non-asymptotic bound for finding an epsilon-stationary point of the non-convex objective function. Finally, we demonstrate numerically through simulations that the performance of our algorithm outperforms GSF, SPSA, and RDSA by a significant margin over a few non-convex settings and further validate its performance over convex (noisy) objectives. △ Less

Submitted 30 June, 2023; v1 submitted 30 July, 2022; originally announced August 2022.

arXiv:2201.00286 [pdf, ps, other]

Reinforcement Learning for Task Specifications with Action-Constraints

Authors: Arun Raman, Keerthan Shagrithaya, Shalabh Bhatnagar

Abstract: In this paper, we use concepts from supervisory control theory of discrete event systems to propose a method to learn optimal control policies for a finite-state Markov Decision Process (MDP) in which (only) certain sequences of actions are deemed unsafe (respectively safe). We assume that the set of action sequences that are deemed unsafe and/or safe are given in terms of a finite-state automaton… ▽ More In this paper, we use concepts from supervisory control theory of discrete event systems to propose a method to learn optimal control policies for a finite-state Markov Decision Process (MDP) in which (only) certain sequences of actions are deemed unsafe (respectively safe). We assume that the set of action sequences that are deemed unsafe and/or safe are given in terms of a finite-state automaton; and propose a supervisor that disables a subset of actions at every state of the MDP so that the constraints on action sequence are satisfied. Then we present a version of the Q-learning algorithm for learning optimal policies in the presence of non-Markovian action-sequence and state constraints, where we use the development of reward machines to handle the state constraints. We illustrate the method using an example that captures the utility of automata-based methods for non-Markovian state and action specifications for reinforcement learning and show the results of simulations in this setting. △ Less

Submitted 1 January, 2022; originally announced January 2022.

arXiv:2112.02999 [pdf, other]

Dynamic Mirror Descent based Model Predictive Control for Accelerating Robot Learning

Authors: Utkarsh A. Mishra, Soumya R. Samineni, Prakhar Goel, Chandravaran Kunjeti, Himanshu Lodha, Aman Singh, Aditya Sagi, Shalabh Bhatnagar, Shishir Kolathaya

Abstract: Recent works in Reinforcement Learning (RL) combine model-free (Mf)-RL algorithms with model-based (Mb)-RL approaches to get the best from both: asymptotic performance of Mf-RL and high sample-efficiency of Mb-RL. Inspired by these works, we propose a hierarchical framework that integrates online learning for the Mb-trajectory optimization with off-policy methods for the Mf-RL. In particular, two… ▽ More Recent works in Reinforcement Learning (RL) combine model-free (Mf)-RL algorithms with model-based (Mb)-RL approaches to get the best from both: asymptotic performance of Mf-RL and high sample-efficiency of Mb-RL. Inspired by these works, we propose a hierarchical framework that integrates online learning for the Mb-trajectory optimization with off-policy methods for the Mf-RL. In particular, two loops are proposed, where the Dynamic Mirror Descent based Model Predictive Control (DMD-MPC) is used as the inner loop Mb-RL to obtain an optimal sequence of actions. These actions are in turn used to significantly accelerate the outer loop Mf-RL. We show that our formulation is generic for a broad class of MPC-based policies and objectives, and includes some of the well-known Mb-Mf approaches. We finally introduce a new algorithm: Mirror-Descent Model Predictive RL (M-DeMoRL), which uses Cross-Entropy Method (CEM) with elite fractions for the inner loop. Our experiments show faster convergence of the proposed hierarchical approach on benchmark MuJoCo tasks. We also demonstrate hardware training for trajectory tracking in a 2R leg and hardware transfer for robust walking in a quadruped. We show that the inner-loop Mb-RL significantly decreases the number of training iterations required in the real system, thereby validating the proposed approach. △ Less

Submitted 4 November, 2021; originally announced December 2021.

Comments: 8 pages, 4 figures. arXiv admin note: substantial text overlap with arXiv:2110.12239

arXiv:2111.11768 [pdf, other]

Schedule Based Temporal Difference Algorithms

Authors: Rohan Deb, Meet Gandhi, Shalabh Bhatnagar

Abstract: Learning the value function of a given policy from data samples is an important problem in Reinforcement Learning. TD($λ$) is a popular class of algorithms to solve this problem. However, the weights assigned to different $n$-step returns in TD($λ$), controlled by the parameter $λ$, decrease exponentially with increasing $n$. In this paper, we present a $λ$-schedule procedure that generalizes the… ▽ More Learning the value function of a given policy from data samples is an important problem in Reinforcement Learning. TD($λ$) is a popular class of algorithms to solve this problem. However, the weights assigned to different $n$-step returns in TD($λ$), controlled by the parameter $λ$, decrease exponentially with increasing $n$. In this paper, we present a $λ$-schedule procedure that generalizes the TD($λ$) algorithm to the case when the parameter $λ$ could vary with time-step. This allows flexibility in weight assignment, i.e., the user can specify the weights assigned to different $n$-step returns by choosing a sequence $\{λ_t\}_{t \geq 1}$. Based on this procedure, we propose an on-policy algorithm - TD($λ$)-schedule, and two off-policy algorithms - GTD($λ$)-schedule and TDC($λ$)-schedule, respectively. We provide proofs of almost sure convergence for all three algorithms under a general Markov noise framework. △ Less

Submitted 23 November, 2021; originally announced November 2021.

arXiv:2111.11004 [pdf, other]

Gradient Temporal Difference with Momentum: Stability and Convergence

Authors: Rohan Deb, Shalabh Bhatnagar

Abstract: Gradient temporal difference (Gradient TD) algorithms are a popular class of stochastic approximation (SA) algorithms used for policy evaluation in reinforcement learning. Here, we consider Gradient TD algorithms with an additional heavy ball momentum term and provide choice of step size and momentum parameter that ensures almost sure convergence of these algorithms asymptotically. In doing so, we… ▽ More Gradient temporal difference (Gradient TD) algorithms are a popular class of stochastic approximation (SA) algorithms used for policy evaluation in reinforcement learning. Here, we consider Gradient TD algorithms with an additional heavy ball momentum term and provide choice of step size and momentum parameter that ensures almost sure convergence of these algorithms asymptotically. In doing so, we decompose the heavy ball Gradient TD iterates into three separate iterates with different step sizes. We first analyze these iterates under one-timescale SA setting using results from current literature. However, the one-timescale case is restrictive and a more general analysis can be provided by looking at a three-timescale decomposition of the iterates. In the process, we provide the first conditions for stability and convergence of general three-timescale SA. We then prove that the heavy ball Gradient TD algorithm is convergent using our three-timescale SA analysis. Finally, we evaluate these algorithms on standard RL problems and report improvement in performance over the vanilla algorithms. △ Less

Submitted 22 November, 2021; originally announced November 2021.

arXiv:2110.15093 [pdf, other]

Finite Horizon Q-learning: Stability, Convergence, Simulations and an application on Smart Grids

Authors: Vivek VP, Dr. Shalabh Bhatnagar

Abstract: Q-learning is a popular reinforcement learning algorithm. This algorithm has however been studied and analysed mainly in the infinite horizon setting. There are several important applications which can be modeled in the framework of finite horizon Markov decision processes. We develop a version of Q-learning algorithm for finite horizon Markov decision processes (MDP) and provide a full proof of i… ▽ More Q-learning is a popular reinforcement learning algorithm. This algorithm has however been studied and analysed mainly in the infinite horizon setting. There are several important applications which can be modeled in the framework of finite horizon Markov decision processes. We develop a version of Q-learning algorithm for finite horizon Markov decision processes (MDP) and provide a full proof of its stability and convergence. Our analysis of stability and convergence of finite horizon Q-learning is based entirely on the ordinary differential equations (O.D.E) method. We also demonstrate the performance of our algorithm on a setting of random MDP as well as on an application on smart grids. △ Less

Submitted 6 August, 2022; v1 submitted 27 October, 2021; originally announced October 2021.

arXiv:2110.10969 [pdf, other]

Memory Efficient Adaptive Attention For Multiple Domain Learning

Authors: Himanshu Pradeep Aswani, Abhiraj Sunil Kanse, Shubhang Bhatnagar, Amit Sethi

Abstract: Training CNNs from scratch on new domains typically demands large numbers of labeled images and computations, which is not suitable for low-power hardware. One way to reduce these requirements is to modularize the CNN architecture and freeze the weights of the heavier modules, that is, the lower layers after pre-training. Recent studies have proposed alternative modular architectures and schemes t… ▽ More Training CNNs from scratch on new domains typically demands large numbers of labeled images and computations, which is not suitable for low-power hardware. One way to reduce these requirements is to modularize the CNN architecture and freeze the weights of the heavier modules, that is, the lower layers after pre-training. Recent studies have proposed alternative modular architectures and schemes that lead to a reduction in the number of trainable parameters needed to match the accuracy of fully fine-tuned CNNs on new domains. Our work suggests that a further reduction in the number of trainable parameters by an order of magnitude is possible. Furthermore, we propose that new modularization techniques for multiple domain learning should also be compared on other realistic metrics, such as the number of interconnections needed between the fixed and trainable modules, the number of training samples needed, the order of computations required and the robustness to partial mislabeling of the training data. On all of these criteria, the proposed architecture demonstrates advantages over or matches the current state-of-the-art. △ Less

Submitted 21 October, 2021; originally announced October 2021.

Comments: 13 pages, 3 figures, 4 graphs, 3 tables

arXiv:2110.10017 [pdf, other]

Neural Network Compatible Off-Policy Natural Actor-Critic Algorithm

Authors: Raghuram Bharadwaj Diddigi, Prateek Jain, Prabuchandran K. J., Shalabh Bhatnagar

Abstract: Learning optimal behavior from existing data is one of the most important problems in Reinforcement Learning (RL). This is known as "off-policy control" in RL where an agent's objective is to compute an optimal policy based on the data obtained from the given policy (known as the behavior policy). As the optimal policy can be very different from the behavior policy, learning optimal behavior is ve… ▽ More Learning optimal behavior from existing data is one of the most important problems in Reinforcement Learning (RL). This is known as "off-policy control" in RL where an agent's objective is to compute an optimal policy based on the data obtained from the given policy (known as the behavior policy). As the optimal policy can be very different from the behavior policy, learning optimal behavior is very hard in the "off-policy" setting compared to the "on-policy" setting where new data from the policy updates will be utilized in learning. This work proposes an off-policy natural actor-critic algorithm that utilizes state-action distribution correction for handling the off-policy behavior and the natural policy gradient for sample efficiency. The existing natural gradient-based actor-critic algorithms with convergence guarantees require fixed features for approximating both policy and value functions. This often leads to sub-optimal learning in many RL applications. On the other hand, our proposed algorithm utilizes compatible features that enable one to use arbitrary neural networks to approximate the policy and the value function and guarantee convergence to a locally optimal policy. We illustrate the benefit of the proposed off-policy natural gradient algorithm by comparing it with the vanilla gradient actor-critic algorithm on benchmark RL tasks. △ Less

Submitted 15 June, 2022; v1 submitted 19 October, 2021; originally announced October 2021.

Comments: This paper has been accepted for presentation at the IJCNN at IEEE WCCI 2022 and for publication in the conference proceedings published by IEEE

arXiv:2102.10165 [pdf, other]

Analyzing Cross Validation In Compressed Sensing With Mixed Gaussian And Impulse Measurement Noise With L1 Errors

Authors: Chinmay Gurjarpadhye, Shubhang Bhatnagar, Ajit Rajwade

Abstract: Compressed sensing (CS) involves sampling signals at rates less than their Nyquist rates and attempting to reconstruct them after sample acquisition. Most such algorithms have parameters, for example the regularization parameter in LASSO, which need to be chosen carefully for optimal performance. These parameters can be chosen based on assumptions on the noise level or signal sparsity, but this kn… ▽ More Compressed sensing (CS) involves sampling signals at rates less than their Nyquist rates and attempting to reconstruct them after sample acquisition. Most such algorithms have parameters, for example the regularization parameter in LASSO, which need to be chosen carefully for optimal performance. These parameters can be chosen based on assumptions on the noise level or signal sparsity, but this knowledge may often be unavailable. In such cases, cross validation (CV) can be used to choose these parameters in a purely data-driven fashion. Previous work analysing the use of CV in CS has been based on the $\ell_2$ cross-validation error with Gaussian measurement noise. But it is well known that the $\ell_2$ error is not robust to impulse noise and provides a poor estimate of the recovery error, failing to choose the best parameter. Here we propose using the $\ell_1$ CV error which provides substantial performance benefits given impulse measurement noise. Most importantly, we provide a detailed theoretical analysis and error bounds for the use of $\ell_1$ CV error in CS reconstruction. We show that with high probability, choosing the parameter that yields the minimum $\ell_1$ CV error is equivalent to choosing the minimum recovery error (which is not observable in practice). To our best knowledge, this is the first paper which theoretically analyzes $\ell_1$-based CV in CS. △ Less

Submitted 19 February, 2021; originally announced February 2021.

arXiv:2101.02349 [pdf, other]

Attention Actor-Critic algorithm for Multi-Agent Constrained Co-operative Reinforcement Learning

Authors: P. Parnika, Raghuram Bharadwaj Diddigi, Sai Koti Reddy Danda, Shalabh Bhatnagar

Abstract: In this work, we consider the problem of computing optimal actions for Reinforcement Learning (RL) agents in a co-operative setting, where the objective is to optimize a common goal. However, in many real-life applications, in addition to optimizing the goal, the agents are required to satisfy certain constraints specified on their actions. Under this setting, the objective of the agents is to not… ▽ More In this work, we consider the problem of computing optimal actions for Reinforcement Learning (RL) agents in a co-operative setting, where the objective is to optimize a common goal. However, in many real-life applications, in addition to optimizing the goal, the agents are required to satisfy certain constraints specified on their actions. Under this setting, the objective of the agents is to not only learn the actions that optimize the common objective but also meet the specified constraints. In recent times, the Actor-Critic algorithm with an attention mechanism has been successfully applied to obtain optimal actions for RL agents in multi-agent environments. In this work, we extend this algorithm to the constrained multi-agent RL setting. The idea here is that optimizing the common goal and satisfying the constraints may require different modes of attention. By incorporating different attention modes, the agents can select useful information required for optimizing the objective and satisfying the constraints separately, thereby yielding better actions. Through experiments on benchmark multi-agent environments, we show the effectiveness of our proposed algorithm. △ Less

Submitted 6 January, 2021; originally announced January 2021.

arXiv:2010.16342 [pdf, other]

Robust Quadrupedal Locomotion on Sloped Terrains: A Linear Policy Approach

Authors: Kartik Paigwar, Lokesh Krishna, Sashank Tirumala, Naman Khetan, Aditya Sagi, Ashish Joglekar, Shalabh Bhatnagar, Ashitava Ghosal, Bharadwaj Amrutur, Shishir Kolathaya

Abstract: In this paper, with a view toward fast deployment of locomotion gaits in low-cost hardware, we use a linear policy for realizing end-foot trajectories in the quadruped robot, Stoch $2$. In particular, the parameters of the end-foot trajectories are shaped via a linear feedback policy that takes the torso orientation and the terrain slope as inputs. The corresponding desired joint angles are obtain… ▽ More In this paper, with a view toward fast deployment of locomotion gaits in low-cost hardware, we use a linear policy for realizing end-foot trajectories in the quadruped robot, Stoch $2$. In particular, the parameters of the end-foot trajectories are shaped via a linear feedback policy that takes the torso orientation and the terrain slope as inputs. The corresponding desired joint angles are obtained via an inverse kinematics solver and tracked via a PID control law. Augmented Random Search, a model-free and a gradient-free learning algorithm is used to train this linear policy. Simulation results show that the resulting walking is robust to terrain slope variations and external pushes. This methodology is not only computationally light-weight but also uses minimal sensing and actuation capabilities in the robot, thereby justifying the approach. △ Less

Submitted 10 November, 2020; v1 submitted 30 October, 2020; originally announced October 2020.

Comments: Accepted in 4th Conference on Robot Learning 2020, MIT, USA

arXiv:2010.15947 [pdf, other]

PAL : Pretext-based Active Learning

Authors: Shubhang Bhatnagar, Sachin Goyal, Darshan Tank, Amit Sethi

Abstract: The goal of pool-based active learning is to judiciously select a fixed-sized subset of unlabeled samples from a pool to query an oracle for their labels, in order to maximize the accuracy of a supervised learner. However, the unsaid requirement that the oracle should always assign correct labels is unreasonable for most situations. We propose an active learning technique for deep neural networks… ▽ More The goal of pool-based active learning is to judiciously select a fixed-sized subset of unlabeled samples from a pool to query an oracle for their labels, in order to maximize the accuracy of a supervised learner. However, the unsaid requirement that the oracle should always assign correct labels is unreasonable for most situations. We propose an active learning technique for deep neural networks that is more robust to mislabeling than the previously proposed techniques. Previous techniques rely on the task network itself to estimate the novelty of the unlabeled samples, but learning the task (generalization) and selecting samples (out-of-distribution detection) can be conflicting goals. We use a separate network to score the unlabeled samples for selection. The scoring network relies on self-supervision for modeling the distribution of the labeled samples to reduce the dependency on potentially noisy labels. To counter the paucity of data, we also deploy another head on the scoring network for regularization via multi-task learning and use an unusual self-balancing hybrid scoring function. Furthermore, we divide each query into sub-queries before labeling to ensure that the query has diverse samples. In addition to having a higher tolerance to mislabeling of samples by the oracle, the resultant technique also produces competitive accuracy in the absence of label noise. The technique also handles the introduction of new classes on-the-fly well by temporarily increasing the sampling rate of these classes. △ Less

Submitted 28 March, 2021; v1 submitted 29 October, 2020; originally announced October 2020.

arXiv:2010.06142 [pdf, other]

Hindsight Experience Replay with Kronecker Product Approximate Curvature

Authors: Dhuruva Priyan G M, Abhik Singla, Shalabh Bhatnagar

Abstract: Hindsight Experience Replay (HER) is one of the efficient algorithm to solve Reinforcement Learning tasks related to sparse rewarded environments.But due to its reduced sample efficiency and slower convergence HER fails to perform effectively. Natural gradients solves these challenges by converging the model parameters better. It avoids taking bad actions that collapse the training performance. Ho… ▽ More Hindsight Experience Replay (HER) is one of the efficient algorithm to solve Reinforcement Learning tasks related to sparse rewarded environments.But due to its reduced sample efficiency and slower convergence HER fails to perform effectively. Natural gradients solves these challenges by converging the model parameters better. It avoids taking bad actions that collapse the training performance. However updating parameters in neural networks requires expensive computation and thus increase in training time. Our proposed method solves the above mentioned challenges with better sample efficiency and faster convergence with increased success rate. A common failure mode for DDPG is that the learned Q-function begins to dramatically overestimate Q-values, which then leads to the policy breaking, because it exploits the errors in the Q-function. We solve this issue by including Twin Delayed Deep Deterministic Policy Gradients(TD3) in HER. TD3 learns two Q-functions instead of one and it adds noise tothe target action, to make it harder for the policy to exploit Q-function errors. The experiments are done with the help of OpenAis Mujoco environments. Results on these environments show that our algorithm (TDHER+KFAC) performs better inmost of the scenarios △ Less

Submitted 9 October, 2020; originally announced October 2020.

Comments: arXiv admin note: text overlap with arXiv:1708.05144 by other authors

arXiv:2009.00821 [pdf, other]

A reinforcement learning approach to hybrid control design

Authors: Meet Gandhi, Atreyee Kundu, Shalabh Bhatnagar

Abstract: In this paper we design hybrid control policies for hybrid systems whose mathematical models are unknown. Our contributions are threefold. First, we propose a framework for modelling the hybrid control design problem as a single Markov Decision Process (MDP). This result facilitates the application of off-the-shelf algorithms from Reinforcement Learning (RL) literature towards designing optimal co… ▽ More In this paper we design hybrid control policies for hybrid systems whose mathematical models are unknown. Our contributions are threefold. First, we propose a framework for modelling the hybrid control design problem as a single Markov Decision Process (MDP). This result facilitates the application of off-the-shelf algorithms from Reinforcement Learning (RL) literature towards designing optimal control policies. Second, we model a set of benchmark examples of hybrid control design problem in the proposed MDP framework. Third, we adapt the recently proposed Proximal Policy Optimisation (PPO) algorithm for the hybrid action space and apply it to the above set of problems. It is observed that in each case the algorithm converges and finds the optimal policy. △ Less

Submitted 2 September, 2020; originally announced September 2020.

Comments: 9 pages

arXiv:2008.13066 [pdf, other]

Computer Model Calibration with Time Series Data using Deep Learning and Quantile Regression

Authors: Saumya Bhatnagar, Won Chang, Seonjin Kim Jiali Wang

Abstract: Computer models play a key role in many scientific and engineering problems. One major source of uncertainty in computer model experiment is input parameter uncertainty. Computer model calibration is a formal statistical procedure to infer input parameters by combining information from model runs and observational data. The existing standard calibration framework suffers from inferential issues wh… ▽ More Computer models play a key role in many scientific and engineering problems. One major source of uncertainty in computer model experiment is input parameter uncertainty. Computer model calibration is a formal statistical procedure to infer input parameters by combining information from model runs and observational data. The existing standard calibration framework suffers from inferential issues when the model output and observational data are high-dimensional dependent data such as large time series due to the difficulty in building an emulator and the non-identifiability between effects from input parameters and data-model discrepancy. To overcome these challenges we propose a new calibration framework based on a deep neural network (DNN) with long-short term memory layers that directly emulates the inverse relationship between the model output and input parameters. Adopting the 'learning with noise' idea we train our DNN model to filter out the effects from data model discrepancy on input parameter inference. We also formulate a new way to construct interval predictions for DNN using quantile regression to quantify the uncertainty in input parameter estimates. Through a simulation study and real data application with WRF-hydro model we show that our approach can yield accurate point estimates and well calibrated interval estimates for input parameters. △ Less

Submitted 8 September, 2020; v1 submitted 29 August, 2020; originally announced August 2020.

arXiv:2007.14290 [pdf, other]

Learning Stable Manoeuvres in Quadruped Robots from Expert Demonstrations

Authors: Sashank Tirumala, Sagar Gubbi, Kartik Paigwar, Aditya Sagi, Ashish Joglekar, Shalabh Bhatnagar, Ashitava Ghosal, Bharadwaj Amrutur, Shishir Kolathaya

Abstract: With the research into development of quadruped robots picking up pace, learning based techniques are being explored for developing locomotion controllers for such robots. A key problem is to generate leg trajectories for continuously varying target linear and angular velocities, in a stable manner. In this paper, we propose a two pronged approach to address this problem. First, multiple simpler p… ▽ More With the research into development of quadruped robots picking up pace, learning based techniques are being explored for developing locomotion controllers for such robots. A key problem is to generate leg trajectories for continuously varying target linear and angular velocities, in a stable manner. In this paper, we propose a two pronged approach to address this problem. First, multiple simpler policies are trained to generate trajectories for a discrete set of target velocities and turning radius. These policies are then augmented using a higher level neural network for handling the transition between the learned trajectories. Specifically, we develop a neural network-based filter that takes in target velocity, radius and transforms them into new commands that enable smooth transitions to the new trajectory. This transformation is achieved by learning from expert demonstrations. An application of this is the transformation of a novice user's input into an expert user's input, thereby ensuring stable manoeuvres regardless of the user's experience. Training our proposed architecture requires much less expert demonstrations compared to standard neural network architectures. Finally, we demonstrate experimentally these results in the in-house quadruped Stoch 2. △ Less

Submitted 28 July, 2020; originally announced July 2020.

Comments: 6 pages, Robot and Human Interaction Conference Italy 2020

arXiv:2002.02084 [pdf, other]

doi 10.1109/ISGT-Europe47291.2020.9248952

A Stochastic Game Framework for Efficient Energy Management in Microgrid Networks

Authors: Shravan Nayak, Chanakya Ajit Ekbote, Annanya Pratap Singh Chauhan, Raghuram Bharadwaj Diddigi, Prishita Ray, Abhinava Sikdar, Sai Koti Reddy Danda, Shalabh Bhatnagar

Abstract: We consider the problem of energy management in microgrid networks. A microgrid is capable of generating a limited amount of energy from a renewable resource and is responsible for handling the demands of its dedicated customers. Owing to the variable nature of renewable generation and the demands of the customers, it becomes imperative that each microgrid optimally manages its energy. This involv… ▽ More We consider the problem of energy management in microgrid networks. A microgrid is capable of generating a limited amount of energy from a renewable resource and is responsible for handling the demands of its dedicated customers. Owing to the variable nature of renewable generation and the demands of the customers, it becomes imperative that each microgrid optimally manages its energy. This involves intelligently scheduling the demands at the customer side, selling (when there is a surplus) and buying (when there is a deficit) the power from its neighboring microgrids depending on its current and future needs. Typically, the transaction of power among the microgrids happens at a pre-decided price by the central grid. In this work, we formulate the problems of demand and battery scheduling, energy trading and dynamic pricing (where we allow the microgrids to decide the price of the transaction depending on their current configuration of demand and renewable energy) in the framework of stochastic games. Subsequently, we propose a novel approach that makes use of independent learners Deep Q-learning algorithm to solve this problem. Through extensive empirical evaluation, we show that our proposed framework is more beneficial to the majority of the microgrids and we provide a detailed analysis of the results. △ Less

Submitted 15 November, 2020; v1 submitted 5 February, 2020; originally announced February 2020.

arXiv:1912.12907 [pdf, other]

Gait Library Synthesis for Quadruped Robots via Augmented Random Search

Authors: Sashank Tirumala, Aditya Sagi, Kartik Paigwar, Ashish Joglekar, Shalabh Bhatnagar, Ashitava Ghosal, Bharadwaj Amrutur, Shishir Kolathaya

Abstract: In this paper, with a view toward fast deployment of learned locomotion gaits in low-cost hardware, we generate a library of walking trajectories, namely, forward trot, backward trot, side-step, and turn in our custom-built quadruped robot, Stoch 2, using reinforcement learning. There are existing approaches that determine optimal policies for each time step, whereas we determine an optimal policy… ▽ More In this paper, with a view toward fast deployment of learned locomotion gaits in low-cost hardware, we generate a library of walking trajectories, namely, forward trot, backward trot, side-step, and turn in our custom-built quadruped robot, Stoch 2, using reinforcement learning. There are existing approaches that determine optimal policies for each time step, whereas we determine an optimal policy, in the form of end-foot trajectories, for each half walking step i.e., swing phase and stance phase. The way-points for the foot trajectories are obtained from a linear policy, i.e., a linear function of the states of the robot, and cubic splines are used to interpolate between these points. Augmented Random Search, a model-free and gradient-free learning algorithm is used to learn the policy in simulation. This learned policy is then deployed on hardware, yielding a trajectory in every half walking step. Different locomotion patterns are learned in simulation by enforcing a preconfigured phase shift between the trajectories of different legs. The transition from one gait to another is achieved by using a low-pass filter for the phase, and the sim-to-real transfer is improved by a linear transformation of the states obtained through regression. △ Less

Submitted 30 December, 2019; originally announced December 2019.

Comments: 7 pages, 11 figures, 1 table

arXiv:1911.08826 [pdf, other]

Hierarchical Average Reward Policy Gradient Algorithms

Authors: Akshay Dharmavaram, Matthew Riemer, Shalabh Bhatnagar

Abstract: Option-critic learning is a general-purpose reinforcement learning (RL) framework that aims to address the issue of long term credit assignment by leveraging temporal abstractions. However, when dealing with extended timescales, discounting future rewards can lead to incorrect credit assignments. In this work, we address this issue by extending the hierarchical option-critic policy gradient theore… ▽ More Option-critic learning is a general-purpose reinforcement learning (RL) framework that aims to address the issue of long term credit assignment by leveraging temporal abstractions. However, when dealing with extended timescales, discounting future rewards can lead to incorrect credit assignments. In this work, we address this issue by extending the hierarchical option-critic policy gradient theorem for the average reward criterion. Our proposed framework aims to maximize the long-term reward obtained in the steady-state of the Markov chain defined by the agent's policy. Furthermore, we use an ordinary differential equation based approach for our convergence analysis and prove that the parameters of the intra-option policies, termination functions, and value functions, converge to their corresponding optimal values, with probability one. Finally, we illustrate the competitive advantage of learning options, in the average reward setting, on a grid-world environment with sparse rewards. △ Less

Submitted 20 November, 2019; originally announced November 2019.

Comments: 6 pages, 3 figures, to be published in Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence

arXiv:1911.05697 [pdf, other]

A Convergent Off-Policy Temporal Difference Algorithm

Authors: Raghuram Bharadwaj Diddigi, Chandramouli Kamanchi, Shalabh Bhatnagar

Abstract: Learning the value function of a given policy (target policy) from the data samples obtained from a different policy (behavior policy) is an important problem in Reinforcement Learning (RL). This problem is studied under the setting of off-policy prediction. Temporal Difference (TD) learning algorithms are a popular class of algorithms for solving the prediction problem. TD algorithms with linear… ▽ More Learning the value function of a given policy (target policy) from the data samples obtained from a different policy (behavior policy) is an important problem in Reinforcement Learning (RL). This problem is studied under the setting of off-policy prediction. Temporal Difference (TD) learning algorithms are a popular class of algorithms for solving the prediction problem. TD algorithms with linear function approximation are shown to be convergent when the samples are generated from the target policy (known as on-policy prediction). However, it has been well established in the literature that off-policy TD algorithms under linear function approximation diverge. In this work, we propose a convergent on-line off-policy TD algorithm under linear function approximation. The main idea is to penalize the updates of the algorithm in a way as to ensure convergence of the iterates. We provide a convergence analysis of our algorithm. Through numerical evaluations, we further demonstrate the effectiveness of our algorithm. △ Less

Submitted 13 November, 2019; originally announced November 2019.

arXiv:1911.00397 [pdf, ps, other]

doi 10.1109/LCSYS.2020.2970555

Generalized Speedy Q-learning

Authors: Indu John, Chandramouli Kamanchi, Shalabh Bhatnagar

Abstract: In this paper, we derive a generalization of the Speedy Q-learning (SQL) algorithm that was proposed in the Reinforcement Learning (RL) literature to handle slow convergence of Watkins' Q-learning. In most RL algorithms such as Q-learning, the Bellman equation and the Bellman operator play an important role. It is possible to generalize the Bellman operator using the technique of successive relaxa… ▽ More In this paper, we derive a generalization of the Speedy Q-learning (SQL) algorithm that was proposed in the Reinforcement Learning (RL) literature to handle slow convergence of Watkins' Q-learning. In most RL algorithms such as Q-learning, the Bellman equation and the Bellman operator play an important role. It is possible to generalize the Bellman operator using the technique of successive relaxation. We use the generalized Bellman operator to derive a simple and efficient family of algorithms called Generalized Speedy Q-learning (GSQL-w) and analyze its finite time performance. We show that GSQL-w has an improved finite time performance bound compared to SQL for the case when the relaxation parameter w is greater than 1. This improvement is a consequence of the contraction factor of the generalized Bellman operator being less than that of the standard Bellman operator. Numerical experiments are provided to demonstrate the empirical performance of the GSQL-w algorithm. △ Less

Submitted 12 February, 2020; v1 submitted 1 November, 2019; originally announced November 2019.

Journal ref: in IEEE Control Systems Letters, vol. 4, no. 3, pp. 524-529, July 2020

arXiv:1906.06659 [pdf, ps, other]

doi 10.1109/TAC.2022.3159453

A Generalized Minimax Q-learning Algorithm for Two-Player Zero-Sum Stochastic Games

Authors: Raghuram Bharadwaj Diddigi, Chandramouli Kamanchi, Shalabh Bhatnagar

Abstract: We consider the problem of two-player zero-sum games. This problem is formulated as a min-max Markov game in the literature. The solution of this game, which is the min-max payoff, starting from a given state is called the min-max value of the state. In this work, we compute the solution of the two-player zero-sum game utilizing the technique of successive relaxation that has been successfully app… ▽ More We consider the problem of two-player zero-sum games. This problem is formulated as a min-max Markov game in the literature. The solution of this game, which is the min-max payoff, starting from a given state is called the min-max value of the state. In this work, we compute the solution of the two-player zero-sum game utilizing the technique of successive relaxation that has been successfully applied in the literature to compute a faster value iteration algorithm in the context of Markov Decision Processes. We extend the concept of successive relaxation to the setting of two-player zero-sum games. We show that, under a special structure on the game, this technique facilitates faster computation of the min-max value of the states. We then derive a generalized minimax Q-learning algorithm that computes the optimal policy when the model information is not known. Finally, we prove the convergence of the proposed generalized minimax Q-learning algorithm utilizing stochastic approximation techniques, under an assumption on the boundedness of iterates. Through experiments, we demonstrate the effectiveness of our proposed algorithm. △ Less

Submitted 18 March, 2022; v1 submitted 16 June, 2019; originally announced June 2019.

Showing 1–50 of 91 results for author: Bhatnagar, S