Skip to main content

Showing 1–13 of 13 results for author: Goiri, �

Searching in archive cs. Search in all archives.
.
  1. arXiv:2409.17264  [pdf, other

    cs.LG cs.DC

    Mnemosyne: Parallelization Strategies for Efficiently Serving Multi-Million Context Length LLM Inference Requests Without Approximations

    Authors: Amey Agrawal, Junda Chen, ��igo Goiri, Ramachandran Ramjee, Chaojie Zhang, Alexey Tumanov, Esha Choukse

    Abstract: As large language models (LLMs) evolve to handle increasingly longer contexts, serving inference requests for context lengths in the range of millions of tokens presents unique challenges. While existing techniques are effective for training, they fail to address the unique challenges of inference, such as varying prefill and decode phases and their associated latency constraints - like Time to Fi… ▽ More

    Submitted 25 September, 2024; originally announced September 2024.

  2. arXiv:2408.13510  [pdf, other

    cs.DC eess.SY

    Intelligent Router for LLM Workloads: Improving Performance Through Workload-Aware Scheduling

    Authors: Kunal Jain, Anjaly Parayil, Ankur Mallick, Esha Choukse, Xiaoting Qin, Jue Zhang, ��igo Goiri, Rujia Wang, Chetan Bansal, Victor R�hle, Anoop Kulkarni, Steve Kofsky, Saravan Rajmohan

    Abstract: Large Language Model (LLM) workloads have distinct prefill and decode phases with different compute and memory requirements which should ideally be accounted for when scheduling input queries across different LLM instances in a cluster. However existing scheduling algorithms treat LLM workloads as monolithic jobs without considering the distinct characteristics of the two phases in each workload.… ▽ More

    Submitted 24 August, 2024; originally announced August 2024.

    Comments: 16 pages, 8 figures

  3. arXiv:2408.00741  [pdf, other

    cs.AI cs.AR cs.DC

    DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency

    Authors: Jovan Stojkovic, Chaojie Zhang, ��igo Goiri, Josep Torrellas, Esha Choukse

    Abstract: The rapid evolution and widespread adoption of generative large language models (LLMs) have made them a pivotal workload in various applications. Today, LLM inference clusters receive a large number of queries with strict Service Level Objectives (SLOs). To achieve the desired performance, these models execute on power-hungry GPUs causing the inference clusters to consume large amount of energy an… ▽ More

    Submitted 1 August, 2024; originally announced August 2024.

  4. arXiv:2405.07250  [pdf

    cs.DC

    Towards Cloud Efficiency with Large-scale Workload Characterization

    Authors: Anjaly Parayil, Jue Zhang, Xiaoting Qin, ��igo Goiri, Lexiang Huang, Timothy Zhu, Chetan Bansal

    Abstract: Cloud providers introduce features (e.g., Spot VMs, Harvest VMs, and Burstable VMs) and optimizations (e.g., oversubscription, auto-scaling, power harvesting, and overclocking) to improve efficiency and reliability. To effectively utilize these features, it's crucial to understand the characteristics of workloads running in the cloud. However, workload characteristics can be complex and depend on… ▽ More

    Submitted 12 May, 2024; originally announced May 2024.

    Comments: 6 figures, 13 Tables

  5. arXiv:2404.19143  [pdf, other

    cs.DC

    Workload Intelligence: Punching Holes Through the Cloud Abstraction

    Authors: Lexiang Huang, Anjaly Parayil, Jue Zhang, Xiaoting Qin, Chetan Bansal, Jovan Stojkovic, Pantea Zardoshti, Pulkit Misra, Eli Cortez, Raphael Ghelman, ��igo Goiri, Saravan Rajmohan, Jim Kleewein, Rodrigo Fonseca, Timothy Zhu, Ricardo Bianchini

    Abstract: Today, cloud workloads are essentially opaque to the cloud platform. Typically, the only information the platform receives is the virtual machine (VM) type and possibly a decoration to the type (e.g., the VM is evictable). Similarly, workloads receive little to no information from the platform; generally, workloads might receive telemetry from their VMs or exceptional signals (e.g., shortly before… ▽ More

    Submitted 29 April, 2024; originally announced April 2024.

  6. arXiv:2403.20306  [pdf, other

    cs.AI cs.AR cs.DC

    Towards Greener LLMs: Bringing Energy-Efficiency to the Forefront of LLM Inference

    Authors: Jovan Stojkovic, Esha Choukse, Chaojie Zhang, Inigo Goiri, Josep Torrellas

    Abstract: With the ubiquitous use of modern large language models (LLMs) across industries, the inference serving for these models is ever expanding. Given the high compute and memory requirements of modern LLMs, more and more top-of-the-line GPUs are being deployed to serve these models. Energy availability has come to the forefront as the biggest challenge for data center expansion to serve these models.… ▽ More

    Submitted 29 March, 2024; originally announced March 2024.

    Comments: 6 pages, 15 figures

    ACM Class: C.0; I.2

  7. arXiv:2403.03377  [pdf, other

    cs.DC

    Junctiond: Extending FaaS Runtimes with Kernel-Bypass

    Authors: Enrique Saurez, Joshua Fried, Gohar Irfan Chaudhry, Esha Choukse, ��igo Goiri, Sameh Elnikety, Adam Belay, Rodrigo Fonseca

    Abstract: This report explores the use of kernel-bypass networking in FaaS runtimes and demonstrates how using Junction, a novel kernel-bypass system, as the backend for executing components in faasd can enhance performance and isolation. Junction achieves this by reducing network and compute overheads and minimizing interactions with the host operating system. Junctiond, the integration of Junction with fa… ▽ More

    Submitted 7 March, 2024; v1 submitted 5 March, 2024; originally announced March 2024.

  8. arXiv:2401.07033  [pdf, other

    cs.HC

    Risk-aware Adaptive Virtual CPU Oversubscription in Microsoft Cloud via Prototypical Human-in-the-loop Imitation Learning

    Authors: Lu Wang, Mayukh Das, Fangkai Yang, Junjie Sheng, Bo Qiao, Hang Dong, Si Qin, Victor R�hle, Chetan Bansal, Eli Cortez, ��igo Goiri, Saravan Rajmohan, Qingwei Lin, Dongmei Zhang

    Abstract: Oversubscription is a prevalent practice in cloud services where the system offers more virtual resources, such as virtual cores in virtual machines, to users or applications than its available physical capacity for reducing revenue loss due to unused/redundant capacity. While oversubscription can potentially lead to significant enhancement in efficient resource utilization, the caveat is that it… ▽ More

    Submitted 13 January, 2024; originally announced January 2024.

    Comments: 9 pages, 3 figures

  9. arXiv:2311.18677  [pdf, other

    cs.AR cs.DC

    Splitwise: Efficient generative LLM inference using phase splitting

    Authors: Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, ��igo Goiri, Saeed Maleki, Ricardo Bianchini

    Abstract: Recent innovations in generative large language models (LLMs) have made their applications and use-cases ubiquitous. This has led to large-scale deployments of these models, using complex, expensive, and power-hungry AI accelerators, most commonly GPUs. These developments make LLM inference efficiency an important challenge. Based on our extensive characterization, we find that there are two main… ▽ More

    Submitted 20 May, 2024; v1 submitted 30 November, 2023; originally announced November 2023.

    Comments: 12 pages, 19 figures

    MSC Class: I.2.0; I.3.1; C.4

  10. arXiv:2308.12908  [pdf, other

    cs.DC cs.AR cs.LG

    POLCA: Power Oversubscription in LLM Cloud Providers

    Authors: Pratyush Patel, Esha Choukse, Chaojie Zhang, ��igo Goiri, Brijesh Warrier, Nithish Mahalingam, Ricardo Bianchini

    Abstract: Recent innovation in large language models (LLMs), and their myriad use-cases have rapidly driven up the compute capacity demand for datacenter GPUs. Several cloud providers and other enterprises have made substantial plans of growth in their datacenters to support these new workloads. One of the key bottleneck resources in datacenters is power, and given the increasing model sizes of LLMs, they a… ▽ More

    Submitted 24 August, 2023; originally announced August 2023.

  11. arXiv:2104.13869  [pdf, other

    cs.DC

    Faa$T: A Transparent Auto-Scaling Cache for Serverless Applications

    Authors: Francisco Romero, Gohar Irfan Chaudhry, ��igo Goiri, Pragna Gopa, Paul Batum, Neeraja J. Yadwadkar, Rodrigo Fonseca, Christos Kozyrakis, Ricardo Bianchini

    Abstract: Function-as-a-Service (FaaS) has become an increasingly popular way for users to deploy their applications without the burden of managing the underlying infrastructure. However, existing FaaS platforms rely on remote storage to maintain state, limiting the set of applications that can be run efficiently. Recent caching work for FaaS platforms has tried to address this problem, but has fallen short… ▽ More

    Submitted 28 April, 2021; originally announced April 2021.

    Comments: 18 pages, 15 figures

  12. arXiv:2003.03423  [pdf, other

    cs.DC

    Serverless in the Wild: Characterizing and Optimizing the Serverless Workload at a Large Cloud Provider

    Authors: Mohammad Shahrad, Rodrigo Fonseca, ��igo Goiri, Gohar Chaudhry, Paul Batum, Jason Cooke, Eduardo Laureano, Colby Tresness, Mark Russinovich, Ricardo Bianchini

    Abstract: Function as a Service (FaaS) has been gaining popularity as a way to deploy computations to serverless backends in the cloud. This paradigm shifts the complexity of allocating and provisioning resources to the cloud provider, which has to provide the illusion of always-available resources (i.e., fast function invocations without cold starts) at the lowest possible resource cost. Doing so requires… ▽ More

    Submitted 5 June, 2020; v1 submitted 6 March, 2020; originally announced March 2020.

    Comments: 14 pages, 20 figures. Corrected and published in USENIX ATC, July 2020. For accompanying dataset, see https://github.com/Azure/AzurePublicDataset

  13. C2MS: Dynamic Monitoring and Management of Cloud Infrastructures

    Authors: Gary A. McGilvary, Josep Rius, ��igo Goiri, Francesc Solsona, Adam Barker, Malcolm Atkinson

    Abstract: Server clustering is a common design principle employed by many organisations who require high availability, scalability and easier management of their infrastructure. Servers are typically clustered according to the service they provide whether it be the application(s) installed, the role of the server or server accessibility for example. In order to optimize performance, manage load and maintain… ▽ More

    Submitted 3 October, 2013; originally announced October 2013.

    Comments: Proceedings of the The 5th IEEE International Conference on Cloud Computing Technology and Science (CloudCom 2013), 8 pages