Export Citations
Save this search
Please login to be able to save your searches and receive alerts for new content matching your search criteria.
- research-articleOctober 2023
Optimizing K-Mer Fingerprint Generation for Machine Learning
BCB '23: Proceedings of the 14th ACM International Conference on Bioinformatics, Computational Biology, and Health InformaticsArticle No.: 101, Pages 1–5https://doi.org/10.1145/3584371.3612946With the increasing availability of genomic data obtained through Whole-Genome Sequencing (WGS), Machine Learning (ML) algorithms are being used to analyze this data. However, processing large datasets or files poses challenges. One approach is to ...
- abstractJune 2023
Brief Announcement: Optimized GPU-accelerated Feature Extraction for ORB-SLAM Systems
SPAA '23: Proceedings of the 35th ACM Symposium on Parallelism in Algorithms and ArchitecturesPages 299–302https://doi.org/10.1145/3558481.3591310Reducing the execution time of ORB-SLAM algorithm is a crucial aspect of autonomous vehicles since it is computationally intensive for embedded boards. We propose a parallel GPU-based implementation, able to run on embedded boards, of the Tracking part ...
- research-articleSeptember 2021
Warp-centric K-Nearest Neighbor Graphs construction on GPU
ICPP Workshops '21: 50th International Conference on Parallel Processing WorkshopArticle No.: 5, Pages 1–10https://doi.org/10.1145/3458744.3474053Recent advances and applications of machine learning algorithms are becoming more common in different fields. It is expected that some applications require the processing of large datasets with those algorithms, which leads to high computational costs. ...
- posterJune 2021
CharminG: A Scalable GPU-resident Runtime System
HPDC '21: Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed ComputingPages 261–262https://doi.org/10.1145/3431379.3464454Host-driven execution of applications on modern GPU-accelerated systems suffer from frequent host-device synchronizations, data movement and limited flexibility in scheduling user tasks. We present CharminG, a runtime system designed to run entirely on ...
- research-articleJune 2021
SnuRHAC: A Runtime for Heterogeneous Accelerator Clusters with CUDA Unified Memory
HPDC '21: Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed ComputingPages 107–120https://doi.org/10.1145/3431379.3460647This paper proposes a framework called SnuRHAC, which provides an illusion of a single GPU for the multiple GPUs in a cluster. Under SnuRHAC, a CUDA program designed to use a single GPU can utilize multiple GPUs in a cluster without any source code ...
-
- research-articleJune 2021
TEMPI: An Interposed MPI Library with a Canonical Representation of CUDA-aware Datatypes
HPDC '21: Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed ComputingPages 95–106https://doi.org/10.1145/3431379.3460645MPI derived datatypes are an abstraction that simplifies handling of non-contiguous data in MPI applications. These datatypes are recursively constructed at runtime from primitive Named Types defined in the MPI standard. More recently, the development ...
- research-articleOctober 2020
SphericRTC: A System for Content-Adaptive Real-Time 360-Degree Video Communication
MM '20: Proceedings of the 28th ACM International Conference on MultimediaPages 3595–3603https://doi.org/10.1145/3394171.3413999We present the SphericRTC system for real-time 360-degree video communication. 360-degree video allows the viewer to observe the environment in any direction from the camera location. This more-immersive streaming experience allows users to more-...
- research-articleSeptember 2020
cuSZ: An Efficient GPU-Based Error-Bounded Lossy Compression Framework for Scientific Data
- Jiannan Tian,
- Sheng Di,
- Kai Zhao,
- Cody Rivera,
- Megan Hickman Fulp,
- Robert Underwood,
- Sian Jin,
- Xin Liang,
- Jon Calhoun,
- Dingwen Tao,
- Franck Cappello
PACT '20: Proceedings of the ACM International Conference on Parallel Architectures and Compilation TechniquesPages 3–15https://doi.org/10.1145/3410463.3414624Error-bounded lossy compression is a state-of-the-art data reduction technique for HPC applications because it not only significantly reduces storage overhead but also can retain high fidelity for postanalysis. Because supercomputers and HPC ...
- research-articleAugust 2020
Detailed Analysis and Optimization of CUDA K-means Algorithm
ICPP '20: Proceedings of the 49th International Conference on Parallel ProcessingArticle No.: 69, Pages 1–11https://doi.org/10.1145/3404397.3404426K-means is one of the most frequently used algorithms for unsupervised clustering data analysis. Individual steps of the k-means algorithm include nearest neighbor finding, efficient distance computation, and cluster-wise reduction, which may be ...
- research-articleAugust 2020
Massively parallel rendering of complex closed-form implicit surfaces
ACM Transactions on Graphics (TOG), Volume 39, Issue 4Article No.: 141, Pages 141:1–141:10https://doi.org/10.1145/3386569.3392429We present a new method for directly rendering complex closed-form implicit surfaces on modern GPUs, taking advantage of their massive parallelism. Our model representation is unambiguously solid, can be sampled at arbitrary resolution, and supports both ...
- short-paperApril 2019
Simultaneous Solving of Batched Linear Programs on a GPU
ICPE '19: Proceedings of the 2019 ACM/SPEC International Conference on Performance EngineeringPages 59–66https://doi.org/10.1145/3297663.3310308Linear Programs (LPs) appear in a large number of applications. Offloading the LP solving tasks to a GPU is viable to accelerate an application's performance. Existing work on offloading and solving an LP on a GPU shows that performance can be ...
Evaluating Characteristics of CUDA Communication Primitives on High-Bandwidth Interconnects
ICPE '19: Proceedings of the 2019 ACM/SPEC International Conference on Performance EngineeringPages 209–218https://doi.org/10.1145/3297663.3310299Data-intensive applications such as machine learning and analytics have created a demand for faster interconnects to avert the memory bandwidth wall and allow GPUs to be effectively leveraged for lower compute intensity tasks. This has resulted in wide ...
- research-articleMarch 2018
Fast and accurate volume data curvature determination using GPGPU computation
ACMSE '18: Proceedings of the 2018 ACM Southeast ConferenceArticle No.: 19, Pages 1–8https://doi.org/10.1145/3190645.3190681A methodology for fast determination of a key shape feature in volume datasets using a GPU is described. The shape feature, surface curvature, which is a valuable descriptor for structure classification and dataset registration applications, can be time-...
- short-paperAugust 2017
An Out-of-Core GPU based Dimensionality Reduction Algorithm for Big Mass Spectrometry Data and Its Application in Bottom-up Proteomics
ACM-BCB '17: Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health InformaticsPages 550–555https://doi.org/10.1145/3107411.3107466Modern high resolution Mass Spectrometry instruments can generate millions of spectra in a single systems biology experiment. Each spectrum consists of thousands of peaks but only a small number of peaks actively contribute to deduction of peptides. ...
- posterSeptember 2016
POSTER: hVISC: A Portable Abstraction for Heterogeneous Parallel Systems
PACT '16: Proceedings of the 2016 International Conference on Parallel Architectures and CompilationPages 443–445https://doi.org/10.1145/2967938.2976039Programming heterogeneous parallel systems can be extremely complex because a single system may include multiple different parallelism models, instruction sets, and memory hierarchies, and different systems use different combinations of these features. ...
- posterSeptember 2016
POSTER: Collective Dynamic Parallelism for Directive Based GPU Programming Languages and Compilers
PACT '16: Proceedings of the 2016 International Conference on Parallel Architectures and CompilationPages 423–424https://doi.org/10.1145/2967938.2974056Early programs for GPU (Graphics Processing Units) acceleration were based on a flat, bulk parallel programming model, in which programs had to perform a sequence of kernel launches from the host CPU. In the latest releases of these devices, dynamic (or ...
- posterJuly 2016
A CUDA Implementation of an Improved Decomposition Based Evolutionary Algorithm for Multi-Objective Optimization
GECCO '16 Companion: Proceedings of the 2016 on Genetic and Evolutionary Computation Conference CompanionPages 71–72https://doi.org/10.1145/2908961.2908971In last few years, the concept of decomposition has been extensively used in a number of evolutionary algorithms, wherein a multiobjective problem is solved as a set of single objective sub-problems. Such algorithms have demonstrated significant break-...
- research-articleMay 2016
Faster and Cheaper: Parallelizing Large-Scale Matrix Factorization on GPUs
HPDC '16: Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed ComputingPages 219–230https://doi.org/10.1145/2907294.2907297Matrix factorization (MF) is used by many popular algorithms such as collaborative filtering. GPU with massive cores and high memory bandwidth sheds light on accelerating MF much further when appropriately exploiting its architectural characteristics.
...
- research-articleMay 2016
GPU Delegation: Toward a Generic Approach for Developping MABS using GPU Programming
AAMAS '16: Proceedings of the 2016 International Conference on Autonomous Agents & Multiagent SystemsPages 1249–1258Using Multi-Agent Based Simulation (MABS), computing resources requirements often limit the extent to which a model could be experimented. As the number of agents and the size of the environment are constantly growing in these simulations, using General-...
- research-articleJune 2015
Automated GPU Kernel Transformations in Large-Scale Production Stencil Applications
HPDC '15: Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed ComputingPages 259–270https://doi.org/10.1145/2749246.2749255This paper proposes an end-to-end framework for automatically transforming stencil-based CUDA programs to exploit inter-kernel data locality. The CUDA-to-CUDA transformation collectively replaces the user-written kernels by auto-generated kernels ...