Search | arXiv e-print repository

AssistantX: An LLM-Powered Proactive Assistant in Collaborative Human-Populated Environment

Authors: Nan Sun, Bo Mao, Yongchang Li, Lumeng Ma, Di Guo, Huaping Liu

Abstract: The increasing demand for intelligent assistants in human-populated environments has motivated significant research in autonomous robotic systems. Traditional service robots and virtual assistants, however, struggle with real-world task execution due to their limited capacity for dynamic reasoning and interaction, particularly when human collaboration is required. Recent developments in Large Lang… ▽ More The increasing demand for intelligent assistants in human-populated environments has motivated significant research in autonomous robotic systems. Traditional service robots and virtual assistants, however, struggle with real-world task execution due to their limited capacity for dynamic reasoning and interaction, particularly when human collaboration is required. Recent developments in Large Language Models have opened new avenues for improving these systems, enabling more sophisticated reasoning and natural interaction capabilities. In this paper, we introduce AssistantX, an LLM-powered proactive assistant designed to operate autonomously in a physical office environment. Unlike conventional service robots, AssistantX leverages a novel multi-agent architecture, PPDR4X, which provides advanced inference capabilities and comprehensive collaboration awareness. By effectively bridging the gap between virtual operations and physical interactions, AssistantX demonstrates robust performance in managing complex real-world scenarios. Our evaluation highlights the architecture's effectiveness, showing that AssistantX can respond to clear instructions, actively retrieve supplementary information from memory, and proactively seek collaboration from team members to ensure successful task completion. More details and videos can be found at https://assistantx-agent.github.io/AssistantX/. △ Less

Submitted 26 September, 2024; originally announced September 2024.

Comments: 6 pages, 8 figures, 4 tables

arXiv:2409.16730 [pdf, ps, other]

Non-stationary BERT: Exploring Augmented IMU Data For Robust Human Activity Recognition

Authors: Ning Sun, Yufei Wang, Yuwei Zhang, Jixiang Wan, Shenyue Wang, Ping Liu, Xudong Zhang

Abstract: Human Activity Recognition (HAR) has gained great attention from researchers due to the popularity of mobile devices and the need to observe users' daily activity data for better human-computer interaction. In this work, we collect a human activity recognition dataset called OPPOHAR consisting of phone IMU data. To facilitate the employment of HAR system in mobile phone and to achieve user-specifi… ▽ More Human Activity Recognition (HAR) has gained great attention from researchers due to the popularity of mobile devices and the need to observe users' daily activity data for better human-computer interaction. In this work, we collect a human activity recognition dataset called OPPOHAR consisting of phone IMU data. To facilitate the employment of HAR system in mobile phone and to achieve user-specific activity recognition, we propose a novel light-weight network called Non-stationary BERT with a two-stage training method. We also propose a simple yet effective data augmentation method to explore the deeper relationship between the accelerator and gyroscope data from the IMU. The network achieves the state-of-the-art performance testing on various activity recognition datasets and the data augmentation method demonstrates its wide applicability. △ Less

Submitted 25 September, 2024; originally announced September 2024.

arXiv:2407.02763 [pdf, other]

ADFQ-ViT: Activation-Distribution-Friendly Post-Training Quantization for Vision Transformers

Authors: Yanfeng Jiang, Ning Sun, Xueshuo Xie, Fei Yang, Tao Li

Abstract: Vision Transformers (ViTs) have exhibited exceptional performance across diverse computer vision tasks, while their substantial parameter size incurs significantly increased memory and computational demands, impeding effective inference on resource-constrained devices. Quantization has emerged as a promising solution to mitigate these challenges, yet existing methods still suffer from significant… ▽ More Vision Transformers (ViTs) have exhibited exceptional performance across diverse computer vision tasks, while their substantial parameter size incurs significantly increased memory and computational demands, impeding effective inference on resource-constrained devices. Quantization has emerged as a promising solution to mitigate these challenges, yet existing methods still suffer from significant accuracy loss at low-bit. We attribute this issue to the distinctive distributions of post-LayerNorm and post-GELU activations within ViTs, rendering conventional hardware-friendly quantizers ineffective, particularly in low-bit scenarios. To address this issue, we propose a novel framework called Activation-Distribution-Friendly post-training Quantization for Vision Transformers, ADFQ-ViT. Concretely, we introduce the Per-Patch Outlier-aware Quantizer to tackle irregular outliers in post-LayerNorm activations. This quantizer refines the granularity of the uniform quantizer to a per-patch level while retaining a minimal subset of values exceeding a threshold at full-precision. To handle the non-uniform distributions of post-GELU activations between positive and negative regions, we design the Shift-Log2 Quantizer, which shifts all elements to the positive region and then applies log2 quantization. Moreover, we present the Attention-score enhanced Module-wise Optimization which adjusts the parameters of each quantizer by reconstructing errors to further mitigate quantization error. Extensive experiments demonstrate ADFQ-ViT provides significant improvements over various baselines in image classification, object detection, and instance segmentation tasks at 4-bit. Specifically, when quantizing the ViT-B model to 4-bit, we achieve a 10.23% improvement in Top-1 accuracy on the ImageNet dataset. △ Less

Submitted 14 October, 2024; v1 submitted 2 July, 2024; originally announced July 2024.

Comments: 29 pages,9 figures

arXiv:2406.17565 [pdf, other]

MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool

Authors: Cunchen Hu, Heyang Huang, Junhao Hu, Jiang Xu, Xusheng Chen, Tao Xie, Chenxi Wang, Sa Wang, Yungang Bao, Ninghui Sun, Yizhou Shan

Abstract: Large language model (LLM) serving has transformed from stateless to stateful systems, utilizing techniques like context caching and disaggregated inference. These optimizations extend the lifespan and domain of the KV cache, necessitating a new architectural approach. We present MemServe, a unified system that integrates both inter-request and intra-request optimizations. MemServe introduces MemP… ▽ More Large language model (LLM) serving has transformed from stateless to stateful systems, utilizing techniques like context caching and disaggregated inference. These optimizations extend the lifespan and domain of the KV cache, necessitating a new architectural approach. We present MemServe, a unified system that integrates both inter-request and intra-request optimizations. MemServe introduces MemPool, an elastic memory pool managing distributed memory and KV caches across serving instances. Using MemPool APIs, MemServe combines context caching with disaggregated inference for the first time, supported by a global scheduler that enhances cache reuse through a global prompt tree-based locality-aware policy. Tests show that MemServe significantly improves job completion time and time-to-first-time. △ Less

Submitted 26 June, 2024; v1 submitted 25 June, 2024; originally announced June 2024.

arXiv:2406.15192 [pdf, ps, other]

Setting Targets is All You Need:Improved Order Competitive Ratio for Online Selection

Authors: Liyan Chen, Nuozhou Sun, Zhihao Gavin Tang

Abstract: There is a rising interest for studying the online benchmark as an alternative of the classical offline benchmark in online stochastic settings. Ezra, Feldman, Gravin, and Tang (SODA 2023) introduced the notion of order-competitive ratio, defined as the worst-case ratio between the performance of the best order-unaware algorithm and the best order-aware algorithm, to quantify the loss incurred by… ▽ More There is a rising interest for studying the online benchmark as an alternative of the classical offline benchmark in online stochastic settings. Ezra, Feldman, Gravin, and Tang (SODA 2023) introduced the notion of order-competitive ratio, defined as the worst-case ratio between the performance of the best order-unaware algorithm and the best order-aware algorithm, to quantify the loss incurred by the lack of knowledge of the arrival order. They showed in the online single selection setting (a.k.a. the prophet problem), the optimal order-competitive ratio achieved by deterministic algorithms is $1/\varphi \approx 0.618$, and left with an open question whether randomized algorithms can do better. We answer the open question firmly by introducing a novel family of algorithms called \emph{targeted value algorithms}. We show that the task of online selection is as easy as guessing the optimal online benchmark. Specifically, we provide 1) an alternative optimal $1/\varphi$ order-competitive algorithm by setting the targeted value deterministically, and 2) a $0.732$ order-competitive algorithm by setting the targeted value randomly. We further provide a $0.758$ upper bound on the order-competitive ratio of our algorithm, showing that our analysis is close to the best possible, and establish an upper bound of $0.829$ on the order-competitive ratio for general randomized order-unaware algorithms. △ Less

Submitted 21 June, 2024; originally announced June 2024.

arXiv:2405.07608 [pdf, other]

FNCC: Fast Notification Congestion Control in Data Center Networks

Authors: Jing Xu, Zhan Wang, Fan Yang, Ning Kang, Zhenlong Ma, Guojun Yuan, Guangming Tan, Ninghui Sun

Abstract: Congestion control plays a pivotal role in large-scale data centers, facilitating ultra-low latency, high bandwidth, and optimal utilization. Even with the deployment of data center congestion control mechanisms such as DCQCN and HPCC, these algorithms often respond to congestion sluggishly. This sluggishness is primarily due to the slow notification of congestion. It takes almost one round-trip t… ▽ More Congestion control plays a pivotal role in large-scale data centers, facilitating ultra-low latency, high bandwidth, and optimal utilization. Even with the deployment of data center congestion control mechanisms such as DCQCN and HPCC, these algorithms often respond to congestion sluggishly. This sluggishness is primarily due to the slow notification of congestion. It takes almost one round-trip time (RTT) for the congestion information to reach the sender. In this paper, we introduce the Fast Notification Congestion Control (FNCC) mechanism, which achieves sub-RTT notification. FNCC leverages the acknowledgment packet (ACK) from the return path to carry in-network telemetry (INT) information of the request path, offering the sender more timely and accurate INT. To further accelerate the responsiveness of last-hop congestion control, we propose that the receiver notifies the sender of the number of concurrent congested flows, which can be used to adjust the congested flows to a fair rate quickly. Our experimental results demonstrate that FNCC reduces flow completion time by 27.4% and 88.9% compared to HPCC and DCQCN, respectively. Moreover, FNCC triggers minimal pause frames and maintains high utilization even at 400Gbps. △ Less

Submitted 26 May, 2024; v1 submitted 13 May, 2024; originally announced May 2024.

arXiv:2405.05170 [pdf, other]

Picking watermarks from noise (PWFN): an improved robust watermarking model against intensive distortions

Authors: Sijing Xie, Chengxin Zhao, Nan Sun, Wei Li, Hefei Ling

Abstract: Digital watermarking is the process of embedding secret information by altering images in an undetectable way to the human eye. To increase the robustness of the model, many deep learning-based watermarking methods use the encoder-noise-decoder architecture by adding different noises to the noise layer. The decoder then extracts the watermarked information from the distorted image. However, this m… ▽ More Digital watermarking is the process of embedding secret information by altering images in an undetectable way to the human eye. To increase the robustness of the model, many deep learning-based watermarking methods use the encoder-noise-decoder architecture by adding different noises to the noise layer. The decoder then extracts the watermarked information from the distorted image. However, this method can only resist weak noise attacks. To improve the robustness of the decoder against stronger noise, this paper proposes to introduce a denoise module between the noise layer and the decoder. The module aims to reduce noise and recover some of the information lost caused by distortion. Additionally, the paper introduces the SE module to fuse the watermarking information pixel-wise and channel dimensions-wise, improving the encoder's efficiency. Experimental results show that our proposed method is comparable to existing models and outperforms state-of-the-art under different noise intensities. In addition, ablation experiments show the superiority of our proposed module. △ Less

Submitted 17 May, 2024; v1 submitted 8 May, 2024; originally announced May 2024.

arXiv:2405.03458 [pdf, other]

SSyncOA: Self-synchronizing Object-aligned Watermarking to Resist Cropping-paste Attacks

Authors: Chengxin Zhao, Hefei Ling, Sijing Xie, Han Fang, Yaokun Fang, Nan Sun

Abstract: Modern image processing tools have made it easy for attackers to crop the region or object of interest in images and paste it into other images. The challenge this cropping-paste attack poses to the watermarking technology is that it breaks the synchronization of the image watermark, introducing multiple superimposed desynchronization distortions, such as rotation, scaling, and translation. Howeve… ▽ More Modern image processing tools have made it easy for attackers to crop the region or object of interest in images and paste it into other images. The challenge this cropping-paste attack poses to the watermarking technology is that it breaks the synchronization of the image watermark, introducing multiple superimposed desynchronization distortions, such as rotation, scaling, and translation. However, current watermarking methods can only resist a single type of desynchronization and cannot be applied to protect the object's copyright under the cropping-paste attack. With the finding that the key to resisting the cropping-paste attack lies in robust features of the object to protect, this paper proposes a self-synchronizing object-aligned watermarking method, called SSyncOA. Specifically, we first constrain the watermarked region to be aligned with the protected object, and then synchronize the watermark's translation, rotation, and scaling distortions by normalizing the object invariant features, i.e., its centroid, principal orientation, and minimum bounding square, respectively. To make the watermark embedded in the protected object, we introduce the object-aligned watermarking model, which incorporates the real cropping-paste attack into the encoder-noise layer-decoder pipeline and is optimized end-to-end. Besides, we illustrate the effect of different desynchronization distortions on the watermark training, which confirms the necessity of the self-synchronization process. Extensive experiments demonstrate the superiority of our method over other SOTAs. △ Less

Submitted 6 May, 2024; originally announced May 2024.

Comments: 7 pages, 5 figures (Have been accepted by ICME 2024)

arXiv:2405.03436 [pdf, other]

DBDH: A Dual-Branch Dual-Head Neural Network for Invisible Embedded Regions Localization

Authors: Chengxin Zhao, Hefei Ling, Sijing Xie, Nan Sun, Zongyi Li, Yuxuan Shi, Jiazhong Chen

Abstract: Embedding invisible hyperlinks or hidden codes in images to replace QR codes has become a hot topic recently. This technology requires first localizing the embedded region in the captured photos before decoding. Existing methods that train models to find the invisible embedded region struggle to obtain accurate localization results, leading to degraded decoding accuracy. This limitation is primari… ▽ More Embedding invisible hyperlinks or hidden codes in images to replace QR codes has become a hot topic recently. This technology requires first localizing the embedded region in the captured photos before decoding. Existing methods that train models to find the invisible embedded region struggle to obtain accurate localization results, leading to degraded decoding accuracy. This limitation is primarily because the CNN network is sensitive to low-frequency signals, while the embedded signal is typically in the high-frequency form. Based on this, this paper proposes a Dual-Branch Dual-Head (DBDH) neural network tailored for the precise localization of invisible embedded regions. Specifically, DBDH uses a low-level texture branch containing 62 high-pass filters to capture the high-frequency signals induced by embedding. A high-level context branch is used to extract discriminative features between the embedded and normal regions. DBDH employs a detection head to directly detect the four vertices of the embedding region. In addition, we introduce an extra segmentation head to segment the mask of the embedding region during training. The segmentation head provides pixel-level supervision for model learning, facilitating better learning of the embedded signals. Based on two state-of-the-art invisible offline-to-online messaging methods, we construct two datasets and augmentation strategies for training and testing localization models. Extensive experiments demonstrate the superior performance of the proposed DBDH over existing methods. △ Less

Submitted 6 May, 2024; originally announced May 2024.

Comments: 7 pages, 6 figures (Have been accepted by IJCNN 2024)

arXiv:2404.17174 [pdf, other]

Optimizing Cycle Life Prediction of Lithium-ion Batteries via a Physics-Informed Model

Authors: Constantin-Daniel Nicolae, Sara Sameer, Nathan Sun, Karena Yan

Abstract: Accurately measuring the cycle lifetime of commercial lithium-ion batteries is crucial for performance and technology development. We introduce a novel hybrid approach combining a physics-based equation with a self-attention model to predict the cycle lifetimes of commercial lithium iron phosphate graphite cells via early-cycle data. After fitting capacity loss curves to this physics-based equatio… ▽ More Accurately measuring the cycle lifetime of commercial lithium-ion batteries is crucial for performance and technology development. We introduce a novel hybrid approach combining a physics-based equation with a self-attention model to predict the cycle lifetimes of commercial lithium iron phosphate graphite cells via early-cycle data. After fitting capacity loss curves to this physics-based equation, we then use a self-attention layer to reconstruct entire battery capacity loss curves. Our model exhibits comparable performances to existing models while predicting more information: the entire capacity loss curve instead of cycle life. This provides more robustness and interpretability: our model does not need to be retrained for a different notion of end-of-life and is backed by physical intuition. △ Less

Submitted 26 April, 2024; originally announced April 2024.

arXiv:2404.12674 [pdf, other]

Towards Universal Performance Modeling for Machine Learning Training on Multi-GPU Platforms

Authors: Zhongyi Lin, Ning Sun, Pallab Bhattacharya, Xizhou Feng, Louis Feng, John D. Owens

Abstract: Characterizing and predicting the training performance of modern machine learning (ML) workloads on compute systems with compute and communication spread between CPUs, GPUs, and network devices is not only the key to optimization and planning but also a complex goal to achieve. The primary challenges include the complexity of synchronization and load balancing between CPUs and GPUs, the variance i… ▽ More Characterizing and predicting the training performance of modern machine learning (ML) workloads on compute systems with compute and communication spread between CPUs, GPUs, and network devices is not only the key to optimization and planning but also a complex goal to achieve. The primary challenges include the complexity of synchronization and load balancing between CPUs and GPUs, the variance in input data distribution, and the use of different communication devices and topologies (e.g., NVLink, PCIe, network cards) that connect multiple compute devices, coupled with the desire for flexible training configurations. Built on top of our prior work for single-GPU platforms, we address these challenges and enable multi-GPU performance modeling by incorporating (1) data-distribution-aware performance models for embedding table lookup, and (2) data movement prediction of communication collectives, into our upgraded performance modeling pipeline equipped with inter-and intra-rank synchronization for ML workloads trained on multi-GPU platforms. Beyond accurately predicting the per-iteration training time of DLRM models with random configurations with a geomean error of 5.21% on two multi-GPU platforms, our prediction pipeline generalizes well to other types of ML workloads, such as Transformer-based NLP models with a geomean error of 3.00%. Moreover, even without actually running ML workloads like DLRMs on the hardware, it is capable of generating insights such as quickly selecting the fastest embedding table sharding configuration (with a success rate of 85%). △ Less

Submitted 27 April, 2024; v1 submitted 19 April, 2024; originally announced April 2024.

Comments: 12 pages, 11 figures, 4 tables

arXiv:2404.00904 [pdf]

A Fast Percolation-Dijkstra Routing Method for Mega-Constellation Backbone Network

Authors: Shenshen Luan, Luyuan Wang, Yepeng Liu, Ninghan Sun, Ran Zhang

Abstract: The real-time routing for satellite communication of the mega-constellations is being challenged due to the large-scale of network nodes, especially on devices with limited computation such as onboard embedded systems. In this paper, a fast routing method is proposed for mega-constellation backbone networks. Firstly, inspired by the regularity and sparse characteristics of mega-constellations, the… ▽ More The real-time routing for satellite communication of the mega-constellations is being challenged due to the large-scale of network nodes, especially on devices with limited computation such as onboard embedded systems. In this paper, a fast routing method is proposed for mega-constellation backbone networks. Firstly, inspired by the regularity and sparse characteristics of mega-constellations, the 4-degree percolation theory is proposed to describe the node search process. Then, dynamic minimum search and mapping methods are used to narrow down the traversal range. The proposed method performs as well as the heap-optimized Dijkstra algorithm with less memory space and dynamic access. The experimental results show that the method proposed in this paper can significantly reduce routing computation time, especially on the onboard, edge-computing or other computation-limited devices. △ Less

Submitted 2 April, 2024; v1 submitted 1 April, 2024; originally announced April 2024.

arXiv:2403.15779 [pdf, other]

The Frontier of Data Erasure: Machine Unlearning for Large Language Models

Authors: Youyang Qu, Ming Ding, Nan Sun, Kanchana Thilakarathna, Tianqing Zhu, Dusit Niyato

Abstract: Large Language Models (LLMs) are foundational to AI advancements, facilitating applications like predictive text generation. Nonetheless, they pose risks by potentially memorizing and disseminating sensitive, biased, or copyrighted information from their vast datasets. Machine unlearning emerges as a cutting-edge solution to mitigate these concerns, offering techniques for LLMs to selectively disc… ▽ More Large Language Models (LLMs) are foundational to AI advancements, facilitating applications like predictive text generation. Nonetheless, they pose risks by potentially memorizing and disseminating sensitive, biased, or copyrighted information from their vast datasets. Machine unlearning emerges as a cutting-edge solution to mitigate these concerns, offering techniques for LLMs to selectively discard certain data. This paper reviews the latest in machine unlearning for LLMs, introducing methods for the targeted forgetting of information to address privacy, ethical, and legal challenges without necessitating full model retraining. It divides existing research into unlearning from unstructured/textual data and structured/classification data, showcasing the effectiveness of these approaches in removing specific data while maintaining model efficacy. Highlighting the practicality of machine unlearning, this analysis also points out the hurdles in preserving model integrity, avoiding excessive or insufficient data removal, and ensuring consistent outputs, underlining the role of machine unlearning in advancing responsible, ethical AI. △ Less

Submitted 23 March, 2024; originally announced March 2024.

arXiv:2401.11181 [pdf, other]

Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads

Authors: Cunchen Hu, Heyang Huang, Liangliang Xu, Xusheng Chen, Jiang Xu, Shuang Chen, Hao Feng, Chenxi Wang, Sa Wang, Yungang Bao, Ninghui Sun, Yizhou Shan

Abstract: Transformer-based large language model (LLM) inference serving is now the backbone of many cloud services. LLM inference consists of a prefill phase and a decode phase. However, existing LLM deployment practices often overlook the distinct characteristics of these phases, leading to significant interference. To mitigate interference, our insight is to carefully schedule and group inference request… ▽ More Transformer-based large language model (LLM) inference serving is now the backbone of many cloud services. LLM inference consists of a prefill phase and a decode phase. However, existing LLM deployment practices often overlook the distinct characteristics of these phases, leading to significant interference. To mitigate interference, our insight is to carefully schedule and group inference requests based on their characteristics. We realize this idea in TetriInfer through three pillars. First, it partitions prompts into fixed-size chunks so that the accelerator always runs close to its computationsaturated limit. Second, it disaggregates prefill and decode instances so each can run independently. Finally, it uses a smart two-level scheduling algorithm augmented with predicted resource usage to avoid decode scheduling hotspots. Results show that TetriInfer improves time-to-first-token (TTFT), job completion time (JCT), and inference efficiency in turns of performance per dollar by a large margin, e.g., it uses 38% less resources all the while lowering average TTFT and average JCT by 97% and 47%, respectively. △ Less

Submitted 20 January, 2024; originally announced January 2024.

arXiv:2312.03549 [pdf, other]

Holmes: Towards Distributed Training Across Clusters with Heterogeneous NIC Environment

Authors: Fei Yang, Shuang Peng, Ning Sun, Fangyu Wang, Yuanyuan Wang, Fu Wu, Jiezhong Qiu, Aimin Pan

Abstract: Large language models (LLMs) such as GPT-3, OPT, and LLaMA have demonstrated remarkable accuracy in a wide range of tasks. However, training these models can incur significant expenses, often requiring tens of thousands of GPUs for months of continuous operation. Typically, this training is carried out in specialized GPU clusters equipped with homogeneous high-speed Remote Direct Memory Access (RD… ▽ More Large language models (LLMs) such as GPT-3, OPT, and LLaMA have demonstrated remarkable accuracy in a wide range of tasks. However, training these models can incur significant expenses, often requiring tens of thousands of GPUs for months of continuous operation. Typically, this training is carried out in specialized GPU clusters equipped with homogeneous high-speed Remote Direct Memory Access (RDMA) network interface cards (NICs). The acquisition and maintenance of such dedicated clusters is challenging. Current LLM training frameworks, like Megatron-LM and Megatron-DeepSpeed, focus primarily on optimizing training within homogeneous cluster settings. In this paper, we introduce Holmes, a training framework for LLMs that employs thoughtfully crafted data and model parallelism strategies over the heterogeneous NIC environment. Our primary technical contribution lies in a novel scheduling method that intelligently allocates distinct computational tasklets in LLM training to specific groups of GPU devices based on the characteristics of their connected NICs. Furthermore, our proposed framework, utilizing pipeline parallel techniques, demonstrates scalability to multiple GPU clusters, even in scenarios without high-speed interconnects between nodes in distinct clusters. We conducted comprehensive experiments that involved various scenarios in the heterogeneous NIC environment. In most cases, our framework achieves performance levels close to those achievable with homogeneous RDMA-capable networks (InfiniBand or RoCE), significantly exceeding training efficiency within the pure Ethernet environment. Additionally, we verified that our framework outperforms other mainstream LLM frameworks under heterogeneous NIC environment in terms of training efficiency and can be seamlessly integrated with them. △ Less

Submitted 29 April, 2024; v1 submitted 6 December, 2023; originally announced December 2023.

Comments: 12 pages

arXiv:2312.02673 [pdf, other]

Robust Backdoor Detection for Deep Learning via Topological Evolution Dynamics

Authors: Xiaoxing Mo, Yechao Zhang, Leo Yu Zhang, Wei Luo, Nan Sun, Shengshan Hu, Shang Gao, Yang Xiang

Abstract: A backdoor attack in deep learning inserts a hidden backdoor in the model to trigger malicious behavior upon specific input patterns. Existing detection approaches assume a metric space (for either the original inputs or their latent representations) in which normal samples and malicious samples are separable. We show that this assumption has a severe limitation by introducing a novel SSDT (Source… ▽ More A backdoor attack in deep learning inserts a hidden backdoor in the model to trigger malicious behavior upon specific input patterns. Existing detection approaches assume a metric space (for either the original inputs or their latent representations) in which normal samples and malicious samples are separable. We show that this assumption has a severe limitation by introducing a novel SSDT (Source-Specific and Dynamic-Triggers) backdoor, which obscures the difference between normal samples and malicious samples. To overcome this limitation, we move beyond looking for a perfect metric space that would work for different deep-learning models, and instead resort to more robust topological constructs. We propose TED (Topological Evolution Dynamics) as a model-agnostic basis for robust backdoor detection. The main idea of TED is to view a deep-learning model as a dynamical system that evolves inputs to outputs. In such a dynamical system, a benign input follows a natural evolution trajectory similar to other benign inputs. In contrast, a malicious sample displays a distinct trajectory, since it starts close to benign samples but eventually shifts towards the neighborhood of attacker-specified target samples to activate the backdoor. Extensive evaluations are conducted on vision and natural language datasets across different network architectures. The results demonstrate that TED not only achieves a high detection rate, but also significantly outperforms existing state-of-the-art detection approaches, particularly in addressing the sophisticated SSDT attack. The code to reproduce the results is made public on GitHub. △ Less

Submitted 5 December, 2023; originally announced December 2023.

Comments: 18 pages. To appear in IEEE Symposium on Security and Privacy 2024

arXiv:2310.19624 [pdf, other]

Exploring Post-Training Quantization of Protein Language Models

Authors: Shuang Peng, Fei Yang, Ning Sun, Sheng Chen, Yanfeng Jiang, Aimin Pan

Abstract: Recent advancements in unsupervised protein language models (ProteinLMs), like ESM-1b and ESM-2, have shown promise in different protein prediction tasks. However, these models face challenges due to their high computational demands, significant memory needs, and latency, restricting their usage on devices with limited resources. To tackle this, we explore post-training quantization (PTQ) for Prot… ▽ More Recent advancements in unsupervised protein language models (ProteinLMs), like ESM-1b and ESM-2, have shown promise in different protein prediction tasks. However, these models face challenges due to their high computational demands, significant memory needs, and latency, restricting their usage on devices with limited resources. To tackle this, we explore post-training quantization (PTQ) for ProteinLMs, focusing on ESMFold, a simplified version of AlphaFold based on ESM-2 ProteinLM. Our study is the first attempt to quantize all weights and activations of ProteinLMs. We observed that the typical uniform quantization method performs poorly on ESMFold, causing a significant drop in TM-Score when using 8-bit quantization. We conducted extensive quantization experiments, uncovering unique challenges associated with ESMFold, particularly highly asymmetric activation ranges before Layer Normalization, making representation difficult using low-bit fixed-point formats. To address these challenges, we propose a new PTQ method for ProteinLMs, utilizing piecewise linear quantization for asymmetric activation values to ensure accurate approximation. We demonstrated the effectiveness of our method in protein structure prediction tasks, demonstrating that ESMFold can be accurately quantized to low-bit widths without compromising accuracy. Additionally, we applied our method to the contact prediction task, showcasing its versatility. In summary, our study introduces an innovative PTQ method for ProteinLMs, addressing specific quantization challenges and potentially leading to the development of more efficient ProteinLMs with significant implications for various protein-related applications. △ Less

Submitted 30 October, 2023; originally announced October 2023.

Comments: 8 pages, 4 figures

arXiv:2309.07581 [pdf, ps, other]

A Survey of Graph Pre-processing Methods: From Algorithmic to Hardware Perspectives

Authors: Zhengyang Lv, Mingyu Yan, Xin Liu, Mengyao Dong, Xiaochun Ye, Dongrui Fan, Ninghui Sun

Abstract: Graph-related applications have experienced significant growth in academia and industry, driven by the powerful representation capabilities of graph. However, efficiently executing these applications faces various challenges, such as load imbalance, random memory access, etc. To address these challenges, researchers have proposed various acceleration systems, including software frameworks and hard… ▽ More Graph-related applications have experienced significant growth in academia and industry, driven by the powerful representation capabilities of graph. However, efficiently executing these applications faces various challenges, such as load imbalance, random memory access, etc. To address these challenges, researchers have proposed various acceleration systems, including software frameworks and hardware accelerators, all of which incorporate graph pre-processing (GPP). GPP serves as a preparatory step before the formal execution of applications, involving techniques such as sampling, reorder, etc. However, GPP execution often remains overlooked, as the primary focus is directed towards enhancing graph applications themselves. This oversight is concerning, especially considering the explosive growth of real-world graph data, where GPP becomes essential and even dominates system running overhead. Furthermore, GPP methods exhibit significant variations across devices and applications due to high customization. Unfortunately, no comprehensive work systematically summarizes GPP. To address this gap and foster a better understanding of GPP, we present a comprehensive survey dedicated to this area. We propose a double-level taxonomy of GPP, considering both algorithmic and hardware perspectives. Through listing relavent works, we illustrate our taxonomy and conduct a thorough analysis and summary of diverse GPP techniques. Lastly, we discuss challenges in GPP and potential future directions. △ Less

Submitted 14 September, 2023; originally announced September 2023.

arXiv:2309.05630 [pdf, other]

Boundary Peeling: Outlier Detection Method Using One-Class Peeling

Authors: Sheikh Arafat, Na Sun, Maria L. Weese, Waldyn G. Martinez

Abstract: Unsupervised outlier detection constitutes a crucial phase within data analysis and remains a dynamic realm of research. A good outlier detection algorithm should be computationally efficient, robust to tuning parameter selection, and perform consistently well across diverse underlying data distributions. We introduce One-Class Boundary Peeling, an unsupervised outlier detection algorithm. One-cla… ▽ More Unsupervised outlier detection constitutes a crucial phase within data analysis and remains a dynamic realm of research. A good outlier detection algorithm should be computationally efficient, robust to tuning parameter selection, and perform consistently well across diverse underlying data distributions. We introduce One-Class Boundary Peeling, an unsupervised outlier detection algorithm. One-class Boundary Peeling uses the average signed distance from iteratively-peeled, flexible boundaries generated by one-class support vector machines. One-class Boundary Peeling has robust hyperparameter settings and, for increased flexibility, can be cast as an ensemble method. In synthetic data simulations One-Class Boundary Peeling outperforms all state of the art methods when no outliers are present while maintaining comparable or superior performance in the presence of outliers, as compared to benchmark methods. One-Class Boundary Peeling performs competitively in terms of correct classification, AUC, and processing time using common benchmark data sets. △ Less

Submitted 20 September, 2024; v1 submitted 11 September, 2023; originally announced September 2023.

arXiv:2308.11138 [pdf, ps, other]

NLP-based detection of systematic anomalies among the narratives of consumer complaints

Authors: Peiheng Gao, Ning Sun, Xuefeng Wang, Chen Yang, Ričardas Zitikis

Abstract: We develop an NLP-based procedure for detecting systematic nonmeritorious consumer complaints, simply called systematic anomalies, among complaint narratives. While classification algorithms are used to detect pronounced anomalies, in the case of smaller and frequent systematic anomalies, the algorithms may falter due to a variety of reasons, including technical ones as well as natural limitations… ▽ More We develop an NLP-based procedure for detecting systematic nonmeritorious consumer complaints, simply called systematic anomalies, among complaint narratives. While classification algorithms are used to detect pronounced anomalies, in the case of smaller and frequent systematic anomalies, the algorithms may falter due to a variety of reasons, including technical ones as well as natural limitations of human analysts. Therefore, as the next step after classification, we convert the complaint narratives into quantitative data, which are then analyzed using an algorithm for detecting systematic anomalies. We illustrate the entire procedure using complaint narratives from the Consumer Complaint Database of the Consumer Financial Protection Bureau. △ Less

Submitted 26 March, 2024; v1 submitted 21 August, 2023; originally announced August 2023.

arXiv:2305.05938 [pdf, other]

V2X-Seq: A Large-Scale Sequential Dataset for Vehicle-Infrastructure Cooperative Perception and Forecasting

Authors: Haibao Yu, Wenxian Yang, Hongzhi Ruan, Zhenwei Yang, Yingjuan Tang, Xu Gao, Xin Hao, Yifeng Shi, Yifeng Pan, Ning Sun, Juan Song, Jirui Yuan, Ping Luo, Zaiqing Nie

Abstract: Utilizing infrastructure and vehicle-side information to track and forecast the behaviors of surrounding traffic participants can significantly improve decision-making and safety in autonomous driving. However, the lack of real-world sequential datasets limits research in this area. To address this issue, we introduce V2X-Seq, the first large-scale sequential V2X dataset, which includes data frame… ▽ More Utilizing infrastructure and vehicle-side information to track and forecast the behaviors of surrounding traffic participants can significantly improve decision-making and safety in autonomous driving. However, the lack of real-world sequential datasets limits research in this area. To address this issue, we introduce V2X-Seq, the first large-scale sequential V2X dataset, which includes data frames, trajectories, vector maps, and traffic lights captured from natural scenery. V2X-Seq comprises two parts: the sequential perception dataset, which includes more than 15,000 frames captured from 95 scenarios, and the trajectory forecasting dataset, which contains about 80,000 infrastructure-view scenarios, 80,000 vehicle-view scenarios, and 50,000 cooperative-view scenarios captured from 28 intersections' areas, covering 672 hours of data. Based on V2X-Seq, we introduce three new tasks for vehicle-infrastructure cooperative (VIC) autonomous driving: VIC3D Tracking, Online-VIC Forecasting, and Offline-VIC Forecasting. We also provide benchmarks for the introduced tasks. Find data, code, and more up-to-date information at \href{https://github.com/AIR-THU/DAIR-V2X-Seq}{https://github.com/AIR-THU/DAIR-V2X-Seq}. △ Less

Submitted 10 May, 2023; originally announced May 2023.

Comments: CVPR2023

arXiv:2302.14830 [pdf, other]

Sharp thresholds in inference of planted subgraphs

Authors: Elchanan Mossel, Jonathan Niles-Weed, Youngtak Sohn, Nike Sun, Ilias Zadik

Abstract: A major question in the study of the Erdős--Rényi random graph is to understand the probability that it contains a given subgraph. This study originated in classical work of Erdős and Rényi (1960). More recent work studies this question both in building a general theory of sharp versus coarse transitions (Friedgut and Bourgain 1999; Hatami, 2012) and in results on the location of the transition (K… ▽ More A major question in the study of the Erdős--Rényi random graph is to understand the probability that it contains a given subgraph. This study originated in classical work of Erdős and Rényi (1960). More recent work studies this question both in building a general theory of sharp versus coarse transitions (Friedgut and Bourgain 1999; Hatami, 2012) and in results on the location of the transition (Kahn and Kalai, 2007; Talagrand, 2010; Frankston, Kahn, Narayanan, Park, 2019; Park and Pham, 2022). In inference problems, one often studies the optimal accuracy of inference as a function of the amount of noise. In a variety of sparse recovery problems, an ``all-or-nothing (AoN) phenomenon'' has been observed: Informally, as the amount of noise is gradually increased, at some critical threshold the inference problem undergoes a sharp jump from near-perfect recovery to near-zero accuracy (Gamarnik and Zadik, 2017; Reeves, Xu, Zadik, 2021). We can regard AoN as the natural inference analogue of the sharp threshold phenomenon in random graphs. In contrast with the general theory developed for sharp thresholds of random graph properties, the AoN phenomenon has only been studied so far in specific inference settings. In this paper we study the general problem of inferring a graph $H=H_n$ planted in an Erdős--Rényi random graph, thus naturally connecting the two lines of research mentioned above. We show that questions of AoN are closely connected to first moment thresholds, and to a generalization of the so-called Kahn--Kalai expectation threshold that scans over subgraphs of $H$ of edge density at least $q$. In a variety of settings we characterize AoN, by showing that AoN occurs if and only if this ``generalized expectation threshold'' is roughly constant in $q$. Our proofs combine techniques from random graph theory and Bayesian inference. △ Less

Submitted 28 February, 2023; originally announced February 2023.

Comments: 41 pages

arXiv:2212.10432 [pdf, other]

AlphaSparse: Generating High Performance SpMV Codes Directly from Sparse Matrices

Authors: Zhen Du, Jiajia Li, Yinshan Wang, Xueqi Li, Guangming Tan, Ninghui Sun

Abstract: Sparse Matrix-Vector multiplication (SpMV) is an essential computational kernel in many application scenarios. Tens of sparse matrix formats and implementations have been proposed to compress the memory storage and speed up SpMV performance. We develop AlphaSparse, a superset of all existing works that goes beyond the scope of human-designed format(s) and implementation(s). AlphaSparse automatical… ▽ More Sparse Matrix-Vector multiplication (SpMV) is an essential computational kernel in many application scenarios. Tens of sparse matrix formats and implementations have been proposed to compress the memory storage and speed up SpMV performance. We develop AlphaSparse, a superset of all existing works that goes beyond the scope of human-designed format(s) and implementation(s). AlphaSparse automatically \emph{creates novel machine-designed formats and SpMV kernel implementations} entirely from the knowledge of input sparsity patterns and hardware architectures. Based on our proposed Operator Graph that expresses the path of SpMV format and kernel design, AlphaSparse consists of three main components: Designer, Format \& Kernel Generator, and Search Engine. It takes an arbitrary sparse matrix as input while outputs the performant machine-designed format and SpMV implementation. By extensively evaluating 843 matrices from SuiteSparse Matrix Collection, AlphaSparse achieves significant performance improvement by 3.2$\times$ on average compared to five state-of-the-art artificial formats and 1.5$\times$ on average (up to 2.7$\times$) over the up-to-date implementation of traditional auto-tuning philosophy. △ Less

Submitted 21 December, 2022; v1 submitted 7 November, 2022; originally announced December 2022.

arXiv:2212.06385 [pdf, other]

TencentPretrain: A Scalable and Flexible Toolkit for Pre-training Models of Different Modalities

Authors: Zhe Zhao, Yudong Li, Cheng Hou, Jing Zhao, Rong Tian, Weijie Liu, Yiren Chen, Ningyuan Sun, Haoyan Liu, Weiquan Mao, Han Guo, Weigang Guo, Taiqiang Wu, Tao Zhu, Wenhang Shi, Chen Chen, Shan Huang, Sihong Chen, Liqun Liu, Feifei Li, Xiaoshuai Chen, Xingwu Sun, Zhanhui Kang, Xiaoyong Du, Linlin Shen , et al. (1 additional authors not shown)

Abstract: Recently, the success of pre-training in text domain has been fully extended to vision, audio, and cross-modal scenarios. The proposed pre-training models of different modalities are showing a rising trend of homogeneity in their model structures, which brings the opportunity to implement different pre-training models within a uniform framework. In this paper, we present TencentPretrain, a toolkit… ▽ More Recently, the success of pre-training in text domain has been fully extended to vision, audio, and cross-modal scenarios. The proposed pre-training models of different modalities are showing a rising trend of homogeneity in their model structures, which brings the opportunity to implement different pre-training models within a uniform framework. In this paper, we present TencentPretrain, a toolkit supporting pre-training models of different modalities. The core feature of TencentPretrain is the modular design. The toolkit uniformly divides pre-training models into 5 components: embedding, encoder, target embedding, decoder, and target. As almost all of common modules are provided in each component, users can choose the desired modules from different components to build a complete pre-training model. The modular design enables users to efficiently reproduce existing pre-training models or build brand-new one. We test the toolkit on text, vision, and audio benchmarks and show that it can match the performance of the original implementations. △ Less

Submitted 11 July, 2023; v1 submitted 13 December, 2022; originally announced December 2022.

arXiv:2210.17411 [pdf, other]

Offset-Guided Attention Network for Room-Level Aware Floor Plan Segmentation

Authors: Zhangyu Wang, Ningyuan Sun

Abstract: Recognition of floor plans has been a challenging and popular task. Despite that many recent approaches have been proposed for this task, they typically fail to make the room-level unified prediction. Specifically, multiple semantic categories can be assigned in a single room, which seriously limits their visual quality and applicability. In this paper, we propose a novel approach to recognize the… ▽ More Recognition of floor plans has been a challenging and popular task. Despite that many recent approaches have been proposed for this task, they typically fail to make the room-level unified prediction. Specifically, multiple semantic categories can be assigned in a single room, which seriously limits their visual quality and applicability. In this paper, we propose a novel approach to recognize the floor plan layouts with a newly proposed Offset-Guided Attention mechanism to improve the semantic consistency within a room. In addition, we present a Feature Fusion Attention module that leverages the channel-wise attention to encourage the consistency of the room, wall, and door predictions, further enhancing the room-level semantic consistency. Experimental results manifest our approach is able to improve the room-level semantic consistency and outperforms the existing works both qualitatively and quantitatively. △ Less

Submitted 22 October, 2022; originally announced October 2022.

Comments: Under review of IEEE Access(3 accepts and 1 reject)

arXiv:2210.07311 [pdf, other]

doi 10.1145/3578360.3580256

Linker Code Size Optimization for Native Mobile Applications

Authors: Gai Liu, Umar Farooq, Chengyan Zhao, Xia Liu, Nian Sun

Abstract: Modern mobile applications have grown rapidly in binary size, which restricts user growth and hinders updates for existing users. Thus, reducing the binary size is important for application developers. Recent studies have shown the possibility of using link-time code size optimizations by re-invoking certain compiler optimizations on the linked intermediate representation of the program. However,… ▽ More Modern mobile applications have grown rapidly in binary size, which restricts user growth and hinders updates for existing users. Thus, reducing the binary size is important for application developers. Recent studies have shown the possibility of using link-time code size optimizations by re-invoking certain compiler optimizations on the linked intermediate representation of the program. However, such methods often incur significant build time overhead and require intrusive changes to the existing build pipeline. In this paper, we propose several novel optimization techniques that do not require significant customization to the build pipeline and reduce binary size with low build time overhead. As opposed to re-invoking the compiler during link time, we perform true linker optimization directly as optimization passes within the linker. This enables more optimization opportunities such as pre-compiled libraries that prior work often could not optimize. We evaluate our techniques on several commercial iOS applications including NewsFeedApp, ShortVideoApp, and CollaborationSuiteApp, each with hundreds of millions of daily active users. Our techniques on average achieve 18.4% binary size reduction across the three commercial applications without any user-perceivable performance degradations. △ Less

Submitted 18 January, 2023; v1 submitted 13 September, 2022; originally announced October 2022.

Journal ref: In Proceedings of the 32nd ACM SIGPLAN International Conference on Compiler Construction, 2023

arXiv:2209.11347 [pdf, ps, other]

A second moment proof of the spread lemma

Authors: Elchanan Mossel, Jonathan Niles-Weed, Nike Sun, Ilias Zadik

Abstract: This note concerns a well-known result which we term the ``spread lemma,'' which establishes the existence (with high probability) of a desired structure in a random set. The spread lemma was central to two recent celebrated results: (a) the improved bounds of Alweiss, Lovett, Wu, and Zhang (2019) on the Erdős-Rado sunflower conjecture; and (b) the proof of the fractional Kahn--Kalai conjecture by… ▽ More This note concerns a well-known result which we term the ``spread lemma,'' which establishes the existence (with high probability) of a desired structure in a random set. The spread lemma was central to two recent celebrated results: (a) the improved bounds of Alweiss, Lovett, Wu, and Zhang (2019) on the Erdős-Rado sunflower conjecture; and (b) the proof of the fractional Kahn--Kalai conjecture by Frankston, Kahn, Narayanan and Park (2019). While the lemma was first proved (and later refined) by delicate counting arguments, alternative proofs have also been given, via Shannon's noiseless coding theorem (Rao, 2019), and also via manipulations of Shannon entropy bounds (Tao, 2020). In this note we present a new proof of the spread lemma, that takes advantage of an explicit recasting of the proof in the language of Bayesian statistical inference. We show that from this viewpoint the proof proceeds in a straightforward and principled probabilistic manner, leading to a truncated second moment calculation which concludes the proof. The proof can also be viewed as a demonstration of the ``planting trick'' introduced by Achlioptas and Coga-Oghlan (2008) in the study of random constraint satisfaction problems. △ Less

Submitted 10 October, 2022; v1 submitted 22 September, 2022; originally announced September 2022.

Comments: Corrected a mistake in the proof of Theorem 2.1. and updated the literature review

arXiv:2209.03326 [pdf, ps, other]

On the Second Kahn--Kalai Conjecture

Authors: Elchanan Mossel, Jonathan Niles-Weed, Nike Sun, Ilias Zadik

Abstract: For any given graph $H$, we are interested in $p_\mathrm{crit}(H)$, the minimal $p$ such that the Erdős-Rényi graph $G(n,p)$ contains a copy of $H$ with probability at least $1/2$. Kahn and Kalai (2007) conjectured that $p_\mathrm{crit}(H)$ is given up to a logarithmic factor by a simpler "subgraph expectation threshold" $p_\mathrm{E}(H)$, which is the minimal $p$ such that for every subgraph… ▽ More For any given graph $H$, we are interested in $p_\mathrm{crit}(H)$, the minimal $p$ such that the Erdős-Rényi graph $G(n,p)$ contains a copy of $H$ with probability at least $1/2$. Kahn and Kalai (2007) conjectured that $p_\mathrm{crit}(H)$ is given up to a logarithmic factor by a simpler "subgraph expectation threshold" $p_\mathrm{E}(H)$, which is the minimal $p$ such that for every subgraph $H'\subseteq H$, the Erdős-Rényi graph $G(n,p)$ contains \emph{in expectation} at least $1/2$ copies of $H'$. It is trivial that $p_\mathrm{E}(H) \le p_\mathrm{crit}(H)$, and the so-called "second Kahn-Kalai conjecture" states that $p_\mathrm{crit}(H) \lesssim p_\mathrm{E}(H) \log e(H)$ where $e(H)$ is the number of edges in $H$. In this article, we present a natural modification $p_\mathrm{E, new}(H)$ of the Kahn--Kalai subgraph expectation threshold, which we show is sandwiched between $p_\mathrm{E}(H)$ and $p_\mathrm{crit}(H)$. The new definition $p_\mathrm{E, new}(H)$ is based on the simple observation that if $G(n,p)$ contains a copy of $H$ and $H$ contains \emph{many} copies of $H'$, then $G(n,p)$ must also contain \emph{many} copies of $H'$. We then show that $p_\mathrm{crit}(H) \lesssim p_\mathrm{E, new}(H) \log e(H)$, thus proving a modification of the second Kahn--Kalai conjecture. The bound follows by a direct application of the set-theoretic "spread" property, which led to recent breakthroughs in the sunflower conjecture by Alweiss, Lovett, Wu and Zhang and the first fractional Kahn--Kalai conjecture by Frankston, Kahn, Narayanan and Park. △ Less

Submitted 7 September, 2022; originally announced September 2022.

Comments: 4 pages

arXiv:2207.06412 [pdf]

RobustAnalog: Fast Variation-Aware Analog Circuit Design Via Multi-task RL

Authors: Wei Shi, Hanrui Wang, Jiaqi Gu, Mingjie Liu, David Pan, Song Han, Nan Sun

Abstract: Analog/mixed-signal circuit design is one of the most complex and time-consuming stages in the whole chip design process. Due to various process, voltage, and temperature (PVT) variations from chip manufacturing, analog circuits inevitably suffer from performance degradation. Although there has been plenty of work on automating analog circuit design under the typical condition, limited research ha… ▽ More Analog/mixed-signal circuit design is one of the most complex and time-consuming stages in the whole chip design process. Due to various process, voltage, and temperature (PVT) variations from chip manufacturing, analog circuits inevitably suffer from performance degradation. Although there has been plenty of work on automating analog circuit design under the typical condition, limited research has been done on exploring robust designs under real and unpredictable silicon variations. Automatic analog design against variations requires prohibitive computation and time costs. To address the challenge, we present RobustAnalog, a robust circuit design framework that involves the variation information in the optimization process. Specifically, circuit optimizations under different variations are considered as a set of tasks. Similarities among tasks are leveraged and competitions are alleviated to realize a sample-efficient multi-task training. Moreover, RobustAnalog prunes the task space according to the current performance in each iteration, leading to a further simulation cost reduction. In this way, RobustAnalog can rapidly produce a set of circuit parameters that satisfies diverse constraints (e.g. gain, bandwidth, noise...) across variations. We compare RobustAnalog with Bayesian optimization, Evolutionary algorithm, and Deep Deterministic Policy Gradient (DDPG) and demonstrate that RobustAnalog can significantly reduce required optimization time by 14-30 times. Therefore, our study provides a feasible method to handle various real silicon conditions. △ Less

Submitted 13 July, 2022; originally announced July 2022.

arXiv:2203.01526 [pdf, other]

doi 10.1109/ACCESS.2022.3187211

How Do Organizations Seek Cyber Assurance? Investigations on the Adoption of the Common Criteria and Beyond

Authors: Nan Sun, Chang-Tsun Li, Hin Chan, Md Zahidul Islam, Md Rafiqul Islam, Warren Armstrong

Abstract: Cyber assurance, which is the ability to operate under the onslaught of cyber attacks and other unexpected events, is essential for organizations facing inundating security threats on a daily basis. Organizations usually employ multiple strategies to conduct risk management to achieve cyber assurance. Utilizing cybersecurity standards and certifications can provide guidance for vendors to design a… ▽ More Cyber assurance, which is the ability to operate under the onslaught of cyber attacks and other unexpected events, is essential for organizations facing inundating security threats on a daily basis. Organizations usually employ multiple strategies to conduct risk management to achieve cyber assurance. Utilizing cybersecurity standards and certifications can provide guidance for vendors to design and manufacture secure Information and Communication Technology (ICT) products as well as provide a level of assurance of the security functionality of the products for consumers. Hence, employing security standards and certifications is an effective strategy for risk management and cyber assurance. In this work, we begin with investigating the adoption of cybersecurity standards and certifications by surveying 258 participants from organizations across various countries and sectors. Specifically, we identify adoption barriers of the Common Criteria through the designed questionnaire. Taking into account the seven identified adoption barriers, we show the recommendations for promoting cybersecurity standards and certifications. Moreover, beyond cybersecurity standards and certifications, we shed light on other risk management strategies devised by our participants, which provides directions on cybersecurity approaches for enhancing cyber assurance in organizations. △ Less

Submitted 5 March, 2022; v1 submitted 3 March, 2022; originally announced March 2022.

arXiv:2201.07417 [pdf, other]

doi 10.1109/ACCESS.2022.3168716

Defining Security Requirements with the Common Criteria: Applications, Adoptions, and Challenges

Authors: Nan Sun, Chang-Tsun Li, Hin Chan, Ba Dung Le, MD Zahidul Islam, Leo Yu Zhang, MD Rafiqul Islam, Warren Armstrong

Abstract: Advances of emerging Information and Communications Technology (ICT) technologies push the boundaries of what is possible and open up new markets for innovative ICT products and services. The adoption of ICT products and systems with security properties depends on consumers' confidence and markets' trust in the security functionalities and whether the assurance measures applied to these products m… ▽ More Advances of emerging Information and Communications Technology (ICT) technologies push the boundaries of what is possible and open up new markets for innovative ICT products and services. The adoption of ICT products and systems with security properties depends on consumers' confidence and markets' trust in the security functionalities and whether the assurance measures applied to these products meet the inherent security requirements. Such confidence and trust are primarily gained through the rigorous development of security requirements, validation criteria, evaluation, and certification. Common Criteria for Information Technology Security Evaluation (often referred to as Common Criteria or CC) is an international standard (ISO/IEC 15408) for cyber security certification. In this paper, we conduct a systematic review of the CC standards and its adoptions. Adoption barriers of the CC are also investigated based on the analysis of current trends in security evaluation. Specifically, we share the experiences and lessons gained through the recent Development of Australian Cyber Criteria Assessment (DACCA) project that promotes the CC among stakeholders in ICT security products related to specification, development, evaluation, certification and approval, procurement, and deployment. Best practices on developing Protection Profiles, recommendations, and future directions for trusted cybersecurity advancement are presented. △ Less

Submitted 2 April, 2022; v1 submitted 19 January, 2022; originally announced January 2022.

arXiv:2201.01446 [pdf, other]

doi 10.1145/3503221.3508425

Extending the limit of molecular dynamics with ab initio accuracy to 10 billion atoms

Authors: Zhuoqiang Guo, Denghui Lu, Yujin Yan, Siyu Hu, Rongrong Liu, Guangming Tan, Ninghui Sun, Wanrun Jiang, Lijun Liu, Yixiao Chen, Linfeng Zhang, Mohan Chen, Han Wang, Weile Jia

Abstract: High-performance computing, together with a neural network model trained from data generated with first-principles methods, has greatly boosted applications of \textit{ab initio} molecular dynamics in terms of spatial and temporal scales on modern supercomputers. Previous state-of-the-art can achieve $1-2$ nanoseconds molecular dynamics simulation per day for 100-million atoms on the entire Summit… ▽ More High-performance computing, together with a neural network model trained from data generated with first-principles methods, has greatly boosted applications of \textit{ab initio} molecular dynamics in terms of spatial and temporal scales on modern supercomputers. Previous state-of-the-art can achieve $1-2$ nanoseconds molecular dynamics simulation per day for 100-million atoms on the entire Summit supercomputer. In this paper, we have significantly reduced the memory footprint and computational time by a comprehensive approach with both algorithmic and system innovations. The neural network model is compressed by model tabulation, kernel fusion, and redundancy removal. Then optimizations such as acceleration of customized kernel, tabulation of activation function, MPI+OpenMP parallelization are implemented on GPU and ARM architectures. Testing results of the copper system show that the optimized code can scale up to the entire machine of both Fugaku and Summit, and the corresponding system size can be extended by a factor of $134$ to an unprecedented $17$ billion atoms. The strong scaling of a $13.5$-million atom copper system shows that the time-to-solution can be 7 times faster, reaching $11.2$ nanoseconds per day. This work opens the door for unprecedentedly large-scale molecular dynamics simulations based on {\it ab initio} accuracy and can be potentially utilized in studying more realistic applications such as mechanical properties of metals, semiconductor devices, batteries, etc. The optimization techniques detailed in this paper also provide insight for relevant high-performance computing applications. △ Less

Submitted 4 January, 2022; originally announced January 2022.

Comments: 13 pages, 11 figures, conference : Principles and Practice of Parallel Programming 2022

arXiv:2110.00211 [pdf, other]

DNN-Opt: An RL Inspired Optimization for Analog Circuit Sizing using Deep Neural Networks

Authors: Ahmet F. Budak, Prateek Bhansali, Bo Liu, Nan Sun, David Z. Pan, Chandramouli V. Kashyap

Abstract: Analog circuit sizing takes a significant amount of manual effort in a typical design cycle. With rapidly developing technology and tight schedules, bringing automated solutions for sizing has attracted great attention. This paper presents DNN-Opt, a Reinforcement Learning (RL) inspired Deep Neural Network (DNN) based black-box optimization framework for analog circuit sizing. The key contribution… ▽ More Analog circuit sizing takes a significant amount of manual effort in a typical design cycle. With rapidly developing technology and tight schedules, bringing automated solutions for sizing has attracted great attention. This paper presents DNN-Opt, a Reinforcement Learning (RL) inspired Deep Neural Network (DNN) based black-box optimization framework for analog circuit sizing. The key contributions of this paper are a novel sample-efficient two-stage deep learning optimization framework leveraging RL actor-critic algorithms, and a recipe to extend it on large industrial circuits using critical device identification. Our method shows 5--30x sample efficiency compared to other black-box optimization methods both on small building blocks and on large industrial circuits with better performance metrics. To the best of our knowledge, this is the first application of DNN-based circuit sizing on industrial scale circuits. △ Less

Submitted 1 October, 2021; originally announced October 2021.

Comments: Accepted to 58th Design Automation Conference (DAC 2021), 6 pages, 5 figures

arXiv:2107.02283 [pdf, other]

Clustering Structure of Microstructure Measures

Authors: Liao Zhu, Ningning Sun, Martin T. Wells

Abstract: This paper builds the clustering model of measures of market microstructure features which are popular in predicting stock returns. In a 10-second time-frequency, we study the clustering structure of different measures to find out the best ones for predicting. In this way, we can predict more accurately with a limited number of predictors, which removes the noise and makes the model more interpret… ▽ More This paper builds the clustering model of measures of market microstructure features which are popular in predicting stock returns. In a 10-second time-frequency, we study the clustering structure of different measures to find out the best ones for predicting. In this way, we can predict more accurately with a limited number of predictors, which removes the noise and makes the model more interpretable. △ Less

Submitted 25 December, 2021; v1 submitted 5 July, 2021; originally announced July 2021.

arXiv:2104.10415 [pdf, other]

Tackling Variabilities in Autonomous Driving

Authors: Yuqiong Qi, Yang Hu, Haibin Wu, Shen Li, Haiyu Mao, Xiaochun Ye, Dongrui Fan, Ninghui Sun

Abstract: The state-of-the-art driving automation system demands extreme computational resources to meet rigorous accuracy and latency requirements. Though emerging driving automation computing platforms are based on ASIC to provide better performance and power guarantee, building such an accelerator-based computing platform for driving automation still present challenges. First, the workloads mix and perfo… ▽ More The state-of-the-art driving automation system demands extreme computational resources to meet rigorous accuracy and latency requirements. Though emerging driving automation computing platforms are based on ASIC to provide better performance and power guarantee, building such an accelerator-based computing platform for driving automation still present challenges. First, the workloads mix and performance requirements exposed to driving automation system present significant variability. Second, with more cameras/sensors integrated in a future fully autonomous driving vehicle, a heterogeneous multi-accelerator architecture substrate is needed that requires a design space exploration for a new form of parallelism. In this work, we aim to extensively explore the above system design challenges and these challenges motivate us to propose a comprehensive framework that synergistically handles the heterogeneous hardware accelerator design principles, system design criteria, and task scheduling mechanism. Specifically, we propose a novel heterogeneous multi-core AI accelerator (HMAI) to provide the hardware substrate for the driving automation tasks with variability. We also define system design criteria to better utilize hardware resources and achieve increased throughput while satisfying the performance and energy restrictions. Finally, we propose a deep reinforcement learning (RL)-based task scheduling mechanism FlexAI, to resolve task mapping issue. Experimental results show that with FlexAI scheduling, basically 100% tasks in each driving route can be processed by HMAI within their required period to ensure safety, and FlexAI can also maximally reduce the breaking distance up to 96% as compared to typical heuristics and guided random-search-based algorithms. △ Less

Submitted 21 April, 2021; originally announced April 2021.

arXiv:2103.12393 [pdf, other]

RISC-NN: Use RISC, NOT CISC as Neural Network Hardware Infrastructure

Authors: Taoran Xiang, Lunkai Zhang, Shuqian An, Xiaochun Ye, Mingzhe Zhang, Yanhuan Liu, Mingyu Yan, Da Wang, Hao Zhang, Wenming Li, Ninghui Sun, Dongrui Fan

Abstract: Neural Networks (NN) have been proven to be powerful tools to analyze Big Data. However, traditional CPUs cannot achieve the desired performance and/or energy efficiency for NN applications. Therefore, numerous NN accelerators have been used or designed to meet these goals. These accelerators all fall into three categories: GPGPUs, ASIC NN Accelerators and CISC NN Accelerators. Though CISC NN Acce… ▽ More Neural Networks (NN) have been proven to be powerful tools to analyze Big Data. However, traditional CPUs cannot achieve the desired performance and/or energy efficiency for NN applications. Therefore, numerous NN accelerators have been used or designed to meet these goals. These accelerators all fall into three categories: GPGPUs, ASIC NN Accelerators and CISC NN Accelerators. Though CISC NN Accelerators can achieve considerable smaller memory footprint than GPGPU thus improve energy efficiency; they still fail to provide same level of data reuse optimization achieved by ASIC NN Accelerators because of the inherited poor pragrammability of their CISC architecture. We argue that, for NN Accelerators, RISC is a better design choice than CISC, as is the case with general purpose processors. We propose RISC-NN, a novel many-core RISC-based NN accelerator that achieves high expressiveness and high parallelism and features strong programmability and low control-hardware costs. We show that, RISC-NN can implement all the necessary instructions of state-of-the-art CISC NN Accelerators; in the meantime, RISC-NN manages to achieve advanced optimization such as multiple-level data reuse and support for Sparse NN applications which previously only existed in ASIC NN Accelerators. Experiment results show that, RISC-NN achieves on average 11.88X performance efficiency compared with state-of-the-art Nvidia TITAN Xp GPGPU for various NN applications. RISC-NN also achieves on average 1.29X, 8.37X and 21.71X performance efficiency over CISC-based TPU in CNN, MLP and LSTM applications, respectively. Finally, RISC-NN can achieve additional 26.05% performance improvement and 33.13% energy reduction after applying pruning for Sparse NN applications. △ Less

Submitted 23 March, 2021; originally announced March 2021.

arXiv:2011.01022

Depth Ranging Performance Evaluation and Improvement for RGB-D Cameras on Field-Based High-Throughput Phenotyping Robots

Authors: Zhengqiang Fan, Na Sun, Quan Qiu, Chunjiang Zhao

Abstract: RGB-D cameras have been successfully used for indoor High-ThroughpuT Phenotyping (HTTP). However, their capability and feasibility for in-field HTTP still need to be evaluated, due to the noise and disturbances generated by unstable illumination, specular reflection, and diffuse reflection, etc. To solve these problems, we evaluated the depth-ranging performances of two consumer-level RGB-D camera… ▽ More RGB-D cameras have been successfully used for indoor High-ThroughpuT Phenotyping (HTTP). However, their capability and feasibility for in-field HTTP still need to be evaluated, due to the noise and disturbances generated by unstable illumination, specular reflection, and diffuse reflection, etc. To solve these problems, we evaluated the depth-ranging performances of two consumer-level RGB-D cameras (RealSense D435i and Kinect V2) under in-field HTTP scenarios, and proposed a strategy to compensate the depth measurement error. For performance evaluation, we focused on determining their optimal ranging areas for different crop organs. Based on the evaluation results, we proposed a brightness-and-distance-based Support Vector Regression Strategy, to compensate the ranging error. Furthermore, we analyzed the depth filling rate of two RGB-D cameras under different lighting intensities. Experimental results showed that: 1) For RealSense D435i, its effective ranging area is [0.160, 1.400] m, and in-field filling rate is approximately 90%. 2) For Kinect V2, it has a high ranging accuracy in the [0.497, 1.200] m, but its in-field filling rate is less than 24.9%. 3) Our error compensation model can effectively reduce the influences of lighting intensity and target distance. The maximum MSE and minimum R2 of this model are 0.029 and 0.867, respectively. To sum up, RealSense D435i has better ranging performances than Kinect V2 on in-field HTTP. △ Less

Submitted 27 April, 2021; v1 submitted 2 November, 2020; originally announced November 2020.

Comments: We want to improve the work of this paper before publishing it publicly

arXiv:2006.06434 [pdf, other]

TableQA: a Large-Scale Chinese Text-to-SQL Dataset for Table-Aware SQL Generation

Authors: Ningyuan Sun, Xuefeng Yang, Yunfeng Liu

Abstract: Parsing natural language to corresponding SQL (NL2SQL) with data driven approaches like deep neural networks attracts much attention in recent years. Existing NL2SQL datasets assume that condition values should appear exactly in natural language questions and the queries are answerable given the table. However, these assumptions may fail in practical scenarios, because user may use different expre… ▽ More Parsing natural language to corresponding SQL (NL2SQL) with data driven approaches like deep neural networks attracts much attention in recent years. Existing NL2SQL datasets assume that condition values should appear exactly in natural language questions and the queries are answerable given the table. However, these assumptions may fail in practical scenarios, because user may use different expressions for the same content in the table, and query information outside the table without the full picture of contents in table. Therefore we present TableQA, a large-scale cross-domain Natural Language to SQL dataset in Chinese language consisting 64,891 questions and 20,311 unique SQL queries on over 6,000 tables. Different from exisiting NL2SQL datasets, TableQA requires to generalize well not only to SQL skeletons of different questions and table schemas, but also to the various expressions for condition values. Experiment results show that the state-of-the-art model with 95.1% condition value accuracy on WikiSQL only gets 46.8% condition value accuracy and 43.0% logic form accuracy on TableQA, indicating the proposed dataset is challenging and necessary to handle. Two table-aware approaches are proposed to alleviate the problem, the end-to-end approaches obtains 51.3% and 47.4% accuracy on the condition value and logic form tasks, with improvement of 4.7% and 3.4% respectively. △ Less

Submitted 9 June, 2020; originally announced June 2020.

arXiv:2005.00406 [pdf, other]

doi 10.1109/DAC18072.2020.9218757

GCN-RL Circuit Designer: Transferable Transistor Sizing with Graph Neural Networks and Reinforcement Learning

Authors: Hanrui Wang, Kuan Wang, Jiacheng Yang, Linxiao Shen, Nan Sun, Hae-Seung Lee, Song Han

Abstract: Automatic transistor sizing is a challenging problem in circuit design due to the large design space, complex performance trade-offs, and fast technological advancements. Although there has been plenty of work on transistor sizing targeting on one circuit, limited research has been done on transferring the knowledge from one circuit to another to reduce the re-design overhead. In this paper, we pr… ▽ More Automatic transistor sizing is a challenging problem in circuit design due to the large design space, complex performance trade-offs, and fast technological advancements. Although there has been plenty of work on transistor sizing targeting on one circuit, limited research has been done on transferring the knowledge from one circuit to another to reduce the re-design overhead. In this paper, we present GCN-RL Circuit Designer, leveraging reinforcement learning (RL) to transfer the knowledge between different technology nodes and topologies. Moreover, inspired by the simple fact that circuit is a graph, we learn on the circuit topology representation with graph convolutional neural networks (GCN). The GCN-RL agent extracts features of the topology graph whose vertices are transistors, edges are wires. Our learning-based optimization consistently achieves the highest Figures of Merit (FoM) on four different circuits compared with conventional black-box optimization methods (Bayesian Optimization, Evolutionary Algorithms), random search, and human expert designs. Experiments on transfer learning between five technology nodes and two circuit topologies demonstrate that RL with transfer learning can achieve much higher FoMs than methods without knowledge transfer. Our transferable optimization method makes transistor sizing and design porting more effective and efficient. △ Less

Submitted 30 April, 2020; originally announced May 2020.

Comments: Accepted to the 57th Design Automation Conference (DAC 2020); 6 pages, 8 figures

arXiv:1805.10503 [pdf, other]

doi 10.1103/PhysRevB.98.085402

Deep Learning Topological Invariants of Band Insulators

Authors: Ning Sun, Jinmin Yi, Pengfei Zhang, Huitao Shen, Hui Zhai

Abstract: In this work we design and train deep neural networks to predict topological invariants for one-dimensional four-band insulators in AIII class whose topological invariant is the winding number, and two-dimensional two-band insulators in A class whose topological invariant is the Chern number. Given Hamiltonians in the momentum space as the input, neural networks can predict topological invariants… ▽ More In this work we design and train deep neural networks to predict topological invariants for one-dimensional four-band insulators in AIII class whose topological invariant is the winding number, and two-dimensional two-band insulators in A class whose topological invariant is the Chern number. Given Hamiltonians in the momentum space as the input, neural networks can predict topological invariants for both classes with accuracy close to or higher than 90%, even for Hamiltonians whose invariants are beyond the training data set. Despite the complexity of the neural network, we find that the output of certain intermediate hidden layers resembles either the winding angle for models in AIII class or the solid angle (Berry curvature) for models in A class, indicating that neural networks essentially capture the mathematical formula of topological invariants. Our work demonstrates the ability of neural networks to predict topological invariants for complicated models with local Hamiltonians as the only input, and offers an example that even a deep neural network is understandable. △ Less

Submitted 9 June, 2018; v1 submitted 26 May, 2018; originally announced May 2018.

Comments: 8 pages, 5 figures

Journal ref: Phys. Rev. B 98, 085402 (2018)

arXiv:1707.00323 [pdf, other]

An improved isogeometric analysis method for trimmed geometries

Authors: Jinlan Xu, Ningning Sun, Laixin Shu, Timon Rabczuk, Gang Xu

Abstract: Trimming techniques are efficient ways to generate complex geometries in Computer-Aided Design(CAD). In this paper, an improved isogeometric analysis(IGA) method for trimmed geometries is proposed. We will show that the proposed method reduces the numerical error of physical solution by 50% for simple trimmed geometries, and the condition number of stiffness matrix is also decreased. Furthermore,… ▽ More Trimming techniques are efficient ways to generate complex geometries in Computer-Aided Design(CAD). In this paper, an improved isogeometric analysis(IGA) method for trimmed geometries is proposed. We will show that the proposed method reduces the numerical error of physical solution by 50% for simple trimmed geometries, and the condition number of stiffness matrix is also decreased. Furthermore, the number of integration elements and integration points involved in the solving process can be significantly reduced compared to previous approaches, drastically improving the computational efficiency for IGA problems on the trimmed geometry. Several examples are illustrated to show the effectiveness of the proposed approach. △ Less

Submitted 2 July, 2017; originally announced July 2017.

arXiv:1612.07866 [pdf, other]

Spectral algorithms for tensor completion

Authors: Andrea Montanari, Nike Sun

Abstract: In the tensor completion problem, one seeks to estimate a low-rank tensor based on a random sample of revealed entries. In terms of the required sample size, earlier work revealed a large gap between estimation with unbounded computational resources (using, for instance, tensor nuclear norm minimization) and polynomial-time algorithms. Among the latter, the best statistical guarantees have been pr… ▽ More In the tensor completion problem, one seeks to estimate a low-rank tensor based on a random sample of revealed entries. In terms of the required sample size, earlier work revealed a large gap between estimation with unbounded computational resources (using, for instance, tensor nuclear norm minimization) and polynomial-time algorithms. Among the latter, the best statistical guarantees have been proved, for third-order tensors, using the sixth level of the sum-of-squares (SOS) semidefinite programming hierarchy (Barak and Moitra, 2014). However, the SOS approach does not scale well to large problem instances. By contrast, spectral methods --- based on unfolding or matricizing the tensor --- are attractive for their low complexity, but have been believed to require a much larger sample size. This paper presents two main contributions. First, we propose a new unfolding-based method, which outperforms naive ones for symmetric $k$-th order tensors of rank $r$. For this result we make a study of singular space estimation for partially revealed matrices of large aspect ratio, which may be of independent interest. For third-order tensors, our algorithm matches the SOS method in terms of sample size (requiring about $rd^{3/2}$ revealed entries), subject to a worse rank condition ($r\ll d^{3/4}$ rather than $r\ll d^{3/2}$). We complement this result with a different spectral algorithm for third-order tensors in the overcomplete ($r\ge d$) regime. Under a random model, this second approach succeeds in estimating tensors of rank $d\le r \ll d^{3/2}$ from about $rd^{3/2}$ revealed entries. △ Less

Submitted 22 December, 2016; originally announced December 2016.

arXiv:1602.01428 [pdf]

"Draw My Topics": Find Desired Topics fast from large scale of Corpus

Authors: Jason Dou, Ni Sun, Xiaojun Zou

Abstract: We develop the "Draw My Topics" toolkit, which provides a fast way to incorporate social scientists' interest into standard topic modelling. Instead of using raw corpus with primitive processing as input, an algorithm based on Vector Space Model and Conditional Entropy are used to connect social scientists' willingness and unsupervised topic models' output. Space for users' adjustment on specific… ▽ More We develop the "Draw My Topics" toolkit, which provides a fast way to incorporate social scientists' interest into standard topic modelling. Instead of using raw corpus with primitive processing as input, an algorithm based on Vector Space Model and Conditional Entropy are used to connect social scientists' willingness and unsupervised topic models' output. Space for users' adjustment on specific corpus of their interest is also accommodated. We demonstrate the toolkit's use on the Diachronic People's Daily Corpus in Chinese. △ Less

Submitted 3 February, 2016; originally announced February 2016.

arXiv:1504.04974 [pdf, other]

Understanding Big Data Analytic Workloads on Modern Processors

Authors: Zhen Jia, Lei Wang, Jianfeng Zhan, Lixin Zhang, Chunjie Luo, Ninghui Sun

Abstract: Big data analytics applications play a significant role in data centers, and hence it has become increasingly important to understand their behaviors in order to further improve the performance of data center computer systems, in which characterizing representative workloads is a key practical problem. In this paper, after investigating three most impor- tant application domains in terms of page v… ▽ More Big data analytics applications play a significant role in data centers, and hence it has become increasingly important to understand their behaviors in order to further improve the performance of data center computer systems, in which characterizing representative workloads is a key practical problem. In this paper, after investigating three most impor- tant application domains in terms of page views and daily visitors, we chose 11 repre- sentative data analytics workloads and characterized their micro-architectural behaviors by using hardware performance counters, so as to understand the impacts and implications of data analytics workloads on the systems equipped with modern superscalar out-of-order processors. Our study reveals that big data analytics applications themselves share many inherent characteristics, which place them in a different class from traditional workloads and scale-out services. To further understand the characteristics of big data analytics work- loads we performed a correlation analysis of CPI (cycles per instruction) with other micro- architecture level characteristics and an investigation of the big data software stack impacts on application behaviors. Our correlation analysis showed that even though big data ana- lytics workloads own notable pipeline front end stalls, the main factors affecting the CPI performance are long latency data accesses rather than the front end stalls. Our software stack investigation found that the typical big data software stack significantly contributes to the front end stalls and incurs bigger working set. Finally we gave several recommen- dations for architects, programmers and big data system designers with the knowledge acquired from this paper. △ Less

Submitted 20 April, 2015; originally announced April 2015.

Comments: arXiv admin note: substantial text overlap with arXiv:1307.8013

arXiv:1411.0650 [pdf, other]

Proof of the satisfiability conjecture for large k

Authors: Jian Ding, Allan Sly, Nike Sun

Abstract: We establish the satisfiability threshold for random $k$-SAT for all $k\ge k_0$, with $k_0$ an absolute constant. That is, there exists a limiting density $α_*(k)$ such that a random $k$-SAT formula of clause density $α$ is with high probability satisfiable for $α<α_*$, and unsatisfiable for $α>α_*$. We show that the threshold $α_*(k)$ is given explicitly by the one-step replica symmetry breaking… ▽ More We establish the satisfiability threshold for random $k$-SAT for all $k\ge k_0$, with $k_0$ an absolute constant. That is, there exists a limiting density $α_*(k)$ such that a random $k$-SAT formula of clause density $α$ is with high probability satisfiable for $α<α_*$, and unsatisfiable for $α>α_*$. We show that the threshold $α_*(k)$ is given explicitly by the one-step replica symmetry breaking prediction from statistical physics. The proof develops a new analytic method for moment calculations on random graphs, mapping a high-dimensional optimization problem to a more tractable problem of analyzing tree recursions. We believe that our method may apply to a range of random CSPs in the 1-RSB universality class. △ Less

Submitted 15 April, 2021; v1 submitted 3 November, 2014; originally announced November 2014.

arXiv:1310.5603 [pdf, other]

GRE: A Graph Runtime Engine for Large-Scale Distributed Graph-Parallel Applications

Authors: Jie Yan, Guangming Tan, Ninghui Sun

Abstract: Large-scale distributed graph-parallel computing is challenging. On one hand, due to the irregular computation pattern and lack of locality, it is hard to express parallelism efficiently. On the other hand, due to the scale-free nature, real-world graphs are hard to partition in balance with low cut. To address these challenges, several graph-parallel frameworks including Pregel and GraphLab (Powe… ▽ More Large-scale distributed graph-parallel computing is challenging. On one hand, due to the irregular computation pattern and lack of locality, it is hard to express parallelism efficiently. On the other hand, due to the scale-free nature, real-world graphs are hard to partition in balance with low cut. To address these challenges, several graph-parallel frameworks including Pregel and GraphLab (PowerGraph) have been developed recently. In this paper, we present an alternative framework, Graph Runtime Engine (GRE). While retaining the vertex-centric programming model, GRE proposes two new abstractions: 1) a Scatter-Combine computation model based on active message to exploit massive fined-grained edge-level parallelism, and 2) a Agent-Graph data model based on vertex factorization to partition and represent directed graphs. GRE is implemented on commercial off-the-shelf multi-core cluster. We experimentally evaluate GRE with three benchmark programs (PageRank, Single Source Shortest Path and Connected Components) on real-world and synthetic graphs of millions billion of vertices. Compared to PowerGraph, GRE shows 2.5~17 times better performance on 8~16 machines (192 cores). Specifically, the PageRank in GRE is the fastest when comparing to counterparts of other frameworks (PowerGraph, Spark,Twister) reported in public literatures. Besides, GRE significantly optimizes memory usage so that it can process a large graph of 1 billion vertices and 17 billion edges on our cluster with totally 768GB memory, while PowerGraph can only process less than half of this graph scale. △ Less

Submitted 21 October, 2013; originally announced October 2013.

Comments: 12 pages, also submitted to PVLDB

arXiv:1208.5542 [pdf, ps, other]

Compression and Sieve: Reducing Communication in Parallel Breadth First Search on Distributed Memory Systems

Authors: Huiwei Lv, Guangming Tan, Mingyu Chen, Ninghui Sun

Abstract: For parallel breadth first search (BFS) algorithm on large-scale distributed memory systems, communication often costs significantly more than arithmetic and limits the scalability of the algorithm. In this paper we sufficiently reduce the communication cost in distributed BFS by compressing and sieving the messages. First, we leverage a bitmap compression algorithm to reduce the size of messages… ▽ More For parallel breadth first search (BFS) algorithm on large-scale distributed memory systems, communication often costs significantly more than arithmetic and limits the scalability of the algorithm. In this paper we sufficiently reduce the communication cost in distributed BFS by compressing and sieving the messages. First, we leverage a bitmap compression algorithm to reduce the size of messages before communication. Second, we propose a novel distributed directory algorithm, cross directory, to sieve the redundant data in messages. Experiments on a 6,144-core SMP cluster show our algorithm outperforms the baseline implementation in Graph500 by 2.2 times, reduces its communication time by 79.0%, and achieves a performance rate of 12.1 GTEPS (billion edge visits per second) △ Less

Submitted 27 August, 2012; originally announced August 2012.

Comments: 10 pages, 10 figures

arXiv:1203.2602 [pdf, ps, other]

The computational hardness of counting in two-spin models on d-regular graphs

Authors: Allan Sly, Nike Sun

Abstract: The class of two-spin systems contains several important models, including random independent sets and the Ising model of statistical physics. We show that for both the hard-core (independent set) model and the anti-ferromagnetic Ising model with arbitrary external field, it is NP-hard to approximate the partition function or approximately sample from the model on d-regular graphs when the model h… ▽ More The class of two-spin systems contains several important models, including random independent sets and the Ising model of statistical physics. We show that for both the hard-core (independent set) model and the anti-ferromagnetic Ising model with arbitrary external field, it is NP-hard to approximate the partition function or approximately sample from the model on d-regular graphs when the model has non-uniqueness on the d-regular tree. Together with results of Jerrum--Sinclair, Weitz, and Sinclair--Srivastava--Thurley giving FPRAS's for all other two-spin systems except at the uniqueness threshold, this gives an almost complete classification of the computational complexity of two-spin systems on bounded-degree graphs. Our proof establishes that the normalized log-partition function of any two-spin system on bipartite locally tree-like graphs converges to a limiting "free energy density" which coincides with the (non-rigorous) Bethe prediction of statistical physics. We use this result to characterize the local structure of two-spin systems on locally tree-like bipartite expander graphs, which then become the basic gadgets in a randomized reduction to approximate MAX-CUT. Our approach is novel in that it makes no use of the second moment method employed in previous works on these questions. △ Less

Submitted 12 March, 2012; originally announced March 2012.

Comments: 23 pages

arXiv:1202.6134 [pdf, ps, other]

doi 10.1109/IPDPSW.2012.213

High Volume Computing: Identifying and Characterizing Throughput Oriented Workloads in Data Centers

Authors: Jianfeng Zhan, Lixin Zhang, Ninghui Sun, Lei Wang, Zhen Jia, Chunjie Luo

Abstract: For the first time, this paper systematically identifies three categories of throughput oriented workloads in data centers: services, data processing applications, and interactive real-time applications, whose targets are to increase the volume of throughput in terms of processed requests or data, or supported maximum number of simultaneous subscribers, respectively, and we coin a new term high vo… ▽ More For the first time, this paper systematically identifies three categories of throughput oriented workloads in data centers: services, data processing applications, and interactive real-time applications, whose targets are to increase the volume of throughput in terms of processed requests or data, or supported maximum number of simultaneous subscribers, respectively, and we coin a new term high volume computing (in short HVC) to describe those workloads and data center computer systems designed for them. We characterize and compare HVC with other computing paradigms, e.g., high throughput computing, warehouse-scale computing, and cloud computing, in terms of levels, workloads, metrics, coupling degree, data scales, and number of jobs or service instances. We also preliminarily report our ongoing work on the metrics and benchmarks for HVC systems, which is the foundation of designing innovative data center computer systems for HVC workloads. △ Less

Submitted 14 January, 2013; v1 submitted 28 February, 2012; originally announced February 2012.

Comments: 10 pages

Journal ref: Workshop on Large-Scale Parallel Processing in conjunction with 26th IEEE International Parallel and Distributed Processing Symposium, 2012, Shanghai, China

arXiv:1110.4821 [pdf, ps, other]

doi 10.1214/12-AOP828

Factor models on locally tree-like graphs

Authors: Amir Dembo, Andrea Montanari, Nike Sun

Abstract: We consider homogeneous factor models on uniformly sparse graph sequences converging locally to a (unimodular) random tree $T$, and study the existence of the free energy density $φ$, the limit of the log-partition function divided by the number of vertices $n$ as $n$ tends to infinity. We provide a new interpolation scheme and use it to prove existence of, and to explicitly compute, the quantity… ▽ More We consider homogeneous factor models on uniformly sparse graph sequences converging locally to a (unimodular) random tree $T$, and study the existence of the free energy density $φ$, the limit of the log-partition function divided by the number of vertices $n$ as $n$ tends to infinity. We provide a new interpolation scheme and use it to prove existence of, and to explicitly compute, the quantity $φ$ subject to uniqueness of a relevant Gibbs measure for the factor model on $T$. By way of example we compute $φ$ for the independent set (or hard-core) model at low fugacity, for the ferromagnetic Ising model at all parameter values, and for the ferromagnetic Potts model with both weak enough and strong enough interactions. Even beyond uniqueness regimes our interpolation provides useful explicit bounds on $φ$. In the regimes in which we establish existence of the limit, we show that it coincides with the Bethe free energy functional evaluated at a suitable fixed point of the belief propagation (Bethe) recursions on $T$. In the special case that $T$ has a Galton-Watson law, this formula coincides with the nonrigorous "Bethe prediction" obtained by statistical physicists using the "replica" or "cavity" methods. Thus our work is a rigorous generalization of these heuristic calculations to the broader class of sparse graph sequences converging locally to trees. We also provide a variational characterization for the Bethe prediction in this general setting, which is of independent interest. △ Less

Submitted 16 December, 2013; v1 submitted 21 October, 2011; originally announced October 2011.

Comments: Published in at http://dx.doi.org/10.1214/12-AOP828 the Annals of Probability (http://www.imstat.org/aop/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-AOP-AOP828

Journal ref: Annals of Probability 2013, Vol. 41, No. 6, 4162-4213

Showing 1–50 of 50 results for author: Sun, N