Skip to main content

Showing 1–50 of 321 results for author: Han, T

Searching in archive cs. Search in all archives.
.
  1. arXiv:2410.11702  [pdf, other

    cs.CV

    It's Just Another Day: Unique Video Captioning by Discriminative Prompting

    Authors: Toby Perrett, Tengda Han, Dima Damen, Andrew Zisserman

    Abstract: Long videos contain many repeating actions, events and shots. These repetitions are frequently given identical captions, which makes it difficult to retrieve the exact desired clip using a text search. In this paper, we formulate the problem of unique captioning: Given multiple clips with the same caption, we generate a new caption for each clip that uniquely identifies it. We propose Captioning b… ▽ More

    Submitted 15 October, 2024; originally announced October 2024.

    Comments: ACCV 2024 Oral. Project page: https://tobyperrett.github.io/its-just-another-day/

  2. arXiv:2410.07113  [pdf, other

    cs.CV

    Personalized Visual Instruction Tuning

    Authors: Renjie Pi, Jianshu Zhang, Tianyang Han, Jipeng Zhang, Rui Pan, Tong Zhang

    Abstract: Recent advancements in multimodal large language models (MLLMs) have demonstrated significant progress; however, these models exhibit a notable limitation, which we refer to as "face blindness". Specifically, they can engage in general conversations but fail to conduct personalized dialogues targeting at specific individuals. This deficiency hinders the application of MLLMs in personalized setting… ▽ More

    Submitted 9 October, 2024; originally announced October 2024.

  3. arXiv:2409.19862  [pdf, other

    cs.LG cs.CV

    Learning Multimodal Latent Generative Models with Energy-Based Prior

    Authors: Shiyu Yuan, Jiali Cui, Hanao Li, Tian Han

    Abstract: Multimodal generative models have recently gained significant attention for their ability to learn representations across various modalities, enhancing joint and cross-generation coherence. However, most existing works use standard Gaussian or Laplacian distributions as priors, which may struggle to capture the diverse information inherent in multiple data types due to their unimodal and less info… ▽ More

    Submitted 29 September, 2024; originally announced September 2024.

    Comments: The 18th European Conference on Computer Vision ECCV 2024

  4. arXiv:2409.16321  [pdf, other

    cs.AI cs.LG physics.ao-ph

    WeatherFormer: Empowering Global Numerical Weather Forecasting with Space-Time Transformer

    Authors: Junchao Gong, Tao Han, Kang Chen, Lei Bai

    Abstract: Numerical Weather Prediction (NWP) system is an infrastructure that exerts considerable impacts on modern society.Traditional NWP system, however, resolves it by solving complex partial differential equations with a huge computing cluster, resulting in tons of carbon emission. Exploring efficient and eco-friendly solutions for NWP attracts interest from Artificial Intelligence (AI) and earth scien… ▽ More

    Submitted 21 September, 2024; originally announced September 2024.

  5. arXiv:2409.11365  [pdf, other

    cs.CL

    CoCA: Regaining Safety-awareness of Multimodal Large Language Models with Constitutional Calibration

    Authors: Jiahui Gao, Renjie Pi, Tianyang Han, Han Wu, Lanqing Hong, Lingpeng Kong, Xin Jiang, Zhenguo Li

    Abstract: The deployment of multimodal large language models (MLLMs) has demonstrated remarkable success in engaging in conversations involving visual inputs, thanks to the superior power of large language models (LLMs). Those MLLMs are typically built based on the LLMs, with an image encoder to process images into the token embedding space of the LLMs. However, the integration of visual modality has introd… ▽ More

    Submitted 9 October, 2024; v1 submitted 17 September, 2024; originally announced September 2024.

    Comments: 10 pages, COLM-2024

  6. arXiv:2409.08712  [pdf, other

    cs.LG cs.AI cs.CL cs.CV

    Layerwise Change of Knowledge in Neural Networks

    Authors: Xu Cheng, Lei Cheng, Zhaoran Peng, Yang Xu, Tian Han, Quanshi Zhang

    Abstract: This paper aims to explain how a deep neural network (DNN) gradually extracts new knowledge and forgets noisy features through layers in forward propagation. Up to now, although the definition of knowledge encoded by the DNN has not reached a consensus, Previous studies have derived a series of mathematical evidence to take interactions as symbolic primitive inference patterns encoded by a DNN. We… ▽ More

    Submitted 13 September, 2024; originally announced September 2024.

  7. arXiv:2409.07995  [pdf, other

    cs.CV

    Depth Matters: Exploring Deep Interactions of RGB-D for Semantic Segmentation in Traffic Scenes

    Authors: Siyu Chen, Ting Han, Changshe Zhang, Weiquan Liu, Jinhe Su, Zongyue Wang, Guorong Cai

    Abstract: RGB-D has gradually become a crucial data source for understanding complex scenes in assisted driving. However, existing studies have paid insufficient attention to the intrinsic spatial properties of depth maps. This oversight significantly impacts the attention representation, leading to prediction errors caused by attention shift issues. To this end, we propose a novel learnable Depth interacti… ▽ More

    Submitted 12 September, 2024; originally announced September 2024.

  8. Hevelius Report: Visualizing Web-Based Mobility Test Data For Clinical Decision and Learning Support

    Authors: Hongjin Lin, Tessa Han, Krzysztof Z. Gajos, Anoopum S. Gupta

    Abstract: Hevelius, a web-based computer mouse test, measures arm movement and has been shown to accurately evaluate severity for patients with Parkinson's disease and ataxias. A Hevelius session produces 32 numeric features, which may be hard to interpret, especially in time-constrained clinical settings. This work aims to support clinicians (and other stakeholders) in interpreting and connecting Hevelius… ▽ More

    Submitted 9 September, 2024; originally announced September 2024.

    Comments: Accepted to the 26th International ACM SIGACCESS Conference on Computers and Accessibility (ASSETS '24)

  9. arXiv:2409.03319  [pdf, other

    cs.ET

    Semantic Communication for Efficient Point Cloud Transmission

    Authors: Shangzhuo Xie, Qianqian Yang, Yuyi Sun, Tianxiao Han, Zhaohui Yang, Zhiguo Shi

    Abstract: As three-dimensional acquisition technologies like LiDAR cameras advance, the need for efficient transmission of 3D point clouds is becoming increasingly important. In this paper, we present a novel semantic communication (SemCom) approach for efficient 3D point cloud transmission. Different from existing methods that rely on downsampling and feature extraction for compression, our approach utiliz… ▽ More

    Submitted 5 September, 2024; originally announced September 2024.

  10. arXiv:2408.13833  [pdf, other

    cs.CL

    Biomedical Large Languages Models Seem not to be Superior to Generalist Models on Unseen Medical Data

    Authors: Felix J. Dorfner, Amin Dada, Felix Busch, Marcus R. Makowski, Tianyu Han, Daniel Truhn, Jens Kleesiek, Madhumita Sushil, Jacqueline Lammert, Lisa C. Adams, Keno K. Bressem

    Abstract: Large language models (LLMs) have shown potential in biomedical applications, leading to efforts to fine-tune them on domain-specific data. However, the effectiveness of this approach remains unclear. This study evaluates the performance of biomedically fine-tuned LLMs against their general-purpose counterparts on a variety of clinical tasks. We evaluated their performance on clinical case challen… ▽ More

    Submitted 25 August, 2024; originally announced August 2024.

    Comments: 10 pages, 3 tables, 1 figure

  11. arXiv:2408.11438  [pdf, other

    cs.LG cs.CV physics.ao-ph

    DABench: A Benchmark Dataset for Data-Driven Weather Data Assimilation

    Authors: Wuxin Wang, Weicheng Ni, Tao Han, Lei Bai, Boheng Duan, Kaijun Ren

    Abstract: Recent advancements in deep learning (DL) have led to the development of several Large Weather Models (LWMs) that rival state-of-the-art (SOTA) numerical weather prediction (NWP) systems. Up to now, these models still rely on traditional NWP-generated analysis fields as input and are far from being an autonomous system. While researchers are exploring data-driven data assimilation (DA) models to g… ▽ More

    Submitted 21 August, 2024; originally announced August 2024.

    Comments: 37pages, 12 figures, 6 tables

  12. arXiv:2408.10467  [pdf, other

    cs.LG cs.CV

    Learning Multimodal Latent Space with EBM Prior and MCMC Inference

    Authors: Shiyu Yuan, Carlo Lipizzi, Tian Han

    Abstract: Multimodal generative models are crucial for various applications. We propose an approach that combines an expressive energy-based model (EBM) prior with Markov Chain Monte Carlo (MCMC) inference in the latent space for multimodal generation. The EBM prior acts as an informative guide, while MCMC inference, specifically through short-run Langevin dynamics, brings the posterior distribution closer… ▽ More

    Submitted 19 August, 2024; originally announced August 2024.

  13. arXiv:2408.05373  [pdf, other

    math.DS cs.AI cs.GT cs.MA nlin.AO

    Evolutionary mechanisms that promote cooperation may not promote social welfare

    Authors: The Anh Han, Manh Hong Duong, Matjaz Perc

    Abstract: Understanding the emergence of prosocial behaviours among self-interested individuals is an important problem in many scientific disciplines. Various mechanisms have been proposed to explain the evolution of such behaviours, primarily seeking the conditions under which a given mechanism can induce highest levels of cooperation. As these mechanisms usually involve costs that alter individual payoff… ▽ More

    Submitted 11 September, 2024; v1 submitted 9 August, 2024; originally announced August 2024.

    Comments: 21 pages, 5 figures

  14. arXiv:2408.02191  [pdf, other

    cs.CV

    Dense Feature Interaction Network for Image Inpainting Localization

    Authors: Ye Yao, Tingfeng Han, Shan Jia, Siwei Lyu

    Abstract: Image inpainting, which is the task of filling in missing areas in an image, is a common image editing technique. Inpainting can be used to conceal or alter image contents in malicious manipulation of images, driving the need for research in image inpainting detection. Existing methods mostly rely on a basic encoder-decoder structure, which often results in a high number of false positives or miss… ▽ More

    Submitted 4 August, 2024; originally announced August 2024.

  15. arXiv:2407.21497  [pdf, other

    cs.CV

    Mitral Regurgitation Recogniton based on Unsupervised Out-of-Distribution Detection with Residual Diffusion Amplification

    Authors: Zhe Liu, Xiliang Zhu, Tong Han, Yuhao Huang, Jian Wang, Lian Liu, Fang Wang, Dong Ni, Zhongshan Gou, Xin Yang

    Abstract: Mitral regurgitation (MR) is a serious heart valve disease. Early and accurate diagnosis of MR via ultrasound video is critical for timely clinical decision-making and surgical intervention. However, manual MR diagnosis heavily relies on the operator's experience, which may cause misdiagnosis and inter-observer variability. Since MR data is limited and has large intra-class variability, we propose… ▽ More

    Submitted 17 September, 2024; v1 submitted 31 July, 2024; originally announced July 2024.

    Comments: Accepted by MICCAI MLMI 2024, 11 pages, 3 figures

  16. arXiv:2407.19468  [pdf, other

    cs.CV cs.MM

    MVPbev: Multi-view Perspective Image Generation from BEV with Test-time Controllability and Generalizability

    Authors: Buyu Liu, Kai Wang, Yansong Liu, Jun Bao, Tingting Han, Jun Yu

    Abstract: This work aims to address the multi-view perspective RGB generation from text prompts given Bird-Eye-View(BEV) semantics. Unlike prior methods that neglect layout consistency, lack the ability to handle detailed text prompts, or are incapable of generalizing to unseen view points, MVPbev simultaneously generates cross-view consistent images of different perspective views with a two-stage design, a… ▽ More

    Submitted 28 July, 2024; originally announced July 2024.

    Comments: Accepted by ACM MM24

  17. arXiv:2407.15850  [pdf, other

    cs.CV

    AutoAD-Zero: A Training-Free Framework for Zero-Shot Audio Description

    Authors: Junyu Xie, Tengda Han, Max Bain, Arsha Nagrani, G�l Varol, Weidi Xie, Andrew Zisserman

    Abstract: Our objective is to generate Audio Descriptions (ADs) for both movies and TV series in a training-free manner. We use the power of off-the-shelf Visual-Language Models (VLMs) and Large Language Models (LLMs), and develop visual and text prompting strategies for this task. Our contributions are three-fold: (i) We demonstrate that a VLM can successfully name and refer to characters if directly promp… ▽ More

    Submitted 22 July, 2024; originally announced July 2024.

    Comments: Project Page: https://www.robots.ox.ac.uk/~vgg/research/autoad-zero/

  18. arXiv:2407.13268  [pdf, other

    cs.AI cs.LG

    Mixture of Experts based Multi-task Supervise Learning from Crowds

    Authors: Tao Han, Huaixuan Shi, Xinyi Ding, Xiao Ma, Huamao Gu, Yili Fang

    Abstract: Existing truth inference methods in crowdsourcing aim to map redundant labels and items to the ground truth. They treat the ground truth as hidden variables and use statistical or deep learning-based worker behavior models to infer the ground truth. However, worker behavior models that rely on ground truth hidden variables overlook workers' behavior at the item feature level, leading to imprecise… ▽ More

    Submitted 18 July, 2024; originally announced July 2024.

  19. arXiv:2407.11194  [pdf, other

    astro-ph.IM astro-ph.EP astro-ph.GA astro-ph.SR cs.AI cs.CL

    AstroMLab 1: Who Wins Astronomy Jeopardy!?

    Authors: Yuan-Sen Ting, Tuan Dung Nguyen, Tirthankar Ghosal, Rui Pan, Hardik Arora, Zechang Sun, Tijmen de Haan, Nesar Ramachandra, Azton Wells, Sandeep Madireddy, Alberto Accomazzi

    Abstract: We present a comprehensive evaluation of proprietary and open-weights large language models using the first astronomy-specific benchmarking dataset. This dataset comprises 4,425 multiple-choice questions curated from the Annual Review of Astronomy and Astrophysics, covering a broad range of astrophysical topics. Our analysis examines model performance across various astronomical subfields and asse… ▽ More

    Submitted 15 July, 2024; originally announced July 2024.

    Comments: 45 pages, 12 figures, 7 tables. Submitted to ApJ. Comments welcome. AstroMLab homepage: https://astromlab.org/

  20. arXiv:2407.04916  [pdf, other

    cs.CV

    Completed Feature Disentanglement Learning for Multimodal MRIs Analysis

    Authors: Tianling Liu, Hongying Liu, Fanhua Shang, Lequan Yu, Tong Han, Liang Wan

    Abstract: Multimodal MRIs play a crucial role in clinical diagnosis and treatment. Feature disentanglement (FD)-based methods, aiming at learning superior feature representations for multimodal data analysis, have achieved significant success in multimodal learning (MML). Typically, existing FD-based methods separate multimodal data into modality-shared and modality-specific features, and employ concatenati… ▽ More

    Submitted 5 July, 2024; originally announced July 2024.

    Comments: Submitted to IEEE JBHI in April 2024

  21. arXiv:2407.04675  [pdf, other

    eess.AS cs.SD

    Seed-ASR: Understanding Diverse Speech and Contexts with LLM-based Speech Recognition

    Authors: Ye Bai, Jingping Chen, Jitong Chen, Wei Chen, Zhuo Chen, Chuang Ding, Linhao Dong, Qianqian Dong, Yujiao Du, Kepan Gao, Lu Gao, Yi Guo, Minglun Han, Ting Han, Wenchao Hu, Xinying Hu, Yuxiang Hu, Deyu Hua, Lu Huang, Mingkun Huang, Youjia Huang, Jishuo Jin, Fanliu Kong, Zongwei Lan, Tianyu Li , et al. (30 additional authors not shown)

    Abstract: Modern automatic speech recognition (ASR) model is required to accurately transcribe diverse speech signals (from different domains, languages, accents, etc) given the specific contextual information in various application scenarios. Classic end-to-end models fused with extra language models perform well, but mainly in data matching scenarios and are gradually approaching a bottleneck. In this wor… ▽ More

    Submitted 10 July, 2024; v1 submitted 5 July, 2024; originally announced July 2024.

  22. arXiv:2407.04619  [pdf, other

    cs.CV

    CountGD: Multi-Modal Open-World Counting

    Authors: Niki Amini-Naieni, Tengda Han, Andrew Zisserman

    Abstract: The goal of this paper is to improve the generality and accuracy of open-vocabulary object counting in images. To improve the generality, we repurpose an open-vocabulary detection foundation model (GroundingDINO) for the counting task, and also extend its capabilities by introducing modules to enable specifying the target object to count by visual exemplars. In turn, these new capabilities - being… ▽ More

    Submitted 5 July, 2024; originally announced July 2024.

  23. RISC-V R-Extension: Advancing Efficiency with Rented-Pipeline for Edge DNN Processing

    Authors: Won Hyeok Kim, Hyeong Jin Kim, Tae Hee Han

    Abstract: The proliferation of edge devices necessitates efficient computational architectures for lightweight tasks, particularly deep neural network (DNN) inference. Traditional NPUs, though effective for such operations, face challenges in power, cost, and area when integrated into lightweight edge devices. The RISC-V architecture, known for its modularity and open-source nature, offers a viable alternat… ▽ More

    Submitted 2 July, 2024; originally announced July 2024.

    Comments: 6 pages, 6 figures, ICAIIC 2024

  24. arXiv:2406.17272  [pdf, ps, other

    cs.LG

    A Comprehensive Solution to Connect Speech Encoder and Large Language Model for ASR

    Authors: Van Tung Pham, Yist Lin, Tao Han, Wei Li, Jun Zhang, Lu Lu, Yuxuan Wang

    Abstract: Recent works have shown promising results in connecting speech encoders to large language models (LLMs) for speech recognition. However, several limitations persist, including limited fine-tuning options, a lack of mechanisms to enforce speech-text alignment, and high insertion errors especially in domain mismatch conditions. This paper presents a comprehensive solution to address these issues. We… ▽ More

    Submitted 25 June, 2024; originally announced June 2024.

  25. arXiv:2406.16983  [pdf, other

    eess.IV cs.AI cs.LG

    On Instabilities of Unsupervised Denoising Diffusion Models in Magnetic Resonance Imaging Reconstruction

    Authors: Tianyu Han, Sven Nebelung, Firas Khader, Jakob Nikolas Kather, Daniel Truhn

    Abstract: Denoising diffusion models offer a promising approach to accelerating magnetic resonance imaging (MRI) and producing diagnostic-level images in an unsupervised manner. However, our study demonstrates that even tiny worst-case potential perturbations transferred from a surrogate model can cause these models to generate fake tissue structures that may mislead clinicians. The transferability of such… ▽ More

    Submitted 23 June, 2024; originally announced June 2024.

  26. arXiv:2406.14399  [pdf, other

    cs.LG cs.CV physics.ao-ph stat.ML

    How far are today's time-series models from real-world weather forecasting applications?

    Authors: Tao Han, Song Guo, Zhenghao Chen, Wanghan Xu, Lei Bai

    Abstract: The development of Time-Series Forecasting (TSF) techniques is often hindered by the lack of comprehensive datasets. This is particularly problematic for time-series weather forecasting, where commonly used datasets suffer from significant limitations such as small size, limited temporal coverage, and sparse spatial distribution. These constraints severely impede the optimization and evaluation of… ▽ More

    Submitted 11 October, 2024; v1 submitted 20 June, 2024; originally announced June 2024.

    Comments: 29 pages, 14 figures

  27. arXiv:2406.11654  [pdf, other

    cs.CL

    Ruby Teaming: Improving Quality Diversity Search with Memory for Automated Red Teaming

    Authors: Vernon Toh Yan Han, Rishabh Bhardwaj, Soujanya Poria

    Abstract: We propose Ruby Teaming, a method that improves on Rainbow Teaming by including a memory cache as its third dimension. The memory dimension provides cues to the mutator to yield better-quality prompts, both in terms of attack success rate (ASR) and quality diversity. The prompt archive generated by Ruby Teaming has an ASR of 74%, which is 20% higher than the baseline. In terms of quality diversity… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

  28. arXiv:2406.10655  [pdf, ps, other

    cs.CR

    E-SAGE: Explainability-based Defense Against Backdoor Attacks on Graph Neural Networks

    Authors: Dingqiang Yuan, Xiaohua Xu, Lei Yu, Tongchang Han, Rongchang Li, Meng Han

    Abstract: Graph Neural Networks (GNNs) have recently been widely adopted in multiple domains. Yet, they are notably vulnerable to adversarial and backdoor attacks. In particular, backdoor attacks based on subgraph insertion have been shown to be effective in graph classification tasks while being stealthy, successfully circumventing various existing defense methods. In this paper, we propose E-SAGE, a novel… ▽ More

    Submitted 15 June, 2024; originally announced June 2024.

  29. arXiv:2406.03508  [pdf, other

    cs.LG cs.AI cs.CR

    Mutual Information Guided Backdoor Mitigation for Pre-trained Encoders

    Authors: Tingxu Han, Weisong Sun, Ziqi Ding, Chunrong Fang, Hanwei Qian, Jiaxun Li, Zhenyu Chen, Xiangyu Zhang

    Abstract: Self-supervised learning (SSL) is increasingly attractive for pre-training encoders without requiring labeled data. Downstream tasks built on top of those pre-trained encoders can achieve nearly state-of-the-art performance. The pre-trained encoders by SSL, however, are vulnerable to backdoor attacks as demonstrated by existing studies. Numerous backdoor mitigation techniques are designed for down… ▽ More

    Submitted 11 June, 2024; v1 submitted 5 June, 2024; originally announced June 2024.

  30. arXiv:2406.01645  [pdf, other

    cs.LG cs.AI

    FNP: Fourier Neural Processes for Arbitrary-Resolution Data Assimilation

    Authors: Kun Chen, Tao Chen, Peng Ye, Hao Chen, Kang Chen, Tao Han, Wanli Ouyang, Lei Bai

    Abstract: Data assimilation is a vital component in modern global medium-range weather forecasting systems to obtain the best estimation of the atmospheric state by combining the short-term forecast and observations. Recently, AI-based data assimilation approaches have attracted increasing attention for their significant advantages over traditional techniques in terms of computational consumption. However,… ▽ More

    Submitted 3 June, 2024; originally announced June 2024.

  31. arXiv:2406.01314  [pdf, other

    cs.CV cs.AI

    Compute-Efficient Medical Image Classification with Softmax-Free Transformers and Sequence Normalization

    Authors: Firas Khader, Omar S. M. El Nahhas, Tianyu Han, Gustav M�ller-Franzes, Sven Nebelung, Jakob Nikolas Kather, Daniel Truhn

    Abstract: The Transformer model has been pivotal in advancing fields such as natural language processing, speech recognition, and computer vision. However, a critical limitation of this model is its quadratic computational and memory complexity relative to the sequence length, which constrains its application to longer sequences. This is especially crucial in medical imaging where high-resolution images can… ▽ More

    Submitted 3 June, 2024; originally announced June 2024.

  32. arXiv:2405.16487  [pdf, other

    cs.RO

    Dynamics Models in the Aggressive Off-Road Driving Regime

    Authors: Tyler Han, Sidharth Talia, Rohan Panicker, Preet Shah, Neel Jawale, Byron Boots

    Abstract: Current developments in autonomous off-road driving are steadily increasing performance through higher speeds and more challenging, unstructured environments. However, this operating regime subjects the vehicle to larger inertial effects, where consideration of higher-order states is necessary to avoid failures such as rollovers or excessive impact forces. Aggressive driving through Model Predicti… ▽ More

    Submitted 26 May, 2024; originally announced May 2024.

    Comments: Accepted to ICRA 2024 Workshop on Resilient Off-road Autonomy

  33. arXiv:2405.14672  [pdf, other

    cs.CV

    Towards Imperceptible Backdoor Attack in Self-supervised Learning

    Authors: Hanrong Zhang, Zhenting Wang, Tingxu Han, Mingyu Jin, Chenlu Zhan, Mengnan Du, Hongwei Wang, Shiqing Ma

    Abstract: Self-supervised learning models are vulnerable to backdoor attacks. Existing backdoor attacks that are effective in self-supervised learning often involve noticeable triggers, like colored patches, which are vulnerable to human inspection. In this paper, we propose an imperceptible and effective backdoor attack against self-supervised models. We first find that existing imperceptible triggers desi… ▽ More

    Submitted 23 May, 2024; originally announced May 2024.

  34. arXiv:2405.13910  [pdf, other

    cs.LG cs.CV stat.ML

    Learning Latent Space Hierarchical EBM Diffusion Models

    Authors: Jiali Cui, Tian Han

    Abstract: This work studies the learning problem of the energy-based prior model and the multi-layer generator model. The multi-layer generator model, which contains multiple layers of latent variables organized in a top-down hierarchical structure, typically assumes the Gaussian prior model. Such a prior model can be limited in modelling expressivity, which results in a gap between the generator posterior… ▽ More

    Submitted 27 May, 2024; v1 submitted 22 May, 2024; originally announced May 2024.

  35. arXiv:2405.13796  [pdf, other

    cs.LG cs.AI

    Generalizing Weather Forecast to Fine-grained Temporal Scales via Physics-AI Hybrid Modeling

    Authors: Wanghan Xu, Fenghua Ling, Wenlong Zhang, Tao Han, Hao Chen, Wanli Ouyang, Lei Bai

    Abstract: Data-driven artificial intelligence (AI) models have made significant advancements in weather forecasting, particularly in medium-range and nowcasting. However, most data-driven weather forecasting models are black-box systems that focus on learning data mapping rather than fine-grained physical evolution in the time dimension. Consequently, the limitations in the temporal scale of datasets preven… ▽ More

    Submitted 29 May, 2024; v1 submitted 22 May, 2024; originally announced May 2024.

  36. arXiv:2405.06246  [pdf

    cs.CV

    Comparative Analysis of Advanced Feature Matching Algorithms in Challenging High Spatial Resolution Optical Satellite Stereo Scenarios

    Authors: Qiyan Luo, Jidan Zhang, Yuzhen Xie, Xu Huang, Ting Han

    Abstract: Feature matching determines the orientation accuracy for the High Spatial Resolution (HSR) optical satellite stereos, subsequently impacting several significant applications such as 3D reconstruction and change detection. However, the matching of off-track HSR optical satellite stereos often encounters challenging conditions including wide-baseline observation, significant radiometric differences,… ▽ More

    Submitted 10 May, 2024; originally announced May 2024.

    Comments: The manuscript is accepted as Oral Presentation in IEEE International Geoscience and Remote Sensing Symposium(IGARSS 2024)

  37. arXiv:2405.03376  [pdf, other

    cs.LG cs.CV

    CRA5: Extreme Compression of ERA5 for Portable Global Climate and Weather Research via an Efficient Variational Transformer

    Authors: Tao Han, Zhenghao Chen, Song Guo, Wanghan Xu, Lei Bai

    Abstract: The advent of data-driven weather forecasting models, which learn from hundreds of terabytes (TB) of reanalysis data, has significantly advanced forecasting capabilities. However, the substantial costs associated with data storage and transmission present a major challenge for data providers and users, affecting resource-constrained researchers and limiting their accessibility to participate in AI… ▽ More

    Submitted 7 May, 2024; v1 submitted 6 May, 2024; originally announced May 2024.

    Comments: Main text and supplementary, 22 pages, 13 figures

  38. arXiv:2405.00945  [pdf, other

    cs.IT eess.SP

    Can FSK Be Optimised for Integrated Sensing and Communications?

    Authors: Tian Han, Peter J Smith, Urbashi Mitra, Jamie S Evans, Rajitha Senanayake

    Abstract: Motivated by the ideal peak-to-average-power ratio and radar sensing capability of traditional frequency-coded radar waveforms, this paper considers the frequency shift keying (FSK) based waveform for joint communications and radar (JCR). An analysis of the probability distributions of its ambiguity function (AF) sidelobe levels (SLs) and peak sidelobe level (PSL) is conducted to study the radar s… ▽ More

    Submitted 1 May, 2024; originally announced May 2024.

    Comments: Submitted to IEEE Transactions on Wireless Communications, 13 pages, 6 figures

  39. arXiv:2404.14412  [pdf, other

    cs.CV

    AutoAD III: The Prequel -- Back to the Pixels

    Authors: Tengda Han, Max Bain, Arsha Nagrani, G�l Varol, Weidi Xie, Andrew Zisserman

    Abstract: Generating Audio Description (AD) for movies is a challenging task that requires fine-grained visual understanding and an awareness of the characters and their names. Currently, visual language models for AD generation are limited by a lack of suitable training data, and also their evaluation is hampered by using performance measures not specialized to the AD domain. In this paper, we make three c… ▽ More

    Submitted 22 April, 2024; originally announced April 2024.

    Comments: CVPR2024. Project page: https://www.robots.ox.ac.uk/~vgg/research/autoad/

  40. arXiv:2404.09515  [pdf, other

    cs.CV

    Revealing the structure-property relationships of copper alloys with FAGC

    Authors: Yuexing Han, Guanxin Wan, Tao Han, Bing Wang, Yi Liu

    Abstract: Understanding how the structure of materials affects their properties is a cornerstone of materials science and engineering. However, traditional methods have struggled to accurately describe the quantitative structure-property relationships for complex structures. In our study, we bridge this gap by leveraging machine learning to analyze images of materials' microstructures, thus offering a novel… ▽ More

    Submitted 18 April, 2024; v1 submitted 15 April, 2024; originally announced April 2024.

  41. arXiv:2404.09140  [pdf, other

    cs.LG cs.IT eess.SP

    RF-Diffusion: Radio Signal Generation via Time-Frequency Diffusion

    Authors: Guoxuan Chi, Zheng Yang, Chenshu Wu, Jingao Xu, Yuchong Gao, Yunhao Liu, Tony Xiao Han

    Abstract: Along with AIGC shines in CV and NLP, its potential in the wireless domain has also emerged in recent years. Yet, existing RF-oriented generative solutions are ill-suited for generating high-quality, time-series RF data due to limited representation capabilities. In this work, inspired by the stellar achievements of the diffusion model in CV and NLP, we adapt it to the RF domain and propose RF-Dif… ▽ More

    Submitted 14 April, 2024; originally announced April 2024.

    Comments: Accepted by MobiCom 2024

    ACM Class: I.2.0

  42. Design and Optimization of Cooperative Sensing With Limited Backhaul Capacity

    Authors: Wenrui Li, Min Li, An Liu, Tony Xiao Han

    Abstract: This paper introduces a cooperative sensing framework designed for integrated sensing and communication cellular networks. The framework comprises one base station (BS) functioning as the sensing transmitter, while several nearby BSs act as sensing receivers. The primary objective is to facilitate cooperative target localization by enabling each receiver to share specific information with a fusion… ▽ More

    Submitted 4 April, 2024; originally announced April 2024.

    Comments: This paper has been published in 2023 IEEE 98th Vehicular Technology Conference (VTC2023-Fall)

  43. arXiv:2404.01079  [pdf, other

    cs.CV

    Stale Diffusion: Hyper-realistic 5D Movie Generation Using Old-school Methods

    Authors: Joao F. Henriques, Dylan Campbell, Tengda Han

    Abstract: Two years ago, Stable Diffusion achieved super-human performance at generating images with super-human numbers of fingers. Following the steady decline of its technical novelty, we propose Stale Diffusion, a method that solidifies and ossifies Stable Diffusion in a maximum-entropy state. Stable Diffusion works analogously to a barn (the Stable) from which an infinite set of horses have escaped (th… ▽ More

    Submitted 1 April, 2024; originally announced April 2024.

    Comments: SIGBOVIK 2024

  44. arXiv:2403.13315  [pdf, other

    cs.CV

    PuzzleVQA: Diagnosing Multimodal Reasoning Challenges of Language Models with Abstract Visual Patterns

    Authors: Yew Ken Chia, Vernon Toh Yan Han, Deepanway Ghosal, Lidong Bing, Soujanya Poria

    Abstract: Large multimodal models extend the impressive capabilities of large language models by integrating multimodal understanding abilities. However, it is not clear how they can emulate the general intelligence and reasoning ability of humans. As recognizing patterns and abstracting concepts are key to general intelligence, we introduce PuzzleVQA, a collection of 2000 puzzle instances based on abstract… ▽ More

    Submitted 17 August, 2024; v1 submitted 20 March, 2024; originally announced March 2024.

    Comments: ACL 2024 Camera Ready

  45. arXiv:2403.09510  [pdf, other

    cs.AI cs.CY cs.GT cs.MA math.DS

    Trust AI Regulation? Discerning users are vital to build trust and effective AI regulation

    Authors: Zainab Alalawi, Paolo Bova, Theodor Cimpeanu, Alessandro Di Stefano, Manh Hong Duong, Elias Fernandez Domingos, The Anh Han, Marcus Krellner, Bianca Ogbo, Simon T. Powers, Filippo Zimmaro

    Abstract: There is general agreement that some form of regulation is necessary both for AI creators to be incentivised to develop trustworthy systems, and for users to actually trust those systems. But there is much debate about what form these regulations should take and how they should be implemented. Most work in this area has been qualitative, and has not been able to make formal predictions. Here, we p… ▽ More

    Submitted 14 March, 2024; originally announced March 2024.

  46. arXiv:2403.08730  [pdf, other

    cs.CL cs.CV

    Strengthening Multimodal Large Language Model with Bootstrapped Preference Optimization

    Authors: Renjie Pi, Tianyang Han, Wei Xiong, Jipeng Zhang, Runtao Liu, Rui Pan, Tong Zhang

    Abstract: Multimodal Large Language Models (MLLMs) excel in generating responses based on visual inputs. However, they often suffer from a bias towards generating responses similar to their pretraining corpus, overshadowing the importance of visual information. We treat this bias as a "preference" for pretraining statistics, which hinders the model's grounding in visual input. To mitigate this issue, we pro… ▽ More

    Submitted 3 April, 2024; v1 submitted 13 March, 2024; originally announced March 2024.

  47. arXiv:2403.08506  [pdf, other

    cs.LG cs.AI cs.CV

    DiPrompT: Disentangled Prompt Tuning for Multiple Latent Domain Generalization in Federated Learning

    Authors: Sikai Bai, Jie Zhang, Shuaicheng Li, Song Guo, Jingcai Guo, Jun Hou, Tao Han, Xiaocheng Lu

    Abstract: Federated learning (FL) has emerged as a powerful paradigm for learning from decentralized data, and federated domain generalization further considers the test dataset (target domain) is absent from the decentralized training data (source domains). However, most existing FL methods assume that domain labels are provided during training, and their evaluation imposes explicit constraints on the numb… ▽ More

    Submitted 11 March, 2024; originally announced March 2024.

    Journal ref: The IEEE/CVF Conference on Computer Vision and Pattern Recognition 2024

  48. arXiv:2403.06328  [pdf, other

    cs.LG

    Transferable Reinforcement Learning via Generalized Occupancy Models

    Authors: Chuning Zhu, Xinqi Wang, Tyler Han, Simon S. Du, Abhishek Gupta

    Abstract: Intelligent agents must be generalists, capable of quickly adapting to various tasks. In reinforcement learning (RL), model-based RL learns a dynamics model of the world, in principle enabling transfer to arbitrary reward functions through planning. However, autoregressive model rollouts suffer from compounding error, making model-based RL ineffective for long-horizon problems. Successor features… ▽ More

    Submitted 28 May, 2024; v1 submitted 10 March, 2024; originally announced March 2024.

  49. arXiv:2403.03864  [pdf, other

    cs.CV cs.AI

    Are Language Models Puzzle Prodigies? Algorithmic Puzzles Unveil Serious Challenges in Multimodal Reasoning

    Authors: Deepanway Ghosal, Vernon Toh Yan Han, Chia Yew Ken, Soujanya Poria

    Abstract: This paper introduces the novel task of multimodal puzzle solving, framed within the context of visual question-answering. We present a new dataset, AlgoPuzzleVQA designed to challenge and evaluate the capabilities of multimodal language models in solving algorithmic puzzles that necessitate both visual understanding, language understanding, and complex algorithmic reasoning. We create the puzzles… ▽ More

    Submitted 12 March, 2024; v1 submitted 6 March, 2024; originally announced March 2024.

  50. arXiv:2403.03846  [pdf, other

    cs.LG

    On the Effectiveness of Distillation in Mitigating Backdoors in Pre-trained Encoder

    Authors: Tingxu Han, Shenghan Huang, Ziqi Ding, Weisong Sun, Yebo Feng, Chunrong Fang, Jun Li, Hanwei Qian, Cong Wu, Quanjun Zhang, Yang Liu, Zhenyu Chen

    Abstract: In this paper, we study a defense against poisoned encoders in SSL called distillation, which is a defense used in supervised learning originally. Distillation aims to distill knowledge from a given model (a.k.a the teacher net) and transfer it to another (a.k.a the student net). Now, we use it to distill benign knowledge from poisoned pre-trained encoders and transfer it to a new encoder, resulti… ▽ More

    Submitted 6 March, 2024; originally announced March 2024.