Publications
Publications by categories in reversed chronological order. 1 represents co-first author.
2026
-
NSDIFlexLLM: Token-Level Co-Serving of LLM Inference and Fine-Tuning with SLO GuaranteesProceedings of NSDI Conference 2026
-
EuroSysAdaServe: Accelerating Multi-SLO LLM Serving with SLO-Customized Speculative DecodingProceedings of EuroSys Conference 2026
2025
-
OSDIMirage: A Multi-Level Superoptimizer for Tensor ProgramsProceedings of OSDI Conference 2025
-
ASPLOSHelix: Distributed Serving of Large Language Models via Max-Flow on Heterogeneous GPUsProceedings of ASPLOS Conference 2025
-
SIGMODPQCache: Product Quantization-based KVCache for Long Context LLM InferenceProceedings of SIGMOD Conference 2025
-
ICLRNetMoE: Accelerating MoE Training through Dynamic Sample Placement (Spotlight)Proceedings of ICLR Conference 2025
2024
-
SOSPEnabling Parallelism Hot Switching for Efficient Training of Large Language ModelsProceedings of SOSP Conference 2024
-
ASPLOSSpotServe: Serving Generative Large Language Models on Preemptible Instances (Distinguished Artifact Award), (IEEE Micro Top Picks Honorable Mention)Proceedings of ASPLOS Conference 2024
-
ASPLOSSpecInfer: Accelerating Generative Large Language Model Serving with Speculative Inference and Token Tree VerificationProceedings of ASPLOS Conference 2024
-
ASPLOSOptimal Kernel Orchestration for Tensor Programs with KorchProceedings of ASPLOS Conference 2024
2023
-
OSDIEinNet: Optimizing Tensor Programs with Derivation-Based TransformationsProceedings of OSDI Conference 2023