I’m currently a Senior Researcher at Microsoft DeepSpeed team, working on improving performance and efficiency of deep learning training and inference (deepspeed.ai, github repo). I have a mixed research background between systems and AI, and I worked in many areas including deep learning, similarity search, distributed caching systems, networks, and computer architecture. Regardless of the area, the way I do research is always similar: identify inefficiencies by in-depth analysis, and fix it by algorithm and policy designs. Particularly in the deep learning area, in the past few years at DeepSpeed team I worked on improving the communication efficiency via compression, improving computation efficiency via MoE modeling, and improving data efficiency via curriculum learning.

I received Ph.D. in Computer Science from Carnegie Mellon University in 2020, advised by Professor David G. Andersen. I received both B.S. (2013) and M.S. (2014) in Computer Science from Rice University, advised by Professor Alan L. Cox and supported by the Graduate Research Fellowship.

Publications

  1. DeepSpeed-VisualChat: Multi-Round Multi-Image Interleave Chat via Multi-Modal Causal Attention.
  2. DeepSpeed-Chat: Easy, Fast and Affordable RLHF Training of ChatGPT-like Models at All Scales.
    • Zhewei Yao, Reza Yazdani Aminabadi, Olatunji Ruwase, Samyam Rajbhandari, Xiaoxia Wu, Ammar Ahmad Awan, Jeff Rasley, Minjia Zhang, Conglong Li, Connor Holmes, Zhongzhu Zhou, Michael Wyatt, Molly Smith, Lev Kurilenko, Heyang Qin, Masahiro Tanaka, Shuai Che, Shuaiwen Leon Song, Yuxiong He.
    • arXiv preprint arXiv:2308.01320. [tutorial][blog]
  3. Random-LTD: Random and Layerwise Token Dropping Brings Efficient Training for Large-scale Transformers.
  4. BLOOM: A 176B-Parameter Open-Access Multilingual Language Model.
    • Teven Le Scao et al. (391 authors. I contributed to code and infrastructure to train BLOOM on the Jean Zay supercomputer as a member of the Engineering team.)
    • arXiv preprint arXiv:2211.05100.
  5. DeepSpeed Data Efficiency: Improving Deep Learning Model Quality and Training Efficiency via Efficient Data Sampling and Routing.
  6. DeepSpeed4Science Initiative: Enabling Large-Scale Scientific Discovery through Sophisticated AI System Technologies.
  7. Maximizing Communication Efficiency for Large-scale Training via 0/1 Adam.
  8. 1-bit LAMB: Communication Efficient Large-Scale Large-Batch Training with LAMB’s Convergence Speed.
  9. The Stability-Efficiency Dilemma: Investigating Sequence Length Warmup for Training GPT Models.
    • Conglong Li, Minjia Zhang, Yuxiong He.
    • In NeurIPS 2022. [tutorial][arxiv]
    • (This paper was previously titled “Curriculum Learning: A Regularization Method for Efficient and Stable Billion-Scale GPT Model Pre-Training” in early arxiv preprint versions.)
  10. XTC: Extreme Compression for Pre-trained Transformers Made Simple and Efficient.
  11. ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers.
  12. DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale.
    • Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Aminabadi, Ammar Ahmad Awan, Jeff Rasley, Yuxiong He.
    • In ICML 2022. [tutorial][arxiv]
  13. 1-bit Adam: Communication Efficient Large-Scale Training with Adam’s Convergence Speed.
    • Hanlin Tang, Shaoduo Gan, Ammar Ahmad Awan, Samyam Rajbhandari, Conglong Li, Xiangru Lian, Ji Liu, Ce Zhang, Yuxiong He.
    • In ICML 2021. [tutorial][arxiv]
  14. Learned Adaptive Accuracy-Cost Optimization for Machine Learning Systems.
  15. Improving Approximate Nearest Neighbor Search through Learned Adaptive Early Termination.
  16. Scaling Video Analytics on Constrained Edge Nodes.
    • Christopher Canel, Thomas Kim, Giulio Zhou, Conglong Li, Hyeontaek Lim, David G. Andersen, Michael Kaminsky, Subramanya R. Dulloor.
    • In SysML 2019. (This conference was renamed to MLSys from 2020.) [source code]
  17. Better Caching in Search Advertising Systems with Rapid Refresh Predictions.
    • Conglong Li, David G. Andersen, Qiang Fu, Sameh Elnikety, Yuxiong He.
    • In WWW 2018.
  18. Workload Analysis and Caching Strategies for Search Advertising Systems.
    • Conglong Li, David G. Andersen, Qiang Fu, Sameh Elnikety, Yuxiong He.
    • In ACM SoCC 2017.
  19. Using Indirect Routing to Recover from Network Traffic Scheduling Estimation Error.
    • Conglong Li, Matthew K. Mukerjee, David G. Andersen, Srinivasan Seshan, Michael Kaminsky, George Porter, Alex C. Snoeren.
    • In ACM/IEEE ANCS 2017.
  20. Scheduling Techniques for Hybrid Circuit/Packet Networks.
    • He Liu, Matthew K. Mukerjee, Conglong Li, Nicolas Feltman, George Papen, Stefan Savage, Srinivasan Seshan, Geoffrey M. Voelker, David G. Andersen, Michael Kaminsky, George Porter, Alex C. Snoeren.
    • In ACM CoNEXT 2015.
  21. GD-Wheel: A Cost-Aware Replacement Policy for Key-Value Stores.
  22. Reducing DRAM Row Activations with Eager Read/Write Clustering.
    • Myeongjae Jeon, Conglong Li, Alan L. Cox, Scott Rixner.
    • In ACM TACO 2013.
  23. GD-Wheel: A Cost-Aware Replacement Policy for Key-Value Stores.
    • Conglong Li, Alan L. Cox.
    • In 7th Workshop on Large-Scale Distributed Systems and Middleware (LADIS 2013).

Last updated: 2024/02/26