School of Mathematics & Statistics

Efficient Large-Scale Optimization for Machine Learning

Mher Safaryan (Lancaster University)

Thursday 5th March 14:00-15:00
Maths 311B

Abstract

Given the scale of current machine learning models, efficiency has become as critical as accuracy. Beyond improving optimization quality, we must rethink how to reduce memory footprint and computational overhead in both training dynamics and inference. In this talk, I will present two complementary perspectives on efficiency: memory-efficient optimization for large-scale training and compute-efficient inference through knowledge distillation. 

The first part introduces a memory-efficient adaptive optimizer designed for training large models. The algorithm performs adaptive updates within dynamically changing low-dimensional subspaces while ensuring full-space exploration throughout training. This significantly reduces the optimizer’s memory footprint to a small fraction of the model size. The method relies on a novel projection-aware update rule that enables consistent transitions across subspaces by properly estimating projected gradient statistics.

In the second part of the talk, I revisit knowledge distillation (KD) from an optimization viewpoint. While traditionally understood as transferring information from a large “teacher” model to a smaller “student,” we show that KD can be interpreted as a form of stochastic variance reduction. For linear and deep linear models, we establish that KD acts as a partial variance reduction mechanism: it reduces stochastic gradient noise without necessarily eliminating it, depending on the properties of the teacher model and the weighting of the distillation loss.

Add to your calendar

Download event information as iCalendar file (only this event)