Using Autogluon natively in Snowflake

Published in

Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

4 min readAug 11, 2022

Dated: Aug-2022

During the recent AWS Mars summit, A brief introduction of AutoGluon was introduced in the keynotes. For a beginner ML engineer, like myself, the concept of AutoML seems an easier adoption to get started. Especially it helps us to understand that there could be multiple ML Algorithms to solve a problem and choosing the right algorithm and hyper-tuning them is a daunting task.

As reflected in www.automl.org:

Automated Machine Learning provides methods and processes to make Machine Learning available for non-Machine Learning experts, to improve efficiency of Machine Learning and to accelerate research on Machine Learning.

I was surprised to see various architectures like Keras, PyTorch already providing the AutoML functionality via AutoKeras or AutoPyTorch. There are many other architecture that provides the same.

Of the various AutoML architecture, I wanted to investigate how to use AutoGluon natively in Snowflake. I choose AutoGluon as it was backed by AWS and also seems the simplest one when it comes to Tabular based predictions.

Running natively in Snowflake

While Snowflake has been proven as offering a data-lake functionalities, the usage of Structured and Semi-Structured data in Snowflake is large. In modern data pipelines, it is not uncommon to see ML functionalities being invoked. These ML functionalities could be:

Clustering
Predicting specific values (Regression)
Classification

With regards to Snowflake a typical pattern is to invoke an external function that will infer the predictions, and then store the results into a table for further processing needs. While this route has proven very well and has been widely adopted, we still try to see if we run these inferences natively in Snowflake.

The main reason for running natively, is to avoid transferring the data outside of Snowflake secure environments. This also helps the client clearly govern on data access and usage.

Scenario: Room Occupancy predictions

Workplace occupancy is widely adopted use case in almost all sectors. In a typical office building, occupancy is used to determine, if lights/air-conditioning etc should be running and saving on energy consumptions. This scenario also exists in Hospitals and Industrial settings too. Here are some articles, which would give a deeper insights to this use-case:

Using the sample dataset from Kaggle: Smart Building System, which are essentially reading from various IOT sensors, we want to predict if a Room is occupied or not.

We use AutoGluon to develop a series of ML Models that are derived from various algorithm for this classification use case.

An ML engineer uses AutoGluon libraries to train and derive models.
The models are packaged and stored in a stage.
An Python UDF is defined, which uses the AutoGluon library and the ml-models.
Inference is done using the UDF.

The code for this project is available at: Github: AutoGluon_on_Snowflake

Execution

Training the model

Model training is done in the context of Snowpark, executing outside of secure snowflake environment. Meaning I did not have this done using stored-procedures though. The reasons are:

Secured managed space: As part of training, AutoGluon would tend to download various algorithms or packages for eX: pytorch ,fastai ,catboost etc.. to name a few from the internet. Hence network connection is needed to achieve this.

Python stored-procedures run in a secure environment and there is no network connectivity to outside world. Hence implementation using stored-procedure is not possible.

Inference

Inferring the class, is pretty much like running any other UDF call.

Limitations

Q: Can I use any of the models trained by AutoGluon?

A: Unfortunately NO. Not all models/algorithms can be used. The reason being that the 3rd party libraries (ex: autogluon.core-0.5.2-py3-none-any.whl) can be extracted and imported as long as there are no native components/libraries. CatBoost & NeuralNetFastAI are examples of algorithms that cannot be used.

In the case of CatBoost it requires a native library ‘_catboost.so’ that would not be able to be loaded. And in the case of NeuralNetFastAI it requires FastAI which has a dependency of MatPlotLib. The MatPlotLib uses a native library hence it cant be loaded.

There are also certain algorithm that is not possible to use currently, for ex: NeuralNetTorch We need to use the PyTorch 1.12 version which is used by AutoGluon and not the one from Snowflake Anaconda channel, which is of version 1.10. The PyTorch library is 750MB+ in size, hence when we extract it we run out of disk space. Currently the temp folder, which is where we use for libraries locally, is limited to 500MB.

Q: What are the various models that AutoGluon currently supports?

A: Refer to Doc: autogluon.tabular.models

Venkatesh Sekar is a Senior Data cloud Architect at Snowflake. He is involved in helping Snowflake GSI Partners to be successful at their client solution & implementation of #Snowflake — the Data Cloud.

Using Autogluon natively in Snowflake

Running natively in Snowflake

Scenario: Room Occupancy predictions

Execution

Limitations

Written by Venkat Sekar