Just the Gist: Parsing GeoTiffs natively in Snowflake

Dated: Feb-2024

Source: https://www.esa.int/Applications/Observing_the_Earth/Copernicus/Sentinel-2

When it comes to remote-sensing “GeoTiff” files, generated based on satellite scans is a common format. Raster / GeoTiff has been used in various public and commercial use cases. To list a few:

Depending on the satellite and the sensors onboarded in it, the generated GeoTiff can contain one or multiple bands of data. Each of these bands offers different insights, and with these bands, calculations can be done to derive insights. a very common scenario is to calculate the NDVI / Normalized difference vegetation index.

The generated GeoTiff typically contains metadata like the bands coordinates (CRS). The raster records are often stored as numpy series and typically with xarray/netcdf formats, which are highly compressed data. When uncompressed and converted to DataFrame the raster numpy data series will often run in millions of records.

Dask and other distributed computing networks are typically used to parse and process these files. With these distributed platforms, there is the hurdle of maintaining a separate infrastructure and securing them. Data, once derived, needs to be exported and imported into a datalake for a broader use case.

Snowflake, as the Data Cloud platform, has been a favorite amongst our customers, as it reduces the pain of managing cloud-based infrastructure and offers seamless scaling based on use case needs. With Snowpark Python, and its ability to import third-party libraries, customers have asked us if we could process the GeoTiff files natively in Snowflake.

Over the past year, I had the opportunity to present the possibility of parsing the GeoTiff file to many of our customers. One customer he code to prototype and enhance too. So, after being on , I am finally taking the time to share the gist of how it is possible.

Gist

Link: geotiff_snowflake_native_public.ipynb

Learnings and Limitations

  • The original prototype was done using rioxarray, as it offered the capability to transform the raster into a dataframe. I recently came to know that this is possible via RasterIO, which is present in the Snowpark-Anaconda channels.
  • As reflected earlier, the GeoTiff can vary in size and number of records. The above gist was done using an X-Small warehouse; this may not work in all scenarios. Sometimes you might need to increase the size, especially for multi-band raster files. You might need to scale the warehouse to a Snowpark-Optimized warehouse too, if needed.
  • RasterIO offers a window based chunking strategy to process large files too; which can be adopted when you do not want to scale to larger warehouse instances. I leave the option of trying these out to the user. Refer to the below on these:

https://rasterio.readthedocs.io/en/latest/topics/windowed-rw.html

https://stackoverflow.com/questions/54501232/iteratively-load-image-block-by-block-where-blocks-are-partially-overlapped

  • The raster will contain millions of data points; not all data points would be worthwhile to extract. In these cases, it would be better to define an area of interest and filter the data points only related to these. This will reduce also the dataset that you want to be ingested. Saving costs with computing and storage.

https://rasterio.readthedocs.io/en/stable/cli.html#clip

  • You could also check out our partner Carto.
  • Once the raster data is landed into Snowflake, it opens up to her Geo-Spatial use cases based on your business needs.
  • With the Snowflake Data Sharing and data-mesh approach, it would be wiser to parse the GeoTiff files once and share them with other Lines of business too.
  • A golden opportunity to parse, process, and share data via the Snowflake marketplace and monetize, too.

For now: Get Inspired -> Learn -> Develop -> Share -> ☺️

--

--