Overview¶

Author: Zhonghua Zheng (zzheng25@illinois.edu)
GitHub: zzheng93

Motivation: This notebook shows the workflow of

Loading Climate Data (CESM-LE from AWS S3)
Scalable Dask-XGBoost regression implementation (the official documentation of Dask-XGBoost only provides the classfication implementation.

Task: How to predict maximum reference height temperature (TREFHTMX) based on related features?

Response (Y):
"TREFHTMX": Maximum reference height temperature over output period
Features (X):
"PRECT": Total (convective and large-scale) precipitation rate (liq + ice)
"WSPDSRFAV": Horizontal total wind speed average at the surface
"TS": Surface temperature (radiative)
"TREFHT": Reference height temperature

Prerequisite
How to create environment for this workflow?
Recommendation:

git clone http://github.com/dask/dask-tutorial
cd dask-tutorial
conda env create -f binder/environment.yml 
conda activate dask-tutorial
conda install xarray cftime=1.0.3.4
conda install -c conda-forge dask-ml dask-xgboost

Load libraries¶

Here we load the necessary libraries, connect to cluster, and set up anonymous access to s3

import numpy as np
import s3fs
import xarray as xr
import matplotlib.pyplot as plt
import dask
import dask.array as da
import dask.dataframe as dd
import dask_xgboost
import matplotlib.pyplot as plt
from dask.distributed import Client
from dask_ml.model_selection import train_test_split
from sklearn import metrics
import xgboost

client = Client() # connect to cluster
s3 = s3fs.S3FileSystem(anon=True) # anonymous access to s3

client

Load Data¶

Here we load the CESM-LE RCP8.5 data from AWS S3.
The original format of the data is Zarr, but Dask-XGBoost only accepts Dask.array or Dask.dataframe collections.

So we use xarray to load the data, and convert to Dask Dataframe.

As an illustration, here I only use "member_id=1" and "time="2010-01-01"" for this tutorial.

%%time
zarr_ls = [xr.open_zarr(s3fs.S3Map(root="ncar-cesm-lens/atm/daily/cesmLE-RCP85-"+var+".zarr", s3=s3))[var]
           .sel(member_id=1,time="2010-01-01") 
           for var in ["TREFHT","TREFHTMX","PRECT","TS","WSPDSRFAV"]]  # read data from s3

CPU times: user 6.55 s, sys: 308 ms, total: 6.85 s
Wall time: 1min 34s

dddf=xr.merge(zarr_ls).to_dask_dataframe()  # merge arrays and convert to dask dataframe

X=dddf[["TREFHT","PRECT","TS","WSPDSRFAV"]]  # features for Dask-XGBoost
Y=dddf["TREFHTMX"]  # response for Dask-XGBoost

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.15)  # split into training and testing

Train the XGB model¶

This parameters are available at Learning Task Parameters.

reg:squarederror: regression with squared loss.
eta (learning_rate): Step size shrinkage used in update to prevents overfitting. After each boosting step, we can directly get the weights of new features, and eta shrinks the feature weights to make the boosting process more conservative.
max_depth: Maximum depth of a tree. Increasing this value will make the model more complex and more likely to overfit. 0 is only accepted in lossguided growing policy when tree_method is set as hist and it indicates no limit on depth. Beware that XGBoost aggressively consumes memory when training a deep tree.
num_boost_round: Number of boosting.

%%time
params = {'objective': 'reg:squarederror',
          'max_depth': 6, 'eta': 0.01}

bst = dask_xgboost.train(client, params, X_train, y_train, num_boost_round=500)

CPU times: user 306 ms, sys: 57.8 ms, total: 364 ms
Wall time: 13 s

Determine feature importance¶

Generally, importance provides a score that indicates how useful or valuable each feature was in the construction of the boosted decision trees within the model. The more an attribute is used to make key decisions with decision trees, the higher its relative importance (reference: Feature Importance and Feature Selection With XGBoost in Python).

%matplotlib inline
ax = xgboost.plot_importance(bst)
ax.grid(False, axis="y")
ax.set_title('Estimated feature importance')
plt.show()

Test model performance¶

Here we test our XGBoost emulator performance using the testing data

y_hat = dask_xgboost.predict(client, bst, X_test).persist()
y_test, y_hat = dask.compute(y_test, y_hat) # get the predicted data and true data

fig, ax = plt.subplots(figsize=(5, 5))
ax.scatter(y_hat,y_test,s=0.1)
ax.plot([y_test.min(), y_test.min()], [y_test.max(), y_test.max()],c="black")
ax.set(
    xlabel="Predicted",
    ylabel="True",
)
plt.show()
print("rmse:",np.sqrt(metrics.mean_squared_error(y_test, y_hat)))
print("r2_score:",metrics.r2_score(y_test,y_hat))

rmse: 2.5904884
r2_score: 0.98544579149015

Table of Contents