Overview

Author: Zhonghua Zheng (zzheng25@illinois.edu)
GitHub: zzheng93

Motivation: This notebook shows the workflow of

Task: How to predict maximum reference height temperature (TREFHTMX) based on related features?

  • Response (Y):
    "TREFHTMX": Maximum reference height temperature over output period

  • Features (X):
    "PRECT": Total (convective and large-scale) precipitation rate (liq + ice)
    "WSPDSRFAV": Horizontal total wind speed average at the surface
    "TS": Surface temperature (radiative)
    "TREFHT": Reference height temperature

Prerequisite
How to create environment for this workflow?
Recommendation:

git clone http://github.com/dask/dask-tutorial
cd dask-tutorial
conda env create -f binder/environment.yml 
conda activate dask-tutorial
conda install xarray cftime=1.0.3.4
conda install -c conda-forge dask-ml dask-xgboost

Load libraries

Here we load the necessary libraries, connect to cluster, and set up anonymous access to s3

In [1]:
import numpy as np
import s3fs
import xarray as xr
import matplotlib.pyplot as plt
import dask
import dask.array as da
import dask.dataframe as dd
import dask_xgboost
import matplotlib.pyplot as plt
from dask.distributed import Client
from dask_ml.model_selection import train_test_split
from sklearn import metrics
import xgboost

client = Client() # connect to cluster
s3 = s3fs.S3FileSystem(anon=True) # anonymous access to s3
In [2]:
client
Out[2]:

Client

Cluster

  • Workers: 4
  • Cores: 16
  • Memory: 32.66 GB

Load Data

Here we load the CESM-LE RCP8.5 data from AWS S3.
The original format of the data is Zarr, but Dask-XGBoost only accepts Dask.array or Dask.dataframe collections.

So we use xarray to load the data, and convert to Dask Dataframe.

As an illustration, here I only use "member_id=1" and "time="2010-01-01"" for this tutorial.

In [3]:
%%time
zarr_ls = [xr.open_zarr(s3fs.S3Map(root="ncar-cesm-lens/atm/daily/cesmLE-RCP85-"+var+".zarr", s3=s3))[var]
           .sel(member_id=1,time="2010-01-01") 
           for var in ["TREFHT","TREFHTMX","PRECT","TS","WSPDSRFAV"]]  # read data from s3
CPU times: user 6.55 s, sys: 308 ms, total: 6.85 s
Wall time: 1min 34s
In [4]:
dddf=xr.merge(zarr_ls).to_dask_dataframe()  # merge arrays and convert to dask dataframe
In [5]:
X=dddf[["TREFHT","PRECT","TS","WSPDSRFAV"]]  # features for Dask-XGBoost
Y=dddf["TREFHTMX"]  # response for Dask-XGBoost
In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.15)  # split into training and testing

Train the XGB model

This parameters are available at Learning Task Parameters.

  • reg:squarederror: regression with squared loss.
  • eta (learning_rate): Step size shrinkage used in update to prevents overfitting. After each boosting step, we can directly get the weights of new features, and eta shrinks the feature weights to make the boosting process more conservative.
  • max_depth: Maximum depth of a tree. Increasing this value will make the model more complex and more likely to overfit. 0 is only accepted in lossguided growing policy when tree_method is set as hist and it indicates no limit on depth. Beware that XGBoost aggressively consumes memory when training a deep tree.
  • num_boost_round: Number of boosting.
In [7]:
%%time
params = {'objective': 'reg:squarederror',
          'max_depth': 6, 'eta': 0.01}

bst = dask_xgboost.train(client, params, X_train, y_train, num_boost_round=500)
CPU times: user 306 ms, sys: 57.8 ms, total: 364 ms
Wall time: 13 s

Determine feature importance

Generally, importance provides a score that indicates how useful or valuable each feature was in the construction of the boosted decision trees within the model. The more an attribute is used to make key decisions with decision trees, the higher its relative importance (reference: Feature Importance and Feature Selection With XGBoost in Python).

In [8]:
%matplotlib inline
ax = xgboost.plot_importance(bst)
ax.grid(False, axis="y")
ax.set_title('Estimated feature importance')
plt.show()

Test model performance

Here we test our XGBoost emulator performance using the testing data

In [9]:
y_hat = dask_xgboost.predict(client, bst, X_test).persist()
y_test, y_hat = dask.compute(y_test, y_hat) # get the predicted data and true data
In [10]:
fig, ax = plt.subplots(figsize=(5, 5))
ax.scatter(y_hat,y_test,s=0.1)
ax.plot([y_test.min(), y_test.min()], [y_test.max(), y_test.max()],c="black")
ax.set(
    xlabel="Predicted",
    ylabel="True",
)
plt.show()
print("rmse:",np.sqrt(metrics.mean_squared_error(y_test, y_hat)))
print("r2_score:",metrics.r2_score(y_test,y_hat))
rmse: 2.5904884
r2_score: 0.98544579149015