Author:
Zhonghua Zheng (zzheng25@illinois.edu)
GitHub: zzheng93
Motivation: This notebook shows the workflow of
Task: How to predict maximum reference height temperature (TREFHTMX) based on related features?
Response (Y):
"TREFHTMX": Maximum reference height temperature over output period
Features (X):
"PRECT": Total (convective and large-scale) precipitation rate (liq + ice)
"WSPDSRFAV": Horizontal total wind speed average at the surface
"TS": Surface temperature (radiative)
"TREFHT": Reference height temperature
Prerequisite
How to create environment for this workflow?
Recommendation:
git clone http://github.com/dask/dask-tutorial
cd dask-tutorial
conda env create -f binder/environment.yml
conda activate dask-tutorial
conda install xarray cftime=1.0.3.4
conda install -c conda-forge dask-ml dask-xgboost
Here we load the necessary libraries, connect to cluster, and set up anonymous access to s3
import numpy as np
import s3fs
import xarray as xr
import matplotlib.pyplot as plt
import dask
import dask.array as da
import dask.dataframe as dd
import dask_xgboost
import matplotlib.pyplot as plt
from dask.distributed import Client
from dask_ml.model_selection import train_test_split
from sklearn import metrics
import xgboost
client = Client() # connect to cluster
s3 = s3fs.S3FileSystem(anon=True) # anonymous access to s3
client
Here we load the CESM-LE RCP8.5 data from AWS S3.
The original format of the data is Zarr, but Dask-XGBoost only accepts Dask.array or Dask.dataframe collections.
So we use xarray to load the data, and convert to Dask Dataframe.
As an illustration, here I only use "member_id=1" and "time="2010-01-01"" for this tutorial.
%%time
zarr_ls = [xr.open_zarr(s3fs.S3Map(root="ncar-cesm-lens/atm/daily/cesmLE-RCP85-"+var+".zarr", s3=s3))[var]
.sel(member_id=1,time="2010-01-01")
for var in ["TREFHT","TREFHTMX","PRECT","TS","WSPDSRFAV"]] # read data from s3
dddf=xr.merge(zarr_ls).to_dask_dataframe() # merge arrays and convert to dask dataframe
X=dddf[["TREFHT","PRECT","TS","WSPDSRFAV"]] # features for Dask-XGBoost
Y=dddf["TREFHTMX"] # response for Dask-XGBoost
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.15) # split into training and testing
This parameters are available at Learning Task Parameters.
%%time
params = {'objective': 'reg:squarederror',
'max_depth': 6, 'eta': 0.01}
bst = dask_xgboost.train(client, params, X_train, y_train, num_boost_round=500)
Generally, importance provides a score that indicates how useful or valuable each feature was in the construction of the boosted decision trees within the model. The more an attribute is used to make key decisions with decision trees, the higher its relative importance (reference: Feature Importance and Feature Selection With XGBoost in Python).
%matplotlib inline
ax = xgboost.plot_importance(bst)
ax.grid(False, axis="y")
ax.set_title('Estimated feature importance')
plt.show()
Here we test our XGBoost emulator performance using the testing data
y_hat = dask_xgboost.predict(client, bst, X_test).persist()
y_test, y_hat = dask.compute(y_test, y_hat) # get the predicted data and true data
fig, ax = plt.subplots(figsize=(5, 5))
ax.scatter(y_hat,y_test,s=0.1)
ax.plot([y_test.min(), y_test.min()], [y_test.max(), y_test.max()],c="black")
ax.set(
xlabel="Predicted",
ylabel="True",
)
plt.show()
print("rmse:",np.sqrt(metrics.mean_squared_error(y_test, y_hat)))
print("r2_score:",metrics.r2_score(y_test,y_hat))