Example for CMIP6

Here, CESM2 data serves as the training data, and the ML model trained on CESM2 data is applied to CMIP6 data.

Reference:
- GitHub: https://github.com/NCAR/cesm2-le-aws
- Data/Variables Information: - https://ncar.github.io/cesm2-le-aws/model_documentation.html#data-catalog
- https://registry.opendata.aws/cmip6/
- Reproduce CESM-LENS: https://github.com/NCAR/cesm2-le-aws/blob/main/notebooks/kay_et_al_lens2.ipynb
- Finding CMIP6 data using intake-esm and plotting time series for points by Zac Flamig

Step 0: load necessary packages and define parameters (no need to change)

[1]:

%%time
# Display output of plots directly in Notebook
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import json
import intake
from flaml import AutoML
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings("ignore")
import util
import math
import seaborn as sns

with open("./config_cesm2_cmip6.json",'r') as load_f:
#     param = json.loads(json.load(load_f))
    param = json.load(load_f)

    model = param["model"] # cesm2
    urban_type = param["urban_type"] # md
    city_loc = param["city_loc"] # {"lat": 40.0150, "lon": -105.2705}
    l_component = param["l_component"]
    a_component = param["a_component"]
    experiment = param["experiment"]
    frequency = param["frequency"]
    cam_ls = param["cam_ls"]
    clm_ls = param["clm_ls"]
    forcing_variant = param["forcing_variant"]
    time = slice(param["time_start"],param["time_end"])
    member_id = param["member_id"]
    estimator_list = param["estimator_list"]
    time_budget = param["time_budget"]
    features = param["features"]
    label = param["label"]
    clm_var_mask = param["label"][0]
    CMIP6_url = param["CMIP6_url"]
    activity_id = param["activity_id"]
    experiment_id = param["experiment_id"]
    institution_id = param["institution_id"]
    table_id = param["table_id"]

CPU times: user 1.84 s, sys: 350 ms, total: 2.19 s
Wall time: 2.19 s

/glade/work/zhonghua/miniconda3/envs/aws_urban/lib/python3.8/site-packages/xgboost/compat.py:31: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
  from pandas import MultiIndex, Int64Index

Step 1: load CESM2 data

[2]:

# get a dataset
ds = util.get_data(model, city_loc, experiment, frequency, member_id, time, cam_ls, clm_ls,
                   forcing_variant=forcing_variant, urban_type=urban_type)
ds['time'] = ds.indexes['time'].to_datetimeindex()


--> The keys in the returned dictionary of datasets are constructed as follows:
        'component.experiment.frequency.forcing_variant'

100.00% [2/2 00:01<00:00]

different lat between CAM and CLM subgrid info, adjust subgrid info's lat

split into training and testing data

[3]:

mapping =  {
    "PRSN":"prsn",
    "PRECT":"pr",
    "PSL":"psl",
    "TREFHT":"tas",
    "TREFHTMN":"tasmin",
    "TREFHTMX":"tasmax"
}

# create a dataframe
df = ds.to_dataframe().reset_index().dropna()
df["PRSN"] = (df["PRECSC"] + df["PRECSL"])*1000
df["PRECT"] = (df["PRECC"] + df["PRECL"])*1000

df_cesm = df.rename(columns=mapping)

# split the data into training and testing data
X_train, X_test, y_train, y_test = train_test_split(
    df_cesm[features], df_cesm[label], test_size=0.1, random_state=42)
display(X_train.head())
display(y_train.head())

	pr	prsn	psl	tas	tasmax	tasmin
4253	5.120080e-05	1.210190e-18	102141.148438	297.131683	305.057770	291.910309
712	8.689989e-06	4.476514e-06	102655.820312	273.535492	277.919128	270.931824
4174	5.766846e-05	2.315565e-14	101171.445312	297.524048	305.175293	290.375336
1811	8.906457e-07	7.931703e-07	101316.953125	271.924591	277.630005	269.025482
5097	4.478946e-08	4.477884e-08	101735.164062	270.995270	275.632812	269.006317

	TREFMXAV
4253	305.300262
712	279.708527
4174	306.675201
1811	278.521729
5097	276.521973

Step 2: load CMIP6 data

[4]:

%%time
features = ['pr', 'prsn', 'psl','tas', 'tasmax', 'tasmin']

catalog = intake.open_esm_datastore('https://cmip6-pds.s3.amazonaws.com/pangeo-cmip6.json')
catalog_subset = catalog.search(
    activity_id=activity_id,
    experiment_id=experiment_id,
    institution_id=institution_id,
    variable_id=features,
    table_id=table_id
)
datasets = catalog_subset.to_dataset_dict(zarr_kwargs={'consolidated': True, 'decode_times': True})
datasets


--> The keys in the returned dictionary of datasets are constructed as follows:
        'activity_id.institution_id.source_id.experiment_id.table_id.grid_label'

100.00% [1/1 00:00<00:00]

CPU times: user 8.71 s, sys: 973 ms, total: 9.69 s
Wall time: 17 s

[4]:

{'ScenarioMIP.NOAA-GFDL.GFDL-ESM4.ssp370.day.gr1': <xarray.Dataset>
 Dimensions:    (bnds: 2, lat: 180, lon: 288, member_id: 1, time: 31390)
 Coordinates:
   * bnds       (bnds) float64 1.0 2.0
   * lat        (lat) float64 -89.5 -88.5 -87.5 -86.5 ... 86.5 87.5 88.5 89.5
     lat_bnds   (lat, bnds) float64 dask.array<chunksize=(180, 2), meta=np.ndarray>
   * lon        (lon) float64 0.625 1.875 3.125 4.375 ... 355.6 356.9 358.1 359.4
     lon_bnds   (lon, bnds) float64 dask.array<chunksize=(288, 2), meta=np.ndarray>
   * time       (time) object 2015-01-01 12:00:00 ... 2100-12-31 12:00:00
     time_bnds  (time, bnds) object dask.array<chunksize=(15695, 2), meta=np.ndarray>
   * member_id  (member_id) <U8 'r1i1p1f1'
     height     float64 2.0
 Data variables:
     pr         (member_id, time, lat, lon) float32 dask.array<chunksize=(1, 592, 180, 288), meta=np.ndarray>
     prsn       (member_id, time, lat, lon) float32 dask.array<chunksize=(1, 671, 180, 288), meta=np.ndarray>
     psl        (member_id, time, lat, lon) float32 dask.array<chunksize=(1, 452, 180, 288), meta=np.ndarray>
     tas        (member_id, time, lat, lon) float32 dask.array<chunksize=(1, 420, 180, 288), meta=np.ndarray>
     tasmax     (member_id, time, lat, lon) float32 dask.array<chunksize=(1, 420, 180, 288), meta=np.ndarray>
     tasmin     (member_id, time, lat, lon) float32 dask.array<chunksize=(1, 415, 180, 288), meta=np.ndarray>
 Attributes: (12/48)
     references:              see further_info_url attribute
     realm:                   atmos
     parent_time_units:       days since 1850-1-1
     sub_experiment:          none
     forcing_index:           1
     grid_label:              gr1
     ...                      ...
     parent_variant_label:    r1i1p1f1
     license:                 CMIP6 model data produced by NOAA-GFDL is licens...
     parent_mip_era:          CMIP6
     parent_activity_id:      CMIP
     title:                   NOAA GFDL GFDL-ESM4 model output prepared for CM...
     intake_esm_dataset_key:  ScenarioMIP.NOAA-GFDL.GFDL-ESM4.ssp370.day.gr1}

[5]:

%%time
# define the dataset name in the dictionary and the "member_id"
df_cmip = datasets['ScenarioMIP.NOAA-GFDL.GFDL-ESM4.ssp370.day.gr1']\
          .sel(member_id = 'r1i1p1f1',
               time = slice(param["time_start"], param["time_end"]))\
          .sel(lat = param["city_loc"]["lat"],
               lon = util.lon_to_360(param["city_loc"]["lon"]),
               method="nearest")\
          [features].load()\
          .to_dataframe().reset_index()
df_cmip.head()

CPU times: user 20.3 s, sys: 11 s, total: 31.3 s
Wall time: 18.5 s

[5]:

	time	pr	prsn	psl	tas	tasmax	tasmin	lat	lon	member_id	height
0	2081-01-02 12:00:00	4.284881e-11	1.210498e-13	101005.351562	267.040985	274.020386	264.221344	40.5	254.375	r1i1p1f1	2.0
1	2081-01-03 12:00:00	1.998000e-06	6.424573e-07	100812.898438	272.238739	275.491943	270.209229	40.5	254.375	r1i1p1f1	2.0
2	2081-01-04 12:00:00	1.482927e-05	8.858435e-06	100455.796875	273.498535	274.920624	271.907166	40.5	254.375	r1i1p1f1	2.0
3	2081-01-05 12:00:00	6.613790e-06	5.915619e-06	101060.375000	271.265442	273.210297	269.065277	40.5	254.375	r1i1p1f1	2.0
4	2081-01-06 12:00:00	9.097347e-06	6.341099e-06	100795.867188	272.010651	276.695343	269.123260	40.5	254.375	r1i1p1f1	2.0

Step 3: compare CESM2 training and CMIP6 data

[6]:

fig = plt.figure(figsize=(12,3))
idx = 1
for var in features:
    ax = fig.add_subplot(math.ceil(math.ceil(len(features)/3)), 3, idx)
    X_train[var].plot.kde(ax=ax, label="CESM")
    df_cmip[var].plot.kde(ax=ax, label="CMIP")
    idx+=1
    ax.set_title(var)
plt.legend()
plt.tight_layout()
plt.show()

../_images/notebooks_demo_CESM2_to_CMIP6_12_0.png

Step 4: automated machine learning

train a model (emulator)

[7]:

%%time
# setup for automl
automl = AutoML()
automl_settings = {
    "time_budget": time_budget,  # in seconds
    "metric": 'r2',
    "task": 'regression',
    "estimator_list":estimator_list,
}

# train the model
automl.fit(X_train=X_train, y_train=y_train.values,
           **automl_settings, verbose=-1)
print(automl.model.estimator)

ExtraTreesRegressor(max_features=0.9591245274694429, max_leaf_nodes=426,
                    n_estimators=125, n_jobs=-1)
CPU times: user 2min 1s, sys: 2.9 s, total: 2min 4s
Wall time: 15.2 s

evaluate the model

[8]:

y_pred = automl.predict(X_test)
print("root mean square error:",
      round(mean_squared_error(y_true=y_test, y_pred=y_pred, squared=False),3))
print("r2:",
      round(r2_score(y_true=y_test, y_pred=y_pred),3))

root mean square error: 0.636
r2: 0.997

apply and test the machine learning model

use automl.predict(X) to apply the model

[9]:

df_cmip[label] = automl.predict(df_cmip[features]).reshape(df_cmip.shape[0],-1)

Step 5: visualization

[10]:

fig, (ax1,ax2,ax3) = plt.subplots(3,1,figsize=(12,6))
fig.suptitle('comparison between CESM2 and cmip')
df_cesm["tasmax"].plot(label="CESM2",c="k",ax=ax1)
df_cmip["tasmax"].plot(label="cmip",c="r",ax=ax1)
ax1.set_title("near-surface air temperature (tasmax)")
ax1.set_ylabel("tasmax, K")
ax1.set_xlabel("time, day since 2081-01-02")
ax1.legend(["CESM2","CMIP6"])

df_cesm["TREFMXAV"].plot(label="CESM2",c="k",ax=ax2)
df_cmip["TREFMXAV"].plot(label="cmip",c="r",ax=ax2)
ax2.set_title("urban daily maximum of average 2-m temperature")
ax2.set_ylabel("TREFMXAV, K")
ax2.set_xlabel("time, day since 2081-01-02")
ax2.legend(["CESM2","CMIP6"])

(df_cesm["TREFMXAV"]-df_cesm["tasmax"]).plot(label="CESM2",c="k",ax=ax3)
(df_cmip["TREFMXAV"]-df_cmip["tasmax"]).plot(label="cmip",c="r",ax=ax3)
ax3.set_title("urban daily maximum of average 2-m temperature")
ax3.set_ylabel("TREFMXAV, K")
ax3.set_xlabel("time, day since 2081-01-02")
ax3.legend(["CESM2","CMIP6"])

plt.tight_layout()
plt.show()

../_images/notebooks_demo_CESM2_to_CMIP6_21_0.png

[11]:

# reference: https://stackoverflow.com/questions/53964485/seaborn-jointplot-color-by-density
x = df_cesm["tasmax"]-df_cmip["tasmax"]
y = df_cesm["TREFMXAV"]-df_cmip["TREFMXAV"]

plot = sns.jointplot(x, y, kind="kde", cmap='hot_r', n_levels=60, fill=True)
plot.ax_joint.set_xlim(-2.5,15)
plot.ax_joint.set_ylim(-2.5,15)
plot.ax_joint.plot([-2.5,15], [-2.5,15], 'b-', linewidth = 2)
plt.show()

../_images/notebooks_demo_CESM2_to_CMIP6_22_0.png

[12]:

# reference: https://stackoverflow.com/questions/53964485/seaborn-jointplot-color-by-density
x = df_cesm["TREFMXAV"]-df_cesm["tasmax"]
y = df_cmip["TREFMXAV"]-df_cmip["tasmax"]

plot = sns.jointplot(x, y, kind="kde", cmap='hot_r', n_levels=60, fill=True)
plot.ax_joint.set_xlim(-0.5,2.5)
plot.ax_joint.set_ylim(-0.5,2.5)
plot.ax_joint.set_xlabel("CESM")
plot.ax_joint.set_ylabel("CMIP")
plot.ax_joint.plot([-0.5,2.5], [-0.5,2.5], 'b-', linewidth = 2)
plt.show()

../_images/notebooks_demo_CESM2_to_CMIP6_23_0.png