{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Import Libraries\n",
    "Since a lot of the functions and classes we need are not part of Python's core we need to import them."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Some basic libraries\n",
    "import pandas as pd # for data manipulation\n",
    "import numpy as np # for array operations\n",
    "from matplotlib import pyplot as plt # to plot stuff\n",
    "\n",
    "# ML and statistics\n",
    "from sklearn.linear_model import LinearRegression, HuberRegressor, Lasso, Ridge, LassoCV, RidgeCV\n",
    "import sklearn.linear_model\n",
    "from sklearn.metrics import mean_squared_error as MSE\n",
    "from sklearn.metrics import mean_absolute_error as MAE\n",
    "from sklearn.metrics import r2_score\n",
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "# Statistics\n",
    "import statsmodels.api as sm\n",
    "import scipy.stats as stats\n",
    "\n",
    "# Some helper libraries\n",
    "from scipy.special import binom\n",
    "from itertools import combinations"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Datasets\n",
    "At some point we have to separate training data from test data, since during training the model should not know any information about the test set. In the ideal case you split the data at the beginning into a test and a train set before you even look at it. Then you put the test set into a safe, lock it and send the key to a relative who lives far away 😉. This way you don't run into the problem of data leakage. However, this approach is rather unpractical because you need to do the same data cleaning and preprocessing on both sets.\n",
    "\n",
    "In the following we will process the data as a whole, but we only use \"legal\" operations, in the context of data leakage. We will use no specific validation set, which you are probably familiar with from deep learning. Instead we use cross-validation to tune our hyperparameters."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Data Loading\n",
    "In a first step we need to load the data and quickly examine it. Depending on the format of the data we must use different load commands. In reality the data can come from multiple sources which means we have to stitch the data blocks together reasonably. However in this example this has already been done and we can use the `data_raw.csv` directly."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "response_name = \"Gewicht\"\n",
    "data_raw = pd.read_csv(\"Data/data_raw.csv\") # to have a save point\n",
    "data = data_raw.copy() # To compare to the original if needed\n",
    "data.head() # displays the first five entries"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Data Cleaning / Data Analysis\n",
    "As we can see the data contains some NaN-values (Not a Number). In general, also non numeric data can be useful to gain insight about the process. However in this case we do not see any specific information, just NaN. It follows that we can omit these columns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "col_set = set(data.columns)\n",
    "data.dropna(axis=1, how=\"all\", inplace=True)\n",
    "print(f\"removed following features: {col_set-set(data.columns)}\")\n",
    "\n",
    "ind_set = set(data.index)\n",
    "data.dropna(axis=0, how=\"any\", inplace=True)\n",
    "print(f\"removed following indices: {ind_set-set(data.index)}\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`Why are two different commands needed to drop the NaN values?`"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After the basic cleaning we would like to gain some insights about the data's statistics."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "summary = data.describe() # Returns a statistical summary of the data\n",
    "summary.T"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`What does the individual column '25%' mean?`"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In case you need a refresher of the statistical measures produced by the .describe()-method, you can run the following code which visualizes these measures for a gaussian distributed random variable (RV) with the same *mean* and *std*."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def plot_n_get_distribution(data: pd.DataFrame, feature_name: str, resolution=10)-> tuple:\n",
    "    \"\"\"Plots the distribution of the data as a histogram and compares it to a gaussian distribution with similar statistics.\n",
    "\n",
    "    Args:\n",
    "        data (pd.DataFrame): Data which contains the feature name.\n",
    "        feature_name (str): String which specifies the feature name.\n",
    "        resolution (int, optional): Number of bins in which the data should be packed. Defaults to 10.\n",
    "\n",
    "    Returns:\n",
    "        tuple: Tuple containing the number of values in each bin and the edges of the bins generated by the hist command.\n",
    "    \"\"\"\n",
    "    summary = data.describe()\n",
    "    feature_number = summary.columns.get_loc(feature_name)\n",
    "    n, bins, patches = plt.hist(data[feature_name], label=\"data distribution (histogram)\", bins=resolution)\n",
    "    mean = summary.loc[\"mean\", feature_name]\n",
    "    std = summary.loc[\"std\", feature_name]\n",
    "    x = np.linspace(mean-3*std, mean+3*std, 1000)\n",
    "    gaussian = stats.norm.pdf(x, mean, std)\n",
    "    scalar = max(n)/max(gaussian)\n",
    "    plt.vlines(summary.iloc[4:-1, feature_number], 0, max(n), colors=\"k\", label=\" \".join(summary.index[4:-1]))\n",
    "    plt.vlines(mean, 0, max(n), colors=\"red\", label=\"mean\")\n",
    "    plt.vlines([mean+std, mean-std], 0, max(n), colors=\"orange\", label=\"stds\")\n",
    "    plt.plot(x, gaussian*scalar, color=\"lightgreen\", label=\"gaussian distribution with same mean and std\")\n",
    "    plt.ylim(0, max(n))\n",
    "    plt.title = feature_name\n",
    "    plt.legend(bbox_to_anchor=(1, 1))\n",
    "    plt.show()\n",
    "    return n, bins"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plot_n_get_distribution(data, \"avg_Nachdruck\", resolution=10) # Put in the feature name"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`Why is the 25% line at the very bottom of the first bin?`\n",
    "\n",
    "\n",
    "<span style=\"font-size:0.6em;\"> Hint: play with the resolution </span>"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we need to quickly check whether the dimensions of the summary and the data are the same..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(f\"{len(summary.T)=}\")\n",
    "print(f\"{len(data.columns)=}\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`Yikes! What happened here? How come the summary describes only 40 features although the dataset contains 42 features?`"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The features `Datum` and `Startzeit` are unfortunately not in a common number format. Now we can decide if we want to keep the features or if we choose to discard them. If we wanted to keep them we would have to transform them in a meaningful way. However for our application we do not need to do it. Therefore we can drop them manually. Or even better: We can only include columns with a common number format in our data."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`*How could these two features be relevant?`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = data.select_dtypes(include=np.number)\n",
    "print(f\"{len(summary.T)=}\")\n",
    "print(f\"{len(data.columns)=}\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see that there are some values which have a std of 0.0000. This is probably due to an error during the data collection process. Even if it wasn't an error, columns with a standard deviation of 0 still contain no valuable information. Therefore we can drop them."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = data.loc[:, data.std() > 0]\n",
    "summary = data.describe()\n",
    "summary.T\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`We removed features with a `*`std`*`  too low. But are there also  `*`std`*`s which are too high?`"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In general this step needs some knowledge about the features. In some settings the feature names are not known to the data scientist. However we know a little bit about the features, their physics and the injection moulding process itself..."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Some of the features which need special attention are:\n",
    "- `Gewicht`\n",
    "- `Seriennummer`\n",
    "- `Vorlauf_Temperiergeraet_AS`\n",
    "- `Ruecklauf_Temperiergeraet_AS`"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We first start with the bottom two:\n",
    "`Vorlauf_Temperiergeraet_AS` and `Ruecklauf_Temperiergeraet_AS` both have a *std* of more than $10^{16}$. If we look more closely on these values we see that their maximum value is above $10^{18}$. As a reference, the core temperature of the sun is about $1.5 \\cdot 10^7$ degrees Celsius. Since these features are measured in reasonable units we can conclude that we are either indirect witnesses of a historical event, or that there is an error in the data...\n",
    "\n",
    "We need to decide whether we want to completely exclude these features or if we just want to drop corrupt data points. In order to do that let's plot their distribution like before."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "n, bins = plot_n_get_distribution(data, \"Vorlauf_Temperiergeraet_AS\", resolution=20)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see that the 25, 50 and 75% lines are all at the same location. Therefore we have a comparibly small amount of so-called outliers. In order to detect them we could use many fancy methods or we can just check the bins generated and do some \"manual\" data wrangling..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(bins)\n",
    "print(n)\n",
    "type(n)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see that there are 409 values in the first bin. However the right edge of this bin is at $8.1055400\\cdot 10^{18}$ degrees. It follows that even in the good bin there could be some outliers...\n",
    "\n",
    "We can sort the data and drop some rows until we are satisfied."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# drop these values\n",
    "data = data.sort_values(\"Vorlauf_Temperiergeraet_AS\", ascending=True)[:-10]\n",
    "data\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's look at the distribution again..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plot_n_get_distribution(data, \"Vorlauf_Temperiergeraet_AS\", resolution=20)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can see there are still some potential outliers, but they are in a temperature range which is technically possible. It is up to you to decide whether to keep them or not."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Optional drop or don't drop"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plot_n_get_distribution(data, \"Ruecklauf_Temperiergeraet_AS\", resolution=20)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see also here are some outliers. Therefore we repeat the step we have previously done."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = data.sort_values(\"Ruecklauf_Temperiergeraet_AS\", ascending=True)[:-6]\n",
    "plot_n_get_distribution(data, \"Ruecklauf_Temperiergeraet_AS\", resolution=20)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Optional drop or don't drop"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's look at the data again..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.describe().T"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you are satisfied you can continue. Otherwise drop some more rows. If you believe to know something about the features you can filter the data using maximal and minimal values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "feature_name = \"\"\n",
    "value = 100\n",
    "\n",
    "data = data[data[feature_name] < value] # generic command which can easily be altered to only include values above or below a certain value"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "Now that we got rid of outliers we need to continue with other special variables:\n",
    "\n",
    "`Seriennummer` is most definetly not relevant for the process, therefore we can drop it."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`Could the feature Seriennummer give valuable insight or is it always useless?`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = data.drop([\"Seriennummer\"], axis=1)\n",
    "summary = data.describe().T\n",
    "summary"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`Gewicht` is the response variable. This means we have to separate it from the rest of the data at some point. We could do it right before the training or we can do it now. Both variants have pros and cons. Nevertheless we split it now."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`What could be the pros and cons of detaching the data later?`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "y = data.pop(\"Gewicht\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next we want to compare the features with eachother to detect collinearity. This means that some features contain more or less the same information."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "corr_mat = data.corr()\n",
    "corr_mat"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we can see in the diagonal we have a 1 which is not surprising, since a variable is 100% correlated with itself. We can also observe a correlation coefficient of > 0.9 between `Umschaltpunkt` and `Fliesszahl`. There might be other features with a high correllation coefficient. Let's filter those out! \n",
    "\n",
    "<span style=\"font-size:0.8em;\"> Note: The factor which can be used to detect high correlation can be different from 0.9, but if there is no prior knowledge we can leave it like that. </span>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Some code which gets all features with an absolute correlation coefficient > 0.9\n",
    "threshold = 0.9\n",
    "condition = (corr_mat.abs() > threshold) & (corr_mat.abs() < 1)\n",
    "corr_features = corr_mat.where(condition).stack()\n",
    "corr_features = corr_features.sort_values(ascending=False)[::2] # since the entire correlation matrix would be redundant we take every second entry\n",
    "corr_features\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`Why do we use .abs() in the filter condition?`"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we have every pair of features which show a significant correlation with each other. Since there are features which are correlated with multiple other features, e.g. `Nachdruckarbeit`, `Integral_Nachdruck`, `Massepolster`, we could assign groups and decide what to do with the entire group instead of just individual features."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "corr_list = corr_features.index.to_list()\n",
    "groups = [set(t) for t in corr_list]\n",
    "already_used = []\n",
    "for g in groups.copy():\n",
    "    new_group = g\n",
    "    for gr in groups.copy():\n",
    "        if new_group.intersection(gr):\n",
    "            new_group.update(gr)\n",
    "            groups.remove(gr)\n",
    "    groups.append(new_group)\n",
    "    # if new_group not in groups: groups.append(new_group)\n",
    "groups \n",
    "\n",
    "\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we identified these buddy-variables we need to decide what to do with them. Should we combine them? Should we drop them? Should we consider the fact that these are just so highly correlated by chance?"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`* How would you cope with these groups?`"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "!!!\n",
    "\n",
    " At this point we can decide whether to use a PCA or to combine features manually. The first approach can always be used regardless of the features. The second approach needs more knowledge about the process. We will continue manually but we save the data here as `data_before_PCA` to have a save point which you can use later if you want to improve the model.\n",
    " \n",
    " !!!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_before_PCA = data.copy()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "One of the easiest but nevertheless a good way to handle this collinearity is to just drop variables from a group and keep only one representative feature. However, since we have a little bit of knowledge about the process we can argue what we want to do with the features."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A possible approach for these six groups would be:\n",
    "1. Use only `Temperatur_Duese` since it is the nearest temperature to the material out of this group\n",
    "2. Use  `Einspritzzeit` $\\cdot$ `avg_Einspritzgeschwindigkeit`  since it gives an indication for the volume pumped into the form.\n",
    "3. see above\n",
    "4. Use only `Kuehlzeit` since it carries probably more unique information (experience)\n",
    "5. Use the average of the variables `Ruecklauf_Temperiergeraet_AS`, `Ruecklauf_Temperiergeraet_DS` and `Vorlauf_Temperiergeraet_DS`\n",
    "6. Use only `avg_Nachdruck` since it carries probably more unique information (experience)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "vars_to_keep = ['Temperatur_Duese', 'Einspritzzeit', 'avg_Einspritzgeschwindigkeit', 'avg_Nachdruck', 'Kuehlzeit', 'Ruecklauf_Temperiergeraet_AS', 'Ruecklauf_Temperiergeraet_DS', 'Vorlauf_Temperiergeraet_DS']\n",
    "vars_to_drop = set.union(*groups) - set(vars_to_keep)\n",
    "data.drop(vars_to_drop, axis=1, inplace=True)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After Dropping the \"useless\" variables we can proceed with combining the two features `Einspritzzeit` and `avg_Einspritzgeschwindigkeit` into the feature `Einspritzvolumen` and make some kind of average variable for the three correllated `Temperiergeraet` features.\n",
    "\n",
    "<span style=\"font-size:0.8em;\"> Note: This is not the actual volume, since we only use averages, however it serves as a reference. </span>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data[\"Einspritzvolumen\"] = data[\"Einspritzzeit\"] * data[\"avg_Einspritzgeschwindigkeit\"]\n",
    "data[\"Temperiergeraet_avg\"] = (data[\"Ruecklauf_Temperiergeraet_AS\"]+data['Ruecklauf_Temperiergeraet_DS']+data['Vorlauf_Temperiergeraet_DS'])/3\n",
    "data.drop([\"Einspritzzeit\", \"avg_Einspritzgeschwindigkeit\", \"Ruecklauf_Temperiergeraet_AS\", 'Ruecklauf_Temperiergeraet_DS', 'Vorlauf_Temperiergeraet_DS'], axis=1, inplace=True)\n",
    "data.describe().T"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "corr_mat = data.corr().abs()\n",
    "corr_mat"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now the data seems clean, so let's split it into a training and a test set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train, X_test, y_train, y_test = train_test_split(data, y, shuffle=True, test_size=0.25, random_state=7)\n",
    "X_train"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we could train a linear regression model with X_train and y_train, but we have 300 data points and 17 features."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`Is this reasonable for a linear regression?`\n",
    "\n",
    "<span style=\"font-size:0.6em;\"> Hint: Think about the underlying assumption of a linear regression.</span>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "2**17"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "No, it is not, even with the assumption of perfectly measured data and perfectly linear relations we still would need 131072 data points.  But this is a common problem in Machine Learning and there are some possible solutions to this problem:\n",
    "- **Subset Selection**\n",
    "- **Dimension Reduction**\n",
    "- **Shrinkage**"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Subset Selection\n",
    "In this section we want to get a group of parameters which is hopefully the most relevant and just ignore the rest. Since we have around 300 rows we would like to get a subset which contains at maximum 8 predictor variables, since $$ 8 < log_2(300) < 9 $$. This way we have an equivalent point density like we would have with two measuring points in one dimension.\n",
    "\n",
    "### Best Subset Selection\n",
    "This method is the version of the subset selection with the potential best outcome. However it can be rather computationally intensive, therefore we check how many different combinations we would have to try:\n",
    "$$ n_{combs} = \\sum_{k=1}^{8}\\binom{17}{k} $$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "k = np.arange(8) + 1\n",
    "combs = np.sum(binom(17, k))\n",
    "combs "
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Seems like a little bit too many combinations to try..."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Stepwise Subset Selection\n",
    "Since we know that testing every possible combination is rather unpractical, we trade accuracy with practicality. Therefore it suffices to just get a good model instead of the best. To actually do this subset selection we can choose between various common methods: **Forward**, **Backward** and **Mixed** stepwise subset selection. With these methods we have an initial model and either add or drop the feature depending on its potential improvement or deterioration. We continue this method until our model contains the specified number of predictor variables."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`Why does a stepwise selection not always lead to the best solution?`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.feature_selection import SequentialFeatureSelector\n",
    "estimator = LinearRegression()\n",
    "\n",
    "nf = 5 # Play with this value (max: 8)\n",
    "\n",
    "selector1 = SequentialFeatureSelector(estimator, n_features_to_select=nf, direction=\"forward\")\n",
    "selector1.fit(X_train, y_train)\n",
    "mask = selector1.support_\n",
    "data.columns[mask]\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "selector2 = SequentialFeatureSelector(estimator, n_features_to_select=nf, direction=\"backward\")\n",
    "selector2.fit(X_train, y_train)\n",
    "mask = selector1.support_\n",
    "data.columns[mask]"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Both the forward and the backward method returned the same features necessary for a fit. Therefore we run with it."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`Play with the factor nf and see what the different features are. What are the pros and cons of using higher or lower nfs?`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train_sfs = selector2.transform(X_train)\n",
    "X_test_sfs = selector2.transform(X_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(X_train_sfs.shape)\n",
    "print(X_test_sfs.shape)\n",
    "X_train_sfs"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Dimension Reduction\n",
    "### PCA\n",
    "In contrast to the trial and error methods of subset selection. PCA has a more mathematical approach which tries to map the data onto imaginary axes which are orthogonal to each other. These axes are chosen in a way that they maximize the variance of the data along them. A huge benefit of it is that it uses all the information available and does not omit certain features. If there is no knowlege of the features it is probably the best method to use in general, but since we know the process and the features a little bit, we use it in addition to the other methods."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.decomposition import PCA\n",
    "n_components = 3 # <- Fill in here\n",
    "pca = PCA(n_components=n_components) # value can also be altered\n",
    "pca.fit(X_train)\n",
    "pca.explained_variance_ratio_\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see that with the first component alone we can capture < 95% of the total variance in the data. Let's fit a model with only this one and watch how it performs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train_pca = pca.transform(X_train)\n",
    "X_val_pca = pca.transform(X_test)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Shrinkage\n",
    "The last but not least approach is called shrinkage. Here the variables themselves are not affected but there influence is. The two main methods of the family of the shrinkage methods are the *Lasso* and the *Ridge Regression* algorithms, which are explained in the next section."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Regression\n",
    "Now that we prepared our data it is time regress $\\mathbf{y}$ on $\\mathbf{X}$ to see how good of a model we can get to describe the relation between input and output."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As a reminder, in a one dimensional setting a linear regression looks like this\n",
    "$$\n",
    "\\hat{y} = \\beta_0 + \\beta_1 x\n",
    "$$\n",
    "If we take vectors instead of scalars we can change the notation to this:\n",
    "$$\n",
    "\\mathbf{\\hat{y}} = \\beta_0 + \\beta_1 \\mathbf{x}\n",
    "$$\n",
    "$\\mathbf{x}$ now represents a matrix and $\\mathbf{\\hat{y}}$ a vector. This is more like our setting, because we have multiple measured points. If we add more $\\beta\\!$ s we can write it like this that it captures exactly our problem.\n",
    "$$\n",
    "\\mathbf{\\hat{y}} = \\beta_0 + \\boldsymbol{\\beta} \\mathbf{X}\n",
    "$$\n",
    "$\\mathbf{X}$ is now a matrix and $\\boldsymbol{\\beta}$ a vector but the idea stays the same.\n",
    "\n",
    "In all of these problems we try to find the $\\beta$ s in such a way that the calculated $\\mathbf{\\hat{y}}(\\mathbf{X},\\boldsymbol{\\beta})$ matches the measured $\\mathbf{y}$ as well as possible. This is done by minimising the *Residual Sum Squared* (RSS).\n",
    "$$\n",
    "RSS = \\sum_{j=1}^{p}(y-\\hat{y})^2 = (\\mathbf{y}-\\mathbf{\\hat{y}})^\\top(\\mathbf{y}-\\mathbf{\\hat{y}})"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can now create a linear regression model for each transformed dataset in $\\mathbf{X}$ we have:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "linreg1 = LinearRegression()\n",
    "linreg1.fit(X_train_sfs, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "linreg2 = LinearRegression()\n",
    "linreg2.fit(X_train_pca, y_train)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Additionally we can use the shrinkage methods mentioned in the previous section, which slightly alter the optimisation problem of linear regression to give weights to the individual $\\beta$ s:\n",
    "We now not only try to minimise the RSS but we also give the condition that the sum of all $\\beta$ s should also be relatively small. More specifically, the *Lasso* algorithm tries to minimise:\n",
    "$$\n",
    "RSS + \\lambda \\sum_{j=1}^{p}|\\beta_j|\n",
    "$$\n",
    "and the *Ridge* regression this:\n",
    "$$\n",
    "RSS + \\lambda \\sum_{j=1}^{p}\\beta_j^2\n",
    "$$\n",
    "$\\lambda$ is a so-called hyperparameter which can be tuned with the validation set."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's create a Lasso and a Ridge model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "lasso = LassoCV(fit_intercept=True)\n",
    "lasso.fit(X_train, y_train)\n",
    "\n",
    "ridge = Ridge(alpha=10, fit_intercept=True)\n",
    "ridge.fit(X_train, y_train)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Evaluation\n",
    "Now that we have fitted several models it is time to check their accuracy using the following metrics for comparison:\n",
    "$$\n",
    "MSE = \\frac{1}{n} \\sum_{i=1}^{n} (y_i - \\hat{y}_i)^2\n",
    "$$\n",
    "$$\n",
    "MAE = \\frac{1}{n} \\sum_{i=1}^{n} |(y_i - \\hat{y}_i)|\n",
    "$$\n",
    "$$\n",
    "R^2 = 1 - \\frac{\\sum_{i=1}^{n} (y_i - \\hat{y}_i)^2}{\\sum_{i=1}^{n} (y_i - \\bar{y})^2},  \\quad   \\bar{y} = \\frac{1}{n} \\sum_{i=1}^n y_i\n",
    "$$"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`Which metric is good for what?`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def check_accuracy(X_test: np.array, y_test: pd.DataFrame, model: sklearn.linear_model)-> np.array:\n",
    "    \"\"\"Prints the accuracy of a model with the three metrics MSE, MAE and R²-score and returns the predicted data.\n",
    "\n",
    "    Args:\n",
    "        X_test (np.array): Numpy array of input test data.\n",
    "        y_test (pd.DataFrame): Pandas DataFrame of output test data.\n",
    "        model (sklearn.linear_model): Linear model from sklearn.\n",
    "\n",
    "    Returns:\n",
    "        np.array: Numpy array of predicted data.\n",
    "    \"\"\"\n",
    "    pred = model.predict(X_test)\n",
    "    mse = MSE(pred, y_test)\n",
    "    mae = MAE(pred, y_test)\n",
    "    r2 = r2_score(pred, y_test)\n",
    "    print(f\"{mse=}\")\n",
    "    print(f\"{mae=}\")\n",
    "    print(f\"{r2=}\")\n",
    "    return pred\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Linear Regression with Subset Selection:\")\n",
    "y_pred_sfs = check_accuracy(X_test_sfs, y_test, linreg1)\n",
    "print(\"\\nLinear Regression with PCA:\")\n",
    "y_pred_pca = check_accuracy(X_val_pca, y_test, linreg2)\n",
    "print(\"\\nLasso:\")\n",
    "y_pred_lasso =check_accuracy(X_test, y_test, lasso)\n",
    "print(\"\\nRidge:\")\n",
    "y_pred_ridge = check_accuracy(X_test, y_test, ridge)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's plot the predicted vs. the actual values to get a better understanding of how well our data performs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def plot_data(y_train: np.array, y_test: pd.DataFrame, y_pred: np.array)-> None:\n",
    "    \"\"\"Plots the training, test and predicted data in one plt.Axes object.\n",
    "\n",
    "    Args:\n",
    "        y_train (np.array): Numpy Array of training data.\n",
    "        y_test (pd.DataFrame): Pandas DataFrame of test data.\n",
    "        y_pred (np.array): Numpy Array of predicted data.\n",
    "    \"\"\"\n",
    "    fig, ax = plt.subplots()\n",
    "    ys = np.c_[y_test, y_pred]\n",
    "    # Sorting the values to better display them\n",
    "    ys = ys[ys[:, 0].argsort()]\n",
    "    x_y_train = np.arange(len(y_train))\n",
    "    x_y_test = np.arange(len(y_train), len(y_train)+len(ys))\n",
    "    ax.plot(x_y_train, y_train.reindex(y_train.index.sort_values()), label=\"Training Data\")\n",
    "    ax.plot(x_y_test, ys[:, 0], label=\"Test Data\")\n",
    "    ax.plot(x_y_test, ys[:, 1], label=\"Predicted Data\")\n",
    "    ax.set_ylabel(f\"{response_name}g\")\n",
    "    ax.legend()\n",
    "    fig.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plot_data(y_train, y_test, y_pred_sfs) # You can use different y_pred values to change the view"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`How would you describe the performance of your models? Is there a clear winner? Does it make sense?`"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we have a few options to improve the model:\n",
    "- We can alter the number of features which should be chosen.\n",
    "- We can mix the subset selection approach with other approaches like the PCA or the Lasso algorithm.\n",
    "- We can omit the potential outliers we saw during preprocessing. \n",
    "- We can use a PCA earlier since our decisions for exclusions of features may not be optimal.\n",
    "- We can add transformed features like `avg_Nachdruck`$^2$ or `Dosierzeit`$\\cdot$`Temperatur_Duese`. Of course it makes sense to look at the physical relations of the features and the process for this, but trial and error is also a method, which can lead to success.\n",
    "\n",
    "Now try to improve the model."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`What is the best way to improve your model without needing to know much about the process?`"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3.8.10 64-bit",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.10"
  },
  "orig_nbformat": 4,
  "vscode": {
   "interpreter": {
    "hash": "916dbcbb3f70747c44a77c7bcd40155683ae19c65e1c03b4aa3499c5328201f1"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
