# **Introduction to Google Colaboratory (short: Colab)**

Create a folder in your Google Drive in which you save your Colab Notebooks and your datasets (.csv files)

- Show how to create "Text"-sections to provide yourself an overview.
- Show how to create sections (using capital letter-function)
- Show how to create "headlines"
- Show how to create a "Code box"

# **Introduction to DataFrames**

**Import NumPy and pandas modules**

In [None]:
import numpy as np # A tool for working with arrays - we will load the dataset in something called a DataFrame.
import pandas as pd # A tool for data analysis and manipulation

# **Import a .csv-file into a DataFrame**

In [None]:
from google.colab import drive # Provide access to your drive
drive.mount('/content/drive') # Mount to your drive

# Create a DataFrame with the cereal dataset.
df = pd.read_csv('/content/drive/MyDrive/DVD/cereal.csv') # Remember to find the specific path and name of the .csv file.
# /content/drive/MyDrive/Colab Notebooks/archive/cereal.csv

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


FileNotFoundError: ignored

**Specifying a subset of a DataFrame**

In [None]:
# Print the first 5 rows of the DataFrame.
df.head()

In [None]:
# Print rows #0, #1 and #2
df.head(3)

In [None]:
# Print rows #1, #2 and #3
df[1:4]

In [None]:
# Print row #2
df.iloc[[2]]

In [None]:
# Print the last 10 rows of the DataFrame.
df.tail(10)

In [None]:
# Print a specified column
df[['rating']]

**Calculating and doing statistics**

Finding the mean of the "rating" variable

In [None]:
df['rating'].mean()

Show amount of unique values in each column

In [None]:
df.nunique()

Show amount of unique values in a specific column

In [None]:
df['mfr'].nunique()

# **Introduction to Altair**

**Start by downloading and importing the Altair Library**

In [None]:
!pip install altair==5.0.0rc1

In [None]:
import altair as alt

**Create a scatterplot with the sodium and rating variables**

In [None]:
alt.Chart(df).mark_point().encode(
    x = "sodium:Q",
    y = "rating:Q"
)

**Color code based on name** - Not a good variable to color code from

In [None]:
alt.Chart(df).mark_point().encode(
    x = "sodium:Q",
    y = "rating:Q",
    color = "name:N"
)

**Color code based on "mfr"** - A better variable to color code from

In [None]:
alt.Chart(df).mark_point().encode(
    x = "sodium:Q",
    y = "rating:Q",
    color = "mfr:N"
)

**Color code based on "mfr" - try other data types than "Nominal"**

In [None]:
alt.Chart(df).mark_point().encode(
    x = "sodium:Q",
    y = "rating:Q",
    color = "mfr:O"
)

**Make the visualization interactive** - Make it possible to "move around" in the visualization

In [None]:
chart = alt.Chart(df).mark_point().encode(
    x = "sodium:Q",
    y = "rating:Q",
    color = "mfr:N"
)

chart.interactive()

**Add a tooltip to the visualization to show the "name" of each data point**

In [None]:
chart = alt.Chart(df).mark_point().encode(
    alt.X("sodium:Q"),
    alt.Y("rating:Q"),
    color = "mfr:N",
    tooltip = ["name"]
)

chart.interactive()

**Make a bar chart with the "mfr" and "rating" variables**

In [None]:
bar = alt.Chart(df).mark_bar().encode(
    x='mfr:N',
    y='rating:Q'
)

bar

# **Optional**

**Create a DataFrame**
- The DataFrame is similar to a spreadsheet as the data is contained in cells with named columns and numbered rows.

In [None]:
# Create and populate a 7x2 NumPy array (= 14 cells).
my_data = np.array([[0,3], [10,7], [15,8], [20,9], [25,12], [30,14], [40,15]])

# Create a Python list that holds the names of the two columns.
my_column_names = ['temperature', 'activity']

# Create a DataFrame.
my_dataframe = pd.DataFrame(data=my_data, columns=my_column_names) # The first argument = data, second argument = names of columns.

# Print the entire DataFrame.
my_dataframe

**Adding a new column to a DataFrame**

In [None]:
# Create a new column named adjusted.
my_dataframe['adjusted'] = my_dataframe['activity'] + 2 # Creating a column whose values are the values of the column "Activity" plus 2.

# Print the entire DataFrame.
print(my_dataframe)

**Show the row of the cereal with the highest calories**

In [None]:
df.loc[df['calories'].idxmax()]

**Create a new DataFrame with filtered values**

In [None]:
df = pd.read_csv('file.csv')
filtered_df = df[df['Age'] > 25]
print(filtered_df)