Learning With Kaggle

🤖 AI

Apr 8, 2023

8 min read

A high level overview of what Kaggle is and beginner Data Science topics.

What is Kaggle?

Kaggle is a place to learn about data science, find data sets and get involved in competitions.

How to Get Started With Kaggle

If you are new to data science, Kaggle have their own learning section, which is completely free. I will cover some of the topics briefly below however if you want to learn data science properly, I would recommend checking out their material.

Topics to Get Comfortable With

Information from these sections are taken from Kaggle Learn. The website has their own custom way of running kernels and interactive lessons. The information below is for reference only.

Python

This is the main programming language for solving problems. You can checkout Kaggle’s python learning, which covers the basics here.

Pandas

The most popular Python library for data analysis.

DataFrame is a table which contains an array of individual entries, each of which has a certain value.

pd.DataFrame({'Yes': [50, 21], 'No': [131, 2]})

	Yes	No
0	50	131
1	21	2

pd.DataFrame({'Bob': ['I liked it.', 'It was awful.'], 'Sue': ['Pretty good.', 'Bland.']})

	Bob	        Sue
0	I liked it.	Pretty good.
1	It was awful.	Bland.

The list of row labels used in a DataFrame is known as an index. Indexes can be assigned as a parameter in the constructor.

pd.DataFrame({'Bob': ['I liked it.', 'It was awful.'], 'Sue': ['Pretty good.', 'Bland.']}, index=['Product A', 'Product B'])

	        Bob	        Sue
Product A	I liked it.	Pretty good.
Product B	It was awful.	Bland.

Series is a sequence of data values. If a DataFrame is a table, a series is a list. Essentially it’s a single column of a DataFrame. You can assign row labels to the series using the index parameter however a series does not hav ea column name, it only has one overall name.

pd.Series([30, 35, 40], index=['2015 sales', '2016 sales', '2017 sales'], name="Product A")

Output

2015 Sales    30
2016 Sales    35
2017 Sales    40
Name: Product A, dtype: int64

Can use shape attribute to check how large the DataFrame is.

wine_reviews.shape

Output

(129971, 14)

head() grabs the first five rows.

wine_reviews_head()

wine_reviews = pd.read_csv("../input/wine-reviews/winemag-data-130k-v2.csv", index_col=0)
wine_reviews.head()

Missing Values With Pandas

The data type for a column in a DataFrame or a series is knows as the dtype. You can use dtype to get the type of a specific column.

import pandas as pd
reviews = pd.read_csv("../input/wine-reviews/winemag-data-130k-v2-csv", index_col=0)
pd.set_option('max_rows', 5)

reviews.price.dtype

Output

dtype('float64')

Entries missing values are given the value NaN, short for “Not a Number”. For technical reasons these NaN values are always of the float64 dtype.

Pandas provides some methods specific to missing data. To select NaN entries you can use pd.isnull().

reviews[pd.isnull(reviews.country)]

Replacing missing values is a common operation. Pandas provides the method fillna(), which changes the value from NaN to Unknown.

reviews.region_2.fillna("Unknown")

Output

0         Unknown
1         Unknown
           ...
129969    Unknown
129970    Unknown
Name: region_2, Length: 129971, dtype: object

Can also use replace().

reviews.example_data_set.replace("examplesss", "examples")

Machine Learning

Okay, so you know how to program in Python but what is machine learning?

Machine learning is the ability to train a model, which will enable you to draw conclusions about the data.

Basic Machine Learning

In the code block below, the CSV file is inputted and read, and columns are stored.

import pandas as pd

melbourne_file_path = '../input/melbourne-housing-snapshot/melb_data.csv'
melbourne_data = pd.read_csv(melbourne_file_path)
melbourne_data.columns

Once this is done, you need to select the prediction target. What’s the purpose of the model you’re trying to build?

You can use a dot notation to select a column you want to predict, which is called the prediction target. The prediction target is called y.

y = melbourne_data.Price

Columns inputted into the model are called features. This data is called X.

melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'Lattitude', 'Longtitude']

X = melbourne_data[melbourne_features]

X.describe()
X.head()

X.describe() and X.head() show the top few rows of predicting house prices.

	Rooms	Bathroom	Landsize	Lattitude	 Longtitude
1	  2	    1.0	      156.0	        -37.8079	  144.9934
2	  3	    2.0	      134.0	        -37.8093	  144.9944
4	  4	    1.0	      120.0	        -37.8072	  144.9941
6	  3	    2.0	      245.0	        -37.8024	  144.9993
7	  2	    1.0	      256.0	        -37.8060	  144.9954

Overfitting and Underfitting

This is a way to optimise a model more accurate.

Overfitting

The model matches the training data almost perfectly but does poorly in validation and other new data.

Capturing spurious patterns that won’t recur in the future, leading to less accurate predictions.

Underfitting

When the model fails to capture important distinctions and patterns in the data, so it performs poorly even in training data.

Failing to capture relevant patterns, again leading to less accurate predictions.

Example

from sklearn.metrics import mean_absolute_error
from sklearn.tree import DecisionTreeRegressor

def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
  model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
  model.fit(train_X, train_y)
  preds_val = model.predict(val_X)
  mae = means_absolute_error(val_y, preds_val)
  return (mae)

from max_leags_nodes in [5, 50, 500, 5000]:
  my_mae = get(max_leaf_nodes, train_x, val_X, train_y, val_y)
  print("Max leaf nodes: %d \t\t Mean Absolute Error: %d" %(max_leaf_nodes, my_mae))

Random Forests

A sophisticated modelling technique. It uses many trees, and makes a prediction by averaging the predictions of each component tree.

import pandas as pd

# load data
melbourne_file_path = '../input/melbourne-housing-snapshot/melb_data.csv'
melbourne_data = pd.read_csv(melbourne_file_path)
# filter rows (missing vals)
melbourne_data = melbourne_data.dropna(axis=0)
# choose target and features
y = melbourne_data.Price
melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'BuildingArea', 'YearBuilt', 'Lattitude', 'Longtitude']
X = melbourne_data[melbourne_features]

from sklearn.model_selection import train_test_split

# split data into traiing and validation (for both features and  target)
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=0)

To build the random forest model:

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

forset_model = RandomForestRegressor(random_state=1)
forest_model.fit(train_X, train_y)
melb_preds = forest_model.predict(val_X)
print(mean_absolute_error(val_y, melb_preds))

output: 191669.7536453626

Missing Values

Missing values in datasets is very common. There are a number of ways this can be tackled.

Simple Option

The simplest way to deal with missing values is to drop columns which have missing values. This approach isn’t the best because those columns could hold valuable information for the model you are building.

# get names of cols with missing values
cols_with_missing_vals = [col for col in X_train.columns if X_train[col].isnull().any()]

# drop columns in training and validation data
reduced_X_train = X_train.drop(cols_with_missing, axis=1)
reduxed_X_valid = X_valid.drop(cols_with_missing, axis=1)

print("MAE:")
print(score_dataset(reduced_X_train, reduced_X_valid, y_train, y_value))

Output

MAE:
183550.22137772635

Imputation

Imputation fills in the missing values with another number. For example if there is an empty value, you may use a zero instead.

from sklearn.impute import SimpleImputer

# imputation
my_imputer = SimpleImputer()
imputed_X_train = pd.DataFrame(my_imputer.fit_transform(X_train))
imputed_X_valid = pd.DataFrame(my_imputer.transform(X_valid))

# put back removed column names
imputed_X_train.columns = X_train.columns
imputed_X_valid.columns = X_valid.columns

print("MAE")
print(score_dataset(imputed_X_train, imputed_X_valid, y_train, y_valid))

Output

MAE:
178166.46269899711

Extension to Imputation

Imputated values can skew the model. Here we have imputated values and a column that shows whether the value is imputated or not (value in column true or false).

# make copy (to vaoid changing original data)
X_train_plus = X_train.copy()
X_valid_plus = X.valid.copy()

# make new cols to be imputed
for col in cols_with_missing:
  X_train_plus[col + ' was missing'] = X_train_plus[col].isnull()
  X_valid_plus[col + ' was missing'] = X_valid_plus[col].isnull()

# imputation
my_imputer = SimpleImputer()
imputed_X_train_plus = pd.DataFrame(my_imputer.fit_transform(X_train_plus))
imputed_X_valid_plus = pd.DataFrame(my_imputer.transform(X_valid_plus))

# imputation columns removed now put back in
imputed_X_train_plus.columns = X_train_plus.columns
imputed_X_valid_plus.columns = X_valid_plus.columns

print("MAE:")
print(score_dataset(imputed_X_train_plus, imputed_X_valid_plus, y_train, y_valid))

output

MAE:
178927.503183954

Titanic Example

holdout = test
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import cross_val_score
import numpy as np

test = pd.read_csv("test.csv")
test_shape = test.shape
print(test_shape)

train = pd.read_csv("train.csv")
train_shape = train.shape
print(train_shape)

train.head(10)

sex_pivot = train.pivot_table(index="Sex",values="Survived")
sex_pivot

pclass_pivot = train.pivot_table(index="Pclass",values="Survived")
pclass_pivot.plot.bar()
plt.show()

train['Age'].describe()
train[train["Survived"] == 1]

survived = train[train["Survived"] == 1]
died = train[train["Survived"] == 0]
survived["Age"].plot.hist(alpha=0.5,color='red',bins=50)
died["Age"].plot.hist(alpha=0.5,color='blue',bins=50)
plt.legend(['Survived','Died'])
plt.show()

def process_age(df,cut_points,label_names):
    df["Age"] = df["Age"].fillna(-0.5)
    df["Age_categories"] = pd.cut(df["Age"],cut_points,labels=label_names)
    return df

cut_points = [-1,0,18,100]
label_names = ["Missing","Child","Adult"]

train = process_age(train,cut_points,label_names)
test = process_age(test,cut_points,label_names)

def process_age(df,cut_points,label_names):
    df["Age"] = df["Age"].fillna(-0.5)
    df["Age_categories"] = pd.cut(df["Age"],cut_points,labels=label_names)
    return df

cut_points = [-1,0, 5, 12, 18, 35, 60, 100]
label_names = ["Missing", 'Infant', "Child", 'Teenager', "Young Adult", 'Adult', 'Senior']

train = process_age(train,cut_points,label_names)
test = process_age(test,cut_points,label_names)

age_cat_pivot = train.pivot_table(index="Age_categories",values="Survived")
age_cat_pivot.plot.bar()
plt.show()

train['Pclass'].value_counts()

column_name = "Pclass"
df = train
dummies = pd.get_dummies(df[column_name],prefix=column_name)
dummies.head()

def create_dummies(df,column_name):
    dummies = pd.get_dummies(df[column_name],prefix=column_name)
    df = pd.concat([df,dummies],axis=1)
    return df

train = create_dummies(train,"Pclass")
test = create_dummies(test,"Pclass")
train.head()

train = create_dummies(train,"Sex")
test = create_dummies(test,"Sex")
train = create_dummies(train,"Age_categories")
test = create_dummies(test,"Age_categories")

train.head()

lr = LogisticRegression()
lr.fit(train_X, train_y)
predictions = lr.predict(test_X)


columns = ['Pclass_2', 'Pclass_3', 'Sex_male']
lr.fit(train[columns], train['Survived'])

columns = ['Pclass_1', 'Pclass_2', 'Pclass_3', 'Sex_female', 'Sex_male',
       'Age_categories_Missing','Age_categories_Infant',
       'Age_categories_Child', 'Age_categories_Teenager',
       'Age_categories_Young Adult', 'Age_categories_Adult',
       'Age_categories_Senior']

lr.decision_function(train[columns])
lr.coef_
lr.fit(train[columns], train['Survived'])

columns = ['Pclass_2', 'Pclass_3', 'Sex_male']
all_X = train[columns]
all_y = train['Survived']
train_X, test_X, train_y, test_y = train_test_split(
    all_X, all_y, test_size=0.2,random_state=0)
train_X.shape

accuracy = accuracy_score(test_y, predictions)
accuracy = accuracy_score(test_y, predictions)
accuracy

conf_matrix = confusion_matrix(test_y, predictions)
pd.DataFrame(conf_matrix, columns=['Survived', 'Died'], index=[['Survived', 'Died']])

cross_val_score(estimator, X, y, cv=None)

lr = LogisticRegression()
scores = cross_val_score(lr, all_X, all_y, cv=10)
np.mean(scores)

columns = ['Pclass_1', 'Pclass_2', 'Pclass_3', 'Sex_female', 'Sex_male',
       'Age_categories_Missing','Age_categories_Infant',
       'Age_categories_Child', 'Age_categories_Teenager',
       'Age_categories_Young Adult', 'Age_categories_Adult',
       'Age_categories_Senior']
holdout.head()

lr = LogisticRegression()
lr.fit(all_X, all_y)
holdout_predictions = lr.predict(holdout[columns])
holdout_predictions

holdout_ids = holdout["PassengerId"]
submission_df = {"PassengerId": holdout_ids,
                 "Survived": holdout_predictions}
submission = pd.DataFrame(submission_df)