A high level overview of what Kaggle is and beginner data science topics.
Kaggle is a place to learn about data science, find data sets and get invovled in compeitions.
If you are new to data science, Kaggle have their own learning secton, which is compeletely free. I will cover some of the topics briefly below however if you want to learn data science properly, I would recommend checking out their material.
Information from these sections are taken from Kaggle Learn. The website has their own custom way of running kernels and interactive lessons. The information below is for reference only.
This is the main programming language for solving problems. You can checkout Kaggle’s python learning, which covers the basics here.
The most popular Python library for data analysis.
DataFrame is a table which contains an array of individual entries, each of which has a certain value.
pd.DataFrame({'Yes': [50, 21], 'No': [131, 2]})
Yes No
0 50 131
1 21 2
pd.DataFrame({'Bob': ['I liked it.', 'It was awful.'], 'Sue': ['Pretty good.', 'Bland.']})
Bob Sue
0 I liked it. Pretty good.
1 It was awful. Bland.
The list of row labels used in a DataFrame is known as an index. Indexes can be assigned as a parameter in the constructor.
pd.DataFrame({'Bob': ['I liked it.', 'It was awful.'], 'Sue': ['Pretty good.', 'Bland.']}, index=['Product A', 'Product B'])
Bob Sue
Product A I liked it. Pretty good.
Product B It was awful. Bland.
Series is a sequence of data values. If a DataFrame is a table, a series is a list. Essentially it’s a single column of a DataFrame. You can assign row labels to the series using the index paramater however a series does not hav ea column name, it only has one overall name.
pd.Series([30, 35, 40], index=['2015 sales', '2016 sales', '2017 sales'], name="Product A")
Output
2015 Sales 30
2016 Sales 35
2017 Sales 40
Name: Product A, dtype: int64
Can use shape attribute to check how large the DataFrame is.
wine_reviews.shape
Output
(129971, 14)
head() grabs the first five rows.
wine_reviews_head()
wine_reviews = pd.read_csv("../input/wine-reviews/winemag-data-130k-v2.csv", index_col=0)
wine_reviews.head()
The data type for a column in a DataFrame or a series is knows as the dtype. You can use dtype to get the type of a specific column.
import pandas as pd
reviews = pd.read_csv("../input/wine-reviews/winemag-data-130k-v2-csv", index_col=0)
pd.set_option('max_rows', 5)
reviews.price.dtype
Output
dtype('float64')
Entries missing values are given the value NaN, short for “Not a Number”. For technical reasons these NaN values are always of the float64 dtype.
Pandas provides some methods specific to missing data. To select NaN entries you can use pd.isnull().
reviews[pd.isnull(reviews.country)]
Replacing missing values is a common operation. Pandas provides the method fillna()
, which changes the value fron NaN
to Unknown
.
reviews.region_2.fillna("Unknown")
Output
0 Unknown
1 Unknown
...
129969 Unknown
129970 Unknown
Name: region_2, Length: 129971, dtype: object
Can also use replace()
.
reviews.example_data_set.replace("examplesss", "examples")
Okay, so you know how to program in Python but what is machine learning?
Machine learning is the ability to train a model, which will enable you to draw conclusions about the data.
In the code block below, the CSV file is inputted and read, and columns are stored.
import pandas as pd
melbourne_file_path = '../input/melbourne-housing-snapshot/melb_data.csv'
melbourne_data = pd.read_csv(melbourne_file_path)
melbourne_data.columns
Once this is done, you need to select the prediction target. What’s the purpose of the model you’re trying to build?
You can use a dot notation
to select a column you want to predict, which is called the prediction target
. The prediction target is called y
.
y = melbourne_data.Price
Columns inputted into the model are called features. This data is called X
.
melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'Lattitude', 'Longtitude']
X = melbourne_data[melbourne_features]
X.describe()
X.head()
X.describe()
and X.head()
show the top few rows of predicting house prices.
Rooms Bathroom Landsize Lattitude Longtitude
1 2 1.0 156.0 -37.8079 144.9934
2 3 2.0 134.0 -37.8093 144.9944
4 4 1.0 120.0 -37.8072 144.9941
6 3 2.0 245.0 -37.8024 144.9993
7 2 1.0 256.0 -37.8060 144.9954
This is a way to optimise a model more accurate.
The model mataches the training data almost perfectly but dodes poorly in validation and other new data.
Capturing spurious patterns that won’t recur in the future, leading to less accurate predictions.
When the model fails to capture important distinctions and patterns in the data, so it performs poorly even in training data.
Failing to capture relevant patterns, again leading to less accurate predictions.
from sklearn.metrics import mean_absolute_error
from sklearn.tree import DecisionTreeRegressor
def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
model.fit(train_X, train_y)
preds_val = model.predict(val_X)
mae = means_absolute_error(val_y, preds_val)
return (mae)
from max_leags_nodes in [5, 50, 500, 5000]:
my_mae = get(max_leaf_nodes, train_x, val_X, train_y, val_y)
print("Max leaf nodes: %d \t\t Mean Absolute Error: %d" %(max_leaf_nodes, my_mae))
A sophisticated modelling technique. It uses many trees, and makes a prediction by averaging the predictions of each component tree.
import pandas as pd
# load data
melbourne_file_path = '../input/melbourne-housing-snapshot/melb_data.csv'
melbourne_data = pd.read_csv(melbourne_file_path)
# filter rows (missing vals)
melbourne_data = melbourne_data.dropna(axis=0)
# choose target and features
y = melbourne_data.Price
melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'BuildingArea', 'YearBuilt', 'Lattitude', 'Longtitude']
X = melbourne_data[melbourne_features]
from sklearn.model_selection import train_test_split
# split data into traiing and validation (for both features and target)
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=0)
To build the random forest model:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
forset_model = RandomForestRegressor(random_state=1)
forest_model.fit(train_X, train_y)
melb_preds = forest_model.predict(val_X)
print(mean_absolute_error(val_y, melb_preds))
output: 191669.7536453626
Missing values in datasets is very common. There are a number of ways this can be tackled.
The simplest way to deal with missing values is to drop columns which have missing values. This approach isn’t the best because those columns could hold valuable information for the model you are building.
# get names of cols with missing values
cols_with_missing_vals = [col for col in X_train.columns if X_train[col].isnull().any()]
# drop columns in training and validation data
reduced_X_train = X_train.drop(cols_with_missing, axis=1)
reduxed_X_valid = X_valid.drop(cols_with_missing, axis=1)
print("MAE:")
print(score_dataset(reduced_X_train, reduced_X_valid, y_train, y_value))
Output
MAE:
183550.22137772635
Imputation fills in the missing values with another number. For example if there is an empty value, you may use a zero instead.
from sklearn.impute import SimpleImputer
# imputation
my_imputer = SimpleImputer()
imputed_X_train = pd.DataFrame(my_imputer.fit_transform(X_train))
imputed_X_valid = pd.DataFrame(my_imputer.transform(X_valid))
# put back removed column names
imputed_X_train.columns = X_train.columns
imputed_X_valid.columns = X_valid.columns
print("MAE")
print(score_dataset(imputed_X_train, imputed_X_valid, y_train, y_valid))
Output
MAE:
178166.46269899711
Imputated values can skew the model. Here we have imputated values and a column that shows whether the value is imputated or not (value in column true or false).
# make copy (to vaoid changing original data)
X_train_plus = X_train.copy()
X_valid_plus = X.valid.copy()
# make new cols to be imputed
for col in cols_with_missing:
X_train_plus[col + ' was missing'] = X_train_plus[col].isnull()
X_valid_plus[col + ' was missing'] = X_valid_plus[col].isnull()
# imputation
my_imputer = SimpleImputer()
imputed_X_train_plus = pd.DataFrame(my_imputer.fit_transform(X_train_plus))
imputed_X_valid_plus = pd.DataFrame(my_imputer.transform(X_valid_plus))
# imputation columns removed now put back in
imputed_X_train_plus.columns = X_train_plus.columns
imputed_X_valid_plus.columns = X_valid_plus.columns
print("MAE:")
print(score_dataset(imputed_X_train_plus, imputed_X_valid_plus, y_train, y_valid))
output
MAE:
178927.503183954
holdout = test
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import cross_val_score
import numpy as np
test = pd.read_csv("test.csv")
test_shape = test.shape
print(test_shape)
train = pd.read_csv("train.csv")
train_shape = train.shape
print(train_shape)
train.head(10)
sex_pivot = train.pivot_table(index="Sex",values="Survived")
sex_pivot
pclass_pivot = train.pivot_table(index="Pclass",values="Survived")
pclass_pivot.plot.bar()
plt.show()
train['Age'].describe()
train[train["Survived"] == 1]
survived = train[train["Survived"] == 1]
died = train[train["Survived"] == 0]
survived["Age"].plot.hist(alpha=0.5,color='red',bins=50)
died["Age"].plot.hist(alpha=0.5,color='blue',bins=50)
plt.legend(['Survived','Died'])
plt.show()
def process_age(df,cut_points,label_names):
df["Age"] = df["Age"].fillna(-0.5)
df["Age_categories"] = pd.cut(df["Age"],cut_points,labels=label_names)
return df
cut_points = [-1,0,18,100]
label_names = ["Missing","Child","Adult"]
train = process_age(train,cut_points,label_names)
test = process_age(test,cut_points,label_names)
def process_age(df,cut_points,label_names):
df["Age"] = df["Age"].fillna(-0.5)
df["Age_categories"] = pd.cut(df["Age"],cut_points,labels=label_names)
return df
cut_points = [-1,0, 5, 12, 18, 35, 60, 100]
label_names = ["Missing", 'Infant', "Child", 'Teenager', "Young Adult", 'Adult', 'Senior']
train = process_age(train,cut_points,label_names)
test = process_age(test,cut_points,label_names)
age_cat_pivot = train.pivot_table(index="Age_categories",values="Survived")
age_cat_pivot.plot.bar()
plt.show()
train['Pclass'].value_counts()
column_name = "Pclass"
df = train
dummies = pd.get_dummies(df[column_name],prefix=column_name)
dummies.head()
def create_dummies(df,column_name):
dummies = pd.get_dummies(df[column_name],prefix=column_name)
df = pd.concat([df,dummies],axis=1)
return df
train = create_dummies(train,"Pclass")
test = create_dummies(test,"Pclass")
train.head()
train = create_dummies(train,"Sex")
test = create_dummies(test,"Sex")
train = create_dummies(train,"Age_categories")
test = create_dummies(test,"Age_categories")
train.head()
lr = LogisticRegression()
lr.fit(train_X, train_y)
predictions = lr.predict(test_X)
columns = ['Pclass_2', 'Pclass_3', 'Sex_male']
lr.fit(train[columns], train['Survived'])
columns = ['Pclass_1', 'Pclass_2', 'Pclass_3', 'Sex_female', 'Sex_male',
'Age_categories_Missing','Age_categories_Infant',
'Age_categories_Child', 'Age_categories_Teenager',
'Age_categories_Young Adult', 'Age_categories_Adult',
'Age_categories_Senior']
lr.decision_function(train[columns])
lr.coef_
lr.fit(train[columns], train['Survived'])
columns = ['Pclass_2', 'Pclass_3', 'Sex_male']
all_X = train[columns]
all_y = train['Survived']
train_X, test_X, train_y, test_y = train_test_split(
all_X, all_y, test_size=0.2,random_state=0)
train_X.shape
accuracy = accuracy_score(test_y, predictions)
accuracy = accuracy_score(test_y, predictions)
accuracy
conf_matrix = confusion_matrix(test_y, predictions)
pd.DataFrame(conf_matrix, columns=['Survived', 'Died'], index=[['Survived', 'Died']])
cross_val_score(estimator, X, y, cv=None)
lr = LogisticRegression()
scores = cross_val_score(lr, all_X, all_y, cv=10)
np.mean(scores)
columns = ['Pclass_1', 'Pclass_2', 'Pclass_3', 'Sex_female', 'Sex_male',
'Age_categories_Missing','Age_categories_Infant',
'Age_categories_Child', 'Age_categories_Teenager',
'Age_categories_Young Adult', 'Age_categories_Adult',
'Age_categories_Senior']
holdout.head()
lr = LogisticRegression()
lr.fit(all_X, all_y)
holdout_predictions = lr.predict(holdout[columns])
holdout_predictions
holdout_ids = holdout["PassengerId"]
submission_df = {"PassengerId": holdout_ids,
"Survived": holdout_predictions}
submission = pd.DataFrame(submission_df)