After reimplementing K-Nearest Neighbors (KNN) from scratch, I wanted to take things a step further: build a cross-validation system entirely in base R. This meant re-creating folds, evaluating accuracy across them, and tuning both k
(the number of neighbors) and the training proportion. No caret, no external frameworks — just my own code.
The first step in cross-validation is creating training/test splits. For this, I wrote v_Rmach_fold
, which generates n_fold
pairs of train/test datasets. Each sample is stored in a small S4 class sample_Rmach
with:
train
: the training dataframetest
: the testing dataframetrain_ids
: row indices used for trainingtest_ids
: row indices used for testinglst_test <- v_Rmach_fold(inpt_datf = iris[1:25,],
train_prop = 0.7,
n_fold = 4)
print(lst_test$sample1@train) # Training set
print(lst_test$sample1@test) # Test set
This function essentially recreates what libraries like caret do internally, but in a transparent and lightweight way.
With knn_Rmach_cross_validation_k
, I can test multiple values of k
(neighbors) across multiple folds, returning the mean accuracy for each candidate value. Example:
iris[, 5] <- as.character(iris[, 5])
print(knn_Rmach_cross_validation_k(
inpt_datf = iris,
col_vars = c(1:4),
n_fold = 5,
knn_v = c(3, 5, 7, 9, 11),
class_col = 5,
train_prop = 0.7
))
# [1] 0.9333333 0.9200000 0.9333333 0.9466667 0.9288889
# Optimal k = 9
Here, k=9
gave the best cross-validated accuracy (~94.7%).
I also implemented knn_Rmach_cross_validation_train
, which instead of tuning k
tunes the training proportion. This answers: how much data should I use for training vs testing to maximize accuracy?
iris[, 5] <- as.character(iris[, 5])
print(knn_Rmach_cross_validation_train(
inpt_datf = iris,
col_vars = c(1:4),
n_fold = 15,
k = 7,
class_col = 5,
train_prop_v = c(0.7, 0.75, 0.8)
))
# [1] 0.4057143 0.3273810 0.2400000
# Optimal training proportion = 0.7
In this run, using 70% of the data for training gave the best results.
n_fold
improves accuracy estimates but at the cost of execution time.By writing my own fold generator and cross-validation functions, I rebuilt the skeleton of what machine learning frameworks provide. It’s slower than optimized libraries, but it gave me a much deeper understanding of how cross-validation really works under the hood — and it integrates seamlessly with my home-grown knn_Rmach
implementation.
repo: https://github.com/julienlargetpiet/Rmach
or here as a zip: Rmach