Quick start guide

Let's build a classifier for the classic iris dataset. If you don't have RDatasets, Pkg.add it.

using RDatasets: dataset

iris = dataset("datasets", "iris")

# ScikitLearn.jl expects arrays, but DataFrames can also be used - see
# the corresponding section of the manual
X = convert(Array, iris[[:SepalLength, :SepalWidth, :PetalLength, :PetalWidth]])
y = convert(Array, iris[:Species])

Next, we load the LogisticRegression model from scikit-learn's library.

using ScikitLearn

# This model requires scikit-learn. See
# http://scikitlearnjl.readthedocs.io/en/latest/models/#installation
@sk_import linear_model: LogisticRegression

Every model's constructor accepts hyperparameters (such as regression strength, whether to fit the intercept, the penalty type, etc.) as keyword arguments. Check out ?LogisticRegression for details.

model = LogisticRegression(fit_intercept=true)

Then we train the model and evaluate its accuracy on the training set:

fit!(model, X, y)

accuracy = sum(predict(model, X) .== y) / length(y)
println("accuracy: $accuracy")

> accuracy: 0.96

Cross-validation

This will train five models, on five train/test splits of X and y, and return the test-set accuracy of each:

using ScikitLearn.CrossValidation: cross_val_score

cross_val_score(LogisticRegression(), X, y; cv=5)  # 5-fold

> 5-element Array{Float64,1}:
>  1.0     
>  0.966667
>  0.933333
>  0.9     
>  1.0     

See this tutorial for more information.

Hyperparameter tuning

LogisticRegression has a regularization-strength parameter C (smaller is stronger). We can use grid search algorithms to find the optimal C.

GridSearchCV will try all values of C in 0.1:0.1:2.0 and will return the one with the highest cross-validation performance.

using ScikitLearn.GridSearch: GridSearchCV

gridsearch = GridSearchCV(LogisticRegression(), Dict(:C => 0.1:0.1:2.0))
fit!(gridsearch, X, y)
println("Best parameters: $(gridsearch.best_params_)")

> Best parameters: Dict{Symbol,Any}(:C=>1.1)

Finally, we plot cross-validation accuracy vs. C

using PyPlot

plot([cv_res.parameters[:C] for cv_res in gridsearch.grid_scores_],
     [mean(cv_res.cv_validation_scores) for cv_res in gridsearch.grid_scores_])

Saving the model to disk

Both Python and Julia models can be saved to disk

import JLD, PyCallJLD

JLD.save("my_model.jld", "model", model)
model = JLD.load("my_model.jld", "model")    # Load it back