Introduction
{mlr3spatial} adds [mlr3::DataBackend]s for spatial classes ([terra::SpatRaster], [raster::brick], [stars::stars]). The package is capable of making predictions to objects of these classes via predict_raster()
. The return is a spatial object in the class specified in argument format
.
Essentially, {mlr3spatial} takes of the burden of converting spatial objects into a plain data.table
and then coercing the predicted values back into the spatial object while making sure to not loose the spatial reference.
There are some more goodies in the bag though:
- Thanks to mlr3’s ability to predict in parallel with any learner, {mlr3spatial} predictions can also make use of future-based parallelization and speed up the predictions of spatial objects. Often enough, spatial predictions are quite large (in the millions of values) and efficient parallelization can save some time here. See the vignette on “Benchmarking parallel predictions” for details.
-
predict_raster()
can be executed in chunks (argumentchunksize
), making it possible to execute predictions on large raster files on consumer grade machines. This chunked execution comes with some overhead compared to execution in one block since the prediction heuristic is executed multiple times and the results need to be merged in the end.
In the following, we showcase a step-by-step example how to handle a multi-layer raster object from package {stars}.
Use Case - Landsat7 data as {stars} object
Data Preparation
First, the TIFF files is read via stars::read_stars()
and put into a DataBackendRaster
. The DataBackend is then used to create a regression task with the response being band1
.
tif = system.file("tif/L7_ETMs.tif", package = "stars")
stack = stars::read_stars(tif)
backend = as_data_backend(stack)
task = as_task_regr(backend, target = "band1")
print(task)
#> <TaskRegr:backend> (122848 x 6)
#> * Target: band1
#> * Properties: -
#> * Features (5):
#> - dbl (5): band2, band3, band4, band5, band6
For large raster files with millions of values it helps to predict in parallel. To enable this, set learner$parallel_predict = TRUE
and initiate a parallel plan via {future}. Since this is only an example, parallelization is not enabled here. Here we will use a simple regression tree as an example learner. In practice you might want to use a different learner - you can find an overview of available learners here.
learner = lrn("regr.rpart")
set.seed(42)
row_ids = sample(1:task$nrow, 500)
learner$train(task, row_ids = row_ids)
print(learner)
#> <LearnerRegrRpart:regr.rpart>: Regression Tree
#> * Model: rpart
#> * Parameters: xval=0
#> * Packages: mlr3, rpart
#> * Predict Type: response
#> * Feature types: logical, integer, numeric, factor, ordered
#> * Properties: importance, missings, selected_features, weights
Prediction
For prediction predict_spatial()
is used. It will return a raster file which contains the predictions. Users can select which R spatial format the returned raster should have.
In the following, we will compare the way to conduct the prediction using {mlr3spatial} with the “native” way of fitting an e1071::svm()
model and predicting with terra::predict()
.
mlr3spatial
ras = predict_spatial(task, learner, format = "stars")
names(ras) = "cadmium"
print(ras)
#> stars object with 2 dimensions and 1 attribute
#> attribute(s):
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> cadmium 62.3629 70.30233 77.01695 79.05135 89.2809 118.1429
#> dimension(s):
#> from to offset delta refsys point values x/y
#> x 1 349 288776 28.5 SIRGAS 2000 / UTM zone 25S FALSE NULL [x]
#> y 1 352 9120761 -28.5 SIRGAS 2000 / UTM zone 25S FALSE NULL [y]
stars
Since the layers are merged in a {stars} object, one first need to split them up and convert them into a regular data.table. Next, the column names need to be adjusted to match the ones of the training data. Afterwards, the data.frame
generic of predict()
can be called. Finally, the predictions need to be injected into a stars object again.
(All of these steps are happening internally in {mlr3spatial}).
rpart_learner = rpart::rpart(band1 ~ ., data = task$data(rows = row_ids))
stars_stack = as.data.table(split(stack, "band"))
stars_stack[, c("x", "y", "X1")] = NULL
colnames(stars_stack) = task$feature_names
stars_pred = predict(rpart_learner, stars_stack)
# subset stars object to one band only
stars_pred_ras = stack[, , , 1]
# rename the layer name
names(stars_pred_ras) = "pred"
# assign predictions
stars_pred_ras$pred = stars_pred
print(stars_pred_ras)
#> stars object with 3 dimensions and 1 attribute
#> attribute(s):
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> pred 62.3629 70.30233 77.01695 79.05135 89.2809 118.1429
#> dimension(s):
#> from to offset delta refsys point values x/y
#> x 1 349 288776 28.5 SIRGAS 2000 / UTM zone 25S FALSE NULL [x]
#> y 1 352 9120761 -28.5 SIRGAS 2000 / UTM zone 25S FALSE NULL [y]
#> band 1 1 NA NA NA NA NULL
Output consistency
Now that we have executed two predictions, we would like to verify that these are actually identical.
all.equal(as.numeric(stars_pred_ras$pred), as.numeric(ras$cadmium))
#> [1] TRUE