WS22/23 ULG Data Science

How to measure performance in Julia

For that, we recall the vector summation example from the introduction to function and include the simple @time macro.

function mysum(V)
    s = zero(eltype(V))

    for i in eachindex(V)
        @inbounds s += V[i]
    end

    return s
end

V = rand(100_000)
@time mysum(V)
@time mysum(V)

  0.009129 seconds (2.31 k allocations: 158.897 KiB, 98.24% compilation time)
  0.000138 seconds (1 allocation: 16 bytes)
49942.94589387533

In order to optimize the loop call we use the @inbounds macro to eliminate inbound checks - does the index exist - for the array access. Note that this macro is potentially very dangerous since the program will silently fail if we try to index into a smaller array.

The downside with the @time macro is, that it really just measures the execution time of what is given to it. This means, if the function is not already compiled this might include compiling or if the CPU is busy with something else it is often not accurate.

Therefore, if we are serious about measuring performance we should stick to the BenchmarkTools. It comes with a couple of macros that we should test out:

In order to use the BenchmarkTools we need to include it with using BenchmarkTools, as any other package. Benchmark our mysum function with the following macros:

@benchmark
@btime
Look at the detailed output of your benchmark with dump(t), where t is the output result of a @benchmark run.

and compare the output and results.

Solution

To measure the performance of the above code we do the following:

using BenchmarkTools

@benchmark mysum($V)

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  103.800 μs …  8.502 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     115.801 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   122.362 μs ± 93.224 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▂ ▄ ▅  █▁▂▁▁▂▂▂▃▃▂                                           ▁
  █▃█▅█▅▇███████████▇▇▇▇▇█▇███▆▆▅▆▅▇▆▆▆▆▅▆▅▆▄▆▄▅▅▅▄▃▅▃▄▃▄▅▄▄▄▄ █
  104 μs        Histogram: log(frequency) by time       202 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

the full details with

t = @benchmark mysum($V)
dump(t)

BenchmarkTools.Trial
  params: BenchmarkTools.Parameters
    seconds: Float64 5.0
    samples: Int64 10000
    evals: Int64 1
    overhead: Float64 0.0
    gctrial: Bool true
    gcsample: Bool false
    time_tolerance: Float64 0.05
    memory_tolerance: Float64 0.01
  times: Array{Float64}((10000,)) [163701.0, 129801.0, 173201.0, 115801.0, 148401.0, 131001.0, 123201.0, 107501.0, 126801.0, 103801.0  …  107501.0, 107500.0, 107501.0, 142302.0, 115801.0, 115800.0, 115801.0, 115701.0, 146702.0, 115700.0]
  gctimes: Array{Float64}((10000,)) [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0  …  0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
  memory: Int64 0
  allocs: Int64 0

and the often used sanity check, that actually also shows you the output of your code.

@btime mysum($V)

  100.300 μs (0 allocations: 0 bytes)
49942.94589387533

Note that for benchmarking with BenchmarkTools we often use the $ literal for variables to tell the Julia interpreter to use interpolation. This will make sure that the variable is not allocated inside the function and the measurement is more accurate, or more likely what we actually want to know.

We can also use the Profiler package to really dig into profiling the code but this is a bit too much of a deep dive for this class.

One should also take a look at the performance tips section of the manual. Especially the parts about avoiding untyped global variables and putting code inside functions are sometimes very performance critical.

Performance Optimization - Case Study

We give an example of how one can optimize the code of a simple program. The goal is to calculate pairwise distances of 3D points and save the results inside a matrix. That is given $x\in\mathbb{R}^{3\times N}, y\in\mathbb{R}^{3\times M}$ we want to calculate $z_{i,j} = \Vert x_i - y_j \Vert_2^2$ . for $i=1,\ldots,N, \; j=1,\ldots,M$ .

First we have a simple one-liner

calc1(x, y) = [norm(x[:,i] - y[:,j])^2 for i in 1:size(x, 2), j in 1:size(y, 2)]

A first improvement can be achieved by preallocation the result and using views since accessing slices in julia always performs a copy and therefore allocates.

function calc2!(z::Matrix{T}, x::Matrix{T}, y::Matrix{T}) where {T<:AbstractFloat}
    # check dimensions since we later use `@inbounds`
    @assert size(x, 1) == size(y, 1)
    @assert size(z) == (size(x, 2), size(y, 2))

    for j in axes(z, 2), i in axes(z, 1)
        # allocates 2 times
        # 1 from - and 1 from .^
        @inbounds @views z[i,j] = sum((x[:, i] - y[:, j]).^2)
    end
    return z
end

function calc3!(z::Matrix{T}, x::Matrix{T}, y::Matrix{T}) where {T<:AbstractFloat}
    @assert size(x, 1) == size(y, 1)
    @assert size(z) == (size(x, 2), size(y, 2))

    for j in axes(z, 2), i in axes(z, 1)
        for k in 1:size(x, 1)
            @inbounds z[i,j] += (x[k, i] - y[k, j])^2
        end
    end
    return z
end

using Base.Threads

function calc4!(z::Matrix{T}, x::Matrix{T}, y::Matrix{T}) where {T<:AbstractFloat}
    @assert size(x, 1) == size(y, 1)
    @assert size(z) == (size(x, 2), size(y, 2))

    @threads for j in axes(z, 2)
        for i in axes(z, 1)
            for k in 1:size(x, 1)
                @inbounds z[i,j] += (x[k, i] - y[k, j])^2
            end
        end
    end
    return z
end

using BenchmarkTools, LinearAlgebra, Test

@testset begin
    x, y = rand(3, 10), rand(3, 5)
    ref = calc1(x, y)
    for f in [calc2!, calc3!, calc4!]
        out = zeros(size(ref))
        f(out, x, y)
        @test ref ≈ out
    end
end

# Time code
x, y = rand(3, 100), rand(3, 200)
z = zeros(size(x, 2), size(y, 2))

println("\nUsing $(nthreads()) threads\n")

@btime calc1(x, y)
@btime calc2!(z, x, y)
@btime calc3!(z, x, y)
@btime calc4!(z, x, y);

Test Summary: | Pass  Total  Time
test set      |    3      3  1.0s

Using 1 threads

  3.294 ms (60002 allocations: 4.73 MiB)
  1.823 ms (40000 allocations: 3.05 MiB)
  64.200 μs (0 allocations: 0 bytes)
  73.500 μs (7 allocations: 656 bytes)

CC BY-SA 4.0 - Stephan Antholzer, Gregor Ehrensperger, Johannes Sappl. Last modified: August 31, 2023. Website built with Franklin.jl and the Julia programming language.