Load and save DataFrames#

We do not cover all features of the packages. Please refer to their documentation to learn them.

Here we’ll load CSV.jl to read and write CSV files and Arrow.jl and JLD2.jl, which allow us to work with a binary format, and finally JSONTables.jl for JSON interaction.

using DataFrames
using Arrow
using CSV
using JSONTables
using CodecZlib
using ZipFile
using JLD2
using StatsPlots ## for charts
using Mmap ## for compression

Let’s create a simple DataFrame for testing purposes,

x = DataFrame(
    A=[true, false, true], B=[1, 2, missing],
    C=[missing, "b", "c"], D=['a', missing, 'c']
)
3×4 DataFrame
RowABCD
BoolInt64?String?Char?
1true1missinga
2false2bmissing
3truemissingcc

and use eltypes to look at the column-wise types.

eltype.(eachcol(x))
4-element Vector{Type}:
 Bool
 Union{Missing, Int64}
 Union{Missing, String}
 Union{Missing, Char}

CSV.jl#

Let’s use CSV.jl to save x to disk; make sure x1.csv does not conflict with some file in your working directory.

tmpdir = mktempdir()
"/tmp/jl_QvUHvL"
location = joinpath(tmpdir, "x1.csv")
CSV.write(location, x)
"/tmp/jl_QvUHvL/x1.csv"

Now we can see how it was saved by reading x.csv.

print(read(location, String))
A,B,C,D
true,1,,a
false,2,b,
true,,c,c

We can also load it back as a data frame

y = CSV.read(location, DataFrame)
3×4 DataFrame
RowABCD
BoolInt64?String1?String1?
1true1missinga
2false2bmissing
3truemissingcc

Note that when loading in a DataFrame from a CSV the column type for columns :C :D have changed to use special strings defined in the InlineStrings.jl package.

eltype.(eachcol(y))
4-element Vector{Type}:
 Bool
 Union{Missing, Int64}
 Union{Missing, InlineStrings.String1}
 Union{Missing, InlineStrings.String1}

JSONTables.jl#

Often you might need to read and write data stored in JSON format. JSONTables.jl provides a way to process them in row-oriented or column-oriented layout. We present both options below.

location1 = joinpath(tmpdir, "x1.json")
open(io -> arraytable(io, x), location1, "w")
106
location2 = joinpath(tmpdir, "x2.json")
open(io -> objecttable(io, x), location2, "w")
76

Read them back.

print(read(location1, String))
[{"A":true,"B":1,"C":null,"D":"a"},{"A":false,"B":2,"C":"b","D":null},{"A":true,"B":null,"C":"c","D":"c"}]
print(read(location2, String))
{"A":[true,false,true],"B":[1,2,null],"C":[null,"b","c"],"D":["a",null,"c"]}
y1 = open(jsontable, location1) |> DataFrame
3×4 DataFrame
RowABCD
BoolInt64?String?String?
1true1missinga
2false2bmissing
3truemissingcc
eltype.(eachcol(y1))
4-element Vector{Type}:
 Bool
 Union{Missing, Int64}
 Union{Missing, String}
 Union{Missing, String}
y2 = open(jsontable, location2) |> DataFrame
3×4 DataFrame
RowABCD
BoolInt64?String?String?
1true1missinga
2false2bmissing
3truemissingcc
eltype.(eachcol(y2))
4-element Vector{Type}:
 Bool
 Union{Missing, Int64}
 Union{Missing, String}
 Union{Missing, String}

JLD2.jl#

JLD2.jl is a high-performance, pure Julia library for saving and loading arbitrary Julia data structures, with HDF5 format.

Documentation: https://juliaio.github.io/JLD2.jl/dev/basic_usage/

  • save() and load(): General save and load using the FileIO.jl interface

  • jldsave() and jldloac(): Advanced save and load with more options

  • save_object() and load_object(): Single-object load and save

using JLD2
location = joinpath(tmpdir, "x.jld2")

save_object(location, x)

Read it back.

load_object(location)
3×4 DataFrame
RowABCD
BoolInt64?String?Char?
1true1missinga
2false2bmissing
3truemissingcc

Arrow.jl#

Finally we use Apache Arrow format that allows, in particular, for data interchange with R or Python.

location = joinpath(tmpdir, "x.arrow")
Arrow.write(location, x)
"/tmp/jl_QvUHvL/x.arrow"
y = Arrow.Table(location) |> DataFrame
3×4 DataFrame
RowABCD
BoolInt64?String?Char?
1true1missinga
2false2bmissing
3truemissingcc
eltype.(eachcol(y))
4-element Vector{Type}:
 Bool
 Union{Missing, Int64}
 Union{Missing, String}
 Union{Missing, Char}

Note that columns of y are immutable

try
    y.A[1] = false
catch e
    show(e)
end
ReadOnlyMemoryError()

This is because Arrow.Table uses memory mapping and thus uses a custom vector types:

y.A
3-element Arrow.BoolVector{Bool}:
 1
 0
 1
y.B
3-element Arrow.Primitive{Union{Missing, Int64}, Vector{Int64}}:
 1
 2
  missing

You can get standard Julia Base vectors by copying a dataframe.

y2 = copy(y)
3×4 DataFrame
RowABCD
BoolInt64?String?Char?
1true1missinga
2false2bmissing
3truemissingcc
y2.A
3-element Vector{Bool}:
 1
 0
 1
y2.B
3-element Vector{Union{Missing, Int64}}:
 1
 2
  missing

Basic benchmarking#

Next, we’ll create some files in the temp directory.

In particular, we’ll time how long it takes us to write a DataFrame with 1000 rows and 100000 columns.

bigdf = DataFrame(rand(Bool, 10^4, 1000), :auto)

bigdf[!, 1] = Int.(bigdf[!, 1])
bigdf[!, 2] = bigdf[!, 2] .+ 0.5
bigdf[!, 3] = string.(bigdf[!, 3], ", as string")

tmpdir = mktempdir()
"/tmp/jl_OGUw62"
println("First run")
First run
println("CSV.jl")
fname = joinpath(tmpdir, "bigdf1.csv.gz")
csvwrite1 = @elapsed @time CSV.write(fname, bigdf; compress=true)

println("Arrow.jl")
fname = joinpath(tmpdir, "bigdf.arrow")
arrowwrite1 = @elapsed @time Arrow.write(fname, bigdf)

println("JSONTables.jl arraytable")
fname = joinpath(tmpdir, "bigdf1.json")
jsontablesawrite1 = @elapsed @time open(io -> arraytable(io, bigdf), fname, "w")

println("JSONTables.jl objecttable")
fname = joinpath(tmpdir, "bigdf2.json")
jsontablesowrite1 = @elapsed @time open(io -> objecttable(io, bigdf), fname, "w")

println("JLD2.jl")
fname = joinpath(tmpdir, "bigdf.jld2")
jld2write1 = @elapsed @time save_object(fname, bigdf; compress = ZstdFilter())

println("Second run")

println("CSV.jl")
fname = joinpath(tmpdir, "bigdf1.csv.gz")
csvwrite2 = @elapsed @time CSV.write(fname, bigdf; compress=true)

println("Arrow.jl")
fname = joinpath(tmpdir, "bigdf.arrow")
arrowwrite2 = @elapsed @time Arrow.write(fname, bigdf)

println("JSONTables.jl arraytable")
fname = joinpath(tmpdir, "bigdf1.json")
jsontablesawrite2 = @elapsed @time open(io -> arraytable(io, bigdf), fname, "w")

println("JSONTables.jl objecttable")
fname = joinpath(tmpdir, "bigdf2.json")
jsontablesowrite2 = @elapsed @time open(io -> objecttable(io, bigdf), fname, "w")

println("JLD2.jl")
fname = joinpath(tmpdir, "bigdf.jld2")
jld2write2 = @elapsed @time save_object(fname, bigdf; compress = ZstdFilter());
CSV.jl
 10.194651 seconds (45.04 M allocations: 1.590 GiB, 1.62% gc time, 75.65% compilation time: <1% of which was recompilation)
Arrow.jl
  6.296806 seconds (6.64 M allocations: 325.542 MiB, 0.31% gc time, 98.50% compilation time)
JSONTables.jl arraytable
 11.581848 seconds (229.63 M allocations: 5.497 GiB, 17.02% gc time, 0.14% compilation time: <1% of which was recompilation)
JSONTables.jl objecttable
  0.522971 seconds (106.17 k allocations: 309.447 MiB, 48.09% gc time, 16.29% compilation time)
JLD2.jl
  0.986526 seconds (1.62 M allocations: 87.149 MiB, 90.91% compilation time: 11% of which was recompilation)
Second run
CSV.jl
  2.644907 seconds (44.41 M allocations: 1.560 GiB, 6.36% gc time)
Arrow.jl
  0.093175 seconds (80.86 k allocations: 5.164 MiB)
JSONTables.jl arraytable
 11.233587 seconds (229.63 M allocations: 5.497 GiB, 15.04% gc time, 0.10% compilation time)
JSONTables.jl objecttable
  0.528392 seconds (20.83 k allocations: 305.238 MiB, 59.36% gc time, 1.93% compilation time)
JLD2.jl
  0.083237 seconds (134.50 k allocations: 13.779 MiB)
groupedbar(
    repeat(["CSV.jl (gz)", "Arrow.jl", "JSONTables.jl\nobjecttable", "JLD2.jl"],
        inner=2),
    [csvwrite1, csvwrite2, arrowwrite1, arrowwrite2, jsontablesowrite1, jsontablesowrite2, jld2write1, jld2write2],
    group=repeat(["1st", "2nd"], outer=4),
    ylab="Second",
    title="Write Performance\nDataFrame: bigdf\nSize: $(size(bigdf))",
    permute = (:x, :y)
)
_images/fe74bb4036defeae0e52a6d0d1a772fd02ee997aee861924ce53c6ac9e5bdd89.png
data_files = ["bigdf1.csv.gz", "bigdf.arrow", "bigdf1.json", "bigdf2.json", "bigdf.jld2"] .|> (f -> joinpath(tmpdir, f))
df = DataFrame(file=["CSV.jl (gz)", "Arrow.jl", "objecttable", "arraytable", "JLD2.jl"], size=getfield.(stat.(data_files), :size))
sort!(df, :size)
5×2 DataFrame
Rowfilesize
StringInt64
1Arrow.jl1742866
2CSV.jl (gz)2470461
3JLD2.jl2740095
4arraytable55090189
5objecttable124031296
@df df plot(:file, :size / 1024^2, seriestype=:bar, title="Format File Size (MB)", label="Size", ylab="MB")
_images/8aee418ff366cf780eb595efc0fef4ce262b05305b00d0de9e0b65cd6dc9fa20.png
println("First run")

println("CSV.jl")
fname = joinpath(tmpdir, "bigdf1.csv.gz")
csvread1 = @elapsed @time CSV.read(fname, DataFrame)

println("Arrow.jl")
fname = joinpath(tmpdir, "bigdf.arrow")
arrowread1 = @elapsed @time df_tmp = Arrow.Table(fname) |> DataFrame
arrowread1copy = @elapsed @time copy(df_tmp)

println("JSONTables.jl arraytable")
fname = joinpath(tmpdir, "bigdf1.json")
jsontablesaread1 = @elapsed @time open(jsontable, fname)

println("JSONTables.jl objecttable")
fname = joinpath(tmpdir, "bigdf2.json")
jsontablesoread1 = @elapsed @time open(jsontable, fname)

println("JLD2.jl")
fname = joinpath(tmpdir, "bigdf.jld2")
jld2read1 = @elapsed @time load_object(fname)

println("Second run")
fname = joinpath(tmpdir, "bigdf1.csv.gz")
csvread2 = @elapsed @time CSV.read(fname, DataFrame)

println("Arrow.jl")
fname = joinpath(tmpdir, "bigdf.arrow")
arrowread2 = @elapsed @time df_tmp = Arrow.Table(fname) |> DataFrame
arrowread2copy = @elapsed @time copy(df_tmp)

println("JSONTables.jl arraytable")
fname = joinpath(tmpdir, "bigdf1.json")
jsontablesaread2 = @elapsed @time open(jsontable, fname)

println("JSONTables.jl objecttable")
fname = joinpath(tmpdir, "bigdf2.json")
jsontablesoread2 = @elapsed @time open(jsontable, fname)

println("JLD2.jl")
fname = joinpath(tmpdir, "bigdf.jld2")
jld2read2 = @elapsed @time load_object(fname);
First run
CSV.jl
  3.073050 seconds (4.40 M allocations: 223.480 MiB, 1.63% gc time, 2 lock conflicts, 98.58% compilation time)
Arrow.jl
  0.548137 seconds (573.49 k allocations: 26.969 MiB, 98.58% compilation time)
  0.051251 seconds (14.02 k allocations: 10.298 MiB)
JSONTables.jl arraytable
  6.864094 seconds (271.15 k allocations: 1.772 GiB, 9.96% gc time, 0.10% compilation time)
JSONTables.jl objecttable
  0.378929 seconds (7.39 k allocations: 566.955 MiB, 3.83% gc time, 0.03% compilation time)
JLD2.jl
  0.337572 seconds (427.99 k allocations: 39.102 MiB, 88.52% compilation time: 7% of which was recompilation)
Second run
  1.028571 seconds (637.06 k allocations: 43.698 MiB)
Arrow.jl
  0.007000 seconds (84.09 k allocations: 3.594 MiB)
  0.052443 seconds (14.02 k allocations: 10.298 MiB)
JSONTables.jl arraytable
  6.331693 seconds (271.10 k allocations: 1.772 GiB, 10.21% gc time)
JSONTables.jl objecttable
  0.362434 seconds (7.08 k allocations: 566.943 MiB, 1.89% gc time)
JLD2.jl
  0.040156 seconds (121.76 k allocations: 24.189 MiB, 8.58% gc time)

Exclude JSONTables due to much longer timing

groupedbar(
    repeat(["CSV.jl (gz)", "Arrow.jl", "Arrow.jl\ncopy", ##"JSON\narraytable",
            "JSON\nobjecttable", "JLD2.jl"], inner=2),
    [csvread1, csvread2, arrowread1, arrowread2, arrowread1 + arrowread1copy, arrowread2 + arrowread2copy,
        # jsontablesaread1, jsontablesaread2,
        jsontablesoread1, jsontablesoread2, jld2read1, jld2read2],
    group=repeat(["1st", "2nd"], outer=5),
    ylab="Second",
    title="Read Performance\nDataFrame: bigdf\nSize: $(size(bigdf))",
    permute = (:x, :y)
)
_images/cb2ad968192e6c2878e6e43fe45772dd5ebc720873a6edfc16a175e380496e33.png

Using gzip compression#

A common user requirement is to be able to load and save CSV that are compressed using gzip. Below we show how this can be accomplished using CodecZlib.jl.

Again make sure that you do not have file named df_compress_test.csv.gz in your working directory. We first generate a random data frame.

df = DataFrame(rand(1:10, 10, 1000), :auto)
10×1000 DataFrame
900 columns omitted
Rowx1x2x3x4x5x6x7x8x9x10x11x12x13x14x15x16x17x18x19x20x21x22x23x24x25x26x27x28x29x30x31x32x33x34x35x36x37x38x39x40x41x42x43x44x45x46x47x48x49x50x51x52x53x54x55x56x57x58x59x60x61x62x63x64x65x66x67x68x69x70x71x72x73x74x75x76x77x78x79x80x81x82x83x84x85x86x87x88x89x90x91x92x93x94x95x96x97x98x99x100
Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64
112107446427895410215107893397210267910472710341052105575761622928773612196101084981011011955433101078131361079910392
2811022105567944375210548510761010865328487233683102344449421871384105626864655110986954541822779229869159951010
339139295741683363112785421310867448219897748558347764965521633103112891109691021076510615153161023711578633
41032363628492531012557103795531091093418765193418688678468108681275397910994810105976105110410772625116381353318
5882476621010271972462695351097975394565119141577102107251063274181071099722129513936441105241013564942327210431
6888829106177814153953992182629516121610379799526910954929584385186976424310986810636655527394188936837823
7736115784710122752921025851096148141818852971041068217108103484314610398262458746969586437421625178966554741
89556416968666692273411021013613862310414188341068788252832164967158991969593134710263926769951182102457364
9179125876294110665247484274454624108659103935866835351046592431426574913126595279497797151356641041777117
10192710819277610624246105722425811059983794210952553981052393131110975831101493687221578689105102314595142171410

Use compress=true option to compress the CSV with the gz format.

tmpdir = mktempdir()
fname = joinpath(tmpdir, "df_compress_test.csv.gz")
CSV.write(fname, df; compress=true)
"/tmp/jl_Exm8tJ/df_compress_test.csv.gz"

Read the CSV file back.

df2 = CSV.File(fname) |> DataFrame
10×1000 DataFrame
900 columns omitted
Rowx1x2x3x4x5x6x7x8x9x10x11x12x13x14x15x16x17x18x19x20x21x22x23x24x25x26x27x28x29x30x31x32x33x34x35x36x37x38x39x40x41x42x43x44x45x46x47x48x49x50x51x52x53x54x55x56x57x58x59x60x61x62x63x64x65x66x67x68x69x70x71x72x73x74x75x76x77x78x79x80x81x82x83x84x85x86x87x88x89x90x91x92x93x94x95x96x97x98x99x100
Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64
112107446427895410215107893397210267910472710341052105575761622928773612196101084981011011955433101078131361079910392
2811022105567944375210548510761010865328487233683102344449421871384105626864655110986954541822779229869159951010
339139295741683363112785421310867448219897748558347764965521633103112891109691021076510615153161023711578633
41032363628492531012557103795531091093418765193418688678468108681275397910994810105976105110410772625116381353318
5882476621010271972462695351097975394565119141577102107251063274181071099722129513936441105241013564942327210431
6888829106177814153953992182629516121610379799526910954929584385186976424310986810636655527394188936837823
7736115784710122752921025851096148141818852971041068217108103484314610398262458746969586437421625178966554741
89556416968666692273411021013613862310414188341068788252832164967158991969593134710263926769951182102457364
9179125876294110665247484274454624108659103935866835351046592431426574913126595279497797151356641041777117
10192710819277610624246105722425811059983794210952553981052393131110975831101493687221578689105102314595142171410
df == df2
true

Working with zip files#

Sometimes you may have files compressed inside a zip file. In such a situation you may use ZipFile.jl in conjunction an an appropriate reader to read the files. Here we first create a ZIP file and then read back its contents into a DataFrame.

df1 = DataFrame(rand(1:10, 3, 4), :auto)
3×4 DataFrame
Rowx1x2x3x4
Int64Int64Int64Int64
141058
25994
388108
df2 = DataFrame(rand(1:10, 3, 4), :auto)
3×4 DataFrame
Rowx1x2x3x4
Int64Int64Int64Int64
136110
210566
36856

And we show yet another way to write a DataFrame into a CSV file: Writing a CSV file into the zip file

w = ZipFile.Writer(joinpath(tmpdir, "x.zip"))

f1 = ZipFile.addfile(w, "x1.csv")
write(f1, sprint(show, "text/csv", df1))
46

write a second CSV file into the zip file

f2 = ZipFile.addfile(w, "x2.csv", method=ZipFile.Deflate)
write(f2, sprint(show, "text/csv", df2))
46
close(w)

Now we read the compressed CSV file we have written:

r = ZipFile.Reader(joinpath(tmpdir, "x.zip"))
# find the index index of file called x1.csv
index_xcsv = findfirst(x -> x.name == "x1.csv", r.files)
# to read the x1.csv file in the zip file
df1_2 = CSV.read(read(r.files[index_xcsv]), DataFrame)
3×4 DataFrame
Rowx1x2x3x4
Int64Int64Int64Int64
141058
25994
388108
df1_2 == df1
true
# find the index index of file called x2.csv
index_xcsv = findfirst(x -> x.name == "x2.csv", r.files)
# to read the x2.csv file in the zip file
df2_2 = CSV.read(read(r.files[index_xcsv]), DataFrame)
3×4 DataFrame
Rowx1x2x3x4
Int64Int64Int64Int64
136110
210566
36856
df2_2 == df2
true

Note that once you read a given file from r object its stream is all used-up (reaching its end). Therefore to read it again you need to close the file object r and open it again. Also do not forget to close the zip file once you are done.

close(r)

This notebook was generated using Literate.jl.