Load and save DataFrames#

We do not cover all features of the packages. Please refer to their documentation to learn them.

Here we’ll load CSV.jl to read and write CSV files and Arrow.jl, JLSO.jl, and serialization, which allow us to work with a binary format and JSONTables.jl for JSON interaction. Finally we consider a custom JDF.jl format.

using DataFrames
using Arrow
using CSV
using Serialization
using JLSO
using JSONTables
using CodecZlib
using ZipFile
using JDF
using StatsPlots ## for charts
using Mmap ## for compression

Let’s create a simple DataFrame for testing purposes,

x = DataFrame(
    A=[true, false, true], B=[1, 2, missing],
    C=[missing, "b", "c"], D=['a', missing, 'c']
)
3×4 DataFrame
RowABCD
BoolInt64?String?Char?
1true1missinga
2false2bmissing
3truemissingcc

and use eltypes to look at the columnwise types.

eltype.(eachcol(x))
4-element Vector{Type}:
 Bool
 Union{Missing, Int64}
 Union{Missing, String}
 Union{Missing, Char}

CSV.jl#

Let’s use CSV to save x to disk; make sure x1.csv does not conflict with some file in your working directory.

CSV.write("x1.csv", x)
"x1.csv"

Now we can see how it was saved by reading x.csv.

print(read("x1.csv", String))
A,B,C,D
true,1,,a
false,2,b,
true,,c,c

We can also load it back as a data frame

y = CSV.read("x1.csv", DataFrame)
3×4 DataFrame
RowABCD
BoolInt64?String1?String1?
1true1missinga
2false2bmissing
3truemissingcc

Note that when loading in a DataFrame from a CSV the column type for columns :C :D have changed to use special strings defined in the InlineStrings.jl package.

eltype.(eachcol(y))
4-element Vector{Type}:
 Bool
 Union{Missing, Int64}
 Union{Missing, InlineStrings.String1}
 Union{Missing, InlineStrings.String1}

Serialization by JDF.jl and JLSO.jl#

Now we use serialization to save x.

There are two ways to perform serialization. The first way is to use the Serialization.serialize as below:

Note that in general, this process will not work if the reading and writing are done by different versions of Julia, or an instance of Julia with a different system image.

open("x.bin", "w") do io
    serialize(io, x)
end

Now we load back the saved file to y variable. Again y is identical to x. However, please beware that if you session does not have DataFrames.jl loaded, then it may not recognize the content as DataFrames.jl

y = open(deserialize, "x.bin")
3×4 DataFrame
RowABCD
BoolInt64?String?Char?
1true1missinga
2false2bmissing
3truemissingcc
eltype.(eachcol(y))
4-element Vector{Type}:
 Bool
 Union{Missing, Int64}
 Union{Missing, String}
 Union{Missing, Char}

JDF.jl#

JDF.jl is a relatively new package designed to serialize DataFrames. You can save a DataFrame with the savejdf function. For more details about design assumptions and limitations of JDF.jl please check out xiaodaigh/JDF.jl.

JDF.save("x.jdf", x);

To load the saved JDF file, one can use the loadjdf function

x_loaded = JDF.load("x.jdf") |> DataFrame
3×4 DataFrame
RowABCD
BoolInt64?String?Char?
1true1missinga
2false2bmissing
3truemissingcc

You can see that they are the same

isequal(x_loaded, x)
true

JDF.jl offers the ability to load only certain columns from disk to help with working with large files. set up a JDFFile which is a on disk representation of x backed by JDF.jl

x_ondisk = jdf"x.jdf"
JDF.JDFFile{String}("x.jdf")

We can see all the names of x without loading it into memory

names(x_ondisk)
4-element Vector{Symbol}:
 :A
 :B
 :C
 :D

The below is an example of how to load only columns :A and :D

xd = JDF.load(x_ondisk; cols=["A", "D"]) |> DataFrame
3×2 DataFrame
RowAD
BoolChar?
1truea
2falsemissing
3truec

JLSO.jl#

Another way to perform serialization is by using the JLSO.jl library:

JLSO.save("x.jlso", :data => x)

Now we can load back the file to y

y = JLSO.load("x.jlso")[:data]
3×4 DataFrame
RowABCD
BoolInt64?String?Char?
1true1missinga
2false2bmissing
3truemissingcc
eltype.(eachcol(y))
4-element Vector{Type}:
 Bool
 Union{Missing, Int64}
 Union{Missing, String}
 Union{Missing, Char}

JSONTables.jl#

Often you might need to read and write data stored in JSON format. JSONTables.jl provides a way to process them in row-oriented or column-oriented layout. We present both options below.

open(io -> arraytable(io, x), "x1.json", "w")
106
open(io -> objecttable(io, x), "x2.json", "w")
76
print(read("x1.json", String))
[{"A":true,"B":1,"C":null,"D":"a"},{"A":false,"B":2,"C":"b","D":null},{"A":true,"B":null,"C":"c","D":"c"}]
print(read("x2.json", String))
{"A":[true,false,true],"B":[1,2,null],"C":[null,"b","c"],"D":["a",null,"c"]}
y1 = open(jsontable, "x1.json") |> DataFrame
3×4 DataFrame
RowABCD
BoolInt64?String?String?
1true1missinga
2false2bmissing
3truemissingcc
eltype.(eachcol(y1))
4-element Vector{Type}:
 Bool
 Union{Missing, Int64}
 Union{Missing, String}
 Union{Missing, String}
y2 = open(jsontable, "x2.json") |> DataFrame
3×4 DataFrame
RowABCD
BoolInt64?String?String?
1true1missinga
2false2bmissing
3truemissingcc
eltype.(eachcol(y2))
4-element Vector{Type}:
 Bool
 Union{Missing, Int64}
 Union{Missing, String}
 Union{Missing, String}

Arrow.jl#

Finally we use Apache Arrow format that allows, in particular, for data interchange with R or Python.

Arrow.write("x.arrow", x)
"x.arrow"
y = Arrow.Table("x.arrow") |> DataFrame
3×4 DataFrame
RowABCD
BoolInt64?String?Char?
1true1missinga
2false2bmissing
3truemissingcc
eltype.(eachcol(y))
4-element Vector{Type}:
 Bool
 Union{Missing, Int64}
 Union{Missing, String}
 Union{Missing, Char}

Note that columns of y are immutable

try
    y.A[1] = false
catch e
    show(e)
end
ReadOnlyMemoryError()

This is because Arrow.Table uses memory mapping and thus uses a custom vector types:

y.A
3-element Arrow.BoolVector{Bool}:
 1
 0
 1
y.B
3-element Arrow.Primitive{Union{Missing, Int64}, Vector{Int64}}:
 1
 2
  missing

You can get standard Julia Base vectors by copying a data frame

y2 = copy(y)
3×4 DataFrame
RowABCD
BoolInt64?String?Char?
1true1missinga
2false2bmissing
3truemissingcc
y2.A
3-element Vector{Bool}:
 1
 0
 1
y2.B
3-element Vector{Union{Missing, Int64}}:
 1
 2
  missing

Basic benchmarking#

Next, we’ll create some files, so be careful that you don’t already have these files in your working directory! In particular, we’ll time how long it takes us to write a DataFrame with 1000 rows and 100000 columns.

bigdf = DataFrame(rand(Bool, 10^4, 1000), :auto)

bigdf[!, 1] = Int.(bigdf[!, 1])
bigdf[!, 2] = bigdf[!, 2] .+ 0.5
bigdf[!, 3] = string.(bigdf[!, 3], ", as string")

println("First run")
First run
println("CSV.jl")
csvwrite1 = @elapsed @time CSV.write("bigdf1.csv", bigdf)
println("Serialization")
serializewrite1 = @elapsed @time open(io -> serialize(io, bigdf), "bigdf.bin", "w")
println("JDF.jl")
jdfwrite1 = @elapsed @time JDF.save("bigdf.jdf", bigdf)
println("JLSO.jl")
jlsowrite1 = @elapsed @time JLSO.save("bigdf.jlso", :data => bigdf)
println("Arrow.jl")
arrowwrite1 = @elapsed @time Arrow.write("bigdf.arrow", bigdf)
println("JSONTables.jl arraytable")
jsontablesawrite1 = @elapsed @time open(io -> arraytable(io, bigdf), "bigdf1.json", "w")
println("JSONTables.jl objecttable")
jsontablesowrite1 = @elapsed @time open(io -> objecttable(io, bigdf), "bigdf2.json", "w")
println("Second run")
println("CSV.jl")
csvwrite2 = @elapsed @time CSV.write("bigdf1.csv", bigdf)
println("Serialization")
serializewrite2 = @elapsed @time open(io -> serialize(io, bigdf), "bigdf.bin", "w")
println("JDF.jl")
jdfwrite2 = @elapsed @time JDF.save("bigdf.jdf", bigdf)
println("JLSO.jl")
jlsowrite2 = @elapsed @time JLSO.save("bigdf.jlso", :data => bigdf)
println("Arrow.jl")
arrowwrite2 = @elapsed @time Arrow.write("bigdf.arrow", bigdf)
println("JSONTables.jl arraytable")
jsontablesawrite2 = @elapsed @time open(io -> arraytable(io, bigdf), "bigdf1.json", "w")
println("JSONTables.jl objecttable")
jsontablesowrite2 = @elapsed @time open(io -> objecttable(io, bigdf), "bigdf2.json", "w")
CSV.jl
  7.534559 seconds (44.69 M allocations: 1.127 GiB, 7.03% gc time, 64.55% compilation time)
Serialization
  0.275333 seconds (224.29 k allocations: 10.824 MiB, 29.30% compilation time)
JDF.jl
  0.148966 seconds (68.08 k allocations: 147.969 MiB, 19.38% gc time, 50.04% compilation time)
JLSO.jl
  1.317175 seconds (284.60 k allocations: 20.857 MiB, 0.69% gc time, 8.39% compilation time)
Arrow.jl
  7.033441 seconds (5.44 M allocations: 275.072 MiB, 0.45% gc time, 97.94% compilation time)
JSONTables.jl arraytable
 21.034314 seconds (229.62 M allocations: 5.422 GiB, 14.40% gc time, 0.08% compilation time)
JSONTables.jl objecttable
  0.776130 seconds (96.64 k allocations: 309.103 MiB, 51.28% gc time, 16.77% compilation time)
Second run
CSV.jl
  2.521896 seconds (44.40 M allocations: 1.113 GiB, 6.78% gc time)
Serialization
  0.209128 seconds (15.01 k allocations: 401.492 KiB, 6.81% compilation time)
JDF.jl
  0.108518 seconds (35.13 k allocations: 146.246 MiB, 16.84% gc time)
JLSO.jl
  1.194867 seconds (33.17 k allocations: 7.950 MiB, 0.19% gc time)
Arrow.jl
  0.145497 seconds (81.35 k allocations: 5.427 MiB)
JSONTables.jl arraytable
 16.981553 seconds (229.62 M allocations: 5.422 GiB, 13.12% gc time, 0.08% compilation time)
JSONTables.jl objecttable
  0.469217 seconds (20.71 k allocations: 305.234 MiB, 52.78% gc time, 2.16% compilation time)
0.469381036
groupedbar(
    repeat(["CSV.jl", "Serialization", "JDF.jl", "JLSO.jl", "Arrow.jl", "JSONTables.jl\nobjecttable"],
        inner=2),
    [csvwrite1, csvwrite2, serializewrite1, serializewrite1, jdfwrite1, jdfwrite2,
        jlsowrite1, jlsowrite2, arrowwrite1, arrowwrite2, jsontablesowrite2, jsontablesowrite2],
    group=repeat(["1st", "2nd"], outer=6),
    ylab="Second",
    title="Write Performance\nDataFrame: bigdf\nSize: $(size(bigdf))"
)
_images/8b95cdf0b2dd2b203813df6999bfbff90eec22d9c4ac60a9e24a3fca8c18e3b5.png
data_files = ["bigdf1.csv", "bigdf.bin", "bigdf.arrow", "bigdf1.json", "bigdf2.json"]
df = DataFrame(file=data_files, size=getfield.(stat.(data_files), :size))
append!(df, DataFrame(file="bigdf.jdf", size=reduce((x, y) -> x + y.size,
    stat.(joinpath.("bigdf.jdf", readdir("bigdf.jdf"))),
    init=0)))
sort!(df, :size)
6×2 DataFrame
Rowfilesize
StringInt64
1bigdf.arrow1742850
2bigdf.bin5199556
3bigdf.jdf5217601
4bigdf1.csv55085598
5bigdf2.json55089599
6bigdf1.json124030706
@df df plot(:file, :size / 1024^2, seriestype=:bar, title="Format File Size (MB)", label="Size", ylab="MB")
_images/9af98a7652a7b3020a8e6fa3aa59eb2b80c7287ad8f034eaa5b68a0d878de03f.png
println("First run")
println("CSV.jl")
csvread1 = @elapsed @time CSV.read("bigdf1.csv", DataFrame)
println("Serialization")
serializeread1 = @elapsed @time open(deserialize, "bigdf.bin")
println("JDF.jl")
jdfread1 = @elapsed @time JDF.load("bigdf.jdf") |> DataFrame
println("JLSO.jl")
jlsoread1 = @elapsed @time JLSO.load("bigdf.jlso")
println("Arrow.jl")
arrowread1 = @elapsed @time df_tmp = Arrow.Table("bigdf.arrow") |> DataFrame
arrowread1copy = @elapsed @time copy(df_tmp)
println("JSONTables.jl arraytable")
jsontablesaread1 = @elapsed @time open(jsontable, "bigdf1.json")
println("JSONTables.jl objecttable")
jsontablesoread1 = @elapsed @time open(jsontable, "bigdf2.json")
println("Second run")
csvread2 = @elapsed @time CSV.read("bigdf1.csv", DataFrame)
println("Serialization")
serializeread2 = @elapsed @time open(deserialize, "bigdf.bin")
println("JDF.jl")
jdfread2 = @elapsed @time JDF.load("bigdf.jdf") |> DataFrame
println("JLSO.jl")
jlsoread2 = @elapsed @time JLSO.load("bigdf.jlso")
println("Arrow.jl")
arrowread2 = @elapsed @time df_tmp = Arrow.Table("bigdf.arrow") |> DataFrame
arrowread2copy = @elapsed @time copy(df_tmp)
println("JSONTables.jl arraytable")
jsontablesaread2 = @elapsed @time open(jsontable, "bigdf1.json")
println("JSONTables.jl objecttable")
jsontablesoread2 = @elapsed @time open(jsontable, "bigdf2.json");
First run
CSV.jl
  3.042355 seconds (4.30 M allocations: 227.431 MiB, 0.76% gc time, 125.87% compilation time)
Serialization
  0.415734 seconds (9.50 M allocations: 155.505 MiB, 7.90% gc time, 9.54% compilation time)
JDF.jl
  0.263473 seconds (157.93 k allocations: 157.877 MiB, 6.70% gc time, 95.78% compilation time)
JLSO.jl
  0.353615 seconds (9.52 M allocations: 158.178 MiB, 5.99% gc time, 7.55% compilation time)
Arrow.jl
  0.457610 seconds (550.37 k allocations: 26.342 MiB, 98.35% compilation time)
  0.049680 seconds (14.50 k allocations: 10.266 MiB)
JSONTables.jl arraytable
  6.316101 seconds (271.09 k allocations: 1.839 GiB, 9.64% gc time)
JSONTables.jl objecttable
  0.347034 seconds (7.43 k allocations: 403.810 MiB, 2.98% gc time, 0.02% compilation time)
Second run
  0.931827 seconds (631.61 k allocations: 43.593 MiB)
Serialization
  0.351505 seconds (9.48 M allocations: 154.596 MiB, 5.33% gc time)
JDF.jl
  0.050261 seconds (77.27 k allocations: 153.765 MiB, 13.48% gc time)
JLSO.jl
  0.295219 seconds (9.50 M allocations: 157.309 MiB, 3.79% gc time)
Arrow.jl
  0.010170 seconds (86.57 k allocations: 3.732 MiB)
  0.049740 seconds (14.50 k allocations: 10.266 MiB)
JSONTables.jl arraytable
  6.280235 seconds (271.09 k allocations: 1.839 GiB, 9.63% gc time)
JSONTables.jl objecttable
  0.350620 seconds (7.08 k allocations: 403.794 MiB, 2.62% gc time)

Exclude JSONTables due to much longer timing

groupedbar(
    repeat(["CSV.jl", "Serialization", "JDF.jl", "JLSO.jl", "Arrow.jl", "Arrow.jl\ncopy", ##"JSON\narraytable",
            "JSON\nobjecttable"], inner=2),
    [csvread1, csvread2, serializeread1, serializeread2, jdfread1, jdfread2, jlsoread1, jlsoread2,
        arrowread1, arrowread2, arrowread1 + arrowread1copy, arrowread2 + arrowread2copy,
        # jsontablesaread1, jsontablesaread2,
        jsontablesoread1, jsontablesoread2],
    group=repeat(["1st", "2nd"], outer=7),
    ylab="Second",
    title="Read Performance\nDataFrame: bigdf\nSize: $(size(bigdf))"
)
_images/f50fa4ff3d0c92578081692e743da29cfd94cf07a0051c7d2eebdce8f08231b4.png

Using gzip compression#

A common user requirement is to be able to load and save CSV that are compressed using gzip. Below we show how this can be accomplished using CodecZlib.jl. The same pattern is applicable to JSONTables.jl compression/decompression. Again make sure that you do not have file named df_compress_test.csv.gz in your working directory. We first generate a random data frame.

df = DataFrame(rand(1:10, 10, 1000), :auto)
10×1000 DataFrame
900 columns omitted
Rowx1x2x3x4x5x6x7x8x9x10x11x12x13x14x15x16x17x18x19x20x21x22x23x24x25x26x27x28x29x30x31x32x33x34x35x36x37x38x39x40x41x42x43x44x45x46x47x48x49x50x51x52x53x54x55x56x57x58x59x60x61x62x63x64x65x66x67x68x69x70x71x72x73x74x75x76x77x78x79x80x81x82x83x84x85x86x87x88x89x90x91x92x93x94x95x96x97x98x99x100
Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64
186145917525362491714842259994681018735841053737561422552521074572261168655891286895674128424259104889667
2536285923295928103593410252942491010856857781099919765747189649683210576984319293218511610221761031057768110310
389467810358222866451563317519226525681023179163224910810581066262108109782106337346765286469110925697771565101
4610352647710310421061356105914110107983232941821033211911541861010654763946339662910104846751069271055615744101013108
59108861108727721109310771754228575331710347310710432957844596761109411829141969464797108166510253101081105109510586
6107586262103102711746897588552410410858355710457252852799811013310713491043631103747688997984102132367999359763
710101728729817810216229315164425810589884567105969105921053593799831541021694913105710132153737557836614485339
89424146116555857993616587441673695412232371051084227281421010761753121681424493956244239995102931033122656
9101962110105853102554571867767510831019854883336193741059811079964310377541674886712787738105281812106237152858
10842102101257993168681949828462610656956531985983493546710101058771055235161847561071107103879782727110735674722

GzipCompressorStream comes from CodecZlib

open("df_compress_test.csv.gz", "w") do io
    stream = GzipCompressorStream(io)
    CSV.write(stream, df)
    close(stream)
end
df2 = CSV.File(transcode(GzipDecompressor, Mmap.mmap("df_compress_test.csv.gz"))) |> DataFrame
10×1000 DataFrame
900 columns omitted
Rowx1x2x3x4x5x6x7x8x9x10x11x12x13x14x15x16x17x18x19x20x21x22x23x24x25x26x27x28x29x30x31x32x33x34x35x36x37x38x39x40x41x42x43x44x45x46x47x48x49x50x51x52x53x54x55x56x57x58x59x60x61x62x63x64x65x66x67x68x69x70x71x72x73x74x75x76x77x78x79x80x81x82x83x84x85x86x87x88x89x90x91x92x93x94x95x96x97x98x99x100
Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64
186145917525362491714842259994681018735841053737561422552521074572261168655891286895674128424259104889667
2536285923295928103593410252942491010856857781099919765747189649683210576984319293218511610221761031057768110310
389467810358222866451563317519226525681023179163224910810581066262108109782106337346765286469110925697771565101
4610352647710310421061356105914110107983232941821033211911541861010654763946339662910104846751069271055615744101013108
59108861108727721109310771754228575331710347310710432957844596761109411829141969464797108166510253101081105109510586
6107586262103102711746897588552410410858355710457252852799811013310713491043631103747688997984102132367999359763
710101728729817810216229315164425810589884567105969105921053593799831541021694913105710132153737557836614485339
89424146116555857993616587441673695412232371051084227281421010761753121681424493956244239995102931033122656
9101962110105853102554571867767510831019854883336193741059811079964310377541674886712787738105281812106237152858
10842102101257993168681949828462610656956531985983493546710101058771055235161847561071107103879782727110735674722
df == df2
true

Using zip files#

Sometimes you may have files compressed inside a zip file. In such a situation you may use ZipFile.jl in conjunction an an appropriate reader to read the files. Here we first create a ZIP file and then read back its contents into a DataFrame.

df1 = DataFrame(rand(1:10, 3, 4), :auto)
3×4 DataFrame
Rowx1x2x3x4
Int64Int64Int64Int64
131032
23537
351052
df2 = DataFrame(rand(1:10, 3, 4), :auto)
3×4 DataFrame
Rowx1x2x3x4
Int64Int64Int64Int64
199810
29872
3911010

And we show yet another way to write a DataFrame into a CSV file: Writing a CSV file into the zip file

w = ZipFile.Writer("x.zip")

f1 = ZipFile.addfile(w, "x1.csv")
write(f1, sprint(show, "text/csv", df1))

# write a second CSV file into zip file
f2 = ZipFile.addfile(w, "x2.csv", method=ZipFile.Deflate)
write(f2, sprint(show, "text/csv", df2))

close(w)

Now we read the compressed CSV file we have written:

z = ZipFile.Reader("x.zip");
# find the index index of file called x1.csv
index_xcsv = findfirst(x -> x.name == "x1.csv", z.files)
# to read the x1.csv file in the zip file
df1_2 = CSV.read(read(z.files[index_xcsv]), DataFrame)
3×4 DataFrame
Rowx1x2x3x4
Int64Int64Int64Int64
131032
23537
351052
df1_2 == df1
true
# find the index index of file called x2.csv
index_xcsv = findfirst(x -> x.name == "x2.csv", z.files)
# to read the x2.csv file in the zip file
df2_2 = CSV.read(read(z.files[index_xcsv]), DataFrame)
3×4 DataFrame
Rowx1x2x3x4
Int64Int64Int64Int64
199810
29872
3911010
df2_2 == df2
true

Note that once you read a given file from z object its stream is all used-up (reaching its end). Therefore to read it again you need to close the file object z and open it again. Also do not forget to close the zip file once you are done.

close(z)

Remove generated files

rm("x.arrow")
rm("x.bin")
rm("x.zip")
rm("x.jlso")
rm("x1.csv")
rm("x1.json")
rm("x2.json")
rm("x.jdf", recursive=true)
rm("bigdf.jdf", recursive=true)
rm("df_compress_test.csv.gz")
rm("bigdf1.json")
rm("bigdf1.csv")
rm("bigdf2.json")
rm("bigdf.jlso")
rm("bigdf.bin")
rm("bigdf.arrow")

This notebook was generated using Literate.jl.