Load and save DataFrames

Load and save DataFrames#

We do not cover all features of the packages. Please refer to their documentation to learn them.

Here we’ll load CSV.jl to read and write CSV files and Arrow.jl, JLSO.jl, and serialization, which allow us to work with a binary format and JSONTables.jl for JSON interaction. Finally we consider a custom JDF.jl format.

using DataFrames
using Arrow
using CSV
using Serialization
using JLSO
using JSONTables
using CodecZlib
using ZipFile
using JDF
using StatsPlots ## for charts
using Mmap ## for compression

[ Info: Precompiling IJuliaExt [2f4121a4-3b3a-5ce6-9c5e-1f2673ce168a]

Let’s create a simple DataFrame for testing purposes,

x = DataFrame(
    A=[true, false, true], B=[1, 2, missing],
    C=[missing, "b", "c"], D=['a', missing, 'c']
)

3×4 DataFrame

Row	A	B	C	D
	Bool	Int64?	String?	Char?
1	true	1	missing	a
2	false	2	b	missing
3	true	missing	c	c

and use eltypes to look at the columnwise types.

eltype.(eachcol(x))

4-element Vector{Type}:
 Bool
 Union{Missing, Int64}
 Union{Missing, String}
 Union{Missing, Char}

CSV.jl#

Let’s use CSV to save x to disk; make sure x1.csv does not conflict with some file in your working directory.

CSV.write("x1.csv", x)

"x1.csv"

Now we can see how it was saved by reading x.csv.

print(read("x1.csv", String))

A,B,C,D
true,1,,a
false,2,b,
true,,c,c

We can also load it back.

y = CSV.read("x1.csv", DataFrame)

3×4 DataFrame

Row	A	B	C	D
	Bool	Int64?	String1?	String1?
1	true	1	missing	a
2	false	2	b	missing
3	true	missing	c	c

Note that when loading in a DataFrame from a CSV the column type for columns :C :D have changed to use special strings defined in the InlineStrings.jl package.

eltype.(eachcol(y))

4-element Vector{Type}:
 Bool
 Union{Missing, Int64}
 Union{Missing, String1}
 Union{Missing, String1}

Serialization by JDF.jl and JLSO.jl#

Now we use serialization to save x.

There are two ways to perform serialization. The first way is to use the Serialization.serialize as below:

Note that in general, this process will not work if the reading and writing are done by different versions of Julia, or an instance of Julia with a different system image.

open("x.bin", "w") do io
    serialize(io, x)
end

Now we load back the saved file to y variable. Again y is identical to x. However, please beware that if you session does not have DataFrames.jl loaded, then it may not recognise the content as DataFrames.jl

y = open(deserialize, "x.bin")

3×4 DataFrame

Row	A	B	C	D
	Bool	Int64?	String?	Char?
1	true	1	missing	a
2	false	2	b	missing
3	true	missing	c	c

eltype.(eachcol(y))

4-element Vector{Type}:
 Bool
 Union{Missing, Int64}
 Union{Missing, String}
 Union{Missing, Char}

JDF.jl#

JDF.jl is a relatively new package designed to serialize DataFrames. You can save a DataFrame with the savejdf function.

For more details about design assumptions and limitations of JDF.jl please check out xiaodaigh/JDF.jl.

JDF.save("x.jdf", x);

To load the saved JDF file, one can use the loadjdf function

x_loaded = JDF.load("x.jdf") |> DataFrame

3×4 DataFrame

Row	A	B	C	D
	Bool	Int64?	String?	Char?
1	true	1	missing	a
2	false	2	b	missing
3	true	missing	c	c

You can see that they are the same

isequal(x_loaded, x)

true

JDF.jl offers the ability to load only certain columns from disk to help with working with large files

# set up a JDFFile which is a on disk representation of `x` backed by JDF.jl
x_ondisk = jdf"x.jdf"

JDFFile{String}("x.jdf")

We can see all the names of x without loading it into memory

names(x_ondisk)

4-element Vector{Symbol}:
 :A
 :B
 :C
 :D

The below is an example of how to load only columns :A and :D

xd = JDF.load(x_ondisk; cols=["A", "D"]) |> DataFrame

3×2 DataFrame

Row	A	D
	Bool	Char?
1	true	a
2	false	missing
3	true	c

JLSO.jl#

Another way to perform serialization is by using the JLSO.jl library:

JLSO.save("x.jlso", :data => x)

Now we can load back the file to y

y = JLSO.load("x.jlso")[:data]

3×4 DataFrame

Row	A	B	C	D
	Bool	Int64?	String?	Char?
1	true	1	missing	a
2	false	2	b	missing
3	true	missing	c	c

eltype.(eachcol(y))

4-element Vector{Type}:
 Bool
 Union{Missing, Int64}
 Union{Missing, String}
 Union{Missing, Char}

JSONTables.jl#

Often you might need to read and write data stored in JSON format. JSONTables.jl provides a way to process them in row-oriented or column-oriented layout. We present both options below.

open(io -> arraytable(io, x), "x1.json", "w")

open(io -> objecttable(io, x), "x2.json", "w")

print(read("x1.json", String))

[{"A":true,"B":1,"C":null,"D":"a"},{"A":false,"B":2,"C":"b","D":null},{"A":true,"B":null,"C":"c","D":"c"}]

print(read("x2.json", String))

{"A":[true,false,true],"B":[1,2,null],"C":[null,"b","c"],"D":["a",null,"c"]}

y1 = open(jsontable, "x1.json") |> DataFrame

3×4 DataFrame

Row	A	B	C	D
	Bool	Int64?	String?	String?
1	true	1	missing	a
2	false	2	b	missing
3	true	missing	c	c

eltype.(eachcol(y1))

4-element Vector{Type}:
 Bool
 Union{Missing, Int64}
 Union{Missing, String}
 Union{Missing, String}

y2 = open(jsontable, "x2.json") |> DataFrame

3×4 DataFrame

Row	A	B	C	D
	Bool	Int64?	String?	String?
1	true	1	missing	a
2	false	2	b	missing
3	true	missing	c	c

eltype.(eachcol(y2))

4-element Vector{Type}:
 Bool
 Union{Missing, Int64}
 Union{Missing, String}
 Union{Missing, String}

Arrow.jl#

Finally we use Apache Arrow format that allows, in particular, for data interchange with R or Python.

Arrow.write("x.arrow", x)

"x.arrow"

y = Arrow.Table("x.arrow") |> DataFrame

3×4 DataFrame

Row	A	B	C	D
	Bool	Int64?	String?	Char?
1	true	1	missing	a
2	false	2	b	missing
3	true	missing	c	c

eltype.(eachcol(y))

4-element Vector{Type}:
 Bool
 Union{Missing, Int64}
 Union{Missing, String}
 Union{Missing, Char}

Note that columns of y are immutable

try
    y.A[1] = false
catch e
    show(e)
end

ReadOnlyMemoryError()

This is because Arrow.Table uses memory mapping and thus uses a custom vector types:

y.A

3-element Arrow.BoolVector{Bool}:
 1
 0
 1

y.B

3-element Arrow.Primitive{Union{Missing, Int64}, Vector{Int64}}:
 1
 2
  missing

You can get standard Julia Base vectors by copying a data frame

y2 = copy(y)

3×4 DataFrame

Row	A	B	C	D
	Bool	Int64?	String?	Char?
1	true	1	missing	a
2	false	2	b	missing
3	true	missing	c	c

y2.A

3-element Vector{Bool}:
 1
 0
 1

y2.B

3-element Vector{Union{Missing, Int64}}:
 1
 2
  missing

Basic bechmarking#

Next, we’ll create some files, so be careful that you don’t already have these files in your working directory!

In particular, we’ll time how long it takes us to write a DataFrame with 1000 rows and 100000 columns.

bigdf = DataFrame(rand(Bool, 10^4, 1000), :auto)

bigdf[!, 1] = Int.(bigdf[!, 1])
bigdf[!, 2] = bigdf[!, 2] .+ 0.5
bigdf[!, 3] = string.(bigdf[!, 3], ", as string")

println("First run")

First run

println("CSV.jl")
csvwrite1 = @elapsed @time CSV.write("bigdf1.csv", bigdf)
println("Serialization")
serializewrite1 = @elapsed @time open(io -> serialize(io, bigdf), "bigdf.bin", "w")
println("JDF.jl")
jdfwrite1 = @elapsed @time JDF.save("bigdf.jdf", bigdf)
println("JLSO.jl")
jlsowrite1 = @elapsed @time JLSO.save("bigdf.jlso", :data => bigdf)
println("Arrow.jl")
arrowwrite1 = @elapsed @time Arrow.write("bigdf.arrow", bigdf)
println("JSONTables.jl arraytable")
jsontablesawrite1 = @elapsed @time open(io -> arraytable(io, bigdf), "bigdf1.json", "w")
println("JSONTables.jl objecttable")
jsontablesowrite1 = @elapsed @time open(io -> objecttable(io, bigdf), "bigdf2.json", "w")
println("Second run")
println("CSV.jl")
csvwrite2 = @elapsed @time CSV.write("bigdf1.csv", bigdf)
println("Serialization")
serializewrite2 = @elapsed @time open(io -> serialize(io, bigdf), "bigdf.bin", "w")
println("JDF.jl")
jdfwrite2 = @elapsed @time JDF.save("bigdf.jdf", bigdf)
println("JLSO.jl")
jlsowrite2 = @elapsed @time JLSO.save("bigdf.jlso", :data => bigdf)
println("Arrow.jl")
arrowwrite2 = @elapsed @time Arrow.write("bigdf.arrow", bigdf)
println("JSONTables.jl arraytable")
jsontablesawrite2 = @elapsed @time open(io -> arraytable(io, bigdf), "bigdf1.json", "w")
println("JSONTables.jl objecttable")
jsontablesowrite2 = @elapsed @time open(io -> objecttable(io, bigdf), "bigdf2.json", "w")

CSV.jl

  4.520346 seconds (44.60 M allocations: 1.126 GiB, 4.27% gc time, 63.40% compilation time)
Serialization
  0.179294 seconds (148.06 k allocations: 9.352 MiB, 26.73% compilation time)

JDF.jl

  0.095084 seconds (46.82 k allocations: 147.657 MiB, 11.32% gc time, 111.51% compilation time)
JLSO.jl

  1.029119 seconds (174.96 k allocations: 14.867 MiB, 5.98% compilation time)
Arrow.jl

  3.256308 seconds (3.02 M allocations: 208.636 MiB, 1.15% gc time, 96.99% compilation time)
JSONTables.jl arraytable
 11.686247 seconds (229.61 M allocations: 5.423 GiB, 11.29% gc time, 0.07% compilation time)

JSONTables.jl objecttable

  0.372464 seconds (60.98 k allocations: 308.136 MiB, 35.64% gc time, 18.88% compilation time)
Second run
CSV.jl
  1.618924 seconds (44.40 M allocations: 1.113 GiB, 12.99% gc time)

Serialization

  0.141854 seconds (14.32 k allocations: 397.188 KiB, 5.90% compilation time)
JDF.jl

  0.081677 seconds (28.12 k allocations: 146.393 MiB, 9.51% gc time)
JLSO.jl

  0.968076 seconds (30.11 k allocations: 4.701 MiB)
Arrow.jl

  0.098084 seconds (52.79 k allocations: 5.619 MiB)
JSONTables.jl arraytable
 11.693286 seconds (229.61 M allocations: 5.423 GiB, 10.47% gc time, 0.07% compilation time)

JSONTables.jl objecttable

  0.337134 seconds (15.94 k allocations: 305.186 MiB, 38.97% gc time, 2.39% compilation time)

0.337387923

Exclude JSONTables.jl arraytable due to timing

groupedbar(
    repeat(["CSV.jl", "Serialization", "JDF.jl", "JLSO.jl", "Arrow.jl", "JSONTables.jl\nobjecttable"],
        inner=2),
    [csvwrite1, csvwrite2, serializewrite1, serializewrite1, jdfwrite1, jdfwrite2,
        jlsowrite1, jlsowrite2, arrowwrite1, arrowwrite2, jsontablesowrite2, jsontablesowrite2],
    group=repeat(["1st", "2nd"], outer=6),
    ylab="Second",
    title="Write Performance\nDataFrame: bigdf\nSize: $(size(bigdf))"
)

data_files = ["bigdf1.csv", "bigdf.bin", "bigdf.arrow", "bigdf1.json", "bigdf2.json"]
df = DataFrame(file=data_files, size=getfield.(stat.(data_files), :size))
append!(df, DataFrame(file="bigdf.jdf", size=reduce((x, y) -> x + y.size,
    stat.(joinpath.("bigdf.jdf", readdir("bigdf.jdf"))),
    init=0)))
sort!(df, :size)

6×2 DataFrame

Row	file	size
	String	Int64
1	bigdf.arrow	1742770
2	bigdf.bin	5201790
3	bigdf.jdf	5222771
4	bigdf1.csv	55085173
5	bigdf2.json	55089174
6	bigdf1.json	124030281

@df df plot(:file, :size / 1024^2, seriestype=:bar, title="Format File Size (MB)", label="Size", ylab="MB")

println("First run")
println("CSV.jl")
csvread1 = @elapsed @time CSV.read("bigdf1.csv", DataFrame)
println("Serialization")
serializeread1 = @elapsed @time open(deserialize, "bigdf.bin")
println("JDF.jl")
jdfread1 = @elapsed @time JDF.load("bigdf.jdf") |> DataFrame
println("JLSO.jl")
jlsoread1 = @elapsed @time JLSO.load("bigdf.jlso")
println("Arrow.jl")
arrowread1 = @elapsed @time df_tmp = Arrow.Table("bigdf.arrow") |> DataFrame
arrowread1copy = @elapsed @time copy(df_tmp)
println("JSONTables.jl arraytable")
jsontablesaread1 = @elapsed @time open(jsontable, "bigdf1.json")
println("JSONTables.jl objecttable")
jsontablesoread1 = @elapsed @time open(jsontable, "bigdf2.json")
println("Second run")
csvread2 = @elapsed @time CSV.read("bigdf1.csv", DataFrame)
println("Serialization")
serializeread2 = @elapsed @time open(deserialize, "bigdf.bin")
println("JDF.jl")
jdfread2 = @elapsed @time JDF.load("bigdf.jdf") |> DataFrame
println("JLSO.jl")
jlsoread2 = @elapsed @time JLSO.load("bigdf.jlso")
println("Arrow.jl")
arrowread2 = @elapsed @time df_tmp = Arrow.Table("bigdf.arrow") |> DataFrame
arrowread2copy = @elapsed @time copy(df_tmp)
println("JSONTables.jl arraytable")
jsontablesaread2 = @elapsed @time open(jsontable, "bigdf1.json")
println("JSONTables.jl objecttable")
jsontablesoread2 = @elapsed @time open(jsontable, "bigdf2.json");

First run
CSV.jl

  1.979127 seconds (1.93 M allocations: 155.763 MiB, 0.96% gc time, 244.67% compilation time)
Serialization
  0.433973 seconds (9.49 M allocations: 155.130 MiB, 10.95% gc time, 5.89% compilation time)

JDF.jl

  0.180052 seconds (123.89 k allocations: 161.520 MiB, 122.57% compilation time)
JLSO.jl

  0.359403 seconds (9.50 M allocations: 157.726 MiB, 7.37% gc time, 5.84% compilation time)
Arrow.jl

  0.363759 seconds (347.75 k allocations: 22.543 MiB, 98.50% compilation time)
  0.053692 seconds (12.50 k allocations: 10.319 MiB)
JSONTables.jl arraytable

 13.347633 seconds (30.24 M allocations: 2.530 GiB, 4.69% gc time)
JSONTables.jl objecttable
  0.347721 seconds (7.05 k allocations: 282.602 MiB, 5.33% gc time, 0.04% compilation time)

Second run

  0.392585 seconds (160.96 k allocations: 35.388 MiB)
Serialization
  0.399444 seconds (9.48 M allocations: 154.650 MiB, 6.76% gc time)

JDF.jl
  0.039274 seconds (69.69 k allocations: 157.945 MiB)
JLSO.jl

336919 seconds (9.50 M allocations: 157.323 MiB)
Arrow.jl
004663 seconds (67.53 k allocations: 3.292 MiB)
053299 seconds (12.50 k allocations: 10.319 MiB)

JSONTables.jl arraytable

 12.999409 seconds (30.24 M allocations: 2.530 GiB, 2.74% gc time)
JSONTables.jl objecttable
  0.328286 seconds (6.06 k allocations: 282.532 MiB, 2.58% gc time)

Exclude JSON arraytable due to much longer timing

groupedbar(
    repeat(["CSV.jl", "Serialization", "JDF.jl", "JLSO.jl", "Arrow.jl", "Arrow.jl\ncopy", #"JSON\narraytable",
            "JSON\nobjecttable"], inner=2),
    [csvread1, csvread2, serializeread1, serializeread2, jdfread1, jdfread2, jlsoread1, jlsoread2,
        arrowread1, arrowread2, arrowread1 + arrowread1copy, arrowread2 + arrowread2copy,
        # jsontablesaread1, jsontablesaread2,
        jsontablesoread1, jsontablesoread2],
    group=repeat(["1st", "2nd"], outer=7),
    ylab="Second",
    title="Read Performance\nDataFrame: bigdf\nSize: $(size(bigdf))"
)

Using gzip compression#

A common user requirement is to be able to load and save CSV that are compressed using gzip. Below we show how this can be accomplished using CodecZlib.jl. The same pattern is applicable to JSONTables.jl compression/decompression.

Again make sure that you do not have file named df_compress_test.csv.gz in your working directory.

We first generate a random data frame.

df = DataFrame(rand(1:10, 10, 1000), :auto)

10×1000 DataFrame

900 columns omitted

Row	x1	x2	x3	x4	x5	x6	x7	x8	x9	x10	x11	x12	x13	x14	x15	x16	x17	x18	x19	x20	x21	x22	x23	x24	x25	x26	x27	x28	x29	x30	x31	x32	x33	x34	x35	x36	x37	x38	x39	x40	x41	x42	x43	x44	x45	x46	x47	x48	x49	x50	x51	x52	x53	x54	x55	x56	x57	x58	x59	x60	x61	x62	x63	x64	x65	x66	x67	x68	x69	x70	x71	x72	x73	x74	x75	x76	x77	x78	x79	x80	x81	x82	x83	x84	x85	x86	x87	x88	x89	x90	x91	x92	x93	x94	x95	x96	x97	x98	x99	x100	⋯
	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	⋯
1	1	2	8	10	3	2	6	8	8	8	3	1	5	6	8	2	2	8	6	10	1	1	3	10	4	9	9	4	3	1	9	3	6	6	1	7	10	8	9	3	3	6	2	5	3	1	7	7	10	5	10	2	10	3	4	5	2	9	4	5	7	7	10	5	9	8	2	3	2	3	8	7	8	3	8	2	9	6	1	2	2	9	3	9	5	8	7	6	6	6	7	4	9	4	1	9	3	4	10	1	⋯
2	8	1	4	7	10	10	2	1	4	7	5	10	7	4	5	3	6	3	3	9	8	10	7	1	8	5	7	3	2	4	4	7	2	9	1	1	8	2	1	4	10	6	2	10	5	6	5	3	4	1	9	4	1	4	3	6	4	5	10	3	10	6	5	2	5	7	10	7	10	10	8	3	3	2	8	2	8	8	8	8	6	3	2	6	2	9	4	6	9	2	5	8	9	9	6	1	6	8	8	10	⋯
3	10	3	2	6	4	3	3	3	10	7	5	9	2	6	3	7	10	6	7	5	8	3	9	10	6	3	8	6	3	8	7	4	4	4	2	10	4	9	3	3	4	9	10	2	10	6	6	10	8	5	1	1	4	10	10	1	9	5	3	8	7	4	6	5	7	10	10	5	2	9	7	1	4	6	7	6	1	3	7	2	3	10	7	1	6	10	3	8	2	9	9	9	7	2	10	10	6	10	2	6	⋯
4	5	2	10	1	1	3	4	6	5	5	1	5	6	1	2	4	1	3	9	6	8	2	7	6	8	1	5	7	10	2	9	8	1	4	2	2	5	2	1	4	10	6	6	5	3	10	6	7	5	10	8	6	2	2	3	9	6	8	1	8	5	7	3	3	3	4	10	8	9	9	10	7	6	2	9	7	3	10	5	8	4	5	4	10	4	2	6	5	4	7	7	1	3	5	2	6	6	4	2	5	⋯
5	2	10	9	6	2	8	4	3	6	9	4	6	9	7	8	1	8	7	5	5	1	5	3	2	10	1	9	3	7	6	6	1	2	10	6	4	5	4	7	8	2	7	3	6	8	3	1	2	7	4	5	10	6	4	4	6	4	5	10	9	7	6	6	7	8	1	10	9	7	9	5	4	9	10	9	9	3	6	6	6	8	4	4	5	9	9	3	2	6	2	2	2	1	7	3	4	6	1	2	4	⋯
6	4	6	10	9	4	6	8	2	9	4	7	3	9	10	9	2	5	4	5	8	2	3	2	4	10	6	5	6	8	1	4	2	6	4	9	10	8	2	10	9	1	7	3	2	6	4	2	10	9	2	4	6	6	3	5	1	6	5	4	7	7	7	10	6	10	1	9	7	6	10	5	3	2	3	9	5	9	8	7	5	1	9	6	6	7	9	10	7	4	9	2	1	8	5	4	9	3	2	7	7	⋯
7	10	9	9	3	4	9	9	7	9	7	9	1	9	5	2	3	3	7	10	8	2	1	9	2	7	8	6	2	10	9	8	5	1	3	2	7	6	8	6	7	2	5	4	8	7	2	5	10	3	6	3	7	1	2	6	2	7	4	1	2	4	2	1	8	3	1	4	6	6	6	9	9	1	3	3	6	1	4	5	2	9	2	7	4	7	2	2	3	2	10	6	10	1	8	5	8	6	9	10	1	⋯
8	10	3	1	2	6	7	3	9	5	10	4	3	8	2	3	5	2	8	6	5	6	6	8	10	8	8	4	10	4	2	1	2	2	4	8	7	6	8	6	10	3	1	8	3	10	7	6	10	3	3	8	5	5	5	7	5	7	1	5	10	10	8	5	4	7	6	3	5	2	1	3	6	7	9	9	8	5	6	3	2	3	7	4	9	6	8	2	1	3	7	5	9	2	1	7	9	10	3	10	8	⋯
9	9	5	3	4	1	10	3	9	5	9	3	10	2	3	4	10	7	10	3	4	8	7	4	3	1	9	3	7	9	8	6	9	3	8	10	10	2	2	10	2	2	6	3	8	2	7	5	3	5	6	4	5	8	1	5	5	2	10	10	4	4	2	3	2	8	10	6	9	10	4	9	7	8	7	1	7	2	10	10	2	3	9	4	3	8	5	10	1	10	1	1	7	10	10	2	4	2	9	2	8	⋯
10	7	9	5	4	7	9	5	1	1	2	4	7	4	9	7	4	10	1	7	7	10	7	4	3	4	9	6	10	10	5	1	6	3	5	7	5	6	9	2	3	2	10	9	7	6	7	8	8	4	8	5	5	2	7	4	6	8	1	1	10	6	10	8	5	8	5	7	3	8	8	9	10	7	1	8	7	2	8	8	4	1	4	6	6	1	3	4	2	4	7	1	5	2	9	1	9	6	8	4	7	⋯

# GzipCompressorStream comes from CodecZlib
open("df_compress_test.csv.gz", "w") do io
    stream = GzipCompressorStream(io)
    CSV.write(stream, df)
    close(stream)
end

df2 = CSV.File(transcode(GzipDecompressor, Mmap.mmap("df_compress_test.csv.gz"))) |> DataFrame

10×1000 DataFrame

900 columns omitted

Row	x1	x2	x3	x4	x5	x6	x7	x8	x9	x10	x11	x12	x13	x14	x15	x16	x17	x18	x19	x20	x21	x22	x23	x24	x25	x26	x27	x28	x29	x30	x31	x32	x33	x34	x35	x36	x37	x38	x39	x40	x41	x42	x43	x44	x45	x46	x47	x48	x49	x50	x51	x52	x53	x54	x55	x56	x57	x58	x59	x60	x61	x62	x63	x64	x65	x66	x67	x68	x69	x70	x71	x72	x73	x74	x75	x76	x77	x78	x79	x80	x81	x82	x83	x84	x85	x86	x87	x88	x89	x90	x91	x92	x93	x94	x95	x96	x97	x98	x99	x100	⋯
	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	⋯
1	1	2	8	10	3	2	6	8	8	8	3	1	5	6	8	2	2	8	6	10	1	1	3	10	4	9	9	4	3	1	9	3	6	6	1	7	10	8	9	3	3	6	2	5	3	1	7	7	10	5	10	2	10	3	4	5	2	9	4	5	7	7	10	5	9	8	2	3	2	3	8	7	8	3	8	2	9	6	1	2	2	9	3	9	5	8	7	6	6	6	7	4	9	4	1	9	3	4	10	1	⋯
2	8	1	4	7	10	10	2	1	4	7	5	10	7	4	5	3	6	3	3	9	8	10	7	1	8	5	7	3	2	4	4	7	2	9	1	1	8	2	1	4	10	6	2	10	5	6	5	3	4	1	9	4	1	4	3	6	4	5	10	3	10	6	5	2	5	7	10	7	10	10	8	3	3	2	8	2	8	8	8	8	6	3	2	6	2	9	4	6	9	2	5	8	9	9	6	1	6	8	8	10	⋯
3	10	3	2	6	4	3	3	3	10	7	5	9	2	6	3	7	10	6	7	5	8	3	9	10	6	3	8	6	3	8	7	4	4	4	2	10	4	9	3	3	4	9	10	2	10	6	6	10	8	5	1	1	4	10	10	1	9	5	3	8	7	4	6	5	7	10	10	5	2	9	7	1	4	6	7	6	1	3	7	2	3	10	7	1	6	10	3	8	2	9	9	9	7	2	10	10	6	10	2	6	⋯
4	5	2	10	1	1	3	4	6	5	5	1	5	6	1	2	4	1	3	9	6	8	2	7	6	8	1	5	7	10	2	9	8	1	4	2	2	5	2	1	4	10	6	6	5	3	10	6	7	5	10	8	6	2	2	3	9	6	8	1	8	5	7	3	3	3	4	10	8	9	9	10	7	6	2	9	7	3	10	5	8	4	5	4	10	4	2	6	5	4	7	7	1	3	5	2	6	6	4	2	5	⋯
5	2	10	9	6	2	8	4	3	6	9	4	6	9	7	8	1	8	7	5	5	1	5	3	2	10	1	9	3	7	6	6	1	2	10	6	4	5	4	7	8	2	7	3	6	8	3	1	2	7	4	5	10	6	4	4	6	4	5	10	9	7	6	6	7	8	1	10	9	7	9	5	4	9	10	9	9	3	6	6	6	8	4	4	5	9	9	3	2	6	2	2	2	1	7	3	4	6	1	2	4	⋯
6	4	6	10	9	4	6	8	2	9	4	7	3	9	10	9	2	5	4	5	8	2	3	2	4	10	6	5	6	8	1	4	2	6	4	9	10	8	2	10	9	1	7	3	2	6	4	2	10	9	2	4	6	6	3	5	1	6	5	4	7	7	7	10	6	10	1	9	7	6	10	5	3	2	3	9	5	9	8	7	5	1	9	6	6	7	9	10	7	4	9	2	1	8	5	4	9	3	2	7	7	⋯
7	10	9	9	3	4	9	9	7	9	7	9	1	9	5	2	3	3	7	10	8	2	1	9	2	7	8	6	2	10	9	8	5	1	3	2	7	6	8	6	7	2	5	4	8	7	2	5	10	3	6	3	7	1	2	6	2	7	4	1	2	4	2	1	8	3	1	4	6	6	6	9	9	1	3	3	6	1	4	5	2	9	2	7	4	7	2	2	3	2	10	6	10	1	8	5	8	6	9	10	1	⋯
8	10	3	1	2	6	7	3	9	5	10	4	3	8	2	3	5	2	8	6	5	6	6	8	10	8	8	4	10	4	2	1	2	2	4	8	7	6	8	6	10	3	1	8	3	10	7	6	10	3	3	8	5	5	5	7	5	7	1	5	10	10	8	5	4	7	6	3	5	2	1	3	6	7	9	9	8	5	6	3	2	3	7	4	9	6	8	2	1	3	7	5	9	2	1	7	9	10	3	10	8	⋯
9	9	5	3	4	1	10	3	9	5	9	3	10	2	3	4	10	7	10	3	4	8	7	4	3	1	9	3	7	9	8	6	9	3	8	10	10	2	2	10	2	2	6	3	8	2	7	5	3	5	6	4	5	8	1	5	5	2	10	10	4	4	2	3	2	8	10	6	9	10	4	9	7	8	7	1	7	2	10	10	2	3	9	4	3	8	5	10	1	10	1	1	7	10	10	2	4	2	9	2	8	⋯
10	7	9	5	4	7	9	5	1	1	2	4	7	4	9	7	4	10	1	7	7	10	7	4	3	4	9	6	10	10	5	1	6	3	5	7	5	6	9	2	3	2	10	9	7	6	7	8	8	4	8	5	5	2	7	4	6	8	1	1	10	6	10	8	5	8	5	7	3	8	8	9	10	7	1	8	7	2	8	8	4	1	4	6	6	1	3	4	2	4	7	1	5	2	9	1	9	6	8	4	7	⋯

df == df2

true

Using zip files#

Sometimes you may have files compressed inside a zip file.

In such a situation you may use ZipFile.jl in conjunction an an appropriate reader to read the files.

Here we first create a ZIP file and then read back its contents into a DataFrame.

df1 = DataFrame(rand(1:10, 3, 4), :auto)

3×4 DataFrame

Row	x1	x2	x3	x4
	Int64	Int64	Int64	Int64
1	8	3	6	1
2	6	3	9	1
3	5	4	6	6

df2 = DataFrame(rand(1:10, 3, 4), :auto)

3×4 DataFrame

Row	x1	x2	x3	x4
	Int64	Int64	Int64	Int64
1	6	3	4	1
2	9	2	3	7
3	10	3	2	1

And we show yet another way to write a DataFrame into a CSV file: writing a CSV file into the zip file

w = ZipFile.Writer("x.zip")

f1 = ZipFile.addfile(w, "x1.csv")
write(f1, sprint(show, "text/csv", df1))

# write a second CSV file into zip file
f2 = ZipFile.addfile(w, "x2.csv", method=ZipFile.Deflate)
write(f2, sprint(show, "text/csv", df2))

close(w)

Now we read the CSV we have written:

z = ZipFile.Reader("x.zip");

# find the index index of file called x1.csv
index_xcsv = findfirst(x -> x.name == "x1.csv", z.files)
# to read the x1.csv file in the zip file
df1_2 = CSV.read(read(z.files[index_xcsv]), DataFrame)

3×4 DataFrame

Row	x1	x2	x3	x4
	Int64	Int64	Int64	Int64
1	8	3	6	1
2	6	3	9	1
3	5	4	6	6

df1_2 == df1

true

# find the index index of file called x2.csv
index_xcsv = findfirst(x -> x.name == "x2.csv", z.files)
# to read the x2.csv file in the zip file
df2_2 = CSV.read(read(z.files[index_xcsv]), DataFrame)

3×4 DataFrame

Row	x1	x2	x3	x4
	Int64	Int64	Int64	Int64
1	6	3	4	1
2	9	2	3	7
3	10	3	2	1

df2_2 == df2

true

Note that once you read a given file from z object its stream is all used-up (it is at its end). Therefore to read it again you need to close z and open it again.

Also do not forget to close the zip file once done.

close(z)

Load and save DataFrames

Contents

Load and save DataFrames#

CSV.jl#

Serialization by JDF.jl and JLSO.jl#

JDF.jl#

JLSO.jl#

JSONTables.jl#

Arrow.jl#

Basic bechmarking#

Using gzip compression#

Using zip files#