Load and save DataFrames#
We do not cover all features of the packages. Please refer to their documentation to learn them.
Here we’ll load CSV.jl
to read and write CSV files and Arrow.jl
, JLSO.jl
, and serialization, which allow us to work with a binary format and JSONTables.jl
for JSON interaction. Finally we consider a custom JDF.jl
format.
using DataFrames
using Arrow
using CSV
using Serialization
using JLSO
using JSONTables
using CodecZlib
using ZipFile
using JDF
using StatsPlots ## for charts
using Mmap ## for compression
Let’s create a simple DataFrame
for testing purposes,
x = DataFrame(
A=[true, false, true], B=[1, 2, missing],
C=[missing, "b", "c"], D=['a', missing, 'c']
)
Row | A | B | C | D |
---|---|---|---|---|
Bool | Int64? | String? | Char? | |
1 | true | 1 | missing | a |
2 | false | 2 | b | missing |
3 | true | missing | c | c |
and use eltypes
to look at the columnwise types.
eltype.(eachcol(x))
4-element Vector{Type}:
Bool
Union{Missing, Int64}
Union{Missing, String}
Union{Missing, Char}
CSV.jl#
Let’s use CSV
to save x
to disk; make sure x1.csv
does not conflict with some file in your working directory.
CSV.write("x1.csv", x)
"x1.csv"
Now we can see how it was saved by reading x.csv
.
print(read("x1.csv", String))
A,B,C,D
true,1,,a
false,2,b,
true,,c,c
We can also load it back as a data frame
y = CSV.read("x1.csv", DataFrame)
Row | A | B | C | D |
---|---|---|---|---|
Bool | Int64? | String1? | String1? | |
1 | true | 1 | missing | a |
2 | false | 2 | b | missing |
3 | true | missing | c | c |
Note that when loading in a DataFrame
from a CSV
the column type for columns :C
:D
have changed to use special strings defined in the InlineStrings.jl package.
eltype.(eachcol(y))
4-element Vector{Type}:
Bool
Union{Missing, Int64}
Union{Missing, InlineStrings.String1}
Union{Missing, InlineStrings.String1}
Serialization by JDF.jl and JLSO.jl#
Now we use serialization to save x
.
There are two ways to perform serialization. The first way is to use the Serialization.serialize
as below:
Note that in general, this process will not work if the reading and writing are done by different versions of Julia, or an instance of Julia with a different system image.
open("x.bin", "w") do io
serialize(io, x)
end
Now we load back the saved file to y
variable. Again y
is identical to x
. However, please beware that if you session does not have DataFrames.jl loaded, then it may not recognize the content as DataFrames.jl
y = open(deserialize, "x.bin")
Row | A | B | C | D |
---|---|---|---|---|
Bool | Int64? | String? | Char? | |
1 | true | 1 | missing | a |
2 | false | 2 | b | missing |
3 | true | missing | c | c |
eltype.(eachcol(y))
4-element Vector{Type}:
Bool
Union{Missing, Int64}
Union{Missing, String}
Union{Missing, Char}
JDF.jl#
JDF.jl
is a relatively new package designed to serialize DataFrames. You can save a DataFrame with the savejdf
function.
For more details about design assumptions and limitations of JDF.jl
please check out xiaodaigh/JDF.jl.
JDF.save("x.jdf", x);
To load the saved JDF file, one can use the loadjdf
function
x_loaded = JDF.load("x.jdf") |> DataFrame
Row | A | B | C | D |
---|---|---|---|---|
Bool | Int64? | String? | Char? | |
1 | true | 1 | missing | a |
2 | false | 2 | b | missing |
3 | true | missing | c | c |
You can see that they are the same
isequal(x_loaded, x)
true
JDF.jl offers the ability to load only certain columns from disk to help with working with large files.
set up a JDFFile which is a on disk representation of x
backed by JDF.jl
x_ondisk = jdf"x.jdf"
JDF.JDFFile{String}("x.jdf")
We can see all the names of x
without loading it into memory
names(x_ondisk)
4-element Vector{Symbol}:
:A
:B
:C
:D
The below is an example of how to load only columns :A
and :D
xd = JDF.load(x_ondisk; cols=["A", "D"]) |> DataFrame
Row | A | D |
---|---|---|
Bool | Char? | |
1 | true | a |
2 | false | missing |
3 | true | c |
JLSO.jl#
Another way to perform serialization is by using the JLSO.jl library:
JLSO.save("x.jlso", :data => x)
Now we can load back the file to y
y = JLSO.load("x.jlso")[:data]
Row | A | B | C | D |
---|---|---|---|---|
Bool | Int64? | String? | Char? | |
1 | true | 1 | missing | a |
2 | false | 2 | b | missing |
3 | true | missing | c | c |
eltype.(eachcol(y))
4-element Vector{Type}:
Bool
Union{Missing, Int64}
Union{Missing, String}
Union{Missing, Char}
JSONTables.jl#
Often you might need to read and write data stored in JSON format. JSONTables.jl provides a way to process them in row-oriented or column-oriented layout. We present both options below.
open(io -> arraytable(io, x), "x1.json", "w")
106
open(io -> objecttable(io, x), "x2.json", "w")
76
print(read("x1.json", String))
[{"A":true,"B":1,"C":null,"D":"a"},{"A":false,"B":2,"C":"b","D":null},{"A":true,"B":null,"C":"c","D":"c"}]
print(read("x2.json", String))
{"A":[true,false,true],"B":[1,2,null],"C":[null,"b","c"],"D":["a",null,"c"]}
y1 = open(jsontable, "x1.json") |> DataFrame
Row | A | B | C | D |
---|---|---|---|---|
Bool | Int64? | String? | String? | |
1 | true | 1 | missing | a |
2 | false | 2 | b | missing |
3 | true | missing | c | c |
eltype.(eachcol(y1))
4-element Vector{Type}:
Bool
Union{Missing, Int64}
Union{Missing, String}
Union{Missing, String}
y2 = open(jsontable, "x2.json") |> DataFrame
Row | A | B | C | D |
---|---|---|---|---|
Bool | Int64? | String? | String? | |
1 | true | 1 | missing | a |
2 | false | 2 | b | missing |
3 | true | missing | c | c |
eltype.(eachcol(y2))
4-element Vector{Type}:
Bool
Union{Missing, Int64}
Union{Missing, String}
Union{Missing, String}
Arrow.jl#
Finally we use Apache Arrow format that allows, in particular, for data interchange with R or Python.
Arrow.write("x.arrow", x)
"x.arrow"
y = Arrow.Table("x.arrow") |> DataFrame
Row | A | B | C | D |
---|---|---|---|---|
Bool | Int64? | String? | Char? | |
1 | true | 1 | missing | a |
2 | false | 2 | b | missing |
3 | true | missing | c | c |
eltype.(eachcol(y))
4-element Vector{Type}:
Bool
Union{Missing, Int64}
Union{Missing, String}
Union{Missing, Char}
Note that columns of y
are immutable
try
y.A[1] = false
catch e
show(e)
end
ReadOnlyMemoryError()
This is because Arrow.Table
uses memory mapping and thus uses a custom vector types:
y.A
3-element Arrow.BoolVector{Bool}:
1
0
1
y.B
3-element Arrow.Primitive{Union{Missing, Int64}, Vector{Int64}}:
1
2
missing
You can get standard Julia Base vectors by copying a data frame
y2 = copy(y)
Row | A | B | C | D |
---|---|---|---|---|
Bool | Int64? | String? | Char? | |
1 | true | 1 | missing | a |
2 | false | 2 | b | missing |
3 | true | missing | c | c |
y2.A
3-element Vector{Bool}:
1
0
1
y2.B
3-element Vector{Union{Missing, Int64}}:
1
2
missing
Basic benchmarking#
Next, we’ll create some files, so be careful that you don’t already have these files in your working directory!
In particular, we’ll time how long it takes us to write a DataFrame
with 1000 rows and 100000 columns.
bigdf = DataFrame(rand(Bool, 10^4, 1000), :auto)
bigdf[!, 1] = Int.(bigdf[!, 1])
bigdf[!, 2] = bigdf[!, 2] .+ 0.5
bigdf[!, 3] = string.(bigdf[!, 3], ", as string")
println("First run")
First run
println("CSV.jl")
csvwrite1 = @elapsed @time CSV.write("bigdf1.csv", bigdf)
println("Serialization")
serializewrite1 = @elapsed @time open(io -> serialize(io, bigdf), "bigdf.bin", "w")
println("JDF.jl")
jdfwrite1 = @elapsed @time JDF.save("bigdf.jdf", bigdf)
println("JLSO.jl")
jlsowrite1 = @elapsed @time JLSO.save("bigdf.jlso", :data => bigdf)
println("Arrow.jl")
arrowwrite1 = @elapsed @time Arrow.write("bigdf.arrow", bigdf)
println("JSONTables.jl arraytable")
jsontablesawrite1 = @elapsed @time open(io -> arraytable(io, bigdf), "bigdf1.json", "w")
println("JSONTables.jl objecttable")
jsontablesowrite1 = @elapsed @time open(io -> objecttable(io, bigdf), "bigdf2.json", "w")
println("Second run")
println("CSV.jl")
csvwrite2 = @elapsed @time CSV.write("bigdf1.csv", bigdf)
println("Serialization")
serializewrite2 = @elapsed @time open(io -> serialize(io, bigdf), "bigdf.bin", "w")
println("JDF.jl")
jdfwrite2 = @elapsed @time JDF.save("bigdf.jdf", bigdf)
println("JLSO.jl")
jlsowrite2 = @elapsed @time JLSO.save("bigdf.jlso", :data => bigdf)
println("Arrow.jl")
arrowwrite2 = @elapsed @time Arrow.write("bigdf.arrow", bigdf)
println("JSONTables.jl arraytable")
jsontablesawrite2 = @elapsed @time open(io -> arraytable(io, bigdf), "bigdf1.json", "w")
println("JSONTables.jl objecttable")
jsontablesowrite2 = @elapsed @time open(io -> objecttable(io, bigdf), "bigdf2.json", "w")
CSV.jl
4.425211 seconds (44.69 M allocations: 1.127 GiB, 3.55% gc time, 69.26% compilation time)
Serialization
0.196434 seconds (224.29 k allocations: 10.825 MiB, 4.88% gc time, 28.97% compilation time)
JDF.jl
0.128338 seconds (68.08 k allocations: 147.945 MiB, 14.16% gc time, 49.25% compilation time)
JLSO.jl
1.044408 seconds (284.79 k allocations: 20.878 MiB, 0.75% gc time, 6.64% compilation time)
Arrow.jl
3.952950 seconds (5.45 M allocations: 275.525 MiB, 0.64% gc time, 97.68% compilation time)
JSONTables.jl arraytable
12.152417 seconds (229.62 M allocations: 5.422 GiB, 18.49% gc time, 0.08% compilation time)
JSONTables.jl objecttable
0.542825 seconds (96.64 k allocations: 309.085 MiB, 51.63% gc time, 15.24% compilation time)
Second run
CSV.jl
1.387532 seconds (44.40 M allocations: 1.113 GiB, 9.94% gc time)
Serialization
0.139032 seconds (15.01 k allocations: 403.086 KiB, 6.71% compilation time)
JDF.jl
0.100345 seconds (35.13 k allocations: 146.222 MiB, 13.05% gc time)
JLSO.jl
0.971269 seconds (33.37 k allocations: 7.969 MiB)
Arrow.jl
0.091576 seconds (81.35 k allocations: 5.426 MiB)
JSONTables.jl arraytable
11.900882 seconds (229.62 M allocations: 5.422 GiB, 15.14% gc time, 0.09% compilation time)
JSONTables.jl objecttable
0.487132 seconds (20.71 k allocations: 305.221 MiB, 54.21% gc time, 2.02% compilation time)
0.487298284
groupedbar(
repeat(["CSV.jl", "Serialization", "JDF.jl", "JLSO.jl", "Arrow.jl", "JSONTables.jl\nobjecttable"],
inner=2),
[csvwrite1, csvwrite2, serializewrite1, serializewrite1, jdfwrite1, jdfwrite2,
jlsowrite1, jlsowrite2, arrowwrite1, arrowwrite2, jsontablesowrite2, jsontablesowrite2],
group=repeat(["1st", "2nd"], outer=6),
ylab="Second",
title="Write Performance\nDataFrame: bigdf\nSize: $(size(bigdf))"
)

data_files = ["bigdf1.csv", "bigdf.bin", "bigdf.arrow", "bigdf1.json", "bigdf2.json"]
df = DataFrame(file=data_files, size=getfield.(stat.(data_files), :size))
append!(df, DataFrame(file="bigdf.jdf", size=reduce((x, y) -> x + y.size,
stat.(joinpath.("bigdf.jdf", readdir("bigdf.jdf"))),
init=0)))
sort!(df, :size)
Row | file | size |
---|---|---|
String | Int64 | |
1 | bigdf.arrow | 1742898 |
2 | bigdf.bin | 5199455 |
3 | bigdf.jdf | 5224275 |
4 | bigdf1.csv | 55084448 |
5 | bigdf2.json | 55088449 |
6 | bigdf1.json | 124029556 |
@df df plot(:file, :size / 1024^2, seriestype=:bar, title="Format File Size (MB)", label="Size", ylab="MB")

println("First run")
println("CSV.jl")
csvread1 = @elapsed @time CSV.read("bigdf1.csv", DataFrame)
println("Serialization")
serializeread1 = @elapsed @time open(deserialize, "bigdf.bin")
println("JDF.jl")
jdfread1 = @elapsed @time JDF.load("bigdf.jdf") |> DataFrame
println("JLSO.jl")
jlsoread1 = @elapsed @time JLSO.load("bigdf.jlso")
println("Arrow.jl")
arrowread1 = @elapsed @time df_tmp = Arrow.Table("bigdf.arrow") |> DataFrame
arrowread1copy = @elapsed @time copy(df_tmp)
println("JSONTables.jl arraytable")
jsontablesaread1 = @elapsed @time open(jsontable, "bigdf1.json")
println("JSONTables.jl objecttable")
jsontablesoread1 = @elapsed @time open(jsontable, "bigdf2.json")
println("Second run")
csvread2 = @elapsed @time CSV.read("bigdf1.csv", DataFrame)
println("Serialization")
serializeread2 = @elapsed @time open(deserialize, "bigdf.bin")
println("JDF.jl")
jdfread2 = @elapsed @time JDF.load("bigdf.jdf") |> DataFrame
println("JLSO.jl")
jlsoread2 = @elapsed @time JLSO.load("bigdf.jlso")
println("Arrow.jl")
arrowread2 = @elapsed @time df_tmp = Arrow.Table("bigdf.arrow") |> DataFrame
arrowread2copy = @elapsed @time copy(df_tmp)
println("JSONTables.jl arraytable")
jsontablesaread2 = @elapsed @time open(jsontable, "bigdf1.json")
println("JSONTables.jl objecttable")
jsontablesoread2 = @elapsed @time open(jsontable, "bigdf2.json");
First run
CSV.jl
2.615539 seconds (3.55 M allocations: 208.428 MiB, 1.24% gc time, 139.11% compilation time)
Serialization
0.402124 seconds (9.50 M allocations: 155.498 MiB, 8.63% gc time, 8.83% compilation time)
JDF.jl
0.178533 seconds (169.92 k allocations: 158.470 MiB, 9.39% gc time, 80.53% compilation time)
JLSO.jl
0.350620 seconds (9.52 M allocations: 158.179 MiB, 7.58% gc time, 7.29% compilation time)
Arrow.jl
0.442189 seconds (550.37 k allocations: 26.348 MiB, 98.34% compilation time)
0.050265 seconds (14.50 k allocations: 10.259 MiB)
JSONTables.jl arraytable
6.264694 seconds (271.09 k allocations: 1.838 GiB, 9.71% gc time)
JSONTables.jl objecttable
0.356244 seconds (7.43 k allocations: 403.796 MiB, 4.57% gc time, 0.02% compilation time)
Second run
0.623085 seconds (152.55 k allocations: 34.426 MiB)
Serialization
0.357145 seconds (9.48 M allocations: 154.588 MiB, 5.51% gc time)
JDF.jl
0.362624 seconds (77.27 k allocations: 153.747 MiB, 86.63% gc time)
JLSO.jl
0.332968 seconds (9.50 M allocations: 157.310 MiB, 5.01% gc time)
Arrow.jl
0.006500 seconds (86.57 k allocations: 3.731 MiB)
0.049625 seconds (14.50 k allocations: 10.259 MiB)
JSONTables.jl arraytable
6.256591 seconds (271.09 k allocations: 1.838 GiB, 10.13% gc time)
JSONTables.jl objecttable
0.351383 seconds (7.08 k allocations: 403.780 MiB, 1.07% gc time)
Exclude JSONTables due to much longer timing
groupedbar(
repeat(["CSV.jl", "Serialization", "JDF.jl", "JLSO.jl", "Arrow.jl", "Arrow.jl\ncopy", ##"JSON\narraytable",
"JSON\nobjecttable"], inner=2),
[csvread1, csvread2, serializeread1, serializeread2, jdfread1, jdfread2, jlsoread1, jlsoread2,
arrowread1, arrowread2, arrowread1 + arrowread1copy, arrowread2 + arrowread2copy,
# jsontablesaread1, jsontablesaread2,
jsontablesoread1, jsontablesoread2],
group=repeat(["1st", "2nd"], outer=7),
ylab="Second",
title="Read Performance\nDataFrame: bigdf\nSize: $(size(bigdf))"
)

Using gzip compression#
A common user requirement is to be able to load and save CSV that are compressed using gzip. Below we show how this can be accomplished using CodecZlib.jl
. The same pattern is applicable to JSONTables.jl
compression/decompression.
Again make sure that you do not have file named df_compress_test.csv.gz
in your working directory.
We first generate a random data frame.
df = DataFrame(rand(1:10, 10, 1000), :auto)
Row | x1 | x2 | x3 | x4 | x5 | x6 | x7 | x8 | x9 | x10 | x11 | x12 | x13 | x14 | x15 | x16 | x17 | x18 | x19 | x20 | x21 | x22 | x23 | x24 | x25 | x26 | x27 | x28 | x29 | x30 | x31 | x32 | x33 | x34 | x35 | x36 | x37 | x38 | x39 | x40 | x41 | x42 | x43 | x44 | x45 | x46 | x47 | x48 | x49 | x50 | x51 | x52 | x53 | x54 | x55 | x56 | x57 | x58 | x59 | x60 | x61 | x62 | x63 | x64 | x65 | x66 | x67 | x68 | x69 | x70 | x71 | x72 | x73 | x74 | x75 | x76 | x77 | x78 | x79 | x80 | x81 | x82 | x83 | x84 | x85 | x86 | x87 | x88 | x89 | x90 | x91 | x92 | x93 | x94 | x95 | x96 | x97 | x98 | x99 | x100 | ⋯ |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | ⋯ | |
1 | 9 | 9 | 9 | 7 | 2 | 9 | 10 | 4 | 6 | 4 | 8 | 8 | 3 | 8 | 3 | 10 | 4 | 6 | 4 | 6 | 2 | 4 | 3 | 10 | 4 | 1 | 1 | 6 | 5 | 6 | 8 | 4 | 5 | 7 | 5 | 10 | 1 | 7 | 7 | 10 | 7 | 1 | 8 | 6 | 4 | 9 | 4 | 6 | 8 | 6 | 1 | 1 | 2 | 8 | 3 | 1 | 6 | 2 | 8 | 4 | 6 | 7 | 7 | 5 | 4 | 8 | 4 | 1 | 1 | 8 | 7 | 8 | 5 | 3 | 4 | 9 | 10 | 2 | 10 | 4 | 4 | 2 | 3 | 9 | 8 | 9 | 9 | 10 | 9 | 8 | 7 | 4 | 9 | 1 | 9 | 3 | 1 | 7 | 2 | 10 | ⋯ |
2 | 8 | 3 | 3 | 1 | 10 | 7 | 9 | 7 | 2 | 8 | 8 | 7 | 4 | 3 | 4 | 3 | 9 | 2 | 4 | 8 | 5 | 6 | 1 | 1 | 1 | 10 | 4 | 1 | 10 | 1 | 4 | 6 | 5 | 1 | 8 | 8 | 2 | 3 | 1 | 5 | 3 | 1 | 7 | 7 | 7 | 1 | 8 | 2 | 10 | 4 | 4 | 9 | 7 | 10 | 4 | 3 | 1 | 6 | 2 | 1 | 9 | 9 | 3 | 1 | 9 | 6 | 7 | 2 | 4 | 2 | 6 | 4 | 3 | 2 | 1 | 9 | 2 | 10 | 2 | 10 | 2 | 8 | 9 | 1 | 7 | 7 | 2 | 9 | 3 | 2 | 6 | 4 | 5 | 4 | 7 | 10 | 10 | 6 | 2 | 7 | ⋯ |
3 | 4 | 9 | 3 | 4 | 9 | 6 | 3 | 3 | 10 | 9 | 10 | 8 | 10 | 2 | 7 | 6 | 10 | 5 | 6 | 2 | 8 | 1 | 7 | 1 | 9 | 9 | 10 | 3 | 8 | 2 | 2 | 7 | 8 | 8 | 2 | 1 | 3 | 2 | 9 | 7 | 2 | 2 | 9 | 9 | 3 | 5 | 2 | 4 | 9 | 5 | 7 | 2 | 2 | 6 | 9 | 7 | 10 | 3 | 6 | 9 | 7 | 4 | 1 | 10 | 5 | 5 | 9 | 5 | 4 | 6 | 2 | 10 | 10 | 4 | 7 | 10 | 7 | 1 | 3 | 9 | 6 | 5 | 1 | 3 | 1 | 1 | 1 | 9 | 1 | 7 | 7 | 6 | 6 | 1 | 1 | 1 | 2 | 9 | 5 | 2 | ⋯ |
4 | 5 | 8 | 6 | 1 | 6 | 3 | 9 | 6 | 4 | 7 | 9 | 6 | 3 | 4 | 5 | 9 | 6 | 10 | 5 | 2 | 8 | 9 | 10 | 3 | 7 | 8 | 4 | 5 | 2 | 5 | 2 | 1 | 8 | 1 | 10 | 1 | 2 | 10 | 1 | 2 | 3 | 2 | 4 | 2 | 5 | 6 | 3 | 6 | 7 | 5 | 7 | 7 | 3 | 9 | 3 | 7 | 2 | 4 | 2 | 1 | 7 | 4 | 1 | 10 | 7 | 6 | 9 | 8 | 8 | 8 | 6 | 3 | 4 | 3 | 2 | 3 | 3 | 3 | 7 | 3 | 1 | 6 | 8 | 9 | 9 | 8 | 4 | 3 | 7 | 7 | 10 | 7 | 9 | 8 | 10 | 1 | 7 | 2 | 10 | 8 | ⋯ |
5 | 2 | 2 | 7 | 6 | 2 | 9 | 10 | 3 | 4 | 8 | 7 | 7 | 2 | 5 | 2 | 5 | 8 | 8 | 8 | 3 | 8 | 3 | 4 | 1 | 1 | 9 | 8 | 8 | 4 | 8 | 7 | 10 | 9 | 2 | 7 | 7 | 6 | 3 | 5 | 10 | 6 | 2 | 8 | 2 | 5 | 9 | 5 | 1 | 10 | 7 | 7 | 1 | 10 | 5 | 10 | 4 | 9 | 1 | 3 | 3 | 3 | 5 | 1 | 9 | 2 | 1 | 3 | 2 | 5 | 4 | 2 | 10 | 5 | 10 | 10 | 8 | 3 | 2 | 8 | 1 | 10 | 6 | 2 | 1 | 9 | 4 | 9 | 3 | 7 | 5 | 4 | 8 | 8 | 9 | 1 | 9 | 1 | 4 | 4 | 10 | ⋯ |
6 | 8 | 7 | 9 | 7 | 4 | 9 | 8 | 7 | 4 | 3 | 2 | 4 | 8 | 1 | 7 | 8 | 6 | 9 | 6 | 1 | 7 | 6 | 6 | 4 | 8 | 9 | 3 | 2 | 6 | 4 | 3 | 7 | 4 | 10 | 4 | 1 | 5 | 3 | 9 | 8 | 2 | 4 | 3 | 1 | 6 | 10 | 10 | 8 | 1 | 2 | 3 | 8 | 3 | 3 | 9 | 8 | 1 | 6 | 9 | 7 | 10 | 4 | 5 | 4 | 7 | 7 | 7 | 9 | 9 | 3 | 5 | 2 | 5 | 6 | 4 | 5 | 3 | 5 | 3 | 8 | 3 | 10 | 2 | 7 | 6 | 4 | 1 | 2 | 4 | 8 | 5 | 8 | 5 | 6 | 9 | 5 | 3 | 2 | 4 | 8 | ⋯ |
7 | 1 | 6 | 8 | 8 | 10 | 1 | 8 | 7 | 6 | 1 | 7 | 7 | 1 | 9 | 1 | 8 | 8 | 6 | 8 | 2 | 2 | 7 | 3 | 2 | 8 | 10 | 8 | 7 | 9 | 4 | 3 | 5 | 8 | 7 | 4 | 3 | 9 | 5 | 3 | 2 | 5 | 2 | 8 | 2 | 8 | 6 | 1 | 9 | 9 | 7 | 4 | 7 | 5 | 4 | 9 | 7 | 4 | 7 | 5 | 10 | 5 | 9 | 10 | 4 | 5 | 10 | 9 | 7 | 10 | 10 | 9 | 10 | 4 | 6 | 8 | 8 | 4 | 4 | 7 | 7 | 4 | 1 | 5 | 6 | 10 | 10 | 7 | 9 | 10 | 10 | 9 | 5 | 7 | 5 | 5 | 9 | 10 | 9 | 1 | 8 | ⋯ |
8 | 3 | 7 | 5 | 9 | 4 | 2 | 2 | 4 | 4 | 1 | 4 | 2 | 10 | 4 | 6 | 10 | 2 | 9 | 9 | 3 | 4 | 1 | 1 | 6 | 1 | 6 | 8 | 10 | 8 | 6 | 3 | 2 | 1 | 3 | 5 | 6 | 8 | 3 | 2 | 9 | 5 | 7 | 1 | 3 | 2 | 10 | 7 | 6 | 10 | 1 | 6 | 4 | 6 | 9 | 2 | 4 | 1 | 4 | 8 | 8 | 6 | 3 | 7 | 1 | 6 | 4 | 6 | 1 | 4 | 3 | 8 | 2 | 8 | 10 | 10 | 8 | 4 | 2 | 10 | 3 | 2 | 9 | 9 | 3 | 6 | 10 | 6 | 5 | 8 | 9 | 7 | 7 | 3 | 6 | 9 | 1 | 3 | 4 | 7 | 1 | ⋯ |
9 | 2 | 6 | 3 | 7 | 6 | 4 | 5 | 1 | 10 | 8 | 6 | 7 | 10 | 9 | 8 | 5 | 7 | 2 | 5 | 1 | 10 | 1 | 5 | 3 | 8 | 10 | 5 | 8 | 10 | 1 | 5 | 9 | 8 | 5 | 4 | 8 | 8 | 10 | 4 | 6 | 3 | 10 | 7 | 5 | 9 | 3 | 5 | 10 | 4 | 8 | 7 | 9 | 5 | 2 | 5 | 6 | 8 | 7 | 2 | 1 | 8 | 5 | 5 | 5 | 9 | 10 | 1 | 1 | 7 | 3 | 1 | 4 | 6 | 3 | 4 | 10 | 5 | 10 | 1 | 1 | 3 | 2 | 3 | 1 | 1 | 2 | 2 | 4 | 2 | 6 | 5 | 2 | 9 | 3 | 1 | 4 | 6 | 1 | 5 | 3 | ⋯ |
10 | 1 | 2 | 6 | 5 | 8 | 2 | 3 | 3 | 4 | 5 | 5 | 6 | 2 | 1 | 5 | 4 | 1 | 10 | 1 | 8 | 2 | 4 | 6 | 7 | 9 | 6 | 1 | 10 | 3 | 1 | 3 | 7 | 3 | 1 | 8 | 4 | 2 | 9 | 5 | 10 | 1 | 5 | 10 | 8 | 6 | 3 | 2 | 6 | 10 | 4 | 9 | 10 | 9 | 3 | 1 | 3 | 2 | 9 | 10 | 7 | 5 | 3 | 2 | 5 | 10 | 8 | 3 | 3 | 4 | 10 | 9 | 8 | 5 | 2 | 8 | 7 | 8 | 9 | 10 | 3 | 8 | 5 | 7 | 10 | 4 | 1 | 8 | 1 | 3 | 3 | 6 | 1 | 1 | 8 | 1 | 1 | 9 | 10 | 2 | 7 | ⋯ |
GzipCompressorStream comes from CodecZlib
open("df_compress_test.csv.gz", "w") do io
stream = GzipCompressorStream(io)
CSV.write(stream, df)
close(stream)
end
df2 = CSV.File(transcode(GzipDecompressor, Mmap.mmap("df_compress_test.csv.gz"))) |> DataFrame
Row | x1 | x2 | x3 | x4 | x5 | x6 | x7 | x8 | x9 | x10 | x11 | x12 | x13 | x14 | x15 | x16 | x17 | x18 | x19 | x20 | x21 | x22 | x23 | x24 | x25 | x26 | x27 | x28 | x29 | x30 | x31 | x32 | x33 | x34 | x35 | x36 | x37 | x38 | x39 | x40 | x41 | x42 | x43 | x44 | x45 | x46 | x47 | x48 | x49 | x50 | x51 | x52 | x53 | x54 | x55 | x56 | x57 | x58 | x59 | x60 | x61 | x62 | x63 | x64 | x65 | x66 | x67 | x68 | x69 | x70 | x71 | x72 | x73 | x74 | x75 | x76 | x77 | x78 | x79 | x80 | x81 | x82 | x83 | x84 | x85 | x86 | x87 | x88 | x89 | x90 | x91 | x92 | x93 | x94 | x95 | x96 | x97 | x98 | x99 | x100 | ⋯ |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | ⋯ | |
1 | 9 | 9 | 9 | 7 | 2 | 9 | 10 | 4 | 6 | 4 | 8 | 8 | 3 | 8 | 3 | 10 | 4 | 6 | 4 | 6 | 2 | 4 | 3 | 10 | 4 | 1 | 1 | 6 | 5 | 6 | 8 | 4 | 5 | 7 | 5 | 10 | 1 | 7 | 7 | 10 | 7 | 1 | 8 | 6 | 4 | 9 | 4 | 6 | 8 | 6 | 1 | 1 | 2 | 8 | 3 | 1 | 6 | 2 | 8 | 4 | 6 | 7 | 7 | 5 | 4 | 8 | 4 | 1 | 1 | 8 | 7 | 8 | 5 | 3 | 4 | 9 | 10 | 2 | 10 | 4 | 4 | 2 | 3 | 9 | 8 | 9 | 9 | 10 | 9 | 8 | 7 | 4 | 9 | 1 | 9 | 3 | 1 | 7 | 2 | 10 | ⋯ |
2 | 8 | 3 | 3 | 1 | 10 | 7 | 9 | 7 | 2 | 8 | 8 | 7 | 4 | 3 | 4 | 3 | 9 | 2 | 4 | 8 | 5 | 6 | 1 | 1 | 1 | 10 | 4 | 1 | 10 | 1 | 4 | 6 | 5 | 1 | 8 | 8 | 2 | 3 | 1 | 5 | 3 | 1 | 7 | 7 | 7 | 1 | 8 | 2 | 10 | 4 | 4 | 9 | 7 | 10 | 4 | 3 | 1 | 6 | 2 | 1 | 9 | 9 | 3 | 1 | 9 | 6 | 7 | 2 | 4 | 2 | 6 | 4 | 3 | 2 | 1 | 9 | 2 | 10 | 2 | 10 | 2 | 8 | 9 | 1 | 7 | 7 | 2 | 9 | 3 | 2 | 6 | 4 | 5 | 4 | 7 | 10 | 10 | 6 | 2 | 7 | ⋯ |
3 | 4 | 9 | 3 | 4 | 9 | 6 | 3 | 3 | 10 | 9 | 10 | 8 | 10 | 2 | 7 | 6 | 10 | 5 | 6 | 2 | 8 | 1 | 7 | 1 | 9 | 9 | 10 | 3 | 8 | 2 | 2 | 7 | 8 | 8 | 2 | 1 | 3 | 2 | 9 | 7 | 2 | 2 | 9 | 9 | 3 | 5 | 2 | 4 | 9 | 5 | 7 | 2 | 2 | 6 | 9 | 7 | 10 | 3 | 6 | 9 | 7 | 4 | 1 | 10 | 5 | 5 | 9 | 5 | 4 | 6 | 2 | 10 | 10 | 4 | 7 | 10 | 7 | 1 | 3 | 9 | 6 | 5 | 1 | 3 | 1 | 1 | 1 | 9 | 1 | 7 | 7 | 6 | 6 | 1 | 1 | 1 | 2 | 9 | 5 | 2 | ⋯ |
4 | 5 | 8 | 6 | 1 | 6 | 3 | 9 | 6 | 4 | 7 | 9 | 6 | 3 | 4 | 5 | 9 | 6 | 10 | 5 | 2 | 8 | 9 | 10 | 3 | 7 | 8 | 4 | 5 | 2 | 5 | 2 | 1 | 8 | 1 | 10 | 1 | 2 | 10 | 1 | 2 | 3 | 2 | 4 | 2 | 5 | 6 | 3 | 6 | 7 | 5 | 7 | 7 | 3 | 9 | 3 | 7 | 2 | 4 | 2 | 1 | 7 | 4 | 1 | 10 | 7 | 6 | 9 | 8 | 8 | 8 | 6 | 3 | 4 | 3 | 2 | 3 | 3 | 3 | 7 | 3 | 1 | 6 | 8 | 9 | 9 | 8 | 4 | 3 | 7 | 7 | 10 | 7 | 9 | 8 | 10 | 1 | 7 | 2 | 10 | 8 | ⋯ |
5 | 2 | 2 | 7 | 6 | 2 | 9 | 10 | 3 | 4 | 8 | 7 | 7 | 2 | 5 | 2 | 5 | 8 | 8 | 8 | 3 | 8 | 3 | 4 | 1 | 1 | 9 | 8 | 8 | 4 | 8 | 7 | 10 | 9 | 2 | 7 | 7 | 6 | 3 | 5 | 10 | 6 | 2 | 8 | 2 | 5 | 9 | 5 | 1 | 10 | 7 | 7 | 1 | 10 | 5 | 10 | 4 | 9 | 1 | 3 | 3 | 3 | 5 | 1 | 9 | 2 | 1 | 3 | 2 | 5 | 4 | 2 | 10 | 5 | 10 | 10 | 8 | 3 | 2 | 8 | 1 | 10 | 6 | 2 | 1 | 9 | 4 | 9 | 3 | 7 | 5 | 4 | 8 | 8 | 9 | 1 | 9 | 1 | 4 | 4 | 10 | ⋯ |
6 | 8 | 7 | 9 | 7 | 4 | 9 | 8 | 7 | 4 | 3 | 2 | 4 | 8 | 1 | 7 | 8 | 6 | 9 | 6 | 1 | 7 | 6 | 6 | 4 | 8 | 9 | 3 | 2 | 6 | 4 | 3 | 7 | 4 | 10 | 4 | 1 | 5 | 3 | 9 | 8 | 2 | 4 | 3 | 1 | 6 | 10 | 10 | 8 | 1 | 2 | 3 | 8 | 3 | 3 | 9 | 8 | 1 | 6 | 9 | 7 | 10 | 4 | 5 | 4 | 7 | 7 | 7 | 9 | 9 | 3 | 5 | 2 | 5 | 6 | 4 | 5 | 3 | 5 | 3 | 8 | 3 | 10 | 2 | 7 | 6 | 4 | 1 | 2 | 4 | 8 | 5 | 8 | 5 | 6 | 9 | 5 | 3 | 2 | 4 | 8 | ⋯ |
7 | 1 | 6 | 8 | 8 | 10 | 1 | 8 | 7 | 6 | 1 | 7 | 7 | 1 | 9 | 1 | 8 | 8 | 6 | 8 | 2 | 2 | 7 | 3 | 2 | 8 | 10 | 8 | 7 | 9 | 4 | 3 | 5 | 8 | 7 | 4 | 3 | 9 | 5 | 3 | 2 | 5 | 2 | 8 | 2 | 8 | 6 | 1 | 9 | 9 | 7 | 4 | 7 | 5 | 4 | 9 | 7 | 4 | 7 | 5 | 10 | 5 | 9 | 10 | 4 | 5 | 10 | 9 | 7 | 10 | 10 | 9 | 10 | 4 | 6 | 8 | 8 | 4 | 4 | 7 | 7 | 4 | 1 | 5 | 6 | 10 | 10 | 7 | 9 | 10 | 10 | 9 | 5 | 7 | 5 | 5 | 9 | 10 | 9 | 1 | 8 | ⋯ |
8 | 3 | 7 | 5 | 9 | 4 | 2 | 2 | 4 | 4 | 1 | 4 | 2 | 10 | 4 | 6 | 10 | 2 | 9 | 9 | 3 | 4 | 1 | 1 | 6 | 1 | 6 | 8 | 10 | 8 | 6 | 3 | 2 | 1 | 3 | 5 | 6 | 8 | 3 | 2 | 9 | 5 | 7 | 1 | 3 | 2 | 10 | 7 | 6 | 10 | 1 | 6 | 4 | 6 | 9 | 2 | 4 | 1 | 4 | 8 | 8 | 6 | 3 | 7 | 1 | 6 | 4 | 6 | 1 | 4 | 3 | 8 | 2 | 8 | 10 | 10 | 8 | 4 | 2 | 10 | 3 | 2 | 9 | 9 | 3 | 6 | 10 | 6 | 5 | 8 | 9 | 7 | 7 | 3 | 6 | 9 | 1 | 3 | 4 | 7 | 1 | ⋯ |
9 | 2 | 6 | 3 | 7 | 6 | 4 | 5 | 1 | 10 | 8 | 6 | 7 | 10 | 9 | 8 | 5 | 7 | 2 | 5 | 1 | 10 | 1 | 5 | 3 | 8 | 10 | 5 | 8 | 10 | 1 | 5 | 9 | 8 | 5 | 4 | 8 | 8 | 10 | 4 | 6 | 3 | 10 | 7 | 5 | 9 | 3 | 5 | 10 | 4 | 8 | 7 | 9 | 5 | 2 | 5 | 6 | 8 | 7 | 2 | 1 | 8 | 5 | 5 | 5 | 9 | 10 | 1 | 1 | 7 | 3 | 1 | 4 | 6 | 3 | 4 | 10 | 5 | 10 | 1 | 1 | 3 | 2 | 3 | 1 | 1 | 2 | 2 | 4 | 2 | 6 | 5 | 2 | 9 | 3 | 1 | 4 | 6 | 1 | 5 | 3 | ⋯ |
10 | 1 | 2 | 6 | 5 | 8 | 2 | 3 | 3 | 4 | 5 | 5 | 6 | 2 | 1 | 5 | 4 | 1 | 10 | 1 | 8 | 2 | 4 | 6 | 7 | 9 | 6 | 1 | 10 | 3 | 1 | 3 | 7 | 3 | 1 | 8 | 4 | 2 | 9 | 5 | 10 | 1 | 5 | 10 | 8 | 6 | 3 | 2 | 6 | 10 | 4 | 9 | 10 | 9 | 3 | 1 | 3 | 2 | 9 | 10 | 7 | 5 | 3 | 2 | 5 | 10 | 8 | 3 | 3 | 4 | 10 | 9 | 8 | 5 | 2 | 8 | 7 | 8 | 9 | 10 | 3 | 8 | 5 | 7 | 10 | 4 | 1 | 8 | 1 | 3 | 3 | 6 | 1 | 1 | 8 | 1 | 1 | 9 | 10 | 2 | 7 | ⋯ |
df == df2
true
Using zip files#
Sometimes you may have files compressed inside a zip file.
In such a situation you may use ZipFile.jl in conjunction an an appropriate reader to read the files.
Here we first create a ZIP file and then read back its contents into a DataFrame
.
df1 = DataFrame(rand(1:10, 3, 4), :auto)
Row | x1 | x2 | x3 | x4 |
---|---|---|---|---|
Int64 | Int64 | Int64 | Int64 | |
1 | 7 | 5 | 3 | 3 |
2 | 3 | 9 | 9 | 7 |
3 | 9 | 5 | 4 | 4 |
df2 = DataFrame(rand(1:10, 3, 4), :auto)
Row | x1 | x2 | x3 | x4 |
---|---|---|---|---|
Int64 | Int64 | Int64 | Int64 | |
1 | 8 | 1 | 7 | 5 |
2 | 2 | 8 | 8 | 6 |
3 | 9 | 5 | 8 | 3 |
And we show yet another way to write a DataFrame
into a CSV file:
Writing a CSV file into the zip file
w = ZipFile.Writer("x.zip")
f1 = ZipFile.addfile(w, "x1.csv")
write(f1, sprint(show, "text/csv", df1))
# write a second CSV file into zip file
f2 = ZipFile.addfile(w, "x2.csv", method=ZipFile.Deflate)
write(f2, sprint(show, "text/csv", df2))
close(w)
Now we read the compressed CSV file we have written:
z = ZipFile.Reader("x.zip");
# find the index index of file called x1.csv
index_xcsv = findfirst(x -> x.name == "x1.csv", z.files)
# to read the x1.csv file in the zip file
df1_2 = CSV.read(read(z.files[index_xcsv]), DataFrame)
Row | x1 | x2 | x3 | x4 |
---|---|---|---|---|
Int64 | Int64 | Int64 | Int64 | |
1 | 7 | 5 | 3 | 3 |
2 | 3 | 9 | 9 | 7 |
3 | 9 | 5 | 4 | 4 |
df1_2 == df1
true
# find the index index of file called x2.csv
index_xcsv = findfirst(x -> x.name == "x2.csv", z.files)
# to read the x2.csv file in the zip file
df2_2 = CSV.read(read(z.files[index_xcsv]), DataFrame)
Row | x1 | x2 | x3 | x4 |
---|---|---|---|---|
Int64 | Int64 | Int64 | Int64 | |
1 | 8 | 1 | 7 | 5 |
2 | 2 | 8 | 8 | 6 |
3 | 9 | 5 | 8 | 3 |
df2_2 == df2
true
Note that once you read a given file from z
object its stream is all used-up (reaching its end). Therefore to read it again you need to close the file object z
and open it again.
Also do not forget to close the zip file once you are done.
close(z)
Remove generated files
rm("x.arrow")
rm("x.bin")
rm("x.zip")
rm("x.jlso")
rm("x1.csv")
rm("x1.json")
rm("x2.json")
rm("x.jdf", recursive=true)
rm("bigdf.jdf", recursive=true)
rm("df_compress_test.csv.gz")
rm("bigdf1.json")
rm("bigdf1.csv")
rm("bigdf2.json")
rm("bigdf.jlso")
rm("bigdf.bin")
rm("bigdf.arrow")
This notebook was generated using Literate.jl.