Load and save DataFrames#
We do not cover all features of the packages. Please refer to their documentation to learn them.
Here we’ll load CSV.jl to read and write CSV files and Arrow.jl, JLSO.jl, and serialization, which allow us to work with a binary format and JSONTables.jl for JSON interaction. Finally we consider a custom JDF.jl format.
using DataFrames
using Arrow
using CSV
using JSONTables
using CodecZlib
using ZipFile
using StatsPlots ## for charts
using Mmap ## for compression
Let’s create a simple DataFrame for testing purposes,
x = DataFrame(
A=[true, false, true], B=[1, 2, missing],
C=[missing, "b", "c"], D=['a', missing, 'c']
)
| Row | A | B | C | D |
|---|---|---|---|---|
| Bool | Int64? | String? | Char? | |
| 1 | true | 1 | missing | a |
| 2 | false | 2 | b | missing |
| 3 | true | missing | c | c |
and use eltypes to look at the columnwise types.
eltype.(eachcol(x))
4-element Vector{Type}:
Bool
Union{Missing, Int64}
Union{Missing, String}
Union{Missing, Char}
CSV.jl#
Let’s use CSV to save x to disk; make sure x1.csv does not conflict with some file in your working directory.
CSV.write("x1.csv", x)
"x1.csv"
Now we can see how it was saved by reading x.csv.
print(read("x1.csv", String))
A,B,C,D
true,1,,a
false,2,b,
true,,c,c
We can also load it back as a data frame
y = CSV.read("x1.csv", DataFrame)
| Row | A | B | C | D |
|---|---|---|---|---|
| Bool | Int64? | String1? | String1? | |
| 1 | true | 1 | missing | a |
| 2 | false | 2 | b | missing |
| 3 | true | missing | c | c |
Note that when loading in a DataFrame from a CSV the column type for columns :C :D have changed to use special strings defined in the InlineStrings.jl package.
eltype.(eachcol(y))
4-element Vector{Type}:
Bool
Union{Missing, Int64}
Union{Missing, InlineStrings.String1}
Union{Missing, InlineStrings.String1}
Clean the generated file
rm("x1.csv")
JSONTables.jl#
Often you might need to read and write data stored in JSON format. JSONTables.jl provides a way to process them in row-oriented or column-oriented layout. We present both options below.
open(io -> arraytable(io, x), "x1.json", "w")
106
open(io -> objecttable(io, x), "x2.json", "w")
76
print(read("x1.json", String))
[{"A":true,"B":1,"C":null,"D":"a"},{"A":false,"B":2,"C":"b","D":null},{"A":true,"B":null,"C":"c","D":"c"}]
print(read("x2.json", String))
{"A":[true,false,true],"B":[1,2,null],"C":[null,"b","c"],"D":["a",null,"c"]}
y1 = open(jsontable, "x1.json") |> DataFrame
| Row | A | B | C | D |
|---|---|---|---|---|
| Bool | Int64? | String? | String? | |
| 1 | true | 1 | missing | a |
| 2 | false | 2 | b | missing |
| 3 | true | missing | c | c |
eltype.(eachcol(y1))
4-element Vector{Type}:
Bool
Union{Missing, Int64}
Union{Missing, String}
Union{Missing, String}
y2 = open(jsontable, "x2.json") |> DataFrame
| Row | A | B | C | D |
|---|---|---|---|---|
| Bool | Int64? | String? | String? | |
| 1 | true | 1 | missing | a |
| 2 | false | 2 | b | missing |
| 3 | true | missing | c | c |
eltype.(eachcol(y2))
4-element Vector{Type}:
Bool
Union{Missing, Int64}
Union{Missing, String}
Union{Missing, String}
Clean the generated files
rm("x1.json")
rm("x2.json")
Arrow.jl#
Finally we use Apache Arrow format that allows, in particular, for data interchange with R or Python.
Arrow.write("x.arrow", x)
"x.arrow"
y = Arrow.Table("x.arrow") |> DataFrame
| Row | A | B | C | D |
|---|---|---|---|---|
| Bool | Int64? | String? | Char? | |
| 1 | true | 1 | missing | a |
| 2 | false | 2 | b | missing |
| 3 | true | missing | c | c |
eltype.(eachcol(y))
4-element Vector{Type}:
Bool
Union{Missing, Int64}
Union{Missing, String}
Union{Missing, Char}
Note that columns of y are immutable
try
y.A[1] = false
catch e
show(e)
end
ReadOnlyMemoryError()
This is because Arrow.Table uses memory mapping and thus uses a custom vector types:
y.A
3-element Arrow.BoolVector{Bool}:
1
0
1
y.B
3-element Arrow.Primitive{Union{Missing, Int64}, Vector{Int64}}:
1
2
missing
You can get standard Julia Base vectors by copying a data frame
y2 = copy(y)
| Row | A | B | C | D |
|---|---|---|---|---|
| Bool | Int64? | String? | Char? | |
| 1 | true | 1 | missing | a |
| 2 | false | 2 | b | missing |
| 3 | true | missing | c | c |
y2.A
3-element Vector{Bool}:
1
0
1
y2.B
3-element Vector{Union{Missing, Int64}}:
1
2
missing
Clean the generated file
rm("x.arrow")
Basic benchmarking#
Next, we’ll create some files, so be careful that you don’t already have these files in your working directory!
In particular, we’ll time how long it takes us to write a DataFrame with 1000 rows and 100000 columns.
bigdf = DataFrame(rand(Bool, 10^4, 1000), :auto)
bigdf[!, 1] = Int.(bigdf[!, 1])
bigdf[!, 2] = bigdf[!, 2] .+ 0.5
bigdf[!, 3] = string.(bigdf[!, 3], ", as string")
println("First run")
First run
println("CSV.jl")
csvwrite1 = @elapsed @time CSV.write("bigdf1.csv.gz", bigdf; compress=true)
println("Arrow.jl")
arrowwrite1 = @elapsed @time Arrow.write("bigdf.arrow", bigdf)
println("JSONTables.jl arraytable")
jsontablesawrite1 = @elapsed @time open(io -> arraytable(io, bigdf), "bigdf1.json", "w")
println("JSONTables.jl objecttable")
jsontablesowrite1 = @elapsed @time open(io -> objecttable(io, bigdf), "bigdf2.json", "w")
println("Second run")
println("CSV.jl")
csvwrite2 = @elapsed @time CSV.write("bigdf1.csv.gz", bigdf; compress=true)
println("Arrow.jl")
arrowwrite2 = @elapsed @time Arrow.write("bigdf.arrow", bigdf)
println("JSONTables.jl arraytable")
jsontablesawrite2 = @elapsed @time open(io -> arraytable(io, bigdf), "bigdf1.json", "w")
println("JSONTables.jl objecttable")
jsontablesowrite2 = @elapsed @time open(io -> objecttable(io, bigdf), "bigdf2.json", "w")
CSV.jl
6.002204 seconds (45.04 M allocations: 1.590 GiB, 6.94% gc time, 58.26% compilation time: <1% of which was recompilation)
Arrow.jl
3.963022 seconds (6.64 M allocations: 325.486 MiB, 0.48% gc time, 97.57% compilation time)
JSONTables.jl arraytable
10.926450 seconds (229.63 M allocations: 5.497 GiB, 16.31% gc time, 0.13% compilation time: <1% of which was recompilation)
JSONTables.jl objecttable
0.300350 seconds (106.19 k allocations: 309.453 MiB, 6.04% gc time, 26.38% compilation time)
Second run
CSV.jl
2.282952 seconds (44.41 M allocations: 1.560 GiB, 5.57% gc time)
Arrow.jl
0.096491 seconds (80.86 k allocations: 5.164 MiB)
JSONTables.jl arraytable
10.895230 seconds (229.63 M allocations: 5.497 GiB, 15.29% gc time, 0.08% compilation time)
JSONTables.jl objecttable
0.239073 seconds (20.83 k allocations: 305.241 MiB, 6.80% gc time, 3.79% compilation time)
0.239252714
groupedbar(
repeat(["CSV.jl (gz)", "Arrow.jl", "JSONTables.jl\nobjecttable"],
inner=2),
[csvwrite1, csvwrite2, arrowwrite1, arrowwrite2, jsontablesowrite1, jsontablesowrite2],
group=repeat(["1st", "2nd"], outer=3),
ylab="Second",
title="Write Performance\nDataFrame: bigdf\nSize: $(size(bigdf))",
permute = (:x, :y)
)
data_files = ["bigdf1.csv.gz", "bigdf.arrow", "bigdf1.json", "bigdf2.json"]
df = DataFrame(file=data_files, size=getfield.(stat.(data_files), :size))
sort!(df, :size)
| Row | file | size |
|---|---|---|
| String | Int64 | |
| 1 | bigdf.arrow | 1742786 |
| 2 | bigdf1.csv.gz | 2470468 |
| 3 | bigdf2.json | 55086823 |
| 4 | bigdf1.json | 124027930 |
@df df plot(:file, :size / 1024^2, seriestype=:bar, title="Format File Size (MB)", label="Size", ylab="MB")
println("First run")
println("CSV.jl")
csvread1 = @elapsed @time CSV.read("bigdf1.csv.gz", DataFrame)
println("Arrow.jl")
arrowread1 = @elapsed @time df_tmp = Arrow.Table("bigdf.arrow") |> DataFrame
arrowread1copy = @elapsed @time copy(df_tmp)
println("JSONTables.jl arraytable")
jsontablesaread1 = @elapsed @time open(jsontable, "bigdf1.json")
println("JSONTables.jl objecttable")
jsontablesoread1 = @elapsed @time open(jsontable, "bigdf2.json")
println("Second run")
csvread2 = @elapsed @time CSV.read("bigdf1.csv.gz", DataFrame)
println("Arrow.jl")
arrowread2 = @elapsed @time df_tmp = Arrow.Table("bigdf.arrow") |> DataFrame
arrowread2copy = @elapsed @time copy(df_tmp)
println("JSONTables.jl arraytable")
jsontablesaread2 = @elapsed @time open(jsontable, "bigdf1.json")
println("JSONTables.jl objecttable")
jsontablesoread2 = @elapsed @time open(jsontable, "bigdf2.json");
First run
CSV.jl
2.817632 seconds (4.41 M allocations: 223.331 MiB, 1.90% gc time, 2 lock conflicts, 86.24% compilation time)
Arrow.jl
0.468273 seconds (573.48 k allocations: 26.968 MiB, 98.38% compilation time)
0.058121 seconds (14.02 k allocations: 10.297 MiB)
JSONTables.jl arraytable
5.402500 seconds (271.15 k allocations: 1.772 GiB, 10.37% gc time, 0.07% compilation time)
JSONTables.jl objecttable
0.410801 seconds (7.39 k allocations: 566.940 MiB, 4.44% gc time, 0.02% compilation time)
Second run
0.926750 seconds (637.12 k allocations: 43.582 MiB)
Arrow.jl
0.009412 seconds (84.09 k allocations: 3.594 MiB)
0.058019 seconds (14.02 k allocations: 10.297 MiB)
JSONTables.jl arraytable
5.375577 seconds (271.10 k allocations: 1.772 GiB, 10.63% gc time)
JSONTables.jl objecttable
0.387121 seconds (7.08 k allocations: 566.921 MiB, 2.86% gc time)
Exclude JSONTables due to much longer timing
groupedbar(
repeat(["CSV.jl (gz)", "Arrow.jl", "Arrow.jl\ncopy", ##"JSON\narraytable",
"JSON\nobjecttable"], inner=2),
[csvread1, csvread2, arrowread1, arrowread2, arrowread1 + arrowread1copy, arrowread2 + arrowread2copy,
# jsontablesaread1, jsontablesaread2,
jsontablesoread1, jsontablesoread2],
group=repeat(["1st", "2nd"], outer=4),
ylab="Second",
title="Read Performance\nDataFrame: bigdf\nSize: $(size(bigdf))",
permute = (:x, :y)
)
Clean generated files
rm("bigdf1.csv.gz")
rm("bigdf1.json")
rm("bigdf2.json")
rm("bigdf.arrow")
Using gzip compression#
A common user requirement is to be able to load and save CSV that are compressed using gzip. Below we show how this can be accomplished using CodecZlib.jl.
Again make sure that you do not have file named df_compress_test.csv.gz in your working directory.
We first generate a random data frame.
df = DataFrame(rand(1:10, 10, 1000), :auto)
| Row | x1 | x2 | x3 | x4 | x5 | x6 | x7 | x8 | x9 | x10 | x11 | x12 | x13 | x14 | x15 | x16 | x17 | x18 | x19 | x20 | x21 | x22 | x23 | x24 | x25 | x26 | x27 | x28 | x29 | x30 | x31 | x32 | x33 | x34 | x35 | x36 | x37 | x38 | x39 | x40 | x41 | x42 | x43 | x44 | x45 | x46 | x47 | x48 | x49 | x50 | x51 | x52 | x53 | x54 | x55 | x56 | x57 | x58 | x59 | x60 | x61 | x62 | x63 | x64 | x65 | x66 | x67 | x68 | x69 | x70 | x71 | x72 | x73 | x74 | x75 | x76 | x77 | x78 | x79 | x80 | x81 | x82 | x83 | x84 | x85 | x86 | x87 | x88 | x89 | x90 | x91 | x92 | x93 | x94 | x95 | x96 | x97 | x98 | x99 | x100 | ⋯ |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | ⋯ | |
| 1 | 5 | 2 | 10 | 1 | 8 | 6 | 7 | 3 | 4 | 5 | 6 | 6 | 8 | 1 | 4 | 7 | 1 | 6 | 2 | 4 | 7 | 2 | 6 | 9 | 9 | 10 | 4 | 1 | 1 | 6 | 5 | 5 | 10 | 3 | 5 | 8 | 2 | 6 | 6 | 8 | 4 | 3 | 6 | 9 | 6 | 1 | 9 | 1 | 9 | 9 | 3 | 10 | 3 | 9 | 4 | 9 | 7 | 1 | 7 | 5 | 8 | 3 | 8 | 3 | 2 | 10 | 10 | 1 | 7 | 5 | 7 | 4 | 7 | 1 | 10 | 3 | 4 | 3 | 4 | 8 | 2 | 1 | 2 | 8 | 10 | 2 | 7 | 2 | 3 | 2 | 4 | 6 | 1 | 6 | 1 | 4 | 2 | 5 | 10 | 3 | ⋯ |
| 2 | 4 | 4 | 9 | 3 | 9 | 3 | 2 | 8 | 5 | 1 | 10 | 4 | 5 | 8 | 7 | 10 | 6 | 10 | 7 | 7 | 6 | 4 | 4 | 2 | 6 | 2 | 7 | 2 | 7 | 4 | 3 | 8 | 2 | 6 | 1 | 2 | 6 | 5 | 9 | 6 | 7 | 10 | 7 | 2 | 10 | 1 | 10 | 4 | 6 | 3 | 3 | 5 | 3 | 7 | 5 | 5 | 6 | 4 | 9 | 1 | 10 | 9 | 9 | 3 | 7 | 8 | 7 | 5 | 7 | 1 | 5 | 6 | 7 | 1 | 6 | 6 | 8 | 1 | 9 | 7 | 2 | 9 | 1 | 5 | 5 | 8 | 9 | 9 | 1 | 4 | 8 | 3 | 7 | 8 | 8 | 3 | 3 | 9 | 3 | 10 | ⋯ |
| 3 | 10 | 6 | 5 | 1 | 7 | 9 | 1 | 5 | 7 | 10 | 8 | 2 | 5 | 4 | 6 | 5 | 7 | 6 | 9 | 5 | 6 | 9 | 1 | 5 | 4 | 9 | 3 | 6 | 2 | 8 | 3 | 1 | 8 | 9 | 1 | 2 | 2 | 8 | 5 | 5 | 5 | 7 | 8 | 10 | 3 | 3 | 3 | 9 | 2 | 10 | 8 | 7 | 1 | 7 | 4 | 6 | 9 | 2 | 3 | 9 | 1 | 7 | 4 | 9 | 2 | 3 | 6 | 7 | 9 | 5 | 2 | 7 | 3 | 4 | 4 | 2 | 5 | 3 | 2 | 7 | 1 | 6 | 2 | 5 | 1 | 9 | 1 | 9 | 3 | 2 | 8 | 10 | 5 | 4 | 4 | 10 | 9 | 6 | 3 | 3 | ⋯ |
| 4 | 10 | 6 | 5 | 6 | 4 | 7 | 6 | 10 | 7 | 5 | 10 | 6 | 9 | 8 | 5 | 9 | 2 | 9 | 1 | 3 | 4 | 9 | 3 | 5 | 7 | 3 | 3 | 5 | 3 | 4 | 1 | 10 | 8 | 7 | 10 | 1 | 1 | 3 | 7 | 10 | 6 | 2 | 6 | 6 | 3 | 2 | 9 | 6 | 3 | 10 | 8 | 7 | 5 | 5 | 5 | 1 | 10 | 10 | 7 | 10 | 10 | 7 | 7 | 4 | 1 | 4 | 9 | 2 | 7 | 7 | 3 | 8 | 2 | 6 | 2 | 3 | 4 | 5 | 6 | 10 | 9 | 1 | 4 | 10 | 5 | 9 | 10 | 5 | 8 | 8 | 7 | 8 | 9 | 5 | 1 | 8 | 9 | 8 | 1 | 10 | ⋯ |
| 5 | 6 | 10 | 6 | 2 | 1 | 7 | 1 | 8 | 8 | 4 | 7 | 3 | 3 | 10 | 5 | 7 | 4 | 9 | 9 | 6 | 6 | 5 | 10 | 3 | 2 | 1 | 1 | 5 | 3 | 3 | 7 | 2 | 6 | 2 | 6 | 10 | 1 | 7 | 4 | 9 | 4 | 5 | 4 | 9 | 2 | 4 | 8 | 5 | 5 | 10 | 10 | 1 | 3 | 4 | 10 | 8 | 10 | 4 | 7 | 9 | 4 | 10 | 9 | 5 | 9 | 10 | 2 | 6 | 9 | 6 | 7 | 8 | 3 | 3 | 1 | 5 | 2 | 3 | 4 | 4 | 4 | 4 | 2 | 1 | 8 | 5 | 9 | 4 | 3 | 10 | 8 | 4 | 10 | 8 | 10 | 10 | 6 | 3 | 3 | 9 | ⋯ |
| 6 | 5 | 2 | 9 | 4 | 2 | 5 | 7 | 2 | 5 | 3 | 2 | 1 | 9 | 3 | 3 | 4 | 9 | 6 | 10 | 1 | 5 | 3 | 6 | 6 | 10 | 5 | 7 | 9 | 8 | 9 | 8 | 9 | 7 | 4 | 4 | 4 | 5 | 9 | 8 | 2 | 4 | 5 | 1 | 8 | 8 | 10 | 1 | 1 | 4 | 1 | 1 | 9 | 5 | 2 | 8 | 8 | 7 | 4 | 3 | 8 | 9 | 4 | 9 | 8 | 4 | 8 | 9 | 3 | 9 | 6 | 3 | 8 | 5 | 6 | 5 | 10 | 3 | 5 | 7 | 4 | 2 | 1 | 10 | 6 | 10 | 7 | 3 | 9 | 7 | 5 | 1 | 9 | 9 | 3 | 1 | 3 | 7 | 2 | 4 | 4 | ⋯ |
| 7 | 5 | 5 | 4 | 6 | 3 | 5 | 4 | 5 | 10 | 3 | 2 | 1 | 2 | 2 | 2 | 2 | 4 | 4 | 2 | 9 | 6 | 6 | 2 | 9 | 10 | 5 | 4 | 1 | 9 | 3 | 4 | 9 | 4 | 10 | 6 | 3 | 3 | 1 | 1 | 8 | 4 | 3 | 4 | 4 | 4 | 3 | 3 | 1 | 6 | 10 | 10 | 4 | 2 | 1 | 8 | 4 | 9 | 4 | 2 | 8 | 9 | 8 | 3 | 5 | 1 | 2 | 10 | 6 | 5 | 2 | 1 | 2 | 8 | 6 | 7 | 1 | 6 | 1 | 6 | 4 | 10 | 3 | 9 | 10 | 9 | 4 | 1 | 5 | 5 | 7 | 7 | 1 | 4 | 3 | 10 | 10 | 6 | 9 | 4 | 4 | ⋯ |
| 8 | 7 | 8 | 9 | 5 | 3 | 3 | 3 | 2 | 8 | 4 | 8 | 9 | 4 | 2 | 1 | 3 | 5 | 6 | 10 | 4 | 2 | 7 | 10 | 8 | 9 | 9 | 2 | 4 | 5 | 10 | 2 | 3 | 7 | 7 | 1 | 5 | 5 | 5 | 2 | 1 | 2 | 6 | 9 | 10 | 10 | 4 | 4 | 4 | 2 | 10 | 4 | 2 | 9 | 2 | 1 | 3 | 6 | 1 | 6 | 3 | 6 | 10 | 8 | 4 | 8 | 10 | 1 | 2 | 5 | 3 | 9 | 7 | 2 | 10 | 5 | 1 | 9 | 7 | 8 | 2 | 1 | 9 | 8 | 4 | 3 | 6 | 1 | 5 | 2 | 6 | 3 | 6 | 3 | 4 | 8 | 6 | 8 | 10 | 8 | 9 | ⋯ |
| 9 | 2 | 3 | 1 | 8 | 4 | 4 | 4 | 9 | 3 | 5 | 6 | 1 | 6 | 10 | 2 | 9 | 10 | 6 | 5 | 8 | 6 | 7 | 3 | 2 | 10 | 2 | 2 | 3 | 4 | 9 | 8 | 9 | 5 | 10 | 7 | 9 | 7 | 2 | 2 | 10 | 1 | 2 | 2 | 7 | 7 | 10 | 1 | 8 | 4 | 7 | 1 | 3 | 5 | 5 | 2 | 2 | 9 | 1 | 2 | 10 | 9 | 6 | 3 | 7 | 4 | 7 | 3 | 9 | 6 | 2 | 1 | 3 | 1 | 3 | 6 | 9 | 9 | 1 | 6 | 7 | 9 | 4 | 5 | 8 | 7 | 2 | 7 | 1 | 6 | 7 | 2 | 8 | 9 | 10 | 3 | 5 | 8 | 10 | 3 | 3 | ⋯ |
| 10 | 5 | 6 | 4 | 9 | 2 | 3 | 6 | 6 | 6 | 8 | 4 | 9 | 5 | 5 | 4 | 6 | 7 | 5 | 8 | 3 | 5 | 2 | 9 | 3 | 9 | 1 | 2 | 3 | 9 | 4 | 10 | 4 | 4 | 2 | 3 | 6 | 2 | 5 | 5 | 4 | 4 | 8 | 1 | 9 | 10 | 8 | 7 | 5 | 2 | 1 | 2 | 8 | 6 | 5 | 7 | 9 | 1 | 5 | 1 | 4 | 8 | 9 | 6 | 1 | 3 | 2 | 8 | 6 | 5 | 6 | 9 | 4 | 1 | 10 | 8 | 2 | 4 | 9 | 9 | 5 | 3 | 8 | 9 | 9 | 3 | 3 | 3 | 8 | 2 | 4 | 7 | 3 | 10 | 1 | 6 | 4 | 9 | 1 | 4 | 4 | ⋯ |
Use CodecZlib to compress the CSV file
CSV.write("df_compress_test.csv.gz", df; compress=true)
"df_compress_test.csv.gz"
df2 = CSV.File("df_compress_test.csv.gz") |> DataFrame
| Row | x1 | x2 | x3 | x4 | x5 | x6 | x7 | x8 | x9 | x10 | x11 | x12 | x13 | x14 | x15 | x16 | x17 | x18 | x19 | x20 | x21 | x22 | x23 | x24 | x25 | x26 | x27 | x28 | x29 | x30 | x31 | x32 | x33 | x34 | x35 | x36 | x37 | x38 | x39 | x40 | x41 | x42 | x43 | x44 | x45 | x46 | x47 | x48 | x49 | x50 | x51 | x52 | x53 | x54 | x55 | x56 | x57 | x58 | x59 | x60 | x61 | x62 | x63 | x64 | x65 | x66 | x67 | x68 | x69 | x70 | x71 | x72 | x73 | x74 | x75 | x76 | x77 | x78 | x79 | x80 | x81 | x82 | x83 | x84 | x85 | x86 | x87 | x88 | x89 | x90 | x91 | x92 | x93 | x94 | x95 | x96 | x97 | x98 | x99 | x100 | ⋯ |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | ⋯ | |
| 1 | 5 | 2 | 10 | 1 | 8 | 6 | 7 | 3 | 4 | 5 | 6 | 6 | 8 | 1 | 4 | 7 | 1 | 6 | 2 | 4 | 7 | 2 | 6 | 9 | 9 | 10 | 4 | 1 | 1 | 6 | 5 | 5 | 10 | 3 | 5 | 8 | 2 | 6 | 6 | 8 | 4 | 3 | 6 | 9 | 6 | 1 | 9 | 1 | 9 | 9 | 3 | 10 | 3 | 9 | 4 | 9 | 7 | 1 | 7 | 5 | 8 | 3 | 8 | 3 | 2 | 10 | 10 | 1 | 7 | 5 | 7 | 4 | 7 | 1 | 10 | 3 | 4 | 3 | 4 | 8 | 2 | 1 | 2 | 8 | 10 | 2 | 7 | 2 | 3 | 2 | 4 | 6 | 1 | 6 | 1 | 4 | 2 | 5 | 10 | 3 | ⋯ |
| 2 | 4 | 4 | 9 | 3 | 9 | 3 | 2 | 8 | 5 | 1 | 10 | 4 | 5 | 8 | 7 | 10 | 6 | 10 | 7 | 7 | 6 | 4 | 4 | 2 | 6 | 2 | 7 | 2 | 7 | 4 | 3 | 8 | 2 | 6 | 1 | 2 | 6 | 5 | 9 | 6 | 7 | 10 | 7 | 2 | 10 | 1 | 10 | 4 | 6 | 3 | 3 | 5 | 3 | 7 | 5 | 5 | 6 | 4 | 9 | 1 | 10 | 9 | 9 | 3 | 7 | 8 | 7 | 5 | 7 | 1 | 5 | 6 | 7 | 1 | 6 | 6 | 8 | 1 | 9 | 7 | 2 | 9 | 1 | 5 | 5 | 8 | 9 | 9 | 1 | 4 | 8 | 3 | 7 | 8 | 8 | 3 | 3 | 9 | 3 | 10 | ⋯ |
| 3 | 10 | 6 | 5 | 1 | 7 | 9 | 1 | 5 | 7 | 10 | 8 | 2 | 5 | 4 | 6 | 5 | 7 | 6 | 9 | 5 | 6 | 9 | 1 | 5 | 4 | 9 | 3 | 6 | 2 | 8 | 3 | 1 | 8 | 9 | 1 | 2 | 2 | 8 | 5 | 5 | 5 | 7 | 8 | 10 | 3 | 3 | 3 | 9 | 2 | 10 | 8 | 7 | 1 | 7 | 4 | 6 | 9 | 2 | 3 | 9 | 1 | 7 | 4 | 9 | 2 | 3 | 6 | 7 | 9 | 5 | 2 | 7 | 3 | 4 | 4 | 2 | 5 | 3 | 2 | 7 | 1 | 6 | 2 | 5 | 1 | 9 | 1 | 9 | 3 | 2 | 8 | 10 | 5 | 4 | 4 | 10 | 9 | 6 | 3 | 3 | ⋯ |
| 4 | 10 | 6 | 5 | 6 | 4 | 7 | 6 | 10 | 7 | 5 | 10 | 6 | 9 | 8 | 5 | 9 | 2 | 9 | 1 | 3 | 4 | 9 | 3 | 5 | 7 | 3 | 3 | 5 | 3 | 4 | 1 | 10 | 8 | 7 | 10 | 1 | 1 | 3 | 7 | 10 | 6 | 2 | 6 | 6 | 3 | 2 | 9 | 6 | 3 | 10 | 8 | 7 | 5 | 5 | 5 | 1 | 10 | 10 | 7 | 10 | 10 | 7 | 7 | 4 | 1 | 4 | 9 | 2 | 7 | 7 | 3 | 8 | 2 | 6 | 2 | 3 | 4 | 5 | 6 | 10 | 9 | 1 | 4 | 10 | 5 | 9 | 10 | 5 | 8 | 8 | 7 | 8 | 9 | 5 | 1 | 8 | 9 | 8 | 1 | 10 | ⋯ |
| 5 | 6 | 10 | 6 | 2 | 1 | 7 | 1 | 8 | 8 | 4 | 7 | 3 | 3 | 10 | 5 | 7 | 4 | 9 | 9 | 6 | 6 | 5 | 10 | 3 | 2 | 1 | 1 | 5 | 3 | 3 | 7 | 2 | 6 | 2 | 6 | 10 | 1 | 7 | 4 | 9 | 4 | 5 | 4 | 9 | 2 | 4 | 8 | 5 | 5 | 10 | 10 | 1 | 3 | 4 | 10 | 8 | 10 | 4 | 7 | 9 | 4 | 10 | 9 | 5 | 9 | 10 | 2 | 6 | 9 | 6 | 7 | 8 | 3 | 3 | 1 | 5 | 2 | 3 | 4 | 4 | 4 | 4 | 2 | 1 | 8 | 5 | 9 | 4 | 3 | 10 | 8 | 4 | 10 | 8 | 10 | 10 | 6 | 3 | 3 | 9 | ⋯ |
| 6 | 5 | 2 | 9 | 4 | 2 | 5 | 7 | 2 | 5 | 3 | 2 | 1 | 9 | 3 | 3 | 4 | 9 | 6 | 10 | 1 | 5 | 3 | 6 | 6 | 10 | 5 | 7 | 9 | 8 | 9 | 8 | 9 | 7 | 4 | 4 | 4 | 5 | 9 | 8 | 2 | 4 | 5 | 1 | 8 | 8 | 10 | 1 | 1 | 4 | 1 | 1 | 9 | 5 | 2 | 8 | 8 | 7 | 4 | 3 | 8 | 9 | 4 | 9 | 8 | 4 | 8 | 9 | 3 | 9 | 6 | 3 | 8 | 5 | 6 | 5 | 10 | 3 | 5 | 7 | 4 | 2 | 1 | 10 | 6 | 10 | 7 | 3 | 9 | 7 | 5 | 1 | 9 | 9 | 3 | 1 | 3 | 7 | 2 | 4 | 4 | ⋯ |
| 7 | 5 | 5 | 4 | 6 | 3 | 5 | 4 | 5 | 10 | 3 | 2 | 1 | 2 | 2 | 2 | 2 | 4 | 4 | 2 | 9 | 6 | 6 | 2 | 9 | 10 | 5 | 4 | 1 | 9 | 3 | 4 | 9 | 4 | 10 | 6 | 3 | 3 | 1 | 1 | 8 | 4 | 3 | 4 | 4 | 4 | 3 | 3 | 1 | 6 | 10 | 10 | 4 | 2 | 1 | 8 | 4 | 9 | 4 | 2 | 8 | 9 | 8 | 3 | 5 | 1 | 2 | 10 | 6 | 5 | 2 | 1 | 2 | 8 | 6 | 7 | 1 | 6 | 1 | 6 | 4 | 10 | 3 | 9 | 10 | 9 | 4 | 1 | 5 | 5 | 7 | 7 | 1 | 4 | 3 | 10 | 10 | 6 | 9 | 4 | 4 | ⋯ |
| 8 | 7 | 8 | 9 | 5 | 3 | 3 | 3 | 2 | 8 | 4 | 8 | 9 | 4 | 2 | 1 | 3 | 5 | 6 | 10 | 4 | 2 | 7 | 10 | 8 | 9 | 9 | 2 | 4 | 5 | 10 | 2 | 3 | 7 | 7 | 1 | 5 | 5 | 5 | 2 | 1 | 2 | 6 | 9 | 10 | 10 | 4 | 4 | 4 | 2 | 10 | 4 | 2 | 9 | 2 | 1 | 3 | 6 | 1 | 6 | 3 | 6 | 10 | 8 | 4 | 8 | 10 | 1 | 2 | 5 | 3 | 9 | 7 | 2 | 10 | 5 | 1 | 9 | 7 | 8 | 2 | 1 | 9 | 8 | 4 | 3 | 6 | 1 | 5 | 2 | 6 | 3 | 6 | 3 | 4 | 8 | 6 | 8 | 10 | 8 | 9 | ⋯ |
| 9 | 2 | 3 | 1 | 8 | 4 | 4 | 4 | 9 | 3 | 5 | 6 | 1 | 6 | 10 | 2 | 9 | 10 | 6 | 5 | 8 | 6 | 7 | 3 | 2 | 10 | 2 | 2 | 3 | 4 | 9 | 8 | 9 | 5 | 10 | 7 | 9 | 7 | 2 | 2 | 10 | 1 | 2 | 2 | 7 | 7 | 10 | 1 | 8 | 4 | 7 | 1 | 3 | 5 | 5 | 2 | 2 | 9 | 1 | 2 | 10 | 9 | 6 | 3 | 7 | 4 | 7 | 3 | 9 | 6 | 2 | 1 | 3 | 1 | 3 | 6 | 9 | 9 | 1 | 6 | 7 | 9 | 4 | 5 | 8 | 7 | 2 | 7 | 1 | 6 | 7 | 2 | 8 | 9 | 10 | 3 | 5 | 8 | 10 | 3 | 3 | ⋯ |
| 10 | 5 | 6 | 4 | 9 | 2 | 3 | 6 | 6 | 6 | 8 | 4 | 9 | 5 | 5 | 4 | 6 | 7 | 5 | 8 | 3 | 5 | 2 | 9 | 3 | 9 | 1 | 2 | 3 | 9 | 4 | 10 | 4 | 4 | 2 | 3 | 6 | 2 | 5 | 5 | 4 | 4 | 8 | 1 | 9 | 10 | 8 | 7 | 5 | 2 | 1 | 2 | 8 | 6 | 5 | 7 | 9 | 1 | 5 | 1 | 4 | 8 | 9 | 6 | 1 | 3 | 2 | 8 | 6 | 5 | 6 | 9 | 4 | 1 | 10 | 8 | 2 | 4 | 9 | 9 | 5 | 3 | 8 | 9 | 9 | 3 | 3 | 3 | 8 | 2 | 4 | 7 | 3 | 10 | 1 | 6 | 4 | 9 | 1 | 4 | 4 | ⋯ |
df == df2
true
Remove generated files
rm("df_compress_test.csv.gz")
Using zip files#
Sometimes you may have files compressed inside a zip file.
In such a situation you may use ZipFile.jl in conjunction an an appropriate reader to read the files.
Here we first create a ZIP file and then read back its contents into a DataFrame.
df1 = DataFrame(rand(1:10, 3, 4), :auto)
| Row | x1 | x2 | x3 | x4 |
|---|---|---|---|---|
| Int64 | Int64 | Int64 | Int64 | |
| 1 | 5 | 6 | 1 | 5 |
| 2 | 5 | 2 | 10 | 6 |
| 3 | 2 | 3 | 7 | 9 |
df2 = DataFrame(rand(1:10, 3, 4), :auto)
| Row | x1 | x2 | x3 | x4 |
|---|---|---|---|---|
| Int64 | Int64 | Int64 | Int64 | |
| 1 | 9 | 8 | 7 | 6 |
| 2 | 3 | 2 | 3 | 3 |
| 3 | 10 | 5 | 4 | 8 |
And we show yet another way to write a DataFrame into a CSV file:
Writing a CSV file into the zip file
w = ZipFile.Writer("x.zip")
f1 = ZipFile.addfile(w, "x1.csv")
write(f1, sprint(show, "text/csv", df1))
# write a second CSV file into zip file
f2 = ZipFile.addfile(w, "x2.csv", method=ZipFile.Deflate)
write(f2, sprint(show, "text/csv", df2))
close(w)
Now we read the compressed CSV file we have written:
z = ZipFile.Reader("x.zip");
# find the index index of file called x1.csv
index_xcsv = findfirst(x -> x.name == "x1.csv", z.files)
# to read the x1.csv file in the zip file
df1_2 = CSV.read(read(z.files[index_xcsv]), DataFrame)
| Row | x1 | x2 | x3 | x4 |
|---|---|---|---|---|
| Int64 | Int64 | Int64 | Int64 | |
| 1 | 5 | 6 | 1 | 5 |
| 2 | 5 | 2 | 10 | 6 |
| 3 | 2 | 3 | 7 | 9 |
df1_2 == df1
true
# find the index index of file called x2.csv
index_xcsv = findfirst(x -> x.name == "x2.csv", z.files)
# to read the x2.csv file in the zip file
df2_2 = CSV.read(read(z.files[index_xcsv]), DataFrame)
| Row | x1 | x2 | x3 | x4 |
|---|---|---|---|---|
| Int64 | Int64 | Int64 | Int64 | |
| 1 | 9 | 8 | 7 | 6 |
| 2 | 3 | 2 | 3 | 3 |
| 3 | 10 | 5 | 4 | 8 |
df2_2 == df2
true
Note that once you read a given file from z object its stream is all used-up (reaching its end). Therefore to read it again you need to close the file object z and open it again.
Also do not forget to close the zip file once you are done.
close(z)
Remove generated zip file
rm("x.zip")
This notebook was generated using Literate.jl.