Load and save DataFrames#
We do not cover all features of the packages. Please refer to their documentation to learn them.
Here we’ll load CSV.jl
to read and write CSV files and Arrow.jl
, JLSO.jl
, and serialization, which allow us to work with a binary format and JSONTables.jl
for JSON interaction. Finally we consider a custom JDF.jl
format.
using DataFrames
using Arrow
using CSV
using Serialization
using JLSO
using JSONTables
using CodecZlib
using ZipFile
using JDF
using StatsPlots ## for charts
using Mmap ## for compression
Let’s create a simple DataFrame
for testing purposes,
x = DataFrame(
A=[true, false, true], B=[1, 2, missing],
C=[missing, "b", "c"], D=['a', missing, 'c']
)
Row | A | B | C | D |
---|---|---|---|---|
Bool | Int64? | String? | Char? | |
1 | true | 1 | missing | a |
2 | false | 2 | b | missing |
3 | true | missing | c | c |
and use eltypes
to look at the columnwise types.
eltype.(eachcol(x))
4-element Vector{Type}:
Bool
Union{Missing, Int64}
Union{Missing, String}
Union{Missing, Char}
CSV.jl#
Let’s use CSV
to save x
to disk; make sure x1.csv
does not conflict with some file in your working directory.
CSV.write("x1.csv", x)
"x1.csv"
Now we can see how it was saved by reading x.csv
.
print(read("x1.csv", String))
A,B,C,D
true,1,,a
false,2,b,
true,,c,c
We can also load it back as a data frame
y = CSV.read("x1.csv", DataFrame)
Row | A | B | C | D |
---|---|---|---|---|
Bool | Int64? | String1? | String1? | |
1 | true | 1 | missing | a |
2 | false | 2 | b | missing |
3 | true | missing | c | c |
Note that when loading in a DataFrame
from a CSV
the column type for columns :C
:D
have changed to use special strings defined in the InlineStrings.jl package.
eltype.(eachcol(y))
4-element Vector{Type}:
Bool
Union{Missing, Int64}
Union{Missing, InlineStrings.String1}
Union{Missing, InlineStrings.String1}
Serialization by JDF.jl and JLSO.jl#
Now we use serialization to save x
.
There are two ways to perform serialization. The first way is to use the Serialization.serialize
as below:
Note that in general, this process will not work if the reading and writing are done by different versions of Julia, or an instance of Julia with a different system image.
open("x.bin", "w") do io
serialize(io, x)
end
Now we load back the saved file to y
variable. Again y
is identical to x
. However, please beware that if you session does not have DataFrames.jl loaded, then it may not recognize the content as DataFrames.jl
y = open(deserialize, "x.bin")
Row | A | B | C | D |
---|---|---|---|---|
Bool | Int64? | String? | Char? | |
1 | true | 1 | missing | a |
2 | false | 2 | b | missing |
3 | true | missing | c | c |
eltype.(eachcol(y))
4-element Vector{Type}:
Bool
Union{Missing, Int64}
Union{Missing, String}
Union{Missing, Char}
JDF.jl#
JDF.jl
is a relatively new package designed to serialize DataFrames. You can save a DataFrame with the savejdf
function.
For more details about design assumptions and limitations of JDF.jl
please check out xiaodaigh/JDF.jl.
JDF.save("x.jdf", x);
To load the saved JDF file, one can use the loadjdf
function
x_loaded = JDF.load("x.jdf") |> DataFrame
Row | A | B | C | D |
---|---|---|---|---|
Bool | Int64? | String? | Char? | |
1 | true | 1 | missing | a |
2 | false | 2 | b | missing |
3 | true | missing | c | c |
You can see that they are the same
isequal(x_loaded, x)
true
JDF.jl offers the ability to load only certain columns from disk to help with working with large files.
set up a JDFFile which is a on disk representation of x
backed by JDF.jl
x_ondisk = jdf"x.jdf"
JDF.JDFFile{String}("x.jdf")
We can see all the names of x
without loading it into memory
names(x_ondisk)
4-element Vector{Symbol}:
:A
:B
:C
:D
The below is an example of how to load only columns :A
and :D
xd = JDF.load(x_ondisk; cols=["A", "D"]) |> DataFrame
Row | A | D |
---|---|---|
Bool | Char? | |
1 | true | a |
2 | false | missing |
3 | true | c |
JLSO.jl#
Another way to perform serialization is by using the JLSO.jl library:
JLSO.save("x.jlso", :data => x)
Now we can load back the file to y
y = JLSO.load("x.jlso")[:data]
Row | A | B | C | D |
---|---|---|---|---|
Bool | Int64? | String? | Char? | |
1 | true | 1 | missing | a |
2 | false | 2 | b | missing |
3 | true | missing | c | c |
eltype.(eachcol(y))
4-element Vector{Type}:
Bool
Union{Missing, Int64}
Union{Missing, String}
Union{Missing, Char}
JSONTables.jl#
Often you might need to read and write data stored in JSON format. JSONTables.jl provides a way to process them in row-oriented or column-oriented layout. We present both options below.
open(io -> arraytable(io, x), "x1.json", "w")
106
open(io -> objecttable(io, x), "x2.json", "w")
76
print(read("x1.json", String))
[{"A":true,"B":1,"C":null,"D":"a"},{"A":false,"B":2,"C":"b","D":null},{"A":true,"B":null,"C":"c","D":"c"}]
print(read("x2.json", String))
{"A":[true,false,true],"B":[1,2,null],"C":[null,"b","c"],"D":["a",null,"c"]}
y1 = open(jsontable, "x1.json") |> DataFrame
Row | A | B | C | D |
---|---|---|---|---|
Bool | Int64? | String? | String? | |
1 | true | 1 | missing | a |
2 | false | 2 | b | missing |
3 | true | missing | c | c |
eltype.(eachcol(y1))
4-element Vector{Type}:
Bool
Union{Missing, Int64}
Union{Missing, String}
Union{Missing, String}
y2 = open(jsontable, "x2.json") |> DataFrame
Row | A | B | C | D |
---|---|---|---|---|
Bool | Int64? | String? | String? | |
1 | true | 1 | missing | a |
2 | false | 2 | b | missing |
3 | true | missing | c | c |
eltype.(eachcol(y2))
4-element Vector{Type}:
Bool
Union{Missing, Int64}
Union{Missing, String}
Union{Missing, String}
Arrow.jl#
Finally we use Apache Arrow format that allows, in particular, for data interchange with R or Python.
Arrow.write("x.arrow", x)
"x.arrow"
y = Arrow.Table("x.arrow") |> DataFrame
Row | A | B | C | D |
---|---|---|---|---|
Bool | Int64? | String? | Char? | |
1 | true | 1 | missing | a |
2 | false | 2 | b | missing |
3 | true | missing | c | c |
eltype.(eachcol(y))
4-element Vector{Type}:
Bool
Union{Missing, Int64}
Union{Missing, String}
Union{Missing, Char}
Note that columns of y
are immutable
try
y.A[1] = false
catch e
show(e)
end
ReadOnlyMemoryError()
This is because Arrow.Table
uses memory mapping and thus uses a custom vector types:
y.A
3-element Arrow.BoolVector{Bool}:
1
0
1
y.B
3-element Arrow.Primitive{Union{Missing, Int64}, Vector{Int64}}:
1
2
missing
You can get standard Julia Base vectors by copying a data frame
y2 = copy(y)
Row | A | B | C | D |
---|---|---|---|---|
Bool | Int64? | String? | Char? | |
1 | true | 1 | missing | a |
2 | false | 2 | b | missing |
3 | true | missing | c | c |
y2.A
3-element Vector{Bool}:
1
0
1
y2.B
3-element Vector{Union{Missing, Int64}}:
1
2
missing
Basic benchmarking#
Next, we’ll create some files, so be careful that you don’t already have these files in your working directory!
In particular, we’ll time how long it takes us to write a DataFrame
with 1000 rows and 100000 columns.
bigdf = DataFrame(rand(Bool, 10^4, 1000), :auto)
bigdf[!, 1] = Int.(bigdf[!, 1])
bigdf[!, 2] = bigdf[!, 2] .+ 0.5
bigdf[!, 3] = string.(bigdf[!, 3], ", as string")
println("First run")
First run
println("CSV.jl")
csvwrite1 = @elapsed @time CSV.write("bigdf1.csv", bigdf)
println("Serialization")
serializewrite1 = @elapsed @time open(io -> serialize(io, bigdf), "bigdf.bin", "w")
println("JDF.jl")
jdfwrite1 = @elapsed @time JDF.save("bigdf.jdf", bigdf)
println("JLSO.jl")
jlsowrite1 = @elapsed @time JLSO.save("bigdf.jlso", :data => bigdf)
println("Arrow.jl")
arrowwrite1 = @elapsed @time Arrow.write("bigdf.arrow", bigdf)
println("JSONTables.jl arraytable")
jsontablesawrite1 = @elapsed @time open(io -> arraytable(io, bigdf), "bigdf1.json", "w")
println("JSONTables.jl objecttable")
jsontablesowrite1 = @elapsed @time open(io -> objecttable(io, bigdf), "bigdf2.json", "w")
println("Second run")
println("CSV.jl")
csvwrite2 = @elapsed @time CSV.write("bigdf1.csv", bigdf)
println("Serialization")
serializewrite2 = @elapsed @time open(io -> serialize(io, bigdf), "bigdf.bin", "w")
println("JDF.jl")
jdfwrite2 = @elapsed @time JDF.save("bigdf.jdf", bigdf)
println("JLSO.jl")
jlsowrite2 = @elapsed @time JLSO.save("bigdf.jlso", :data => bigdf)
println("Arrow.jl")
arrowwrite2 = @elapsed @time Arrow.write("bigdf.arrow", bigdf)
println("JSONTables.jl arraytable")
jsontablesawrite2 = @elapsed @time open(io -> arraytable(io, bigdf), "bigdf1.json", "w")
println("JSONTables.jl objecttable")
jsontablesowrite2 = @elapsed @time open(io -> objecttable(io, bigdf), "bigdf2.json", "w")
CSV.jl
6.625767 seconds (44.69 M allocations: 1.127 GiB, 3.21% gc time, 64.54% compilation time)
Serialization
0.280890 seconds (224.29 k allocations: 10.821 MiB, 29.82% compilation time)
JDF.jl
0.165195 seconds (68.08 k allocations: 147.946 MiB, 23.08% gc time, 46.89% compilation time)
JLSO.jl
1.295517 seconds (284.83 k allocations: 20.372 MiB, 0.76% gc time, 8.74% compilation time)
Arrow.jl
6.013596 seconds (5.45 M allocations: 275.670 MiB, 0.58% gc time, 97.57% compilation time)
JSONTables.jl arraytable
20.560618 seconds (229.62 M allocations: 5.422 GiB, 15.39% gc time, 0.08% compilation time)
JSONTables.jl objecttable
0.617348 seconds (96.64 k allocations: 309.085 MiB, 48.29% gc time, 17.88% compilation time)
Second run
CSV.jl
1.690940 seconds (44.40 M allocations: 1.113 GiB, 9.06% gc time)
Serialization
0.216240 seconds (15.01 k allocations: 400.680 KiB, 7.44% compilation time)
JDF.jl
0.124376 seconds (35.13 k allocations: 146.223 MiB, 17.17% gc time)
JLSO.jl
1.183068 seconds (33.41 k allocations: 7.465 MiB)
Arrow.jl
0.113700 seconds (81.35 k allocations: 5.426 MiB)
JSONTables.jl arraytable
15.777456 seconds (229.62 M allocations: 5.422 GiB, 14.61% gc time, 0.10% compilation time)
JSONTables.jl objecttable
0.532067 seconds (20.71 k allocations: 305.221 MiB, 54.40% gc time, 2.10% compilation time)
0.532235469
groupedbar(
repeat(["CSV.jl", "Serialization", "JDF.jl", "JLSO.jl", "Arrow.jl", "JSONTables.jl\nobjecttable"],
inner=2),
[csvwrite1, csvwrite2, serializewrite1, serializewrite1, jdfwrite1, jdfwrite2,
jlsowrite1, jlsowrite2, arrowwrite1, arrowwrite2, jsontablesowrite2, jsontablesowrite2],
group=repeat(["1st", "2nd"], outer=6),
ylab="Second",
title="Write Performance\nDataFrame: bigdf\nSize: $(size(bigdf))"
)
data_files = ["bigdf1.csv", "bigdf.bin", "bigdf.arrow", "bigdf1.json", "bigdf2.json"]
df = DataFrame(file=data_files, size=getfield.(stat.(data_files), :size))
append!(df, DataFrame(file="bigdf.jdf", size=reduce((x, y) -> x + y.size,
stat.(joinpath.("bigdf.jdf", readdir("bigdf.jdf"))),
init=0)))
sort!(df, :size)
Row | file | size |
---|---|---|
String | Int64 | |
1 | bigdf.arrow | 1742906 |
2 | bigdf.bin | 5199352 |
3 | bigdf.jdf | 5223402 |
4 | bigdf1.csv | 55083649 |
5 | bigdf2.json | 55087650 |
6 | bigdf1.json | 124028757 |
@df df plot(:file, :size / 1024^2, seriestype=:bar, title="Format File Size (MB)", label="Size", ylab="MB")
println("First run")
println("CSV.jl")
csvread1 = @elapsed @time CSV.read("bigdf1.csv", DataFrame)
println("Serialization")
serializeread1 = @elapsed @time open(deserialize, "bigdf.bin")
println("JDF.jl")
jdfread1 = @elapsed @time JDF.load("bigdf.jdf") |> DataFrame
println("JLSO.jl")
jlsoread1 = @elapsed @time JLSO.load("bigdf.jlso")
println("Arrow.jl")
arrowread1 = @elapsed @time df_tmp = Arrow.Table("bigdf.arrow") |> DataFrame
arrowread1copy = @elapsed @time copy(df_tmp)
println("JSONTables.jl arraytable")
jsontablesaread1 = @elapsed @time open(jsontable, "bigdf1.json")
println("JSONTables.jl objecttable")
jsontablesoread1 = @elapsed @time open(jsontable, "bigdf2.json")
println("Second run")
csvread2 = @elapsed @time CSV.read("bigdf1.csv", DataFrame)
println("Serialization")
serializeread2 = @elapsed @time open(deserialize, "bigdf.bin")
println("JDF.jl")
jdfread2 = @elapsed @time JDF.load("bigdf.jdf") |> DataFrame
println("JLSO.jl")
jlsoread2 = @elapsed @time JLSO.load("bigdf.jlso")
println("Arrow.jl")
arrowread2 = @elapsed @time df_tmp = Arrow.Table("bigdf.arrow") |> DataFrame
arrowread2copy = @elapsed @time copy(df_tmp)
println("JSONTables.jl arraytable")
jsontablesaread2 = @elapsed @time open(jsontable, "bigdf1.json")
println("JSONTables.jl objecttable")
jsontablesoread2 = @elapsed @time open(jsontable, "bigdf2.json");
First run
CSV.jl
2.638622 seconds (3.55 M allocations: 208.195 MiB, 0.98% gc time, 138.18% compilation time)
Serialization
0.395702 seconds (9.50 M allocations: 155.494 MiB, 9.00% gc time, 9.19% compilation time)
JDF.jl
0.217561 seconds (169.93 k allocations: 158.469 MiB, 12.14% gc time, 76.63% compilation time)
JLSO.jl
0.350374 seconds (9.52 M allocations: 158.102 MiB, 6.15% gc time, 8.30% compilation time)
Arrow.jl
0.481468 seconds (550.37 k allocations: 26.328 MiB, 98.46% compilation time)
0.068915 seconds (14.50 k allocations: 10.259 MiB)
JSONTables.jl arraytable
6.826233 seconds (271.07 k allocations: 1.838 GiB, 9.73% gc time)
JSONTables.jl objecttable
0.352800 seconds (7.43 k allocations: 403.793 MiB, 4.50% gc time, 0.02% compilation time)
Second run
0.546310 seconds (152.65 k allocations: 34.427 MiB)
Serialization
0.340662 seconds (9.48 M allocations: 154.588 MiB, 5.76% gc time)
JDF.jl
0.371574 seconds (77.27 k allocations: 153.746 MiB, 86.69% gc time)
JLSO.jl
0.352822 seconds (9.50 M allocations: 157.233 MiB, 4.64% gc time)
Arrow.jl
0.008759 seconds (86.57 k allocations: 3.731 MiB)
0.049918 seconds (14.50 k allocations: 10.259 MiB)
JSONTables.jl arraytable
6.749524 seconds (271.07 k allocations: 1.838 GiB, 9.91% gc time)
JSONTables.jl objecttable
0.354865 seconds (7.08 k allocations: 403.777 MiB, 1.90% gc time)
Exclude JSONTables due to much longer timing
groupedbar(
repeat(["CSV.jl", "Serialization", "JDF.jl", "JLSO.jl", "Arrow.jl", "Arrow.jl\ncopy", ##"JSON\narraytable",
"JSON\nobjecttable"], inner=2),
[csvread1, csvread2, serializeread1, serializeread2, jdfread1, jdfread2, jlsoread1, jlsoread2,
arrowread1, arrowread2, arrowread1 + arrowread1copy, arrowread2 + arrowread2copy,
# jsontablesaread1, jsontablesaread2,
jsontablesoread1, jsontablesoread2],
group=repeat(["1st", "2nd"], outer=7),
ylab="Second",
title="Read Performance\nDataFrame: bigdf\nSize: $(size(bigdf))"
)
Using gzip compression#
A common user requirement is to be able to load and save CSV that are compressed using gzip. Below we show how this can be accomplished using CodecZlib.jl
. The same pattern is applicable to JSONTables.jl
compression/decompression.
Again make sure that you do not have file named df_compress_test.csv.gz
in your working directory.
We first generate a random data frame.
df = DataFrame(rand(1:10, 10, 1000), :auto)
Row | x1 | x2 | x3 | x4 | x5 | x6 | x7 | x8 | x9 | x10 | x11 | x12 | x13 | x14 | x15 | x16 | x17 | x18 | x19 | x20 | x21 | x22 | x23 | x24 | x25 | x26 | x27 | x28 | x29 | x30 | x31 | x32 | x33 | x34 | x35 | x36 | x37 | x38 | x39 | x40 | x41 | x42 | x43 | x44 | x45 | x46 | x47 | x48 | x49 | x50 | x51 | x52 | x53 | x54 | x55 | x56 | x57 | x58 | x59 | x60 | x61 | x62 | x63 | x64 | x65 | x66 | x67 | x68 | x69 | x70 | x71 | x72 | x73 | x74 | x75 | x76 | x77 | x78 | x79 | x80 | x81 | x82 | x83 | x84 | x85 | x86 | x87 | x88 | x89 | x90 | x91 | x92 | x93 | x94 | x95 | x96 | x97 | x98 | x99 | x100 | ⋯ |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | ⋯ | |
1 | 5 | 7 | 10 | 10 | 8 | 6 | 9 | 3 | 1 | 9 | 4 | 7 | 4 | 2 | 7 | 6 | 4 | 7 | 8 | 6 | 2 | 4 | 5 | 5 | 6 | 4 | 3 | 4 | 4 | 7 | 2 | 9 | 8 | 9 | 8 | 1 | 2 | 9 | 6 | 9 | 4 | 1 | 5 | 5 | 9 | 9 | 10 | 9 | 1 | 9 | 8 | 10 | 6 | 7 | 4 | 1 | 10 | 5 | 1 | 6 | 7 | 1 | 1 | 3 | 5 | 7 | 4 | 8 | 2 | 5 | 9 | 6 | 5 | 1 | 7 | 1 | 9 | 1 | 2 | 7 | 8 | 8 | 8 | 8 | 1 | 3 | 10 | 1 | 9 | 3 | 8 | 3 | 8 | 1 | 7 | 8 | 3 | 6 | 3 | 8 | ⋯ |
2 | 5 | 8 | 2 | 7 | 6 | 6 | 5 | 4 | 2 | 4 | 4 | 5 | 6 | 1 | 3 | 6 | 7 | 3 | 1 | 10 | 7 | 9 | 7 | 6 | 9 | 2 | 9 | 8 | 8 | 9 | 7 | 7 | 4 | 10 | 1 | 6 | 2 | 1 | 9 | 1 | 9 | 8 | 4 | 6 | 1 | 10 | 10 | 10 | 5 | 1 | 2 | 3 | 9 | 6 | 4 | 1 | 9 | 10 | 4 | 4 | 3 | 3 | 8 | 4 | 3 | 1 | 1 | 2 | 10 | 8 | 7 | 3 | 6 | 5 | 5 | 10 | 9 | 10 | 8 | 3 | 10 | 5 | 3 | 9 | 1 | 7 | 4 | 10 | 4 | 3 | 3 | 1 | 2 | 5 | 10 | 9 | 7 | 4 | 10 | 1 | ⋯ |
3 | 1 | 3 | 2 | 7 | 8 | 9 | 10 | 4 | 2 | 9 | 3 | 7 | 8 | 2 | 1 | 4 | 1 | 4 | 1 | 9 | 3 | 8 | 6 | 3 | 6 | 10 | 6 | 8 | 4 | 3 | 7 | 10 | 10 | 2 | 7 | 1 | 6 | 5 | 6 | 9 | 4 | 1 | 8 | 9 | 10 | 6 | 7 | 8 | 2 | 5 | 9 | 7 | 2 | 10 | 3 | 10 | 8 | 9 | 3 | 3 | 4 | 4 | 2 | 5 | 1 | 3 | 10 | 4 | 8 | 8 | 4 | 4 | 4 | 5 | 1 | 3 | 9 | 9 | 2 | 2 | 3 | 10 | 3 | 5 | 8 | 6 | 9 | 3 | 3 | 4 | 1 | 2 | 7 | 5 | 3 | 2 | 6 | 7 | 9 | 7 | ⋯ |
4 | 7 | 7 | 8 | 2 | 8 | 8 | 3 | 6 | 7 | 6 | 2 | 9 | 4 | 10 | 1 | 10 | 9 | 5 | 5 | 2 | 8 | 7 | 10 | 2 | 4 | 8 | 4 | 9 | 9 | 10 | 8 | 9 | 3 | 3 | 1 | 9 | 3 | 1 | 5 | 2 | 1 | 3 | 2 | 8 | 9 | 2 | 8 | 9 | 5 | 3 | 7 | 3 | 9 | 9 | 7 | 8 | 8 | 4 | 10 | 5 | 7 | 4 | 3 | 6 | 9 | 3 | 3 | 10 | 3 | 9 | 9 | 5 | 5 | 9 | 10 | 4 | 7 | 2 | 1 | 3 | 3 | 6 | 10 | 5 | 5 | 3 | 9 | 4 | 4 | 4 | 3 | 10 | 6 | 1 | 8 | 2 | 10 | 4 | 3 | 1 | ⋯ |
5 | 4 | 4 | 10 | 9 | 10 | 7 | 4 | 7 | 7 | 5 | 9 | 6 | 8 | 4 | 6 | 10 | 1 | 5 | 4 | 10 | 9 | 9 | 8 | 8 | 5 | 5 | 2 | 2 | 10 | 3 | 4 | 6 | 5 | 8 | 9 | 2 | 4 | 9 | 10 | 2 | 7 | 4 | 4 | 1 | 9 | 2 | 1 | 1 | 6 | 4 | 3 | 8 | 4 | 9 | 1 | 10 | 1 | 4 | 7 | 3 | 2 | 5 | 3 | 6 | 9 | 9 | 2 | 9 | 2 | 8 | 3 | 5 | 5 | 2 | 10 | 1 | 8 | 9 | 2 | 9 | 5 | 3 | 10 | 3 | 9 | 4 | 2 | 5 | 10 | 4 | 3 | 7 | 2 | 6 | 6 | 1 | 10 | 4 | 4 | 4 | ⋯ |
6 | 1 | 1 | 7 | 3 | 4 | 8 | 4 | 7 | 4 | 6 | 6 | 6 | 4 | 6 | 8 | 3 | 5 | 8 | 4 | 9 | 5 | 2 | 5 | 10 | 1 | 8 | 8 | 7 | 3 | 1 | 8 | 7 | 3 | 1 | 8 | 6 | 8 | 1 | 6 | 3 | 4 | 9 | 7 | 4 | 3 | 10 | 2 | 10 | 7 | 6 | 3 | 2 | 9 | 7 | 1 | 8 | 8 | 5 | 6 | 2 | 9 | 2 | 10 | 6 | 4 | 6 | 6 | 7 | 6 | 8 | 10 | 7 | 8 | 9 | 8 | 3 | 3 | 1 | 5 | 10 | 3 | 7 | 7 | 2 | 1 | 2 | 4 | 10 | 4 | 7 | 9 | 10 | 8 | 4 | 9 | 4 | 2 | 10 | 7 | 1 | ⋯ |
7 | 2 | 7 | 5 | 10 | 1 | 10 | 5 | 3 | 1 | 4 | 3 | 4 | 10 | 10 | 7 | 7 | 8 | 2 | 10 | 9 | 5 | 7 | 10 | 1 | 2 | 7 | 2 | 7 | 5 | 10 | 7 | 6 | 1 | 5 | 7 | 2 | 2 | 6 | 5 | 8 | 4 | 5 | 7 | 1 | 6 | 2 | 10 | 4 | 5 | 9 | 1 | 2 | 3 | 6 | 9 | 5 | 8 | 3 | 2 | 9 | 6 | 8 | 6 | 2 | 3 | 9 | 4 | 5 | 3 | 2 | 5 | 7 | 5 | 4 | 7 | 6 | 3 | 4 | 1 | 1 | 10 | 1 | 8 | 4 | 1 | 6 | 3 | 5 | 10 | 4 | 7 | 2 | 8 | 7 | 2 | 6 | 1 | 1 | 8 | 2 | ⋯ |
8 | 4 | 1 | 8 | 3 | 9 | 6 | 5 | 10 | 1 | 7 | 7 | 3 | 3 | 9 | 8 | 7 | 9 | 5 | 5 | 9 | 2 | 7 | 10 | 7 | 7 | 1 | 6 | 2 | 5 | 3 | 6 | 2 | 5 | 3 | 8 | 1 | 10 | 9 | 10 | 4 | 10 | 7 | 4 | 8 | 3 | 5 | 1 | 8 | 10 | 3 | 1 | 4 | 9 | 5 | 8 | 2 | 10 | 6 | 8 | 8 | 3 | 4 | 7 | 2 | 3 | 6 | 10 | 7 | 8 | 6 | 8 | 8 | 6 | 8 | 10 | 7 | 10 | 4 | 10 | 8 | 10 | 9 | 2 | 2 | 8 | 7 | 3 | 2 | 10 | 6 | 2 | 8 | 2 | 4 | 9 | 7 | 3 | 10 | 1 | 9 | ⋯ |
9 | 5 | 7 | 6 | 2 | 1 | 8 | 9 | 6 | 5 | 7 | 4 | 5 | 2 | 1 | 9 | 4 | 1 | 1 | 8 | 4 | 1 | 9 | 2 | 8 | 4 | 4 | 1 | 4 | 10 | 10 | 1 | 7 | 1 | 9 | 3 | 5 | 3 | 6 | 6 | 8 | 4 | 1 | 10 | 2 | 4 | 4 | 1 | 7 | 8 | 5 | 1 | 1 | 10 | 5 | 2 | 1 | 2 | 7 | 8 | 5 | 7 | 8 | 5 | 6 | 2 | 5 | 9 | 2 | 7 | 6 | 6 | 3 | 2 | 8 | 9 | 5 | 2 | 10 | 6 | 6 | 3 | 10 | 8 | 8 | 10 | 6 | 5 | 8 | 3 | 2 | 3 | 3 | 6 | 9 | 3 | 9 | 6 | 2 | 3 | 7 | ⋯ |
10 | 10 | 2 | 2 | 3 | 2 | 10 | 8 | 8 | 7 | 7 | 5 | 7 | 2 | 3 | 10 | 10 | 9 | 3 | 7 | 8 | 1 | 4 | 1 | 9 | 7 | 8 | 2 | 9 | 10 | 8 | 10 | 6 | 2 | 10 | 6 | 8 | 5 | 3 | 10 | 9 | 1 | 5 | 4 | 4 | 2 | 4 | 9 | 9 | 6 | 5 | 4 | 3 | 1 | 8 | 10 | 6 | 7 | 9 | 8 | 6 | 6 | 4 | 6 | 8 | 8 | 3 | 3 | 6 | 2 | 8 | 4 | 5 | 10 | 10 | 1 | 8 | 9 | 9 | 1 | 3 | 6 | 8 | 5 | 8 | 10 | 2 | 10 | 4 | 1 | 7 | 5 | 10 | 2 | 9 | 5 | 2 | 3 | 9 | 10 | 9 | ⋯ |
GzipCompressorStream comes from CodecZlib
open("df_compress_test.csv.gz", "w") do io
stream = GzipCompressorStream(io)
CSV.write(stream, df)
close(stream)
end
df2 = CSV.File(transcode(GzipDecompressor, Mmap.mmap("df_compress_test.csv.gz"))) |> DataFrame
Row | x1 | x2 | x3 | x4 | x5 | x6 | x7 | x8 | x9 | x10 | x11 | x12 | x13 | x14 | x15 | x16 | x17 | x18 | x19 | x20 | x21 | x22 | x23 | x24 | x25 | x26 | x27 | x28 | x29 | x30 | x31 | x32 | x33 | x34 | x35 | x36 | x37 | x38 | x39 | x40 | x41 | x42 | x43 | x44 | x45 | x46 | x47 | x48 | x49 | x50 | x51 | x52 | x53 | x54 | x55 | x56 | x57 | x58 | x59 | x60 | x61 | x62 | x63 | x64 | x65 | x66 | x67 | x68 | x69 | x70 | x71 | x72 | x73 | x74 | x75 | x76 | x77 | x78 | x79 | x80 | x81 | x82 | x83 | x84 | x85 | x86 | x87 | x88 | x89 | x90 | x91 | x92 | x93 | x94 | x95 | x96 | x97 | x98 | x99 | x100 | ⋯ |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | ⋯ | |
1 | 5 | 7 | 10 | 10 | 8 | 6 | 9 | 3 | 1 | 9 | 4 | 7 | 4 | 2 | 7 | 6 | 4 | 7 | 8 | 6 | 2 | 4 | 5 | 5 | 6 | 4 | 3 | 4 | 4 | 7 | 2 | 9 | 8 | 9 | 8 | 1 | 2 | 9 | 6 | 9 | 4 | 1 | 5 | 5 | 9 | 9 | 10 | 9 | 1 | 9 | 8 | 10 | 6 | 7 | 4 | 1 | 10 | 5 | 1 | 6 | 7 | 1 | 1 | 3 | 5 | 7 | 4 | 8 | 2 | 5 | 9 | 6 | 5 | 1 | 7 | 1 | 9 | 1 | 2 | 7 | 8 | 8 | 8 | 8 | 1 | 3 | 10 | 1 | 9 | 3 | 8 | 3 | 8 | 1 | 7 | 8 | 3 | 6 | 3 | 8 | ⋯ |
2 | 5 | 8 | 2 | 7 | 6 | 6 | 5 | 4 | 2 | 4 | 4 | 5 | 6 | 1 | 3 | 6 | 7 | 3 | 1 | 10 | 7 | 9 | 7 | 6 | 9 | 2 | 9 | 8 | 8 | 9 | 7 | 7 | 4 | 10 | 1 | 6 | 2 | 1 | 9 | 1 | 9 | 8 | 4 | 6 | 1 | 10 | 10 | 10 | 5 | 1 | 2 | 3 | 9 | 6 | 4 | 1 | 9 | 10 | 4 | 4 | 3 | 3 | 8 | 4 | 3 | 1 | 1 | 2 | 10 | 8 | 7 | 3 | 6 | 5 | 5 | 10 | 9 | 10 | 8 | 3 | 10 | 5 | 3 | 9 | 1 | 7 | 4 | 10 | 4 | 3 | 3 | 1 | 2 | 5 | 10 | 9 | 7 | 4 | 10 | 1 | ⋯ |
3 | 1 | 3 | 2 | 7 | 8 | 9 | 10 | 4 | 2 | 9 | 3 | 7 | 8 | 2 | 1 | 4 | 1 | 4 | 1 | 9 | 3 | 8 | 6 | 3 | 6 | 10 | 6 | 8 | 4 | 3 | 7 | 10 | 10 | 2 | 7 | 1 | 6 | 5 | 6 | 9 | 4 | 1 | 8 | 9 | 10 | 6 | 7 | 8 | 2 | 5 | 9 | 7 | 2 | 10 | 3 | 10 | 8 | 9 | 3 | 3 | 4 | 4 | 2 | 5 | 1 | 3 | 10 | 4 | 8 | 8 | 4 | 4 | 4 | 5 | 1 | 3 | 9 | 9 | 2 | 2 | 3 | 10 | 3 | 5 | 8 | 6 | 9 | 3 | 3 | 4 | 1 | 2 | 7 | 5 | 3 | 2 | 6 | 7 | 9 | 7 | ⋯ |
4 | 7 | 7 | 8 | 2 | 8 | 8 | 3 | 6 | 7 | 6 | 2 | 9 | 4 | 10 | 1 | 10 | 9 | 5 | 5 | 2 | 8 | 7 | 10 | 2 | 4 | 8 | 4 | 9 | 9 | 10 | 8 | 9 | 3 | 3 | 1 | 9 | 3 | 1 | 5 | 2 | 1 | 3 | 2 | 8 | 9 | 2 | 8 | 9 | 5 | 3 | 7 | 3 | 9 | 9 | 7 | 8 | 8 | 4 | 10 | 5 | 7 | 4 | 3 | 6 | 9 | 3 | 3 | 10 | 3 | 9 | 9 | 5 | 5 | 9 | 10 | 4 | 7 | 2 | 1 | 3 | 3 | 6 | 10 | 5 | 5 | 3 | 9 | 4 | 4 | 4 | 3 | 10 | 6 | 1 | 8 | 2 | 10 | 4 | 3 | 1 | ⋯ |
5 | 4 | 4 | 10 | 9 | 10 | 7 | 4 | 7 | 7 | 5 | 9 | 6 | 8 | 4 | 6 | 10 | 1 | 5 | 4 | 10 | 9 | 9 | 8 | 8 | 5 | 5 | 2 | 2 | 10 | 3 | 4 | 6 | 5 | 8 | 9 | 2 | 4 | 9 | 10 | 2 | 7 | 4 | 4 | 1 | 9 | 2 | 1 | 1 | 6 | 4 | 3 | 8 | 4 | 9 | 1 | 10 | 1 | 4 | 7 | 3 | 2 | 5 | 3 | 6 | 9 | 9 | 2 | 9 | 2 | 8 | 3 | 5 | 5 | 2 | 10 | 1 | 8 | 9 | 2 | 9 | 5 | 3 | 10 | 3 | 9 | 4 | 2 | 5 | 10 | 4 | 3 | 7 | 2 | 6 | 6 | 1 | 10 | 4 | 4 | 4 | ⋯ |
6 | 1 | 1 | 7 | 3 | 4 | 8 | 4 | 7 | 4 | 6 | 6 | 6 | 4 | 6 | 8 | 3 | 5 | 8 | 4 | 9 | 5 | 2 | 5 | 10 | 1 | 8 | 8 | 7 | 3 | 1 | 8 | 7 | 3 | 1 | 8 | 6 | 8 | 1 | 6 | 3 | 4 | 9 | 7 | 4 | 3 | 10 | 2 | 10 | 7 | 6 | 3 | 2 | 9 | 7 | 1 | 8 | 8 | 5 | 6 | 2 | 9 | 2 | 10 | 6 | 4 | 6 | 6 | 7 | 6 | 8 | 10 | 7 | 8 | 9 | 8 | 3 | 3 | 1 | 5 | 10 | 3 | 7 | 7 | 2 | 1 | 2 | 4 | 10 | 4 | 7 | 9 | 10 | 8 | 4 | 9 | 4 | 2 | 10 | 7 | 1 | ⋯ |
7 | 2 | 7 | 5 | 10 | 1 | 10 | 5 | 3 | 1 | 4 | 3 | 4 | 10 | 10 | 7 | 7 | 8 | 2 | 10 | 9 | 5 | 7 | 10 | 1 | 2 | 7 | 2 | 7 | 5 | 10 | 7 | 6 | 1 | 5 | 7 | 2 | 2 | 6 | 5 | 8 | 4 | 5 | 7 | 1 | 6 | 2 | 10 | 4 | 5 | 9 | 1 | 2 | 3 | 6 | 9 | 5 | 8 | 3 | 2 | 9 | 6 | 8 | 6 | 2 | 3 | 9 | 4 | 5 | 3 | 2 | 5 | 7 | 5 | 4 | 7 | 6 | 3 | 4 | 1 | 1 | 10 | 1 | 8 | 4 | 1 | 6 | 3 | 5 | 10 | 4 | 7 | 2 | 8 | 7 | 2 | 6 | 1 | 1 | 8 | 2 | ⋯ |
8 | 4 | 1 | 8 | 3 | 9 | 6 | 5 | 10 | 1 | 7 | 7 | 3 | 3 | 9 | 8 | 7 | 9 | 5 | 5 | 9 | 2 | 7 | 10 | 7 | 7 | 1 | 6 | 2 | 5 | 3 | 6 | 2 | 5 | 3 | 8 | 1 | 10 | 9 | 10 | 4 | 10 | 7 | 4 | 8 | 3 | 5 | 1 | 8 | 10 | 3 | 1 | 4 | 9 | 5 | 8 | 2 | 10 | 6 | 8 | 8 | 3 | 4 | 7 | 2 | 3 | 6 | 10 | 7 | 8 | 6 | 8 | 8 | 6 | 8 | 10 | 7 | 10 | 4 | 10 | 8 | 10 | 9 | 2 | 2 | 8 | 7 | 3 | 2 | 10 | 6 | 2 | 8 | 2 | 4 | 9 | 7 | 3 | 10 | 1 | 9 | ⋯ |
9 | 5 | 7 | 6 | 2 | 1 | 8 | 9 | 6 | 5 | 7 | 4 | 5 | 2 | 1 | 9 | 4 | 1 | 1 | 8 | 4 | 1 | 9 | 2 | 8 | 4 | 4 | 1 | 4 | 10 | 10 | 1 | 7 | 1 | 9 | 3 | 5 | 3 | 6 | 6 | 8 | 4 | 1 | 10 | 2 | 4 | 4 | 1 | 7 | 8 | 5 | 1 | 1 | 10 | 5 | 2 | 1 | 2 | 7 | 8 | 5 | 7 | 8 | 5 | 6 | 2 | 5 | 9 | 2 | 7 | 6 | 6 | 3 | 2 | 8 | 9 | 5 | 2 | 10 | 6 | 6 | 3 | 10 | 8 | 8 | 10 | 6 | 5 | 8 | 3 | 2 | 3 | 3 | 6 | 9 | 3 | 9 | 6 | 2 | 3 | 7 | ⋯ |
10 | 10 | 2 | 2 | 3 | 2 | 10 | 8 | 8 | 7 | 7 | 5 | 7 | 2 | 3 | 10 | 10 | 9 | 3 | 7 | 8 | 1 | 4 | 1 | 9 | 7 | 8 | 2 | 9 | 10 | 8 | 10 | 6 | 2 | 10 | 6 | 8 | 5 | 3 | 10 | 9 | 1 | 5 | 4 | 4 | 2 | 4 | 9 | 9 | 6 | 5 | 4 | 3 | 1 | 8 | 10 | 6 | 7 | 9 | 8 | 6 | 6 | 4 | 6 | 8 | 8 | 3 | 3 | 6 | 2 | 8 | 4 | 5 | 10 | 10 | 1 | 8 | 9 | 9 | 1 | 3 | 6 | 8 | 5 | 8 | 10 | 2 | 10 | 4 | 1 | 7 | 5 | 10 | 2 | 9 | 5 | 2 | 3 | 9 | 10 | 9 | ⋯ |
df == df2
true
Using zip files#
Sometimes you may have files compressed inside a zip file.
In such a situation you may use ZipFile.jl in conjunction an an appropriate reader to read the files.
Here we first create a ZIP file and then read back its contents into a DataFrame
.
df1 = DataFrame(rand(1:10, 3, 4), :auto)
Row | x1 | x2 | x3 | x4 |
---|---|---|---|---|
Int64 | Int64 | Int64 | Int64 | |
1 | 6 | 2 | 7 | 9 |
2 | 4 | 10 | 8 | 9 |
3 | 5 | 8 | 4 | 2 |
df2 = DataFrame(rand(1:10, 3, 4), :auto)
Row | x1 | x2 | x3 | x4 |
---|---|---|---|---|
Int64 | Int64 | Int64 | Int64 | |
1 | 8 | 9 | 3 | 2 |
2 | 8 | 1 | 6 | 8 |
3 | 4 | 1 | 10 | 10 |
And we show yet another way to write a DataFrame
into a CSV file:
Writing a CSV file into the zip file
w = ZipFile.Writer("x.zip")
f1 = ZipFile.addfile(w, "x1.csv")
write(f1, sprint(show, "text/csv", df1))
# write a second CSV file into zip file
f2 = ZipFile.addfile(w, "x2.csv", method=ZipFile.Deflate)
write(f2, sprint(show, "text/csv", df2))
close(w)
Now we read the compressed CSV file we have written:
z = ZipFile.Reader("x.zip");
# find the index index of file called x1.csv
index_xcsv = findfirst(x -> x.name == "x1.csv", z.files)
# to read the x1.csv file in the zip file
df1_2 = CSV.read(read(z.files[index_xcsv]), DataFrame)
Row | x1 | x2 | x3 | x4 |
---|---|---|---|---|
Int64 | Int64 | Int64 | Int64 | |
1 | 6 | 2 | 7 | 9 |
2 | 4 | 10 | 8 | 9 |
3 | 5 | 8 | 4 | 2 |
df1_2 == df1
true
# find the index index of file called x2.csv
index_xcsv = findfirst(x -> x.name == "x2.csv", z.files)
# to read the x2.csv file in the zip file
df2_2 = CSV.read(read(z.files[index_xcsv]), DataFrame)
Row | x1 | x2 | x3 | x4 |
---|---|---|---|---|
Int64 | Int64 | Int64 | Int64 | |
1 | 8 | 9 | 3 | 2 |
2 | 8 | 1 | 6 | 8 |
3 | 4 | 1 | 10 | 10 |
df2_2 == df2
true
Note that once you read a given file from z
object its stream is all used-up (reaching its end). Therefore to read it again you need to close the file object z
and open it again.
Also do not forget to close the zip file once you are done.
close(z)
Remove generated files
rm("x.arrow")
rm("x.bin")
rm("x.zip")
rm("x.jlso")
rm("x1.csv")
rm("x1.json")
rm("x2.json")
rm("x.jdf", recursive=true)
rm("bigdf.jdf", recursive=true)
rm("df_compress_test.csv.gz")
rm("bigdf1.json")
rm("bigdf1.csv")
rm("bigdf2.json")
rm("bigdf.jlso")
rm("bigdf.bin")
rm("bigdf.arrow")
This notebook was generated using Literate.jl.