Skip to content

Conversation

jean-philippe-martin
Copy link

A CountBytes examples that reads through the provided file (using NIO), counting the bytes.

It also reports elapsed time, allowing for a simple form of benchmarking.

Amusingly, it currently says that the file sizes are off by one: it sees one fewer bytes than Files.size() reports. This is also noticeably slower than gsutil (perhaps because it's single-threaded?).

Sample output:

$ target/appassembler/bin/CountBytes <redacted>/dbsnp_138.b37.1.1-65M.vcf
<redacted>/dbsnp_138.b37.1.1-65M.vcf: 237432747 bytes.
Reading the whole file...
Read all 237432746 bytes in 8s. (6 calls to chan.read)
Wait, this doesn't match! We saw 237432746 bytes, yet the file size is listed at 237432747 bytes.

@googlebot googlebot added the cla: yes This human has signed the Contributor License Agreement. label May 2, 2016
@mziccard
Copy link
Contributor

mziccard commented May 3, 2016

Let me start with general comments:

Amusingly, it currently says that the file sizes are off by one: it sees one fewer bytes than Files.size() reports.

This is not true. This error comes from the way you loop and compute total size:

long total = 0;
int readCalls = 0;
for (int read = 1; read > 0; ) {
  readCalls ++;
  read = chan.read(buf);
  buf.flip();
  total += read; // On the last call you read -1 and sum it to total size, thus the mismatch
}

In general the for loop seems a bit strange to me, I would have used something like (not tested):

while (chan.read(buf) > 0) {
  readCalls++;
  total += buf.position();
  buf.flip();
}
readCalls++; // We must count the last call

This is also noticeably slower than gsutil (perhaps because it's single-threaded?).

What does this mean? Is it slower at reading the size metadata or at downloading the file? Can you share some numbers?

@@ -0,0 +1,82 @@
package com.google.cloud.examples.nio;

This comment was marked as spam.

This comment was marked as spam.

@jean-philippe-martin
Copy link
Author

jean-philippe-martin commented May 3, 2016

Amusingly, it currently says that the file sizes are off by one: it sees one fewer bytes than Files.size() reports.
This is not true. This error comes from the way you loop and compute total size:

Thank you! Indeed it's obvious now.

This is also noticeably slower than gsutil (perhaps because it's single-threaded?).
What does this mean? Is it slower at reading the size metadata or at downloading the file? Can you share some numbers?

Downloading the file. Here's what I see:

$ time gsutil cp gs://<redacted>/dbsnp_138.b37.256m.vcf /tmp/
Copying gs://<redacted>/dbsnp_138.b37.256m.vcf...
Downloading file:///tmp/dbsnp_138.b37.256m.vcf:                  214.89 MiB/214.89 MiB    
Downloading file:///tmp/dbsnp_138.b37.256m.vcf:                  214.89 MiB/214.89 MiB    
Downloading file:///tmp/dbsnp_138.b37.256m.vcf:                  214.89 MiB/214.89 MiB    
Downloading file:///tmp/dbsnp_138.b37.256m.vcf:                  214.89 MiB/214.89 MiB    

real    0m11.984s
user    0m7.618s
sys 0m4.513s
$ target/appassembler/bin/CountBytes gs://<redacted>dbsnp_138.b37.256m.vcf
gs://<redacted>/dbsnp_138.b37.256m.vcf: 901305932 bytes.
Reading the whole file...
Read all 901305931 bytes in 42s. (19 calls to chan.read)

We can see that gsutil takes 1/4th of the time. The way it prints it suggests that it did 4 parallel downloads (each of 1/4th the size).

Added licence text, plus some esthetic changes.
/**
* CountBytes will read through the whole file given as input.
*
* <p>It's meant for testing that NIO doesn't read too slowly.

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

@jean-philippe-martin
Copy link
Author

Are we OK to merge?

@mziccard mziccard merged commit 7ea0fc8 into googleapis:gcs-nio May 4, 2016
@mziccard
Copy link
Contributor

mziccard commented May 4, 2016

@jean-philippe-martin just merged. As usual: thanks!

@jean-philippe-martin
Copy link
Author

You are welcome!