Beware of using java.util.Scanner with “/z”

17 12 2011

There are various articles and blog postings around that suggest that using Scanner with a “/z” delimiter is an easy way to read an entire file in one go (with “/z” being the regular expression for “end of input”).

Some examples are:

Because a single read with “/z” as the delimiter should read everything until “end of input”, it’s tempting to just do a single read and leave it at that, as the examples listed above all do.

In most cases that’s OK, but I’ve found at least one situation where reading to “end of input” doesn’t read the entire input – when the input is a SequenceInputStream, each of the constituent InputStreams appears to give a separate “end of input” of its own. As a result, if you do a single read with a delimiter of “/z” it returns the content of the first of the SequenceInputStream’s constituent streams, but doesn’t read into the rest of the constituent streams.

At any rate, that what I get on Oracle JDK 5, 6 and 7.

This might be a quirk or bug in Scanner, SequenceInputStream, regular expression processing, or how “end of input” is detected, or it might be some subtlety in the meaning of “/z” that I’m not privy to. Equally, there might be other types of InputStream with constituent sub-components that each report a separate “end of input”. But whatever the underlying reasons and scope of this problem, it seems safest to never assume that a single read delimited by “/z” will always read the whole of an input stream.

So if you really want to use Scanner to read the whole of something, I’d recommend that even when using “/z” you should still iterate the read until the Scanner reports “hasNext” as false (even though that rather reduces the attraction of using Scanner for this, as opposed to some other more direct approach to reading through the whole of the input).



2 responses

10 01 2012
Jacob Kjome

Have you tried using “\\A” as the delimiter? I haven’t verified whether it resolves issues with “SequenceInputStream” or not, but I thought I’d mention it. I came across this technique by reading a blog post [1] by Pat Niemeyer. He states “this effectively tells Scanner to tokenize the entire stream, from beginning to (illogical) next beginning”. So…

String text = new Scanner( source ).useDelimiter(“\\A”).next();


10 01 2012

Thanks Jacob.

I’d also experimented with \Z (end of input apart from final terminator, if any), just to see if that was any different (it isn’t). But I’d not seen the \A idea before… interesting, if somewhat unintuitive!

So just out of interest I set up a quick experiment to try \A, and it does seem to work correctly even with a SequenceInputStream (tried this on Oracle JDK 5, 6 and 7).

However, a quick run of some other tests seems to show that using \A throws a “NoSuchElementException” (with no message or cause) if an IOException occurs at the start of the input or when trying to close the input. This contrasts with the more usual behavior of storing the IOException for subsequent checking/retrieval and treating the input as terminated (but without then treating that as a separate and immediate “not found” exception).

So a single \A does seem to work, at least on the particular thing I’ve seen a single \z fail on, but it may need somewhat different error handling and may result in less clear exception messages in some situations.

Anyway, after seeing what happens with \z, I still don’t think I’d entirely trust a single \A read either!

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: