/************************************************************************** Copyright (c) 2002 John Henckel, johnh@formulus.com Permission to use, copy, modify, distribute and sell this software and its documentation for any purpose is hereby granted without fee, provided that the above copyright notice appear in all copies. All programs contained herein are provided to you "as is". The implied warranties of merchantability and fitness for a particular purpose are expressly disclaimed. Please send comments or questions to johnh@formulus.com */ import java.util.*; // for collections import java.io.*; import java.net.*; // for URL /**************************************************************************

URLReader

This is a class to facilitate reading data from web files.

Quick and dirty way to use URLReader

This simplest way to use URLReader is with one of the following static methods. You just call the method and get the data from the resource. If the resource is text (like HTML) use one of the string methods, otherwise use the bytes method.

urlToString(urlstring) - read a web page into a string.
urlToBytes(urlstring) - read data (like an image) into a byte array.
getURL(urlString, start, end) - read a substring of a web page.

That's it! that's all you need to know. If you want to do something more fancy like passing cookies, setting the referer, or using the POST method, then you need to keep reading.

The more complicated way to use a URLReader object

A URLReader object can be in one of these states

zombie - doesn't have a valid URL
ready - has url, but isn't connected yet
open - the data can be read from the resource
done - end of data, or any kind of error

These are the steps to create and use a URLReader object.

create the object
call as many "set" methods as you like, such as setRequestProperty, setIfModifiedSince, setPost, setURL, or setRequestMethod.
now the URLReader object is "ready" to be connected
call the connect() method and check the status code (this step is optional)
call any of the "get" methods, such as getResponse-, getHeader-, getContent-, getBufferedReader, or getInputStream.
after you are done getting the resource, you can throw it away, or you can call disconnect. After you call disconnect the URLReader object is once again in "ready" state.

After disconnect, you can re-connect and read the data again (and again...), or you can change the URL and read different data. Just remember that everything you set, like postData or usesCaches will stay that way unless you change it! If you try to read data from a zombie, then "null" is returned.

To open the URLReader you call the "connect" method. Also any of the getResponse, etc. methods will automatically connect if the URLReader is ready. The "ready()" method returns true if it is not connected, but is ready to connect. After a connect, the disconnect() method, restores the ready state.

You might ask yourself "why do we need this stuff? aren't the URL and the HttpURLConnection classes good enough?" Well, yes they are almost good enough. Sun's URLConnection classes have three major flaws that are fixed by this class.

URLConnection cannot retry. You get one chance and if 400 or 504 is returned, then TOO BAD. If you want to try again you have to throw away the URLConnection and start over with a new one.
URLConnection does not maintain cookies. This is especially a problem if followRedirects is enabled, because any cookies that were set during the redirect will be lost.
URLConnection.getHeaderField(String) method does not return multiple values. If a header appears more than once, only the last one is returned. The URLReader fixes this problem.

A note about SSL. This class does not have any special code for SSL. However, it can handle SSL connections just fine. All you need to do is put "https:" in front of the URL. Make sure you have installed JSSE and configured it. The configuration must add an "https" URLStreamHandler implementation to the pkgs list

  System.setProperty("java.protocol.handler.pkgs",
                     "com.sun.net.ssl.internal.www.protocol");

This will enable the URLStreamHandlerFactory to know how to handle https protocol. The result of "openConnection" will be an instance of HttpsURLConnection which is a subclass of HttpURLConnection.

*/ public class URLReader { /** These are some counters used to evaluate the success of opening urls. Because increment is not threaded-safe these numbers might not be entirely correct, but for statistical purposes they are still ok. */ private static int open_count = 0; private static int fail_count = 0; private static int retry_count = 0; /** Default values */ private static int default_retry = 2; private static String default_user_agent = "Mozilla/4.0 (compatible; MSIE 5.51)"; private static boolean default_set_referer = true; /** Private instance data */ private URL url; private HttpURLConnection uconn; private boolean followRedirects, useCaches, allowUserInteraction; private boolean captureCookies; private String requestMethod; // either "GET", "POST", or "HEAD" private String postdata; private Properties requestProperties; private long ifmodifiedsince; // datestamp /************************************************************************** Default constructor. All other constructors must call this constructor. */ public URLReader() { url = null; uconn = null; followRedirects = true; captureCookies = true; useCaches = false; allowUserInteraction = true; requestMethod = "GET"; postdata = null; requestProperties = new Properties(); ifmodifiedsince = 0; } /************************************************************************** Construct from URL string */ public URLReader(String urlstring) { this(); // call the default ctor try { url = new URL(urlstring); } catch (MalformedURLException mfu) { Log_warning("caught "+mfu); } } /************************************************************************** Construct from URL */ public URLReader(URL url) { this(); // call the default ctor this.url = url; } /************************************************************************** This method sets the url, the URLReader is disconnected if nec. */ public void setURL(URL url) { disconnect(); this.url = url; } /************************************************************************** This method gets the url */ public URL getURL() { return url; } /************************************************************************** This method sets the followRedirects. Default is true. */ public void setFollowRedirects(boolean followRedirects) { this.followRedirects = followRedirects; } /************************************************************************** This method gets the followRedirects */ public boolean getFollowRedirects() { return followRedirects; } /************************************************************************** This method sets the useCaches. Default is false. Quite honestly I don't know what this does. I think it doesn't do anything. See the corresponding method in URLConnection. */ public void setUseCaches(boolean useCaches) { this.useCaches = useCaches; } /************************************************************************** This method gets the useCaches */ public boolean getUseCaches() { return useCaches; } /************************************************************************** This method sets the captureCookies. Default is true. When true, the URLReader will capture all the set-cookie headers and copy them to the "cookie" request property. In this way each URLReader object maintains a "cookie jar" into which it accumulates all the cookies that have been set using it. NOTE if you set captureCookies to false and redirect to true then any cookies that were set in the redirect response header are lost. However, if you set captureCookies to true, none are lost. The URLReader does not maintain a separate cookie jar for each host. All cookies go to the same list, in the "cookie" request property. The caller should clear the cookie jar, if the URL is changed to a different host. */ public void setCaptureCookies(boolean captureCookies) { this.captureCookies = captureCookies; } /************************************************************************** This method gets the captureCookies */ public boolean getCaptureCookies() { return captureCookies; } /************************************************************************** This method sets the allowUserInteraction. Default is true. */ public void setAllowUserInteraction(boolean allowUserInteraction) { this.allowUserInteraction = allowUserInteraction; } /************************************************************************** This method gets the allowUserInteraction */ public boolean getAllowUserInteraction() { return allowUserInteraction; } /************************************************************************** This method sets the request method "POST" "GET" or "HEAD". The default method is "GET". */ public void setRequestMethod(String m) { requestMethod = m; } /************************************************************************** This method gets the request method. */ public String getRequestMethod() { return requestMethod; } /************************************************************************** This method sets the date to test the modified date of the file. If the file has not been modified more recently than the date, the server may return status code = 304 "not modified". Set to zero to disable checking. */ public void setIfModifiedSince(long ifmodifiedsince) { this.ifmodifiedsince = ifmodifiedsince; } /************************************************************************** This method returns the date used for checking modified date. Zero = no checking. */ public long getIfModifiedSince() { return ifmodifiedsince; } /************************************************************************** This method sets the request method to "POST" and it also sets the data to be posted. Usually the data looks like "name1=value1&name2=value2&..." The data can be empty. After calling this method, you may change the request method to something else if you want, and the postdata will not be lost. I don't know why you'd want to do that, though. */ public void setPost(String postdata) { setRequestMethod("POST"); this.postdata = postdata; } /************************************************************************** This method adds a new request header. For example, you could set "Referer" to "http://w3.org/". Setting a header to null will remove it. Note: The key is normalized using the "capitalize()" method. Note: The URLConnection class might add extra headers if they are not specified. In particular, it will usually add "User-Agent", "Accept", and "Host". */ public void setRequestProperty(String key, String data) { if (data == null) requestProperties.remove(capitalize(key)); else requestProperties.setProperty(capitalize(key), data); } /************************************************************************** This method gets all the request header names. See note on @see #setRequestProperty(String,String) */ public Enumeration getRequestPropertyNames() { return requestProperties.propertyNames(); } /************************************************************************** This method gets the value of a request header. See note on @see #setRequestProperty(String,String) */ public String getRequestProperty(String key) { return requestProperties.getProperty(capitalize(key)); } /************************************************************************** This method determines if the URLReader is ready to connect. If it is not ready, they it might be because the URL is missing or not valid, or because the URLReader is already connected. */ public boolean ready() { return (url != null && uconn == null); } /************************************************************************** This method connects to the host. If already connected, this does nothing. To reconnect, you must first disconnect. @param retry number of times to retry the connection. For example, retry=0 means one try only, retry=2 means try up to 3 times. @return the response code, or -1 if the url is not set or an unexpected error happened. */ public int connect(int retry) { if (url == null) return -1; if (uconn != null) // already connected { try { return uconn.getResponseCode(); } catch (IOException io) { Log_warning("getResponse caught "+io); } uconn = null; // if exception caught, try to reconnect } // This is the "retry" loop to connect to the URL ++open_count; int redir_count = 0; int i,rc = -1; URL url2 = url; // temporary url used for redirect for (i = 0; i <= retry; ++i, rc = -1) { // Create the URLConnection object. The openConnection method // will invoke the URLStreamHandlerFactory which will construct // the currect subclass of URLConnection based on the protocol. try { uconn = (HttpURLConnection) url2.openConnection(); } catch (Throwable th) { // Exception - unable to create a connection Log_warning("open caught "+th); uconn = null; break; } // Set various properties of the request object try { uconn.setFollowRedirects(followRedirects && !captureCookies); uconn.setRequestMethod(requestMethod); uconn.setUseCaches(useCaches); uconn.setAllowUserInteraction(allowUserInteraction); if (ifmodifiedsince > 0) uconn.setIfModifiedSince(ifmodifiedsince); } catch (Exception exc) { // exception - unable to set properties Log_warning("set caught "+exc); uconn = null; break; } // Set the request header for (Enumeration e = getRequestPropertyNames(); e.hasMoreElements(); ) { String key = (String) e.nextElement(); uconn.setRequestProperty(key, getRequestProperty(key)); } // Set default request header values if (default_set_referer && getRequestProperty("Referer")==null) uconn.setRequestProperty("Referer", url.getProtocol()+"://"+url.getHost()+"/"); if (getRequestProperty("User-Agent")==null) uconn.setRequestProperty("User-Agent", default_user_agent); // Get ready to send the "POST" data if (postdata != null) { uconn.setDoOutput(true); uconn.setDoInput(true); } try { // Connect to the host uconn.connect(); // Send the "POST" data to the host if (postdata != null) { PrintStream ps = new PrintStream(uconn.getOutputStream()); ps.print(postdata); ps.flush(); } // Get the response code from the host rc = uconn.getResponseCode(); // Get the cookies from the host if (captureCookies) getSetCookie(); // Follow redirects if (followRedirects && rc == 304) { url2 = new URL(url.toString()); i = 0; // reset retry counter if (++redir_count > 10) Log_warning("redirect count "+redir_count+" too large for "+url); else continue; } } catch (IOException ioe) { Log_warning("connect caught "+ioe); uconn = null; continue; } // If the response is not an error, then do not retry if (rc < 400 && rc >= 100) break; } if (uconn == null) ++fail_count; else if (i > 0) ++retry_count; // succeeded after a retry return rc; } /************************************************************************** This method is used internally to copy values from the "set-cookie" response header to the "cookie" request header. */ private void getSetCookie() // PRIVATE METHOD { String c2 = getHeaderField("Set-cookie"); if (c2==null) return; String c1 = getRequestProperty("Cookie"); if (c1 == null) setRequestProperty("Cookie", c2); else setRequestProperty("Cookie", c1 + "; " + c2); } /************************************************************************** This method converts a list of data like "key1=value1; key2=value2; ..." into a Properties hash table. Note: if any key appears more than once then the last value will be replace the first one. @return a Properties object representing the key=value pairs in a string. Return null if a cookie_jar is null. When any parsing error occurs, an error message is added to the property list under the key "$error". */ public static Properties parseCookieList(String cookie_jar) { int i,j,k; String key,val; Properties p; if (cookie_jar == null) return null; p = new Properties(); i = 0; while (true) { while (i < cookie_jar.length() && // skip leading whitespace (Character.isWhitespace(cookie_jar.charAt(i)) || cookie_jar.charAt(i) == ';')) ++i; if (i == cookie_jar.length()) // no more key=value pairs break; if ((j = cookie_jar.indexOf('=',i)) <= i) // missing equal sign { p.setProperty("$error", cookie_jar.substring(i)); break; } key = cookie_jar.substring(i, j).trim(); k = cookie_jar.indexOf(';', j); if (k > j) val = cookie_jar.substring(j + 1, k); else val = cookie_jar.substring(j + 1); // if (p.containsKey(key)) p.setProperty(key, p.getProperty(key) + " " + val); else p.setProperty(key, val); if (k <= j) break; // no more key=value pairs i = k + 1; } return p; } /************************************************************************** This method returns the response code from the URL connection. If the URLReader is ready, this method tries to connect. For a list of codes see w3.org @return the response code, or -1 if the url is not set or an unexpected error happened. */ public int getResponseCode() { if (ready()) return connect(default_retry); if (uconn != null) { try { return uconn.getResponseCode(); } catch (IOException ioe) { Log_warning("caught "+ioe); } } return -1; } /************************************************************************** This method returns the response message, like "Ok", or "Not Found". If the URLReader is ready, this method tries to connect. */ public String getResponseMessage() { if (ready()) connect(default_retry); if (uconn != null) { try { return uconn.getResponseMessage(); } catch (IOException ioe) { Log_warning("caught "+ioe); } } return null; } /************************************************************************** This method gets a header field from the http response. If the URLReader is ready, this method tries to connect.

NOTE this is returns all the values!! If any response header is repeated, all the values of the headers are concatenated with semicolon delimiter "; ". This is different from the implementation of URLConnection class which returns only the last value. @return a list of values, delimited by ';' or null if key not found. */ public String getHeaderField(String key) { if (ready()) connect(default_retry); if (uconn == null || key == null) return null; StringBuffer b = new StringBuffer(); String k; for (int i=1; (k = uconn.getHeaderFieldKey(i)) != null; ++i) if (key.equalsIgnoreCase(k)) { b.append("; "); b.append(uconn.getHeaderField(i)); } if (b.length() > 2) return b.substring(2); return null; } /************************************************************************** This method returns a key from the http response header. If the URLReader is ready, this method tries to connect. @param i the index of the key, 1=first, 2=second, etc. @return the name of the i^th header, or null if there are fewer than i headers. */ public String getHeaderFieldKey(int i) { if (ready()) connect(default_retry); if (uconn != null) return uconn.getHeaderFieldKey(i); return null; } /************************************************************************** This method returns the value from the http response header. If the URLReader is ready, this method tries to connect. @param i the index of the value, 1=first, 2=second, etc. @return the value of the i^th header, or null if there are fewer than i headers. */ public String getHeaderField(int i) { if (ready()) connect(default_retry); if (uconn != null) return uconn.getHeaderField(i); return null; } /************************************************************************** This method returns the contents as an object. Read more about it in the ContentHandlerFactory class. If the URLReader is ready, this method tries to connect. */ public Object getContent() { if (ready()) connect(default_retry); if (uconn != null) { try { return uconn.getContent(); } catch (Exception e) { Log_warning("caught "+e); } } return null; } /************************************************************************** This reads all the data from a stream into a string. The stream data can be from any source, like a URL or a File or a socket.

This algorithm reads the data in chunks, as more data is read the chunks get larger. This makes it fast for either small streams or large ones. @return all the data in the stream is returned as a string @throws nothing ever */ public static String readAll(InputStream in) { if (in==null) return ""; StringBuffer sb = new StringBuffer(); int i, m, n = 0; byte[] chunk; try { m = in.available(); } catch (Exception e) { Log_warning("caught "+e); return ""; } if (m < 256) m = 256; chunk = new byte[m]; // allocate initial chunk buffer while (true) { if (++n == 3 && m < 16000) // increase chunk size every 3 up to 16K { n = 0; chunk = new byte[m *= 8]; } try { if ((i = in.read(chunk)) < 0) break; sb.append(new String(chunk, 0, i)); } catch (Exception e) { Log_warning("caught "+e); } } return sb.toString(); } /************************************************************************** This reads all the data from a stream into an array of bytes. The stream data can be from any source, like a URL or a File or a socket.

This algorithm reads the data in chunks, as more data is read the chunks get larger. This makes it fast for either small streams or large ones. @return all the data in the stream, or empty array on error @throws nothing ever */ public static byte[] readAllBytes(InputStream in) { int i, m, n = 0; byte[] result = new byte[0]; byte[] chunk; byte[] temp; if (in==null) return result; try { m = in.available(); } catch (Exception e) { Log_warning("caught "+e); return result; } if (m < 256) m = 256; chunk = new byte[m]; while (true) { if (++n == 3 && m < 16000) // increase chunk size every 3 up to 16K { n = 0; chunk = new byte[m *= 8]; } try { if ((i = in.read(chunk)) < 0) break; // Compute result = result + chunk[0..i] temp = new byte[result.length + i]; if (result.length > 0) System.arraycopy(result, 0, temp, 0, result.length); System.arraycopy(chunk, 0, temp, result.length, i); result = temp; } catch (Exception e) { Log_warning("caught "+e); } } return result; } /************************************************************************** This method returns the web page contents as one big string. If the URLReader is ready, this method tries to connect. @return the web page contents, or "" if an error happens. */ public String getContentString() { return readAll(getInputStream()); } /************************************************************************** Fetch a segment of a web resource.

Returns URL data from the web. If the URLReader is ready, this method tries to connect. The first substring that starts with start and ends with end is returned. When start=="" matches the begining of the data, and end=="" matches the end. If either start or end are not found then "" is returned.

If (start,end) are both empty, the entire URL data is returned. NOTE This might not handle binary data! Only fetch text-based URLs.

NOTE The start/end must not contain '\n'.

@param urlString the full url "http://...." @param start the substring for the start of the segment @param end the substring for the end of the segment @return the segment of the URL data. @throws nothing ever. */ public String getContentSubstring(String start, String end) { String nextline; StringBuffer lineBuffer = null; boolean found = false; int i,j; int len_s, len_e; if (start==null) start = ""; if (end==null) end = ""; len_s = start.length(); len_e = end.length(); try { //open an input stream from the given url BufferedReader in = getBufferedReader(); if (in == null) return ""; //get webpage data, line by line while((nextline = in.readLine())!=null) { if (!found && (i = nextline.indexOf(start)) >= 0) { found = true; if (len_e > 0 && (j = nextline.indexOf(end, i + len_s)) >= 0) return nextline.substring(i,j + len_e); // RETURN: START AND END else lineBuffer = new StringBuffer(nextline.substring(i)); } else if (found) { if (len_e > 0 && (j = nextline.indexOf(end)) >= 0) { lineBuffer.append(nextline.substring(0,j + len_e)); return lineBuffer.toString(); // RETURN: END IS FOUND } else { lineBuffer.append("\n"); lineBuffer.append(nextline); } } } } catch(Exception e) { Log_warning("Unable to get URL '"+url+"' caught "+e); } if (found && len_e == 0) return lineBuffer.toString(); // RETURN: END IS EMPTY return ""; // RETURN: EITHER START OR END IS NOT FOUND } /************************************************************************** This method returns the web object as a byte array. This is useful for binary stuff, like images and class files. If the URLReader is ready, this method tries to connect. @return the web file contents, or an empty array if an error happens. */ public byte[] getContentBytes() { return readAllBytes(getInputStream()); } /************************************************************************** This method returns a stream to get the web data. If the URLReader is ready, this method tries to connect. */ public InputStream getInputStream() { if (ready()) connect(default_retry); if (uconn != null) { try { return uconn.getInputStream(); } catch (Exception e) { Log_warning("caught "+e); } } return null; } /************************************************************************** This method returns a reader to get the web page data one line at a time. If the URLReader is ready, this method tries to connect. */ public BufferedReader getBufferedReader() { InputStream in = getInputStream(); if (in == null) return null; return new BufferedReader(new InputStreamReader(in)); } /************************************************************************** This method disconnects from the host. If the URLReader is not connected, then this does nothing. After disconnect, the URLReader can be modified and/or reconnected. */ public void disconnect() { if (uconn != null) uconn.disconnect(); // I don't know if this does anything uconn = null; } /************************************************************************** This method returns a string dump of the URLReader */ public String toString() { return super.toString(); } /*------------------------------------- | STATIC METHODS | -------------------------------------*/ /************************************************************************** This method reads a web page into a string. @return the web page contents, or "" if an error happens. */ public static String urlToString(String urlstring) { URLReader ur = new URLReader(urlstring); return ur.getContentString(); } /************************************************************************** This method returns the web object as a byte array. This is useful for binary stuff, like images and class files. @return the web file contents, or an empty array if an error happens. */ public static byte[] urlToBytes(String urlstring) { URLReader ur = new URLReader(urlstring); return ur.getContentBytes(); } /************************************************************************** Fetch a segment of a web resource.

Returns URL data from the web. The first substring that starts with start and ends with end is returned. When start=="" matches the begining of the data, and end=="" matches the end. If either start or end are not found then "" is returned.

If (start,end) are both empty, the entire URL data is returned. NOTE This might not handle binary data! Only fetch text-based URLs.

NOTE The start/end must not contain '\n'.

@param urlString the full url "http://...." @param start the substring for the start of the segment @param end the substring for the end of the segment @return the segment of the URL data. @throws nothing ever. */ public static String getURL(String urlstring, String start, String end) { URLReader ur = new URLReader(urlstring); return ur.getContentSubstring(start, end); } /************************************************************************** for debugging */ public static void Log_warning(String m) { System.err.println(m); } /************************************************************************** Return two numbers in a string like "1.2 3.2" the first is the percent of urls that failed but succeeded on a retry, the second is the percent that did not succeed. */ public static String failRates(boolean reset) { float r,f; int t; if (open_count == 0) // undefined count return "-0.0 -0.0"; t = 1000 * fail_count / open_count; f = t / 10.0f; t = 1000 * retry_count / open_count; r = t / 10.0f; if (reset) { open_count = 0; fail_count = 0; retry_count = 0; } return ""+r+" "+f; } /************************************************************************** Capitalize the first letter in each group of alphabet characters and lowercase all the others. For instance "useR-aGenT" ==> "User-Agent". */ public static String capitalize(String s) { if (s==null) return s; StringBuffer b = new StringBuffer(); for (int i=0; i < s.length(); ++i) if (i==0 || !Character.isLetter(s.charAt(i-1))) b.append(Character.toUpperCase(s.charAt(i))); else b.append(Character.toLowerCase(s.charAt(i))); return b.toString(); } /************************************************************************** This is for testing only */ public static void main1(String[] args) { String s = args.length > 0 ? args[0] : null; Properties p = parseCookieList(s); p.list(System.out); } /************************************************************************** This is for testing only */ public static void main(String[] args) { int chop = 200; String s = "http://www.formulus.com/cgi-bin/trivial.cgi"; String ck; if (args.length > 0) s = args[0]; // If you want to test SSL, install and configure JSSE, and // uncomment the following lines. // // if (s.startsWith("https")) // java.security.Security.addProvider(new com.sun.net.ssl.internal.ssl.Provider()); if (args.length > 1) try { chop = Integer.parseInt(args[1]); } catch (Exception e) {} URLReader ur = new URLReader(s); System.out.println("chop = "+chop); System.out.println("ready = "+ur.ready()); ur.setFollowRedirects(false); System.out.println("connect = "+ur.connect(2)); System.out.println("ready = "+ur.ready()); s = ur.getContentString(); if (s.length() > chop) s = s.substring(0,chop); System.out.println("data(partial) = '"+s+"'"); for (int i=1; ur.getHeaderFieldKey(i) != null; ++i) System.out.println(ur.getHeaderFieldKey(i)+": "+ur.getHeaderField(i)); System.out.println("ready = "+ur.ready()); ur.disconnect(); } }