Do-It-Yourself Caching: Squid 2.3
Why Caching is EssentialCaching is an important function for a wide variety of Internet-related concerns, as ISPs, educational institutions and corporations all find that it measurably enhances system performance. There are a host of commercial caching products on the market, but perhaps the most popular is Squid, the open-source cache originally produced by the ARPA Harvest project and now maintained by the National Laboratory for Applied Network Research (NLANR). That's why you should be interested in running Squid if you're doing any sort of Web serving. In this article, we'll explore Squid configuration and test its capabilities under real-world circumstances. Chances are pretty good that you already have Squid, since virtually every Linux distribution includes Squid in both source code and a precompiled binary as contributed software. You can also download source code by following links from the Squid home page. Squid can compile and run on minimal hardware, but experience shows that a stable Squid cache requires at least 128 MB of RAM and several GB of disk storage. Performance of course varies widely, affected by many factors: CPU, memory, disk, Squid configuration and kernel tuning. We first ran Squid on a Pentium 133 with 64 MB of RAM, but could not achieve long-term stability. We had no problem running Squid on a Pentium III 500 with 128 MB RAM and 20 GB disk, with only 4 GB allocated to cache storage. Of course, we were only shooting for a stable eval platform, not production-quality performance. The first IRCACHE Bakeoff found Squid to max out at 100 requests per second on a PII 333 with 256 MB of RAM and 30 GB storage. We evaluated a newer version of Squid (2.3.STABLE1) but made no attempt to measure its performance. Off-the-shelf Squid lacks the custom file system, optimized kernel, tuned protocol stack and redundant hardware support provided by commercial cache products. If you plan to deploy Squid, you'll want to start with fast, robust hardware and tweak your config to get the most out of this open-source solution. Plenty of physical memory and Fast or Ultra Wide SCSI disks are highly recommended.
Software InstallationUnless you bump into compile problems, installing Squid isn't all that difficult. We recommend starting with the Squid User's Guide. This Guide provides an excellent intro-level explanation. Once you begin to ask tougher questions, consult the Squid FAQ for additional detail. We began with a pair of PCs already networked and running Apache. As suggested by the Guide, we created a new squid user account with home directory /usr/local/squid. (For security reasons, Squid should not be run as root.) After logging in as squid, we downloaded, unzipped, and extracted the source into/usr/local/squid/squid-2.3.STABLE1. We then invoked the following commands to compile and install Squid in the default location with an embedded SNMP agent: ./configure --enable-snmp These commands install Squid in /usr/local/squid with subdirectories /bin, /etc, /logs, and /cache. You can start with the default config (etc/squid.conf), but it's worth taking a few minutes to consider where your cache storage will be located. If you change nothing, data will be stored in the /cache subdirectory created during installation. We replaced /cache with a separate 4 GB disk partition we'd created exclusively for cache storage. Though not strictly required, we suggest using dedicated partitions or disks for cache storage to facilitate tuning and compartmentalize "out of space" problems. The default config allocates 100 MB disk space and 8 KB of RAM for cache storage. On our PIII 500, we modified squid.conf to use 3 GB of disk space and 32 KB of RAM. On our minimal P133, we tried several settings, but could not overcome stability problems that we attribute to insufficient memory. After making these changes to the cache_dir statement in squid.conf, execute the command: /usr/local/bin/squid -z to create swap directories that index stored data. If you change your mind later, you can always remove cached files and recreate (empty) swap directories in a different location, or add another cache_dir. By default, Squid operates in proxy mode, listening for client requests sent to port 3128. If that's not what you want, you'll need to modify squid.conf (see "Deploying Squid" later in this tutorial). Otherwise, just execute squid again without the -z option to launch a background child process. View the file /usr/local/bin/logs/cache.log to see errors. Once you have Squid running as intended, you'll probably want to update your inittab, rc.local, or init.d file to launch Squid automatically at boot. It took us about 30 minutes to get Squid running in proxy mode during our first install.
Managing SquidUnlike commercial products, Squid isn't configured through a command line or graphical user interface. To make a configuration change, edit etc/squid.conf or mime.conf files, then invoke:
Changes can be checked for proper syntax by using the -k parse command-line option. Debug can be enabled when parsing the config file by using the -X option. A few changes (e.g., enabling or disabling the client access log) require Squid to be stopped and restarted; this can be accomplished with the -k shutdown option. Squid can be monitored through log files, created by default in /usr/local/bin/logs. The cache.log contains system-level messages used to monitor status when starting, reconfiguring or stopping Squid. The access.log records client request activity (see Monitoring Squid). The store.log tracks objects being added to the cache; you'll probably want to disable this very large file by adding cache_store_log none to squid.conf. In fact, if you don't disable the store.log, you'll eventually run out of disk space. Accessing shell commands and files require the administrator to log into the Squid server. Remote login session traffic can be secured with additional software (for example, Secure Shell). Squid can also be remotely monitored through a browser-based Cache Manager GUI. Cache Manager is a CGI script. To run it, config changes may be required to the localhost web server. We added a link to /usr/local/squid/bin/cachemgr.cgi from our existing Apache cgi-bin directory, but we could have added a ScriptAlias to Apache's srm.conf file instead. If you want to limit access to Cache Manager, add a Location to Apache's access.conf file permitting execution by a specified host, domain, or authenticated user. Config details for each web server vary; do whatever it takes to execute this CGI script and impose appropriate security restrictions. With our version of Squid, manager access by the localhost was permitted by default. Older Squid versions may require squid.conf file updates to add the ACL http_access allow manager localhost. You can restrict actions available through Cache Manager by customizing cachemgr_passwd statements in squid.conf. The only active command provided by Cache Manager--shutdown--is password-protected by default. All other Cache Manager commands are passive; this GUI does not support remote configuration.
Deploying SquidSquid 2.3 caches HTTP, FTP, Gopher, and DNS lookup results, with support for SSL connections. All protocols are handled in forward proxy mode; only HTTP is available in transparent proxy mode. Squid can also operate as an HTTP accelerator (reverse web proxy) for a single server, or for several servers with different content. We tested Squid as a simultaneous forward and reverse web proxy. To use both at the same time, you must enable the httpd_accel_with_proxy option. As a reverse proxy, Squid can listen for requests directed to several IPs and ports or accelerate virtual hosts at the same address using the HTTP/1.1 Host: header. Squid does not distribute requests across servers; for this, you'd need to use DNS round-robin or a server load balancer. We did not test Squid in transparent mode. To do so, you'd need to route or redirect traffic to Squid using a switch or (new in Squid 2.3) a Cisco router running WCCP 1.0. Several Squid config changes are required, including: http_port 8080 Depending upon your OS, you may also need to recompile Squid. The most difficult part may be getting your OS to accept redirected packets and deliver them to Squid. Some OSes cannot do this; others require IP filtering or forwarding tweaks described in the Squid FAQ. Squid must run initially as root to listen to port 80 (or any other privileged port). Squid supports hierarchies using ICP queries between child, parent, and neighbor caches over both unicast and multicast IP. As we had with commercial caches, we successfully configured Squid to query an ICP parent for all ISP-Planet.com requests: cache_peer huahine.netreach.net parent 8080 3130 and monitored hierarchy behavior using Cache Manager. But we found that Squid must be able to reach its cache_peer(s) at start-up. NLANR maintains a registry of caches to help build hierarchies. You can query NLANR's Tracker database to locate potential parent or sibling caches. To enroll your own cache, contact NLANR. You can restrict hierarchical queries satisfied by your cache using the icp_access option, or have proxied requests bypass the hierarchy if they match a specified hierarchy_stoplist. Finally, Squid can generate and exchange "cache digests." A digest is a (relatively) compact object index that peers can used to locate objects stored in neighboring caches. Squid does not offer the high-availability features found in some commercial caching products. Using transparent mode with a switch may insulate clients from Squid server failure. Perhaps disk mirroring could be used to enable hot-standby, but we did not find this possibility discussed in the Squid FAQ.
ACLs and AuthenticationMany squid.conf options require use of Access Control Lists (ACLs). Each ACL consists of a name, type and value (a string or filename). ACL types include source or destination IP address, domain, or domain regular expression, hours, days, URL, port, protocol, method, username or type of browser. ACLs can also require user authentication, specify an SNMP read community string or set a TCP connection limit. HTTP processing depends upon http_access config statements that allow or deny requests, filtered by ACL. The default config permits manager access by the localhost, blocks HTTP or SSL access to "unsafe" ports, and denies all other access. This default config must be relaxed to permit client access. We defined ACLs representing local subnets, and allowed access by any request originating from our subnets. We also blocked access from one local host, and permitted authenticated user access from anywhere else. User authentication can be performed in two ways: authorized users can be enumerated in squid.conf, or a proxy authentication "helper" can be identified. Several proxy programs are supplied with Squid 2.3 source, including NCSA, LDAP, Windows NT and SMB. A RADIUS proxy can also be found on the Squid web site. When authentication is enabled, Squid prompts the client for login/password, relays client input to the proxy program and uses the result (ERR or OK) to accept or reject the client request. Proxies like this allow simple integration with existing authentication servers. We tried NCSA authentication. Simply make, install and add the following statements to squid.conf:
NCSA authentication requires an encrypted password file. Before reconfiguring Squid to use NCSA, we created the file /usr/local/etc/httpd/users by using Apache's htpasswd utility. We also tried MSNT authentication, but didn't have any luck getting the proxy to talk to our NT domain controller (probably a config error; we didn't debug the problem). Squid ACLs are referenced in many config statements, including icp_access, miss_access and cache_peer_access (all provide hierarchy control), deny_info (customizes error messages) and always_direct (configures cache bypasses). For example:
These ACLs are very granular and provide lots of flexibility, but Squid lacks the group-level control you can find in other caching tools, such as the Network Appliance NetCache. Squid also supports Redirectors that can be used to block access to "undesirable" sites. Just identify a redirect_program in your squid.conf. Squid will forward incoming requests to that program so that content filters can be applied. A few open source Redirectors like Squirm, SquidGuard and Ad Zapper can be downloaded by following links from the Squid home page.
Squid Resource UsageMany admins will do what we did: make a few obvious squid.conf changes after the initial install, then let Squid run for awhile and observe what happens. Among our initial tuning changes:
Before disabling logs, we started a Polygraph workload and monitored the access.log. But with an active workload, these logs grew large and quickly consumed all available space in their default location /usr/local/squid/logs. We then set cache_access_log none, but this only created a file called none! In fact, you can't stop Squid from recording client access, but you can avoid writing the log to disk by sending output to /dev/null. You can also better manage disk space by placing your log files in another partition, invoking squid -k rotate periodically in a cron job and transferring old files elsewhere. Once we brought disk usage under control by tweaking cache_disk and log settings, our next challenge was memory. The config option cache_mem affects how much RAM is used by in-transit, hot and negatively cached objects, but if Squid needs more memory for in-transit objects, it will use it despite any limit you set here. And there is no config option to limit Squid process size: cache_mem is only one way Squid consumes RAM; others uses include disk and network I/O buffers, IP and FQDN cache, per-request state information (including request and reply headers) and statistics. You can monitor Squid memory usage with Cache Manager. Our general stats showed 55 KB in use when we'd configured cache_mem 32 KB. Squid developers have been working to eliminate memory leaks; check the FAQ for hints on how to reduce usage and avoid leaks.
Tuning Squid EfficiencyOnce you've stabilized Squid, you can then proceed to fine-tune its operating efficiency. The configuration file contains over 100 tuning knobs. Among other things, you can:
Earlier versions of Squid had a single self-explanatory replacement policy: Least Recently Used (LRU). In version 2.3, you can also choose Greedy-Dual Size Frequency (GDSF) or Least Frequently Used with Dynamic Aging (LFUDA). GDSF gives priority to smaller objects that have a better chance of getting hit. LFUDA is more like LRU, but employs a dynamic aging mechanism that is said to be more efficient than recent usage. Generally, GDSF produces a higher hit rate, while LFUDA reduces WAN bandwidth use. For further details, follow the URLs supplied in squid.conf. Squid can't provide the patented adaptive or predictive refresh methods used by some commercial caching products and doesn't include the ability to preload specified high-utilization sites. However, it is quite simple to write a script that uses wget to download sites, invoked on a scheduled basis by cron. You'll find a tuning tips to increase hit rate, improve response time and optimize cache performance in the Squid FAQ. The Cache Manager offers instantaneous, tabular stats. General Runtime Info summarizes HTTP, ICP and DNS request rates, hit ratios, median service times, mean object size, CPU, memory, swap and file descriptor usage.
Monitoring SquidWe used three tools to monitor Squid: Cache Manager, an SNMP NMS and logfile post-processors. We've already discussed Cache Manager (see Managing Squid and Tuning Cache Efficiency). We've also shown how to disable and rotate the access.log. Squid's access.log uses the native log format adopted by most commercial products. An emulate_httpd_log option is available to select the CERN format instead. Squid's native log format is described by the Squid FAQ, including explanation of the HTTP and ICP status codes that accompany each log entry. Many logfile post-processors are available which understand Squid's native log format. Some basic Perl scripts are provided by NLANR. For example, run access_extract on a log file to generate a raw summary, then cat the result to access_summary to produce the report shown in part here. Another popular tool which produces text or html reports from Squid logs is Calamaris. For more info on these and other post-processors, visit the Squid home page and follow links to software. We used CastleRock's SNMPc MIB Browser to query Squid's SNMP agent. Squid source includes an enterprise MIB. To use SNMP, compile Squid with SNMP enabled, add etc/mib.txt to your NMS MIB database (we had to fix one syntax error), then add snmp_community and snmp_access statements to squid.conf. By default, the Squid agent listens to port 3401 but denies all requests. With appropriate read community string and access ACLs, the agent will accept read (but not write) requests. The Squid MIB allows an NMS to query software version, memory and disk allocations, as well as a healthy set of protocol, network, ICP peer and client counters. In fact, this MIB provides the platform necessary to monitor traffic using MRTG. The Squid agent is independent of any other SNMP agent you might run on your Squid server (for example, providing access to MIB-II objects). Squid doesn't support e-mail or pager event notification, administrative features found in commercial products. In fact, Squid doesn't generate SNMP traps. If you want to monitor Squid events, you can do so by writing a monitoring process that runs on the Squid server, or by listening to standard coldStart and interface traps generated by the server's SNMP agent.
Maintenance and Support/Final WordsOne of the chief arguments against production use of any open-source solution is limited support. In truth, there are many companies willing to provide Squid support for a price: you'll find several enumerated in the Squid FAQ. There is also a wealth of free troubleshooting info in the Squid Mail Archive. Squid source is maintained by NLANR, funded by NSF. Most of you are familiar with the tradeoffs between open source and commercial support, so we won't belabor this point. Final Words
|