O'Reilly Network: Deploying Squid, Part 1 of 2

Published on The O'Reilly Network (http://www.oreillynet.com/)
http://www.oreillynet.com/pub/a/linux/2000/02/07/tutorial.html
See this if you're having trouble printing code examples

Deploying Squid, Part 1 of 2

by Jeff Dean
02/07/2000

Deploying Squid, Part 2 of 2

In this two-part technical tutorial we'll explore the deployment of a web proxy cache, sometimes referred to as a Web cache or a proxy server, for a small to medium sized corporate enterprise. A web proxy cache is surprisingly easy to implement and maintain, and when built using open-source software it can be quite economical as well.

Why cache?

Bandwidth on a corporate Internet connection is a valuable and often critical business resource, and in most cases is also fairly expensive. Unfortunately, even for small and medium sized companies that precious lifeline can become consumed by Web traffic from the company's own internal systems. This leads to a slow and unresponsive connection during peak work hours. An analysis of the Web surfing habits of a company's user population will often show a number of "hot" Web sites, such as competitors, stock tracking sites, and items of personal interest to employees. Visits by multiple individuals to hot sites leads to inefficiencies, because each client browser must use the relatively slow corporate Internet connection to fetch the same data.

Popular browsers help to reduce inefficiencies by locally caching Web objects. This locally reduces demand and increases performance, but browser caches aren't shared across an enterprise. The implementation of a web proxy cache can save additional bandwidth. Just what is a Web cache? Simply put, it's an intermediary (or proxy) computer system between Web browsers and Internet Web servers. Instead of sending requests for Web pages directly to origin servers on the Internet, browsers instead contact a web proxy cache server on the local high-speed network, which in turn contacts the origin server on behalf of the browser. The proxy fetches the object from the Internet and forwards it back to the browser, but also keeps a copy for itself. Subsequent requests for the same object from any browser in the enterprise won't require a visit to the origin server. They can be fulfilled locally from the Web cache. This has the effect of speeding response time for everyone and reducing bandwidth demands on the Internet connection. It is not unusual for a cache system to reduce demand by 20-30%.

A variety of web proxy cache products are available. Some are free while others are very expensive, particularly for large corporate environments. When per-user licensing fees are evaluated for some of the products, the costs can become prohibitive. Fortunately, at least one mature, reliable, and popular Open Source alternative exists, the Squid web proxy cache. Squid is funded by the US National Science Foundation and is developed through the unpaid contributions of many volunteers. Squid is free, licensed under the GNU Public License. Squid runs on nearly all flavors of Unix, including Linux and FreeBSD.

System Requirements

A web proxy cache requires a generous amount of memory and a fast disk I/O subsystem. Memory is needed to maintain lists of cached objects, and disks must be capable of keeping up with a steady flood of random reads and writes. Typically processor speed is not a limiting factor, and a modest processor can make a satisfactory proxy server given the appropriate I/O and memory configuration.

In this tutorial, we'll be configuring Squid for a pair of Intel systems running Linux and intended to serve up to 2000 client browsers. Since Internet demand and usage patterns are site-specific, your site may need more or less hardware as your needs dictate. For the purposes of this example, the following specifications are adequate:

Single-processor Intel PentiumPro-200 or better
256MB RAM
Ultra-Wide SCSI Interface
Three 4GB Ultra-Wide SCSI disks (no RAID)
Redhat Linux 6.0
Squid web proxy cache

In our example configuration we'll begin with a working Redhat Linux 6.0 system (including the gcc C compiler) on ultra-wide SCSI disk /dev/sda. This partition will also hold Squid and its log files. Two more disks, /dev/sdb and /dev/sdc, will contain the cached Web objects. To start, the cache disks are assumed to contain unused ext2 (Linux native) partitions /dev/sdb1 and /dev/sdc1. By placing the cache on multiple disks, we increase cache performance. This distributes I/O and takes advantage of Squid's ability to manage multiple cache disks simultaneously. (If you are configuring Squid for a small installation, you may choose to cache to your system disk instead.) For even better performance, we could place the disks on separate SCSI channels. Note that IDE disk interfaces are not recommended for heavily loaded proxy servers because of the inherent random nature of the cache I/O.

We'll be installing Squid into its default location, /usr/local/squid. It is recommended to make the /usr partition large enough to handle Squid's log files which can grow very big on a production server. We will also run Squid under a special user created for the purpose, appropriately called "squid" with a special group also called "squid."

A web proxy cache will write a large number of small files in its cache directories. Therefore, you should create the filesystems for the two cache disks with a relatively large number of inodes. If the inode configuration is new to you, don't worry about it at this point - it's easy to reconfigure the cache disks later if necessary.

Getting and Compiling Squid

While you may find a current precompiled binary package for your system, we'll compile Squid from source code for this tutorial. Squid compiles easily and offers complete control over where it is installed.

First, create directories for Squid:

# mkdir -p /usr/local/squid/src

Next, set ownership and the SGID permission on the top level and source directories. This ensures that all new files have the squid group owner, allowing multiple sysadmins to manage Squid without using root privilege:

# chown -R squid.squid /usr/local/squid
# chmod g+s /usr/local/squid /usr/local/squid/src

Create the squid user (under Redhat Linux, this also creates the squid group):

# useradd squid -d /usr/local/squid

Use your browser or FTP client to transfer the Squid source distribution from the Squid web proxy cache download page. As of February 1, 2000, the latest version of Squid is known as "2.3.STABLE1," the version we'll use in this example (you should be able to implement any recent stable release without difficulty).

The squid source is stored in a compressed tar file, which should be placed in the new src directory. Unpack the compressed tar file:

# cd /usr/local/squid/src
# tar zxvf squid-2.3.STABLE1-src.tar.gz

This will leave you with the entire source directory tree under squid-2.3.STABLE1. There are helpful documents in the doc directory, including a quick-start guide and installation instructions. It's worth poking around at this point to familiarize yourself with the version of Squid you're using. Next, build the software:

# cd /usr/local/squid/src/squid-2.3.STABLE1
# ./configure

The automatic configuration process will profile your system to determine exactly what capabilities exist. You shouldn't have difficulty with this process, but if issues do arise the error messages from configure should help you find quick resolutions.

Next, we compile Squid using the supplied Makefile:

# make

The compilation should take between a few minutes and an hour depending on your system's performance. When the compilation has completed without reporting errors, install it:

# make install

The last line will create a directory hierarchy under /usr/local/squid, including bin (executables like squid itself and its utilities), etc (configuration), and logs (Squid log files). Note that there are no cache directories set up at this point. To create them, we'll need to mount the two disks we set aside for the task:

# mkdir /usr/local/squid/cache0 /usr/local/squid/cache1
# mount -t ext2 /dev/sdb1 /usr/local/squid/cache0
# mount -t ext2 /dev/sdc1 /usr/local/squid/cache1

We now need to create a configuration file for Squid, stored in /usr/local/squid/etc/squid.conf. Listing 1 contains a basic file that you can use to get started. Later, you'll want to customize your configuration.

After creating squid.conf, you're ready to build your cache directories. The cached objects are stored in a large hierarchy. Its framework must be created before launching Squid for the first time. To initiate the cache build use the -z option to squid:

# /usr/local/squid/bin/squid -z

This will exercise your disks for a while as the hierarchy is created. When it completes, you're ready to start Squid for the first time:

# /usr/local/squid/bin/squid -Ns &

To verify that squid is running, take a look at /usr/local/squid/logs/squid.log. You should see something like Listing 2, ending in "Ready to serve requests." Squid should now be ready to accept requests from browsers.

Access Control

Before moving on to the browser side of things, let's stop to consider some basic security issues involved with using a cache (my thanks to Michael Alan Dorman for raising this important issue). Your intended purpose for deploying a cache will imply an intended user base. In the case of a small to medium sized enterprise, for whom this tutorial is intended, the users are usually the employees of the company, who access the Internet from their internal private LAN. A web cache becomes part of the larger security infrastructure, including firewalls, mail servers, and other technologies. In many such cases the web cache can be deployed behind the firewall because it is intended for access only by users on the LAN. In this configuration, security for the cache server isn't a significant concern because only trusted users have access to it.

However, your situation may dictate that you deploy your Squid system outside your firewall so that it is publicly available on the Internet. In this scenario, security rises to the top of the priority list. As Mike Dorman points out, an unsecured web proxy can be unexpectedly abused by unauthorized outsiders.

To prevent such abuse, you can create an access control methodology to selectively offer caching services only to users you trust. Squid offers this capability through administrator-defined Access Control Lists (ACLs), which can be used to create finely detailed access control schemes. Limitations can be placed on client addresses, destination domains, time of day, port numbers, access methods, browsers, and even users. While a complete treatment of Squid ACLs is far beyond the scope of this tutorial, a simple client-address ACL scheme has been included in the Squid configuration shown in Listing 1. The first part of the ACL setup involves the definition of access groups:

acl all src 0.0.0.0/0.0.0.0
acl mynet src 192.168.1.0/255.255.255.0

The first line defines the group all that includes all possible IP addresses. The second defines a small subgroup of addresses called mynet on the private network 192.168.1.0 (this is just an example - your address configuration will be different). It is only users from mynet that we wish to allow access to the cache, which leads us to the second part of the ACL setup:

http_access deny all
http_access allow mynet

Here, we explicitly deny http access to Squid by every possible address as defined in group all, but then turn around and grant access to mynet. The effect is that systems coming from addresses outside of mynet will not be able to access Squid while those inside have full access.

While effective, this ACL configuration only scratches the surface of Squid's capability. A thorough review of ACL usage is essential prior to deployment of a publicly available cache.

Browser Configuration

To test Squid, we'll manually configure a browser to use Squid instead of origin servers. In Netscape Communicator, this is done using the Edit -> Preferences -> Advanced --> Proxies dialog. Select "Manual Proxy Configuration" and click on "View". For each protocol, enter the IP address of your Squid machine and port number 3128, the default port on which Squid listens for inbound requests. Save your changes and try browsing a site you're familiar with. If everything is working correctly, you should be able to browse as before. The difference is that Squid is now acting as an intermediary, keeping copies of the pages you view in its cache. To see Squid's activity, watch its access log:

# tail -f /usr/local/squid/logs/access.log

You should see a line in that file for each request from browsers. An example is given in Listing 3, showing the time (since the Unix epoch), requesting IP address, URL, etc. Each line also will indicate a status of the request with respect to the cache, such as TCP_HIT, TCP_MISS, or TCP_MEM_HIT, among others. Those status messages including the word HIT indicate that the request was served from the cache.

If everything has gone well up to this point, you should have a functional Squid configuration that serves requests from multiple browsers.

Next Month

In the second part of this article, we'll complete our enterprise installation of Squid, including:

Configuration of automatic startup and shutdown for the Squid daemon.
Configuration of Squid's Web-based management utility.
Automation of proxy configuration for browsers.
The Setup for two Squid systems as peers, including the ability to share their caches.



=========
Listing 1
=========

# squid.conf
#
# a basic configuration file for the Squid Proxy Web Cache

# set logging to the lowest level
debug_options ALL,1

# define group "all" that encompasses all possible IP addresses
# and group "mynet" that represents my class-C network:
acl all src 0.0.0.0/0.0.0.0
acl mynet src 192.168.1.0/255.255.255.0

# define an access control for group "all" to deny http access,
# and another for group "mynet" to allow http access.
#
# The effect of using both is to prohibit access to the cache by
# any address that doesn't satisfy the criteria established
# in group "mynet".
http_access deny all
http_access allow mynet

# set Squid's user and group
cache_effective_user squid squid

# set log directories
cache_access_log /usr/local/squid/logs/access.log
cache_log /usr/local/squid/logs/cache.log

# set cache directories of 3.5GB each
cache_dir ufs /usr/local/squid/cache0 3500 16 256
cache_dir ufs /usr/local/squid/cache1 3500 16 256

# set the cache memory target for the Squid process
cache_mem 80 MB

# the mailbox of the sysadmin
cache_mgr root@localhost


=========
Listing 2
=========

2000/02/01 03:12:10| Starting Squid Cache version 2.3.STABLE1 for
i686-pc-linux-gnu...
2000/02/01 03:12:10| Process ID 1188
2000/02/01 03:12:10| With 1024 file descriptors available
2000/02/01 03:12:10| Performing DNS Tests...
2000/02/01 03:12:10| Successful DNS name lookup tests...
2000/02/01 03:12:10| DNS Socket created on FD 5
2000/02/01 03:12:10| idnsParseResolvConf: nameserver 209.195.201.3
2000/02/01 03:12:10| idnsAddNameserver: Added nameserver #0:
209.195.201.3
2000/02/01 03:12:10| idnsParseResolvConf: nameserver 209.195.192.3
2000/02/01 03:12:10| idnsAddNameserver: Added nameserver #1:
209.195.192.3
2000/02/01 03:12:10| Unlinkd pipe opened on FD 10
2000/02/01 03:12:10| Swap maxSize 1024000 KB, estimated 78769 objects
2000/02/01 03:12:10| Target number of buckets: 1575
2000/02/01 03:12:10| Using 8192 Store buckets
2000/02/01 03:12:10| Max Mem  size: 40960 KB
2000/02/01 03:12:10| Max Swap size: 1024000 KB
2000/02/01 03:12:10| Rebuilding storage in /usr/local/squid/cache0
(CLEAN)
2000/02/01 03:12:10| Rebuilding storage in /usr/local/squid/cache1
(CLEAN)
2000/02/01 03:12:10| Set Current Directory to /usr/local/squid/cache0
2000/02/01 03:12:10| Loaded Icons.
2000/02/01 03:12:10| Accepting HTTP connections at 0.0.0.0, port 3128,
FD 14.
2000/02/01 03:12:10| Accepting ICP messages at 0.0.0.0, port 3130, FD
15.
2000/02/01 03:12:10| WCCP Disabled.
2000/02/01 03:12:10| Ready to serve requests.



=========
Listing 3
=========

949393249.739    393 192.168.1.30 TCP_MISS/000 526
  GET http://oreilly.linux.com/ - DIRECT/oreilly.linux.com -
949393253.010     19 192.168.1.30 TCP_HIT/200 1699
  GET http://www.oreillynet.com/onstyle.css - NONE/- text/css
949393253.837    572 192.168.1.30 TCP_MISS/200 529
  GET http://adforce.imgis.com/? - DIRECT/adforce.imgis.com
application/x-javascript