Testing the NetApp

The performance of the NetApp was measured in several ways. Logs and analysis with hacked Calamaris, Squidclients & Squidtimes. MRTG, for the bandwidth statistics, tcpbanger, Wisconsin Proxy Benchmark, a few pingtests and nuketests

For all statistics only the month May was taken into account.

3.1 logs and analysis with hacked Calamaris, Squidclients & Squidtimes

Calamaris is a perl script designed to get statistics about your cache. It is written by Cord Beerman, and is to be used with Squid. We hacked it a little, so that it could cope with the somewhat different logfiles from the NetApp. This new script can be found at http://www.student.utwente.nl/~mark/netapp/netcache.pl. A diff from calamaris.pl to netcache.pl is also available, at http://www.student.utwente.nl/~mark/netapp/diff

The results for the month may are in appendix B

You can see that the NetApp was able to give 38.10% of all requests from the cache. This value is a bit fragmented due to the fact that the cache was still filling in the beginning of the month, and at the end of the month we did a throughput test. A maximum of 47342 urls/hour was reached at May 28, the day of the throughput test.

Squidclients and Squidtimes are written by Nico Tranquilli and tell you about the number of clients using your cache, and some timing statistics, i.e. the average time a HIT and a MISS takes.

The result in HITs by netcache.pl (38.10%) differs from Squidtimes (11%) even though the total number of TCP_HITs match (519905). This is because the Squidtimes software only takes TCP_HITs into consideration. If we take only TCP_HITs into account within netcache.pl we can see the expected results: 11.58%. In order to explain the difference in more detail we must look into the software: Squidtimes counts 5476705 requests, netcache.pl counts 4491318 TCP + 985352 ICP = 5476670. Apparently Squidtimes saw 35 requests it couldn't handle. This is confirmed when we calculate the errors seen by the netcache script vs. the errors seen by Squidtimes.

You can see that we had 767 unique clients accessing the NetApp during the month of May, good for a total of 40 Gb of traffic through the cache.

On average a HIT was deliverd in 623 msec, and a MISS in 7765 seconds. When a cache is settled (filled, and had some time to stabilize) you should get a very nice time for HIT. In our opinion this time is a bit too high here. However, due the fact that the NetApp was here not all that long, it could be very well that this value would have dropped if the NetApp were used longer.

3.2 MRTG

Multi Router Traffic Grapher was used to see how much bandwidth the NetApp took over time. During the throughput test on May 28 it reached it's maximum, at 700 kB/s in and 400 kB/s out.When the cache was most used (9.00-17.00) the average flow was about 45 kB/s in and out. The graphs can be found here.

3.3 tcpbanger

tcpbanger is a tool to fire a lot of urls at a proxy together. With it, you can get an idea about the maximum number of users that your proxy can handle. Here we used it to get a lot of objects in our cache, on May 28. That day we took the indexfile of the SURFnet toplevel cache and fed that split up into 10 pieces to 10 tcpbangers.

The graphs that were made of that run can be found in appendix C. It can be seen that an average number of 30 requests/second was achieved.

Due to the fact that some bangers were not functioning as they were supposed to, this test is not totally reliable for the maximum number of connections the C230 can take. A educated guess would be that the C230 can handle about 70 requests/sec if it has to get all the pages from remote servers, and about 100 requests/sec if it can deliver the requests from cache.

3.4 Wisconsin Proxy Benchmark

The Wisconsin Proxy Benchmark (WPB) is a benchmark developed by Jussara Almeida and Pei Cao at the University of Wisconsin-Madison. Most proxy benchmarks measure only the speed of the client: how fast can a client retrieve urls from a proxy. The WPB in addition collects data from the server too, reducing the number of uncontrolled variables.

The benchmark consists of a number of clients, which request html-pages from a number of servers. Both clients and servers communicate with a master server, which controls the benchmark and collects the data afterwards. A benchmark run has two rounds: the first without caching and the second with caching enabled.

The benchmark collects information about latencies and hit rates. The benchmark model does not take into account DNS lookups, persistent connections and non-cacheable objects. Because the benchmark uses a limited number of servers and urls, the number of DNS lookups is limited. However the NetApp C230 has a DNS cache, with a hit rate greater than 85%. In addition with a fast (100 Mbit) link to the nameservers, this should only create a small error.

The NetApp C230 uses persistent connections as much as possible. Persistent connections can give a dramatic performance increase, so it's unfortunate the WPB doesn't model this behaviour. For a client performing a large number of requests for small objects (e.g. a buttonbar), this will increase the latency, due to the connection setup time.

The WPB measures the performance for static html documents. Typically cgi-scripts, documents with cookies or other 'active' content are not cached. This reduces the efficiency of the cache and will give a too optimistic result for the benchmarked hit-rate.

The urls that the benchmark uses are not real-world urls, which may cause problems translating the results to the real world.

Benchmark procedure

The benchmark can be downloaded from the WPB homepage at http://www.cs.wisc.edu/~cao/wpb1.0.html. This page also contains information about how the benchmark works and some recommendations the benchmark setup. The recommendations are:

Ratio client machines / server machines should be 2.
Ratio client processes / server processes should be 2.
Server latency should be 3 seconds.
Each run should last at least 10 minutes.
Total size of HTTP documents should be at least four times the cache size of the proxy.

Initial testing indicated a number of issues:

The WPB would not compile, without some modification, which limited the number of processes and time the benchmark could be run.
Only a limited number of UNIX machines available.
The recommended total size of at least 4 times the proxy cache size would force us to transfer 50 GB of data.
The modification, which was necessary in order to compile the WPB on our Linux systems, caused some problems with child processes not being cleaned up properly, especially under high load. Because we had only a few UNIX machines available this posed a limit to the total number of servers we could run and consequently the total number of requests that clients could send.
Due to the limit on the number of requests the clients could make, the amount of transferred data would not come close to the 50 GB. In short: with our version of the WPB we could not generate enough load on the proxy to see any effect on the latency.

In order to generate the load on proxy we used another program called CacheFlow. With CacheFlow we simulated a number of clients. At the same time the WPB was run with 5 server processes and 10 client processes. For the non-caching run, the server urls had been marked non-cacheable. After the first run the urls were marked cachable again and the benchmarks was run. The client load was then changed using CacheFlow, and the WPB was configured to listen to other ports to make sure no objects were in cache. The client load was varied between 16 and 128 concurrent clients, all fetching urls from a list as fast as possible. The machines performing the WPB were positioned in such a way they didn't interfere with the load generating clients.

Results

Clients	Latency (seconds)				Hit ratio
	Cacheing disabled		Cacheing enabled
	1st run	2nd run	1st run	2nd run	Hits	Bytes
16	3.47	3.44	3.50	1.91	24.2	22.3
32	3.45	3.43	3.27	1.79	27.5	21.7
64	3.44	3.49	3.55	3.01	7.5	3.8
128	3.42	3.40	3.40	3.12	6.3	0.4

From the table above it's clear the NetApp C230 is saturated with more than 64 load generating clients. This can be seen from the increase in latency and the hitrate drop. With this number of clients the NetApp ran constantly at 100% CPU usage, doing 220,000 urls per hour. The networkload during the test was between 700 and 800 kB/s. The lights indicated quite some diskactivity but the disk troughput was a bit lower than the network throughput.

The results indicate the NetApp C230 is not able to saturate a 10 Mbit link. Possible bottlenecks are the CPU or the drive array or a combination of the two. The most probable bottleneck seems the CPU of the NetApp C230, a mere Pentium 90. This CPU runs out of breath under heavy load as the constant 100% CPU usage shows. In contrast, its bigger brother, the C630 is equipped with a DEC Alpha 533, moving the bottleneck from the CPU to the disks.