CMPSC 311, Fall 2019
Proxy Lab: Writing a Sequential Caching Web Proxy
Assigned: Monday, Dec 2nd, 2019
Due: Sunday, Dec 15th, 11:00 PM
A Web proxy is a program that acts as a middleman between a Web
owser and an end server. Instead of
contacting the end server directly to get a Web page, the
owser contacts the proxy, which forwards the
equest on to the end server. When the end server replies to the proxy, the proxy sends the reply on to the
Proxies are useful for many purposes. Sometimes proxies are used in firewalls, so that
owsers behind a
firewall can only contact a server beyond the firewall via the proxy. Proxies can also act as anonymizers:
y stripping requests of all identifying information, a proxy can make the
owser anonymous to We
servers. Proxies can even be used to cache web objects by storing local copies of objects from servers then
esponding to future requests by reading them out of its cache rather than by communicating again with
In this lab, you will write a simple HTTP proxy that caches web objects. For the first part of the lab, you will
set up the proxy to accept incoming connections, read and parse requests, forward requests to web servers,
ead the servers’ responses, and forward those responses to the co
esponding clients. This first part will
involve learning about basic HTTP operation and how to use sockets to write programs that communicate
over network connections. In the second part, you will add caching to your proxy using a simple main
memory cache of recently accessed web content.
This is an individual project. Please refrain from looking up solutions for similar projects online.
3 Handout instructions
Download proxylab-handout.tar file from Canvas. Copy the handout file to a protected directory
on the Linux machine where you plan to do your work, and then issue the following command:
linux> tar xvf proxylab-handout.ta
This will generate a handout directory called proxylab-handout. The README file describes the
4 Part I: Implementing a sequential web proxy
The first step is implementing a basic sequential proxy that handles HTTP/1.1 GET requests. Other requests
type, such as POST, are strictly optional.
When started, your proxy should listen for incoming connections on a port whose number will be specified
on the command line. Once a connection is established, your proxy should read the entirety of the request
from the client and parse the request. It should determine whether the client has sent a valid HTTP request;
if so, it can then establish its own connection to the appropriate web server then request the object the client
specified. Finally, your proxy should read the server’s response and forward it to the client.
4.1 HTTP/1.1 GET requests
When an end user enters a URL such as http:
web.mit.edu/index.html into the address ba
of a web
owser will send an HTTP request to the proxy that begins with a line that might
esemble the following:
In that case, the proxy should parse the request into at least the following fields: the hostname, web.mit.edu;
and the path or query and everything following it, /index.html. Use the parse url function from
hw9. That way, the proxy can determine that it should open a connection to web.mit.edu and send an
HTTP request of its own starting with a line of the following form:
GET /index.html HTTP/1.0
Note that all lines in an HTTP request end with a ca
iage return, ‘\r’, followed by a newline, ‘\n’. Also
important is that every HTTP request is terminated by an empty line: "\r\n".
You should notice in the above example that the web
owser’s request line ends with HTTP/1.1, while
the proxy’s request line ends with HTTP/1.0. Modern web
owsers will generate HTTP/1.1 requests, but
your proxy should handle them and forward them as HTTP/1.0 requests.
It is important to consider that HTTP requests, even just the subset of HTTP/1.0 GET requests, can be
incredibly complicated. The textbook describes certain details of HTTP transactions, but you should refe
to RFC 1945 for the complete HTTP/1.0 specification. Ideally your HTTP request parser will be fully
obust according to the relevant sections of RFC 1945, except for one detail: while the specification allows
for multiline request fields, your proxy is not required to properly handle them. Of course, your proxy
should never prematurely abort due to a malformed request.
4.2 Request headers
The important request headers for this lab are the Host, User-Agent, Connection, and Proxy-Connection
• Always send a Host header. While this behavior is technically not sanctioned by the HTTP/1.0
specification, it is necessary to coax sensible responses out of certain Web servers, especially those
that use virtual hosting.
The Host header describes the hostname of the end server. For example, to access http:
mit.edu/index.html, your proxy would send the following header:
It is possible that web
owsers will attach their own Host headers to their HTTP requests. If that is
the case, your proxy should use the same Host header as the
• You may choose to always send the following User-Agent header:
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:10.0.3)
The header is provided on two separate lines because it does not fit as a single line in the writeup, but
your proxy should send the header as a single line.
The User-Agent header identifies the client (in terms of parameters such as the operating system
owser), and web servers often use the identifying information to manipulate the content they
serve. Sending this particular User-Agent: string may improve, in content and diversity, the material
that you get back during simple telnet-style testing.
• Always send the following Connection header:
• Always send the following Proxy-Connection header:
The Connection and Proxy-Connection headers are used to specify whether a connection
will be kept alive after the first request
esponse exchange is completed. It is perfectly acceptable
(and suggested) to have your proxy open a new connection for each request. Specifying close as
the value of these headers alerts web servers that your proxy intends to close connections after the
• There are two other headers If-Modified-Since and If-None-Match that you should skip if
owser generated these request headers. The reason is these headers are used to handle caching
done by the
owser, and when included, the server instead of returning a 200 OK status, will return a
304 Not Modified status and will cause the cache in our proxy server to not function properly. Simply
eliminate these headers will solve the problem.
Also keep in mind, when analyzing request headers from the
owser, the header names are case-insensitive,
owsers might capitalize (or have lower cases) for the same field name. So your parsing should
do comparsions that is also case insensitive.
To make your headers work, you will have to skip the
owser supplied header for Connection, User-Agent
and Proxy-Connection. You should also check if the request header coming from the
contains Host, if it does, use it; if it doesn’t, make sure you add the Host header.
For your convenience, the values of the described User-Agent header is provided to you as a string
constant in proxy.c.
Finally, if a
owser sends any additional request headers as part of an HTTP request, your proxy should
forward them unchanged.
4.3 Port numbers
There are two significant classes of port numbers for this lab: HTTP request ports and your proxy’s listening
The HTTP request port is an optional field in the URL of an HTTP request. That is, the URL may be of
the form, http:
cse-cmpsc311.cse.psu.edu:8080, in which case your proxy should connect
to the host cse-cmpsc311.cse.psu.edu on port 8080 instead of the default HTTP port, which is port
80. Your proxy must properly function whether or not the port number is included in the URL.
The listening port is the port on which your proxy should listen for incoming connections. Your proxy
should accept a command line argument specifying the listening port number for your proxy. For example,
with the following command, your proxy should listen for connections on port 8081:
linux> ./proxy 8081
You may select any non-privileged listening port (greater than 1,024 and less than 65,536) as long as it
is not used by other processes. Since each proxy must use a unique listening port and many people will
simultaneously be working on each machine, the script port-for-user.pl is provided to help you
pick your own personal port number. Use it to generate port number based on your user ID:
$ ./port-for-user.pl yuw17
The port, p, returned by port-for-user.pl is always an even number. So if you need an additional
port number, say for the Tiny server, you can safely use ports p and p+ 1.
Please don’t pick your own random port. If you do, you run the risk of interfering with another user.
5 Part II: Caching your Requests
For the second part of the lab, you will add a cache to your proxy that stores recently-used Web objects in
memory. HTTP actually defines a fairly complex model by which web servers can give instructions as to
how the objects they serve should be cached and clients can specify how caches should be used on thei
ehalf. However, your proxy will adopt a simplified approach.
When your proxy receives a web object from a server, it should cache it in memory as it transmits the object
to the client. If another client requests the same object from the same server, your proxy need not reconnect
to the server; it can simply resend the cached object.
Obviously, if your proxy were to cache every object that is ever requested, it would require an unlimited
amount of memory. Moreover, because some web objects are larger than others, it might be the case that
one giant object will consume the entire cache, preventing other objects from being cached at all. To avoid
those problems, your proxy should have both a maximum cache size and a maximum cache object size.
5.1 Maximum cache size
The entirety of your proxy’s cache should have the following maximum size:
MAX_CACHE_SIZE = XXXXXXXXXX
When calculating the size of its cache, your proxy must only count bytes used to store the actual web objects;
any extraneous bytes, including metadata, should be ignored.
5.2 Maximum object size
Your proxy should only cache web objects that do not exceed the following maximum size:
MAX_OBJECT_SIZE = XXXXXXXXXX
For your convenience, both size limits are provided as macros in cache.h. When