Site Monitoring

Glossary L to R

Log File Archive

Archiving Access Activity

Many web servers keep their raw log files only for a limited time period. This information may be crucial in determining what has happened at a particular time. If the web server crashes or the log files get lost then you need the security of an off-site log backup. Hosting companies may well back up the web site files but not the logs, the logs can grow to a very large size and therefore backup is an expensive option for them to provide. Some servers will keep only the previous month or week on the site before it is replaced by the next one.

All web servers will store information about the accesses made to the site in a log file archive. This enables a web master to investigate problems. Standard web hosting companies generate their 'free' access statistics by processing these log files to work what and when is being accessed on the site.

Web site monitoring

Log Files

Web Server Log Files

A web site host runs a special service that manages the HTTP protocol. The HTTP service is the standard way that HTML pages are transferred to browsers over the Internet. Each HTTP request received by the server is typically a request for the contents of a particular web page or graphics file. The server logs all these requests with each new request or 'hit' as a separate record as a line. The server log file is the source of information used to generate web site statistics offered by most web hosting companies. It gives information about the date and time, source IP address, data requested, the referring web page and browser. The referral data give vital information about the links which people are using to reach a web site. The data requested is the full URL used to reach the site, quite often this will include the keywords used by the search engine to list the web site.


Here is a sample line from a log file :
69.125.123.175 - - [29/Jan/2006:01:32:03 -0600] "GET /africa.htm HTTP/1.1" 200 5456 " http://search.yahoo.com/search?_adv_prop=web&x=op&ei=UTF-8&prev_vm=p&va=scams&va_vt=any& vp=south+africa&vp_vt=any &vo=sigcau&vo_vt=any" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"

The meaning of each of these fields is as follows :

69.125.123.175The IP address of the person or robot requesting the data
-The User name given to access the resource (in this case none was given)
-The Password given to access the resource (in this case none was given)
[29/Jan/2006:01:32:03 -0600]The date and time that the request occurred, this includes the time zone difference (6 hours in this case)
GETThe HTTP access method for the resource, GET means read the whole resource
/africa.htmIdentifies the resource to be fetched, in this case it is an HTML page called africa.htm located in the root folder
HTTP/1.1The access method used to fetch the resource, in this case it is version 1.1 of HTTP
200HTTP status code returned, see errors for details of these
5456The size of the data returned in bytes
http://search.yahoo.com/...Which page which linked to the resource. This is the referrer field, and in this case is from the Yahoo! search engine .
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)Identifies the browser or robot that accessed the page. In this case it is Internet Explorer® running on Windows® XP.

See also : RFC2616 : Hypertext Transfer Protocol HTTP/1.1
How HTTP works

Web site monitoring

Log File Formats

Logging accesses to Web sites

All web servers will store information about the accesses made to the site in a log file. This enables a web master to investigate problems and for statistics to be generated showing what and when is being accessed on the site.

There are two main formats in use. Different servers (Microsoft® IIS, Apache®, ...) use different log formats. However, they are all text based and contain information about each access to a resource (HTML page or graphics file) is recorded as a single line of information (a hit). A web site administrator can usually control how much information goes into the log file, as the full set of information is rather large. Each log file record may contain the following fields of information :

Date and TimeDate and time of the record, possibly including time zone information
Client IPThe IP address of the requesting client agent (may not be the same as used by previous accesses)
UserThe authenticated name used to access the site (typical access is anonymous)
Server SiteThe web site name of the web server
Host SiteThe domain name of the site being hosted
ComputerThe name of the computer running the web server (for large sites multiple computers may service the same web address)
Server IPThe IP address of the web server
MethodThe type of request - typically a GET to fetch information for display
UrlThe URL address of the information being requested from the web site
QueryAny query string associated with the request (typically follows a ? in the URL)
StatusServer's response code that is sent back to client (200 for success, 404 not found etc.)
Req SizeSize of the request issued by the client in bytes
Resp SizeSize of the response (typically a file) sent to the client as a response to the request
Resp TimeHow long the server took to process the request
PortThe machine port address used to access the server, typically 80 for HTTP
ProtocolThe protocol used by the client typically HTTP1.0 or HTTP1.1
BrowserA long string describing the type of browser the client used to issue request, often includes platform information
CookieAny associated cookie value submitted by client
ReferrerWhere the client's request came from, an external site or an internal page reference.

Some servers will archive log files in compressed format (e.g. ZIP, GZ or CAB).

Log file format specification
A log file fragment explained

Web site monitoring

MIME

Specifying information format

The HTTP protocol transfers information around in binary format, it is up to the client and server to negotiate so that the client (typically a browser) is only sent information that it can understand. This negotiation is carried out using MIME (Multipurpose Internet Mail Extensions) types.

As the acronym suggests this was originally developed to describe the content of email messages but is now much widely used within HTTP. It uses a simple two-part text description to describe the content format consisting of a type and a subtype. So text/html indicates that it is basically text but the text is in HTML format, text/plain is for raw untagged text (as is a .txt file) and image/jpg indicates a graphics file in JPEG image format.

When a browser requests data it states the MIME types it is willing to receive as a response, the server will then choose an available format for the response from these types.

The MIME Information Page
Media Types
RFC2046 : Multipurpose Internet Mail Extensions 2 : Media Types

Web site monitoring

Ping

Using Ping to check a Server is working

One important facet of site monitoring is knowing as soon as possible that servers have failed or are not accessible. A web server provides multiple services not just HTTP. Just because a server is not responding to HTTP requests does not imply it is not functioning at all.

The simplest means of establishing whether a site or server is alive is using the Interface Control Message Protocol (ICMP) protocol to Ping a server. This is a much simpler request than fetching an HTML page in terms of the communication overheads. It runs over the IP protocol and so checks that the IP part of TCP/IP is functioning OK. This protocol is also used by the tracert command line utility to find the route that communication is taking to a server.

The Ping connectivity check works well in an Office Intranet situation too, it can regularly monitor whether the key servers and workstations are responding properly to IP traffic within an office local area network.

The protocol supports a number of commands but the ECHO command is the one of interest for Ping monitoring. It instructs routers to pass the message over IP to a particular destination IP address requesting an ECHO REPLY to be sent back. Measuring the time between from issuing the ECHO and receiving the ECHO REPLY determines the responsiveness of the remote server. The ICMP echo reply includes a Time to Live (TTL) value. This indicates the number of router hops that the message has gone through from the source. Normally the packet starts off with a 255 TTL value and then each router it passes through decrements the value by one. If the number of hops is erratic or suddenly becomes large this indicates a router problem.

The same ECHO command can be used to trace a route over the Internet (as used in tracert program ). In this case the protocol's TTL field is used to limit how many hops between routers it can make before the request fails. If the limit is reached then a failure response is returned, with the IP address of the most distant router on the path returned. By iterating over all TTL values until the destination server is reached all the routers can be identified. By inspecting the time delay between reaching routers along the communication path bottlenecks can be easily identified.

See also RFC792 : Internet Control Message Protocol
Guide to Ping and Tracert
Ping Monitoring Explained

Web site monitoring

Port

Connecting to the correct Service Port

Each IP address can be accessed on a range of numeric port numbers. The port number requested is part of a client connect request and can be specified as part of a URL. When a URL omits the port number the default port number for that service is assumed (80 for HTTP web service). These map onto inter-communicating sockets, when a server socket is set up it chooses a unique port number on which to listen for requests (as part of the bind socket API call), the client issues a connect to a server giving an IP address and a port.

In most cases the port number is assigned to a particular service, so the number is really acting as a name for the service that is required. The only ports of interest on the Internet to users are the ones used for HTTP and FTP.

A more comprehensive list of standard ports is as follows :

PortNameDescription
7echosimple echo service
13daytimefind out server's clock setting
21ftpfile transfer protocol
22sftpsecure file transfer protocol
23telnetterminal access service
25smtpemail service
42nameserverDNS lookup
43whoiswho is service
53DNSdomain name lookup service
70gopherpredecessor to http
80WWWworld wide web (HTTP)

Proxy

Indirect proxy access to the Internet

Originally each computer wishing to use the Internet had to connect to it directly, this is OK for servers or a home user dialing up for a connection but not convenient for an office environment where hundreds of PCs may want to use the Internet all at the same time. To solve this problem Proxy servers are used. These servers have a dedicated Internet connection but make requests on behalf (as a proxy) for all the computers wishing to access the Internet through it.

Most browsers have connection settings that allow you to configure the IP address and port that is then used to communicate with a Proxy Server. HTTP requests are then sent to the Proxy Server using TCP/IP which then in turn sends them out onto the Internet. A proxy server may run on a separate machine (often in conjunction with a firewall) or as an ordinary program running on a PC. It needs to keep track of all client requests so it can route all the responses sent to it from the server back to the browser that requested the information.

Web site monitoring

Query

Getting information from a user

Queries are an important part of HTML. They are used to pass additional information to a server about the data requested. The most widespread usage is when an HTML Form is submitted (as an HTTP POST) and the various values entered on the form are sent as query strings tagged onto the end of the URL. Search engines such as Google use this mechanism to send the search phrase or keywords that the user typed in when the Search button is clicked upon. Each web server is free to use whichever keywords it likes in the query string there are few constraints it has to follow.

Web site monitoring

Rank

Web Site Ranking

There are a number of Internet services that attempt to rank web sites in some sort of popularity order. As there are about 500 million web sites this is not an easy task and relative ranking scores are not to be totally trusted.

For example on this particular day the top five web sites according to Alexa are Yahoo!, MSN, Google, Passport, EBay, Microsoft.

It is not possible to look at individual web site traffic statistics in order to gauge popularity. The server logs are not publicly accessible. If you use the Google or Alexa Toolbars to reach web sites this is one way that these sites can build up statistics in order to rank sites. Each time the toolbar is used, the click is recorded and added to their database. This makes all ranking measures rather inaccurate, it is best to treat them as a very rough indication of relative popularity only.

Google also includes a Page Rank figure as a rough estimate (score out of ten) of the importance of the page. Most web sites manage to score between 3 and 6 out of ten. Only very large and very popular web sites score 8 or more ('Currently' CNN scores 8/10 and BBC News scores 9/10). View with suspicion any page with a rank below 3.

Web site monitoring

Referrals

Looking at how visitors find your site

Do you know how people are reaching your web site?

When a web site has a set of links to it, you get a referral when the user follows the link. This is usually by a person clicking on a link in a browser or else by an automated scanning engine or robot.

It is important to monitor the number of referrals coming to a site to be able to adjust the site content and therefore the keywords to attract more visitors. It may be that you are getting people to come to the site for the wrong reason and are not going to stay or come back again.

When a user follows an HTML link to get from one web page to another, the web server typically store how a page (locally or remotely was referred to). Tracking how people get referred to a web site is crucial to measuring the effectiveness of a web site. Are people getting to the information quickly and easily? Are they only looking at one page and then leaving the web site? Which keywords and phrases are people using to reach your site?

Some firewalls give the option to remove the referral information in order to give the user more anonymity, if a web site needs to be sure it knows where it has been referenced from it needs to include this as part of the query part of a URL.

Referral monitoring allows the sources of external web traffic to be identified. Typically Google will be the largest source of referrals as most people find web sites from a search engine, and Google is by the far the most popular one at present.

By using referrer information you can work out :

  • When another site has added a link to one of your pages.
  • When a site gets included in search engine databases.
  • Which Keywords are being used to find your site on search engines.
  • When your site is mentioned in an online forum or blog.
Web site monitoring

RFC

Internet Standards and Proposals

The technical standards that govern how the various parts of the Internet function are documented as Request For Comment (RFC) documents. Although the name suggests that these are just early proposals, these documents include actual working standards for much of Internet technology. Some RFCs are experimental and some have been entirely superseded so it is important to refer to the appropriate RFC.

They are co-ordinated by committees of Internet professionals. There are over three thousand RFCs in existence covering all aspects of the Internet. A good starting point is Internet Engineering Task Force (IETF web site www.ietf.org . There are copies of the RFCs on different sites, including IETF and World Wide Web Consortium [W3C] .

Here is a list of commonly referenced RFCs, but be warned that their technical nature can make them tough reading :

RFC777 : Internet Control Message Protocol
RFC791 : Internet Protocol Specification
RFC792 : Internet Control Message Protocol
RFC959 : File Transfer Protocol [FTP]
RFC1034 : Domain Names - Concepts and Facilities
RFC1035 : Domain Names - Implementation and Specification
RFC1180 : A TCP/IP Tutorial
RFC2046 : Multipurpose Internet Mail Extensions 2 : Media Types
RFC2133 : Basic Socket Interface Extensions for IPv6
RFC2616 : Hypertext Transfer Protocol HTTP/1.1
RFC2660 : The Secure HyperText Transfer Protocol

Web site monitoring

Robot

Automated Internet Scan

On the Internet a robot is not some mechanical human-like servant but just a special type of computer program. Ever wondered how search engines build up their indices of web sites? Well, search engines amongst other programs use robots to continually trawl web sites analyzing the contents as if they were human visitors using a browser. They use HTTP just like browsers in order to access information. A server can state which pages should be inspected by robots, the instructions are stored in the robots.txt file. Over time, search engines have grown much more sophisticated and the way that they scan sites is complex. Many will first scan the site's index page, coming back after weeks or months to drill down to scan the rest of the web site.

The server log usually includes a browser field in each log record indicating the name of the robot. No well-behaved robot will flood a server with requests as this would affect the server's performance. They will spread their site scan over hours or days. Robots should include a contact URL or email address in the browser information included in the HTTP request header so that a web master can analyze the activity by robots.