Site Monitoring Web Site Monitoring
Robot Visits
Stop excessive visits to a site and hiding parts of a web site
Read More

Glossary A to E

Browsers

Surfing the Internet

A browser is a special program that displays web page information. The web pages can be located on a local computer or more usually the Internet. It uses standard Internet protocols like HTTP to access information and display it. HTML pages are still the most common way of presenting information to a user.

Internally a browser will set up sockets to connect to and transfer information with the remote server using TCP/IP.

To access information a URL is typed in to uniquely identify the information requested, the text entered is analyzed and then DNS is used to look up the IP address where the information can be accessed.

For information on the most widely used browsers see statistics such as those accumulated at TheCounter.com

The most widely used browsers are Microsoft® Internet Explorer with Firefox , Safari and Opera .

Web site monitoring

Cache

Getting information more quickly

Computer systems have used the caching technique since the very earliest days to speed up access to information. This is just a matter of keeping a copy of data that may be needed in the near future locally rather than requiring slower access back to the original source. When browsing the Internet there are at least two levels of caching carried out. Firstly, the browser will keep copies of pages and graphics in a local cache on the PC on the basis that many pages within a web site will share the same set of graphics. Secondly all the computers on the route between the client and the server have an opportunity to cache information in the hope that it will be requested again, from a wide range of different users.

Caching is good news for improving perceived access time but is bad news if you want to be sure you are seeing up-to-date information from the server. If a page is dynamically updating you may want to see the server's current copy not any cached out-of-date copies. The HTTP protocol defines a number of controls about how information should be cached (if at all). Rather than returning the same information again a browser can indicate it is only interested in new information if it has changed since a particular time, if it has not changed the server just returns a 304 not modified error code.

Site Vigil In order to find a reasonable measure of access time the Site Vigil program specifies that the information that it requests should not be cached. The access time will then more accurately how long it takes for the information to travel from the original web server to the PC running Site Vigil.

Web site monitoring

Cookies

Tracking Internet Usage

The HTTP protocol is designed to work as a single request-response fashion so that a web server treats each request that it receives for a web site page as if it came from a different user. It does not maintain a 'session', this is not ideal in situations when a user has entered some information and should not need to enter the same information again and again. This typically occurs when ordering goods, if you go back or refresh a page you wouldn't expect to have to enter the same data twice. Similarly if a user has expressed a preference for particular options or an area of a web site they might expect to restore these settings whenever they return to a web site. Amazon is a company that pioneered this approach early on to remember previous orders and searches. It now uses this previous browsing history to suggest tempting offers of similar goods that may be of interest.

To provide this sort of functionality cookies were invented, they are small pieces of information that are stored locally on the user's PC and presented back to the server when required. If they are acting as authentication tokens they need to have a short lifetime so that they apply for only the current user session, so all cookies specify an expiry time. There is some concern that cookies represent a breach of privacy as they are storing information about a user without the user's knowledge.

Most browsers allow some control over cookies, you can usually view and delete them as well as set the browser to refuse to store cookies. However refusing to use cookies will mean that many web sites will refuse to allow access to some pages.

The cookie is issued by the server and sent back as a field in the HTTP response header. The browser will then store the cookie locally and whenever an HTTP request is made for a page on that web server the cookie will be sent back. The server can then work out that a set of requests were sent by the same user.

See also : General information about cookies
Netscape introduction to cookies

CSS

Consistent style for web pages

Making the look and feel of a web site consistent is a key web site design aim. In the past this required fairly tedious editing of the HTML for each individual web page. Putting detailed FONT attribute selection around all the displayed text. Style sheets provide a way of specifying how the text in every page should look (font, colour, size, margin etc.) in a single external file. Each individual page can then reference the same style sheet file. This provides not only a common look but can often reduce the size of pages (because formatting information is specified in only one place), decreasing the download time for a page and enhance the apparent speed of a web site.

It's also true that CSS (Cascading Style Sheets) offer far more formatting options than the original HTML 4.0 specification allowed. One frequent query about CSS is why the term Cascading is used. Well, it's at the heart of the way styles are defined and allows common elements to be inherited. So if you have a fancy looking table format and then decide to create a new style that's only slightly different you can define it by just stating the differences of the new from the original - the rest of the definition cascades from one style to another - that's a good example of code reuse.

For further details on CSS please visit : HTML Code Tutorial . or the formal specification Cascading Style Sheets Specification .

Web site monitoring

DNS

Looking up a site's name

The Directory Name Service (DNS) gets you to the Internet information you want. It is the largest dynamic data store in the world and is central to the functioning of the Internet. It maps domain names typed in as text strings into digital IP addresses, in effect it is a huge online telephone directory. The computer browser then uses the digital IP address to fetch information from the web site. This not only allows web sites to be given meaningful names but also lets the web site containing the information to move to a new web server without the user needing to be aware of this. This makes it operate very much like a telephone directory, we use it to get a phone number even when a business or person has moved house. As with the analogous printed directory, the entries have to be constantly updated to keep track of changes, and once a year is not often enough.

To keep up-to-date, all the Servers on the Internet need to get updates for their local copies of DNS information. This is why it can take up to two days for a new domain to become accessible. Not only the Server hosting the new site must know its address, but also all the Servers and Routers making up the access path from the Server to requesting computers need the address. When a web site is in the process of relocating some people will get to the old address and others to the new one. When relocating a domain this overlap period is necessary you should keep both old and new versions running for a while.

In order to keep traffic to a tolerable level, each router caches information locally of recently requested domain names and IP addresses in order to save requesting the same information again. The traffic relating to name lookup has been running at 5% of total Internet traffic.

The traditional domain name is just a two part entity, DNS is organized hierarchically into zones so that only DNS servers under a particular path e.g. .org need to concern themselves about names within that domain - in this case anything ending in .org. The root node of the whole Internet '.' only needs to know the IP address the top level names (.net, .org, .com, .us, .biz, .name etc.) to add reliability and scalability several servers can be used to maintain the information for a single zone. The overall authority for domain names is ICANN (the Internet Corporation for Assigned Names and Numbers).

Domain names are not restricted to just the same number of parts, an organization can choose divide their top level domain into sub-zones. For example http://mech.eng.cam.ac.uk might represent a mechanical engineering zone within Cambridge University's main site (cam.ac.uk). This also shows the use of top level country codes which have their own internal domain name structure, in this case ac indicates academic institutions within the UK zone.

In order for a browser running on a PC to look up domain names it needs a default name server. This is usually provided by the ISP.

On Windows the nslookup command line utility lets you investigate in detail any name server lookup problems. If the DNS records are incorrect you can use this tool to trace which name servers have been affected and if you have submitted a DNS record change you can use it to check the progress of propagation of the change of IP address.

See also : RFC1034 : Domain Names - Concepts and Facilities
RFC1035 : Domain Names - Implementation and Specification
Domain Name Registrars

NOTE : Strictly speaking all domain names should end in '.' (e.g. www.sacu.org.) as this final '.' indicates the root node of the whole domain. All browsers automatically assume the extra trailing '.', but in theory the domain name could be interpreted as a relative name. This is similar to UNIX file paths where a leading '/' is used to indicate that the path is an absolute path from the root node rather than a relative path.

Site Vigil has an option to report the IP Address for any page or address that you monitor. You can also request SV to inform you if the IP address for a domain changes.

Web site monitoring

Domains

Giving a Name to a Site

The Internet would function quite happily without domain names. They are simply a way of giving meaningful names to computer addresses. The mapping of a name to an IP address is performed by the largest distributed database in the world : the Domain Name Service (DNS).

A domain is a fairly technical, nebulous term that is used about the Internet. It usually refers to the name used in DNS to refer to a web site but it has a more general and vague usage.

The domain name part of a URL is typically in three parts as a hierarchy : <host name>.<domain><top level domain>.

So, for example, www.silurian.org identifies www (for World Wide Web) as the host name or sub-domain name, silurian as the domain and org as the top level domain. Normally silurian.org refers to the 'domain name' but care has to be taken as to whether the sub-domain name is required too.

Most web sites use www as the subdomain to make it clear that this is the area providing Internet content (usually HTML).

There is typically also a mail subdomain to deal with Internet mail (so that an email address like sales@silurian.org gets to the intended recipient). The DNS lookup mechanism allows different sub-domains to go to different servers. If you do not specify a sub-domain name then most servers will interpret this as the www sub-domain.

Sub-domains are a frequent cause of confusion. They are now often mis-used to provide different people with hosting under the same domain name. A web host can divert different subdomains say 'clothes' and 'paper' to completely different web pages under a generic domain name like 'mallshopping.com' so the domain owner can allow their clients access to 'clothes.mallshopping.com' or 'paper.mallshopping.com' with no additional domain registration costs. Similarly subdomains are often used to distinguish different affiliates or spammers who promote a web site. The web logs will show which subdomain was used to reach the site and their referral fee can then be calculated.

When you purchase a domain from a domain registrar you buy just the name, you must then find a web hosting company to be the home for the web site. The web host will provide you with an IP address for their DNS Servers when you purchase web hosting space. You can then update the domain name record to refer to these servers (both primary and secondary). [The Secondary name server is a backup system for use when the primary server is not working or not accessible]. Throughout the day, servers keep exchanging the updated DNS lookup information and so eventually all connected DNS Servers will discover the server that knows where the domain name is hosted. Whenever anyone then asks for a web page on that site they can respond with the appropriate IP address of the server. However, if you know the IP address for a web site you can always type this in directly.

For a list of all the Top Level Domains (TLDs) and links to the main registrar you can visit Domain Name Registrars

Site Vigil When using Site Vigil to monitor a web site's log files you need to specify the domain name for the site that is being monitored. You can use it to monitor individual subdomains if the log files are kept separate.

Web site monitoring

Domain Registrars

All too often people are mystified by the distinction between a web hosting company who host the domain and a registrar that look after the allocation of domain names. It is now quite common for the same company to do both jobs.

Originally there was only one organization responsible for allocating domains (Internic : Internet network information center ) and believe it or not, they were originally free. In 1995 this became commercial with the setting up of Network Solutions .

A domain registrar just deals with the ownership of the domain name, it holds personal contact information about the owner and the all important Name Server entries that define the master DNS servers that map the name to a particular web server IP address hosting the domain name.

The owning of a domain is a separate issue to running a web site. Several domain names can be configured to refer to the same web site. It is possible for an individual to directly manage the domain zone file that determines where a site is hosted. Most web hosting companies do not want their customers to manage this information themselves as it could mean they could move the domain to another host with payments still outstanding. If they hold the domain record (as the technical or administrative contact) they can control any updates and its renewal.

Web site monitoring

Errors

Decoding Server Error Reports

The Internet HTTP1.1 protocol specifies a whole set of errors that are recorded by most web servers. The error codes are encoded into decimal ranges.

The status codes are returned to the client making the request (typically an Internet Browser) and also recorded in the server's log file.

100 to 199Informational status codes, rarely used
200 to 299Successful, only 200 frequently used
300 to 399Warning, the request will normally have been completed OK
400 to 499Client Error, the request was invalid in some way
500 to 599Server Error, the server could not fulfil the (valid) request

Of all these possible errors you are only ever likely to see a handful when analyzing a server log file, these are shown in bold.

100 Continue The request is progressing normally
200 Success The request was successful. Hopefully the most frequent status code
201 Created A new resource has been created successfully on the server
202 Accepted Request accepted but not completed yet, it will continue asynchronously
203 Non Authoritative Request probably completed successfully but can't tell from original server
204 Empty Success : The requested completed successfully but the resource requested is empty (has zero length)
205 Reset The requested completed successfully but the client should clear down any cached information as it may now be invalid
206 Cancel The request was canceled before it could be fulfilled. Typically the user gave up waiting for data and went to another page. Some download accelerator programs produce this error as they submit multiple requests to download a file at the same time
300 Multiple Choice The request is ambiguous and needs clarification as to which resource was requested
301 Moved The resource has permanently moved elsewhere, the response indicates where it has gone to.
302 Found The resource has temporarily moved elsewhere, the response indicates where it is at present
303 Redirect A preferred alternative source should be used at present
304 Not modified The server has identified from the request information that the client's copy of the information is up-to-date and so the requested information does not need to be sent again. This can save considerable server resources. A browser can then use the cached copy it has kept locally
305 Use Proxy The request must be sent through the indicated proxy server
307 Temporary Redirect The resource has temporarily moved elsewhere, the response indicates where it is at present. The Client should still use this URL rather than a new one.
400 Bad Request The request header was not understood
401 Unauthorised The client does not have access to this resource, authorization (user and password) is needed
402 Payment Required Reserved for future use
403 Forbidden Access to a resource is not allowed. The most frequent case of this occurs when a user requests a directory name rather than a web page and directory listing access is not allowed.
404 Not found The resource request was not found. This is the code returned for missing pages or graphics. Viruses will often attempt to access resources that do not exist, so the error does not necessarily represent a problem
405 Not Allowed The HTTP access method (GET, POST, HEAD) is not allowed on this resource
406 Not Acceptable None of the acceptable file types (as requested by client) are available for this resource
407 Proxy Authentication The client does not have access to this resource, proxy authorization is needed
408 Timeout The client did not send a request within the required time period
409 Conflict There is an internal conflict of access to the resource
410 Gone The requested resource used to be on the server but is no longer available. Any robot seeing this response should delete the reference from its information store.
411 Length Required The request requires the Content-Length HTTP request field to be specified
412 Precondition Failed The request's HTTP header specified conditions that can not be met
413 Request Too Large The URL is too long (possibly too many query strings)
414 URL Too Large The URL is too long (possibly too many query keyword/value pairs)
415 Unsupported Media The server does not support the resource type requested
416 Requested Range invalid The portion of the resource requested is not available or out of range. This can occur when a request for a file has been split into multiple parts.
417 Expectation Failed The Expect specifier in the HTTP request header can not be met
500 Internal Server Error The server had some sort of internal error trying to fulfil the request. The client may see a partial page or error message. It's a fault in the server and happens all too frequently.
501 Not Implemented The request needs functionality not available on the server
502 Bad Gateway The response by an intermediary server was invalid. This may happen if there is a problem with the DNS routing tables.
503 Service unavailable Service temporarily unavailable - typically because it is currently overloaded.
504 Gateway Timeout The server did not respond back to the gateway within acceptable time period
505 Version not supported The request uses a version of HTTP that is not supported

For the full set of error codes please refer to RFC2616 for the authoritative explanation.

Site Vigil For Site Vigil to be able to analyse errors it needs access to the web server's log files. Site Vigil can use FTP, SecureFTP or direct access to achieve this. It also must understand the log file format used by the server. Various W3ORG pages are available describing standard server log formats. It produces a categorised error report of these events.