How it Works
Stop excessive visits to a site and hiding parts of a web site
Glossary A to E
Surfing the Internet
A browser is a special program that displays web page information. The web pages can be located on a local computer or more usually the Internet. It uses standard Internet protocols like HTTP to access information and display it. HTML pages are still the most common way of presenting information to a user.
To access information a URL is typed in to uniquely identify the information requested, the text entered is analyzed and then DNS is used to look up the IP address where the information can be accessed.
For information on the most widely used browsers see statistics such as those accumulated at TheCounter.com ➚
Getting information more quickly
Computer systems have used the caching technique since the very earliest days to speed up access to information. This is just a matter of keeping a copy of data that may be needed in the near future locally rather than requiring slower access back to the original source. When browsing the Internet there are at least two levels of caching carried out. Firstly, the browser will keep copies of pages and graphics in a local cache on the PC on the basis that many pages within a web site will share the same set of graphics. Secondly all the computers on the route between the client and the server have an opportunity to cache information in the hope that it will be requested again, from a wide range of different users.
Caching is good news for improving perceived access time but is bad news if you want to be sure you are seeing up-to-date information from the server. If a page is dynamically updating you may want to see the server's current copy not any cached out-of-date copies. The HTTP protocol defines a number of controls about how information should be cached (if at all). Rather than returning the same information again a browser can indicate it is only interested in new information if it has changed since a particular time, if it has not changed the server just returns a 304 not modified error code.
In order to find a reasonable measure of access time the Site Vigil program specifies that the information
that it requests should not be cached. The access time will then more accurately how long it takes for the information to travel from the original web server to the PC running Site Vigil.
Tracking Internet Usage
The HTTP protocol is designed to work as a single request-response fashion so that a web server treats each request that it receives for a web site page as if it came from a different user. It does not maintain a 'session', this is not ideal in situations when a user has entered some information and should not need to enter the same information again and again. This typically occurs when ordering goods, if you go back or refresh a page you wouldn't expect to have to enter the same data twice. Similarly if a user has expressed a preference for particular options or an area of a web site they might expect to restore these settings whenever they return to a web site. Amazon ➚ is a company that pioneered this approach early on to remember previous orders and searches. It now uses this previous browsing history to suggest tempting offers of similar goods that may be of interest.
To provide this sort of functionality cookies were invented, they are small pieces of information that are stored locally on the user's PC and presented back to the server when required. If they are acting as authentication tokens they need to have a short lifetime so that they apply for only the current user session, so all cookies specify an expiry time. There is some concern that cookies represent a breach of privacy as they are storing information about a user without the user's knowledge.
The cookie is issued by the server and sent back as a field in the HTTP response header. The browser will then store the cookie locally and whenever an HTTP request is made for a page on that web server the cookie will be sent back. The server can then work out that a set of requests were sent by the same user.
Consistent style for web pages
Making the look and feel of a web site consistent is a key web site design aim. In the past this required fairly tedious editing of the HTML for each individual web page. Putting detailed FONT attribute selection around all the displayed text. Style sheets provide a way of specifying how the text in every page should look (font, colour, size, margin etc.) in a single external file. Each individual page can then reference the same style sheet file. This provides not only a common look but can often reduce the size of pages (because formatting information is specified in only one place), decreasing the download time for a page and enhance the apparent speed of a web site.
It's also true that CSS (Cascading Style Sheets) offer far more formatting options than the original HTML 4.0 specification allowed. One frequent query about CSS is why the term Cascading is used. Well, it's at the heart of the way styles are defined and allows common elements to be inherited. So if you have a fancy looking table format and then decide to create a new style that's only slightly different you can define it by just stating the differences of the new from the original - the rest of the definition cascades from one style to another - that's a good example of code reuse.
Looking up a site's name
The Directory Name Service (DNS) gets you to the Internet information you want. It is the largest dynamic data store in the world and is central to the functioning of the Internet. It maps domain names typed in as text strings into digital IP addresses, in effect it is a huge online telephone directory. The computer browser then uses the digital IP address to fetch information from the web site. This not only allows web sites to be given meaningful names but also lets the web site containing the information to move to a new web server without the user needing to be aware of this. This makes it operate very much like a telephone directory, we use it to get a phone number even when a business or person has moved house. As with the analogous printed directory, the entries have to be constantly updated to keep track of changes, and once a year is not often enough.
To keep up-to-date, all the Servers on the Internet need to get updates for their local copies of DNS information. This is why it can take up to two days for a new domain to become accessible. Not only the Server hosting the new site must know its address, but also all the Servers and Routers making up the access path from the Server to requesting computers need the address. When a web site is in the process of relocating some people will get to the old address and others to the new one. When relocating a domain this overlap period is necessary you should keep both old and new versions running for a while.
In order to keep traffic to a tolerable level, each router caches information locally of recently requested domain names and IP addresses in order to save requesting the same information again. The traffic relating to name lookup has been running at 5% of total Internet traffic.
The traditional domain name is just a two part entity, DNS is organized hierarchically into zones so that only DNS servers under a particular path e.g. .org need to concern themselves about names within that domain - in this case anything ending in .org. The root node of the whole Internet '.' only needs to know the IP address the top level names (.net, .org, .com, .us, .biz, .name etc.) to add reliability and scalability several servers can be used to maintain the information for a single zone. The overall authority for domain names is ICANN ➚ (the Internet Corporation for Assigned Names and Numbers).
Domain names are not restricted to just the same number of parts, an organization can choose divide their top level domain into sub-zones. For example http://mech.eng.cam.ac.uk might represent a mechanical engineering zone within Cambridge University's main site (cam.ac.uk). This also shows the use of top level country codes which have their own internal domain name structure, in this case ac indicates academic institutions within the UK zone.
On Windows the nslookup command line utility lets you investigate in detail any name server lookup problems. If the DNS records are incorrect you can use this tool to trace which name servers have been affected and if you have submitted a DNS record change you can use it to check the progress of propagation of the change of IP address.
NOTE : Strictly speaking all domain names should end in '.' (e.g. www.sacu.org.) as this final '.' indicates the root node of the whole domain. All browsers automatically assume the extra trailing '.', but in theory the domain name could be interpreted as a relative name. This is similar to UNIX file paths where a leading '/' is used to indicate that the path is an absolute path from the root node rather than a relative path.
Site Vigil has an option to report the IP Address for any page or address that you monitor. You can also request SV to inform you if the IP address for a domain changes.
Giving a Name to a Site
The Internet would function quite happily without domain names. They are simply a way of giving meaningful names to computer addresses. The mapping of a name to an IP address is performed by the largest distributed database in the world : the Domain Name Service (DNS).
A domain is a fairly technical, nebulous term that is used about the Internet. It usually refers to the name used in DNS to refer to a web site but it has a more general and vague usage.
The domain name part of a URL is typically in three parts as a hierarchy : <host name>.<domain><top level domain>.
So, for example, www.silurian.org identifies www (for World Wide Web) as the host name or sub-domain name, silurian as the domain and org as the top level domain. Normally silurian.org refers to the 'domain name' but care has to be taken as to whether the sub-domain name is required too.
Most web sites use www as the subdomain to make it clear that this is the area providing Internet content (usually HTML).
There is typically also a mail subdomain to deal with Internet mail (so that an email address like firstname.lastname@example.org gets to the intended recipient). The DNS lookup mechanism allows different sub-domains to go to different servers. If you do not specify a sub-domain name then most servers will interpret this as the www sub-domain.
Sub-domains are a frequent cause of confusion. They are now often mis-used to provide different people with hosting under the same domain name. A web host can divert different subdomains say 'clothes' and 'paper' to completely different web pages under a generic domain name like 'mallshopping.com' so the domain owner can allow their clients access to 'clothes.mallshopping.com' or 'paper.mallshopping.com' with no additional domain registration costs. Similarly subdomains are often used to distinguish different affiliates or spammers who promote a web site. The web logs will show which subdomain was used to reach the site and their referral fee can then be calculated.
When you purchase a domain from a domain registrar you buy just the name, you must then find a web hosting company to be the home for the web site. The web host will provide you with an IP address for their DNS Servers when you purchase web hosting space. You can then update the domain name record to refer to these servers (both primary and secondary). [The Secondary name server is a backup system for use when the primary server is not working or not accessible]. Throughout the day, servers keep exchanging the updated DNS lookup information and so eventually all connected DNS Servers will discover the server that knows where the domain name is hosted. Whenever anyone then asks for a web page on that site they can respond with the appropriate IP address of the server. However, if you know the IP address for a web site you can always type this in directly.
For a list of all the Top Level Domains (TLDs) and links to the main registrar you can visit Domain Name Registrars ➚
When using Site Vigil to monitor a web site's log files you need to specify the domain name for the site that is being monitored. You can use it to monitor individual subdomains if the log files are kept separate.
All too often people are mystified by the distinction between a web hosting company who host the domain and a registrar that look after the allocation of domain names. It is now quite common for the same company to do both jobs.
Originally there was only one organization responsible for allocating domains (Internic : Internet network information center ➚) and believe it or not, they were originally free. In 1995 this became commercial with the setting up of Network Solutions ➚.
A domain registrar just deals with the ownership of the domain name, it holds personal contact information about the owner and the all important Name Server entries that define the master DNS servers that map the name to a particular web server IP address hosting the domain name.
The owning of a domain is a separate issue to running a web site. Several domain names can be configured to refer to the same web site. It is possible for an individual to directly manage the domain zone file that determines where a site is hosted. Most web hosting companies do not want their customers to manage this information themselves as it could mean they could move the domain to another host with payments still outstanding. If they hold the domain record (as the technical or administrative contact) they can control any updates and its renewal.
Decoding Server Error Reports
Of all these possible errors you are only ever likely to see a handful when analyzing a server log file, these are shown in bold.
For the full set of error codes please refer to RFC2616 ➚ for the authoritative explanation.
For Site Vigil to be able to analyse errors it needs access to the web server's log files. Site Vigil can use FTP, SecureFTP or direct access to achieve this.
It also must understand the log file format used by the server.
Various W3ORG ➚ pages are available describing standard server log formats.
It produces a categorised error report of these events.