How it Works
Using ALT text
Making sure your web site images get referenced
Glossary S to Z
Scripts Sockets Search Engine SecureFTP Site Availability Speed Spider TCP/IP Traffic Analysis URL Virus Watch Web hosting Who is
HTML Script Files
Socket based Communication
Making the Internet Connection
Socket based communication is the principal mechanism that works the whole Internet. Sockets can also be used to communicate between processes on the same computer and that is how it was first developed on the earliest UNIX systems in the 1960s and 1970s. Within TCP/IP, the two communicating programs (server and client), allocate sockets and then the connection is initiated by a connect request from the client program. The server continually listens for connect requests and then chooses to accept a connection from these client programs.
This client-server model is an appropriate scheme for the Internet as there are many clients making connection requests for information from one place (the server). A web server does not normally need to initiate communication to a client. However the HTTP is a once off request-response protocol that means a connection is only made momentarily and this is not an efficient use of resources if a client is going to request a whole set of information from the same place. HTTP has a Keep-Alive request header setting that suggests that a socket connection is kept open as further requests are expected.
Position on Search Engines
A search engine continuously monitors the Internet to build up massive databases that categorise the content of web sites according to keywords. Because a good position on the search engine results is such an important way of bringing visitors to a web site a whole industry has been created to enable web sites to achieve a higher placement on the result list - Search Engine Optimization. Originally a search engine used the META keywords TAG in the HTML page header to determine the keywords for a page. This was soon abused by people who put in common search phrases and words just to fool the search engine into thinking that pages had relevant content. Now search engines tend to disregard individual elements in a page and look for consistent usage of keywords in the text, headers, title and tags in a page. They discriminate against sites that look as though they are trying to subvert the process by over-used keywords.
From the end user's point of view they want to see the sites that they want to visit at the top of the list. If a search engine shows sites of little relevance then the end user will go to another search engine that does give them more appropriate results. It is not in a web site owner's interest to increase traffic that is not appropriate to the web site as users will just feel frustrated and go elsewhere.
For much more on search engine please refer to our companion search engine reference pages.
Secure File Transfer
Transferring files securely over the Internet
FTP was designed way back in the early 1970s before malevolent software was created. FTP has a number of security weaknesses, especially the fact that the FTP connection command sends the user/account name and password in plain text over the Internet, and so it can be easily intercepted.
SecureFTP uses state of the art encryption to make sure neither the commands or data can be quite so easily eavesdropped. It is based around the OpenSource Secure Shell (ssh).
For more information see Secure FTP transfers via Secure Shell Tunneling
You can access tell Site Vigil to access a web server's log files using normal FTP or SecureFTP.
The Internet is a lot more stable than it used to be. Five years ago it was quite common to find web sites unavailable for one reason or another. Sometimes the web site server itself was offline, more frequently it was a failure in the network communication infrastructure. Better reliability has been achieved by connecting web hosts to several network service providers, so if one route to the site goes down then an alternative route is always available. Nowadays a web site should be available to all users over 99% of the time. The most frequent reason for a web server to appear to go down nowadays, is when it is subject to a denial of service attack. In this case a web site (usually co-located on the same server) is flooded by requests from multiple places (a distributed denial of service or Ddos attack). These may be launched by someone aiming to maliciously harm a web site and bring it down.
If a web host is running a flaky server or have an unreliable connection to the Internet then this will show up in measures of the availability of the web site. The worst scenario is that your site is down when a search engine is scanning your site when it is checking for inclusion in its search database, your site may then not be listed for several more months.
Speed of Access to Web Sites
Now that web hosting companies have become more competitive at seeking out new business, there is less differential in speed than there ever used to be.
It is often the case that the design of a web page will have a more dramatic effect on perceived speed than a slightly faster web host. Many sites use graphics that can be substantially reduced in file size, or redesigned so that they have smaller, simpler graphics. Speed is also a measure of the communication system, it will often be the case (especially without a broadband connection) that the slowest step is getting the data from the Internet to the client computer over a 56K modem line.
Site Vigil lets you measure the speed of access over time and build up a profile of the basic site speed.
Spidering a web site
Checking all pages on a web site
If a web site has more than a handful of pages it is very difficult to keep track of which page links to other pages and which pages use a particular graphics image. Some web designer tools will let you check pages before they are uploaded but this may not reflect the live content of the web site. As well as checking all the pages on a web site a spider monitoring scan can establish that all the links to other web sites are working correctly too.
To perform this spider scan, a robot is used that reads each page on a web site in turn. It then analyses the HTML making up each page and adds any new links to pages or graphics that it finds to the list of pages to scan next. The spider monitor robot continues to scan the site until it has scanned every page it has found a reference to.
A well behaved spider must take account of the special directives put on the web site or individual pages to control what can be scanned. This is explained in our Controlling Robot Visit tip.
Site Vigil supports manual or automatic spider scans of web sites enabling missing pages to be spotted easily.
It also checks the time it takes for each page access and so discovers which pages are taking the longest time to load.
You can also select the comparison option to make Site Vigil work out what has changed between two scans of a web site.
Communicating over the Web
All the commonly used communications protocols HTTP, HTTPS, FTP, Secure FTP, Gopher, DHCP, USENET need ultimately to send out digital signals over a physical wire connection or with radio waves. Rather than each program using their own implementation of communication services all of these protocols make use of a common underlying set of communication services called TCP/IP.
In the late 1960s the U.S. military research network pioneered a network of computers that remained remarkably stable considering the unreliability of the equipment in those days. It achieved this by using a set of communication protocols that are resilient to failure and loss. Each network node computer works independently, there is no 'master' node controlling the whole network. Each node dynamically maintains its own routing data as a 'map' of how to get information to a particular destination node. It exchanges routing information with only its immediate neighbours. This mechanism allows the network to 'self heal' when a network link or node becomes unavailable, and re-adjusts automatically when it becomes available again.
Network architecture is traditionally split into layers starting at the top application layer and going progressively down towards the hardware. The Transmission Control Protocol (TCP) forms the Transport layer and beneath it the Internet Protocol (IP) forms the Network layer. The Internet may also use UDP (User Datagram Protocol) as an alternative to TCP in some circumstances. In rough terms the Transport layer looks after assembling whole messages from individual small packets of data whatever route they may take. The Network layer looks after getting individual packets across the network. If data packets are lost then TCP automatically attempts to retry the operation. It uses a simple acknowledgement interchange to ensure this. Access to the communication stack is usually made by sockets.
Monitoring visits to a Web Site
It's very important to keep an accurate measure of the level of interest in a web site. The number of hits is a very crude estimate of activity as accesses to graphics images on pages are often treated as hits as well as the HTML page itself. Page impressions are a better measure as they ignore references to graphics. Similarly, scans by automated programs (robots) rather than real users are often counted as hits. More sophisticated analysis requires a group of requests from the same client to be treated as a single session and the number of sessions or user visits is a more useful measure. The profile of activity during a week is important as some sites are busy during the day and while other sites are busiest at weekends when users browse from home. It will also indicate where geographically the main source of visitors is located. If the traffic peak coincides with the peak of the Pacific timezone then that can quickly identify the main audience. A brief, high peak may indicate a scan by a robot, perhaps a search engine building an index for the web site. The traffic profile will indicate when to do site maintenance or when to increase available bandwidth.
Web site traffic analysis is an increasingly important tool for the efficient management of web sites.
The most significant reason why traffic might suddenly rise is when some other web site or an advertisement promotes your site. These days that will normally be a mention in an online forum or blog. Even if it is not possible to use the Referral tracking information to work out where the new users are coming from, the date and time that it happens can indicate, for example, when a magazine advertisement campaign hits the streets. Similarly if the traffic drops off unexpectedly this may indicate a major network routing problem that is blocking out a large number of users. Just because you can access a site does not guarantee that everyone else is equally fortunate.
See the Internet Traffic Report for a real-time display of global traffic levels on the Internet.
Site Vigil can automatically monitor web site traffic and raise an alert if there is a lot more (or a lot less) traffic than normal.
It does this by periodically checking the web site's logs.
Specifying Which pages you Want with URLs
A Uniform Resource Locator (URL) is the way to specify what you want and how it should be provided. It has the following format :
<protocol>:://<domain name>/<object name>?<query string>
The protocol specifies how to access the information, the same information may be available through several protocols (e.g. typically via HTTP, HTTPS and FTP). HTTP is just one example of a protocol handler, because web browsers usually use http: they normally accept www.mine.com as shorthand for the full http://www.mine.com
The domain name is the multi-level DNS name of the server e.g. www.silurian.com. More strictly www is a sub-domain name and silurian.com is the actual domain name. It is used in DNS to locate the web server that can provide the information.
The object name is the resource on the web site to access e.g. /win32/chart.htm, this need not be a file it can be a program or any other named entity that the server wishes to make accessible.
The query string is an optional part that provides additional information for the server to give a tailored response. It is frequently used by HTML Forms (as in a query to a search engine) to specify extra information. A search engine query string usually has a whole set of keywords. For example http://www.google.com/search?sourceid=navclient&query=develop+XP+style is a query with keywords sourceid as navclient, and query as 'develop XP style'.
You can gain access to authenticated areas of a web site by specifying a user name and password. If no username is provided then the default public access permission control is used.
Site Vigil allows a whole group of resources to be monitored by just specifying their URL in the page monitoring options.
You'll want to know any problem that a client may experience accessing the site with a browser.
Site Vigil offers a variety of options for web page monitoring :
Server Virus Attacks
Although it's ordinary PCs which normally succumb to computer viruses, it's all too easy for web servers to become infected too. All that is needed is a security weakness to be exploited and arbitrary code can be executed on the server. This can often be because access to a port has not been blocked to external access. Some pernicious viruses will not only infect executable files held on the Server but include extra script code into any HTML, PHP or ASP files they find on a server. Nimda is one such virus that is very hard to remove from a server once it is infected. More recently viruses may try to get the Server to send out masses of emails to spread itself over the Internet.
Site Vigil helps detect virus attacks by maintaining a checksum for web pages and alerts you when the page has been unexpectedly changed.
Watch WebSite Pages
When you want to know the pattern of accesses to individual pages, graphics or other web site resources then standard site statistics fail to deliver the level of detail you need.
This facility is useful if you want to count the build up of accesses to a new area of a web site or perhaps accesses to a download file or a contact page indicating that users are having problems. Of particular use is the ability to keep a watch on accesses to a 404 error handler page, so that you are quickly alerted when a page is missing anywhere on a web site.
Site Vigil lets you monitor the level of activity of accesses to resources on a web site. It will then alert you when the number of accesses is above or falls below a programmable threshold of number of hits per day. A detailed history of accesses gives you the information needed to check out the historical pattern of access to a resource. In order to monitor web site accesses Site Vigil needs access to the web server log files in order to analyse the data.
Choosing a good Web Host
A web hosting company looks after the web pages for a domain. The company has a high capacity connection to the Internet to allow anyone to access the web pages on their computer. A good web host will provide several independent connections to the Internet in case one of them fails.
The web service will normally dictate how much web space is available in total and how much bandwidth is allocated. Small sites may need only 10Mb of web space and the actual bandwidth used may be as low as 10Mb a month but packages will often allow a much greater level of web traffic. For a large or very busy web site it will be necessary to have a dedicated server to host the site. The very largest web sites will have multiple server computers with complex failover and load balancing techniques to ensure fast access times.
Most small domains are hosted together in groups on a single server computer. Each server is accessed by a single unique IP address. If a web site experiences sporadic slow access it may be that the some of the web sites with which it is co-located are overloaded.
You can get Site Vigil to inform you when the IP address for any domain name has changed. This is particularly useful when you want to track the transfer of a domain from one web hosting company to another.
Finding out who owns what
To find out about information about a domain you can use a Who is service like Whois Source to get the ownership and contact information for the domain name but this may not indicate who is a hosting a domain as the owner or administrator of a domain can be entirely separate to the hosting company.
The second way is to trace the owner of a domain's IP address. The Internet is split up into geographical areas, each with their own controlling authority for IP addresses, for the Americas this is ARIN (American Registry for Internet Numbers) whilst for Europe this is RIPE (Réseaux IP Européens). They all offer a Who is service that allows you to type in an IP address and trace the company responsible for managing it. For a server this is normally the hosting company, while for a user this is normally the ISP (Internet Service Provider). In both cases there will be contact e-mail addresses in the returned information.
There follows two examples (they do not reflect real contact information) copied from Site Vigil screens.
Example Whois for Detron.com
Lookup for information about domain name 'www.detron.com'
Using information read from 'whois.internic.net,whois.networksolutions.com'
Registrant: Retro Aerospace (RETRO2-DOM)
7600 Belfast Ave
Oakland, CA 94719
Domain Name: RETRO.COM
Rupert Bushell (38456655P) firstname.lastname@example.org
24672 Santa Clara St
Hayward, CA 94544
Hamish McCall (3823255P) email@example.com
24672 Santa Clara St
Hayward, CA 94544
Record expires on 21-Mar-2006.
Record created on 20-Mar-1994.
Database last updated on 29-Dec-2004 11:47:28 EST.
Domain servers in listed order:
Example Whois for 188.8.131.52
Lookup for information about IP address '184.108.40.206'
Using information read from 'whois.arin.net'
OrgName: DSN.net, Inc.
Address: 541 Long Wharf Drive
City: New Haven
NetRange: 220.127.116.11 - 18.104.22.168
NetType: Direct Allocation
Comment: ADDRESSES WITHIN THIS BLOCK ARE NON-PORTABLE
Comment: rwhois.scruz.net 4321
OrgNOCName: Network Operations Center
OrgTechName: IP Administration
Site Vigil lets you easily access domain and IP address Whois information. Whenever an IP address or domain name is displayed a Who Is button lets you get the ownership data from the Internet.