![]() |
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Glossary L to RLog File ArchiveArchiving Access ActivityMany web servers keep their raw log files only for a limited time period. This information may be crucial in determining what has happened at a particular time. If the web server crashes or the log files get lost then you need the security of an offsite log backup. Hosting companies may well back up the web site files but not the logs, the logs can grow to a very large size and therefore backup is an expensive option for them to provide. Some servers will keep only the previous month or week on the site before it is replaced by the next one. All web servers will store information about the accesses made to the site in a log file archive. This enables a web master to investigate problems. Standard web hosting companies generate their 'free' access statistics by processing these log files to work what and when is being accessed on the site.
Log FilesWeb Server Log FilesA web site host runs a special service that manages the HTTP protocol. The HTTP service is the standard way that HTML pages are transferred to browsers over the Internet. Each HTTP request received by the server is typically a request for the contents of a particular web page or graphics file. The server logs all these requests with each new request or 'hit' as a separate record as a line. The server log file is the source of information used to generate web site statistics offered by most web hosting companies. It gives information about the date and time, source IP address, data requested, the referring web page and browser. The referral data give vital information about the links which people are using to reach a web site. The data requested is the full URL used to reach the site, quite often this will include the keywords used by the search engine to list the web site. Here is a sample line from a log file :
The meaning of each of there fields is as follows :
See also : RFC2616 : Hypertext Transfer Protocol HTTP/1.1 Site Vigil can automatically fetch and analyse log files in a variety of different formats Log File FormatsLogging accesses to Web sitesAll web servers will store information about the accesses made to the site in a log file. This enables a web master to investigate problems and for statistics to be generated showing what and when is being accessed on the site. There are two main formats in use. Different servers (Microsoft® IIS, Apache®, ...) use different log formats. However, they are all text based and contain information about each access to a resource (HTML page or graphics file) is recorded as a single line of information (a hit). A web site administrator can usually control how much information goes into the log file, as the full set of information is rather large. Each log file record may contain the following fields of information :
Some servers will archive log files in compressed format (e.g. ZIP, GZ or CAB). Log file format specification Site Vigil supports a wide variety of log file formats and will automatically select an appropriate one by scanning a sample log file. It can then analyse the records and pick out error and referral records. MIMESpecifying information formatThe HTTP protocol transfers information around in binary format, it is up to the client and server to negotiate so that the client (typically a browser) is only sent information that it can understand. This negotiation is carried out using MIME (Multipurpose Internet Mail Extensions) types. As the acronym suggests this was originally developed to describe the content of email messages but is now much widely used within HTTP. It uses a simple two-part text description to describe the content format consisting of a type and a subtype. So text/html indicates that it is basically text but the text is in HTML format, text/plain is for raw untagged text (as is a .txt file) and image/jpg indicates a graphics file in JPEG image format. When a browser requests data it states the MIME types it is willing to receive as a response, the server will then choose an available format for the response from these types.
The MIME Information Page PingUsing Ping to check a Server is workingOne important facet of site monitoring is knowing as soon as possible that servers have failed or are not accessible. A web server provides multiple services not just HTTP. Just because a server is not responding to HTTP requests does not imply it is not functioning at all. The simplest means of establishing whether a site or server is alive is using the Interface Control Message Protocol (ICMP) protocol to Ping a server. This is a much simpler request than fetching an HTML page in terms of the communication overheads. It runs over the IP protocol and so checks that the IP part of TCP/IP is functioning OK. This protocol is also used by the tracert command line utility to find the route that communication is taking to a server. The Ping connectivity check works well in an Office Intranet situation too, it can regularly monitor whether the key servers and workstations are responding properly to IP traffic within an office local area network. The protocol supports a number of commands but the ECHO command is the one of interest for Ping monitoring. It instructs routers to pass the message over IP to a particular destination IP address requesting an ECHO REPLY to be sent back. Measuring the time between from issuing the ECHO and receiving the ECHO REPLY determines the responsiveness of the remote server. The ICMP echo reply includes a Time to Live (TTL) value. This indicates the number of router hops that the message has gone through from the source. Normally the packet starts off with a 255 TTL value and then each router it passes through decrements the value by one. If the number of hops is erratic or suddenly becomes large this indicates a router problem. The same ECHO command can be used to trace a route over the Internet (as used in tracert program See also RFC792 : Internet Control Message Protocol
PortConnecting to the correct Service PortEach IP address can be accessed on a range of numeric port numbers. The port number requested is part of a client connect request and can be specified as part of a URL. When a URL omits the port number the default port number for that service is assumed (80 for HTTP web service). These map onto inter-communicating sockets, when a server socket is set up it chooses a unique port number on which to listen for requests (as part of the bind socket API call), the client issues a connect to a server giving an IP address and a port. In most cases the port number is assigned to a particular service, so the number is really acting as a name for the service that is required. The only ports of interest on the Internet to users are the ones used for HTTP and FTP. A more comprehensive list of standard ports is as follows :
Site Vigil can be used to check that you can connect to any port on a particular IP address. This is an option included in the Ping functionality. ProxyIndirect proxy access to the InternetOriginally each computer wishing to use the Internet had to connect to it directly, this is OK for servers or a home user dialling up for a connection but not convenient for an office environment where hundreds of PCs may want to use the Internet all at the same time. To solve this problem Proxy servers are used. These servers have a dedicated Internet connection but make requests on behalf (as a proxy) for all the computers wishing to access the Internet through it. Most browsers have connection settings that allow you to configure the IP address and port that is then used to communicate with a Proxy Server. HTTP requests are then sent to the Proxy Server using TCP/IP which then in turn sends them out onto the Internet. A proxy server may run on a separate machine (often in conjunction with a firewall) or as an ordinary program running on a PC. It needs to keep track of all client requests so it can route all the responses sent to it from the server back to the browser that requested the information. You can configure Site Vigil to access the Internet via a Proxy Server for HTTP and HTTPS requests. QueryGetting information from a userQueries are an important part of HTML.
They are used to pass additional information to a server about the data requested.
The most widespread usage is when an HTML Form is submitted (as an HTTP POST) and the various values
entered on the form are sent as query strings tagged onto the end of the URL.
Search engines such as Google
RankWeb Site RankingThere are a number of Internet services that attempt to rank web sites in some sort of popularity order. As there are about 500 million web sites this is not an easy task and relative ranking scores are not to be totally trusted. For example on this particular day the top five web sites according to Alexa It is not possible to look at individual web site traffic statistics in order to gauge popularity. The server logs are not publicly accessible. If you use the Google or Alexa Toolbars to reach web sites this is one way that these sites can build up statistics in order to rank sites. Each time the toolbar is used, the click is recorded and added to their database. This makes all ranking measures rather inaccurate, it is best to treat them as a very rough indication of relative popularity only. Google ReferralsLooking at how visitors find your siteDo you know how people are reaching your web site ? When a web site has a set of links to it, you get a referral when the user follows the link. This is usually by a person clicking on a link in a browser or else by an automated scanning engine or robot. It is important to monitor the number of referrals coming to a site to be able to adjust the site content and therefore the keywords to attract more visitors. It may be that you are getting people to come to the site for the wrong reason and are not going to stay or come back again. When a user follows an HTML link to get from one web page to another, the web server typically store how a page (locally or remotely was referred to). Tracking how people get referred to a web site is crucial to measuring the effectiveness of a web site. Are people getting to the information quickly and easily ? Are they only looking at one page and then leaving the web site ? Which keywords and phrases are people using to reach your site ? Some firewalls give the option to remove the referral information in order to give the user more anonymity, if a web site needs to be sure it knows where it has been referenced from it needs to include this as part of the query part of a URL. Referral monitoring allows the sources of external web traffic to be identified. Typically Google will be the largest source of referrals as most people find web sites from a search engine, and Google is by the far the most popular one at present. By using referrer information you can work out :
Site Vigil can scan the server log files and check all the referrals to a site. It will then alert you when a new source of referrals starts and even warn you when the expected level of referrals declines. Referral monitoring is a crucial part of web site management. A detailed history of referrals gives you the information needed to check out any new references to your site. RFCInternet Standards and ProposalsThe technical standards that govern how the various parts of the Internet function are documented as Request For Comment (RFC) documents. Although the name suggests that these are just early proposals, these documents include actual working standards for much of Internet technology. Some RFCs are experimental and some have been entirely superseded so it is important to refer to the appropriate RFC. They are co-ordinated by committees of Internet professionals. There are over three thousand RFCs in existence covering all aspects of the Internet.
A good starting point is Internet Engineering Task Force (IETF web site www.ietf.org Here is a list of commonly referenced RFCs, but be warned that their technical nature can make them tough reading : RFC777 : Internet Control Message Protocol RobotAutomated Internet ScanOn the Internet a robot is not some mechanical human-like servant but just a special type of computer program. Ever wondered how search engines build up their indices of web sites ? Well, search engines amongst other programs use robots to continually trawl web sites analyzing the contents as if they were human visitors using a browser. They use HTTP just like browsers in order to access information. A server can state which pages should be inspected by robots, the instructions are stored in the robots.txt file. Over time, search engines have grown much more sophisticated and the way that they scan sites is complex. Many will first scan the site's index page, coming back after weeks or months to drill down to scan the rest of the web site. The server log usually includes a browser field in each log record indicating the name of the robot. No well-behaved robot will flood a server with requests as this would affect the server's performance. They will spread their site scan over hours or days. Robots should include a contact URL or email address in the browser information included in the HTTP request header so that a web master can analyze the activity by robots.
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||