Webmasters should always be suspect of hackers who are looking to steal data or computer resources. This is true
simply because the nature of digital communications permits such activities to go easily undetected. Hackers just can't be spottedin a physical senseas easily as, say, car jackers.
This book has brought up the topic of security whenever it has been pertinent to the other techniques or software being discussed. Chapter 16, "Maintaining Your Web Site," discussed several settings in the Windows Registry relevant to
security. This chapter takes a look at the bigger picture of securityfrom an Internet perspective.
We'll also be talking about Internet robots, or bots. Internet robots are a progressive and controversial topic on the Internet. Many people are still unaware of their existence, and some people are even upset by their existence. We are going to tell
you what these robots are doing on the Internet, how you can benefit from them, why you should be concerned, and what you can do about it.
The second part of this chapter is about using a firewall to protect your server or your LAN from outside intrusion and how to use proxy servers to keep your network diagram from being reverse-engineered.
Finally, we will mention the topic of software viruses and what you should do to minimize your risk of attack.
The security information in this chapter is more concerned with the structure of the network than it is with secure commercial transactions. If you are looking for information about secure commerce on the Web, please see Chapter 15, "Commerce on
the Web."
World Wide Web robots, sometimes called wanderers or spiders, are programs that traverse the Web automatically. The job of a robot is to retrieve information about the documents that are available on the Web and then store that information
in some kind of master index of the Web. Usually, the robot is limited by its author to hunt for a particular topic or segment of the Web.
At the very least, most robots are programmed to look at the <TITLE> and <H1> tags in the HTML documents they discover. Then they scan the contents of the file looking for <A HREF> tags to other documents. A typical robot might store
the URLs of those documents in a data structure called a tree, which it (the robot) then uses to continue the search whenever it reaches a dead-end (more technically called a leaf-node). We are oversimplifying this a bit; the larger robots
probably use much more sophisticated algorithms. But the basic principles are the same.
The idea behind this is that the index built by the robot will make life easier for us humans who would like a quick hop to information sources on the Internet.
The good news is that most robots are successful at this and do help make subsequent search and retrieval of those documents more efficient. This is important in terms of Internet traffic. If a robot spends several hours looking for documents, but
thousands (or even millions) of users take advantage of the index that is generated, it will save all those users from tapping their own means of discovering the links, potentially saving great amounts of network bandwidth.
The bad news is that some robots inefficiently revisit the same site more than once, or they submit rapid-fire requests to the same site in such a frenzy that the server can't keep up. This is obviously a cause of concern for Webmasters. Robot authors
are as upset as the rest of the Internet community when they find out that a poorly behaved robot has been unleashed. But usually such problems are found only in a few poorly written robots.
Figure 20.1 shows a hypothetical case of a tree-traversal algorithm that a robot might use. In tree diagrams such as this, computer scientists refer to the circles as nodes (generically). In this case, the nodes represent HTML pages. Node #1 is where
the journey begins; it is considered the root of the tree. Upon inspecting the HTML code at node #1, it discovers a link to node #2. When it reaches a static document at node #3 (a leaf node), it backtracks first to node #2 and then to node #1 where it
continues to node #4, and so on.
Figure 20.1. A robot traversing the Web.
The problem that occurs is when node #7 contains an additional link back to node #3. Certainly, the dynamic nature of the Web does not preclude this. If the robot isn't smart, it will revisit node #3, placing an unnecessary burden upon that server.
Fortunately, guidelines have been developed for robot authors, and most robots are compliant. An excellent online resource for information about robots, including further information of which much of this chapter is based, see "World Wide Web
Robots, Wanderers, and Spiders" by Martijn Koster, http://info.webcrawler.com/mak/projects/robots/robots.html. It contains links to documents describing robot guidelines, the
standard for robot exclusion, and an in-depth collection of information about known robots.
There are many active robots on the Web today. For a list of known robots at the time of printing, see Appendix F, "36 Internet Robots."
A good understanding of Web robots and how to use or exclude them will aid you in your Web ventures; in fact, it could help to keep your server alive.
There are lots of reasons to want to exclude robots from visiting your site. One reason is that rapid-fire requests from buggy robots could drag your server down. Or your site might contain data that you do not want to be indexed by outside sources.
Whatever the reason, there is an obvious need for a method for robot exclusion. Be aware that it wouldn't be helpful to the Internet if all robots were excluded.
Often on the Internet Web-related news groups and listservers, you will see a new Web site administrator ask the question "What is ROBOTS.TXT and why are people looking for it?" This question often comes up after the administrator looks at his
or her Web access logs and notices a line similar to this:
Tue Jun 06 17:36:36 1995 204.252.2.5 192.100.81.115 GET /robots.txt HTTP/1.0
Knowing that they don't have a file named ROBOTS.TXT in the root directory, most administrators are puzzled.
The answer is that ROBOTS.TXT is part of the Standard for Robot Exclusion. The standard was
agreed to in June 1994 on the robots mailing list (robots-request@webcrawler.com) by the majority of robot authors and other people with an interest in robots.
Some of the things to take into account concerning the Standard for Robot Exclusion are:
In addition to using the exclusion described below, there are a few other simple steps you can follow if you discover an unwanted robot visiting your site:
The method used to exclude robots from a server is to create a file on the server that specifies an access policy for robots; this file is named /robots.txt.
The file must be accessible via HTTP on the local URL, with the contents as specified here. The format and semantics of the file are as follows:
Any empty value indicates that all URLs can be retrieved. At least one Disallow field needs to be present in a record. The presence of an empty ROBOTS.TXT file has no explicit associated semantics; it will be treated as if it was not present. In other
words, all robots will consider themselves welcome to pillage, um, we mean, examine your site. Jokes aside, just remember that robots cannot damage your files. When a robot interacts with your Web server, the robot has no more capability than a typical Web
browser. The only potential harm from robots is that poorly written ones can keep your Web server busy. If you want to protect your site from damage, see the section below titled "Firewalls and Proxy Servers."
Here is a sample ROBOTS.TXT for http://www.yourco.com/ that specifies no robots should visit any URL starting with /yourco/cgi-bin/ or /tmp/:
User-agent: * Disallow: /yourco/cgi-bin/ Disallow: /tmp/
Here is an example that indicates no robots should visit the current site:
Useragent: * Disallow: /
Let's face it, the very thing that made the Internet grow so large and fast is what makes it such a dangerous place. There are entire books written about Internet security, and rightly so: Internet security is a complex topic. Clearly, this section
won't tell you everything you need to know about Internet security, but it will help you to understand many of the security implications of your decisions. Our motto is that it's always good to know what it is that you don't know.
Recently, there was an interesting thread (a sequence of several messages on the same topic) in a list server about Web site security. The story went like this: An Internet Service Provider (ISP) was updating its Web page that contained service prices.
When the ISP employee opened the document for editing, he/she noticed that all the service prices had been bumped up to outrageous levels. This is just one example of how your site can be compromised.
If you intend to maintain an Internet connection and you truly want a secure site, you will have to consider getting firewall protection. An Internet firewall gets its name from the fact that it helps you control the TCP/IP packets that travel between
your network (or server) and the rest of the Internet. Running a firewall enables you to regulate network traffic and discard packets which originate from undesirable Internet locations (based on TCP/IP packet type and/or previous logfile analysis). The
first thing to do if you want to add a firewall to your site is to change from a Dial-Up Networking connection to an Ethernet/router-based type of connection.
A firewall can be software, hardware, or a combination of the two. Commercial firewall packages cost a lot more than loose changea price range of anywhere from $1,000 to $100,000.
We haven't heard of a software-only firewall for Windows 95, but when they become available they would likely be less expensive than the hardware versions. In the meantime, you might consider running a freeware version of UNIX for the purpose of
including a firewall in your network. For an excellent reference, you might consult Linux Unleashed, published by Sams.net.
A firewall usually includes several software tools. For example, it might include separate proxy servers for e-mail, FTP, Gopher, Telnet, Web, and WAIS. The firewall can also filter certain outbound ICMP (Internet Control Message Protocol) packets so
your server won't be capable of divulging network information.
Figure 20.2 shows a network diagram of a typical LAN connection to the Internet including a Web server and a firewall. Note that the Web server, LAN server, and firewall server could all be rolled into one machine if the budget is tight, but separating
them as we show here is considered a safer environment.
Figure 20.2. Using a firewall/proxy server on a LAN.
The proxy server is used to mask all of your LAN IP addresses on outbound packets so they look like they all originated at the proxy server itself. Each of the client machines on your LAN must use the proxy server whenever they connect to the Internet
for FTP, Telnet, Gopher, or the Web. The reason for doing this is to prevent outside detection of the structure of your network. Otherwise, hackers monitoring your outbound traffic would eventually be able to determine your individual IP addresses and then
use IP spoofing to feed those back to your server when they want to appear as a known client.
Another purpose of a firewall is to perform IP filtering of incoming packets. Let's say that you have been monitoring the log files on your Web server and you keep noticing some unusual or unwanted activity originating from 193.3.5.9. After
checking the whois program (such as the GUI version included with this book), you determine the domain name is bad.com, and you don't have any business with them. You can configure the IP filter to block any connection attempts originating from
bad.com while still allowing packets from the friendly good.com to proceed.
Now let's return to the previous point about the need to switch your Internet connection method to an Ethernet and router combinationas opposed to a Dial-Up Networking modem connection. Following are a few possible firewall configurations.
Figure 20.3 depicts one of the most simple firewall configurations. The router that resides between your server and the Internet is the key element. Of course, adding a router means also adding a CSU/DSU and takes you out of the realm of a simple modem
connection.
Figure 20.3. Server and router.
This economical network depicts a router that offers IP packet filtering. This is hardware-only firewall protection.
This scenario is only slightly more involved. Figure 20.4 shows a software firewall (perhaps just a proxy Web server) being added to the Web site.
Figure 20.4. The server with a built-in firewall.
In this case, if the router offered no IP packet filtering, you would have software-only firewall protection. At the time of this writing, we had found no software-only solutions that ran natively under Win95.
The configuration shown in Figure 20.5 is rather advanced. The addition of a separate firewall server to your network is one of the best ways to guard your resources from intruders.
Figure 20.5. Using a Firewall Server to protect the rest of the LAN.
In this case, we added a separate computer running firewall/IP filtering software. This is the hardware/software solution described earlier.
These diagrams do not show the connections beyond the server. If your situation involves a network, the server would be a multihomed host with two NICs acting as a bridge or Internet gateway.
As you can see, there are many benefits to firewalls. The choice of whether or not to implement a firewall is a complex decision that will involve factors such as these:
If you decide you need to implement a firewall but don't have the cash to fork out for a $30K or $40K hardware solution, you might want to take a look at using the FWTK (Firewall Toolkit.) FWTK is a freeware collection of firewall-related utilities from
TIS (Trusted Information Systems). This toolkit is distributed in source code format. It's written for UNIX, so you will need run it on a UNIX box or spend the time or money to convert the code to Windows. There are several freeware versions of UNIX that
run on the PC, so this solution can be a very inexpensive one. The drawback to this method is that it requires a great deal of time, effort, and UNIX knowledge to implement. To obtain the FWTK, and read lots of other information on commercial firewalls,
see http://www.tis.com.
As if you don't already have enough trouble, alas, the risk of a virus deleting or scrambling your files is very serious.
One way to curtail the risk is to avoid downloading programs from the Internet. You must realize that you can be bitten the instant you run any kind of executable image. This includes programs, DLLs, command files, or even autostart macros in commercial
applications.
Of course, life on the Net isn't too practical without access to all the cool stuff that keeps being invented everyday. So the next level of protection is to test new software on a stand-alone cheap machine before giving it access to your precious hard
drive.
Virus detection programs (or virus scanners) should be used to verify that a new program appears legitimate. Virus scanners analyze your software looking for several types of red flags that would indicate danger. They are able to detect and warn you of
hundreds of known viruses. Yet another problem is that new viruses are being written by scum programmers everyday. Fortunately, most virus scanners have an answer for that too. By keeping a checksum of the files on your disk when it was last scanned and
then monitoring those for future change, they can detect the effects of unknown viruses.
At the time of this writing, we know of four virus scanners for Windows 95. We have tested three of these; McAfee Scan for Windows 95, ThunderBYTE Anti-Virus 95, and Doctor Anti-Virus 95. All of these applications worked well and each had there own
unique features. Of the three, Doctor Anti-Virus 95 seemed to have the best integration with Windows 95. If money is not an issue we recommend running several anti-virus scanners on your system. It never hurts to be too careful.
When people think about security on the Internet, they automatically think about firewalls. But there is a lot more to security than just firewalls. You will keep away most general pranksters by setting up your site with security in mind. Here are some
miscellaneous pointers for running a secure Web site:
This concludes the section on Expanding Your Internet Server. Now we are ready to tackle the topic of Web programming. In the next part of the book, we get into several programming languages, each with its unique place in the Webmaster's toolbox. Regardless of your preference, Perl, C++, Visual Basic, or Java, we have a little something for everyone.