Internet Robots and More Security Issues

Internet Robots

Excluding Robots

The Method

Examples

Firewalls and Proxy Servers

Other Firewall Configurations

Internet, CSU/DSU, Router, and Windows 95
Internet, CSU/DSU, Router, Windows 95, and Software Firewall
Internet, CSU/DSU, Router, Firewall Server, and Windows 95

Software Viruses
Miscellaneous Security Advice
What's Next?

20

Internet Robots and More Security Issues

Internet Robots
Excluding Robots
Firewalls and Proxy Servers
Software Viruses
Miscellaneous Security Advice

Webmasters should always be suspect of hackers who are looking to steal data or computer resources. This is true simply because the nature of digital communications permits such activities to go easily undetected. Hackers just can't be spotted—in a physical sense—as easily as, say, car jackers.

This book has brought up the topic of security whenever it has been pertinent to the other techniques or software being discussed. Chapter 16, "Maintaining Your Web Site," discussed several settings in the Windows Registry relevant to security. This chapter takes a look at the bigger picture of security—from an Internet perspective.

We'll also be talking about Internet robots, or bots. Internet robots are a progressive and controversial topic on the Internet. Many people are still unaware of their existence, and some people are even upset by their existence. We are going to tell you what these robots are doing on the Internet, how you can benefit from them, why you should be concerned, and what you can do about it.

The second part of this chapter is about using a firewall to protect your server or your LAN from outside intrusion and how to use proxy servers to keep your network diagram from being reverse-engineered.

Finally, we will mention the topic of software viruses and what you should do to minimize your risk of attack.

The security information in this chapter is more concerned with the structure of the network than it is with secure commercial transactions. If you are looking for information about secure commerce on the Web, please see Chapter 15, "Commerce on the Web."

Internet Robots

World Wide Web robots, sometimes called wanderers or spiders, are programs that traverse the Web automatically. The job of a robot is to retrieve information about the documents that are available on the Web and then store that information in some kind of master index of the Web. Usually, the robot is limited by its author to hunt for a particular topic or segment of the Web.

At the very least, most robots are programmed to look at the <TITLE> and <H1> tags in the HTML documents they discover. Then they scan the contents of the file looking for <A HREF> tags to other documents. A typical robot might store the URLs of those documents in a data structure called a tree, which it (the robot) then uses to continue the search whenever it reaches a dead-end (more technically called a leaf-node). We are oversimplifying this a bit; the larger robots probably use much more sophisticated algorithms. But the basic principles are the same.

The idea behind this is that the index built by the robot will make life easier for us humans who would like a quick hop to information sources on the Internet.

The good news is that most robots are successful at this and do help make subsequent search and retrieval of those documents more efficient. This is important in terms of Internet traffic. If a robot spends several hours looking for documents, but thousands (or even millions) of users take advantage of the index that is generated, it will save all those users from tapping their own means of discovering the links, potentially saving great amounts of network bandwidth.

The bad news is that some robots inefficiently revisit the same site more than once, or they submit rapid-fire requests to the same site in such a frenzy that the server can't keep up. This is obviously a cause of concern for Webmasters. Robot authors are as upset as the rest of the Internet community when they find out that a poorly behaved robot has been unleashed. But usually such problems are found only in a few poorly written robots.

Figure 20.1 shows a hypothetical case of a tree-traversal algorithm that a robot might use. In tree diagrams such as this, computer scientists refer to the circles as nodes (generically). In this case, the nodes represent HTML pages. Node #1 is where the journey begins; it is considered the root of the tree. Upon inspecting the HTML code at node #1, it discovers a link to node #2. When it reaches a static document at node #3 (a leaf node), it backtracks first to node #2 and then to node #1 where it continues to node #4, and so on.

Figure 20.1. A robot traversing the Web.

The problem that occurs is when node #7 contains an additional link back to node #3. Certainly, the dynamic nature of the Web does not preclude this. If the robot isn't smart, it will revisit node #3, placing an unnecessary burden upon that server.

Fortunately, guidelines have been developed for robot authors, and most robots are compliant. An excellent online resource for information about robots, including further information of which much of this chapter is based, see "World Wide Web Robots, Wanderers, and Spiders" by Martijn Koster, http://info.webcrawler.com/mak/projects/robots/robots.html. It contains links to documents describing robot guidelines, the standard for robot exclusion, and an in-depth collection of information about known robots.

There are many active robots on the Web today. For a list of known robots at the time of printing, see Appendix F, "36 Internet Robots."

The Internet community puts up with robots because robots give something back to all of us. Private robots are seriously frowned upon because they take resources and offer value only to a single user in return. If you are looking for your own Internet robot, however, you can check out the Verity Inc. home page at http://www.verity.com/. Please remember that one of the guidelines of robot design is to first analyze carefully if a new robot is really called for.

A good understanding of Web robots and how to use or exclude them will aid you in your Web ventures; in fact, it could help to keep your server alive.

Excluding Robots

There are lots of reasons to want to exclude robots from visiting your site. One reason is that rapid-fire requests from buggy robots could drag your server down. Or your site might contain data that you do not want to be indexed by outside sources. Whatever the reason, there is an obvious need for a method for robot exclusion. Be aware that it wouldn't be helpful to the Internet if all robots were excluded.

Often on the Internet Web-related news groups and listservers, you will see a new Web site administrator ask the question "What is ROBOTS.TXT and why are people looking for it?" This question often comes up after the administrator looks at his or her Web access logs and notices a line similar to this:

Tue Jun 06 17:36:36 1995 204.252.2.5 192.100.81.115 GET /robots.txt HTTP/1.0

Knowing that they don't have a file named ROBOTS.TXT in the root directory, most administrators are puzzled.

The answer is that ROBOTS.TXT is part of the Standard for Robot Exclusion. The standard was agreed to in June 1994 on the robots mailing list (robots-request@webcrawler.com) by the majority of robot authors and other people with an interest in robots.

The information on these pages is based on the working draft of the exclusion standard, which can be found at this URL.

http://info.webcrawler.com/mak/projects/robots/norobots.html

Some of the things to take into account concerning the Standard for Robot Exclusion are:

It is not an official standard backed by a standards body.
It is not enforced by anybody, and there are no guarantees that all current and future robots will adhere to it.
Consider it a loose standard that the majority of robot authors will follow.

In addition to using the exclusion described below, there are a few other simple steps you can follow if you discover an unwanted robot visiting your site:

Check your Web server log files to detect the frequency of document retrievals.
Try to determine where the robot originates. This will enable you to contact the author. You can find the author by looking at the User-agent and From field in the request, or look up the host domain in the list of robots. Also, the origination of several robots is given in Appendix F, "36 Internet Robots."
If the robot is annoying in some fashion, let the robot author know about it. Ask the author to visit http://info.webcrawler.com/mak/projects/robots/robots.html so he or she can read the guidelines for robot authors and the standard for exclusion.

The Method

The method used to exclude robots from a server is to create a file on the server that specifies an access policy for robots; this file is named /robots.txt.

The file must be accessible via HTTP on the local URL, with the contents as specified here. The format and semantics of the file are as follows:

The file consists of one or more records separated by one or more blank lines (terminated by CR, CR/NL, or NL). Each record contains lines of the form:

<field name>:<optional space><value><optional space>.
The field name is case-insensitive. Comments can be included in the file using UNIX Bourne shell conventions. The # character is used to indicate that the preceding spaces (if any) and the remainder of the line up to the line termination are discarded. Lines containing only a comment are discarded completely and therefore do not indicate a record boundary. The record starts with one or more user-agent lines, followed by one or more disallow lines. Unrecognized headers are ignored.
User-agent

The value of this field is the name of the robot for which the record is describing an access policy. If more than one User-agent field is present, the record describes an identical access policy for more than one robot. At least one field needs to be present per record. The robot should be liberal in interpreting this field. A case-insensitive substring match of the name without version information is recommended. If the value in the record describes the default access policy for any robot that has not matched any of the other records, it is not allowed to have two such records in the ROBOTS.TXT file.
Disallow

The value of this field specifies a partial URL that is not to be visited. This can be a full path or a partial path; any URL that starts with this value will not be retrieved. For example

Disallow: /help

disallows both /help.htm and /help/default.htm, whereas

Disallow: /help/

disallows /help/default.htm but allows /help.htm.

Any empty value indicates that all URLs can be retrieved. At least one Disallow field needs to be present in a record. The presence of an empty ROBOTS.TXT file has no explicit associated semantics; it will be treated as if it was not present. In other words, all robots will consider themselves welcome to pillage, um, we mean, examine your site. Jokes aside, just remember that robots cannot damage your files. When a robot interacts with your Web server, the robot has no more capability than a typical Web browser. The only potential harm from robots is that poorly written ones can keep your Web server busy. If you want to protect your site from damage, see the section below titled "Firewalls and Proxy Servers."

Examples

Here is a sample ROBOTS.TXT for http://www.yourco.com/ that specifies no robots should visit any URL starting with /yourco/cgi-bin/ or /tmp/:

User-agent: *

Disallow: /yourco/cgi-bin/

Disallow: /tmp/

Here is an example that indicates no robots should visit the current site:

Useragent: *

Disallow: /

Firewalls and Proxy Servers

Let's face it, the very thing that made the Internet grow so large and fast is what makes it such a dangerous place. There are entire books written about Internet security, and rightly so: Internet security is a complex topic. Clearly, this section won't tell you everything you need to know about Internet security, but it will help you to understand many of the security implications of your decisions. Our motto is that it's always good to know what it is that you don't know.

The threat to your site is real. We don't want to alarm you, but as the Internet grows, so does the threat. If you are running a business Web site, your risks range from annoying pranks to sabotage to industrial espionage. Regardless of whether your server is on a LAN, if you have any files at all worth protecting (financial information or even just the operating system itself), you should definitely consider a comprehensive Internet firewall plan.

Recently, there was an interesting thread (a sequence of several messages on the same topic) in a list server about Web site security. The story went like this: An Internet Service Provider (ISP) was updating its Web page that contained service prices. When the ISP employee opened the document for editing, he/she noticed that all the service prices had been bumped up to outrageous levels. This is just one example of how your site can be compromised.

If you intend to maintain an Internet connection and you truly want a secure site, you will have to consider getting firewall protection. An Internet firewall gets its name from the fact that it helps you control the TCP/IP packets that travel between your network (or server) and the rest of the Internet. Running a firewall enables you to regulate network traffic and discard packets which originate from undesirable Internet locations (based on TCP/IP packet type and/or previous logfile analysis). The first thing to do if you want to add a firewall to your site is to change from a Dial-Up Networking connection to an Ethernet/router-based type of connection.

Just remember that if you hear people talking about RAS, it is basically the same thing as Dial-Up Networking. RAS, which stands for Remote Access Service, is built into Windows NT 3.5x. A RAS client was built into Windows for Workgroups 3.11, the predecessor of Windows 95. As part of their overall plan to make the operating system appear more friendly, Microsoft decided to change the name from RAS to Dial-Up Networking when they released Windows 95.

A firewall can be software, hardware, or a combination of the two. Commercial firewall packages cost a lot more than loose change—a price range of anywhere from $1,000 to $100,000.

We haven't heard of a software-only firewall for Windows 95, but when they become available they would likely be less expensive than the hardware versions. In the meantime, you might consider running a freeware version of UNIX for the purpose of including a firewall in your network. For an excellent reference, you might consult Linux Unleashed, published by Sams.net.

A firewall usually includes several software tools. For example, it might include separate proxy servers for e-mail, FTP, Gopher, Telnet, Web, and WAIS. The firewall can also filter certain outbound ICMP (Internet Control Message Protocol) packets so your server won't be capable of divulging network information.

Figure 20.2 shows a network diagram of a typical LAN connection to the Internet including a Web server and a firewall. Note that the Web server, LAN server, and firewall server could all be rolled into one machine if the budget is tight, but separating them as we show here is considered a safer environment.

Figure 20.2. Using a firewall/proxy server on a LAN.

The proxy server is used to mask all of your LAN IP addresses on outbound packets so they look like they all originated at the proxy server itself. Each of the client machines on your LAN must use the proxy server whenever they connect to the Internet for FTP, Telnet, Gopher, or the Web. The reason for doing this is to prevent outside detection of the structure of your network. Otherwise, hackers monitoring your outbound traffic would eventually be able to determine your individual IP addresses and then use IP spoofing to feed those back to your server when they want to appear as a known client.

Another purpose of a firewall is to perform IP filtering of incoming packets. Let's say that you have been monitoring the log files on your Web server and you keep noticing some unusual or unwanted activity originating from 193.3.5.9. After checking the whois program (such as the GUI version included with this book), you determine the domain name is bad.com, and you don't have any business with them. You can configure the IP filter to block any connection attempts originating from bad.com while still allowing packets from the friendly good.com to proceed.

Many people think IP packet filtering is worthless if only implemented in software. They may advise that packet filtering is useful only in a router or a hardware firewall solution that includes the capability to filter at the Link Layer, as opposed to the Network or Transport layers where the TCP/IP software operates. However, even if you do packet filtering through software, the trick is to filter based on both the source IP and the interface. That is, if a packet with a source address that is also an internal IP shows up on the external interface of your router (meaning, the Internet side), you should drop the packet and have it logged for immediate attention.

Other Firewall Configurations

Now let's return to the previous point about the need to switch your Internet connection method to an Ethernet and router combination—as opposed to a Dial-Up Networking modem connection. Following are a few possible firewall configurations.

Internet, CSU/DSU, Router, and Windows 95

Figure 20.3 depicts one of the most simple firewall configurations. The router that resides between your server and the Internet is the key element. Of course, adding a router means also adding a CSU/DSU and takes you out of the realm of a simple modem connection.

Figure 20.3. Server and router.

This economical network depicts a router that offers IP packet filtering. This is hardware-only firewall protection.

Internet, CSU/DSU, Router, Windows 95, and Software Firewall

This scenario is only slightly more involved. Figure 20.4 shows a software firewall (perhaps just a proxy Web server) being added to the Web site.

Figure 20.4. The server with a built-in firewall.

In this case, if the router offered no IP packet filtering, you would have software-only firewall protection. At the time of this writing, we had found no software-only solutions that ran natively under Win95.

Internet, CSU/DSU, Router, Firewall Server, and Windows 95

The configuration shown in Figure 20.5 is rather advanced. The addition of a separate firewall server to your network is one of the best ways to guard your resources from intruders.

Figure 20.5. Using a Firewall Server to protect the rest of the LAN.

In this case, we added a separate computer running firewall/IP filtering software. This is the hardware/software solution described earlier.

Because of the current lack of Windows 95 and Windows NT software-only firewall solutions, the firewall server is most likely a UNIX box. With the enormous surge of popularity that Windows is witnessing, however, we can expect this situation to change very soon.

These diagrams do not show the connections beyond the server. If your situation involves a network, the server would be a multihomed host with two NICs acting as a bridge or Internet gateway.

As you can see, there are many benefits to firewalls. The choice of whether or not to implement a firewall is a complex decision that will involve factors such as these:

Risk analysis
Cost-benefit analysis
Network security policies
Network topology

If you decide you need to implement a firewall but don't have the cash to fork out for a $30K or $40K hardware solution, you might want to take a look at using the FWTK (Firewall Toolkit.) FWTK is a freeware collection of firewall-related utilities from TIS (Trusted Information Systems). This toolkit is distributed in source code format. It's written for UNIX, so you will need run it on a UNIX box or spend the time or money to convert the code to Windows. There are several freeware versions of UNIX that run on the PC, so this solution can be a very inexpensive one. The drawback to this method is that it requires a great deal of time, effort, and UNIX knowledge to implement. To obtain the FWTK, and read lots of other information on commercial firewalls, see http://www.tis.com.

Software Viruses

As if you don't already have enough trouble, alas, the risk of a virus deleting or scrambling your files is very serious.

One way to curtail the risk is to avoid downloading programs from the Internet. You must realize that you can be bitten the instant you run any kind of executable image. This includes programs, DLLs, command files, or even autostart macros in commercial applications.

Of course, life on the Net isn't too practical without access to all the cool stuff that keeps being invented everyday. So the next level of protection is to test new software on a stand-alone cheap machine before giving it access to your precious hard drive.

Virus detection programs (or virus scanners) should be used to verify that a new program appears legitimate. Virus scanners analyze your software looking for several types of red flags that would indicate danger. They are able to detect and warn you of hundreds of known viruses. Yet another problem is that new viruses are being written by scum programmers everyday. Fortunately, most virus scanners have an answer for that too. By keeping a checksum of the files on your disk when it was last scanned and then monitoring those for future change, they can detect the effects of unknown viruses.

Some viruses can infect a LAN after they land on a single machine on the network. If you can't be certain about the type of software someone else might install on your LAN, you should seriously consider running a network virus scanner on an on-going basis. This option is safer and more practical than having network police running around checking on all the employees.

At the time of this writing, we know of four virus scanners for Windows 95. We have tested three of these; McAfee Scan for Windows 95, ThunderBYTE Anti-Virus 95, and Doctor Anti-Virus 95. All of these applications worked well and each had there own unique features. Of the three, Doctor Anti-Virus 95 seemed to have the best integration with Windows 95. If money is not an issue we recommend running several anti-virus scanners on your system. It never hurts to be too careful.

Miscellaneous Security Advice

When people think about security on the Internet, they automatically think about firewalls. But there is a lot more to security than just firewalls. You will keep away most general pranksters by setting up your site with security in mind. Here are some miscellaneous pointers for running a secure Web site:

If you allow FTP access, set restrictive FTP directory and file permissions.
Don't let server applications run as system services. Because server applications are basically listening to a port, a hacker could pass data to the well-known port. If the application is running as a system service, the application has system privileges and could in theory be forced to run a program that you do not want it to run. It's always best to create an account for the server application to run under so that it has only the privileges necessary to do its job. Server applications running as system services on UNIX have been the source of many documented break-ins. (Granted, having public access to the UNIX source code made it easier.)
Don't put any files in the directory of an application server that you can't afford to loose.
Make sure that Directory Browsing is not enabled in your Web server. Otherwise, users who only key in your Fully Qualified Domain Name will end up with an FTP directory listing of your document root directory. And there might be files other than your home page that you don't want them to see.
Develop a security policy for users you allow to log into your server. What programs are they allowed to run? How often must they change their password? Are users permitted to dial out to the Internet from a private modem?
Monitor your system logs carefully and often. It might be your only chance to catch strange behavior as it begins to develop. If you're lucky, you'll be able to exclude a hacker before further damage is done.

What's Next?

This concludes the section on Expanding Your Internet Server. Now we are ready to tackle the topic of Web programming. In the next part of the book, we get into several programming languages, each with its unique place in the Webmaster's toolbox. Regardless of your preference, Perl, C++, Visual Basic, or Java, we have a little something for everyone.