6
- Database and Data Repository Access Methods

-6-

Database and Data Repository Access Methods

Up to this point, this book has covered information about the technologies that laid the foundation for Web database development. This chapter and the following chapters build on that knowledge by introducing you to the components that provide the mechanism for integrating the Web with existing and new database platforms.

This chapter looks at the components that make up the database and database-access methods used in Web database development. Specifically, the focus is on databases, database-access methods, APIs, and publicly available search engines.

Before you get started, take a step back and look at the broader picture. As World Wide Web usage continues to grow at an astronomical rate, so does the desire to access existing data repositories. Many organizations have yet to realize the potential for using the WWW as an advertising vehicle, a means for providing information, and a communications medium. The drive to be competitive in the market will force businesses to look seriously at the Web and its offerings as solutions to existing needs, though.

The hype surrounding the Web is partly due to the fact that users can access data and information from anywhere in the world. Until recently, this data was almost all static in nature. Now, the capability to generate customized, dynamic data is becoming more commonplace and expected by users. Today, there is a greater need to provide Web access to the vast amounts of information in databases and data repositories. This need drives the ever-expanding realm of Web database access products. Not long ago, no built-in Web capabilities existed among the major database vendors. At that time, Web database developers had to use gateway programs with CGI scripts to integrate their data into Web documents. Today, each of the major database vendors has a new suite of Internet-access products available to bridge the Web database gap.

By looking at the whole picture, you can begin to see where the Web and database environments are moving--toward providing a common environment that guarantees ease of use, interoperability, secure transactions, and information access.

Large Database Systems

The demand for online transaction processing (OLTP) applications has grown incrementally over the past few years. From order-entry to real-time decision support, more and more organizations are using large database systems as a solution to the need to maintain and manage large volumes of data. Large database systems can be defined in many ways. This book refers to large database systems generally as being relational databases that are several gigabytes in size and operate in a multiuser environment.

The first thing that comes to mind when somebody mentions the term large database systems is a UNIX operating system with a relational database such as Oracle, Sybase, Informix, Ingres, and so on. This is often the case. Many relational databases also exist that operate in non-UNIX environments, however. With the introduction of Microsoft's Windows NT operating system, organizations now can bring large database systems into PC-based server environments. Additionally, many large database systems run under other operating systems such as VMS, AS400, and so on. The remainder of this book focuses primarily on large databases that operate in UNIX and desktop database systems operating under the Windows-NT environment: Oracle7 Universal Server, Sybase SQL Server 11, Microsoft SQL Server, Microsoft Access, and Borland Paradox 7.

Oracle7 Universal Server

Oracle Corporation's Oracle7 Universal Server provides a relational database environment that is both scaleable and reliable. Oracle7's scalability is demonstrated by its support for incrementally expanding networked database resources. Oracle7 Server has a built-in automatic and dynamic self-tuning capability, which can balance the processing workload evenly across your hardware and operating system resources.

Oracle7 Universal Server can be configured with up to six installed options: Context, Messaging, OLTP, Spatial Data, Video, and WebServer. This section briefly addresses some of the main features and options included in Oracle7 Universal Server. Although this is not an all-encompassing list of features, it does cover the features most large Oracle database installations use. The Context Option The Context option is a text indexing-and-retrieval system that enables normal text information to be leveraged as easily as normal structured database data. Context is very similar to more common text-indexing engines used on the Internet, and it supports the following features:

Automatic text reduction
Automatic classification
Soundex/Fuzzy matches
Proximity searches
Boolean logic
Multilingual stemming
Relevance ranking
Term weighting
Thesaurus support
Stop lists

The Context option ideally is suited to be used with the WebServer option for indexing large volumes of data, such as data managed on a Website. The Messaging Option The Messaging option is Oracle's groupware competing product built with the intent of leveraging the database technologies in most of the more common organizational tasks. These tasks include e-mail, document management, calendar planning, scheduling, and directory services. Parallel Architecture Oracle7's parallel architecture takes advantage of OLTP applications by distributing tasks across multiple processors or machines. For multinode configurations, the Oracle7 Parallel Server option provides a reliable, high-availability environment in which failure of any one node does not result in a loss of data.

The Oracle7 Parallel Query option provides a mechanism for scaling complex, ad hoc queries that run against large amounts of data. Parallel Query takes advantage of scalable architec- tures and parallel hardware configurations by dividing the query process into manageable subqueries and farming those subqueries out for processing. This allows multinode and multiprocessor environments to share the work load, which increases the speed at which the query can be executed.

Another use of Oracle's parallel architecture is for data and database recovery. Oracle7 can use multiple processes to recover the database or datafiles in parallel, speeding the online recovery. Rollback operations related to any uncommitted transactions are performed in parallel after startup, providing earlier system availability and improved performance. Query Optimizer Oracle7 Server has a built-in query optimizer for locating transactional data quickly and efficiently. This optimizer has two configurations: cost-based and rule-based. In the cost-based configuration, the optimizer dynamically selects the fastest available access path and satisfies query requests directly from indexes whenever possible. This process is accomplished by using fast, full-table scans, B-tree single-column and concatenated-index scans, clustered (prescanned) tables, and unique row identifiers. In the rule-based configuration, the optimizer uses specific rules defined by the developer to access the data. These rules can be in the form of hints to the optimizer. Hints enable the developer to help the optimizer determine the fastest access method to use. A hint can tell the optimizer to use a specific index or to perform a full table scan of the data, for example. Indexing Schemes and Join Methods Because large data repositories require high-performance access methods to reduce the time required to access vast amounts of data, Oracle7 includes integrated indexing schemes and join methods to deliver query responses quickly. Through the use of bitmapped indexes, Oracle7 Server can deliver dramatic performance benefits in large data repository access applications. For ad hoc queries, Oracle7 Server can take advantage of hash joins to eliminate the need to perform sorts by using an in-memory hash table constructed at runtime. High-Availability Features Oracle7 Parallel Server uses a clustered computer environment to provide a mechanism for ensuring that your data is accessible in the event of a node failure. If a node in the cluster fails, users simply can resume processing by logging onto another operating node.

Oracle7 extends high availability by including an online backup facility, which allows administrators to perform backup activities while the database is running. This task can be accomplished without interrupting transaction processing. If a device fails, resulting in a loss of datafiles, the administrator can recover those lost datafiles to another device while continuing to process requests for data that has not been lost.

Another high-availability feature of Oracle7 is its symmetric replication. Symmetric replication can be used to replicate data from a primary system to one or more alternative systems. Each alternative site is fully accessible and can be used for query access (by using the Parallel Query option). Read-Only Tablespaces Oracle7 also supports read-only tablespaces, which save time by eliminating backup and recovery of static data. Organizations with static CD-ROM data archives can access that data as read-only tablespaces, thus reducing the size of their databases and associated database administration.

Sybase SQL Server 11

Sybase SQL Server 11 is much like Oracle7 Universal Server, because it is a client/server RDBMS built to meet the reliability and scalability demands of the most challenging business applications. SQL Server's support for disk mirroring and high-speed backup/restore minimizes the system impact of hardware failure. The Backup Server component of SQL Server assumes the task of handling all backup and restore functions, enabling the server to run ongoing applications.

SQL Server takes advantage of stored procedures and triggers to maintain integrity. When there is a chance for violation of data integrity, the Sybase trigger rolls back the transaction to preserve the integrity of your data. SQL Server meets C2 specifications for level of trust as defined by the National Computer Security Center.

This section briefly addresses some of the main features included in Sybase SQL Server 11. As noted in the "Oracle7 Universal Server" section, this is not an all-encompassing list of features; it is a list of the features that most large database installations use. Web Access Sybase's new SQL Server Professional for Windows NT is now Web-enabled with NetImpact Dynamo, which supports a JavaScript superset scripting language. NetImpact not only allows for Web-based database access, but it also provides the tools to create, manage, and maintain dynamic Web documents. NetImpact uses templates to access the database information. These templates are merely HTML documents with embedded SQL statements and/or embedded scripts. NetImpact Dynamo acts as a gateway between the Web server and the database engine by retrieving the requested HTML page; processing the embedded SQL statements against the specified database; and building the resulting HTML file, which the server then sends back to the Web browser. Scalability Sybase SQL Server scales easily from a small workgroup (usually fewer than 20 users) to a multiuser (thousands of users) solution. Because a SQL Server implementation is linear (the system load is determined by the number of users and the applications in use), organizations can better plan for changing hardware needs.

SQL Server is a multithreaded environment that implements a highly efficient thread manager to manage server sessions. This thread manager runs on top of the native operating system to deliver the highest performance possible. Because each SQL Server user connection requires only 48KB of system memory, system resources such as disk caching and application memory requirements can be extended to achieve maximum throughput. Adaptable Workload Architecture Sybase SQL Server efficiently manages both OLTP and decision-support applications through the use of multiple buffer caches, tunable block I/O sizes, and multiple levels of configuration tuning. Multiple buffer caches allow for the fine-tuning and optimization of multiple applications at the same time. Tunable block I/O sizes support bulk loading. Additionally, three levels of configuration tuning permit a high degree of system and application optimization. Backup and Recovery Support The Sybase SQL Server Backup Server component provides a mechanism for archiving large databases, while having little impact on overall system performance. Backup Server is optimized to run in environments as large as hundreds of gigabytes and can maintain backup speeds in excess of 50GB per hour.

Sybase SQL Server removes any potential single point of media failure by supporting database mirroring. Database mirroring supports the mirroring of database logs and the database itself. Through the use of Backup Server and database mirroring, Sybase databases of 100GB and larger can remain available and easy to manage. Performance Features SQL Server takes advantage of many performance-enhancing features to provide optimum throughput. Backup Server can off-load the dump from the server while supporting online backup, for example. This results in negligible performance impact on running applications. High-speed dump-and-load features enable you to rapidly back up very large databases. Through data partitioning, parallel loads and multiple concurrent inserts can occur simultaneously. Clustered indexes help speed up the update and retrieval of data rows almost as much in a large database as in a small database. Large databases (in excess of 2GB) are addressed in memory, resulting in extremely high performance returns. Distributed Database Control Through the programmatic use of two-phase commit (2PC), SQL Server ensures the data integrity of distributed applications. Programmatic 2PC provides a more detailed level of control over the transaction by allowing developers to manage errors individually for each transaction instead of leaving error-handling control to the system defaults. Distributed Database Features Because stored procedures insulate developers from implementation details, database administrators are given complete autonomy in configuring database objects. Through integrity control, SQL Server can execute transactions even when data is distributed among several servers. Sybase Replication Server enables organizations to use replication features to maintain consistency across a distributed database. Database Management Database administrators (DBAs) can define thresholds in the transaction log to initiate automatic log dumps. This feature ensures that the database does not lock up when the transaction log is full. Additionally, DBAs can configure audit trails to inform them of authorized activities. Central control of remote sites reduces manpower needs and gives DBAs greater control over their network environment for tasks such as backing up centrally managed remote databases. A feature of Sybase SQL Server Manager is the capability to manage multiple remote servers from a single location.

Desktop Database Systems

Desktop database systems are broadly defined as database platforms that operate in a user desktop environment such as Microsoft Windows 3.1, Windows 95, or Windows NT. These database systems generally are developed to support platforms for single users and workgroups and include databases such as Microsoft Access, Microsoft SQL Server, Microsoft FoxPro, Borland Paradox, and so on. This section covers Microsoft Access and Borland Paradox 7 desktop database systems.

Microsoft Access

Microsoft Access is a desktop database solution that integrates well with the entire suite of Microsoft Office products. Microsoft Access has many features geared toward the database developer. Some of these features include the use of wizards. Wizards are intelligent graphical tools that aid in the design of an application or process. This section examines a few of those wizards and discuss how they round out the Microsoft Access platform. Database Wizard The Database Wizard walks the developer through the process of building a new database graphically. To accomplish this, the Database Wizard uses a set of more than 20 standard template databases from which the developer can choose. The template set includes a database for asset tracking, a wine-list database, an order-entry database, and a video-collection database, to name just a few. Each of the standard databases has a default set of tables, which include standard indexing schemes to make the whole process as effortless as possible. Form and Report Wizards The Form and Report Wizards enable you to create your own forms and reports by selecting fields from any table in your database. Access even autoformats your form or report for you. You also can specify background pictures for your forms and reports. Import/Export Wizards You can use the Import/Export Wizards to help determine the types of data you are about to import or export. Additionally, these wizards help you select the correct field delimiters and data types in the source data. Access runs the Import Spreadsheet Wizard when you import or link text or spreadsheet data. Access runs the Text Export Wizard when you export data to an external file. You can import data from any of the following data sources:

Microsoft Access
Microsoft Excel spreadsheet
Text files
dBASE III, IV, V
Microsoft FoxPro, FoxPro 3.0
Other ODBC databases

Simple Query Wizard You can use the Simple Query Wizard to create single-table or multiple-table queries based on user input. You also have the option of specifying query criteria such as the order in which the data is returned. One nice feature of Access query definitions is that they can be used as a data source for an application interface widget (similar to the way in which views are used in large database systems). Table Analyzer Wizard You can use the Table Analyzer Wizard to normalize your new database (or an existing database). By default, whenever you import text or spreadsheet data using one of the Import Wizards, Access asks whether you want to run the Table Analyzer Wizard against the newly imported data. User-Level Security Wizard You can use the User-Level Security Wizard to generate a completely new, secure database. This wizard exports copies of all the objects from the original database, secures the object types by revoking all permissions of the Users group for those objects in the new database, and then encrypts the new database. After running the wizard, you can grant permissions to users and/or groups at your discretion.

Borland Paradox 7

The Borland Paradox 7 database is a desktop database system that integrates with client/server development environments to produce a scalable database product. With Paradox, you can connect to native high-performance SQL-linked databases such as Oracle, Sybase, Informix, and Microsoft SQL Server. Additionally, Paradox now has support for Microsoft Exchange/MAPI, which provides an e-mail interface for the distribution of your information.

The Paradox Project Viewer organizes tables, reports, forms, scripts, and queries graphically. You quickly can generate forms and reports with Paradox's Form and Report style sheets.

Through enhanced OLE support, you easily can share information with Microsoft applications that use OLE Automation. Additionally, database developers now can create applications that use OLE custom controls (OCXs), such as the spell checker, spreadsheet controls, and graphic controls included in Paradox 7.

Shareware Database Engines

This section looks at two shareware databases (Postgres95 and Mini SQL) available for downloading on the Internet. Both these products have built-in capabilities for Web server access. In the case of Postgres95, Perl library routines can be used to access the database engine from within a CGI script. Mini SQL includes a w3-mSQL gateway product that handles the Web interface for you (w3-mSQL also is implemented as a CGI). Because functionality and capability vary between the products, this section does not go into too many specifics on either product.

Postgres95

Postgres95 is a relational database that implements an extended subset of ANSI SQL and maintains all the features of its original core product, Postgres 4.2. Postgres95 builds on the original research goals of the Postgres project; it can handle complex objects and rules, and it is highly extensible. Some of Postgres95's standard relational DBMS features follow:

Concurrency control
Declarative SQL queries
Multiuser support
Object-oriented inheritance
Query optimization
Transaction support
User-defined operators, types, functions, and access methods

Postgres95 uses three separate processes to manage its client/server architecture. The postmaster is a daemon process that manages the communications between front-end and back-end processes. It is responsible for managing the allocation of the shared buffer pool and performing additional start-up initialization procedures. The Postgres back-end database server process executes client requests (queries, inserts, updates, deletes, and so on). Finally, the front-end application process requests a connection to a Postgres back-end through the postmaster. The postmaster starts a new Postgres back-end process for each new front-end client connection.

NOTE: The Postgres back-end process must run on the server machine. The front-end process can run on another machine or the same machine as the Postgres back-end process, however.

Postgres95 includes API library routines for C, Tcl, Perl, and Python. The C-library API set (libpq) includes routines that permit queries to be passed to the Postgres back-end process for execution. The Tcl API set (libpgtcl) includes routines used to create a new Tcl interpreter with commands for building Tcl-based clients. Use of the Perl API set (pg95perl) requires Perl 4.036. The Python API set (PyGres95) encapsulates its APIs through a Python object-oriented interface.

Mini SQL

Mini SQL (or mSQL) is a lightweight database engine designed with fast access and low memory requirements in mind. Mini SQL supports only a subset of SQL; it does not include features such as views, subqueries, and so on.

The complete Mini SQL package includes a database engine, a monitor program, a database administration program, a schema viewer, and a C-language API. Both the API and the database engine are designed to work in a client/server environment over a TCP/IP network.

The Mini SQL engine (msqld) is implemented as a UNIX daemon process that listens on a well-known port for client connections. Although the msqld process accepts multiple client connections, it is still a single-threaded process that manages client requests in a serialized manner.

Debugging capabilities are built into the Mini SQL engine to facilitate the monitoring of the client requests. Debug support is available through eight separate modules and can be enabled at runtime by an environment variable (MINERVA_DEBUG). You can set multiple levels of debug by separating them with a colon:

C-shell syntax setenv MINERVA_DEBUG cache:key:mmap:query

Korn-shell syntax
MINERVA_DEBUG="cache:key:mmap:query";export MINERVA_DEBUG

Table 6.1 lists the eight available debug modules.

Table 6.1. Mini SQL Debug Modules.

Module Displays

cache Table-cache messages

error Error messages

general Messages not included in other modules

key Key-based, data-lookup messages

malloc Memory-allocation messages

mmap Memory-mapped region messages

query All queries before execution

trace A call trace of all functions

To access Mini SQL databases from the Web, you can use the mSQL gateway w3-mSQL. See the "Database Gateways" section later in this chapter for a more detailed look at w3-mSQL.

Commercially Available Database/Web Products

Commercial built-in Web database interface products only recently have been announced. Developers can use these products to enter SQL commands directly into HTML files for the Web server to process. Usually, the Web server must be a by-product of the database, because no HTML standard allows for the insertion of SQL commands in structured HTML code. Because of the increasing focus on Web database products, though, it is my opinion that developers soon will see HTML standards for supporting ANSI SQL access through HTML tags.

This section discusses a few of the database Web products available today and the ways in which they integrate with the Web applications.

Oracle WebServer 2.1

Oracle WebServer 2.1 is a scalable Web server platform that provides the capability to build secure and dynamic Websites. A high-performance, advanced security and native database connectivity platform is built into WebServer 2.1, which enables Web database developers to create powerful Web database applications using industry-standard development tools. WebServer 2.1 takes standard HTTP servers one step further by integrating information server and management capabilities into a single environment. With WebServer 2.1, there is no need to bridge the Web and database environments with database gateway products.

Additional features of Oracle WebServer 2.1 include a high-performance and extensible architecture, end-to-end security, secure commerce, authentication and encryption capabilities, and Java capabilities on the server. Oracle has scheduled a new release of Oracle WebServer 3.0 for sometime during the 1st quarter of 1997.

This section covers some of the key components of Oracle WebServer 2.1 and how you can best use them in a Web database developer environment. Web Server and Database Connectivity A high-performance system called the Web Request Broker is built into Oracle WebServer 2.1. Web Request Broker links Oracle's WebServer to applications, databases, and even other HTTP listeners. Web Request Broker replaces the need for CGI scripts to access the database server as well as the need for CGI scripts, which tend to be resource-intensive processes. Through Web Request Broker, a high-performance native Oracle7 Server connection is established, which enables WebServer 2.1 to deliver relational, text, audio, video, and spatial data through standard Web browsers. Security for Intranets and Internets WebServer 2.1 supports the Secure Socket Layer 2.0 security standard for encrypting data between the browser and WebServer process. WebServer 2.1 also supports basic and digest authentication to prevent unauthorized access to Web server documents. By using SSL, WebServer 2.1 can securely pass encrypted data through industry-leading firewall products.

Standard database security features are common to all Oracle7 products and are built into the WebServer 2.1 Server through Oracle's Secure Network Services. These features provide a comprehensive security environment for the corporate network. Secure Network Services incorporates leading encryption capabilities and advanced systems for authenticating users and ensuring data integrity. Open Web Development Architecture The Web Request Broker ensures the interoperability of applications written in different languages and residing on different Web servers by providing an open API for building server objects. Web Request Broker supports development in Java, PL/SQL, C, and C++. An integrated Java runtime environment is built into Oracle WebServer 2.1; this environment includes extensions for Oracle7 that effectively enable developers to implement dynamic Web applications using native Java classes on the server. PL/SQL Agent In the previous release of Oracle WebServer, PL/SQL requests were handled by a process called the Oracle Web Agent. This enabled data-driven Web applications to be implemented by using PL/SQL--Oracle's procedural development environment. In WebServer 2.1, a new process called the PL/SQL Agent is used for translating and dispatching client PL/SQL requests directly to the Oracle7 Server. Listing 6.1 shows an example of a PL/SQL agent request.

Listing 6.1. A PL/SQL agent request.

create procedure emp_query as
cursor emp_cursor is select * from emp order by empno;
begin
   htp.htitle(`Employee Update Form');
   htp.formOpen(`/cgi-bin/ora/get_emp_info');
   htp.print(`Select Employee');
   htp.formSelectOpen(`emp_name');
   for emp in emp_cursor loop
      htp.formSelectOption(to_char(emp.fullname));
   end loop;
   htp.formSelectClose;
   htp.formSubmit(`Query Database');
   htp.formClose;
end;

If it looks pretty intuitive, that's because it is. LiveHTML Interpreter WebServer 2.1 uses a LiveHTML Interpreter to enable developers to use Server-Side Includes, which are mechanisms that extend the capability of HTML pages by making callouts to applications and back-end servers. WebServer System Manager WebServer System Manager is a fully configurable tool for managing components of the Oracle WebServer 2.1 system. Web System Manager uses a collection of HTML forms and online, context-sensitive help facilities to quickly configure the Oracle WebServer.

Sybase Web.SQL

Web.SQL is a Sybase database interface product that creates transparent server-side extensions to HTML, allows both static and dynamic HTML to be mixed within a single HTML page, and allows for the creation of static templates that can be updated continuously with new content. Web.SQL supports inline SQL and Perl scripting calls and improves integration between Web servers and databases by decreasing the number of Web documents the Web server must manage.

Web.SQL includes support for native Netscape Server API (NSAPI) calls and offers scalability to support increasing Web traffic by maintaining open database connections to improve response time and lower system overhead.

RESOURCE: You can find additional information on Sybase's Web.SQL and other Web-based products at

http://www.sybase.com/

Data-Access Methods

You can integrate database access with the Web in numerous ways. The method you choose depends on several factors. One factor to consider is the database platform to which you want to provide access. Another factor is the type of access (database/Web server product, CGI scripting language, client APIs, database gateway, and so on) you want to use. Finally, you have to consider the cost of the access method.

Suppose that you want to integrate a Web server with your Oracle database. You might want to choose Oracle's WebServer solution. As mentioned earlier, this feature provides an all-in-one Web server database solution that takes advantage of the specific database-engine technologies to provide the fastest access to structured data. However, products such as these do not come free (unless you're talking about a Microsoft product such as IIS... just kidding, Bill!).

Major database vendors recognize the market for Web database integration and will continue to build on their core Internet technologies in the months to come. For users of large databases who have spent thousands of dollars on their database engine and supporting products, it might be more economical to choose such a solution. For organizations not willing to spend the big dollars on Web database-integration products, however, many gateway tools are available today. This section focuses on those types of access methods.

SQL and ODBC Access Products

Many products provide a Web server interface to open database connectivity (ODBC) compliant databases. This feature can be very advantageous, because ODBC drivers exist for all the major database vendors (Oracle, Sybase, Informix, Access, SQL Server, and so on).

Consider a situation in which a Web administrator has many database sources of information (for example, a Microsoft Access Products database, an Oracle Technical Support database, a Sybase Order-Entry database, and so on), but he wants to provide a WWW single-source database for access by a Web server. This easily could be implemented as a Microsoft Access WWW database that takes advantage of features such as external table imports (to import external data into a Microsoft Access table) and table linking (to access external database tables through ODBC) to provide a single-source view of the Web data.

This section looks at a few Web database interface products that provide SQL and ODBC database interfaces. Later chapters will give specific examples of using an ODBC-compliant database as a WWW, single-source database.

WebDBC 3.0

WebDBC 3.0 is a Web database interface application that enables Web servers to connect to existing SQL and ODBC-compliant databases. WebDBC has built-in support for Microsoft Internet Server API (ISAPI), Netscape Server API (NSAPI), and other Web server interface APIs.

You can invoke WebDBC by specifying a query URL in an HTML document. You can specify query URLs in one of three ways:

Using a hypertext URL link
Using a forms-based submitted URL
Using a Query tag in a result file

In the case of a forms-based submitted URL, the contents of the form input widgets (textboxes, checkboxes, radio buttons, pop-up menus, and so on) are sent by the Web server to WebDBC for processing. A result file describes the layout of the data returned to the user. Additionally, a result file can generate another form containing a query URL constructed from information extracted from a previous database query. This capability provides a mechanism for constructing more refined database query results.

After the user selects a hypertext link or fills out an associated HTML form, WebDBC queries the specified database and incorporates the result file layout with the database query results to generate the HTML that is sent to the user's browser for rendering. HTML error files can be configured to return a predefined set of error messages that describe an error condition.

RESOURCE: You can find more information and an evaluation copy of WebDBC for Windows NT/95 at

http://www.ndev.com/

WebBase 4.0

WebBase 4.0 is a Web database server that provides access to existing databases and can operate in a standalone mode or with any standard Web server. One feature of WebBase is its capability to allow users to search a database as easily as hypertext is used in a document. WebBase is compatible with many database formats, including these:

Microsoft SQL Server 4.2 and 6.0
Microsoft Access 1.x and 2.0
FoxPro database management system 2.0, 2.5, and 2.6
Microsoft Excel 3.0, 4.0, and 5.0
Btrieve
dBASE III, IV
Paradox 3.x, 4.0
Oracle6, Oracle7
Standard formatted text files

WebBase supports all features included in HTML Levels 1 and 2, as well as features included in the proposed HTML Level 3.2 specifications. WebBase permits SQL statements to be embedded anywhere in HTML documents and includes a full-featured macro language with standard control flow constructs such as if, case, forRow, forIndex, and so on. Additionally, WebBase includes many built-in functions for handling math, logic, comparisons, string manipulation, and dates. WebBase includes support for user-defined variables, session variables that hold state between pages, and system variables, such as %date%, %browserAddress%, %Netscape%, and so on.

RESOURCE: WebBase is available for Windows 3.1, Word for Windows 3.11 (with Win32s), Windows NT (Intel only), and Windows 95. You can download WebBase 4.0 and get additional information on the product from the product home page at

http://www.webbase.com/

Database Gateways

Database gateways are available for virtually all databases and have been ported to many operating system platforms. Many of these gateways use the interpreted language Perl. Perl is used for scanning arbitrary text files and extracting information from those text files.

RESOURCE: You can find additional information on Perl at

http://www.perl.com/perl/

This section describes a few of the database gateway interfaces used for accessing the majority of the larger database platforms. Specifically, it covers Sybperl, Oraperl, Mini SQL, WDV, and GSQL. Sybperl and Oraperl are both Perl interfaces to the respective database products. WDV and GSQL are implemented through definition files that identify form layouts for queries. mSQL is implemented as a CGI that pre-parses HTML input files to convert msql directives into Mini SQL queries. The Mini SQL directive syntax is covered in detail, because it's simple when compared to that of the other gateway products. In a following section, "Other Database Gateway Products," you'll find a listing of additional database gateway interfaces and the platforms for which they have been ported.

Sybperl

Sybperl is a version of Perl that uses the Perl usersubs feature to allow access to Sybase databases. A subset of the Sybase DB-Library APIs has been mapped to Sybperl. In most cases, functions that have been mapped to Sybperl routines are the same as Sybase DB-Library routines unless otherwise noted. Additionally, Sybperl implements a subset of the newer Sybase Client Library API. Client Library is included with Sybase's OpenClient libraries as of the release of System 10.

RESOURCE: Sybperl support is available through a list-server mailing list. To subscribe to that list, send an e-mail message to this address:

SYBPERL-L@trln.lib.unc.edu

Include in the body of the message the text string subscribe SYBPERL-L.

Oraperl

Oraperl is a Perl-language interface to Oracle databases. Oraperl is implemented through patches to the Perl 4.x release and includes the use of Oracle Call Interface (OCI) calls. Although these calls are written specifically for Oracle V6, notes are available on building the Oracle7 version of Oraperl.

RESOURCE: You can find these notes on building the Oracle7 version of Oraperl, as well as the Oraperl distribution, at

ftp://ftp.demon.co.uk/pub/perl/db/perl4/oraperl/

NOTE: The Oraperl community is not nearly as active as the Sybperl community. This is in part due to the fact that both the Sybperl and Oraperl user groups appear to be no longer supported by the parent database companies. This, as you can guess, is because both Oracle and Sybase now have their own Web database products available.

w3-mSQL

w3-mSQL is a World Wide Web database gateway used to access Mini SQL databases. w3-mSQL is implemented as a CGI script and provides full access to mSQL databases by HTML. w3-mSQL provides the capability to embed SQL queries within HTML documents so that data can be retrieved on-the-fly.

The w3-mSQL program is used as a CGI script to which w3-mSQL enhanced documents are passed as arguments. w3-mSQL interprets mSQL document directives in the following form to complete the processing of the document and return the HTML-formatted database data:

<! msql command args >

The msql directives include support for connecting to and disconnecting from Mini SQL databases, specifying query criteria, fetching query results, and formatting those results for display. Additional directives allow for the use of if-then-else constructs that can be used to determine how the resulting document will be formatted and displayed. Valid mSQL commands follow. <! msql connect [host]> The connect directive connects to an mSQL database engine that might reside anywhere on the network. A sample connect directive follows:

<! msql connect www.caffrey.com >

w3-mSQL supports only a single concurrent database connection from within a w3-mSQL HTML document. You can open a connection to another w3-mSQL database server from within the same HTML document, however, after you close out the previous connection. Additionally, you can access multiple databases from one database connection by using the database directive. <! msql close > The close directive closes the currently open mSQL database server connection.

TIP: A good practice is to close each database server connection within each HTML document immediately after completing all database transactions for that database server. This helps free up database resources for other processes.

<! msql database DBName > The database directive enables you to select the mSQL database used to access your HTML document data. A sample database directive follows:

<! msql database mydb >

TIP: To access data from multiple mSQL databases that are managed by the same mSQL server, you can issue the database directive as many times as needed to access the specific database in which your data resides.

<! msql query "sql query text" QueryHandle > The query directive submits a query to the database. The sql query text is submitted to the database, and returned data is stored in the QueryHandle for later processing. A sample query directive follows:

<! msql query "select last_name from employee" qh >

NOTE: The use of the term query in the query directive can be misleading. You actually can issue any SQL command, such as INSERT, UPDATE, DELETE, and so on. Note that, in cases other than SELECT, no data is returned in the QueryHandle.

After a query directive is processed by the w3-mSQL server, only the first row of data is returned to the query handle. As with a database cursor, the programmer must fetch the next and subsequent rows appropriately. w3-mSQL uses the fetch and seek directives to specify the current row data placed in the query handle.

When accessing multiple-column query-handle values, you use an @ for database variables, followed by the name of the query handle, followed by a period (.), followed by the numerical index of the desired field in the query handle row results. Suppose that you submitted this query directive:

<! msql query "select last_name,first_name, mi from employee" qh>

Your results would be stored as this:

@qh.0 would hold the last-name contents.
@qh.1 would hold the first-name contents.
@qh.2 would hold the middle-initial contents.

<! msql free QueryHandle> The free directive releases the QueryHandle and frees any data associated with the query. A sample free directive follows:

<! msql free qh >

<! msql print "format" > The print directive generates printed output of variable contents. Variables can be from the current row of a query handle, an environment variable, or from data passed to the document by a GET or POST request method.

The "format" string specifies the format (similar to a C printf()) of the string that is to be printed. Using the query directive as input for the variables, a sample print directive follows:

< ! msql print "Good morning @qh.1 @qh.2 @qh0, your path is $PATH" >

NOTE: When searching for variable-content holders, the w3-mSQL server first looks in internal w3-mSQL symbol tables and then looks for environment variables. This means that if both w3-mSQL symbols and environment variables have the same name, the w3-mSQL symbol value is used instead of the environment variable.

<! msql print_rows QueryHandle "format" > The print_rows directive prints the entire contents of the query handle from the current cursor location to the end of the query handle. Using this directive speeds up HTML generation when used with tables, lists, and selection menus.

Take a look at a few examples. Don't worry about the HTML syntax just yet; Chapter 7 will cover that. For now, just focus on the use of w3-mSQL query results and how they are referenced.

You could create an HTML table with the following w3-mSQL syntax:

<! msql query "select title, author, isbn from publications" qh>
<TABLE>
<TR><TH>Title</TH><TH>Author</TH><TH>ISBN Number</TH></TR>
<! msql print_rows qh "<TR><TD>@qh.0</TD><TD>@qh.1</TD><TD>@qh.2</TD></TR>"
</TABLE>

<! msql if ( condition ) > <! msql else > <! msql fi > w3-mSQL incorporates logic for an if-then-else construct. This is very useful for cases in which you do not want to parse the entire document. Take a look at a sample if-then-else w3-mSQL construct:

< ! msql if($author == "Swank") >
<P>Books by Mark Swank</P>
< ! msql else >
<P>Other Books</P>
<! Msql fi >

w3-mSQL uses standard C constructs such as ==, ||, !=, >, <, and so on. <! msql fetch QueryHandle> The fetch directive fetches the next row of data from the query handle. A sample fetch directive follows:

<! msql fetch q1 >

<! msql seek QueryHandle Position> The seek directive moves the position of the current data cursor in the specified query handle to the specified position.

NOTE: When a non-positive position number is specified, w3-mSQL replaces the value with a 0 and sets the cursor position to the first record in the query handle. When a position beyond the last record is specified, the cursor position is set to the last record in the query handle.

The following sample seek directive sets the current cursor record to record #33:

<! Msql seek qh 33 >

WDB

WDB is a WWW database gateway written in Perl and used as a CGI script for accessing SQL databases by a Web server. The current release of WDB supports interfaces to Sybase, Informix, and mSQL SQL databases. WDB has been ported to work with NCSA's HTTP server.

WDB uses a set of high-level, forms-definition files that describe a view of the database. Working template forms-definition files can be generated automatically based on database table structure information.

WDB is different from other Web database gateways, because it automatically creates HTML forms on-the-fly, based on query constraints defined by the end user. Another nice feature of WDB is its capability to convert data from the database into hypertext links, making it possible to access any database element directly through a WWW URL.

GSQL

GSQL is a WWW database gateway interface that provides an NCSA Mosaic forms interface to SQL databases. GSQL consists of two programs: gsql and sqlmain. gsql is a forms-creation program. sqlmain is the back-end process, which handles the processing of database queries. GSQL is a C program that is executed with a shell script by the HTTP server.

GSQL generates forms based on user input and commands found in SQL-specification procedure (proc) files. The proc files are used to map resulting fields of the SQL string execution to widgets (fields, buttons, pull-down menus, and so on) that are used for user input or selection. When the form is submitted, GSQL assembles the user input into a SQL query and forwards that query to the back-end database process (sqlmain) for processing.

Other Database Gateway Products

Many other database gateway products are available. Which one you choose to use certainly depends on the database you want to access. Table 6.2 lists a few database gateway products you might want to download and evaluate.

Table 6.2. Database Gateway Products.

Product Support URL

WebGenera Sybase ftp://ftp.gdb.org/pub/letovsky/genera/genera.tar.Z

Oraplex Oracle http://moulon.inra.fr:80/oracle/wwworaperl_eng.html

dbWeb ODBC http://www.microsoft.com/intdev/dbweb/

SoftwareEngine Sybase http://www.engine.com/dbdoc/inst-get.html

Sapphire Sybase http://www.bluestone.com/

Oracle

Informix

WebDatablade ODBC http://www.informix.com/

A-Xorion ODBC http://www.clark.net/infouser/endidc.htm

Client-Side APIs

Developers can design and implement their own API to provide Web access to their databases. In most cases, this is not necessary, because many products already are available to handle this function. As many developers know, however, something "right out of the box" seldom provides all the functionality needed in a particular situation, so you might need to develop your own interface to a database.

Almost all database platforms provide an API set of routines to enable developers to compile client programs using any one of several programming languages and compilers. In most cases, client applications are written in languages such as C and compiled using the appropriate database interface precompiler. Oracle database developers, for example, can write C programs (.pc extension) that include ProC SQL commands. The program then is precompiled using a ProC compiler, which translates ProC commands into Oracle APIs for accessing the database. The ProC compiler then generates C-compilable (.c extension) programs. Finally, the ProC-compiled (.c extensions) are compiled into a functional application or library routine by using a standard C compiler.

Publicly Available Search Engines

Have you ever used a search engine? Your answer probably is yes. Even if you didn't realize it, you probably have. Most Web servers today provide an input mechanism for users to search indexed documents.

Information is useless if it is not available for distribution. By using search engines, which provide an index into your document repositories, you greatly increase the availability of your information. Today, the problem is not finding information, because properly indexed information is easy to find. Instead, the task for data managers is managing the data and managing access to the data.

Before graphical Web browsers took the Web by storm, search engines provided a mechanism for searching text documents locally (and, in a few cases, globally). With the use of HTTP servers and graphical Web browsers came the need to integrate the indexing of HTML documents as well. In this context, Web search engines work quite well.

Search engines are available to serve many diverse applications. The most common use of search engines is to provide a manageable indexing capability for WWW documents. Another, more complex, use of search engines is to provide an Internet searching engine and indexing capability. The Harvest system, which is covered later in the section "Harvest 1.4," is an example of an information-retrieval system.

This section discusses a few of the available search engines used on intranets and the Internet.

Wide-Area Information Services (freeWAIS-sf)

First, WAIS is a protocol and not an application or server. Many people confuse the two by thinking that when they execute a WAIS search, their request is being serviced by a search engine called WAIS. This is not necessarily the case; instead, they are using a WAIS-compliant server (such as freeWAIS or freeWAIS-sf) to search a WAIS-specific indexed database of information. This section focuses specifically on the freeWAIS-sf distribution.

The WAIS protocol is based on the ANSI standard Z39.50 version 1 and uses only a subset of features of the Z39.50 specification. Specifically, requests that require the server to preserve state are not used. The WAIS protocol extends the basic features of the Z39.50 specification, however, by adding elements for relevance feedback.

Relevancy is used to identify documents that are more applicable to the query criteria. If a user specifies a search based on three different text strings (for example, "WWW", "Database", and "Developer"), the WAIS server returns relevancy rankings (usually between 1-1000) for documents that contain the text strings. These rankings are based on relevancy criteria such as the number of times a text string occurs in a document, the occurrence of multiple specified text strings (that is, if you specify three text strings and all three exist in the document), and so on.

FreeWAIS-sf now supports proximity search operators. Proximity searches enable the user to query based on the nearness of specific text strings in relation to some given criteria. A user can further refine a request for documents, for example, by specifying a proximity search of documents that contain the text string "Bill" within one word of the text string "Gates". This type of search helps you avoid the "false hits" scenario you would experience in a non-proximity search when a document contains the two text strings "bill" and "gates" completely separated and with no relationship to each other.

NOTE: To activate proximity searches in your freeWAIS-sf distribution, you must specify them at compile time. Additionally, the standard string-search capability is not available when proximity searches are included.

The freeWAIS-sf documentation on proximity searches is very limited. A C-source file (ir/query_l.l) is included in the freeWAIS-sf source code, however, that is used to handle proximity searches. Feel free to look at the source code to identify the specification terms you can use with proximity searches. For those of you who do not want to hack into the code, you might consider not using proximity searches until further clarification of their usage is available.

RESOURCE: You can find information on downloading and configuring the freeWAIS-sf distribution at

http://ls6-www.informatik.uni-dortmund.de/freeWAIS-sf/README-sf

TIP: Support for freeWAIS-sf is available from the Usenet newsgroup comp.infosystems.wais. When posting articles to this group, make sure that you include the term freeWAIS-sf somewhere in the subject line and the specific version you are referring to somewhere in the body.

RESOURCE: For beta testers of freeWAIS-sf, a mailing list can be used to post and read art. To subscribe to this mailing list, send an e-mail message to

majordomo@wsct.wsc.com

Include the text string subscribe wais in the body. You can post messages to all subscribers of the mailing list by using this e-mail address:
wais@wsct.wsc.com

TIP: As is the case with most mailing lists, you can remove your e-mail address from the list by sending an e-mail to the same address to which you subscribed. Just use the body text unsubscribe <mailing-list name> instead of subscribe <mailing-list>. Some list servers use enter and leave rather than subscribe and unsubscribe.

RESOURCE: You also can link freeWAIS-sf with WWW clients, WWW servers, and Gopher servers. For more information on linking freeWAIS-sf to other servers, consult this URL:

http://ls6-www.informatik.uni-dortmund.de/freeWAIS-sf/fwsf_6.html

Glimpse 3.0

Glimpse (global implicit search) is an indexing and query system that enables you to search through files quickly and efficiently. You can use it to search small collections of documents as well as large data collections. Glimpse is the default search engine in Harvest.

Glimpse is a composition of several programs (glimpse, glimpseindex, agrep, glimpseserver, and so on) used to index documents as well as to search and manage glimpse indexes. To index an entire directory tree structure, for example, you would execute this command:

glimpseindex <directory>

The glimpseindex command then creates index files in the current working directory that will be used by the glimpse command for text searches.

Table 6.3 lists the command-line options that glimpseindex supports.

Table 6.3. Glimpseindex Command-line Options.

Option Function

-a Adds the given file(s) and/or directories to an existing index

-b Builds a medium-size index (20-30 percent of the size of all files), enabling a faster search

-B Uses a hash table that is four times bigger (256KB entries instead of 64KB) to speed up indexing

-d filenames Deletes the given file(s) from the index

-D filenames Deletes the given file(s) from the list of filenames but not from the index

-E Does not run a check on file types

-f Incremental indexing

-F glimpseindex receives the list of files to index from standard input

-H directory Puts or updates the index and all other .glimpse files in directory

-i Forces .glimpse_include to override .glimpse_exclude

-I Provides a list of files to be indexed, but does not actually do the indexing

-M x Uses xMB of memory for temporary tables

-n Indexes numbers as well as text

-o Builds a small index rather than a tiny one (7-9 percent of the sizes of all files), allowing a faster search

-s Supports structured queries

-S k k determines the size of the stop list

-w k k determines the number of unique new words that will cause a file to be added to the indexing list

-z Allows customizable filtering, using the file .glimpse_filters to perform the programs listed there for each match

CAUTION: When you specify the -M option, a large value of x may cause your system to run out of swap space.

To find all occurrences of the string texarkana in all files with cities somewhere in their name, for example, you would execute this glimpse command:

glimpse -F cities texarkana

To specify an AND search clause (one in which both text strings must occur in the search to return a value of true), use a semicolon to separate search strings, as in this example:

glimpse  `Military Intelligence;imagery'

Here, glimpse finds all lines that contain both Military Intelligence and imagery.

Glimpse supports three types of indexes: tiny, small, and medium. Tiny indexes usually are 2-3 percent of the size of all documents indexed. Small indexes are about 7-9 percent of the size of all documents indexed. Finally, medium-size indexes range from 20-30 percent of the size of all documents indexed. The larger the index, the faster the search. For most applications, though, the small index (glimpseindex -o) is sufficient.

Glimpse also can be executed by GlimpseHTTP (a Glimpse search engine to HTTP gateway). GlimpseHTTP provides tools that enable files to be searched through an HTTP interface.

Glimpse has been compiled and is known to work under the following operating systems:

AIX

FreeBSD

HP-mc68k

HPPA

Linux

OSF/1 DEC Alpha

SGI

Sparc Solaris

Sparc Sun OS

RESOURCE: Information and downloading information for Glimpse and GlimpseHTTP is available at

http://glimpse.cs.arizona.edu/ghttp/

Htgrep

Htgrep is a Perl script used as a CGI interface to query any documents accessible by a Web server. Htgrep accepts a user-supplied query (normally, by HTML forms), converts it to a Perl regular expression, and applies it to a list of files or records. Records that match the query then are used as input to generate a valid HTML file.

Htgrep queries can be Boolean keyword searches or regular expressions. Htgrep supports the use of bulleted and numbered HTML lists. Additionally, plain text can be searched, with automatic recognition of URLs.

RESOURCE: You can access a FAQ that provides information on Htgrep at this URL:

http://iamwww.unibe.ch/~scg/Src/Doc/htgrep.html
A tarred version of Htgrep and other Perl utilities is available at this URL:
ftp://iamftp.unibe.ch/pub/scg/www/PerlLib.tar.gz

SWISH

SWISH (simple Web indexing system for humans)--much like WAIS, Glimpse, and Htgrep--is used to index directories of files and to search the generated indexes. When indexing HTML files, SWISH can ignore data in tags and give higher relevance to information in header and title tags. Titles are extracted from HTML files and appear in the search results. SWISH can search an entire Website in a single pass, assuming that it's all within the same directory structure. When searching through HTML documents, SWISH can search for words that exist in HTML titles, comments, and emphasized tags.

Because SWISH indexes consist of only one file, they are easy to transport and maintain. Also, because SWISH is provided in full-source form, you can modify the code to support your own needs.

SWISH has been compiled and is known to work on these operating systems:

BSDI 1.1

IRIX 5.3/4

OSF/1 2.0

Solaris 2.4

SunOS 4.1.3

Harvest 1.4

Harvest is a compilation of customizable tools for gathering information from diverse repositories, building topic-specific content indexes, flexibly searching the indexes, widely replicating them, and caching objects as they are retrieved across the Internet. Harvest can interoperate with WWW client/server applications such as HTTP, FTP, Gopher, and NetNews information resources. Harvest is built on a scalable architecture that, compared to other search engines, reduces network usage, server load, and disk space usage. Through this architecture, Harvest can reduce FTP, HTTP, and Gopher server load as well.

RESOURCE: You can find information on the Harvest Project at

http://harvest.cs.colorado.edu/

Harvest consists of several subsystems. The Gatherer subsystem retrieves information resources using a variety of standard access methods (FTP, Gopher, HTTP, NNTP, and local files) and then summarizes those resources in various type-specific ways to generate structured indexing information. A gatherer can retrieve a technical report from an FTP archive, for example, and then extract the author, title, and abstract from the paper to summarize the technical report. Harvest brokers or other search services then can retrieve the indexing information from the gatherer to use in a searchable index available through a WWW interface.

The Broker subsystem retrieves indexing information from one or more gatherers, suppresses duplicate information, incrementally indexes the collected information, and provides a WWW query interface to it.

The Replicator subsystem efficiently replicates brokers around the Internet. The Harvest replicator distributes copies of a broker's database to replicas running throughout the Internet. Replication distributes the server load on a broker, improving query performance and availability, and the replicator attempts to minimize network traffic and server workload when propagating updates.

The Harvest Server Registry (HSR) is a distinguished broker that holds information about each Harvest gatherer, broker, cache, and replicator on the Internet.

Summary

This chapter covered a great deal of information on databases and database repository access methods. It began with a discussion of large and desktop database systems, such as Oracle Universal Server, Sybase System 11, and Microsoft Access. It also reviewed shareware databases such as mSQL and Postgres95.

The chapter then identified Web database gateway products available both commercially and as shareware. You learned about using Microsoft ODBC and database gateways to provide access to Web database documents.

The chapter continued with a look at publicly available search engines. You learned what search engines are and the different types of search engines. Finally, you learned about the ways in which search engines can provide access to your documents as well as other indexed documents on the Internet.

C-shell syntax	setenv MINERVA_DEBUG cache:key:mmap:query
Korn-shell syntax	MINERVA_DEBUG="cache:key:mmap:query";export MINERVA_DEBUG

*Module*	*Displays*
`cache`	Table-cache messages
`error`	Error messages
`general`	Messages not included in other modules
`key`	Key-based, data-lookup messages
`malloc`	Memory-allocation messages
`mmap`	Memory-mapped region messages
`query`	All queries before execution
`trace`	A call trace of all functions

*Product*	*Support*	*URL*
WebGenera	Sybase	`ftp://ftp.gdb.org/pub/letovsky/genera/genera.tar.Z`
Oraplex	Oracle	`http://moulon.inra.fr:80/oracle/wwworaperl_eng.html`
dbWeb	ODBC	`http://www.microsoft.com/intdev/dbweb/`
SoftwareEngine	Sybase	`http://www.engine.com/dbdoc/inst-get.html`
Sapphire	Sybase	`http://www.bluestone.com/`
	Oracle
	Informix
WebDatablade	ODBC	`http://www.informix.com/`
A-Xorion	ODBC	`http://www.clark.net/infouser/endidc.htm`

*Option*	*Function*
`-a`	Adds the given file(s) and/or directories to an existing index
`-b`	Builds a medium-size index (20-30 percent of the size of all files), enabling a faster search
`-B`	Uses a hash table that is four times bigger (256KB entries instead of 64KB) to speed up indexing
`-d` filenames	Deletes the given file(s) from the index
`-D` filenames	Deletes the given file(s) from the list of filenames but not from the index
`-E`	Does not run a check on file types
`-f`	Incremental indexing
`-F`	`glimpseindex` receives the list of files to index from standard input
`-H` directory	Puts or updates the index and all other `.glimpse` files in directory
`-i`	Forces `.glimpse_include` to override `.glimpse_exclude`
`-I`	Provides a list of files to be indexed, but does not actually do the indexing
`-M` x	Uses xMB of memory for temporary tables
`-n`	Indexes numbers as well as text
`-o`	Builds a small index rather than a tiny one (7-9 percent of the sizes of all files), allowing a faster search
`-s`	Supports structured queries
`-S` k	k determines the size of the stop list
`-w` k	k determines the number of unique new words that will cause a file to be added to the indexing list
`-z`	Allows customizable filtering, using the file .glimpse_filters to perform the programs listed there for each match