Documents  
Open Source Application Primer   
Writer and workshop leader Eric Lease Morgan provides a tour of some of the most popular open source applications. A thorough, rather technical introduction to the nuts and bolts of open source computing.
@(c) 2003 Eric Lease Morgan

Rights notice: This work may be copied and redistributed under the following conditions: you must give the original author (Eric Lease Morgan) credit, and you may not use it for commercial purposes.

Introduction

Below is a list of open source software especially useful in libraries and open source software in general. This list is not intended to be comprehensive but selective instead. It is representative of the types of open source software available and the most used tools.

A more comprehensive lists of open source software especially designed for libraries can be found at OSS4Lib (http://www.oss4lib.org/). There you will also find the archives of the OSS4Lib mailing list, a low-traffic but ongoing discussion surrounding the issues of open source software in libraries. For an even more comprehensive list of software, check out SourceForge (http://sourceforge.net/). There you will find just about any type of open source software you desire.

Apache - http://httpd.apache.org/

Apache is the most popular Web (HTTP) server on the Internet and a standard open source piece of software. It's name doesn't really have anything to do with American Indians. Instead, its name comes from the way it is built. It is "a patchy" server, meaning that it is made up of many modular parts to create a coherent whole. This design philosophy has made the application very extensible. For example, there are the core modules that make up the server's ability to listen for connections, retrieve files, and return them to the requesting client (the "user agent" in HTTP parlance). There are other modules dealing with logging transactions and CGI (common gateway interface) scripting. Other modules allow you to rewrite incoming requests, manage email, implement the little-used HTTP PUT method, write other modules in Perl, or transform XML files using XSLT. Apache is currently at version 2.0, but for some reason many people are still using the 1.3 series. I don't really know why. I have not upgraded my Apache servers to version 2.0 because I do not want to lose the functionality of AxKit, an XML tranformation engine. Apache is a part of LAMP (Linux Apache MySQL Perl/PHP), a term coined by RedHat to denote the core open source applications dealing with stuff Web.

CVS - http://ximbiot.com/cvs/wiki/index.php?title=Main_Page

CVS is an acronym for Concurrent Versions System. It is the way open source software is shared by developers. It consists of a client and server application. The server is set up and points to a directory where one or more projects are saved. Usernames and passwords are created, and the server sits and waits for connections. For the most part, the CVS client is command-line driven. On the command-line you specify the location of a CVS server, the protocol you are going to use to connect to the server, and your username/password. Once logged in you give CVS various commands used to download remote projects. You then spend your time hacking away at the source code. When you think you have created the latest and greatest hack, you issue the CVS diff command to create a diff file. This file lists the changes you made to the original source. By sending this diff file to the project's maintainer, your hack can be incorporated into the next release. Alternatively, you might be granted write access to the remote project. In which case you issue a CVS commit command, and your hacks are automatically incorporated. If you are going to do any open source software development, then you must get acquainted with CVS. Luckily, it comes pre-installed with many Unix variants, but it is just as easily compiled.

DocBook stylesheets - http://sourceforge.net/projects/docbook/Given a set of XML/DocBook files, the DocBook stylesheets, and/or an XSL processor such as xsltproc or FOP, you can transform your DocBook files into PDF documents, HTML documents, XHTML documents, or a few other file types. When you download the stylesheets, but sure to download the XSL sheets and not other types. You would need other processors to use the other types. The stylesheets are configurable by setting a number of parameters. Through this means you can specify a cascading stylesheet to be incorporated into your XHTML/HTML files. The stylesheets are thorough but do not allow you to change very much of the resulting output. If you don't like the way the stylesheets format your XML, you can always write your own stylesheets, but I'm willing to bet you have better things to do with your time. As a person who is interested in open source software, learning how to write DocBook files is skill that will come in handy in the future.

FOP - http://xml.apache.org/fop/

FOP is an implementation of the Formatting Objects standards for transforming XML documents into documents intended for printing. It is mentioned here, not because it a primary open source software application, but because it is a Java application and represents a nice way to create PDF documents. For example, given a Java virtual machine, a DocBook file, the DocBook stylesheets, and FOP, you can create PDF versions of your DocBook documents. I have only had success with version 0.20.3 but it has proven indispensable a number of times. Writing FO stylesheets is not easy, and that is why I have relied on the DocBook FO stylesheets. Learning how to use FOP will give you good experience with Java as well as XML files.

GNU tools - http://www.gnu.org/directory/

The GNU family of tools is wide and varied. Probably the most important one is gcc, a C compiler. Ironically, you can not compile the compiler unless you have a compiler. Crazy. Consequently, beginning the process of software development is a sort of chicken-and-egg problem. For example, while you might be able to download the gcc distribution, you will need gunzip and tar to uncompress the distribution, and you can't build gunzip or tar without the compiler. Not to worry: many operating systems now come with an "unzipper" and a "de-tarrer". Frequently flavors of Unix (including Linux) come with a version of gcc pre-installed, allowing you to upgrade accordingly. Besides gcc, gunzip, and tar, there are a number of other very useful GNU tool including Berkeley DB (database library), binutils (miscellaneous binary utilites espeicaly a linker and assembler), bison (alternative to yacc), curl (Internet user agent), emacs (text editor), fileutils (miscellaneous file utilities such as cp, mv, and rm), less (alternative to more), make (a sort of scripting language used to build source files), OpenSSH and OpenSSL (implementations of secure socket transacations), patch (applies diff files to source files), procmail (mail filter), sendmail (mail transfer agent), and wget (Internet user agent). By the way, an interesting discussion can be had by comparing the philosophy of "open source software" and GNU software.

Hypermail - http://www.hypermail.org/

Hypermail converts email messages into sets of HTML files browsable by author, subject, date, thread, and attachment for the purpose of creating a mailing list archive. As I alluded to earlier, open source software is about communities. Email mailing lists are one of the primary, if not the primary, communication channels in the open source software world. As you develop open source software and manage a mailing list to keep everybody up-to-date, don't lets those valuable pieces of information go to Big Byte Heaven. Capture those "Perls" of wisdom by maintaining a mailing list archive with Hypermail. Hypermail is a C program driven by a number of configuration files and/or command line switches. Pass Hypermail raw, SMTP messages (Unix mbox files) and it will create sets of browsable HTML files. The look, feel, and some functionality of the archives can be changed through templates and the configuration files. The only thing Hypermail does not support is searching the resulting archive. For that functionality you need an indexer, preferably an indexer that can index mbox files, but you usually end up using an indexer that can index HTML files.

Koha - http://www.koha.org/

Koha is an intergrated library system with a growing user community. Written in Perl and using MySQL as the underlying database, Koha makes it simple to create and manage a small integrated library system. Equipped with acquisitions, cataloging, circulation, and searching modules, Koha provides much of the functionality of traditional online catalogs. With the recent implementation of its Z39.50 interface, it is easy to enter ISBN numbers into the system, locate MARC records, and have those records added. The user and system interfaces are simple and unencumbered, but alas, not very customizable. For many libraries, the catalog is the centerpiece of the operation. Koha represents a major step in providing a catalog that is functional and usable for small libraries. As long as support continues, I expect Koha to be more viable option for medium and possibly large library collections. The obstacle is not technology: the obstacle is time and effort.

MARC::Record - http://marcpm.sourceforge.net/

This Perl module is the Perl module to use when reading and writing MARC records. The module is very well supported on the Perl4Lib mailing list, and a testament to the module's abilities is its incorporation into things like Koha and Net::Z3950. If you are not familiar with object-oriented programming techniques in Perl, then MARC::Record might take a bit of getting used to. On the other hand, learning to use MARC::Record will not only improve your programming abilities, but it will also educate you on the intricacies of the MARC record data structure, a structure that was design in an era of scarce disk space, non-relational databases, and little or no network connectivity.

MyLibrary - http://dewey.library.nd.edu/mylibrary/

MyLibrary is a user-driven, customizable interface to sets of library resources -- a portal. Technically, MyLibrary is a database-driven website application written in Perl. It requires a relational database application as a foundation, and it currently supports MySQL and PostgreSQL. MyLibrary grew out of a number of focus group interviews where people said they were suffering from information overload. To address this problem, MyLibrary takes three essential components of librarianship (resources, patrons, and librarians) and tries to create relationships between them through the use of common controlled vocabularies such as a list of subject terms. Like a library catalog, MyLibrary provides the means to create collections of resources and classify these resources with a controlled vocabulary. Unlike a library catalog, the system also allows librarians as well as patrons to be classified in this manner. By sharing a common set of controlled vocabulary terms, relationships between resources, patrons, and librarians can be made, thus addressing things like, "If you are like this, then these resources may be of interest", or "If you have this interest, then your librarian is...", or "These people have expressed in interest it, therefore your patrons are...", or potentially even doing Amazon-like things such as "People like you also used...".

MySQL - http://www.mysql.com/

MySQL is a relational database application, pure and simple. Billed as "The World's Most Popular Open Source Database" MySQL certainly has wide support in the Internet community. Many people (especially Oracle database administrators) think MySQL can't be very good because it is free. True, it does not have all the features of Oracle, nor does it require a specially trained person to keep it up and running. A part of the LAMP suite, MySQL compiles easily on a multitude of platforms. It comes as a pre-compiled binary for Windows. It has been used to manage millions of records and gigabytes of data. Fast and robust, it supports the majority of people's relational database needs. On its downside, it does not currently support triggers, transactions, or roll-backs. Nor does it have a GUI interface. At the same time, a program called phpMyAdmin, a set of PHP scripts, can be used to manage, manipulate, and query MySQL databases through a Web browser window. If there were one technical skill I could teach the library profession, it would be the creating and maintenance of relational databases--and I would teach them how to use MySQL.

Perl - http://www.perl.com/

Perl is a programming language. Originally written to handle various systems administration tasks, Perl's strength lies in its ability to manipulate strings (text). Perl matured through the era of Gopher but really started becoming popular with the advent to CGI scripting. Perl has been ported to just about any computer operating system, has one of the largest numbers of support forums, and has been written about in more books than you can count. Perl can be complied into Apache making it possible to run Perl scripts as fast as C programs. It easily connects to database applications through a module called DBI. It can be run from the command line. It can listen and respond to networking connections. It can call many aspects of your computer's operating system. In short, Perl is mature and very robust. Other very good programming languages exist and can do much of what Perl can do. Examples include other "P" languages such as PHP and Python. These languages are becoming increasingly popular, especially PHP, but at the risk of starting a religious war, I advocate Perl because of its very large support base and its cross-platform functionality.

swish-e - http://www.swish-e.org/

Swish-e is an uncomplicated indexer/search engine. Once built, you feed the swish-e binary a configuration file and/or a set of command line switches to index content. This content can be individual files on a file system, files retrieved by crawling a website, or a stream of content from another application such as a database. The indexing half of swish-e is able to denote specifically marked-up text in XML and HTML as fields for searching later. The indexes created by swish-e are portable from file system to file system. The same binary that creates the indexes can be used to search the indexes. Swish-e supports relevance ranking, Boolean operations, right-hand truncation, field searching, and nested queries. Later versions of swish-e come with a C and Perl API allowing developers to create CGI interfaces to these indexes. Swish-e is an unsung hero. It's inherently open nature allows for the creation of some very smart search engines supporting things like spelling correction, thesaurus intervention, and "best bets" implementations. Of all the different types of information services librarians provide, access to indexes is one of the biggest ones. With swish-e librarians could create their own indexes and rely on commercial bibliographic indexer less and less.

xsltproc - http://xmlsoft.org/XSLT/

Xsltproc and its companion program, xmllint, are very useful applications for processing XML files with XSL. Both applications are built from a C library that is becoming increasingly popular for parsing and processing XML documents. By feeding xsltproc an XSL stylesheet and an XML data file, you can transform the XML data file into any one of a number of text files, whether they be SQL, (X)HTML, tab-delimited files, or even plain text files intended for printing. Xmllint is a syntax checker. Given an XML file, xmllint will check the validity of your XML files against a DTD. By first installing the C library and mod_perl, you will be able to incorporate AxKit into your Apache HTTP server allowing you to transform XML data on the fly and serve it accordingly. Swish-e makes use of the C library. It is easy to use the DocBook stylesheets with xsltproc to create XHTML versions of your DocBook files. With xsltproc and a plain o' text editor, you can learn a whole lot about XML.

YAZ and Zebra - http://www.indexdata.dk/yaz/ and http://www.indexdata.dk/zebra/

YAZ is a C library and resulting binary application implementing a Z39.50/SRW client. Zebra is an indexer and Z39.50 server. The yaz-client is a straight-forward terminal application. Zebraidx is the indexer, and requires bunche o' configuration files. It is not as straightforward as other indexers, but its data can be served by zebrasrv. Since the client is built on a library, it can (and has) been compiled into other tools such as PHP, Apache. The YAZ API also has a Perl interface. Well documents and supported YAZ/Zebra are definitely worth your time exploring if you want to make your collections available through Z39.50. Yes, you will spend time learning the in's and out's of Z39.50 in the process, but that experience can be taken forward and applied on other venues where Z39.50 is needed.


Contribute to this topic
Do you have an article, presentation, or other content to share on this topic?
You can post it on this topic page. Find out more about submitting documents in the Member Center.
Ratings You must be signed in to rate this item
Average (3 Votes)
Comments