You don’t have to know the answer to everything – just how to find it

Since I work at IBM, I get to use the companys own email system, which is based on what used to be called Lotus Notes. It’s recently had some extra “social media awareness” added to it, been rebranded “IBM Notes”, and repositioned as a desktop client for social business. Which is all very modern and hip, especially for a product that has it’s roots back in the early 1990’s. However, most organisations (including IBM) tend to use it solely for email – for which it is the proverbial sledgehammer.

But having been using it for some 18 years now, I’m fairly comfortable with it. The only issue I have is that as I’ve been using it for so long, my mail archives contain a huge amount of useful information from old projects that I’ve worked on. I also have other information related to those projects stored elsewhere on my laptop harddrive, and pulling all that information together and searching it coherently isn’t a trivial problem. However, in recent years desktop search engines have begun to provide a really nice solution to this.

The problem here is that Lotus Notes is based on a series of binary databases which form the backbone of its ability to efficiently replicate documents between clients and servers. Desktop search engines generally don’t understand those databases, and hence do not work with Lotus Notes. So searching my laptop becomes a somewhat tedious process, involving the Lotus Notes client search feature, and manually correlating with a desktop search engine of some type. It works, but it’s just not as good as it could be.

What I really want, what I really really want (as the Spice Girls would sing) is a desktop search engine that can understand and integrate my Lotus Notes content. And that’s what this post is all about.

Since I run Linux I have a choice of open source desktop search engines such as Tracker or Beagle (now deceased). But my current preference is for Recoll, which I find to be very usable. And then, last year, I discovered that a colleague had written and published a filter, to enable Recoll to index documents inside Lotus Notes databases. So I had to get it working on my system!

Unfortunately, it turned out that my early attempts to get the filter working on my Ubuntu (now Mint) system completely failed. He was a RedHat user, and there are quite a lot of packaging differences between a Debianesque Lotus Notes install, and a RedHat one, especially inside IBM where we use our own internal versions of Java too. So the rest of this post is essentially a decription of how I hacked his elegant code to pieces to make it work on my system. It’s particularly relevant to members of the IBM community who use the IBM “OCDC” extensions to Linux as their production workstation. I’m going to structure it into a description of how Recoll and the Notes filter work, then a description of how I chose to implement the indexing (to minimise wasteful re-indexing), and hence what files go where, and some links to allow people to download the various files that I know to work on my system.

At a very simplistic level, Recoll works by scanning your computer filesystem, and for each file it encounters, it works out what it is (plain text, HTML, Microsoft Word, etc) and then either indexes it (if it’s a format that it natively understands) using the Xapian framework, or passing it to a helper application or filter which returns a version of the file in a format that Recoll does understand, and so can index. In the case of container formats like zip files, Recoll extracts all the contents, and processes each of those extracted files in turn. This means Recoll can process documents to an arbitrary level of “nesting”, comfortably indexing a Word file inside a zip file inside a RAR archive for example. Once all your files are indexed, you can search the index with arbitrary queries. If you get any hits, Recoll will help to invoke an appropriate application to allow you to view the original file. The helper applications are already existing external applications like unRTF or PDFtotext that carry out conversions from formats that Recoll will commonly encounter, while filters are Python applications that enable Recoll to cope with specialist formats, such as Lotus Notes databases.

So, the way the Lotus Notes filter works, is that:

  1. Recoll encounters a Lotus Notes database, something.nsf
  2. To work out what to do with it, Recoll looks up the file type in its mimemap configuration file, and determines what “mimetype” to associate with that file
  3. It then looks up what action to take for that mimetype in the mimeconf configuration file, which tells it to invoke the rcllnotes filter
  4. It then invokes rcllnotes, passing it the URI to something.nsf
  5. rcllnotes then extracts all the documents (and their attachments) from the Notes database, passing them back to Recoll for indexing
  6. It does this by invoking a Java application, rcllnotes.jar, that must be run under the same JVM as Lotus Notes
  7. This Java application uses Lotus Notes’ Java APIs to access each document in the database in turn
  8. These are then either flattened into HTML output (using an XLST stylesheet) which Recoll can consume directly, or in the case of attachments, output as a document needing further processing; Recoll can tell which is which from the mimetype of the output. Included in the flattened HTML are a couple of metadata tags, one marking this HTML document as descended from a Lotus Notes database, and the other containing the complete Lotus Notes URI for the original document. This latter information can be used by the Lotus Notes client to directly access the document – which is crucial later in the search process
  9. Recoll then indexes the documents it receives, saving enough information to allow Recoll to use rcllnotes again to retrieve just the relevant document from within the Notes database.
  10. So, when a search results in a Notes document, Recoll can use the saved information (the URI of the database and the Notes UNID of the document?) and the rcllnotes filter to obtain either the flattened HTML version of the document, or a copy of an attachment. Recoll then uses the documents mimetype to determine how to display it. In the case of an attachment, Recoll simply opens it with the appropriate application. In the case of the HTML, Recoll combines the expected “text/html” with the information in the metadata tag that describes this HTML as being derived from a Lotus Notes document. This produces a mimetype of “text/html|notesdoc”, which it then looks up in the mimeview configuration file, which causes it to use the rclOpenNotesClient script. That reads the Notes URI from the other HTML metadata field in the flattened HTML file, and then invokes the Lotus Notes client with it, causing the actual document of interest to be opened in Lotus Notes.

One of the problems with using Recoll with Lotus Notes databases is that it’s not possible to index just the few changed documents in a Notes database; you have to reindex an entire database worth of documents. Unfortunately there are usually a lot of documents in a Notes database, and the process of indexing a single document actually seems relatively slow, so it’s important to minimise how often you need to reindex a Notes database.

To achieve this, I make use of a feature of Recoll where it is possible to search multiple indexes in parallel. This allows me to partition my system into different types of data, creating separate indexes for each, but then searching against them all. To help with this, I made the decision to index only Notes databases associated with my email (either my current email database, or it’s archives) and a well-known (to me) subset of my filesystem data. Since my email archives are partitioned into separate databases, each holding about two years of content, I can easily partition the data I need to index into three categories: static Lotus Notes databases that never change (the old archives), dynamic Lotus Notes databases that change more frequently (my email database and its current archive), and other selected filesystem data.

I then create three separate indexes, one for each of those categories:

  1. The static Notes databases amount to about 5.5GB and takes about 2.5 hours 8GB and takes a little under 4 hours to index on my X201 laptop; however, since this is truely static, I only need to index it once.
  2. The dynamic Notes databases amount to about 4GB and take about 2 hours 1.5GB and take about 40 minutes to index; I reindex this once a week. This is a bigger job than it should be because I’ve been remiss and need to carve a big chunk of my current archive off into another “old” static one.
  3. Finally, the filesystem data runs to about another 20GB or so, and I expect this to change most frequently, but be the least expensive to reindex. Consequently I use “real time indexing” on this index; that means the whole 20GB is indexed once, and then inotify is used to determine whenever a file has been changed and trigger a reindex of just that file, immediately. That process runs in the background and is generally unnoticable.

So, how to duplicate this setup on your system?

First you will need to install Recoll. Use sudo apt-get install recoll to achieve that. Then you need to add the Lotus Notes filter to Recoll. Normally you’d download the filter from here, and follow the instructions in the README. However, as I noted at the beginning, it won’t work “out the box” under IBM’s OCDC environment. So instead, you can download the version that I have modified.

Unpack that into a temporary directory. Now copy the files in RecollNotes/Filter (rcllnotes, rcllnotes.jar and rclOpenNotesClient) to the Recoll filter directory (normally /usr/share/recoll/filters), and ensure that they are executable (sudo chmod +x rcllnotes etc). You should also copy a Lotus Notes graphic into the Recoll images directory where it can be used in the search results; sudo cp /opt/ibm/lotus/notes/notes_48.png /usr/share/recoll/images/lotus-notes.png.

Now copy the main configuration file for the Notes filter to your home directory. It’s called RecollNotes/Configurations/.rcllnotes and once you have copied it to your home directory, you need to edit it, and add your Lotus Notes password in the appropriate line. Note that this is by default a “hidden” file, so won’t show up in Nautilus or normal “ls” commands. Use “ls -a” if necessary!

Next you need to set up and configure the three actual indexes. The installation of Recoll should have created a ~/.recoll/ configuration directory. Now create two more, such as ~/.recoll-static/ and ~/.recoll-dynamic/. Appropriately copy the configuration files from the subfolders of RecollNotes/Configurations/, into your three Recoll configuration folders. Now edit the recoll.conf files in ~/.recoll-static/ and ~/.recoll-dynamic/, updating the names of the Notes Databases that you wish to index. Now manually index these Notes databases by running the commands recollindex -c ~/.recoll-static -z and recollindex -c ~/.recoll-dynamic -z.

At this point it should be possible to start recoll against either of those indexes (recoll -c ~/.recoll-static for example) and run searches within databases in that index. I leave it as an exercise for the interested reader to work out how to automate the reindexing with CRON jobs.

Next we wish to set up the indexing for the ~/.recoll/ configuration. This is the filesystem data that will run with a real-time indexer. So start by opening up the Recoll GUI. You will be asked if you want to start indexing immediately. I suggest that you select real-time indexing at startup, and let it start the indexing. Then immediately STOP the indexing process from the File menu. Now copy the file RecollNotes/recoll_index_on_ac to your personal scripts directory (mine is ~/.scripts), ensure it is executable, and then edit the file ~/.config/autostart/recollindex.desktop, changing the line that says Exec=recollindex -w 60 -m to Exec=~/.scripts/recoll_index_on_ac (or as appropriate). This script will in future be started instead of the normal indexer, and will ensure that indexing only runs when your laptop is on AC power, hopefully increasing your battery life. You can now start it manually with the command nohup ~/.scripts/recoll_index_on_ac &, but in future it will be started automatically whenever you login.

While your filesystem index is building, you can configure Recoll to use all three indexes at once. Start the Recoll GUI, and navigate to Preferences -> External Index dialog. Select “Add Index”, and navigate into the ~/.recoll-static/ and ~/.recoll-dynamic/ directories, selecting the xapiandb directory in each. Make sure each is selected. Searches done from the GUI will now use the default index (from the filesystem data) and the additional indexes from the two Lotus Notes configurations.

There is one final configuration worth carrying out, and that is to customise the presentation of the search results. If you look in the file in RecollNotes/reoll-result-list-customisation you will find some instructions to make the search results easier to read and use. Feel free to adopt them or not, as you wish.

Update: To answer the first question (by text message no less!), my indexes use up about 2.5GB of space, so no, it’s not insignificant, but I figure disk really is cheap these days.

Update: Corrected command to copy Notes icon to Recoll image directory to match configuration files, and a couple of the pathnames where I had introduced some typos.

Update: Added the main .rcllnotes configuration file to my archive of files, and updated the installation instructions to discuss installing it, and the need to add your Lotus Notes password to the file.

Advertisements

4 thoughts on “You don’t have to know the answer to everything – just how to find it

    • Yes, I’ve looked at Zeitgeist (Gnomes take on Nepomuk, and compatible with its ontology) in the past, and although I can see how the whole semantic desktop concept could be really useful, it’s just total overkill for what I need right now 🙂

    • Thanks Dave – that sounds really useful, as opening some of the documents can be quite slow. I guess I’ll need to adjust it for the Ubuntu-esque version of your code that I’m using & discussing here, but hopefully that should be fairly easy. Let me take a look at your patch over the next few days.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s