Database

This page summarizes the database and data storage technology employed by PageWork.

Summary [Edit]

PageWork is designed as a NoSQL document store that handles any number of pages. It is optimized for low latency and fast page delivery. The fast and flawless delivery of individual pages is considered the primary task of a document store, and the underlying database technology was primarily chosen with this purpose in mind.

Technology [Edit]

Key-Value Store [Edit]

There are many other tasks involved in the operation of a document management system, but all these duties are considered secondary compared to high-performance page delivery. PageWork employs a technique that perfectly suits the task at hand: it treats its contents, such as pages, users, and notes, as structured documents, and addresses these contents via unique keys, effectively yielding a key-value document store. The process of accessing any particular property of a document does not take more than two atomic operations: first, the document itself is retrieved using its unique ID, its node. Having access to the document, the property is then extracted from its data. While documents are stored as key-value pairs in the database, their properties are encoded as schema-free, serialized octet streams.

The low-level functionality of key-value stores is handled by a family of software libraries best known as DataBase Manager whose first derivatives date back to the early decades of BSD Unix. The technology on which dbm is based comes with two major advantages:

Speed
Given a key, its value can be retrieved in almost no time. Moreover, the time it takes to retrieve and deliver a page requested by a user is independent of the number of pages stored in the database – be it 20, 2 000, or 200 000 pages in total. Save for data caching and file locking, there's very little protocol overhead involved. All data is stored in local files that reside in the file system of the web server, and no kind of client-server communication takes place. All these factors yield a very low operational latency.
Simplicity
The underlying technology, the actual database, is comparatively simple in nature, at least compared to a full-blown, schema based RDBMS. Not only does simplicity imply speed, it also reduces complexity, and thus error proneness.

Downsides [Edit]

It should be notes that the use of dbm-based technology involves various drawbacks compared to a schema based RDBMS. The first is related to the sequential processing, that is, scanning, of the records of the database. If, for example, one issues a query for all pages that have a particular combination of tags assigned, the schema free encoding of the document store proves to be a suboptimal solution: scans are much slower and less efficient than an optimized query processed by a schema based RDBMS.

The second notable disadvantage stems from the very simplicity of the database, that is, its ability to manage the access to its resources. In an environment with a high load of concurrent write accesses, a dedicated database server is much better suited to optimize competing requests that attempt to modify database contents, and to avoid deadlocks that might occur due to conflicting access. However, a dedicated server will inevitably increase the overall complexity of the storage infrastructure, and thus the susceptibility and error proneness of the system as a whole.

Due to the prospect of the advantages listed above, these two trade-offs were knowingly taken into account when the storage technology for PageWork was selected. Since the database core is composed of modular, object orientated components, the current data store can easily be exchanged with an alternative database setup, should the need ever arise. This scenario is being dealt with in the section Alternatives below.

Database Architecture [Edit]

The stock software release of PageWork employs PHP's DBA architecture to store most of its dynamic contents, using a schema free, document orientated encoding of the properties.

DBA Handlers [Edit]

In theory, DBA supports several low level database handlers, provided the underlying operating system provides the necessary libraries. This includes, but is not restricted to, the commonly employed gdbm, and even various original Berkeley DB architectures, such as dbm and ndbm. However, for practical reasons server administrators should avoid these venerable forefathers of today's key-value stores. Strictly speaking, PageWork's true potential will unfold only with more advanced database handlers, notably the current version of Berkeley DB (formerly developed by Sleepcat, and now owned and maintained by Oracle), or qdbm, an independent, highly recommended derivative created and maintained by Mikio Hirabayashi.

PageWork makes explicit use of some of the basic functionality of the B-tree/B+ tree storage technology on which Berkeley DB and qdbm are based, the most important being that the database keys are being stored in an orderly fashion. Hash-based databases, such as gdbm, scan the stored keys, though bijective regarding the data set as a whole, in an unsorted order that appears random to the user's eye. For example, if the pages of a site were stored in a gdbm-based database, a generated list of page names would not be listed in alphabetical order.

The following list sums up the most important features of Berkeley DB and qdbm, and constitutes the choice of employing them:

Speed
Even in decent every-day scenarios, these databases outperform their ancient ancestors by far.
Size
Data storage is highly optimized. Data stored in the database takes but a fraction of the space it would consume inside a gdbm, dbm, or ndbm counterpart.
Locking
dbm and ndbm do not support file locking at all, which makes their deployment in multi-user environments an absolute no go.
Key order
B-tree/B+ tree technology stores keys in a “natural” order that can be comfortably built upon when returning lists of stored contents.

Optionally, PageWork employs an additional database handler, CDB, for storing read-only information, like localized strings. CDB is supposed to be the perfect choice for storing static key-value pairs. Its use is optional, and any data intended to be stored in a CDB can also be stored in one of the dynamic database formats listed above. If available, the PageWork installer selects CDB automatically, and uses it to store read-only data of the software library.

Server Specific Settings [Edit]

Depending on the platform your web server runs on, DBA-specific settings can be fine-tuned. Most importantly, this concerns database locking. The locking mode of DBA database files can be specified as one of the following options:

  • d: Locking is performed on the database file itself. This is the default setting.
  • l: Locking is handled via an additional lock file. This setting should be used on Windows platforms.
  • -: Locking is disabled.

Database locking is configured in the first lines of file $(LIB_PATH)/core/pool_base.php.

Alternatives [Edit]

The need for alternatives to DBA might arise because of one of the two major drawbacks:

Sequential Processing [Edit]

As soon as the data set becomes very large, the performance of sequential processing, that is, the scanning of the data store for specific properties, will deteriorate proportionally to the number of stored data records. Depending on the configuration of the hardware on the one hand, and the complexity of the query on the other, data scans may begin to delay when the database comprehends about 2000 – 6000 pages. This performance drop can be mitigated by employing a schema based RDBMS instead of a schema free key-value store.

A former version of PageWork implemented Sqlite as its storage back-end, the result being a radical speed gain regarding sequential scans. However, although Sqlite is a slim RDBMS that does not require a dedicated server, and that stores all data in a single local file the overall complexity of the storage management grew considerably. Especially simple tasks, such as the retrieval of a single page by means of a SQL-92 query, were far more complex to implement than their counterparts based on a key-value store. It therefore violates the dogma of simplicity.

Concurrent Write Access [Edit]

Lots of concurrent write requests are best streamlined by a dedicated database server. An elegant option is Tokyo Tyrant, a network based variant of qdbm. Tokyo Tyrant is a fast, lightweight database server that takes the key-value store to the level of a network based server. The first of Tokyo Tyrant's two relevant drawbacks is that each database requires a server instance of its own – a requirement that impairs PageWork's ideal of wiki farming. The other trade-off is that PHP's support for Tokyo Tyrant has been made available only recently, and that it is unlikely that many web servers will support the necessary extension.

Another option is a fully-featured, server based RDBMS, such as MySQL or PostgreSQL. The advantage of these systems comes threefold: First, all concurrent write attempts are smoothly resolved by the ACID-proof request scheduler of the server. Second, being RDBMSs, such servers nicely optimize complex queries that operate on large data sets, and are thus perfectly suited to handle sequential processing. Third and last, many of these software packages offer powerful features, like content indexing or date calculations – tasks that are currently implemented by PageWork itself. The major disadvantage of server based RDBMSs is a steep increase regarding the complexity of the web server.

Convergence [Edit]

A reasonable scenario would be to provide an alternative storage back-end to DBA in form of PDO. That way, any number of supported RDBMSs can be used as drop-in replacements for one another. The approach would be completely different from a document orientated key-value store, entailing strengths and weaknesses as it goes.