Youssef Boubli

Personel blog

Choosing the Right Storage for Application Data

What types of data you are dealing with? We will try to roughly classify them and divide into the following five categories. Naturally, this is not a comprehensive classification, but it will help us to understand the options and approaches we have to keep in mind.

  1. Homogeneous data arrays containing elements of the same type
  2. Multimedia - audio, video and graphics files
  3. Interim data for internal use (logs of various types, caches)
  4. Streams of calculated data of various types (e.g. recorded video stream or massive computation results)
  5. Documents (simple or compound).

The ways for storing such a data are as follows.

  1. Files in file system
  2. Databases
  3. Structured storages
  4. Archives (as a specific form of structured storage)
  5. Remote (distributed, cloud) storages.

Let us now discuss which storage mechanism will be the best suited for the types of data mentioned above.

Homogeneous data arrays

Homogeneous data arrays contain elements of the same type. Examples of a homogeneous data array may be a simple table, temperature data over time or last year stock values.

  1. For homogeneous data arrays, regular files do not provide possibility for convenient and fast search. You have to create, maintain and constantly update special indexing files. Modification of the data structure is almost impossible. Metainformation is limited. There is no built-in run-time compression or encryption of data.

  2. Relational databases are well suited for homogeneous data. They comprise a set of predefined records with rigid internal format. Main advantage of relational databases is an ability to locate data quickly according to specified criterion, as well as transactional support of data integrity. Their significant shortcoming is that relational databases will not work well for large-size data of variable length (BLOB fields are usually stored separately from the rest of the record). Moreover, keeping data in relational databases requires: a)• use of specific DBMS, which limits severely portability of the data and of the application itself, b)• pre-planning of database structure, including interrelational links and indexing policy, c)• researching details of peak loads is required for efficient database development, which also may be a serious overhead.

  3. Structured storages are somewhat analogous to a file system, i.e. storages are a specific set of enveloped named streams (files). Such storage can be stored at any location, i.e. in a single file on a disk, in a database record, or even in RAM. The main advantage of this approach is that it allows efficient adding or deleting data in an existing storage, provides the effective manipulation of data of various sizes (from small to huge). The storages represent separate units (files) and therefore can be easily relocated, copied, duplicated, backed up. There is no need to track all files generated by an application. Moreover, journal keeping makes it possible to restore content completely or partially, thus eliminating accidents or failures. The disadvantage may be relatively slower search inside these huge data arrays.

  4. ZIP archives, as a specific form of the structured storage, can be used for storing homogenous data arrays, but only in case when the most of access is read-only. Standardized nature of ZIP format makes it easy to use, especially in cross-platform applications, but this format is not suitable for the data to be modified after packing, so adding and deleting of data is a time-consuming operation.

  5. Remote and distributed storages are the next level of storage in which actual data location and data access are provided by specific layer used for encapsulating of access mechanics. In such storages data can actually be stored in databases or be distributed among different file systems, but the actual storage organization does not matter for an end-user. The user observes only a set of objects accessed through an API, or, as a variant, through file system calls. Good example is cloud storages. These types of data storages are to be used in large software complexes. Among other advantages one can mention unified data access without a need to think about actual ways how data are stored. Its disadvantages - they cannot be efficiently managed and controlled, and backup or migration of data is complicated.

Audio, video and graphic files

Storing a single (or several) multimedia files is simple. Complexities appear when you need to maintain a large number of files and want to perform a search across the multimedia collection.

  1. Only very simple and sparse multimedia files can be stored as regular files. Even for an average home collection, simple file-based multimedia data storage becomes unmanageable very quickly. This is mostly due to size of these files, inability to handle any annotation, tags or metadata, and low speed of copying or relocation.

  2. Relational databases are a dubious way of storing audio, video or similar types of data. RDBMS are not well suited for keeping large BLOBs, especially when it comes to storing video files of big size. Also each type of data requires it’s own table (due to different sets of metadata that needs to be stored). On the other hand RDBMS can be handy as they offer powerful search capabilities, which is very suitable for read-only collections.

  3. Structured storages work perfectly well for storing of multimedia files when the storage supports metadata and fast search through them. If this search is not supported, structured storage becomes a variant of the file system.

  4. Remote and distributed storages are among the best solutions when it comes to storing of video, music or similar data. Storage represents a single unit where all elements of a multimedia or video game can be safely stored. There is no risk of loosing a single but important file. Searches are fast and efficient if the storage supports tags and metadata.

Temporary data

Temporary data are generated by software on the fly and usually have a validity term. Most of updates are very frequent. In addition, such intermediate information should stay easily accessible, integral, and, in many cases, encrypted and secured. It is still possible to use regular files for these purposes. This approach will result in high resource consumption, there is no reliable way to control and enforce integrity of data and their encryption functions should be implemented by your software.

  1. For a long time files have been used as a way of interim data storage. They are quite suitable for storing low-priority unsecured temporary data of insignificant size. Meanwhile, modern legislations of several countries dictate more careful and responsive treatment of interim data. As a result, regular file system becomes less suitable when issue of data security, vulnerability, and protection from tampering becomes paramount.

  2. Relational databases are not usually used for interim data storage due to absence (as a rule) of clearly defined structure and interrelated nature of elements. Low speed of upgrade, issues of compression and security add to this unsuitability. At the same time, a relational database can contain interim data related to the database itself and its operation. Also a database can be used for some kind of data cache or for storing activity logs (journal files). RDBMS doesn’t suit well, if the data are required to be stored for a long term (years) and to be signed or encrypted.

  3. Structured storages may be considered as an optimal solution when a large volume of interim data need to be stored, accessed, indexed and searched, compressed and encrypted on-the-fly. Structured storages may be build with anti-tempering functions, or, should the requirements be present, - provide an easy way for data removal or replacement. As always, such storages can be easily copied or moved without need for taking special care to preserve data integrity.

  4. ZIP archives are rarely used for interim data storage. Fast (as a rule) interim data turnaround makes them impractical in most situations. An encrypted archive may be suitable for this type of data only when snapshots are to be stored for long time and need to be protected from loss or tempering.

  5. Remote and distributed storages are used for interim data streams basically due to space considerations. They don’t provide speed or easy management and backup, often required for interim data.

Data streams

Large volumes of quickly generated data, such as output data feeds, need to be stored efficiently. Regular file systems significantly limit file sizes, necessitating design of specific handlers for data overflow at an expense of lost integrity and reliability. Since data of this type often contain privileged or sensitive materials, fast on-the-fly encryption is a must. The same applies to efficiency of data compressions, since, obviously, sizes of these data feeds are usually very significant.

  1. Regular files are not well suited for this type of data. Quickly increasing file sizes require creating many intermediate caching files that need to be copied back. Even in case of careful designs, an amount of memory or media consumed tends to grow in geometrical progression. Handling, indexing, searching and encrypting data streams stored in regular files become a nightmare.

  2. Relational databases pose almost exactly the same problems as regular files. Add to that inefficiency of database updates, rigid structure, and it can be seen that relational databases are among least suitable storage solution for streams of data.

  3. Repositories may be used for data streams storage when requirements are present for security and low vulnerability at the expense of easy searches and fast retrievals. Data can be compressed, but fast and efficient searches become almost impossible.

  4. Structured storages have advantages of security, integrity and efficient searches. Data storages are autonomous single-file units, which can be easily transferred or copied. Access is easy and efficient. Data streams kept in them can be encrypted and protected from tampering. Presence of thin partitioning provides another convenience for storage users: the storage will automatically grow with increase of data size.

  5. Remote and distributed storages are well suited for streaming data and are commonly used in projects generating vast amount of data. Since such data are frequently analyzed by distributed system or clusters, the use of remote storages is the best fit. This type of storages provides easy, but well controlled data access and guarantee against illegal tampering or removal.

Documents

Documents are rigidly structured data type specifically designed to store human-readable textual or graphical information. Documents are one of the most common forms of information, produced and used in business and personal activities.

  1. Files are the most common way of storage for documents. But when a concurrent access to documents is required, use of regular files is complicated. Since all the compound document structure is stored sequentially in a flat file, any document modifications require creation of a set of temporary files, which contain a subset of document’s elements to be edited. In addition, deletion of any elements from the document will not reduce file size automatically. To optimize the size, an additional document copy must be created and saved into yet another file. After edit operation is completed, the original file must be deleted. If this is to be done automatically by the editing software, the developer of this software has got another task to remember about.

  2. Relational databases will work well for some types of documents and can provide fast and efficient indexing, search and retrieval - if there is an on-the-fly conversion to plain text is available. Databases suffer from the same shortcomings applicable to storage of homogeneous data arrays. Keeping data in relational databases requires a)• use of a specific DBMS, b)• pre-planning of database structure, including interelational links and indexing policy c)• researching details of peak loads is required for efficient database development, which also may be a serious overhead.

  3. Structured customizable storages are among the best choice when it comes to corporate use of documents. The main advantage of structured storages is that they allow efficient adding or deleting of documents or their parts to existing storage, provides an effective document access restrictions etc. Complex documents, that contain embedded images or other multimedia, can be handled easier by putting the text apart from the multimedia (doing this will reduce load/save time, make text search easier etc). Moreover, journal keeping makes it possible to restore content (completely or partially) after accidents or failures. One more benefit is possibility to store multiple editions or multiple alternative views of the data within one document. The disadvantage may be slower search, which should be implemented by using on-the-fly conversion to plain text.

  4. ZIP files are used in some document formats such as Open Document Format to store document data. Most of the advantages, described above for structured storage, are applicable to ZIP file storage, but again, addition, modification and deletion of the information are time-consuming operations and sometimes require complete rewrite of the file. Also, ZIP file format doesn’t allow you to attach metadata to the entries inside, and ZIP encryption capabilities are limited (strong AES encryption is a recent addition to the standard and it’s not supported by many ZIP compression and decompression tools and libraries).

  5. Remote and distributed storages are becoming widespread and popular. They allow easy collaboration during document creation and use, and remote but tightly controlled and secured access to them. Unlike homogeneous data arrays, the document usually constitutes one object accessed and modified in its entirety, and this makes document retrieval and management quite simple. The cons are the same as in previous paragraphs.

Suggested solutions

A simple rule use the right tool for the right job is even more important in the area of software design. Incorrect or under-thought data and information storage planning can lead to disastrous results.

  1. For use of files you are faced with choice of file system.

  2. There is a wide choice of commercial database system: Oracle DB2, etc. or open source solutions.

  3. Repositories can be created by commercial and public archiving solutions, such as Zip, etc.

  4. Examples of Structured storages include OLE Structured Storage by Microsoft (offers basic storage capabilities, i.e. no encryption, compression or search are available) or Solid File System by EldoS Corporation.

  5. Remote storages are offered as can be designed with Solid File System OS Edition and Callback File System by EldoS, FUSE for Unix-based systems etc.

In any case, only the project developer knows exact requirements and understands all the technologies, their features and restrictions, and can make, therefore, an adequate choice of tools for successful implementation of his software project.