UniVerse and UniData Hashed Files: Part 1
The MultiValue Database and its descendants are known for their flexibility in handling data. Files contain data records that are composed of an arbitrary number of fields, values and sub-values, all of which can be of completely variable length. This is a radical departure from traditional, fixed-length data records found in other SQL databases, and provides a very attractive environment for business applications. The flexibility of the structure places demands on the underlying data storage mechanisms. To meet these demands, MultiValue databases use a hashed file structure. This series of articles will examine in depth the specifics of hashed files as implemented in the UniVerse and UniData environments.
Fixed length records offer a number of advantages from the point of view of data storage. Perhaps the biggest is that the address of any record can be calculated very easily. Similarly, since each data field within a record is fixed length the location of any field in a record can be calculated. A variety of direct access methods can be devised to provide quick data retrieval by depending on these calculations. Think of these traditional database constructs as rows and columns.
Variable length data isn't as easily handled. Since the location of data fields can't be calculated, MultiValue databases use special characters as delimiters — the familiar field or attribute mark, value mark, sub-value mark and segment mark. Locating data is done by scanning the records — for example, the 5th field is between the 4th and the 5th field marks. Since the location of a data record in the file can't be calculated, scanning through a list of records is required to locate a specific record.
One way of limiting the number of records that must be scanned in the search for a specific record is to subdivide the data. If only part of the file must be scanned, it will be quicker to access a record than if the whole file must be scanned. Of course, there must be a mechanism for determining which part of the file holds the desired record. The file storage schemes that we will be examining use a method called "hashing" and thus the files in these environments are called "hashed files". This first article will explore the basic structures and mechanisms involved in UniVerse and UniData hashed files. Note that both UniVerse and UniData dynamic files are "hashed". Later articles will be more specific and look at various aspects of these files in more depth.
Hashed File Basics
The parts into which a file is subdivided are called "groups", because each part contains a group of data records. The number of groups is specified at the time the file is created and is referred to as the "modulo". The word modulo is a reference to "modular arithmetic" which was discovered by the German mathematician Carl Friedrich Gauss in the year 1801.
Modular arithmetic provides a mechanism to sort a file's data records into groups. The data records that share the same remainder when divided by the modulo of the file will all "hash" into the same group. This means that by looking at the remainder we can determine the group the record belongs to and access it by only scanning that group. Less data to scan means faster access!
The Hash Key
Wait a minute! How do you get a remainder for a data record? Well, first each data record has to be identified by a unique record key. The key is chosen by the user or designer of the file and can be nearly anything from a sequential number to a name or a social security number. Since record keys can be non-numeric we have to have a means of converting them to numbers — "remainder" only has meaning when referring to numbers. The number that the record key is converted to for the hashing process is called the "hash key"
Fortunately, computers already represent characters numerically. There is a code called the "ASCII code" that specifies numeric representations for characters. "ASCII", by the way, stands for American Standard Code for Information Interchange and was developed a long time ago for use with teletypes. As an example, suppose we have a record with the key of "PEGGY". Here are the ASCII codes for the characters in the key:
Character ASCII Code
P 80
E 69
G 71
G 71
Y 89
Even in environments that use UniCode, an alternative to ASCII which is more friendly to multinational environments, each character is still represented by a number.
Once we've converted the characters to numbers we could invent many different schemes to create a numeric hash key from the record key. In fact, UniVerse offers seventeen different techniques and UniData offers two — these are referred to as "file types" or "hashing algorithms" and we will talk much more about the choices later on. In this article we will use a hashing algorithm that we made up — but it is very similar to the standard hashing algorithm developed for the original PICK MulitValue Database, the UniVerse type 18 algorithm, and the UniData type 0 algorithm.
Here are the rules for our algorithm: Start at the beginning of the record key. Initialize an accumulator to 0. Multiply the accumulator by 10 and add the ASCII value of the next character. Repeat for all the characters in the key. See Here is an example BASIC program that implements this algorithm:
ACCUM = 0 INPUT KEY: * FOR I = 1 TO LEN(KEY) ACCUM = ACCUM * 10 ACCUM += SEQ(KEY[I,1]) NEXT I * CRT CRT ACCUM STOP END
So using our hashing algorithm the record key of PEGGY has a hash key of "876899" (Process illustrated below):
Step Character ASCII Value Accumulator
1 0
2 P 80 80
3 E 69 869
4 G 71 8761
5 G 71 87681
6 Y 89 876899
Although the record key must be unique in the file, the hash key will not necessarily be unique. We only need to to calculate the modulo. The remainder will allow us to determine which group to place the record with the key of PEGGY.
Hashing the Record to the Group
Remember that each hashed file has a modulo which specifies the number of groups in the file. The next step in hashing is to divide the hash key by the modulo. The division yields a quotient and a remainder — remember from our discussion of modular arithmetic that the remainder is the important part. The remainder is always going to be a number in the range 0 to the modulo minus 1. If we now add 1 to the remainder it will be between 1 and the modulo — if we number the groups of the file beginning with 1 there is a direct, one-to-one correspondence between the remainder plus 1 and the set of groups in the file. Thus, the remainder plus 1 addresses the group where the data record will be placed.
As an example, suppose a file has a modulo of 7. Using our record with the key of PEGGY, we know that the hash key is 876899. Let's divide the hash key by the modulo:
876899 / 7 = 125271 + (2 / 7)
Quotient = 125271
Remainder = 2
We add 1 to the remainder to get 3. Viola! The record with the key PEGGY "hashes" to group 3 of our file. Whenever our system reads or writes the record, it will know that it resides in group 3.
It is interesting to compare the results from our made-up hashing algorithm with those used by UniVerse and UniData. Create a test file using a modulo of 7 — on UniVerse use type 18; on UniData use type 0 (the default). Now use the RECORD verb to test various record keys and see which group they hash to. Examine the example below of how it should look in Unidata:
:CREATE.FILE HASHTEST 7 Create file D_HASHTEST, modulo/1,blocksize/1024 Hash type = 0 Create file HASHTEST, modulo/7,blocksize/1024 Hash type = 0 Added "@ID", the default record for UniData to DICT HASHTEST. :RECORD HASHTEST PEGGY PEGGY hashed to group 2 and was not found
And an example of how it should look in UniVerse:
>CREATE.FILE HASHTEST 18 7 1 Creating file "HASHTEST" as Type 18, Modulo 7, Separation 1. Creating file "D_HASHTEST" as Type 3, Modulo 1, Separation 2. Added "@ID", the default record for RetrieVe, to "D_HASHTEST". >RECORD HASHTEST PEGGY Record "PEGGY" hashes to group 3 but was not found.
Notice that the group number in the UniData file is 1 less than in the UniVerse file? UniData numbers their groups beginning with 0 instead of 1. So for UniData files, our made-up algorithm could dispense with the final addition of 1 to the remainder. You will find that our algorithm will produce the correct group number for most record keys in UniVerse type 18 and UniData type 0 files. Please don't call us with exceptions that you find! We said MOST because our made-up hashing algorithm doesn't exactly duplicate the real ones.
In our next article, we will take about how the records are stored in "groups", and how this affects read/write and scanning records.