Business Tech: Building MultiValue Programmers - Part I
Not everyone is a natural-born teacher. This article series is not a substitute for real training. If you have the knack, this series is here to help you build your own programmers. If you have no choice — can't engage a trainer — at least you'll have this.
When I teach, I like to start with the basics. Even if the person I'm teaching is experienced in other systems, laying groundwork helps them connect with my perspective. Sharing a common view makes teaching easier.
What is Data
When computers were first entering small to mid-sized businesses, data was an obscure concept. Today, while many people lack a solid, practical definition, the idea itself is already firmly established. Training a new MultiValue programmer or analyst is both easier and harder because of this.
Let's start by mixing how data is represented with what data is. While explicitly separating those ideas makes academic sense, merging them makes it graspable. You'll see very little academic sense in this series. We aren't building a theology, we are building for practical use.
Fixed
Fixed length data is a bedrock concept. Look at this example [Figure 1].
12345678901234567890123456789012345678901234 ROBERT MAY POLITICIAN 58BLKBLK099887765 First NameROBERT Last NameMAY ProfessionPOLITICIAN Age58 Eye ColorBLK Hair ColorBLK ID #099887765
Figure 1
The problem with fixed length becomes obvious when Alexandra Rogers has to be added to the database. We've only allotted seven characters to the First Name field. She needs nine. To fix this we have to re-factor the entire database to allow nine characters in the First Name and change every reference in every program.
When she divorces and goes back to her maiden name of Wychowski, we have to do the same thing for our too-small Last Name field. And, if we get an eleven character long job designation, again we re-factor.
To make matters worse, Robert's full name is now taking up eighteen spaces because every name must take up the same amount of space. Yes, disk is cheap, but everything you do to make data less efficient should come with a meaningful benefit. This does not.
Delimited
Delimited fields solve the length problem [Figure 2].
ROBERT,MAY,POLITICIAN,58,BLK,BLK,099887765 ALEXANDRA,WYCHOWSKI,MATHEMATICIAN,BLU,RED,986753099 First NameROBERTALEXANDRA Last NameMAYWYCHOWSKI ProfessionPOLITICIANMATHEMATICIAN Age58BLU Eye ColorBLKRED Hair ColorBLK986753099 ID #099887765
Figure 2
Oops. We lost a field (Age) on Alexandra. In a fixed format, that would have stood out visually. In delimited, we don't have neat columns. We need to insert her age, 26, to fix the data. The good news is that our data only takes the space it needs, plus one for each delimiter. We did have to trade a some human-readability.
SQL
SQL tries to solve the missing field issue by adding validation rules directly into the database schema. In this case, Age would have been an Integer field which would have stopped us from adding BLU as the value.
Unfortunately, SQL also reintroduces maximum field lengths, which is a major step backwards. This might be a good trade, except that SQL's rules are wildly insufficient. Yes, knowing that Age is numeric would have caught the error on our previous example, but generally, knowing if something is a number, boolean, or character, is an extremely limited validation. You still have to do real validation in the programming layer, so SQL's half-thought-out feature is just a way to split the logic arbitrarily into two places.
Our Story so Far
All three of these models are predicated on a poor premise: One question equals one answer. What if Robert works a second job? What if Alexandra has a second ID #? In the real world, data is rarely one to one. These are flat data models. Real data is lumpy.
If I'm making dolls, I might have a retail price, a wholesale price, a samples price (free), and a donation price for tax purposes when I send my excess inventory to a charity. In fixed, delimited, or SQL models, I have a choice of making each of those a discreet location in my data, or I have to resort to child tables [Figure 3].
SQL Product Table ProductID # Betsy9988772 JoeJoe9988773 SQL Price Table ID #TypePrice 9988772W10.00 9988772R15.00 9988772S 0.00 9988772D12.00 9988773W 9.50 9988773R13.25 9988773S 0.00 9988773D11.75
Figure 3
With this extended structure, we now requires five reads to get a product and price. Yes, CPU is cheap, but everything you do to make data less efficient should come with a meaningful benefit. This does not.
Of course, if you have another multiple-answer situation, you spawn another child table and add some more reads. Perhaps the doll comes with several possible outfits. This is worse than the price issue because you not only have additional reads but you also have a variable number of reads, creating an "are we done yet?" inefficiency where you have to scan until you don't find any. In a child table with a large number of rows, this matters.
XML/JSON
XML or JSON might be possible solutions [Figure 4].
<?xml version="1.0" encoding="UTF-8"?> <dolls> <product> <id>9988772</id> <price> <type>W</type> <amount>10.00</amount> </price> <price> <type>R</type> <amount>15.00</amount> </price> </product> <product> <id>9988773</id> <price> <type>W</type> <amount>9.50</amount> </price> <price> <type>R</type> <amount>13.25</amount> </price> </product> </dolls> JSON { "dolls": { "product": { "id": "9988772", "price": { "w": "10.00", "r": "15.00" }}, "product":{ "id": "9988773", "price": { "w": "9.50", "r": "13.25" } } } }
Figure 4
With XML and JSON, we've lost the easy scanning of fixed fields. We've also lost the relative compactness of delimited rows because we are now required to add the field tags into every single record. This is an example of a terrible trade-off. It is less human readable, it is less computer readable, it is space expensive, and slower than every format previously discussed.
Columnar
Columnar databases (like Hadoop) have a smart premise: We search more than we write, so let's optimize for searching [Figure 5].
First Name Table 1ROBERT 2ALEXANDRA Last Name Table 1MAY 2WYCHOWSKI Profession Table: 1POLITICIAN 2MATHEMATICIAN
Figure 5
By splitting the data by column, we are back to multiple reads but unlike SQL, we get something in exchange. When I search by profession, I only deal with one table, and that table has the smallest amount of data needed to resolve that part of the query.
Additionally, in a ten field database, while I do have to do ten reads, I get the ID from the first read and that changes the reads from searches (fixed child tables, delimited child tables, SQL child tables) to targeted reads (columnar tables). And yes, SQL has indexes, but so do Columnar databases. So instead of the index being a speed-up for the inefficiency in SQL, in Columnar our indexes are a speed-up to an already efficient system.
Document Databases
Document databases (like MultiValue and MongoDB) are based on a different premise than the ones above. They support embedded table logic to allow the flexibility of child tables without the extra reads.
Now, I need to say this before we go any further. Document databases can do what every data format we've listed above can do. We can implement columnar logic, fixed logic, delimited logic, or SQL logic in a document database. We can do this easily. Further, we can mix approaches.
Negative: This is chaotic! Positive: This is flexible and models the real world.
For this example, I'm going to use MultiValue because MongoDB represents as JSON, which we have already covered. Additionally, I will use MultiValue's preferred method because all of the other methods it can do are covered above [Figure 6].
ROBERT^MAY^POLITICIAN]BRICKLAYER^58^BLK^BLK^099887765 ALEXANDRA^WYCHOWSKI]ROGERS^26^BLU^RED^986753099]123557722Figure 6
Robert's additional job (Bricklayer) is accommodated by adding a multi-value to attribute three. That's the terminology we use. Alexandra's two legal last names share an attribute. Her two ID #s share their attribute.
So, scorecard: we have the advantages and disadvantages of delimited text, but by embedding delimiters within delimited text, we have acquired single reads. Mike Ruane calls this compressed XML. The attribute numbers stand in for the XML or JSON tags, creating compactness while also keeping space efficiency.
What do I use?
I use everything. I prefer MultiValue in most cases, but the goal of a database is to store, manage, and return data. That can be done with all of these. If a shop is already using Hadoop, use it. If they have SQL, use that. If they have MultiValue, use that.