Refactoring Entity-Attribute-Value Design
Entity-Attribute-Value (often referenced as EAV) table design is one of the “wonderful” things that newcomers to the SQL land discover very quickly! Traditional procedural programmers have very hard time understanding why tables have fixed number of columns, and adding a new column requires table and possibly code change. Their great problem solving skills from the procedural world help them find a “natural” solution – the EAV design – where a pair of key (describing the attribute) and value allow to store data with flexibility.
Example is storing all employee information in 3 columns: key column to identify employee (entity), attribute key column (attribute) to identify the attribute stored (name, birth date, SSN, date of hire, etc.), and value column (value) to store the actual value.
Normally the value column has to be of some large length character data type, like VARCHAR(200), and data has to be casted to text when saved and then casted back to the original data type when retrieved.
Once all coding is done then comes the realization of how bad this approach is… And sometimes you just walk into a project where EAV has been already deployed and now it is your job to fix it. The typical signs of EAV implementation is the existence of very complex (and slow) queries to satisfy relatively trivial user requests for data and the lack of data integrity.
How do you fix EAV design?
Here is one example of EAV design and the refactoring process to correct it. In our case the information for bank loans is stored in a table representing EAV design. Listing 1 below shows the scripts to create the table and insert sample data.
Here the pair of loan number and customer number identifies the entity. The code column stores the attribute (amount, date, and type of loan) and the value column represents that actual value.
A quick demonstration of the problems caused by this design. Let’s write a query to retrieve customers with personal loans over 1,000 for the period Jan 1, 2008 through Jan 31, 2008. Here is a first attempt to satisfy the request:
One of the first issues to notice is the multiple joins to retrieve each attribute. This is because each attribute is stored in separate row and we need to put back together the customer loan info. Besides that seems the code should be OK… except one little problem! The nature of the SQL is such that nothing guarantees that the predicates in the WHERE clause will be executed in the order listed. They will be expanded in the query plan and the cast to DATETIME or DECIMAL can be applied to the loan type value which will result in conversion error. I described this in more detail in my post Predicates in SQL.
To fix this you may be tempted to try a different approach: using table expressions to retrieve each attribute. Listing 3 shows a version of such query.
However, this query has the same problem because the table expressions are expanded in the query plan and nothing guarantees the order of executing the predicates. One approach to retrieve the requested data successfully is to use a pivoting technique. The query below illustrates that.
Here the CASE expressions guarantee that only the correct values will be casted to the specific data type. But then how efficient is this query for such a trivial request?
The refactoring process
Usually the first step of correcting an EAV design is to create a normalized table for storing the data. In our case the table structure can look like Listing 5 (note that here for clarify the loan type is spelled out as entire word; in real business scenario it may be sufficient to indicate only P or B; alternatively create table to store loan types and reference via a foreign key constraint).
Now each attribute is in separate column with appropriate data type. This guarantees the integrity of the data as well as now we can define constraints to restrict data values (like the CHECK constraint for loan type).
The next step is to convert and transfer data from the old format to the new table. We already know that a pivoting technique works and can be used here. Here is the query to transfer from EAV to normalized format.
The last step in the refactoring process is to replace the old EAV table with view for backward compatibility (if there is code referencing that original table that cannot be changed at this time). There are different methods to accomplish this. For simplicity here we can use unpivoting using UNION and query for each attribute (note that you have to drop the original EAV table first because the view has the same name).
If there is any old code that references the original EAV table for data modifications you will have to add instead of triggers to the view to handle it.
Now, let’s answer the same data request to retrieve customers with personal loans over 1,000 for the period Jan 1, 2008 through Jan 31, 2008, this time using the normalized table.
Simple and elegant, just as it should be!
It is very good article. But some of people (myself included) known the EAV as Open Schema.
Yes pivot refacturing has worked well, however this does not change the problem of "adding a new column requires table and possibly code changes" when using EAV.
"adding a new column requires table and possibly code changes"
Code change: Depends on how you write the code. You know you can access metadata…
Table change: Runtime DDL or EAV, each is to be avoided, so you can choose.
Good article — thanks. I work with EAV a lot in healthcare, where it can be highly useful because of the "sparse matrix" nature of the data. Question for you — do you know of an algorithm or technique for transforming an EAV table to a flat table "on the fly" when the number of attributes that must become columns is potentially large and/or little is known about them a priori?
Your method involves knowing all of the attributes in advance, hard-coding the new table that will represent them as columns, and then constructing the query that transforms. I would like to be able to write a query that takes any EAV table about which I have little prior knowledge and transforms and presents the data in a flattened format.
Thanks in advance for any advice!
PS — if this cannot be accomplished in SQL, I would also be interested in a programmatic algorithm using arrays.
I've written about the many problems of EAV, AKA MUCK.
Based on frustrating experience.
If anyone brings up the idea of using MUCK/EAV, feel free to refer them to the article.