ODS Model Nulls Versus Default Values
What is the "industry standard" for an ODS model for attributes that are nullable versus using default values? This ODS model will be used to host customer and order data from many different sources. The data will be summarized into an ADM model for analysis and reporting.
One philosophy for attributes would be the use of nulls. Attributes that do not have a value would have a null value. Here are a couple of examples. The cancel date of an order would not be populated with an actual date unless the order is actually canceled. The cancel amount for the same order would be set to null.
The other philosophy would be to put default values in all attributes. Dates would have "12/31/9999", numerics would be “-1”, characters would be “N/A”, etc… For the above example, cancel date would be populated with "12/31/9999" and cancel amount is set to "-1".
This being said I would not have any nullable attributes in an ADM model.
Do others have issues with putting basically invalid values in an attribute? Additional coding is needed to get around these default values.
Continuing with the examples from above... If you are trying to find out which orders have been canceled since the beginning of 2020, the you can just code cancel date greater than 12/31/2019 if the cancel date attribute is nullable. If the cancel date attribute is not nullable, then you must use a "BETWEEN" and code as cancel date between 1/1/2020 and 12/30/9999 (or something similar to eliminate the default value).
If you need to perform a SUM of the cancel amount, then you can just sum the cancel amount if the attribute is nullable. If it is not nullable, then you have to add a WHERE clause to exclude any values less than zero.
I am interested in hearing what others think so we can make a "best practices" decision on this ODS model.
Thanks in advance!
This is a tough one. I think it really depends on how the ODS is structured. If the ODS is structured relationally (some level of normalization), then the model needs to reflect the business rules and if there is a rule that an order does not need to have a cancel date, than it should remain null. However, if the ODS has been structured for analytics, having a date way out in the future instead of null makes it possible to run certain types of queries such as date range queries. For example, give me all of the orders cancelled between April 1 and April 15 2021.....
Consider usage. If users want to list the number of cancellations by month, and you default cancel date to 12/31/9999, the report will show a large number of cancellations in December. If users want to see the average number of days between an order being placed and an order being cancelled, and you default the date to 12/31/9999, the report will show a lot of very old orders being cancelled. To handle these, users could filter the results to eliminate orders with a cancel date of 12/31/9999, but that's extra work that they might not want to do.
You ask for the "industry standard" for nullable attributes. The answer is definitively SQL's NULL. Using "Defaults" is a bad idea. As stated in your question and the other replys, it leads to a new sort of problem you will have to resolve with more and more code (where conditions, order by, sum/average, UI etc.) This additional code makes your applications more complex and therefore more susceptible to errors.
A non existent value has to be marked and treated as non existent. Period. Is should never be or look like a valid value of the attribute. It can be solved with NULL, or with an attribute status-flag or with subtyping etc.
On top of this, you have to decide what "non existent" means for this attribute: not relevant, not knwon, not defined yet, etc.
PS: I am fully aware of the dicussion about NULL's in the relational model community. I nevertheless consider NULL as a pragmatic solution, if applied consistently and consciously.