Evolving Database Schemas

Master your Domain

It amazes me that while database schema evolution is one of the most critical factors in software development it’s also one of the most ignored and least understood aspects. Pretty much every interesting software development project requires an evolving database schema. From renaming tables to modifying relationships it’s as critical as any piece of source code but is often the least cared for part of any project. Your schema is what maintains order in your data, arguably this is the most important part of any enterprise (organizing your data)!

There are mountains of excellent (and practical) resources on how to manage software development projects. I can get certified in SCRUM or nearly any Agile-based development methodology but I can barely find any useful websites on managing my database in an Agile environment. When your development process encourages change this applies to your data modelling and schema design and not just application development.

There are some excellent articles from Fowler and Ambler that provide a good starting point. However, I found a definite lack of details and practical advice in those articles. It goes without saying that I want to run regression tests and version my work, but a database is distinctly different in that I have to deal with all that existing data and can’t exactly redeploy a schema while preserving the old data (not easily anyhow).

Existing persistence strategies often fail to accommodate Agile-based development leading to poor design choices in data modeling. In this article I’d like to explore practical solutions to Agile database development. Like anything Agile, there’s no perfect solution, but this is a pervasive problem across all interesting software development projects and I hope the discussion alone will yield better solutions.

To start, there is one important observation that I’ve found to be true across various software development projects: If the code smells it’s likely that the database smells worse! Let’s examine some common database smells:

  • Inconsistent relationship strategies; when your ER diagram starts looking like spaghetti and every piece of business logic introduces a different convention you’ve got a problem. You have object tables and three possible types of relationships (1-1, 1-N, N-M), your data model is only as complicated as you make it; pick a strategy for each of the three types of relationships and stick to it. I liken this to using GOTO statements in software, it’s unacceptable.
  • Inconsistent object model strategies; this is often the impetus to change your relationship strategies, that is, when I have inconsistent strategies for object tables it leads to very confusing relationships. You’ll see one table with a varchar(20) NAME and another with a varchar(16) NAME, is this the name of the object or does the object contain a “name”? Use a consistent strategy for tables as well as concepts like status, timestamp, IDs and alternate keys (such as name).
  • Inconsistent naming conventions; STAT_DATE, STATUS_DT, or STATDATE? Pick a convention and stick with it!
  • Inconsistent usage of the same column; a common example is a varchar column named TYPE that means different things to different applications. I’ve noticed that even good data modellers make this mistake.
  • Overloading fields; things like a comma-separated list of values where only the application knows what each value means. Don’t use a relational database if this is how you model – text files may work better!
  • F normal form; Johnny just took a class on database design and learned about normalizing a database and now you have 175 tables in what he claims is 6NF! There are appropriate times to denormalize just as often as there are to normalize.
  • Know when to OLAP; why are there summary tables attached to each of my transaction tables?
  • The Cauldron of Data; this is the crux of the problem, everyone is so scared of the data that they lose control of the schema and treat it like a bubbling cauldron too paranoid to make any significant changes out of fear of breaking a legacy application. This is the end result in any application where they didn’t manage their evolving database schema.

Let’s talk about some solutions!

First of all, this is a developer problem! Don’t expect your DBA or Hibernate to fix this for you – if you’re a developer this is your problem. This leads to the central theme of how I propose database evolution to be solved: Your development methodology must cover application and database development.

If your application depends on a database, then your development methodology better cover both application and database development! I know, you like building the code and leaving the responsibility of the database to someone else. But that brings you back to the Cauldron of Data scenario where you can’t make any significant changes to a schema because it got our of your control. And if you can’t control the schema you can hardly control the application that depends on that schema!

We tend to ignore database development as a way to simplify our application development – I suggest you make the application development suffer by adhering to a development methodology that works with database development! Think of it like this: you’re going to be the databases bitch if you don’t take this responsibility.

I know, this seems like more work from the application side but like anything done right it’s hard to imagine doing it differently once you get your development methodology to cover applications and databases. That said, how does one integrate their database development into a unified development methodology?

Let’s look at the differences between application development and database development (and what needs to change in the traditional Agile-based development methodologies):

  • Databases contain data that cannot be lost; this means you have to migrate production rather than reinstall
  • Data is easy to migrate when your data is not controlling you (see the Cauldron above)
  • Rebuilding your database is like compiling and deploying your code (this sounds like a maven target)
  • Databases should have unit tests, and not just for the stored procedures (more on this later)
  • Map your database development to your project lifecycle goals exactly like you would with application development (say, in Maven 2) but introduce the migrate step in the deploy target.

If I compile and build my application why not build the database schemas at the same time, just like I would with anyother dependent artifact? So, let’s get practical and talk about things you can actually do to accomplish Agile-based evolutionary database design:

Create a Database Schema Change Policy

Keep it simple and make sure you answer how you plan to address schema migrations planned and unplanned. Your process should lend itself to an emergent property of better schema design. This by itself requires you to not only support planned and unplanned schema changes, but to encourage them. Either do a big design up front (not-agile) or encourage change in all aspects of your development (including your data model). I recommend you clearly define a process for planned migrations (migrating from one version of the schema to another) and unplanned patches (critical fixes, the kind you get in the middle of the night).

Bring DBAs in Early

You’ll need their help, and you know it, best to get friendly with them early on – give them a chance to know what you’re trying to do on their database. I argue that you’ll find more resistance to agile development from software developers than you will from DBAs. Most DBAs have been on the front-line fixing smelly database code and are likely your strongest ally. Not only can they help with the development process they can (and should) assist with design.

Use Stored Procedures

Read up (separately) on “End-to-End Architecture”, if your schema is going to change then you better clearly define your endpoints and provide an API-like package to abstract the schema completely. What’s great about stored procedures is that they can (and should) be treated like application code. I recommend two types of packages, consider using a suffix of _PKG and _API. All of your object tables will likely have GET, PUT, and DELETE procedures. These should be autogenerated, if not, write yourself a script or invest into some software to autogenerate CRUDL stored procedures. Each schema should have a _PKG with CRUDL procedures for all object tables. You also have business logic, from complicated transactions to simple procedures like authenticate(user, pass). Procedures that encapsulate business logic should be in packages with an _API suffix and follow the same rigorous design that would be employed for any application API.

The naming convention of an _API and _PKG suffix is unimportant (any convention here would be fine), but what is important is distinguishing between your CRUDL procedures and your APIs that encapsulate your business logic. Once you have a convention that cleanly separates these concepts you now have a mechanism which can completely abstract your schema from your application and best of all, you’ve likely imposed some constraints and standardization on your object tables that lend themselves to easy autogeneration of the CRUDL procedures.

Version your Schema just like you would an Application

Versioning is a given for application code, why should database code be any different? Schemas, default data, packages, grants, everything should be versioned along with ALL other application code. Applications depend on a versioned database, just like any other versioned artifact – I would expect my build to fail if the dependent database for my application doesn’t exist.

Each of your schemas is like an application, and all of the DDLs should be checked into source control and managed as applications! Check in your test data (sql inserts) and you’ll easily be able to define a database-specific unit test environment!

Use the Right Tools

You’ll need more than a modeling tool, modeling tools are great at helping you to visualize your schema, but don’t get carried away. You need to track schema AND default data! Use tools that fit your process not the other way around – write your own scripts if you need, they’re not that hard once you have a working process. Between Maven and some sqlplus scripts we’ve gotten plenty of mileage at my current job with the following scripts:

  • drop_objects.sql; loops through all of the schemas and drops everything, there’s also a delete user approach but with the drop script you don’t have to redefine your tablespace; this script is never run in production
  • create_objects.sql; loops through all of the schemas and creates all of the tables and default data; this script is never run in production
  • create_pkg_spec.sql, create_pkg_body.sql; loops through all schemas and compiles the package specs and separately the package bodies
  • run_tests.sql; loops through all schemas and runs database unit tests, it uses stored functions with setUp and tearDown procedures similar to Junit; this script is never run in production
  • migrate_objects.sql; loops through all objects and runs a per-schema migrate script which is created based on the delta between two different versions of the same schema

Localhost Development

Why else do we have fancy development workstations? Stop assuming Eclipse is allowed to eat up all of your resources – let Oracle do it! The only way to empower your developers to be agile is to give them an environment where they can easily change the database schema!

We’ve gone so far at my current job to support localhost Oracle instances where we checked Oracle into our software version control (along with Tomcat, Java, etc). We tried using the Express Edition but it didn’t support all of the PL/SQL code we were developing so we’ve got the full bloated 10g running on all of the developer workstations (takes about 30 minutes to install on a new workstation). So don’t tell me you can’t run MySQL locally!!