Data Quality: Enter the 4th Dimension

Data quality a uniform cause of deep pain in BI projects. The more systems that are involved the harder it gets to clear it up, before you even start accounting for how old they are, how up to speed the SME’s are, how poor front end validation was – there’s a host of potential problems. However something tells me that the number of BI projects where the customer has said that it’s OK if the numbers are wrong on the reports is going to remain pretty small.

Scope, Cost, Time – Choose one. But not that one.

Project Management Triangle

Project Management Triangle

I’m sure most of you are familiar with the Project Management Triangle which dictates that you vary two of Scope, Cost or Time to fix the other. The end result being that in the middle, Quality gets affected. In practice for BI projects Cost and Time tend to be least negotiable, so scope gets restricted. Yet, somehow Time and Cost get blown out anyway. Whilst BI is hardly unique in terms of cost and schedule overruns, there is one key driver which is neglected by traditional methods. Leaning once again on Larissa Moss’s Extreme Scoping approach, she calls out the reason. It’s because in a BI project Quality – specifically Data Quality – is also fixed. The data must be complete and the data must be accurate for it to be usable – and there is no room for negotiation on this. Given that the data effort consumes around 80% of a BI/DW projects budget, this becomes a significant concern.

How do we centralise Quality as a constraint?

So now we have to get the business to accept that the traditional levers can’t be pulled in the way they are used to and that requires a little end user education. The business needs to be made aware that it is a fixed constraint – one that they are imposing, albeit perhaps only implicitly –  The business has to accept that if Quality is not a variable, then the three traditional “pick two to play with” becomes “prepare to vary all of them”.  Larissa Moss refers to this as an  “Information Age Mental Model” which centralises quality of output above all else. Here is where strong leadership comes into play. Ultimately if one business demands a certain piece of information the BI team will have to be clear to them that to obtain that piece of data to the quality which is mandated, they must be prepared to bear the costs of doing so, including the cost of bringing it up to a standard that means it is enterprise grade and reusable, so that it integrates with the whole solution for both past and future components of the system. This of course does not mean that an infinite budget is opened up to deal with each data item. Some data may not be worth the cost of acquisition. What it does mean is that the discussion about the costs can be more honest, and the consumer can be more aware of the drivers for the issues that will arise from trying to obtain their data.

What makes BI different from an Agile perspective?

Let’s start with a bold proposition: Waterfall doesn’t work for BI. Agile doesn’t work for BI.

Why doesn’t Waterfall work?

On the face of it, the reasons for Waterfall not working for BI are easy to tease out for those familiar with the field. Waterfall requires fixed inputs and outputs with a high degree of certainty over requirements. This works (to a limited degree) for some IT projects as it is possible to define this. Infrastructure is a good example. Even in application development there is often a reasonable degree of certainty in terms of requirements and desired functionality. In a BI project these are often easy to define to start with, but rapidly unwind as requirements change or are clarified (and often worked out as part of the projects progression) and the user realises what they asked for is either unattainable or was addressing the wrong question in the first place.

…but surely Agile fixes this?

Well, no. It should as in theory it adapts better to rapid change…   but when Agile is attempted it rapidly descends into chaos due to a lack of deliverables coming out of the sprints, project owners and managers get nervous and Waterfall rears its ugly head again in an attempt to provide an illusion of certainty and control. Agile methods should help deal with the level of change that BI projects seem to inherently contain, but somehow struggle. Much of this is down to a common problem with Agile and Waterfall.

Is BI development the same as software development?

Here’s the rub. In Larissa Moss’s approach “Extreme Scoping: An Agile Approach to Enterprise Data Warehousing and Business Intelligence” she calls out that one of the biggest issues with trying to get Agile with BI projects is the data integration component. While in a BI project the output artifacts are not unlike traditional coding – reports, ETL jobs and the like – the input artifact – data – is nothing like an easily definable requirement. If your CRM data and your Web Analytics data need to be inter-meshed, there’s a whole bunch of groundwork to do to uncover that. Common keys to be found, business rules to be teased out, access to be obtained. That may be as simple as looking up two common identifiers – in which case, good luck – but usually its far more complex.

This is what kills both Waterfall and Agile delivery. In estimation it’s assumed that the coding is the work. When in reality the coding is really the expression of a data exercise which could be a matter of both interpretation and integration. In software development the coding is the expression of an idea – e.g. “I want to be able to use a shopping cart plug in to sell my product”. In BI development this an expression of the data, which is bounded by the vagaries of the quality of the data, how it should be interpreted and how it interacts with all your other data. Resolving this extends far beyond calling an ETL job complex and assigning it 5 days in the project plan.

So what do we do next?

Whatever method we adopt, we have to listen to Larissa’s message – that we cannot run a BI project the same way we run a coding project. Time needs to be allowed resolve the data issues – up front – before any estimate of difficulty is assigned. After all, once the problem is understood, defining a report as simple or complex is easy enough – but until the problem is understood you may as well be estimating with a dartboard.

Check out Larissa’s book here:

The fine art of starting to adopt Agile with a Zero sprint

Agile methodologies have a patchy track record in BI/DW projects. A lot of this is to do with adopting the methodologies themselves – as I’ve alluded to in prior posts there are a heap of obstacles in the way that are cultural, process and ability based.

I was discussing agile adoption with a client who readily admitted that their last attempt had failed completely. The conversation turned to the concept of the Zero sprint and he admitted part of the reasons for failure was that they had allowed Zero time for their Zero sprint.

What is this Zero sprint anyway?

The reality of any technical project is that there are always certain fundamental decisions and planning processes that need to be gone through before any meaningful work can be done. Data Warehouses are particularly vulnerable to this – you need servers, an agreed design approach, a set of ETL standards – before any valuable work can be done – or at least without incurring so much technical debt that your project gets sunk after the first iteration cleaning up after itself.

So the Zero Sprint is all that groundwork that needs to be done before you get started. It feels counter agile as you can easily spend a couple of months producing nothing of any direct value to the business/customer. The business will of course wonder where the productivity nirvana is – and particularly galling is you need your brightest and best on it to make sure you get a solid foundation put in place so it’s not a particularly cheap phase either.

How to structure and sell the Zero sprint

The structure part is actually pretty easy. There’s a set of things you need to establish which will form a fairly stable product backlog. Working out how long they will take isn’t that hard either as experienced team members will be able to tell you how long it takes to do pieces like the conceptual architecture. It just needs to be run like a long sprint.

Selling it as part of an Agile project is a bit harder. Because you end up not delivering any business consumable value you need to be very clear about what you will deliver, when you will deliver it and what value it adds to the project. It starts smelling a lot like Waterfall at this point, so if the business is skeptical that anything has changed, you have to manage their expectations well. Be clear that once the initial hump is passed, the value will flow – but if you don’t do it the value will flow earlier to their expectations, but then quickly after the pipes will clog with technical debt (though you may want to use a different terminology!).

This post reproduced courtesy of BI Monkey

Productivity issues for Agile in BI/DW – Part 2: Technology

Agile in a BI/DW environment faces a unique set of challenges that make becoming productive more difficult. These issues fall into a couple of categories. First are the difficulties in  getting the team to the productivity nirvana promised, which I covered in this post. Second are the difficulties posed by technology and process, which I’ll talk about today.

Some obstructions cannot be moved by thought alone.

Solving problems by thought alone

Solving problems by thought alone

Agility in traditional coding environments runs at a very high level like this: User states requirements, Coder develops an application that meets those requirements, test, showcase, done.

In BI/DW environments there process is less contained and has a lot of external dependencies. A user requesting a metric on a report is not a matter of coding to meet that requirement – we need to find the data source, find the data owner, get access to the data, process it, clean it, conform it and then finally put it on the report. Depending on the size and complexity of the organisation this can take anywhere between days and months to resolve.

Agile development as it is traditionally understood, with short sprints and close user engagement works well for reporting and BI when the data has already been loaded into the Warehouse. If you are starting from scratch, your user will often have become bored and wandered off long before you give them any reporting.

(Yes, once again, nobody cares about the back end because it’s boring and complicated)

Rather than move the mountain to Mohammed…

There are some steps you can take to mitigate this. The product backlog is your friend here. Often with some relatively light work on the backlog you can identify which systems you are going to hit and broadly what data you will need from those systems.

On a large scale project you may find that you have multiple systems to target, all of which will vary in terms of time from discovery to availability in the DW. Here I generally advocate switching to a Kanban type approach (i.e. task by task rather than sprint based) where you try and move your tasks forward as best you can, and once you are blocked getting at one system, while you wait for it to unblock move on to another.

As systems get delivered into the EDW you can start moving to delivering BI in a more interactive, sprint based fashion. I generally advocate decoupling the BI team from the DW team for this reason. The DW team work on a different dynamic and timescale to the BI team (though note I count building Data Marts as a BI function, not a DW function). You do run the risk of building Data Warehouse components that are not needed, but knowing you will discarding some effort is part of Agile thinking so shouldn’t be a big concern.

Once again its about people

You may notice that none of the issues I’ve raised here are set in stone technical issues. It’s still about people – the ability of external people to react to or accommodate your needs – the capacity of users to be engaged in protracted development processes – the flexibility of project sponsors not to have a rigid scope.

Good people who can be flexible and accommodate change are the keystone to agile success. No tool or process with ever trump these factors.

Thsi  prost reproduced with permission from BI Monkey

Productivity issues for Agile in BI/DW – part 1: People

Agile in a BI/DW environment faces a unique set of challenges that make becoming productive more difficult. These issues fall into a couple of categories. First are the difficulties in getting the team to the productivity nirvana promised. Second are the difficulties in simply being productive. Today I’ll focus on the first case.

Productivity nirvana is hard to find.



A core principle of Agile is the cross functionality of teams – so if there is slack in demand for one type of resource in a sprint, that resource can help out where there is stress on another. So a coder may pick up some test work, a web developer may help with some database design or a tester may help with some documentation and so on. The end result being the team can pretty much jump in each others shoes for basic tasks and only lean on the specialists for the tricky bits.

In BI/DW this cross-skilling is harder to pull off. The technical specialisation is more extreme – people tend to sit in the ETL, Cube or Report developer buckets and its taken them quite a while to get there. There is occasional crossover between a couple of technologies (usually at the BI end between Cube & report) but true polymaths are very rare. Plus the skills required to be good at any of these technologies tends to need very different mindsets – ETL developers tend to need to be methodical, logical thinkers with a strong eye for details and a love of databases – whereas report developers are often more creative and engage more with people (the business). This makes hopping into other team members shoes quite hard.

Meditations on the path

These things can be overcome to an extent by limiting the domains where cross-skilling is expected. This can be done in smaller teams by focusing the areas where the team can support each other away from the technical – for example testing or documentation can be pretty process driven and an ETL developer can easily test a report. Expectations around cross-skilling need to be reined in and the sprint planned with that in mind. This isn’t to say that cross-skilling can’t arise – but the time to get there is going to be a lot longer.

In larger teams you can look at dividing up the teams into areas where cross-skilling is more practical. Typically I like to Partition the DW and BI teams, though I take the perspective that your data mart ETL developer is part of the BI team which means you do need a bit of a flexible player in that BI ETL role though.

Once again it’s about people

A topic I like to hammer home is that most of your project concerns are not technical or process driven – it’s all about people, specifically people’s ability and willingness to adapt and learn. Picking team members who can adapt, are willing to adapt and can see the value to themselves in doing so are going to get you to the productivity nirvana that much faster.

This post reproduced with permission from BI Monkey