Data Lakes
A Primer for Consumer Goods Success
A Primer for Consumer Goods Success
Data lakes are a valuable asset for CPG manufacturers as they pull together information from internal systems and external partners. When a visualization tool like Tableau, Power BI, or Domo is layered on top of this retailer and internal data, analysts and executives have access to never-before-seen views of the business across all areas, including sales, supply chain, demand planning and logistics.
Through our leadership in the CPG analytics world, TR3 has been called in to assist in dozens of these projects – this paper shares the most important lessons we’ve learned as the difference between data lake success or struggles.
Any data lake must be able to translate all sales, supply chain, and logistics metrics into one view of item performance. Yes, you’ve worked hard for decades to make sure your ERP’s item master is clean, standardized and shared. But it’s not going to help make the retailer data meaningful.
Step one will be to create a cross-retailer item master, step two will be to relate your internal material management to the new cross-retailer item master. We find this step to be the most critical and most frequently skipped, and it will render most data comparisons useless. In fact, it is so important that we have developed a methodology to handle the complexities (prime items, mixed packs, and exploded items are just a few of the lurking challenges).
Half of the projects we’ve been brought into were already underway with teams of personnel manually pulling numbers together from various retailer portals. The desire to make quick progress leads many organizations to start with manual data pulls, because it’s the quickest way to deal with the wide variety of different platforms. This quickly becomes overwhelming as manual efforts are error prone and time consuming. Even if you cobble together semiautomated loads, they will regularly break when the retailers make unannounced portal changes with no notice. Go with vendor maintained, automated connectors.
One customer was able to free up a team of 12 analysts by using automated connectors, allowing the team to focus their time on using the data to improve business results.
The quickest way to stall one of these projects is to lose your audience’s interest because they don’t believe the numbers. Retailer data feeds vary in quality, and regularly have issues, requiring a reconciliation process to tie back to published numbers before making your initial data available. You will also need an ongoing filtration process to clean retailer feeds before you publication. And you must find a way to publicize when updated data is or isn’t available. Why? Because the community will come to rely on your data, and if it is “wrong” the community will lose faith. Our approach is to provide visibility (think of a mini dashboard) into when data is received, cleansed, and then published.
Until recently dotcom was a rounding error for most companies, and these numbers were lumped in to overall channel metrics. No longer; increased focus and the surge in dotcom sales makes it a critical channel for management and growth. For your data lake that means separating dotcom from brick and mortar sales, and that demands handling the nuances that each retailer brings with it. For example, sometimes retailers pull dotcom orders from store inventory and other times from dedicated inventory; you need to accommodate this in your data lake or the lack of transparency in your channel data will disappoint your stakeholders.
When pulling data from retailer portals it will be tempting to poll frequently so that you get the latest data as early as its available. However, retailer portal systems are frequently heavily loaded so over-polling can pummel the system, effectively causing a Denial Of Service, crashing their systems and getting your account locked. You need to be thoughtful, factoring in timing of scheduled data releases and what frequency to repoll.
This is going to be a lot of data – dozens of customers, tens of thousands of stores, all your SKU’s, and lots of time frames. If you give your data visualization tool the raw data and depend on it to handle summarization then your will disappoint your users. You need two or three “levels” of data in order to avoid slow response times. Take the time to design the right data structures based on usage, including building aggregated cubes that will help get numbers into the right hands as fast as possible.
Here’s our recommendation for your initial CPG data lake project
Allow adequate time for this first stage; even with our proven methodology this is typically an intensive 30-day sprint to build for customers. Without those tools to jump start things, think along a 4-to-6-month timeline