Data Modeling

Collection: Collect everything, no exceptions.

By far, the most damaging mistake you can make in analytics is not collecting all potential data points.

Over the last few years, I have written about the 4 pillars of analytics (Collection, Validation, Enhancement & Accessibility). In this blog post, I will center on the first pillar, Collection. We will get to the others in future posts.

Collect Everything

Yes. One thousand percent yes. By far, the most damaging mistake you can make in analytics is not collecting all potential data points. Even if you don’t know how, or if, you will need them, you should work with engineering to collect all events. Storage is cheap, do not let that be a deterrent in your collection efforts.

You can always go back and reprocess data, but you cannot go back and collect it.


Determine what your goals are for users who visit your site/app. It is very important to state the specific goals so you can be sure you are collecting the right information to hit your goals. Some simple examples are:

  • A user who visits will look at a minimum of 5 screens/3 pages
  • A user who visits will request information from the contact screen
  • A user who visits will sign up for 1 service/ buy 1 product
  • All users who visit will average 2 minutes in the app / 5 minutes in the site

Once these types of goals are set, you can define your collection processes to be able to answer these as well as other questions.

Events & Actions

Once you have an application designed, you should take time to understand all the potential events that are possible within the application. Think of things like logins, clicks, page/screen paths, hovers, direct actions (buttons, navigation, etc), indirect actions (scrolling), data entry (forms) and then temporal based items like user sessions are extremely important. You can also note and collect what I call ‘nudges’. Items like notifications to take action, visual cues (bolded items, small motions, etc) are useful to drive traffic so you should collect data surrounding those items.

Go through and circle all the direct interactions and make notes where indirect actions occur. Surrounding this whole process, state the questions you can think of. Some simple examples of these are:

  • What do I know about this user? (demographics etc)
  • Has the user been here before? (customer loyalty)
  • Did the user enter data directly?
  • What did the user click in the navigation?
  • When the user left, what was the last page/screen they visited

I challenge you to get very creative but stay within the business context and make sure that most of your questions have value. There are always examples of exploratory questions and those are ok, but just keep the context at the forefront.

External Data

Don’t think of just using your own data, find and collect external data sets that enhance your own. Maybe you are building an application for runners that collects all kinds of data round running. If you only have 1000 runners using it, then your data size is small but you can go out and find/purchase similar data for other runners (demographics, activity, heart rate, etc) and bring that into your platform so you can model the 2 sets together creating an enhanced data set with 10x the value.


So, you are collecting all this data. Where do you put it? What do you do with it? Well, I’ve said it before “Storage is cheap”. Although all businesses will have their own preferences Ill give you an example of one way to handle storage.

First, store all the raw, un-enhanced data. This is crucial. You should refrain from taking data from an operational / transactional system and manipulating that data. Im not going to get into validation in this article (next one). If you are building from a streaming architecture, you have so many more options for you because data streams can live on for periods of time (see Kafka). A streaming architecture gives you some flexibility for storing items in a queue. If you are just using a transactional platform, you could pull over all the records for the previous defined period (24 hours, for example)

Second, decide if you are going to build a data model (hint: you need to). My model of choice for most big data systems is a fact and dimensions model, or star schema. This will help you to store data in the method necessary to perform calculations on like data. Think of it like a recycling center. All the data comes in, cans, bottles, paper. Within each of those you can have different flavors (large beer cans, smaller soda cans) but the recycling center will melt all those down to their bare components (but remembering the original details of each) and all of them now are just aluminum (a single “aluminum” table with all the component data attached like ‘large/small’, ‘soda/beer’). Now any new ‘aluminum’ items that come in can be reduced to a common denominator and searched.


The most important takeaway is to know that you should collect everything. Collect what you can directly measure, collect what you can infer (for the most part) and collect what you know at the moment about the user and attach to the collected event. Define your goals so you can know if you are successful. Collect EVERYTHING no matter how insignificant you think it might be. It just may be the difference in understanding your user’s behavior.