Let me tell you something…
How many times did you try to create that data analysis visualization, spending hours to pull, clean and aggregate your data and only few minutes to actually perform the analysis? This process is very frustrating, especially when you have to re-do your analysis on the regular basis and you don’t or cannot automate your data transformations. And what’s even more frustrating is that after you spend so much time transforming the data you realize that your analysis doesn’t answer the question you initially set out to answer.
We data scientists and analysts face very similar problems every day and, despite our enormous interest in drawing insights from sparse data, we often find ourselves in a hole trying to pull, clean, transform these bits of information into a workable dataset before we can even start with the best part.
When I decided to start blogging I wasn’t sure what it is that I am really passionate about. Long story short, I was on a project where I had to determine the underlying drivers of a co-morbid condition in patients with diabetes. Despite having a multi-gigabyte dataset containing millions of records and hundreds of fields, I realized that I was still missing fields that could be helpful in answering the main question. By this time I have already put in hours of transformational work and now not only had to go somewhere to find more data, but also had to re-do the whole exploratory analysis from scratch.
Recently I was browsing the web when I stumbled upon a post of someone who was solving a different problem but had the same exact issues I had. And he too wasn’t happy about hours spent working inefficiently.
And so I had my blog topic. If all data scientists shared issues they experience while exploring and analyzing data, we could save significant amount of time as opposed to figuring it out on our own. And it doesn’t matter whether this shared piece of knowledge is common sense, basic principle, a known cool trick, a rule that works 50% of the time or a complicated piece of code that automates certain data transformations.
In my following posts I plan to introduce new analytical problems and will go over main issues and pitfalls I faced while solving them. I will then ask you to suggest a more elegant solution or point out other issues I have not thought about. As I gather your feedback, I intend to create a list of common issues we face on a day-to-day basis and to add material to this list over time. My goal is to build an analytics “knowledge bank” that anyone can benefit from or add to.
Let me know if you have thoughts or suggestions. I really look forward to our first collaboration.