On this blog, we seek to get past the spin & platitudes of much advertising on data science. So, I’m delighted to bring you a new guest blogger, to share his reality.
Chris Bose is a Data Scientist, running a small technical PR company, In Press PR. As you’ll read, his focus is on smaller datasets & textual data. However, I believe his experience helps illuminate the reality for many data scientists.
So, prepare to look beyond the infographics & theoretical articles, here is what a real data scientist spends their time doing. Over to Chris, to share his answer to my question: “What do you spend your time doing? What is the average week of a Data Scientist?“
Average week of a Data Scientist: things I wish I’d known
I have been applying data science, to marketing problems, for the last seven years. Here are several things, I wish I’d known at the beginning:
- where to find data
- data wrangling
- data leakage
- precision and recall
- bias and variance
- how to use Upwork effectively
- how to use Amazon AWS effectively
- how to choose algorithms
- how to apply domain knowledge
- how to deal with the curse of high dimensional data
- why feature engineering is important.
I come across some, or all, of these issues every week.
I work with text documents only, and use two state classification (the document either belongs to the class, or it doesn’t). I use techniques like Support Vector Machine, Random Forests and XGBoost to get my results. I am not using Big Data: my datasets are around 3,400 GB in total.
In my experience, most “data science” is about the quality of your training data and your test sets. If you put rubbish into your classifications methods you only get rubbish out.
Given that, here are some of the issues I come across in an average week. I hope they help shed a light on what Data Scientists do.
Average week of a Data Scientist: where to find “good” data
These are many datasets, that are available publicly; usually for free. You can find them on Amazon S3. I have to keep telling myself, that most of the data on the internet is hidden. Search engines only crawl the observable web, but there is a wealth of databases that exist that you cannot find from ordinary searches. Teasing out their locations, and finding out how to get the information, is an ongoing issue. Then, you will find that most public data has many mistakes. These range from missing entries, to entries with the wrong data for the database table, data traps and so on.
Average week of a Data Scientist: data wrangling
As well as Paul’s previous post, there is a useful Wikipedia page about data wrangling, I urge you to read it:
Data wrangling, sometimes referred to as data munging, is the process of transforming and mapping data from one ” raw” data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics. A data wrangler is a person who performs these transformation operations.
Data wrangling is transforming the datasets you find. They were probably constructed a long time ago, for a different reason. So, need to be transformed, into a form that your machine learning program can understand. I cannot underestimate the importance of this. I have seen lots of reported machine learning results, that are nonsense. Often, this is because the data, that was actually used by the program, was not the data the user thought was being used.
In my work, this translates into a lot of text manipulation. For text and numbers, I use this as a reference:
Regular expressions are a central element of UNIX utilities like egrep and programming languages such as Perl. But whether you’re a UNIX …
You also need to be familiar with the syntax for SED and AWK. I have automated a lot of these text manipulation processes, but you must always eyeball a subset of data, to make sure it still makes sense. I have learnt, that it is very easy to trust automated programs, that then just spew out nonsense, usually because of operator error. So, quality control, is a very important part of my working week.
Average week of a Data Scientist: Dealing with data leakage
Data leakage, in my work, is making sure that the training data has not “leaked” into the test datasets. If this happens, the models can make unrealistically good predictions. You may think that everything is looking really good, and the sun is shining, when in reality your model is useless in the real world. You can read more about the issue of data leakage here:
Climb the world’s most elite machine learning leaderboards
Average week of a Data Scientist: False Positives, False negatives
This is about balancing, precision and recall, bias and variance. Finding the right balance, between True Positives and False Positives, is what my work is all about.
Average week of a Data Scientist: Amazon AWS
When I first started in text classification, seven years ago, I used every available computer desktop resource in my company. I needed them, to run my models, usually out of working hours and at weekends. As my datasets began to get bigger, I was always looking for extra computer resource. I had two choices: buy new computers in-house that I could control, or purchase outside resources.
Amazon AWS, is heaven-sent for small companies, like me. We could not afford to buy the server resources of big companies, yet we can be on par with them, by renting resources from Amazon. I spend some time, every week, managing computing resources on Amazon AWS. All of my text pre-processing, is now done on Amazon AWS. It is an amazing resource for small companies. I urge you to find a programmer, on Upwork, who will write you the scripts you need; often for a few hundred pounds.
No such thing as an Average week of a Data Scientist
I don’t really have a routine week. Every week has different prioritize. Much depends on whether I am integrating new datasets, analysing results, or communicating with clients. Data science is about solving real world problems, which means in practice failing every day.
Candour from Data Scientists
Many thanks to Chris for sharing his day-to-day experience. I hope you’ve all found his candour helpful.
If you are working in business today, as a Data Scientist, please join in this conversation. Either using the comment links below, or via social media, we’d love to hear your experience. Are there other challenges, priorities or tools you’d highlight?
Meanwhile, have a wonderful Easter!