Many folks just getting started with data science have an illusory idea of the field as a breeding ground where state-of-the-art machine learning algorithms are produced day after day, hour after hour, second after second. While it is true that getting to push out cool machine learning models is part of the work, it’s far from the only thing you’ll be doing as a data scientist.
In reality, data science involves quite a bit of not-so-shiny grunt work to even make the available data corpus suitable for analysis. According to a Twitter poll conducted in 2019 by data scientist Vicki Boykis, fewer than 5% of respondents claimed to spend the majority of their time on ML models [1]. The largest percentage of data scientists said that most of their time was spent cleaning up the data to make it usable.
And once it’s clean, there’s a plethora of data processing and analysis tasks to do that have little to do with machine learning. In this article, I’ll discuss this topic from two angles: 1) the often overlooked aspects of data science and 2) the potential issues that can arise from a blind focus on machine learning.
What is data science all about?
At its core, data science is all about gleaning meaningful insights from data. There are two important things to understand about this definition: 1) data is almost never available in a convenient state that lends itself to analysis and 2) inferential modeling (i.e. machine learning) is far from the only way to extract insights from data.
When I was a senior, my university’s computer science department began offering a new special topics course — CS 194: Data Engineering. The course was designed to teach students how to manage data at scale with projects focused on real-word applications — both professors are founders of successful data-focused startups. The course markets itself as teaching students the whole data science life cycle and setting them up for success as data engineers and scientists.
If you look at the course syllabus [2], you’ll notice that machine learning is only a small part of the course description. In fact, a quick glance at the homepage [3] reveals a detailed list of topics, very few of which seem related to machine learning.
So, what does the class teach? The first few weeks are heavily focused on data structuring, querying, and cleaning. This is significant — most data science courses give students nice, pretty data sets to work with and focus on the analysis parts. This makes partial sense from a pedagogical perspective, but it’s also misleading. Real-world data will almost never be in the form that you want it, and a large part of your work will involve structuring and processing the data to make it usable at all.
The second half of the course discusses various ways to process data, including summarization, visualization (my own area of research), approximation, parallelization, and more. Each of these is a detailed topic which deserves its own article to do it justice, but my point here is simple: there are a lot of ways to gain an insight into data, and machine learning is merely one of them.
What can go wrong if you blindly apply machine learning?
Good data science is about more than just the numbers.
I’m a student in the Human-Centered Data Science Lab in the University of Washington — and while I love programming and quantitative analysis, that is not true of all the lab members. In fact, I have a frequent collaborator who never writes a line of code, and yet he writes intensive research papers on machine classification, recommender systems, and a bunch of other topics that seem to immediately relate to machine learning.
How? Well, talk to him and he’ll tell you that his goal is to advance the human side of data science so that he can influence the field to become more diverse, equitable, and inclusive. Many of the issues we see in advanced technologies today — such as racially biased facial recognition algorithms — were born out of a restrictive practice of data science that failed to account for bias in the data itself, a metric which is overlooked by purely quantitative methods. Through the use of qualitative research methods and ethical design principles, researchers like my lab-mate hope to change that in the years to come.
The point is simple: the reason that parts of data science today remain ethically questionable is because of a relentless and ill-founded obsession with pure machine learning. We must shift our perspective if we are to improve the field moving forward.
Final Thoughts
Data science is a burgeoning field, and reducing it down to one concept is a misrepresentation which is at best false, and at worse dangerous. To excel in the field as a whole, it’s necessary to remove the pop-culture tunnel vision that seems to only notice machine learning.
It’s perhaps easier to see with the analogy of other disciplines. Reducing data science to machine learning is akin to reducing computer science to just building iOS apps. Or, for a less technical example, it’s like thinking that culinary arts only involves pretty, shiny desserts. But just like there’s an entire meal you must cook properly first in order to make the dessert meaningful, there’s quite a bit of important work you need to do if you push out that model.
But if you’re willing to do all the other stuff, it’ll make your eventual model much, much better. If nothing else, remember that.
Article courtesy – Medium