Towards Data Science

By The TDS team

Listen to a podcast, please open Podcast Republic app. Available on Google Play Store.


Category: Technology

Open in iTunes


Open RSS feed


Open Website


Rate for this podcast

Subscribers: 42
Reviews: 0

Description

Sharing concepts, ideas, and codes

Episode Date
18. Edouard Harris - Mastering the data science job hunt
3138

Getting hired as a data scientist, machine learning engineer or data analyst is hard. And if there’s one person who’s spent a *lot* of time thinking about why that is, and what you can do about it if you’re trying to break into the field, it’s Edouard Harris.

Ed is the co-founder of SharpestMinds, a data science mentorship program that’s free until you get a job. He also happens to be my brother, which makes this our most nepostistic episode yet.

Jan 15, 2020
17. Nate Nichols - Product instinct and data storytelling
2848

If there’s one trend that not nearly enough data scientists seem to be paying attention to heading into 2020, it’s this: data scientists are becoming product people.

Five years ago, that wasn’t the case at all: data science and machine learning were all the rage, and managers were impressed by fancy analytics and build over-engineered predictive models. Today, a healthy dose of reality has set in, and most companies see data science as a means to an end: it’s way of improving the experience of real users and real, paying customers, and not a magical tool whose coolness is self-justifying.

At the same time, as more and more tools continue to make it easier and easier for people who aren’t data scientists to build and use predictive models, data scientists are going to have to get good at new things. And that means two things: product instinct, and data storytelling.

That’s why we wanted to chat with Nate Nichols, a data scientist turned VP of Product Architecture at Narrative Science — a company that’s focused on addressing data communication. Nate is also the co-author of Let Your People Be People, a (free) book on data storytelling.

Jan 07, 2020
16. Helen Ngo - Real Talk with Machine Learning Engineer
1743

In this podcast episode, Helen Ngo and YK (aka CS Dojo) discuss deep fake, NLP, and women in data science.

Dec 16, 2019
15. Ian Xiao - Why Machine Learning Is More Boring Than You May Think
660

In this podcast interview, YK (aka CS Dojo) asks Ian Xiao about why he thinks machine learning is more boring than you may think. 


Original article: https://towardsdatascience.com/data-science-is-boring-1d43473e353e 

Dec 09, 2019
14. Jeremie Harris - Building a Data Science Startup & Getting Into Data Science
1219

The other day, I interviewed Jeremie Harris, a SharpestMinds cofounder, for the Towards Data Science podcast and YouTube channel. SharpestMinds is a startup that helps people who are looking for data science jobs by finding mentors for them.

In my opinion, their system is interesting in a way that a mentor only gets paid when their mentee lands a data science job. I wanted to interview Jeremie because I had previously spoken to him on a different occasion, and I wanted to personally learn more about his story, as well as his thoughts on today’s data science job market.

Dec 02, 2019
13. Jessica Li - Predicting Snowmelt Patterns with Deep Learning and Satellite Imagery
2987

Hi! It's YK here from CS Dojo. In this episode, I interviewed Jessica Li from Kaggle about how she worked with NASA to predict snowmelt patterns using deep learning. Hope you enjoy!

Nov 25, 2019
12. Rachael Tatman - Data science at Kaggle
2899

One question I’ve been getting a lot lately is whether graduate degrees — especially PhDs — are necessary in order to land a job in data science. Of course, education requirements vary widely from company to company, which is why I think the most informative answers to this question tend to come not from recruiters or hiring managers, but from data scientists with those fancy degrees, who can speak to whether they were actually useful.

That’s far from the only reason I wanted to sit down with Rachael Tatman for this episode of the podcast though. In addition to holding a PhD in computational sociolinguistics, Rachael is a data scientist at Kaggle, and a popular livestreaming coder (check out her Twitch stream here). She’s has a lot of great insights about breaking into data science, how to get the most out of Kaggle, the future of NLP, and yes, the value of graduate degrees for data science roles.

Nov 06, 2019
11. Sanjeev Sharma - DataOps and data science at enterprise scale
2496

One thing that you might not realize if you haven’t worked as a data scientist in very large companies is that the problems that arise at enterprise scale (and well as the skills that are needed to solve them) are completely different from those you’re likely to run into at a startup.

Scale is a great thing for many reasons: it means access to more data sources, and usually more resources for compute and storage. But big companies can take advantage of these things only by fostering successful collaboration between and among large teams (which is really, really hard), and have to contend with unique data sanitation challenges that can’t be addressed without reinventing practically the entire data science life cycle.

So I’d say it’s a good thing we booked Sanjeev Sharma, Vice President of Data Modernization and Strategy at Delphix, for today’s episode. Sanjeev’s specialty is helping huge companies with significant technical debt modernize and upgrade their data pipelines, and he’s seen the ins and outs of data science at enterprise scale for longer than almost anyone.

Oct 31, 2019
10. Sanyam Bhutani - Data science beyond the classroom
2415

A few years ago, there really wasn’t much of a difference between data science in theory and in practice: a jupyter notebook and a couple of imports were all you really needed to do meaningful data science work. Today, as the classroom overlaps less and less with the realities of industry, it’s becoming more and more important for data scientists to develop the ability to learn independently and go off the beaten path.

Few people have done so as effectively as Sanyam Bhutani, who among other things is an incoming ML engineer at H2O.ai, a top-1% Kaggler, popular blogger and host of the Chai Time Data Science Podcast. Sanyam has a unique perspective on the mismatch between what’s taught in the classroom and what’s required in industry: he started doing ML contract work while still in undergrad, and has interviewed some of the world’s top-ranked Kagglers to better understand where the rubber meets the data science road. 


Oct 22, 2019
9. Ben Lorica - Trends in data science with O'Reilly Media's Chief Data Scientist
3209

The trend towards model deployment, engineering and just generally building “stuff that works” is just the latest step in the evolution of the now-maturing world of data science. It’s almost guaranteed not to be the last one though, and staying ahead of the data science curve means keeping an eye on what trends might be just around the corner. That’s why we asked Ben Lorica, O’Reilly Media’s Chief Data Scientist, to join us on the podcast.

Not only does Ben have a mile-high view of the data science world (he advises about a dozen startups and organizes multiple world-class conferences), but he also has a perspective that spans two decades of data science evolution.

Oct 15, 2019
8. George Hayward: comedian, lawyer and data scientist
3064

Each week, I have dozens of conversations with people who are trying to break into data science. The main topic of the conversations varies, but it’s rare that I walk away without getting a question like, “Do you think I have a shot in data science given my unusual background in [finance/physics/stats/economics/etc]?”.

From now on, my answer to that question will be to point them to today’s guest, George John Jordan Thomas Aquinas Hayward.

George [names omitted] Hayward’s data science career is a testament to the power of branding and storytelling. After completing a JD/MBA at Stanford and reaching top-ranked status in Hackerrank’s SQL challenges, he went on to work on contract for a startup at Google, and subsequently for a number of other companies. Now, you might be tempted to ask how comedy and law could possibly lead to a data science career.

Oct 08, 2019
7. Serkan Piantino - From Facebook to startups: data science is becoming an engineering problem
2356

For today’s podcast, we spoke with someone who is laser-focused on considering this second possibility: the idea that data science is becoming an engineer’s game. Serkan Piantino served as the Director of Engineering for Facebook AI Research, and now runs machine learning infrastructure startup Spell. Their goal is to make dev tools for data scientists that make it as easy to train models on the cloud as it is to train them locally. That experience, combined with his time at Facebook, have given him a unique perspective on the engineering best practices that data scientists should use, and the future of the field as a whole.

Oct 01, 2019
6. Jay Feng - Data science in the startup world
2701

I’ve said it before and I’ll say it again: “data science” is an ambiguous job title. People use the term to refer to data science, data engineering, machine learning engineering and analytics roles, and that’s bad enough. But worse still, being a “data scientist” means completely different things depending on the scale and stage of the company you’re working at. A data scientist at a small startup might have almost nothing in common with a data scientist at a massive enterprise company, for example.

So today, we decided to talk to someone who’s seen data science at both scales. Jay Feng started his career working in analytics and data science at Jobr, which was acquired by Monster.com (which was itself acquired by an even bigger company). Among many other things, his story sheds light on a question that you might not have thought about before: what happens to data scientists when their company gets acquired?

Sep 25, 2019
5. Rocio Ng - Data science and product management at LinkedIn
2719

Most software development roles are pretty straightforward: someone tells you what to build (usually a product manager), and you build it. What’s interesting about data science is that although it’s a software role, it doesn’t quite follow this rule.

That’s because data scientists are often the only people who can understand the practical business consequences of their work. There’s only one person on the team who can answer questions like, “What does the variance in our cluster analysis tell us about user preferences?” and “ What are the business consequences of our model’s ROC score?”, and that person is the data scientist. In that sense, data scientists have a very important responsibility not to leave any insights on the table, and to bring business instincts to bare even when they’re dealing with deeply technical problems.

For today’s episode, we spoke with Rocio Ng, a data scientist at LinkedIn, about the need for strong partnerships between data scientists and product managers, and the day-to-day dynamic between those roles at LinkedIn. Along the way, we also talked about one of the most common mistakes that early career data scientists make: focusing too much on that first role.

Sep 19, 2019
4. Akshay Singh - The thin line between data science and data engineering
2946

If you’ve been following developments in data science over the last few years, you’ll know that the field has evolved a lot since its Wild West phase in the early/mid 2010s. Back then, a couple of Jupyter notebooks with half-baked modeling projects could land you a job at a respectable company, but things have since changed in a big way.

Today, as companies have finally come to understand the value that data science can bring, more and more emphasis is being placed on the implementation of data science in production systems. And as these implementations have required models that can perform on larger and larger datasets in real-time, an awful lot of data science problems have become engineering problems.

That’s why we sat down with Akshay Singh, who among other things has worked in and managed data science teams at Amazon, League and the Chan-Zuckerberg Initiative (formerly Meta.com).

Sep 10, 2019
Susan Holcomb - Nontechnical career skills for data scientists
3041

It’s easy to think of data science as a technical discipline, but in practice, things don’t really work out that way. If you’re going to be a successful data scientist, people will need to believe that you can add value in order to hire you, people will need to believe in your pet project in order to endorse it within your company, and people will need to make decisions based on the insights you pull out of your data.

Although it’s easy to forget about the human element, managing it is one of the most useful skills you can develop if you want to climb the data science ladder, and land that first job, or that promotion you’re after. And that’s exactly why we sat down with Susan Holcomb, the former Head of Data at Pebble, the world’s first smartwatch company.

When Pebble first hired her, Susan was fresh out of grad school in physics, and had never led a team, or interacted with startup executives. As the company grew, she had to figure out how to get Pebble’s leadership to support her effort to push the company in a more data-driven direction, at the same time as she managed a team of data scientists for the first time. 

Aug 14, 2019
Tan Vachiramon - Choosing the right algorithm for your real-world problem
2601

You import your data. You clean your data. You make your baseline model. 

Then, you tune your hyperparameters. You go back and forth from random forests to XGBoost, add feature selection, and tune some more. Your model’s performance goes up, and up, and up.

And eventually, the thought occurs to you: when do I stop?

Most data scientists struggle with this question on a regular basis, and from what I’ve seen working with SharpestMinds, the vast majority of aspiring data scientists get the answer wrong. That’s why we sat down with Tan Vachiramon, a member of the Spatial AI team Oculus, and former data scientist at Airbnb. 

Tan has seen data science applied in two very different industry settings: once, as part of a team whose job it was to figure out how to understand their customer base in the middle of a the whirlwind of out-of-control user growth (at Airbnb); and again in a context where he’s had the luxury of conducting far more rigorous data science experiments under controlled circumstances (at Oculus). 

My biggest take-home from our conversation was this: if you’re interested in working at a company, it’s worth taking some time to think about their business context, because that’s the single most important factor driving the kind of data science you’ll be doing there. Specifically:

  • Data science at rapidly growing companies comes with a special kind of challenge that’s not immediately obvious: because they’re growing so fast, no matter where you look, everything looks like it’s correlated with growth! New referral campaign? “That definitely made the numbers go up!” New user onboarding strategy? “Wow, that worked so well!”. Because the product is taking off, you need special strategies to ensure that you don’t confuse the effectiveness of a company initiative you’re interested in with the inherent viral growth that the product was already experiencing. 
  • The amount of time you spend tuning or selecting your model, or doing feature selection, entirely depends on the business context. In some companies (like Airbnb in the early days), super-accurate algorithms aren’t as valuable as algorithms that allow you to understand what the heck is going on in your dataset. As long as business decisions don’t depend on getting second-digit-after-the-decimal levels of accuracy, it’s okay (and even critical) to build a quick model and move on. In these cases, even logistic regression often does the trick!
  • In other contexts, where tens of millions of dollars depend on every decimal point of accuracy you can squeeze out of your model (investment banking, ad optimization), expect to spend more time on tuning/modeling. At the end of the day, it’s a question of opportunity costs: keep asking yourself if you could be creating more value for the business if you wrapped up your model tuning now, to work on something else. If you think the answer could be yes, then consider calling model.save() and walking away.


Jul 16, 2019
Joel Grus - The case against the jupyter notebook
2852

To most data scientists, the jupyter notebook is a staple tool: it’s where they learned the ropes, it’s where they go to prototype models or explore their data — basically, it’s the default arena for their all their data science work. 

But Joel Grus isn’t like most data scientists: he’s a former hedge fund manager and former Googler, and author of Data Science From Scratch. He currently works as a research engineer at the Allen Institute for Artificial Intelligence, and maintains a very active Twitter account

Oh, and he thinks you should stop using Jupyter noteoboks. Now. 

When you ask him why, he’ll provide many reasons, but a handful really stand out:

  • Hidden state: let’s say you define a variable like a = 1 in the first cell of your notebook. In a later cell, you assign it a new value, say a = 3 . This results is fairly predictable behavior as long as you run your notebook in order, from top to bottom. But if you don’t—or worse still, if you run the a = 3 cell and delete it later — it can be hard, or impossible to know from a simple inspection of the notebook what the true state of your variables is. 
  • Replicability: one of the most important things to do to ensure that you’re running repeatable data science experiments is to write robust, modular code. Jupyter notebooks implicitly discourage this, because they’re not designed to be modularized (awkward hacks do allow you to import one notebook into another, but they’re, well, awkward). What’s more, to reproduce another person’s results, you need to first reproduce the environment in which their code was run. Vanilla notebooks don’t give you a good way to do that. 
  • Bad for teaching: Jupyter notebooks make it very easy to write terrible tutorials — you know, the kind where you mindlessly hit “shift-enter” a whole bunch of times, and make your computer do a bunch of stuff that you don’t actually understand? It leads to a lot of frustrated learners, or even worse, a lot of beginners who think they understand how to code, but actually don’t.

Overall, Joel’s objections to Jupyter notebooks seem to come in large part from his somewhat philosophical view that data scientists should follow the same set of best practices that any good software engineers would. For instance, Joel stresses the importance of writing unit tests (even for data science code), and is a strong proponent of using type annotation (if you aren’t familiar with that, you should definitely learn about it here). 

But even Joel thinks Jupyter notebooks have a place in data science: if you’re poking around at a pandas dataframe to do some basic exploratory data analysis, it’s hard to think of a better way to produce helpful plots on the fly than the trusty ol’ Jupyter notebook. 

Whatever side of the Jupyter debate you’re on, it’s hard to deny that Joel makes some compelling points. I’m not personally shutting down my Jupyter kernel just yet, but I’m guessing I’ll be firing up my favorite IDE a bit more often in the future.

Jul 16, 2019