Data Engineering Podcast

By Tobias Macey

Listen to a podcast, please open Podcast Republic app. Available on Google Play Store.


Category: Software How-To

Open in iTunes


Open RSS feed


Open Website


Rate for this podcast

Subscribers: 137
Reviews: 0

Description

Weekly deep dives on data management with the engineers and entrepreneurs who are shaping the industry

Episode Date
Build Your Data Analytics Like An Engineer - Episode 81
00:56:46
In recent years the traditional approach to building data warehouses has shifted from transforming records before loading, to transforming them afterwards. As a result, the tooling for those transformations needs to be reimagined. The data build tool (dbt) is designed to bring battle tested engineering practices to your analytics pipelines. By providing an opinionated set of best practices it simplifies collaboration and boosts confidence in your data teams. In this episode Drew Banin, creator of dbt, explains how it got started, how it is designed, and how you can start using it today to create reliable and well-tested reports in your favorite data warehouse.
May 20, 2019
Using FoundationDB As The Bedrock For Your Distributed Systems - Episode 80
01:06:02
The database market continues to expand, offering systems that are suited to virtually every use case. But what happens if you need something customized to your application? FoundationDB is a distributed key-value store that provides the primitives that you need to build a custom database platform. In this episode Ryan Worl explains how it is architected, how to use it for your applications, and provides examples of system design patterns that can be built on top of it. If you need a foundation for your distributed systems, then FoundationDB is definitely worth a closer look.
May 07, 2019
Running Your Database On Kubernetes With KubeDB - Episode 79
00:50:54
Kubernetes is a driving force in the renaissance around deploying and running applications. However, managing the database layer is still a separate concern. The KubeDB project was created as a way of providing a simple mechanism for running your storage system in the same platform as your application. In this episode Tamal Saha explains how the KubeDB project got started, why you might want to run your database with Kubernetes, and how to get started. He also covers some of the challenges of managing stateful services in Kubernetes and how the fast pace of the community has contributed to the evolution of KubeDB. If you are at any stage of a Kubernetes implementation, or just thinking about it, this is definitely worth a listen to get some perspective on how to leverage it for your entire application stack.
Apr 29, 2019
Unpacking Fauna: A Global Scale Cloud Native Database - Episode 78
00:53:50
One of the biggest challenges for any business trying to grow and reach customers globally is how to scale their data storage. FaunaDB is a cloud native database built by the engineers behind Twitter's infrastructure and designed to serve the needs of modern systems. Evan Weaver is the co-founder and CEO of Fauna and in this episode he explains the unique capabilities of Fauna, compares the consensus and transaction algorithm to that used in other NewSQL systems, and describes the ways that it allows for new application design patterns. One of the unique aspects of Fauna that is worth drawing attention to is the first class support for temporality that simplifies querying of historical states of the data. It is definitely worth a good look for anyone building a platform that needs a simple to manage data layer that will scale with your business.
Apr 22, 2019
Index Your Big Data With Pilosa For Faster Analytics - Episode 77
00:43:41
Database indexes are critical to ensure fast lookups of your data, but they are inherently tied to the database engine. Pilosa is rewriting that equation by providing a flexible, scalable, performant engine for building an index of your data to enable high-speed aggregate analysis. In this episode Seebs explains how Pilosa fits in the broader data landscape, how it is architected, and how you can start using it for your own analysis. This was an interesting exploration of a different way to look at what a database can be.
Apr 15, 2019
Serverless Data Pipelines On DataCoral - Episode 76
00:53:41
How much time do you spend maintaining your data pipeline? How much end user value does that provide? Raghu Murthy founded DataCoral as a way to abstract the low level details of ETL so that you can focus on the actual problem that you are trying to solve. In this episode he explains his motivation for building the DataCoral platform, how it is leveraging serverless computing, the challenges of delivering software as a service to customer environments, and the architecture that he has designed to make batch data management easier to work with. This was a fascinating conversation with someone who has spent his entire career working on simplifying complex data problems.
Apr 08, 2019
Why Analytics Projects Fail And What To Do About It - Episode 75
00:36:30
Analytics projects fail all the time, resulting in lost opportunities and wasted resources. There are a number of factors that contribute to that failure and not all of them are under our control. However, many of them are and as data engineers we can help to keep our projects on the path to success. Eugene Khazin is the CEO of PrimeTSR where he is tasked with rescuing floundering analytics efforts and ensuring that they provide value to the business. In this episode he reflects on the ways that data projects can be structured to provide a higher probability of success and utility, how data engineers can get throughout the project lifecycle, and how to salvage a failed project so that some value can be gained from the effort.
Apr 01, 2019
Building An Enterprise Data Fabric At CluedIn - Episode 74
00:57:49
Data integration is one of the most challenging aspects of any data platform, especially as the variety of data sources and formats grow. Enterprise organizations feel this acutely due to the silos that occur naturally across business units. The CluedIn team experienced this issue first-hand in their previous roles, leading them to build a business aimed at building a managed data fabric for the enterprise. In this episode Tim Ward, CEO of CluedIn, joins me to explain how their platform is architected, how they manage the task of integrating with third-party platforms, automating entity extraction and master data management, and the work of providing multiple views of the same data for different use cases. I highly recommend listening closely to his explanation of how they manage consistency of the data that they process across different storage backends.
Mar 25, 2019
A DataOps vs DevOps Cookoff In The Data Kitchen - Episode 73
00:54:31
Delivering a data analytics project on time and with accurate information is critical to the success of any business. DataOps is a set of practices to increase the probability of success by creating value early and often, and using feedback loops to keep your project on course. In this episode Chris Bergh, head chef of Data Kitchen, explains how DataOps differs from DevOps, how the industry has begun adopting DataOps, and how to adopt an agile approach to building your data platform.
Mar 18, 2019
Customer Analytics At Scale With Segment - Episode 72
00:47:46
Customer analytics is a problem domain that has given rise to its own industry. In order to gain a full understanding of what your users are doing and how best to serve them you may need to send data to multiple services, each with their own tracking code or APIs. To simplify this process and allow your non-engineering employees to gain access to the information they need to do their jobs Segment provides a single interface for capturing data and routing it to all of the places that you need it. In this interview Segment CTO and co-founder Calvin French-Owen explains how the company got started, how it manages to multiplex data streams from multiple sources to multiple destinations, and how it can simplify your work of gaining visibility into how your customers are engaging with your business.
Mar 04, 2019
Deep Learning For Data Engineers - Episode 71
00:42:46
Deep learning is the latest class of technology that is gaining widespread interest. As data engineers we are responsible for building and managing the platforms that power these models. To help us understand what is involved, we are joined this week by Thomas Henson. In this episode he shares his experiences experimenting with deep learning, what data engineers need to know about the infrastructure and data requirements to power the models that your team is building, and how it can be used to supercharge our ETL pipelines.
Feb 25, 2019
The Alluxio Distributed Storage System - Episode 70
00:59:44
Distributed storage systems are the foundational layer of any big data stack. There are a variety of implementations which support different specialized use cases and come with associated tradeoffs. Alluxio is a distributed virtual filesystem which integrates with multiple persistent storage systems to provide a scalable, in-memory storage layer for scaling computational workloads independent of the size of your data. In this episode Bin Fan explains how he got involved with the project, how it is implemented, and the use cases that it is particularly well suited for. If your storage and compute layers are too tightly coupled and you want to scale them independently then Alluxio is the tool for the job.
Feb 19, 2019
Building Machine Learning Projects In The Enterprise - Episode 69
00:48:18
Machine learning is a class of technologies that promise to revolutionize business. Unfortunately, it can be difficult to identify and execute on ways that it can be used in large companies. Kevin Dewalt founded Prolego to help Fortune 500 companies build, launch, and maintain their first machine learning projects so that they can remain competitive in our landscape of constant change. In this episode he discusses why machine learning projects require a new set of capabilities, how to build a team from internal and external candidates, and how an example project progressed through each phase of maturity. This was a great conversation for anyone who wants to understand the benefits and tradeoffs of machine learning for their own projects and how to put it into practice.
Feb 11, 2019
Cleaning And Curating Open Data For Archaeology - Episode 68
01:00:55
Archaeologists collect and create a variety of data as part of their research and exploration. Open Context is a platform for cleaning, curating, and sharing this data. In this episode Eric Kansa describes how they process, clean, and normalize the data that they host, the challenges that they face with scaling ETL processes which require domain specific knowledge, and how the information contained in connections that they expose is being used for interesting projects.
Feb 04, 2019
Managing Database Access Control For Teams With strongDM - Episode 67
00:42:17
Controlling access to a database is a solved problem... right? It can be straightforward for small teams and a small number of storage engines, but once either or both of those start to scale then things quickly become complex and difficult to manage. After years of running across the same issues in numerous companies and even more projects Justin McCarthy built strongDM to solve database access management for everyone. In this episode he explains how the strongDM proxy works to grant and audit access to storage systems and the benefits that it provides to engineers and team leads.
Jan 29, 2019
Building Enterprise Big Data Systems At LEGO - Episode 66
00:48:03
Building internal expertise around big data in a large organization is a major competitive advantage. However, it can be a difficult process due to compliance needs and the need to scale globally on day one. In this episode Jesper Søgaard and Keld Antonsen share the story of starting and growing the big data group at LEGO. They discuss the challenges of being at global scale from the start, hiring and training talented engineers, prototyping and deploying new systems in the cloud, and what they have learned in the process. This is a useful conversation for engineers, managers, and leadership who are interested in building enterprise big data systems.
Jan 21, 2019
TimescaleDB: The Timeseries Database Built For SQL And Scale - Episode 65
00:41:25
The past year has been an active one for the timeseries market. New products have been launched, more businesses have moved to streaming analytics, and the team at Timescale has been keeping busy. In this episode the TimescaleDB CEO Ajay Kulkarni and CTO Michael Freedman stop by to talk about their 1.0 release, how the use cases for timeseries data have proliferated, and how they are continuing to simplify the task of processing your time oriented events.
Jan 14, 2019
Performing Fast Data Analytics Using Apache Kudu - Episode 64
00:50:46
The Hadoop platform is purpose built for processing large, slow moving data in long-running batch jobs. As the ecosystem around it has grown, so has the need for fast data analytics on fast moving data. To fill this need the Kudu project was created with a column oriented table format that was tuned for high volumes of writes and rapid query execution across those tables. For a perfect pairing, they made it easy to connect to the Impala SQL engine. In this episode Brock Noland and Jordan Birdsell from PhData explain how Kudu is architected, how it compares to other storage systems in the Hadoop orbit, and how to start integrating it into you analytics pipeline.
Jan 07, 2019
Simplifying Continuous Data Processing Using Stream Native Storage In Pravega with Tom Kaitchuck - Episode 63
00:44:42
As more companies and organizations are working to gain a real-time view of their business, they are increasingly turning to stream processing technologies to fullfill that need. However, the storage requirements for continuous, unbounded streams of data are markedly different than that of batch oriented workloads. To address this shortcoming the team at Dell EMC has created the open source Pravega project. In this episode Tom Kaitchuk explains how Pravega simplifies storage and processing of data streams, how it integrates with processing engines such as Flink, and the unique capabilities that it provides in the area of exactly once processing and transactions. And if you listen at approximately the half-way mark, you can hear as the hosts mind is blown by the possibilities of treating everything, including schema information, as a stream.
Dec 31, 2018
Continuously Query Your Time-Series Data Using PipelineDB with Derek Nelson and Usman Masood - Episode 62
01:03:51
Processing high velocity time-series data in real-time is a complex challenge. The team at PipelineDB has built a continuous query engine that simplifies the task of computing aggregates across incoming streams of events. In this episode Derek Nelson and Usman Masood explain how it is architected, strategies for designing your data flows, how to scale it up and out, and edge cases to be aware of.
Dec 24, 2018
Advice On Scaling Your Data Pipeline Alongside Your Business with Christian Heinzmann - Episode 61
00:39:22
Every business needs a pipeline for their critical data, even if it is just pasting into a spreadsheet. As the organization grows and gains more customers, the requirements for that pipeline will change. In this episode Christian Heinzmann, Head of Data Warehousing at Grubhub, discusses the various requirements for data pipelines and how the overall system architecture evolves as more data is being processed. He also covers the changes in how the output of the pipelines are used, how that impacts the expectations for accuracy and availability, and some useful advice on build vs. buy for the components of a data platform.
Dec 17, 2018
Putting Apache Spark Into Action with Jean Georges Perrin - Episode 60
00:50:31
Apache Spark is a popular and widely used tool for a variety of data oriented projects. With the large array of capabilities, and the complexity of the underlying system, it can be difficult to understand how to get started using it. Jean George Perrin has been so impressed by the versatility of Spark that he is writing a book for data engineers to hit the ground running. In this episode he helps to make sense of what Spark is, how it works, and the various ways that you can use it. He also discusses what you need to know to get it deployed and keep it running in a production environment and how it fits into the overall data ecosystem.
Dec 10, 2018
Apache Zookeeper As A Building Block For Distributed Systems with Patrick Hunt - Episode 59
00:54:25
Distributed systems are complex to build and operate, and there are certain primitives that are common to a majority of them. Rather then re-implement the same capabilities every time, many projects build on top of Apache Zookeeper. In this episode Patrick Hunt explains how the Apache Zookeeper project was started, how it functions, and how it is used as a building block for other distributed systems. He also explains the operational considerations for running your own cluster, how it compares to more recent entrants such as Consul and EtcD, and what is in store for the future.
Dec 03, 2018
Set Up Your Own Data-as-a-Service Platform On Dremio with Tomer Shiran - Episode 58
00:39:18
When your data lives in multiple locations, belonging to at least as many applications, it is exceedingly difficult to ask complex questions of it. The default way to manage this situation is by crafting pipelines that will extract the data from source systems and load it into a data lake or data warehouse. In order to make this situation more manageable and allow everyone in the business to gain value from the data the folks at Dremio built a self service data platform. In this episode Tomer Shiran, CEO and co-founder of Dremio, explains how it fits into the modern data landscape, how it works under the hood, and how you can start using it today to make your life easier.
Nov 26, 2018
Stateful, Distributed Stream Processing on Flink with Fabian Hueske - Episode 57
00:48:01
Modern applications and data platforms aspire to process events and data in real time at scale and with low latency. Apache Flink is a true stream processing engine with an impressive set of capabilities for stateful computation at scale. In this episode Fabian Hueske, one of the original authors, explains how Flink is architected, how it is being used to power some of the world's largest businesses, where it sits in the lanscape of stream processing tools, and how you can start using it today.
Nov 19, 2018
How Upsolver Is Building A Data Lake Platform In The Cloud with Yoni Iny - Episode 56
00:51:50
A data lake can be a highly valuable resource, as long as it is well built and well managed. Unfortunately, that can be a complex and time-consuming effort, requiring specialized knowledge and diverting resources from your primary business. In this episode Yoni Iny, CTO of Upsolver, discusses the various components that are necessary for a successful data lake project, how the Upsolver platform is architected, and how modern data lakes can benefit your organization.
Nov 11, 2018
Self Service Business Intelligence And Data Sharing Using Looker with Daniel Mintz - Episode 55
00:58:04
Business intelligence is a necessity for any organization that wants to be able to make informed decisions based on the data that they collect. Unfortunately, it is common for different portions of the business to build their reports with different assumptions, leading to conflicting views and poor choices. Looker is a modern tool for building and sharing reports that makes it easy to get everyone on the same page. In this episode Daniel Mintz explains how the product is architected, the features that make it easy for any business user to access and explore their reports, and how you can use it for your organization today.
Nov 05, 2018
Using Notebooks As The Unifying Layer For Data Roles At Netflix with Matthew Seal - Episode 54
00:40:54
Jupyter notebooks have gained popularity among data scientists as an easy way to do exploratory analysis and build interactive reports. However, this can cause difficulties when trying to move the work of the data scientist into a more standard production environment, due to the translation efforts that are necessary. At Netflix they had the crazy idea that perhaps that last step isn't necessary, and the production workflows can just run the notebooks directly. Matthew Seal is one of the primary engineers who has been tasked with building the tools and practices that allow the various data oriented roles to unify their work around notebooks. In this episode he explains the rationale for the effort, the challenges that it has posed, the development that has been done to make it work, and the benefits that it provides to the Netflix data platform teams.
Oct 29, 2018
Of Checklists, Ethics, and Data with Emily Miller and Peter Bull (Cross Post from Podcast.__init__) - Episode 53
00:45:32
As data science becomes more widespread and has a bigger impact on the lives of people, it is important that those projects and products are built with a conscious consideration of ethics. Keeping ethical principles in mind throughout the lifecycle of a data project helps to reduce the overall effort of preventing negative outcomes from the use of the final product. Emily Miller and Peter Bull of Driven Data have created Deon to improve the communication and conversation around ethics among and between data teams. It is a Python project that generates a checklist of common concerns for data oriented projects at the various stages of the lifecycle where they should be considered. In this episode they discuss their motivation for creating the project, the challenges and benefits of maintaining such a checklist, and how you can start using it today.
Oct 22, 2018
Improving The Performance Of Cloud-Native Big Data At Netflix Using The Iceberg Table Format with Ryan Blue - Episode 52
00:53:45
With the growth of the Hadoop ecosystem came a proliferation of implementations for the Hive table format. Unfortunately, with no formal specification, each project works slightly different which increases the difficulty of integration across systems. The Hive format is also built with the assumptions of a local filesystem which results in painful edge cases when leveraging cloud object storage for a data lake. In this episode Ryan Blue explains how his work on the Iceberg table format specification and reference implementation has allowed Netflix to improve the performance and simplify operations for their S3 data lake. This is a highly detailed and technical exploration of how a well-engineered metadata layer can improve the speed, accuracy, and utility of large scale, multi-tenant, cloud-native data platforms.
Oct 15, 2018
Combining Transactional And Analytical Workloads On MemSQL with Nikita Shamgunov - Episode 51
00:56:54
One of the most complex aspects of managing data for analytical workloads is moving it from a transactional database into the data warehouse. What if you didn't have to do that at all? MemSQL is a distributed database built to support concurrent use by transactional, application oriented, and analytical, high volume, workloads on the same hardware. In this episode the CEO of MemSQL describes how the company and database got started, how it is architected for scale and speed, and how it is being used in production. This was a deep dive on how to build a successful company around a powerful platform, and how that platform simplifies operations for enterprise grade data management.
Oct 09, 2018
Building A Knowledge Graph From Public Data At Enigma With Chris Groskopf - Episode 50
00:52:52
There are countless sources of data that are publicly available for use. Unfortunately, combining those sources and making them useful in aggregate is a time consuming and challenging process. The team at Enigma builds a knowledge graph for use in your own data projects. In this episode Chris Groskopf explains the platform they have built to consume large varieties and volumes of public data for constructing a graph for serving to their customers. He discusses the challenges they are facing to scale the platform and engineering processes, as well as the workflow that they have established to enable testing of their ETL jobs. This is a great episode to listen to for ideas on how to organize a data engineering organization.
Oct 01, 2018
A Primer On Enterprise Data Curation with Todd Walter - Episode 49
00:49:35
As your data needs scale across an organization the need for a carefully considered approach to collection, storage, organization, and access becomes increasingly critical. In this episode Todd Walter shares his considerable experience in data curation to clarify the many aspects that are necessary for a successful platform for your business. Using the metaphor of a museum curator carefully managing the precious resources on display and in the vaults, he discusses the various layers of an enterprise data strategy. This includes modeling the lifecycle of your information as a pipeline from the raw, messy, loosely structured records in your data lake, through a series of transformations and ultimately to your data warehouse. He also explains which layers are useful for the different members of the business, and which pitfalls to look out for along the path to a mature and flexible data platform.
Sep 24, 2018
Take Control Of Your Web Analytics Using Snowplow With Alexander Dean - Episode 48
00:47:48
Every business with a website needs some way to keep track of how much traffic they are getting, where it is coming from, and which actions are being taken. The default in most cases is Google Analytics, but this can be limiting when you wish to perform detailed analysis of the captured data. To address this problem, Alex Dean co-founded Snowplow Analytics to build an open source platform that gives you total control of your website traffic data. In this episode he explains how the project and company got started, how the platform is architected, and how you can start using it today to get a clearer view of how your customers are interacting with your web and mobile applications.
Sep 17, 2018
Keep Your Data And Query It Too Using Chaos Search with Thomas Hazel and Pete Cheslock - Episode 47
00:48:08
Elasticsearch is a powerful tool for storing and analyzing data, but when using it for logs and other time oriented information it can become problematic to keep all of your history. Chaos Search was started to make it easy for you to keep all of your data and make it usable in S3, so that you can have the best of both worlds. In this episode the CTO, Thomas Hazel, and VP of Product, Pete Cheslock, describe how they have built a platform to let you keep all of your history, save money, and reduce your operational overhead. They also explain some of the types of data that you can use with Chaos Search, how to load it into S3, and when you might want to choose it over Amazon Athena for our serverless data analysis.
Sep 10, 2018
An Agile Approach To Master Data Management with Mark Marinelli - Episode 46
00:47:16
With the proliferation of data sources to give a more comprehensive view of the information critical to your business it is even more important to have a canonical view of the entities that you care about. Is customer number 342 in your ERP the same as Bob Smith on Twitter? Building a master data set helps you answer these questions reliably and simplify the process of building your business intelligence reports. In this episode the head of product at Tamr, Mark Marinelli, discusses the challenges of building a master data set, why you should have one, and some of the techniques that modern platforms and systems provide for maintaining it.
Sep 03, 2018
Protecting Your Data In Use At Enveil with Ellison Anne Williams - Episode 45
00:24:41
There are myriad reasons why data should be protected, and just as many ways to enforce it in tranist or at rest. Unfortunately, there is still a weak point where attackers can gain access to your unencrypted information. In this episode Ellison Anny Williams, CEO of Enveil, describes how her company uses homomorphic encryption to ensure that your analytical queries can be executed without ever having to decrypt your data.
Aug 27, 2018
Graph Databases In Production At Scale Using DGraph with Manish Jain - Episode 44
00:42:39
The way that you store your data can have a huge impact on the ways that it can be practically used. For a substantial number of use cases, the optimal format for storing and querying that information is as a graph, however databases architected around that use case have historically been difficult to use at scale or for serving fast, distributed queries. In this episode Manish Jain explains how DGraph is overcoming those limitations, how the project got started, and how you can start using it today. He also discusses the various cases where a graph storage layer is beneficial, and when you would be better off using something else. In addition he talks about the challenges of building a distributed, consistent database and the tradeoffs that were made to make DGraph a reality.
Aug 20, 2018
Putting Airflow Into Production With James Meickle - Episode 43
00:48:05
The theory behind how a tool is supposed to work and the realities of putting it into practice are often at odds with each other. Learning the pitfalls and best practices from someone who has gained that knowledge the hard way can save you from wasted time and frustration. In this episode James Meickle discusses his recent experience building a new installation of Airflow. He points out the strengths, design flaws, and areas of improvement for the framework. He also describes the design patterns and workflows that his team has built to allow them to use Airflow as the basis of their data science platform.
Aug 13, 2018
Taking A Tour Of PostgreSQL with Jonathan Katz - Episode 42
00:56:21
One of the longest running and most popular open source database projects is PostgreSQL. Because of its extensibility and a community focus on stability it has stayed relevant as the ecosystem of development environments and data requirements have changed and evolved over its lifetime. It is difficult to capture any single facet of this database in a single conversation, let alone the entire surface area, but in this episode Jonathan Katz does an admirable job of it. He explains how Postgres started and how it has grown over the years, highlights the fundamental features that make it such a popular choice for application developers, and the ongoing efforts to add the complex features needed by the demanding workloads of today's data layer. To cap it off he reviews some of the exciting features that the community is working on building into future releases.
Aug 06, 2018
Mobile Data Collection And Analysis Using Ona And Canopy With Peter Lubell-Doughtie - Episode 41
00:29:14
With the attention being paid to the systems that power large volumes of high velocity data it is easy to forget about the value of data collection at human scales. Ona is a company that is building technologies to support mobile data collection, analysis of the aggregated information, and user-friendly presentations. In this episode CTO Peter Lubell-Doughtie describes the architecture of the platform, the types of environments and use cases where it is being employed, and the value of small data.
Jul 30, 2018
Ceph: A Reliable And Scalable Distributed Filesystem with Sage Weil - Episode 40
00:48:30
When working with large volumes of data that you need to access in parallel across multiple instances you need a distributed filesystem that will scale with your workload. Even better is when that same system provides multiple paradigms for interacting with the underlying storage. Ceph is a highly available, highly scalable, and performant system that has support for object storage, block storage, and native filesystem access. In this episode Sage Weil, the creator and lead maintainer of the project, discusses how it got started, how it works, and how you can start using it on your infrastructure today. He also explains where it fits in the current landscape of distributed storage and the plans for future improvements.
Jul 16, 2018
Building Data Flows In Apache NiFi With Kevin Doran and Andy LoPresto - Episode 39
01:04:15
Data integration and routing is a constantly evolving problem and one that is fraught with edge cases and complicated requirements. The Apache NiFi project models this problem as a collection of data flows that are created through a self-service graphical interface. This framework provides a flexible platform for building a wide variety of integrations that can be managed and scaled easily to fit your particular needs. In this episode project members Kevin Doran and Andy LoPresto discuss the ways that NiFi can be used, how to start using it in your environment, and plans for future development. They also explained how it fits in the broad landscape of data tools, the interesting and challenging aspects of the project, and how to build new extensions.
Jul 08, 2018
Leveraging Human Intelligence For Better AI At Alegion With Cheryl Martin - Episode 38
00:46:13
Data is often messy or incomplete, requiring human intervention to make sense of it before being usable as input to machine learning projects. This is problematic when the volume scales beyond a handful of records. In this episode Dr. Cheryl Martin, Chief Data Scientist for Alegion, discusses the importance of properly labeled information for machine learning and artificial intelligence projects, the systems that they have built to scale the process of incorporating human intelligence in the data preparation process, and the challenges inherent to such an endeavor.
Jul 02, 2018
Package Management And Distribution For Your Data Using Quilt with Kevin Moore - Episode 37
00:41:43
Collaboration, distribution, and installation of software projects is largely a solved problem, but the same cannot be said of data. Every data team has a bespoke means of sharing data sets, versioning them, tracking related metadata and changes, and publishing them for use in the software systems that rely on them. The CEO and founder of Quilt Data, Kevin Moore, was sufficiently frustrated by this problem to create a platform that attempts to be the means by which data can be as collaborative and easy to work with as GitHub and your favorite programming language. In this episode he explains how the project came to be, how it works, and the many ways that you can start using it today.
Jun 25, 2018
User Analytics In Depth At Heap with Dan Robinson - Episode 36
00:45:27
Web and mobile analytics are an important part of any business, and difficult to get right. The most frustrating part is when you realize that you haven't been tracking a key interaction, having to write custom logic to add that event, and then waiting to collect data. Heap is a platform that automatically tracks every event so that you can retroactively decide which actions are important to your business and easily build reports with or without SQL. In this episode Dan Robinson, CTO of Heap, describes how they have architected their data infrastructure, how they build their tracking agents, and the data virtualization layer that enables users to define their own labels.
Jun 17, 2018
CockroachDB In Depth with Peter Mattis - Episode 35
00:43:41
With the increased ease of gaining access to servers in data centers across the world has come the need for supporting globally distributed data storage. With the first wave of cloud era databases the ability to replicate information geographically came at the expense of transactions and familiar query languages. To address these shortcomings the engineers at Cockroach Labs have built a globally distributed SQL database with full ACID semantics in Cockroach DB. In this episode Peter Mattis, the co-founder and VP of Engineering at Cockroach Labs, describes the architecture that underlies the database, the challenges they have faced along the way, and the ways that you can use it in your own environments today.
Jun 11, 2018
ArangoDB: Fast, Scalable, and Multi-Model Data Storage with Jan Steeman and Jan Stücke - Episode 34
00:40:05
Using a multi-model database in your applications can greatly reduce the amount of infrastructure and complexity required. ArangoDB is a storage engine that supports documents, dey/value, and graph data formats, as well as being fast and scalable. In this episode Jan Steeman and Jan Stücke explain where Arango fits in the crowded database market, how it works under the hood, and how you can start working with it today.
Jun 04, 2018
The Alooma Data Pipeline With CTO Yair Weinberger - Episode 33
00:47:50
Building an ETL pipeline is a common need across businesses and industries. It's easy to get one started but difficult to manage as new requirements are added and greater scalability becomes necessary. Rather than duplicating the efforts of other engineers it might be best to use a hosted service to handle the plumbing so that you can focus on the parts that actually matter for your business. In this episode CTO and co-founder of Alooma, Yair Weinberger, explains how the platform addresses the common needs of data collection, manipulation, and storage while allowing for flexible processing. He describes the motivation for starting the company, how their infrastructure is architected, and the challenges of supporting multi-tenancy and a wide variety of integrations.
May 28, 2018
PrestoDB and Starburst Data with Kamil Bajda-Pawlikowski - Episode 32
00:42:07
Most businesses end up with data in a myriad of places with varying levels of structure. This makes it difficult to gain insights from across departments, projects, or people. Presto is a distributed SQL engine that allows you to tie all of your information together without having to first aggregate it all into a data warehouse. Kamil Bajda-Pawlikowski co-founded Starburst Data to provide support and tooling for Presto, as well as contributing advanced features back to the project. In this episode he describes how Presto is architected, how you can use it for your analytics, and the work that he is doing at Starburst Data.
May 21, 2018
Brief Conversations From The Open Data Science Conference: Part 2 - Episode 31
00:26:05
The Open Data Science Conference brings together a variety of data professionals each year in Boston. This week's episode consists of a pair of brief interviews conducted on-site at the conference. First up you'll hear from Andy Eschbacher of Carto. He dscribes some of the complexities inherent to working with geospatial data, how they are handling it, and some of the interesting use cases that they enable for their customers. Next is Todd Blaschka, COO of TigerGraph. He explains how graph databases differ from relational engines, where graph algorithms are useful, and how TigerGraph is built to alow for fast and scalable operation.
May 14, 2018
Brief Conversations From The Open Data Science Conference: Part 1 - Episode 30
00:32:38
The Open Data Science Conference brings together a variety of data professionals each year in Boston. This week's episode consists of a pair of brief interviews conducted on-site at the conference. First up you'll hear from Alan Anders, the CTO of Applecart about their challenges with getting Spark to scale for constructing an entity graph from multiple data sources. Next I spoke with Stepan Pushkarev, the CEO, CTO, and Co-Founder of Hydrosphere.io about the challenges of running machine learning models in production and how his team tracks key metrics and samples production data to re-train and re-deploy those models for better accuracy and more robust operation.
May 07, 2018
Metabase Self Service Business Intelligence with Sameer Al-Sakran - Episode 29
00:44:46
Business Intelligence software is often cumbersome and requires specialized knowledge of the tools and data to be able to ask and answer questions about the state of the organization. Metabase is a tool built with the goal of making the act of discovering information and asking questions of an organizations data easy and self-service for non-technical users. In this episode the CEO of Metabase, Sameer Al-Sakran, discusses how and why the project got started, the ways that it can be used to build and share useful reports, some of the useful features planned for future releases, and how to get it set up to start using it in your environment.
Apr 30, 2018
Octopai: Metadata Management for Better Business Intelligence with Amnon Drori - Episode 28
00:39:52
The information about how data is acquired and processed is often as important as the data itself. For this reason metadata management systems are built to track the journey of your business data to aid in analysis, presentation, and compliance. These systems are frequently cumbersome and difficult to maintain, so Octopai was founded to alleviate that burden. In this episode Amnon Drori, CEO and co-founder of Octopai, discusses the business problems he witnessed that led him to starting the company, how their systems are able to provide valuable tools and insights, and the direction that their product will be taking in the future.
Apr 23, 2018
Data Engineering Weekly with Joe Crobak - Episode 27
00:43:31
The rate of change in the data engineering industry is alternately exciting and exhausting. Joe Crobak found his way into the work of data management by accident as so many of us do. After being engrossed with researching the details of distributed systems and big data management for his work he began sharing his findings with friends. This led to his creation of the Hadoop Weekly newsletter, which he recently rebranded as the Data Engineering Weekly newsletter. In this episode he discusses his experiences working as a data engineer in industry and at the USDS, his motivations and methods for creating a newsleteter, and the insights that he has gleaned from it.
Apr 15, 2018
Defining DataOps with Chris Bergh - Episode 26
00:54:30
Managing an analytics project can be difficult due to the number of systems involved and the need to ensure that new information can be delivered quickly and reliably. That challenge can be met by adopting practices and principles from lean manufacturing and agile software development, and the cross-functional collaboration, feedback loops, and focus on automation in the DevOps movement. In this episode Christopher Bergh discusses ways that you can start adding reliability and speed to your workflow to deliver results with confidence and consistency.
Apr 08, 2018
ThreatStack: Data Driven Cloud Security with Pete Cheslock and Patrick Cable - Episode 25
00:51:52
Cloud computing and ubiquitous virtualization have changed the ways that our applications are built and deployed. This new environment requires a new way of tracking and addressing the security of our systems. ThreatStack is a platform that collects all of the data that your servers generate and monitors for unexpected anomalies in behavior that would indicate a breach and notifies you in near-realtime. In this episode ThreatStack's director of operations, Pete Cheslock, and senior infrastructure security engineer, Patrick Cable, discuss the data infrastructure that supports their platform, how they capture and process the data from client systems, and how that information can be used to keep your systems safe from attackers.
Apr 01, 2018
MarketStore: Managing Timeseries Financial Data with Hitoshi Harada and Christopher Ryan - Episode 24
00:33:27
The data that is used in financial markets is time oriented and multidimensional, which makes it difficult to manage in either relational or timeseries databases. To make this information more manageable the team at Alapaca built a new data store specifically for retrieving and analyzing data generated by trading markets. In this episode Hitoshi Harada, the CTO of Alapaca, and Christopher Ryan, their lead software engineer, explain their motivation for building MarketStore, how it operates, and how it has helped to simplify their development workflows.
Mar 25, 2018
Stretching The Elastic Stack with Philipp Krenn - Episode 23
00:51:02
Search is a common requirement for applications of all varieties. Elasticsearch was built to make it easy to include search functionality in projects built in any language. From that foundation, the rest of the Elastic Stack has been built, expanding to many more use cases in the proces. In this episode Philipp Krenn describes the various pieces of the stack, how they fit together, and how you can use them in your infrastructure to store, search, and analyze your data.
Mar 19, 2018
Database Refactoring Patterns with Pramod Sadalage - Episode 22
00:49:05
As software lifecycles move faster, the database needs to be able to keep up. Practices such as version controlled migration scripts and iterative schema evolution provide the necessary mechanisms to ensure that your data layer is as agile as your application. Pramod Sadalage saw the need for these capabilities during the early days of the introduction of modern development practices and co-authored a book to codify a large number of patterns to aid practitioners, and in this episode he reflects on the current state of affairs and how things have changed over the past 12 years.
Mar 12, 2018
The Future Data Economy with Roger Chen - Episode 21
00:42:47
Data is an increasingly sought after raw material for business in the modern economy. One of the factors driving this trend is the increase in applications for machine learning and AI which require large quantities of information to work from. As the demand for data becomes more widespread the market for providing it will begin transform the ways that information is collected and shared among and between organizations. With his experience as a chair for the O'Reilly AI conference and an investor for data driven businesses Roger Chen is well versed in the challenges and solutions being facing us. In this episode he shares his perspective on the ways that businesses can work together to create shared data resources that will allow them to reduce the redundancy of their foundational data and improve their overall effectiveness in collecting useful training sets for their particular products.
Mar 05, 2018
Honeycomb Data Infrastructure with Sam Stokes - Episode 20
00:41:33
One of the sources of data that often gets overlooked is the systems that we use to run our businesses. This data is not used to directly provide value to customers or understand the functioning of the business, but it is still a critical component of a successful system. Sam Stokes is an engineer at Honeycomb where he helps to build a platform that is able to capture all of the events and context that occur in our production environments and use them to answer all of your questions about what is happening in your system right now. In this episode he discusses the challenges inherent in capturing and analyzing event data, the tools that his team is using to make it possible, and how this type of knowledge can be used to improve your critical infrastructure.
Feb 26, 2018
Data Teams with Will McGinnis - Episode 19
00:28:38
The responsibilities of a data scientist and a data engineer often overlap and occasionally come to cross purposes. Despite these challenges it is possible for the two roles to work together effectively and produce valuable business outcomes. In this episode Will McGinnis discusses the opinions that he has gained from experience on how data teams can play to their strengths to the benefit of all.
Feb 19, 2018
TimescaleDB: Fast And Scalable Timeseries with Ajay Kulkarni and Mike Freedman - Episode 18
01:02:40
As communications between machines become more commonplace the need to store the generated data in a time-oriented manner increases. The market for timeseries data stores has many contenders, but they are not all built to solve the same problems or to scale in the same manner. In this episode the founders of TimescaleDB, Ajay Kulkarni and Mike Freedman, discuss how Timescale was started, the problems that it solves, and how it works under the covers. They also explain how you can start using it in your infrastructure and their plans for the future.
Feb 11, 2018
Pulsar: Fast And Scalable Messaging with Rajan Dhabalia and Matteo Merli - Episode 17
00:53:46
One of the critical components for modern data infrastructure is a scalable and reliable messaging system. Publish-subscribe systems have been popular for many years, and recently stream oriented systems such as Kafka have been rising in prominence. This week Rajan Dhabalia and Matteo Merli discuss the work they have done on Pulsar, which supports both options, in addition to being globally scalable and fast. They explain how Pulsar is architected, how to scale it, and how it fits into your existing infrastructure.
Feb 04, 2018
Dat: Distributed Versioned Data Sharing with Danielle Robinson and Joe Hand - Episode 16
01:02:58
Sharing data across multiple computers, particularly when it is large and changing, is a difficult problem to solve. In order to provide a simpler way to distribute and version data sets among collaborators the Dat Project was created. In this episode Danielle Robinson and Joe Hand explain how the project got started, how it functions, and some of the many ways that it can be used. They also explain the plans that the team has for upcoming features and uses that you can watch out for in future releases.
Jan 29, 2018
Snorkel: Extracting Value From Dark Data with Alex Ratner - Episode 15
00:37:12
The majority of the conversation around machine learning and big data pertains to well-structured and cleaned data sets. Unfortunately, that is just a small percentage of the information that is available, so the rest of the sources of knowledge in a company are housed in so-called "Dark Data" sets. In this episode Alex Ratner explains how the work that he and his fellow researchers are doing on Snorkel can be used to extract value by leveraging labeling functions written by domain experts to generate training sets for machine learning models. He also explains how this approach can be used to democratize machine learning by making it feasible for organizations with smaller data sets than those required by most tooling.
Jan 22, 2018
CRDTs and Distributed Consensus with Christopher Meiklejohn - Episode 14
00:45:43
As we scale our systems to handle larger volumes of data, geographically distributed users, and varied data sources the requirement to distribute the computational resources for managing that information becomes more pronounced. In order to ensure that all of the distributed nodes in our systems agree with each other we need to build mechanisms to properly handle replication of data and conflict resolution. In this episode Christopher Meiklejohn discusses the research he is doing with Conflict-Free Replicated Data Types (CRDTs) and how they fit in with existing methods for sharing and sharding data. He also shares resources for systems that leverage CRDTs, how you can incorporate them into your systems, and when they might not be the right solution. It is a fascinating and informative treatment of a topic that is becoming increasingly relevant in a data driven world.
Jan 15, 2018
Citus Data: Distributed PostGreSQL for Big Data with Ozgun Erdogan and Craig Kerstiens - Episode 13
00:46:44
PostGreSQL has become one of the most popular and widely used databases, and for good reason. The level of extensibility that it supports has allowed it to be used in virtually every environment. At Citus Data they have built an extension to support running it in a distributed fashion across large volumes of data with parallelized queries for improved performance. In this episode Ozgun Erdogan, the CTO of Citus, and Craig Kerstiens, Citus Product Manager, discuss how the company got started, the work that they are doing to scale out PostGreSQL, and how you can start using it in your environment.
Jan 08, 2018
Wallaroo with Sean T. Allen - Episode 12
00:59:13
Data oriented applications that need to operate on large, fast-moving sterams of information can be difficult to build and scale due to the need to manage their state. In this episode Sean T. Allen, VP of engineering for Wallaroo Labs, explains how Wallaroo was designed and built to reduce the cognitive overhead of building this style of project. He explains the motivation for building Wallaroo, how it is implemented, and how you can start using it today.
Dec 25, 2017
SiriDB: Scalable Open Source Timeseries Database with Jeroen van der Heijden - Episode 11
00:00:00
Time series databases have long been the cornerstone of a robust metrics system, but the existing options are often difficult to manage in production. In this episode Jeroen van der Heijden explains his motivation for writing a new database, SiriDB, the challenges that he faced in doing so, and how it works under the hood.
Dec 18, 2017
Confluent Schema Registry with Ewen Cheslack-Postava - Episode 10
00:49:21
To process your data you need to know what shape it has, which is why schemas are important. When you are processing that data in multiple systems it can be difficult to ensure that they all have an accurate representation of that schema, which is why Confluent has built a schema registry that plugs into Kafka. In this episode Ewen Cheslack-Postava explains what the schema registry is, how it can be used, and how they built it. He also discusses how it can be extended for other deployment targets and use cases, and additional features that are planned for future releases.
Dec 10, 2017
data.world with Bryon Jacob - Episode 9
00:46:24
We have tools and platforms for collaborating on software projects and linking them together, wouldn't it be nice to have the same capabilities for data? The team at data.world are working on building a platform to host and share data sets for public and private use that can be linked together to build a semantic web of information. The CTO, Bryon Jacob, discusses how the company got started, their mission, and how they have built and evolved their technical infrastructure.
Dec 03, 2017
Data Serialization Formats with Doug Cutting and Julien Le Dem - Episode 8
00:51:43
With the wealth of formats for sending and storing data it can be difficult to determine which one to use. In this episode Doug Cutting, creator of Avro, and Julien Le Dem, creator of Parquet, dig into the different classes of serialization formats, what their strengths are, and how to choose one for your workload. They also discuss the role of Arrow as a mechanism for in-memory data sharing and how hardware evolution will influence the state of the art for data formats.
Nov 22, 2017
Buzzfeed Data Infrastructure with Walter Menendez - Episode 7
00:43:40
Buzzfeed needs to be able to understand how its users are interacting with the myriad articles, videos, etc. that they are posting. This lets them produce new content that will continue to be well-received. To surface the insights that they need to grow their business they need a robust data infrastructure to reliably capture all of those interactions. Walter Menendez is a data engineer on their infrastructure team and in this episode he describes how they manage data ingestion from a wide array of sources and create an interface for their data scientists to produce valuable conclusions.
Nov 14, 2017
Astronomer with Ry Walker - Episode 6
00:42:50
Building a data pipeline that is reliable and flexible is a difficult task, especially when you have a small team. Astronomer is a platform that lets you skip straight to processing your valuable business data. Ry Walker, the CEO of Astronomer, explains how the company got started, how the platform works, and their commitment to open source.
Aug 06, 2017
Rebuilding Yelp's Data Pipeline with Justin Cunningham - Episode 5
00:42:27
Yelp needs to be able to consume and process all of the user interactions that happen in their platform in as close to real-time as possible. To achieve that goal they embarked on a journey to refactor their monolithic architecture to be more modular and modern, and then they open sourced it! In this episode Justin Cunningham joins me to discuss the decisions they made and the lessons they learned in the process, including what worked, what didn't, and what he would do differently if he was starting over today.
Jun 18, 2017
ScyllaDB with Eyal Gutkind - Episode 4
00:35:06
If you like the features of Cassandra DB but wish it ran faster with fewer resources then ScyllaDB is the answer you have been looking for. In this episode Eyal Gutkind explains how Scylla was created and how it differentiates itself in the crowded database market.
Mar 18, 2017
Defining Data Engineering with Maxime Beauchemin - Episode 3
00:45:20
What exactly is data engineering? How has it evolved in recent years and where is it going? How do you get started in the field? In this episode, Maxime Beauchemin joins me to discuss these questions and more.
Mar 05, 2017
Dask with Matthew Rocklin - Episode 2
00:46:00
There is a vast constellation of tools and platforms for processing and analyzing your data. In this episode Matthew Rocklin talks about how Dask fills the gap between a task oriented workflow tool and an in memory processing framework, and how it brings the power of Python to bear on the problem of big data.
Jan 22, 2017
Pachyderm with Daniel Whitenack - Episode 1
00:44:42
Do you wish that you could track the changes in your data the same way that you track the changes in your code? Pachyderm is a platform for building a data lake with a versioned file system. It also lets you use whatever languages you want to run your analysis with its container based task graph. This week Daniel Whitenack shares the story of how the project got started, how it works under the covers, and how you can get started using it today!
Jan 14, 2017
Introducing The Show - Episode 0
00:04:23
Are you looking for a podcast that discusses the tools, techniques, and culture of data engineering? Then you've come to the right spot!
Jan 08, 2017