Data Science at Home

By Francesco Gadaleta

Listen to a podcast, please open Podcast Republic app. Available on Google Play Store.


Category: Podcasting

Open in Apple Podcasts


Open RSS feed


Open Website


Rate for this podcast

Subscribers: 336
Reviews: 0

Description

Technology, machine learning and algorithms. Come join the discussion on Discord! https://discord.gg/4UNKGf3

Episode Date
Machine learning in production: best practices [LIVE from twitch.tv]
37:31

Hey there! Having the best time of my life ;)

This is the first episode I record while I am live on my new Twitch channel :) So much fun!

Feel free to follow me for the next live streaming. You can also see me coding machine learning stuff in Rust :))

Don't forget to jump on the usual Discord and have a chat

I'll see you there!

 

 

 

 

Sep 16, 2020
Testing in machine learning: checking deeplearning models (Ep. 118)
18:17

In this episode I speak with Adam Leon Smith, CTO at DragonFly and expert in testing strategies for software and machine learning.
We cover testing with deep learning (neuron coverage, threshold coverage, sign change coverage, layer coverage, etc.), combinatorial testing and their practical aspects.

On September 15th there will be a live@Manning Rust conference. In one Rust-full day you will attend many talks about what's special about rust, building high performance web services or video game, about web assembly and much more.
If you want to meet the tribe, tune in september 15th to the live@manning rust conference.

 

 

Sep 04, 2020
Testing in machine learning: generating tests and data (Ep. 117)
20:18

In this episode I speak with Adam Leon Smith, CTO at DragonFly and expert in testing strategies for software and machine learning.

 

On September 15th there will be a live@Manning Rust conference. In one Rust-full day you will attend many talks about what's special about rust, building high performance web services or video game, about web assembly and much more.
If you want to meet the tribe, tune in september 15th to the live@manning rust conference.

 

 

Aug 29, 2020
Why you care about homomorphic encryption (Ep. 116)
18:50

After deep learning, a new entry is about ready to go on stage. The usual journalists are warming up their keyboards for blogs, news feeds, tweets, in one word, hype.
This time it's all about privacy and data confidentiality. The new words, homomorphic encryption.

 

Join and chat with us on the official Discord channel.

 

Sponsors

This episode is supported by Amethix Technologies.

Amethix works to create and maximize the impact of the world’s leading corporations, startups, and nonprofits, so they can create a better future for everyone they serve. They are a consulting firm focused on data science, machine learning, and artificial intelligence.

 

References


Towards a Homomorphic Machine Learning Big Data Pipeline for the Financial Services Sector

IBM Fully Homomorphic Encryption Toolkit for Linux

Aug 12, 2020
Test-First machine learning (Ep. 115)
19:44

In this episode I speak about a testing methodology for machine learning models that are supposed to be integrated in production environments.

Don't forget to come chat with us in our Discord channel

 

Enjoy the show!

 

--

This episode is supported by Amethix Technologies.

 

Amethix works to create and maximize the impact of the world’s leading corporations, startups, and nonprofits, so they can create a better future for everyone they serve. They are a consulting firm focused on data science, machine learning, and artificial intelligence.

Aug 03, 2020
GPT-3 cannot code (and never will) (Ep. 114)
19:06

The hype around GPT-3 is alarming and gives and provides us with the awful picture of people misunderstanding artificial intelligence. In response to some comments that claim GPT-3 will take developers' jobs, in this episode I express some personal opinions about the state of AI in generating source code (and in particular GPT-3).

 

If you have comments about this episode or just want to chat, come join us on the official Discord channel.

 

 

This episode is supported by Amethix Technologies.

Amethix works to create and maximize the impact of the world’s leading corporations, startups, and nonprofits, so they can create a better future for everyone they serve. They are a consulting firm focused on data science, machine learning, and artificial intelligence.

Jul 26, 2020
Make Stochastic Gradient Descent Fast Again (Ep. 113)
20:35

There is definitely room for improvement in the family of algorithms of stochastic gradient descent. In this episode I explain a relatively simple method that has shown to improve on the Adam optimizer. But, watch out! This approach does not generalize well.

Join our Discord channel and chat with us.

 

References

 

Jul 22, 2020
What data transformation library should I use? Pandas vs Dask vs Ray vs Modin vs Rapids (Ep. 112)
21:10

In this episode I speak about data transformation frameworks available for the data scientist who writes Python code.
The usual suspect is clearly Pandas, as the most widely used library and de-facto standard. However when data volumes increase and distributed algorithms are in place (according to a map-reduce paradigm of computation), Pandas no longer performs as expected. Other frameworks play a role in such context. 

In this episode I explain the frameworks that are the best equivalent to Pandas in bigdata contexts.

Don't forget to join our Discord channel and comment previous episodes or propose new ones.

 

This episode is supported by Amethix Technologies

Amethix works to create and maximize the impact of the world’s leading corporations, startups, and nonprofits, so they can create a better future for everyone they serve. Amethix is a consulting firm focused on data science, machine learning, and artificial intelligence.

 

References

Jul 19, 2020
[RB] It’s cold outside. Let’s speak about AI winter (Ep. 111)
36:54

In this episode I speak with Filip Piekniewski about some of the most worth noting findings in AI and machine learning in 2019. As a matter of fact, the entire field of AI has been inflated by hype and claims that are hard to believe. A lot of the promises made a few years ago have revealed quite hard to achieve, if not impossible. Let's stay grounded and realistic on the potential of this amazing field of research, not to bring disillusion in the near future.

Join us to our Discord channel to discuss your favorite episode and propose new ones.

 

This episode is brought to you by Protonmail

Click on the slink in the description or go to protonmail.com/datascience and get 20% off their annual subscription.

Jul 03, 2020
Rust and machine learning #4: practical tools (Ep. 110)
24:18

In this episode I make a non exhaustive list of machine learning tools and frameworks, written in Rust. Not all of them are mature enough for production environments. I believe that community effort can change this very quickly.

To make a comparison with the Python ecosystem I will cover frameworks for linear algebra (numpy), dataframes (pandas), off-the-shelf machine learning (scikit-learn), deep learning (tensorflow) and reinforcement learning (openAI).

Rust is the language of the future.
Happy coding!
 

Reference

  1. BLAS linear algebra https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms
  2. Rust dataframe https://github.com/nevi-me/rust-dataframe
  3. Rustlearn https://github.com/maciejkula/rustlearn
  4. Rusty machine https://github.com/AtheMathmo/rusty-machine
  5. Tensorflow bindings https://lib.rs/crates/tensorflow
  6. Juice (machine learning for hackers) https://lib.rs/crates/juice
  7. Rust reinforcement learning https://lib.rs/crates/rsrl
Jun 29, 2020
Rust and machine learning #3 with Alec Mocatta (Ep. 109)
23:58

In the 3rd episode of Rust and machine learning I speak with Alec Mocatta.
Alec is a +20 year experience professional programmer who has been spending time at the interception of distributed systems and data analytics. He's the founder of two startups in the distributed system space and author of Amadeus, an open-source framework that encourages you to write clean and reusable code that works, regardless of data scale, locally or distributed across a cluster.

Only for June 24th, LDN *Virtual* Talks June 2020 with Bippit (Alec speaking about Amadeus)

 

Jun 22, 2020
Rust and machine learning #2 with Luca Palmieri (Ep. 108)
27:02

In the second episode of Rust and Machine learning I am speaking with Luca Palmieri, who has been spending a large part of his career at the interception of machine learning and data engineering.
In addition, Luca contributed to several projects closer to the machine learning community using the Rust programming language. Linfa is an ambitious project that definitely deserves the attention of the data science community (and it's written in Rust, with Python bindings! How cool??!).

 

References

Jun 19, 2020
Rust and machine learning #1 (Ep. 107)
22:27

This is the first episode of a series about the Rust programming language and the role it can play in the machine learning field.

Rust is one of the most beautiful languages I have ever studied so far. I personally come from the C programming language, though for professional activities in machine learning I had to switch to the loved and hated Python language.

This episode is clearly not providing you with an exhaustive list of the benefits of Rust, nor its capabilities. For this you can check the references and start getting familiar with what I think it's going to be the language of the next 20 years.

 

Sponsored

This episode is supported by Pryml Technologies. Pryml offers secure and cost effective data privacy solutions for your organisation. It generates a synthetic alternative without disclosing you confidential data.

 

References

 

Jun 17, 2020
Protecting workers with artificial intelligence (with Sandeep Pandya CEO Everguard.ai)(Ep. 106)
16:20

In this episode I have a chat with Sandeep Pandya, CEO at Everguard.ai a company that uses sensor fusion, computer vision and more to provide safer working environments to workers in heavy industry.
Sandeep is a senior executive who can hide the complexity of the topic with great talent.

 

This episode is supported by Pryml.io
Pryml is an enterprise-scale platform to synthesise data and deploy applications built on that data back to a production environment.
Test ideas. Launch new products. Fast. Secure.

Jun 15, 2020
Compressing deep learning models: rewinding (Ep.105)
15:31

As a continuation of the previous episode in this one I cover the topic about compressing deep learning models and explain another simple yet fantastic approach that can lead to much smaller models that still perform as good as the original one.

Don't forget to join our Slack channel and discuss previous episodes or propose new ones.

This episode is supported by Pryml.io
Pryml is an enterprise-scale platform to synthesise data and deploy applications built on that data back to a production environment.

 

References

Comparing Rewinding and Fine-tuning in Neural Network Pruning
https://arxiv.org/abs/2003.02389

 

Jun 01, 2020
Compressing deep learning models: distillation (Ep.104)
22:19

Using large deep learning models on limited hardware or edge devices is definitely prohibitive. There are methods to compress large models by orders of magnitude and maintain similar accuracy during inference.

In this episode I explain one of the first methods: knowledge distillation

 Come join us on Slack

Reference

May 20, 2020
Pandemics and the risks of collecting data (Ep. 103)
20:09

Codiv-19 is an emergency. True. Let's just not prepare for another emergency about privacy violation when this one is over.

 

Join our new Slack channel

 

This episode is supported by Proton. You can check them out at protonmail.com or protonvpn.com

May 08, 2020
Why average can get your predictions very wrong (ep. 102)
14:40

Whenever people reason about probability of events, they have the tendency to consider average values between two extremes. 
In this episode I explain why such a way of approximating is wrong and dangerous, with a numerical example.

We are moving our community to Slack. See you there!

 

 

Apr 19, 2020
Activate deep learning neurons faster with Dynamic RELU (ep. 101)
22:18

In this episode I briefly explain the concept behind activation functions in deep learning. One of the most widely used activation function is the rectified linear unit (ReLU). 
While there are several flavors of ReLU in the literature, in this episode I speak about a very interesting approach that keeps computational complexity low while improving performance quite consistently.

This episode is supported by pryml.io. At pryml we let companies share confidential data. Visit our website.

Don't forget to join us on discord channel to propose new episode or discuss the previous ones. 

References

Dynamic ReLU https://arxiv.org/abs/2003.10027

Apr 01, 2020
WARNING!! Neural networks can memorize secrets (ep. 100)
24:16

One of the best features of neural networks and machine learning models is to memorize patterns from training data and apply those to unseen observations. That's where the magic is. 
However, there are scenarios in which the same machine learning models learn patterns so well such that they can disclose some of the data they have been trained on. This phenomenon goes under the name of unintended memorization and it is extremely dangerous.

Think about a language generator that discloses the passwords, the credit card numbers and the social security numbers of the records it has been trained on. Or more generally, think about a synthetic data generator that can disclose the training data it is trying to protect. 

In this episode I explain why unintended memorization is a real problem in machine learning. Except for differentially private training there is no other way to mitigate such a problem in realistic conditions.
At Pryml we are very aware of this. Which is why we have been developing a synthetic data generation technology that is not affected by such an issue.

 

This episode is supported by Harmonizely
Harmonizely lets you build your own unique scheduling page based on your availability so you can start scheduling meetings in just a couple minutes.
Get started by connecting your online calendar and configuring your meeting preferences.
Then, start sharing your scheduling page with your invitees!

 

References

The Secret Sharer: Evaluating and Testing Unintended Memorization in Neural Networks
https://www.usenix.org/conference/usenixsecurity19/presentation/carlini

Mar 23, 2020
Attacks to machine learning model: inferring ownership of training data (Ep. 99)
19:39

In this episode I explain a very effective technique that allows one to infer the membership of any record at hand to the (private) training dataset used to train the target model. The effectiveness of such technique is due to the fact that it works on black-box models of which there is no access to the data used for training, nor model parameters and hyperparameters. Such a scenario is very realistic and typical of machine learning as a service APIs. 

This episode is supported by pryml.io, a platform I am personally working on that enables data sharing without giving up confidentiality. 

 

As promised below is the schema of the attack explained in the episode.

 shadow-model-attack.png

 

References

Membership Inference Attacks Against Machine Learning Models

 

 

Mar 14, 2020
Don't be naive with data anonymization (Ep. 98)
13:41

Masking, obfuscating, stripping, shuffling. 
All the above techniques try to do one simple thing: keeping the data private while sharing it with third parties. Unfortunately, they are not the silver bullet to confidentiality. 

All the players in the synthetic data space rely on simplistic techniques that are not secure, might not be compliant and risky for production.
At pryml we do things differently. 

Mar 08, 2020
Why sharing real data is dangerous (Ep. 97)
10:35

There are very good reasons why a financial institution should never share their data. Actually, they should never even move their data. Ever.
In this episode I explain you why.

 

 

Mar 01, 2020
Building reproducible machine learning in production (Ep. 96)
14:20

Building reproducible models is essential for all those scenarios in which the lead developer is collaborating with other team members. Reproducibility in machine learning shall not be an art, rather it should be achieved via a methodical approach. 
In this episode I give a few suggestions about how to make your ML models reproducible and keep your workflow as smooth.

Enjoy the show!

Come visit us on our discord channel and have a chat

Feb 22, 2020
Bridging the gap between data science and data engineering: metrics (Ep. 95)
13:25

Data science and data engineering are usually two different departments in organisations. Bridging the gap between the two is essential to success. Many times the brilliant applications created by data scientists don't find a match in production, just because they are not production-ready.

In this episode I have a talk with Daan Gerits, co-founder and CTO at Pryml.io

 

Feb 14, 2020
A big welcome to Pryml: faster machine learning applications to production (Ep. 94)
09:26

Why so much silence? Building a company! That's why :) 
I am building pryml, a platform that allows data scientists build their applications on data they cannot get access to. 
This is the first of a series of episodes in which I will speak about the technology and the challenges we are facing while we build it. 

Happy listening and stay tuned!

Feb 07, 2020
It's cold outside. Let's speak about AI winter (Ep. 93)
36:48

In the last episode of 2019 I speak with Filip Piekniewski about some of the most worth noting findings in AI and machine learning in 2019. As a matter of fact, the entire field of AI has been inflated by hype and claims that are hard to believe. A lot of the promises made a few years ago have revealed quite hard to achieve, if not impossible. Let's stay grounded and realistic on the potential of this amazing field of research, not to bring disillusion in the near future.

Join us to our Discord channel to discuss your favorite episode and propose new ones. 
I would like to thank all of you for supporting and inspiring us. I wish you a wonderful 2020!

Francesco and the team of Data Science at Home

Dec 31, 2019
The dark side of AI: bias in the machine (Ep. 92)
20:26

 

This is the fourth and last episode of mini series "The dark side of AI".
I am your host Francesco and I’m with Chiara Tonini from London. The title of today’s episode is Bias in the machine 

 


 

C: Francesco, today we are starting with an infuriating discussion. Are you ready to be angry? 

 


F: yeah sure is this about brexit? 
No, I don’t talk about that. In 1986 the New York City’s Rockefeller University conducted a study on breast and uterine cancers and their link to obesity. Like in all clinical trials up to that point, the subjects of the study were all men. 
So Francesco, do you see a problem with this approach? 

 

F: No problem at all, as long as those men had a perfectly healthy uterus.
In medicine, up to the end of the 20th century, medical studies and clinical trials were conducted on men, medicine dosage and therapy calculated on men (white men). The female body has historically been considered an exception, or variation, from a male body. 

 

F: Like Eve coming from Adam’s rib. I thought we were past that...
When the female body has been under analysis, the focus was on the difference between it and the male body, the so-called “bikini approach”: the reproductive organs are different, therefore we study those, and those only. For a long time medicine assumed this was the only difference. 

 

Oh good ...
This has led to a hugely harmful fallout across society. Because women had reproductive organs, they should reproduce, and all else about them was deemed uninteresting. Still today, they consider a woman without children somehow to have betrayed her biological destiny. This somehow does not apply to a man without children, who also has reproductive organs. 

 

F: so this is an example of a very specific type of bias in medicine, regarding clinical trials and medical studies, that is not only harmful for the purposes of these studies, but has ripple effects in all of society
Only in the 2010 a serious conversation has started about the damage caused by not including women in clinical trials. There are many many examples (which we list in the references for this episode). 

 

Give me one
Researchers consider cardiovascular disease a male disease - they even call it “the widower”. They conduct studies on male samples. But it turns out, the symptoms of a heart attack, especially the ones leading up to one, are different in women. This led to doctors not recognising or dismissing the early symptoms in women. 

 

F: I was reading that women are also subject to chronic pain much more than men: for example migraines, and pain related to endometriosis. But there is extensive evidence now of doctors dismissing women’s pain, as either imaginary, or “inevitable”, like it is a normal state of being and does not need a cure at all. 

 

The failure of the medical community as a whole to recognise this obvious bias up to the 21st century is an example of how insidious the problem of bias is.

 

There are 3 fundamental types of bias: 

 

  1. One: Stochastic drift: you train your model on a dataset, and you validate the model on a split of the training set. When you apply your model out in the world, you systematically add bias in the predictions due to the training data being too specific
  2. Two: The bias in the model, introduced by your choice of the parameters of your model.  
  3. Three: The bias in your training sample: people put training samples together, and people have culture, experience, and prejudice. As we will see today, this is the most dangerous and subtle bias. Today we’ll talk about this bias.

 

Bias is a warping of our understanding of reality. We see reality through the lens of our experience and our culture. The origin of bias can date back to traditions going back centuries, and is so ingrained in our way of thinking, that we don’t even see it anymore. 

 

F: And let me add, when it comes to machine learning, we see reality through the lens of data. Bias is everywhere, and we could spend hours and hours talking about it. It’s complicated. 

 

It’s about to become more complicated. 

 

F: of course, if I know you…
Let’s throw artificial intelligence in the mix. 

 

F: You know, there was a happier time when this sentence didn’t fill me with a sense of dread... 
ImageNet is an online database of over 14 million photos, compiled more than a decade ago at Stanford University. They used it to train machine learning algorithms for image recognition and computer vision, and played an important role in the rise of deep learning. We’ve all played with it, right? The cats and dogs classifier when learning Tensorflow? (I am a dog by the way. )

 

F: ImageNet has been a critical asset for computer-vision research. There was an annual international competition to create algorithms that could most accurately label subsets of images.
In 2012, a team from the University of Toronto used a Convolutional Neural Network to handily win the top prize. That moment is widely considered a turning point in the development of contemporary AI. The final year of the ImageNet competition was 2017, and accuracy in classifying objects in the limited subset had risen from 71% to 97%. But that subset did not include the “Person” category, where the accuracy was much lower... 

ImageNet contained photos of thousands of people, with labels. This included straightforward tags like “teacher,” “dancer” and “plumber”, as well as highly charged labels like “failure, loser” and “slut, slovenly woman, trollop.”

 

F: Uh Oh. 
Then “ImageNet Roulette” was created, by an artist called Trevor Paglen and a Microsoft researcher named Kate Crawford. It was a digital art project, where you could upload your photo and let the classifier identify you, based on the labels of the database. Imagine how well that went. 

 

F: I bet it did’t work
Of course it didn’t work. Random people were classified as “orphans” or “non-smoker” or “alcoholic”. Somebody with glasses was a “nerd”. Tabong Kima, a 24-year old African American, was classified as “offender” and “wrongdoer”. 

 

F: and there it is. 
Quote from Trevor Paglen: “We want to show how layers of bias and racism and misogyny move from one system to the next. The point is to let people see the work that is being done behind the scenes, to see how we are being processed and categorized all the time.”

 

F: The ImageNet labels were applied by thousands of unknown people, most likely in the United States, hired by the team from Stanford, and working through the crowdsourcing service Amazon Mechanical Turk. They earned pennies for each photo they labeled, churning through hundreds of labels an hour. The labels were not verified in any way : if a labeler thought someone looks “shady”, this label is just a result of their prejudice, but has no basis in reality.
As they did, biases were baked into the database. Paglen quote again: “The way we classify images is a product of our worldview,” he said. “Any kind of classification system is always going to reflect the values of the person doing the classifying.” They defined what a “loser” looked like. And a “slut.” And a “wrongdoer.”

 

F: The labels originally came from another sprawling collection of data called WordNet, a kind of conceptual dictionary for machines built by researchers at Princeton University in the 1980s. But with these inflammatory labels included, the Stanford researchers may not have realized what they were doing.
What is happening here is the transferring of bias from one system to the next. 

 

Tech jobs, in past decades but still today, predominantly go to white males from a narrow social class. Inevitably, they imprint the technology with their worldview. So their algorithms learn that a person of color is a criminal, and a woman with a certain look is a slut. 

 

I’m not saying they do it on purpose, but the lack of diversity in the tech industry translates into a narrower world view, which has real consequences in the quality of AI systems. 

 

F: Diversity in tech teams is often framed as an equality issue (which of course it is), but there are enormous advantages in it: it allows to create that cognitive diversity that will reflect into superior products or services.
I believe this is an ongoing problem. In recent months, researchers have shown that face-recognition services from companies like Amazon, Microsoft and IBM can be biased against women and people of color. 

 

Crawford and Paglen argue this:
“In many narratives around AI it is assumed that ongoing technical improvements will resolve all problems and limitations.
But what if the opposite is true? What if the challenge of getting computers to “describe what they see” will always be a problem? The automated interpretation of images is an inherently social and political project, rather than a purely technical one. Understanding the politics within AI systems matters more than ever, as they are quickly moving into the architecture of social institutions: deciding whom to interview for a job, which students are paying attention in class, which suspects to arrest, and much else.”

 

F: You are using the words “interpretation of images” here, as opposed to “description” or “classification”. Certain images depict something concrete, with an objective reality. Like an apple. But other images… not so much? 

 

ImageNet contain images only corresponding to nouns (not verbs for example). Noun categories such as “apple” are well defined.
But not all nouns are created equal. Linguist George Lakoff points out that the concept of an “apple” is more nouny than the concept of “light”, which in turn is more nouny than a concept such as “health.”
Nouns occupy various places on an axis from concrete to abstract, and from descriptive to judgmental. The images corresponding to these nouns become more and more ambiguous.
These gradients have been erased in the logic of ImageNet. Everything is flattened out and pinned to a label.
The results can be problematic, illogical, and cruel, especially when it comes to labels applied to people. 

 

F: so when an image is interpreted as Drug Addict, Crazy, Hypocrite, Spinster, Schizophrenic, Mulatto, Red Neck… this is not an objective description of reality, it’s somebody’s worldview coming to the surface.
The selection of images for these categories skews the meaning in ways that are gendered, racialized, ableist, and ageist. ImageNet is an object lesson in what happens when people are categorized like objects.
And this practice has only become more common in recent years, often inside the big AI companies, where there is no way for outsiders to see how images are being ordered and classified. 

 

The bizarre thing about these systems is that they remind of early 20th century criminologists like Lombroso, or phrenologists (including Nazi scientists), and physiognomy in general. This was a discipline founded on the assumption that there is a relationship between an image of a person and the character of that person. If you are a murderer, or a Jew, the shape of your head for instance will tell. 

 

F: In reaction to these ideas, Rene’ Magritte produced that famous painting of the pipe with the tag “This is not a pipe”.

 

You know that famous photograph of the soldier kissing the nurse at the end of the second world war? The nurse came public about it when she was like 90 years old, and told how this total stranger in the street had grabbed her and kissed her. This is a picture of sexual harassment. And knowing that, it does not seem romantic anymore. 

 

F: not romantic at all indeed

Images do not describe themselves. This is a feature that artists have explored for centuries. We see those images differently when we see how they’re labeled. The correspondence between image, label, and referent is fluid. What’s more, those relations can change over time as the cultural context of an image shifts, and can mean different things depending on who looks, and where they are located. Images are open to interpretation and reinterpretation. Entire subfields of philosophy, art history, and media theory are dedicated to teasing out all the nuances of the unstable relationship between images and meanings.
The common mythos of AI and the data it draws on, is that they are objectively and scientifically classifying the world. But it’s not true, everywhere there is politics, ideology, prejudices, and all of the subjective stuff of history. 

 

F: When we survey the most widely used training sets, we find that this is the rule rather than the exception.
Training sets are the foundation on which contemporary machine-learning systems are built. They are central to how AI systems recognize and interpret the world.
By looking at the construction of these training sets and their underlying structures, we discover many unquestioned assumptions that are shaky and skewed. These assumptions inform the way AI systems work—and fail—to this day.
And the impenetrability of the algorithms, the impossibility of reconstructing the decision-making of a NN, hides the bias further away from scrutiny. When an algorithm is a black box and you can’t look inside, you have no way of analysing its bias. 

 

And the skewness and bias of these algorithms have real effects in society, the more you use AI in the judicial system, in medicine, the job market, in security systems based on facial recognition, the list goes on and on. 

 

Last year Google unveiled BERT (Bidirectional Encoder Representations from Transformers). It’s an AI system that learns to talk: it’s a Natural Language Processing engine to generate written (or spoken) language. 

 

F: we have an episode  in which we explain all that

 


They trained it from lots and lots of digitized information, as varied as old books, Wikipedia entries and news articles. They baked decades and even centuries of biases — along with a few new ones — into all that material. So for instance BERT is extremely sexist: it associates with male almost all professions and positive attributes (except for “mom”). 

 

BERT is widely used in industry and academia. For example it can interpret news headlines automatically. Even Google’s search engine use it. 

 

Try googling “CEO”, and you get out a gallery of images of old white men.

 

F: such a pervasive and flawed AI system can propagate inequality at scale. And it’s super dangerous because it’s subtle. Especially in industry, query results will not be tested and examined for bias. AI is a black box and researchers take results at face value. 

 

There are many cases of algorithm-based discrimination in the job market. Targeting candidates for tech jobs for instance, may be done by algorithms that will not recognise women as potential candidates. Therefore, they will not be exposed to as many job ads as men. Or, automated HR systems will rank them lower (for the same CV) and screen them out. 

 

In the US, algorithms are used to calculate bail. The majority of the prison population in the US is composed of people of colour, as a result of a systemic bias that goes back centuries. An algorithm learns that a person of colour is more likely to commit a crime, is more likely to not be able to afford bail, is more likely to violate parole. Therefore, people of colour will receive harsher punishments for the same crime. This amplifies this inequality at scale. 

 

Conclusion

 

Question everything, never take predictions of your models at face value. Always question how your training samples have been put together, who put them together, when and in what context. Always remember that your model produces an interpretation of reality, not a faithful depiction. 
Treat reality responsibly. 

 

Dec 28, 2019
The dark side of AI: metadata and the death of privacy (Ep. 91)
23:00
Get in touch with us
Join the discussion about data science, machine learning and artificial intelligence on our Discord server

 

Episode transcript

We always hear the word “metadata”, usually in a sentence that goes like this

 

Your Honor, I swear, we were not collecting users data, just metadata.

 

Usually the guy saying this sentence is Zuckerberg, but could be anybody from Amazon or Google. “Just” metadata, so no problem. This is one of the biggest lies about the reality of data collection.

 

F: Ok the first question is, what the hell is metadata? 

 

Metadata is data about data. 

 

F: Ok… still not clear.
Imagine you make a phone call to your mum. How often do you call your mum, Francesco?

F: Every day of course! (coughing)

 

Good boy! Ok, so let’s talk about today’s phone call. Let’s call “data” the stuff that you and your mum actually said. What did you talk about? 

 


F: She was giving me the recipe for her famous lasagna. 

So your mum’s lasagna is the DATA. What is the metadata of this phone call? The lasagna has data of its own attached to it: the date and time when the conversation happened, the duration of the call, the unique hardware identifiers of your phone and your mum’s phone, the identifiers of the two sim cards, the location of the cell towers that pinged the call, the GPS coordinates of the phones themselves. 

 


F: yeah well, this lasagna comes with a lot of data :) 

And this is assuming that this data is not linked to any other data like your Facebook account or your web browsing history. More of that later. 

 


F: Whoa Whoa Whoa, ok. Let’s put a pin in that. Going back to the “basic” metadata that you describe. I think we understand the concept of data about data. I am sure you did your research and you would love to paint me a dystopian nightmare, as always. Tell us why is this a big deal? 

 


Metadata is a very big deal. In fact, metadata is far more “useful” than the actual data, where by “useful” I mean that it allows a third party to learn about you and your whole life. What I am saying is, the fact that you talk with your mum every day for 15 minutes is telling me more about you than the content of the actual conversations. In a way, the content does not matter. Only the metadata matters. 

 


F: Ok, can you explain this point a bit more? 

 

Imagine this scenario: you work in an office in Brussels, and you go by car. Every day, you use your time in the car while you go home to call your mum. So every day around 6pm, a cell tower along the path from your office to your home pings a call from your phone to your mum’s phone. Someone who is looking at your metadata, knows exactly where you are while you call your mum. Every day you will talk about something different, and it doesn't really matter.  Your location will come through loud and clear. A lot of additional information can be deduced from this too: for example, you are moving along a motorway, therefore you have a car. The metadata of a call to mum now becomes information on where you are at 6pm, and the way you travel. 

 


F: I see. So metadata about the phone call is, in fact, real data about me. 

 

Exactly. YOU are what is interesting, not your mum’s lasagna.

 


F: you say so because you haven’t tried my mum’s lasagna. But I totally get your point.

 


Now, imagine that one day, instead of going straight home, you decide to go somewhere else. Maybe you are secretly looking for another job. Your metadata is recording the fact that after work you visit the offices of a rival company. Maybe you are a journalist and you visit your anonymous source. Your metadata records wherever you go, and one of these places is your secret meeting with your source. Anyone’s metadata can be combined with yours. There will be someone who was with you at the time and place of your secret meeting. Anyone who comes in contact with you can be tagged and monitored. Now their anonymity has been reduced. 

 


F: I get it. So, compared to the content of my conversation, its metadata contains more actionable information. And this is the most useful, and most precious, kind of information about me. What I do, what I like, who I am, beyond the particular conversation. 

 


Precisely. If companies like Facebook or the phone companies had the explicit permission to collect all the users’ data, including all content of conversations, it’s still the metadata that would  generate the most actionable information. They would probably throw the content of conversations away. In the vast majority of instances, the content does not matter. Unless you are an actual spy talking about state secrets, nobody cares. 

 


F: Let’s stay on the spy point for a minute. One could say, So what? As I have heard this many times. So what if my metadata contains actionable information, and there are entities that collect it. If I am an honest person, I have nothing to hide. 

 


There are two aspects to the problem of privacy. Government surveillance, and corporate - in other words private - surveillance. 
Government surveillance is a topic that has been covered flawlessly by Edward Snowden in his book “Permanent Record”, and in the documentary about his activity, “Citizenfour”. Which I both recommend, and in fact I think every data scientist should read and watch.
Let’s just briefly mention the obvious: just because something comes from a government, it does not mean it’s legal or legitimate, or even ethical or moral. What if your government is corrupt, or authoritarian. What if you are a dissident and you are fighting for human rights. What if you are a journalist, trying to uncover government corruption. 

 


F: In other words,  it is a false equivalence to say that protecting your privacy has anything to do with having something to hide.

 


Mass surveillance of private citizens without cause is a danger to individual freedom as well as civil liberties. Government exists to serve its citizens, not the other way around. To freely paraphrase Snowden, as individuals have no power compared to the government, the only way the system works is if the government is completely transparent to the citizens, so that they can collectively change it, and at the same time the single citizens are opaque to the government, so that it cannot abuse its power. But today the opposite happens: we citizens are completely naked and exposed in front of a completely opaque government machine, with secret surveillance programs on us, that we don’t even know exist. We are not free to self-determine, or do anything about government power, really.

 

 
F: We could really talk for days and days about government mass surveillance. But let’s go back to metadata, and let’s talk about the commercial use of it. Metadata for sale. You mentioned this term, “corporate surveillance”. It sounds…. Ominous. 

 


We live in privacy hell, Francesco. 

 


F: I get that. According to your research, where can we find metadata?  

 


First of all, metadata is everywhere. We are swimming in it. In each and every interaction between two people, that make use of digital technology, metadata is generated automatically, without the user’s consent. When two people interact, two machines also interact, recording the “context” of this interaction. Who we are, when, where, why, what we want.

F: And that doesn’t seem avoidable. In fact metadata must be generated by devices and software to just work properly. I look at it as an intrinsic component that cannot be removed from the communication system, whatever it is. The problem is who owns it. So tell me, who has such data? 

 


It does not matter, because it’s all for sale. Which means, we are for sale. 

 


F: Ok, holy s**t, this keeps getting darker. Let’s have a practical example, shall we? 

 


Have you booked a flight recently? 

 


F: Yep. I’m going to Berlin, and in fact so are you. For a hackathon, no less. 

 


Have you ever heard of a company called Adara

 


F: No… Cannot say that I have. 

 


Adara is a “Predictive Traveler Intelligence” company. 

 


F: sounds pretty pretentious. Kinda douchy. 

 


This came up on the terrifying twitter account of Wolfie Christl, author among other things of a great report about corporate surveillance for Cracked Labs. Go check him out on twitter, he’s great. 

 


F: Sure I will add what I find to the show notes of this episode. Oh and by the way you can find all this stuff on datascienceathome.com
Sorry go ahead. 

 


Adara collects data - metadata - about travel-related online searches, purchases, devices, passenger records, loyalty program records. Data from clients that include major airlines, major airports, hotel chains and car rental chains. It creates a profile, a “traveler graph” in real time, for 750 million people around the world. A profile based on personal identifiers. 

 

F: uhh uhh Then what?

 


Then Adara sells these profiles. 

 


F: Ok… I have to say, the box that I tick giving consent to the third-party use of my personal data when I use an airline website does not quite convey how far my data actually goes. 

 


Consent. LOL. Adara calculates a: “traveler value score” based on customer behaviour and needs across the global travel ecosystem, over time.

 

The score is in the Salesforce Service Cloud, for sale to anyone. 
This score, and your profile, determine the personalisation of travel offers and treatment, before purchase, during booking, post purchase, at check in, in airport, at destination. 
In their own website, Adara explains how customer service agents for their myriad of clients - for example a front desk agent at a hotel - can instantly see the Traveler value score. Therefore they will treat you differently based on this score. 

 

 

 


F: Oh so if you have money to spend they will treat you differently

 


The score is used to assess your potential value, to inform service and customer service strategies for you, as well as personalised messaging and relevant offers. And of course, the pricing you see when you look for flights. Low score? Prepare yourself to wait to have your call rerouted to a customer service agent. Would you ever tick a box to give consent to this? 

 


F: Fuck no. How is this even legal? What about the GDPR? 

 


It is, in fact, illegal. Adara is based in the US, but they collect data through data warehouses in the Netherlands. They claim they are GDPR-compliant. However, they collect all the data, and then decide on the specific business use, which is definitely not GDPR compliant. 

 


F: exactly! According to GDPR the user has to know in advance what the business use of the data they are giving consent for!!
With GDPR and future regulations, there is a way to control how the data is used and with what purpose. Regulations are still blurred or undefined when it comes to metadata. For example, there’s no regulation for the number of records in a database or the timestamp when such record  was created. As a matter of fact data is useless without metadata. 

 


One cannot even collect data without metadata. 

 


Whatsapp, telegram, Facebook messenger... they all create metadata. So one might say “I’ve got end-to-end encryption, buddy”. Sure thing. How about the metadata attached to that encrypted gibberish nobody is really interested in? To show you how unavoidable the concept of metadata is, even Signal developed by the Signal Foundation which is considered the truly end-to-end and open source protocol for confidential information exchange, can see metadata. At Signal they claim they just don’t keep it, as they also state in the Signal’s privacy policy.

 


"Certain information (e.g. a recipient's identifier, an encrypted message body, etc.) is transmitted to us solely for the purpose of placing calls or transmitting messages. Unless otherwise stated below, this information is only kept as long as necessary to place each call or transmit each message, and is not used for any other purpose."
This is one of those issues that shall be solved with legislation.

 

But like money laundering, your data is caught in a storm of transactions so intricate that at a certain point, how do you even check...
All participating companies share customer data with each other (a process called value exchange). They let marketers utilize the data, for example to target people after they have searched for flights or hotels. Adara creates audience segments and sells them, for example to Google, for advertisement targeting. The consumer data broker LiveRamp for example lists Adara as a data provider. 

 


F: consumer data broker. I am starting to get what you mean when you say that we are for sale. 

 


Let’s talk about LiveRamp, part of Acxiom. 

 


F: there they go... Acxiom... I heard of them 

 


They self-describe as an “Identity Resolution Platform”. 

 


F: I mean, George Orwell would be proud. 

 


Their mission? “To connect offline data and online data back to a single identifier”. In other words, clients can “resolve all” of their “offline and online identifiers back to the individual consumer”.
Various digital profiles, like the ones generated on social media or when you visit a website, are matched to databases which contains names, postal addresses, email addresses, phone numbers, geo locations and IP addresses, online and mobile identifiers, such as cookie and device IDs.

F: well, all this stuff is possible if and only if someone gets in possession of all these profiles, or well... they purchase them. Still, what the f**k. 

 


A cute example? Imagine you register on any random website but you don’t want to give them your home address. They just buy it from LiveRamp, which gets it from your phone geolocation data - which is for sale. Where does your phone sit still for 12 hours every night? That’s your home address. Easy.

 


F: And they definitely know how much time do I spend at the gym, without even checking my Instagram! Ok this is another level of creepy. 

 


Clients of LiveRamp can upload their own consumer data to the platform, combine it with data from hundreds of 100 third-party data providers, and then utilize it on more than 500 marketing technology platforms. They can use this data to find and target people with specific characteristics, to recognize and track consumers across devices and platforms, to profile and categorize them, to personalize content for them, and to measure how they behave. For example, clients could “recognize a website visitor” and “provide a customized offer” based on extensive profile data, without requiring said user to log in to the website. Furthermore, LiveRamp has a data store, for other companies to “buy and sell valuable customer data”. 

 


F: What is even the point of giving me the choice to consent to anything online?

 


In short, there is no point. 

 


F: it seems we are so behind with regulations on data sharing. GDPR is not cutting it, not really. With programmatic advertising we have created a monster that has really grown out of control.
So: our lives are completely transparent to private corporations, that constantly surveil us en-masse, and exploit all of our data to sell us shit. How does this affect our freedom? How about we just don’t buy it? Can it be that simple? And I would not take a no for an answer here.

 

Unfortunately, no. 

 


F:  oh crap!

 

I’m going to read you a passage from Permanent Record:  

 

Who among us can predict the future? Who would dare to?
The answer to the first question is no one, really, and the answer to the second is everyone, especially every government and business on the planet. This is what that data of ours is used for. Algorithms analyze it for patterns of established behaviour in order to extrapolate behaviours to come, a type of digital prophecy that’s only slightly more accurate that analog methods like palm reading. Once you go digging into the actual technical mechanisms by which predictability is calculated, you come to understand that its science is, in fact, anti-scientific, and fatally misnamed: predictability is actually manipulation.

 


A website that tells you that because you liked book 1 then you might also like book 2, isn’t offering an educated guess as much as a mechanism of subtle coercion. We can’t allow ourselves to be used in this way, to be used against the future. We can’t permit our data to be used to sell us the very things that must not be sold, such as journalism. [....]

We can’t let the god-like surveillance we’re under be used to “calculate” our citizenship scores, or to “predict” our criminal activity; to tell us what kind of education we can have, or what kind of job we can have [...], to discriminate against us based on our financial, legal, and medical histories, not to mention our ethnicity or race, which are constructs that data often assumes or imposes.

 


[...] if we allow [our data] to be used to identify us, then it will be used to victimize us, even to modify us - to remake the very essence of our humanity in the image of the technology that seeks its control. Of course, all of the above has already happened. 

 

F: In other words, we are surveilled and our data collected, and used to affect every aspect of our lives - what we read, what movies we watch, where we travel, what we buy, who we date, what we study, where we work… This is a self-fulfilling prophecy for all of humanity, and the prophet is a stupid, imperfect algorithm optimised just to make money.
So I guess my message of today for all Data Scientists out there is this: just… don't. 

 

 

 

References

 

 

 

 

Dec 23, 2019
The dark side of AI: recommend and manipulate (Ep. 90)
20:33

In 2017 a research group at the University of Washington did a study on the Black Lives Matter movement on Twitter. They constructed what they call a “shared audience graph” to analyse the different groups of audiences participating in the debate, and found an alignment of the groups with the political left and political right, as well as clear alignments with groups participating in other debates, like environmental issues, abortion issues and so on. In simple terms, someone who is pro-environment, pro-abortion, left-leaning, is also supportive of the Black Lives Matter movement, and viceversa.

F: Ok, this seems to make sense, right? But… I suspect there is more to this story?

So far, yes…. What they did not expect to find, though, was a pervasive network of Russian accounts participating in the debate, which turned out to be orchestrated by the Internet Research Agency, the not-so-secret Russian secret service agency of internet black ops. The same connected with the US election and Brexit referendum, allegedly. 

F: Are we talking about actual spies? Where are you going with this?

Basically, the Russian accounts (part of them human and part of them bots) were infiltrating all aspects of the debate, both on the left and on the right side, and always taking the most extreme stances on any particular aspect of the debate. The aim was to radicalise the conversation, to make it more and more extreme, in a tactic of divide-and-conquer: turn the population against itself in an online civil war, push for policies that normally would be considered too extreme (for instance, give tanks to the police to control riots, force a curfew, try to ban Muslims from your country). Chaos and unrest have repercussions on international trade and relations, and can align to foreign interests.

F: It seems like a pretty indirect and convoluted way of influencing a foreign power…

You might think so, but you are forgetting social media. This sort of operation is directly exploiting a core feature of internet social media platforms. And that feature, I am afraid, is recommender systems.

F: Whoa. Let’s take a step back. Let’s recap the general features of recommender systems, so we are on the same page. 

The main purpose of recommender systems is to recommend people the same items similar people show an interest in.
Let’s think about books and readers. The general idea is to find a way to predict the best book to the best reader. Amazon is doing it, Netflix is doing it, probably the bookstore down the road does that too, just on a smaller scale.
Some of the most common methods to implement recommender systems, use concepts such as cosine/correlation similarity, matrix factorization, neural autoencoders and sequence predictors.

The major issue of recommender systems is in their validation. Even though validation occurs in a way that is similar to many machine learning methods, one should recommend a set of items first (in production) and measure the efficacy of such a recommendation. But, recommending is already altering the entire scenario, a bit in the flavour of the Heisenberg principle of uncertainty

F: In the attention economy, the business model is to monetise the time the user spends on a platform, by showing them ads. Recommender systems are crucial for this purpose.
Chiara, you are saying that these algorithms have effects that are problematic?

As you say, recommender systems exist because the business model of social media platforms is to monetise attention. The most effective way to keep users’ attention is to show them stuff they could show an interest in.
In order to do that, one must segment the audience to find the best content for each user. But then, for each user, how do you keep them engaged, and make them consume more content? 

F: You’re going to say the word “filter bubble” very soon.

Spot on. To keep the user on the platform, you start by showing them content that they are interested in, and that agrees with their opinion. 

But that is not all. How many videos of the same stuff can you watch, how many articles can you read? You must also escalate the content that the user sees, increasing the wow factor. The content goes from mild to extreme (conspiracy theories, hate speech etc).

The recommended content pushes the user opinion towards more extreme stances. It is hard to see from inside the bubble, but a simple experiment will show it. If you continue to click the first recommended video on YouTube, and you follow the chain of first recommended videos, soon you will find yourself watching stuff you’d never have actively looked for, like conspiracy theories, or alt-right propaganda (or pranks that get progressively more cruel, videos by people committing suicide, and so on).

F: So you are saying that this is not an accident: is this the basis of the optimisation of the recommender system? 

Yes, and it’s very effective. But obviously there are consequences. 

F: And I’m guessing they are not good. 

The collective result of single users being pushed toward more radical stances is a radicalisation of the whole conversation, the disappearance of nuances in the argument, the trivialisation of complex issues. For example, the Brexit debate in 2016 was about trade deals and custom unions, and now it is about remain vs no deal, with almost nothing in between. 

F: Yes, the conversation is getting stupider. Is this just a giant accident? Just a sensible system that got out of control?

Yes and no. Recommender systems originate as a tool for boosting commercial revenue, by selling more products. But applied to social media, they have caused an aberration: the recommendation of information, which leads to the so-called filter bubbles, the rise of fake news and disinformation, and the manipulation of the masses. 

There is an intense debate in the scientific community about the polarising effects of the internet and social media on the population. An example of such study is a paper by Johnson et al. It predicts that whether and how a population becomes polarised is dictated by the nature of the underlying competition, rather than the validity of the information that individuals receive or their online bubbles

F: I would like to stress on this finding. This is really f*cked up. Polarisation is not caused by the particular subject nor the way a debate is conducted. But by how legitimate the information seems to the single person. Which means that if I find a way to convince the single individuals about something, I will be in fact manipulating the debate at a community scale or, in some cases, globally!
Oh my god we seem to be so f*cked.

Take for instance the people who believe that the Earth is flat. Or the time it took people to recognise global warming as scientific, despite the fact that, the threshold for scientific confirmation was reached decades ago.

F: So, recommender systems let loose on social media platforms amplify controversy and conflict, and fringe opinions. I know I’m not going to like the answer, but I’m going to ask the question anyway.
This is all just an innocent mistake, right? 

Last year, the European Data Protection Supervisor has published a report on online manipulation at scale. 

F: That does not sound good.

The online digital ecosystem has connected people across the world with over 50% of the population on the Internet, albeit very unevenly in terms of geography, wealth and gender. The initial optimism about the potential of internet tools and social media for civic engagement has given way to concern that people are being manipulated. This happens through the combination of constant harvesting of often intimate information about them, and the control over the information they see online according to the category they are put into (so called segmentation of the audience). Arguably since 2016, but probably before, mass manipulation at scale has occurred during democratic elections. By using algorithms to game recommender systems, among other things, to spread misinformation. Remember Cambridge Analytica? 

F: I remember. I wish I didn’t. But why does it work? Are we so easy to manipulate? 

An interesting point is this. When one receives information collectively, as for example from the television news, it is far less likely that she develops extreme views (like, the Earth is flat), because she would base the discourse on a common understanding of reality. And people call out each other’s bulls*it. 

F: Fair enough.


But when one receives information singularly, like what happens via a recommender system through micro-targeting, then reality has a different manifestation for each audience member, with no common ground. It is far more likely to adopt extreme views, because there is no way to fact check, and because the news feel personal. In fact, they tailor such news are to the users to push their buttons.
Francesco, if you show me George Clooney shirtless and holding a puppy, and George tells me that the Earth is flat, I might have doubts for a minute. Too personal? 

F: That’s good to know about you. I’m more of a cat person. But, experts keep saying that we are moving towards personalisation of everything. While this makes sense for things like personalised medicine, it probably is not that beneficial with many other kinds of recommendations. Especially not the news.
But social media feeds are extremely personalised. What can we do? 

Solutions have focused on transparency measures, exposing the source of information while neglecting the accountability of players in the ecosystem who profit from harmful behaviour. But these are band aids on bullet wounds.
The problem is the social media platforms. In October 2019 Zuckerberg was in front of congress again, because Facebook refuses to fact-check political advertisements, in 2019, after everything that’s happened. At the same time market concentration and the rise of platform dominance threatens media pluralism. This in turn, is leading to repeat and amplify a handful of news pieces and to silence independent journalism. 

F: When I think of a recommender system, I think of Netflix.

  • You liked this kind of show in the past, so here are more shows of the same genre
  • People like you have liked this other type of show. Hence, here it is for your consideration

This seems relatively benign. Although, if you think some more, you realise that this mechanism will prevent you from actually discovering anything new. It just gives you more of what you are likely to like. But one would not think that this would have world-changing consequences. 
If you think of the news, this mechanism becomes lethal: in the mildest form – which is already bad – you will only hear opinions that already align with those of your own peer group. In the worst scenario, you will not hear some news at all, or you will hear a misleading or false version of the news, and you don’t even know that a different version exists.

In the Brexit referendum, misleading or false content (like the famous NHS money that supposedly was going to the EU instead) has been amplified in filter bubbles. Each bubble of people was essentially understanding a different version of the same issue. Brexit was a million different things, depending on your social media feeds.
And of course, there are malicious players in the game, like the russian Internet Research Agency and Cambridge Analytica, who actively exploited this features in order to swing the vote. 

F: Even the traditional media is starting to adopt recommender systems for the news content. This seems like a very bad idea, after all. Is there any other scenario in which recommender systems are not great? 

Researchers use recommender systems in a variety of applications.
For instance, in the job market. A recommender system limits exposure to certain information about jobs on the basis of the person’s gender or inferred health status, and therefore it perpetuates discriminatory attitudes and practices. In the US, researchers use recommender systems to calculate the bail fee for people who have been arrested, disproportionately penalising people of colour. This has to do with the training of the algorithm. In an already unequal system (where for instance there are few women in top managerial positions, and more African-Americans in jail than white Americans) a recommender system will by design amplify such inequality. 

F: Recommender systems are part of the problem, and they make everything worse. But the origin of the problem lies somewhere else, I suspect. 

Yep. The problem with recommender systems goes even deeper. I would rather connect it to the problem of privacy. A recommender system only works if it knows its audience. They are so powerful, because they know everything about us. 
We don’t have any privacy anymore. Online players know exactly who we are, our lives are transparent to both corporations and governments. For an excellent analysis of this, read Snowden’s book “Permanent Record”. I highly recommend it. 

F: The pun was intended wasn’t it?

With all this information about us, we are put into “categories” for specific purposes: selling us products, influencing our vote. They target us with ads aimed at our specific category, and this generates more discussion and more content on our social media. Recommender systems amplify the targeting by design. They would be much less effective, and much less dangerous, in a world where our lives are private. 

F: Social media platforms base their whole business model in “knowing us”. The business model itself is problematic. 

As we said in the previous episode, the internet has become centralised, with a handful of platforms controlling most of the traffic. In some countries like Myanmar, internet access itself is provided and controlled by Facebook. 

F: Chiara, where’s Myanmar?

In South-East Asia, between India and Thailand.
In effect, the forum for public discourse and the available space for freedom of speech is now bounded by the profit motives of powerful private companies. Due to technical complexity or on the grounds of commercial secrecy, such companies decline to explain how decisions are made. Mostly, they make decisions via recommender algorithms, which amplify bias and segregation. And at the same time the few major platforms with their extraordinary reach offer an easy target for people seeking to use the system for malicious ends. 

Conclusion

This is our call to all data scientists out there. Be aware of personalisation in building recommender systems. Personalising is not always beneficial. There are a few cases where it is, e.g. medicine, genetics, drug discovery. Many other cases where it is detrimental e.g. news, consumer products/services, opinions.
Personalisation by algorithm, and in particular of the news, leads to a fragmentation of reality that undermines democracy. Collectively we need to push for reigning in targeted advertising, and the path to this leads to more strict rules on privacy. As long as we are completely transparent to commercial and governmental players, like we are today, we are vulnerable to lies, misdirection and manipulation.
As Christopher Wylie (the Cambridge Analytica whistleblower) eloquently said, it’s like going on a date, where you know nothing about the other person, but they know absolutely everything about you.
We are left without agency, and without real choice.
In other words, we are f*cked

References

Black  lives matter / Internet Research Agency (IRA) articles: 

http://faculty.washington.edu/kstarbi/Stewart_Starbird_Drawing_the_Lines_of_Contention-final.pdf

https://medium.com/s/story/the-trolls-within-how-russian-information-operations-infiltrated-online-communities-691fb969b9e4

https://medium.com/s/story/the-trolls-within-how-russian-information-operations-infiltrated-online-communities-691fb969b9e4

https://faculty.washington.edu/kstarbi/BLM-IRA-Camera-Ready.pdf

IRA tactics:
https://int.nyt.com/data/documenthelper/533-read-report-internet-research-agency/7871ea6d5b7bedafbf19/optimized/full.pdf#page=1

https://int.nyt.com/data/documenthelper/534-oxford-russia-internet-research-agency/c6588b4a7b940c551c38/optimized/full.pdf#page=1

EDPS report
https://edps.europa.eu/sites/edp/files/publication/18-03-19_online_manipulation_en.pdf

Johnson et al.  “Population polarization dynamics and next-generation social media algorithms” https://arxiv.org/abs/1712.06009

Dec 11, 2019
The dark side of AI: social media and the optimization of addiction (Ep. 89)
22:45

Chamath Palihapitiya, former Vice President of User Growth at Facebook, was giving a talk at Stanford University, when he said this:
“I feel tremendous guilt. The short-term, dopamine-driven feedback loops that we have created are destroying how society works ”.

He was referring to how social media platforms leverage our neurological build-up in the same way slot machines and cocaine do, to keep us using their products as much as possible. They turn us into addicts.

 

F: how many times do you check your Facebook in a day?

I am not a fan of Facebook. I do not have it on my phone.  Still, I check it in the morning on my laptop, and maybe twice more per day. I have a trick though: I do not scroll down. I only check the top bar to see if someone has invited me to an event, or contacted me directly. But from time to time, this resolution of mine slips, and I catch myself scrolling down, without even realising it!

 

F: is it the first thing you check when you wake up?

No because usually I have a message from you!! :) But yes, while I have my coffee I do a sweep on Facebook and twitter and maybe Instagram, plus the news.

 

F: Check how much time you spend on Facebook

And then sum it up to your email, twitter, reddit, youtube, instagram, etc. (all viable channels for ads to reach you)

We have an answer. More on that later. 
Clearly in this episode there is some form of addiction we would like to talk about. So let’s start from the beginning: how does addiction work?

Dopamine is a hormone produced by our body, and in the brain it works as a neurotransmitter, a chemical that neurons use to transmit signals to each other. One of the main functions of dopamine is to shape the “reward-motivated behaviour”: this is the way our brain learns through association, positive reinforcement, incentives, and positively-valenced emotions, in particular, pleasure. In other words, it makes our brain desire more of the things that make us feel good. These things can be for example good food, sex, and crucially, good social interactions, like hugging your friends or your baby, or having a laugh together. Because we are evolved to be social animals with complex social structures, successful social interactions are an evolutionary advantage, and therefore they trigger dopamine release in our brain, which makes us feel good, and reinforces the association between the action and the reward. This feeling motivates us to repeat the behaviour.

 

F: now that you mention reinforcement, I recall that this mechanism is so powerful and effective that in fact we have been inspired by nature and replicated it in-silico with reinforcement learning. The idea is to motivate (and eventually create an addictive pattern) an agent to follow what is called the optimal policy by giving it positive rewards or punishing it when things don’t go the way we planned. 

In our brain, every time an action produces a reward, the connection between action and reward becomes stronger. Through reinforcement, a baby learns to distinguish a cat from a dog, or that fire hurts (that was me).

 

F: and so this means that all the social interactions people get from social media platforms are in fact doing the same, right? 

Yes, but with a difference: smartphones in our pockets keep us connected to an unlimited reserve of constant social interactions. This constant flux of notifications - the rewards - flood our brain with dopamine. The mechanism of reinforcement can spin out of control. The reward pathways in our brain can malfunction, and this leads to addiction. 

 

F: you are saying that social media has LITERALLY the effect of a drug? 

Yes. In fact, social media platforms are DESIGNED to exploit the rewards systems in our brain. They are designed to work like a drug.
Have you been to a casino and played roulette or the slot machines? 

 

F: ...maybe?

Why is it fun to play roulette? The fun comes from the WAIT before the reward. You put a chip on a number, you don’t know how it’s going to go. You wait for the ball to spin, you get excited. And from time to time, BAM! Your number comes out. Now, compare this with posting something on facebook. You write a message into the void, wait…. And then the LIKES start coming in. 

 

F:  yeah i find that familiar... 

Contrary to the casino, social media platforms do not want our money, in fact they are free. What they want is, and what we are buying into with, is our time. Because the longer we stay on, the longer they can show us ads, and the more money advertisers can pay them. This is no accident, this is the business model. But asking for our time out loud would not work, we would probably not consciously give it to them. So, like a casino, they make it hard for us to get off, once we are on: they make us crave the likes, the right-swipes, the retweets, the subscriptions. So we check in, we stay on, we keep scrolling, because we hope to get those rewards. The short-term satisfaction of getting a “like” is a little boost of dopamine in our brain. We get used to it, and we want more. 

 

F: a lot of machine learning is also being deployed to amplify this form of addiction and make it.... Well more addictive :) But the question is: how such powerful ads and scenarios are so effective because of the algorithms and how much just because humans are just wired to obey such dynamics? My question is: are we essentially flawed or are these algorithms truly powerful? 

It is not a flaw, it’s a feature. The way our brain has evolved has been in response to very specific needs. In particular for this conversation, our brain is wired to favour social interactions, because it is an evolutionary advantage. These algorithms exploit these features of the brain on purpose, they are designed to exploit them. 

 

F: I believe so, but I also believe that the human brain is a powerful machine, so it should be able to predict what satisfaction it can get from social media. So how does it happen that we become addicted?

An example of optimisation strategy that social media platforms use is based on the principle of “reward prediction error coding”. Our brain learns to find patterns in data - this is a basic survival skill - and therefore learns when to expect a reward for a given set of actions. I eat cake, therefore I am happy. Every time. 
Imagine a scenario, where we have learnt through experience that when we play slot machines in a casino, we learn that we win some money once every 100 times we pull the lever. The difference between predicted and received rewards is a known, fixed quantity. If so, just after winning once, we have almost zero incentive to play again. So the casino fixes the slot machines, to introduce a random element to the timing of the reward. Suddenly our prediction error increases substantially. In this margin of error, in the time between the action (pull the lever) and the reward (maybe) our brain has time to make us anticipate the result and make us excited at the possibility, and this releases dopamine. Playing in itself becomes a reward.

F: There is an equivalent in reinforcement learning called the grid world which consists in a mouse getting to the cheese in a maze. In reinforcement learning, everything works smooth as long as the cheese stays in the same place.

Exactly! Now social media apps implement an equivalent trick, called “variable reward schedules”.

In our brain, after an action we get a reward or punishment, and we generate positive or negative feedback to that action.
Social media apps optimise their algorithms for the ideal balance of negative and positive feedback  in our brains caused by the difference between these predicted and received rewards. 

If we perceive a reward to be delivered at random, and - crucially - if checking for the reward comes at little cost, like opening the Facebook app, we end up checking for rewards all the time. Every time we are just a little bit bored, without even thinking, we check the app. The Facebook reward system (the schedule and triggers of notification and likes) has been optimised to maximise this behaviour. 

 

F: are you saying that buffering some likes and then finding the right moment to show them to the user can make the user crave for reward? 

Oh yes. Instagram will withhold likes for a period of time, causing a dip in reward compared to the expected level. It will then deliver them later in larger bundles, thus boosting the reward above the expected value, which trigger extra dopamine release, which sends us on a high akin to a cocaine hit.

 

F: Dear audience, do you remember my question? How much time do each of you spend on social media (or similar) in a day? And why do we still do it?

The fundamental feature here is how little is the perceived cost to check for the reward: I just need to open the app. We perceive this cost to be minimal, so we don’t even think about it. YouTube for instance had the autoplay feature, so you need to do absolutely nothing to remain on the app. But the cost is cumulative over time, it becomes hours in our day, days in a month, years in our lives!! 2 hours of social media per day amounts to 1 month per year. 

 

F: But it’s so EASY, it has become so natural to use social media for everything. To use Google for everything.

The convenience that the platforms give us is one of the most dangerous things about them, and not only for our individual life. The convenience of reaching so many users, together with the business model of monetising attention is one of the causes of the centralisation of the internet, i.e. the fact a few giant platforms control most of the internet traffic. Revenue from ads is concentrated on big platforms, and content creators have no other choice but to use them, if they want to be competitive. The internet went from looking like a distributed network to a centralised network. And this in turn causes data to be centralised, in a self-reinforcing loop. Most of human conversations and interactions pass through the servers of a handful of private corporations.



Conclusion

As Data scientists we should be aware of this (and we think mostly we are). We should also be ethically responsible. I think that being a data scientist no longer has a neutral connotation. Algorithms have this huge power of manipulating human behaviour, and let’s be honest, we are the only ones who really understand how they work. So we have a responsibility here. 

There are some organisations, like Data For Democracy for example, who are advocating for something equivalent to the Hippocratic Oath for data scientists. Do no harm.  

 

References

Dopamine reward prediction error coding https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4826767/

Dopamine, Smartphones & You: A battle for your time http://sitn.hms.harvard.edu/flash/2018/dopamine-smartphones-battle-time/

Reward system https://en.wikipedia.org/wiki/Reward_system

Data for democracy datafordemocracy.org

Dec 03, 2019
More powerful deep learning with transformers (Ep. 84) (Rebroadcast)
37:44

Some of the most powerful NLP models like BERT and GPT-2 have one thing in common: they all use the transformer architecture.
Such architecture is built on top of another important concept already known to the community: self-attention.
In this episode I explain what these mechanisms are, how they work and why they are so powerful.

Don't forget to subscribe to our Newsletter or join the discussion on our Discord server

 

References

Nov 27, 2019
How to improve the stability of training a GAN (Ep. 88)
28:20

Generative Adversarial Networks or GANs are very powerful tools to generate data. However, training a GAN is not easy. More specifically, GANs suffer of three major issues such as instability of the training procedure, mode collapse and vanishing gradients.

 

In this episode I not only explain the most challenging issues one would encounter while designing and training Generative Adversarial Networks. But also some methods and architectures to mitigate them. In addition I elucidate the three specific strategies that researchers are considering to improve the accuracy and the reliability of GANs.

 

The most tedious issues of GANs

 

Convergence to equilibrium

 

A typical GAN is formed by at least two networks: a generator G and a discriminator D. The generator's task is to generate samples from random noise. In turn, the discriminator has to learn to distinguish fake samples from real ones. While it is theoretically possible that generators and discriminators converge to a Nash Equilibrium (at which both networks are in their optimal state), reaching such equilibrium is not easy. 

 

Vanishing gradients

 

Moreover, a very accurate discriminator would push the loss function towards lower and lower values. This in turn, might cause the gradient to vanish and the entire network to stop learning completely. 

 

Mode collapse

 

Another phenomenon that is easy to observe when dealing with GANs is mode collapse. That is the incapability of the model to generate diverse samples. This in turn, leads to generated data that are more and more similar to the previous ones. Hence, the entire generated dataset would be just concentrated around a particular statistical value. 

 

The solution

 

Researchers have taken into consideration several approaches to overcome such issues. They have been playing with architectural changes, different loss functions and game theory.

 

Listen to the full episode to know more about the most effective strategies to build GANs that are reliable and robust.
Don't forget to join the conversation on our new Discord channel. See you there!

 

Nov 18, 2019
What if I train a neural network with random data? (with Stanisław Jastrzębski) (Ep. 87)
19:37

What happens to a neural network trained with random data?

Are massive neural networks just lookup tables or do they truly learn something? 

Today’s episode will be about memorisation and generalisation in deep learning, with Stanislaw Jastrzębski from New York University.

Stan spent two summers as a visiting student with Prof. Yoshua Bengio and has been working on 

  • Understanding and improving how deep network generalise
  • Representation Learning
  • Natural Language Processing
  • Computer Aided Drug Design

 

What makes deep learning unique?

I have asked him a few questions for which I was looking for an answer for a long time. For instance, what is deep learning bringing to the table that other methods don’t or are not capable of? 
Stan believe that the one thing that makes deep learning special is representation learning. All the other competing methods, be it kernel machines, or random forests, do not have this capability. Moreover, optimisation (SGD) lies at the heart of representation learning in the sense that it allows finding good representations. 

 

What really improves the training quality of a neural network?

We discussed about the accuracy of neural networks depending pretty much on how good the Stochastic Gradient Descent method is at finding minima of the loss function. What would influence such minima?
Stan's answer has revealed that training set accuracy or loss value is not that interesting actually. It is relatively easy to overfit data (i.e. achieve the lowest loss possible), provided a large enough network, and a large enough computational budget. However, shape of the minima, or performance on validation sets are in a quite fascinating way influenced by optimisation.
Optimisation in the beginning of the trajectory, steers such trajectory towards minima of certain properties that go much further than just training accuracy.

As always we spoke about the future of AI and the role deep learning will play.

I hope you enjoy the show!

Don't forget to join the conversation on our new Discord channel. See you there!

 

References

 

Homepage of Stanisław Jastrzębski https://kudkudak.github.io/

A Closer Look at Memorization in Deep Networks https://arxiv.org/abs/1706.05394

Three Factors Influencing Minima in SGD https://arxiv.org/abs/1711.04623

Don't Decay the Learning Rate, Increase the Batch Size https://arxiv.org/abs/1711.00489

Stiffness: A New Perspective on Generalization in Neural Networks https://arxiv.org/abs/1901.09491

Nov 12, 2019
Deeplearning is easier when it is illustrated (with Jon Krohn) (Ep. 86)
44:53
In this episode I speak with Jon Krohn, author of Deeplearning Illustrated a book that makes deep learning easier to grasp. 
We also talk about some important guidelines to take into account whenever you implement a deep learning model, how to deal with bias in machine learning used to match jobs to candidates and the future of AI. 
 
 
You can purchase the book from informit.com/dsathome with code DSATHOME and get 40% off books/eBooks and 60% off video training
Nov 05, 2019
[RB] How to generate very large images with GANs (Ep. 85)
14:41

Join the discussion on our Discord server

In this episode I explain how a research group from the University of Lubeck dominated the curse of dimensionality for the generation of large medical images with GANs.
The problem is not as trivial as it seems. Many researchers have failed in generating large images with GANs before. One interesting application of such approach is in medicine for the generation of CT and X-ray images.
Enjoy the show!

 

References

Multi-scale GANs for Memory-efficient Generation of High Resolution Medical Images https://arxiv.org/abs/1907.01376

Nov 04, 2019
More powerful deep learning with transformers (Ep. 84)
37:44

Some of the most powerful NLP models like BERT and GPT-2 have one thing in common: they all use the transformer architecture.
Such architecture is built on top of another important concept already known to the community: self-attention.
In this episode I explain what these mechanisms are, how they work and why they are so powerful.

Don't forget to subscribe to our Newsletter or join the discussion on our Discord server

 

References

Oct 27, 2019
[RB] Replicating GPT-2, the most dangerous NLP model (with Aaron Gokaslan) (Ep. 83)
37:47

In this episode, I am with Aaron Gokaslan, computer vision researcher, AI Resident at Facebook AI Research. Aaron is the author of OpenGPT-2, a parallel NLP model to the most discussed version that OpenAI decided not to release because too accurate to be published.

We discuss about image-to-image translation, the dangers of the GPT-2 model and the future of AI.
Moreover, 
Aaron provides some very interesting links and demos that will blow your mind!

Enjoy the show! 

References

Multimodal image to image translation (not all mentioned in the podcast but recommended by Aaron)

Pix2Pix: 
 
CycleGAN:
 
GANimorph
 
Oct 18, 2019
What is wrong with reinforcement learning? (Ep. 82)
21:48

After reinforcement learning agents doing great at playing Atari video games, Alpha Go, doing financial trading, dealing with language modeling, let me tell you the real story here.
In this episode I want to shine some light on reinforcement learning (RL) and the limitations that every practitioner should consider before taking certain directions. RL seems to work so well! What is wrong with it?

 

Are you a listener of Data Science at Home podcast?
A reader of the Amethix Blog? 
Or did you subscribe to the Artificial Intelligence at your fingertips newsletter?
In any case let’s stay in touch! 
https://amethix.com/survey/

 

 

References

Oct 15, 2019
Have you met Shannon? Conversation with Jimmy Soni and Rob Goodman about one of the greatest minds in history (Ep. 81)
32:21

In this episode I have an amazing conversation with Jimmy Soni and Rob Goodman, authors of “A mind at play”, a book entirely dedicated to the life and achievements of Claude Shannon. Claude Shannon does not need any introduction. But for those who need a refresh, Shannon is the inventor of the information age

Have you heard of binary code, entropy in information theory, data compression theory (the stuff behind mp3, mpg, zip, etc.), error correcting codes (the stuff that makes your RAM work well), n-grams, block ciphers, the beta distribution, the uncertainty coefficient?

All that stuff has been invented by Claude Shannon :) 

 
Articles: 
 
Claude's papers:
 
A mind at play (book links): 
Oct 10, 2019
Attacking machine learning for fun and profit (with the authors of SecML Ep. 80)
34:04

As ML plays a more and more relevant role in many domains of everyday life, it’s quite obvious to see more and more attacks to ML systems. In this episode we talk about the most popular attacks against machine learning systems and some mitigations designed by researchers Ambra Demontis and Marco Melis, from the University of Cagliari (Italy). The guests are also the authors of SecML, an open-source Python library for the security evaluation of Machine Learning (ML) algorithms. Both Ambra and Marco are members of research group PRAlab, under the supervision of Prof. Fabio Roli.
 

SecML Contributors

Marco Melis (Ph.D Student, Project Maintainer, https://www.linkedin.com/in/melismarco/)
Ambra Demontis (Postdoc, https://pralab.diee.unica.it/it/AmbraDemontis) 
Maura Pintor (Ph.D Student, https://it.linkedin.com/in/maura-pintor)
Battista Biggio (Assistant Professor, https://pralab.diee.unica.it/it/BattistaBiggio)

References

SecML: an open-source Python library for the security evaluation of Machine Learning (ML) algorithms https://secml.gitlab.io/.

Demontis et al., “Why Do Adversarial Attacks Transfer? Explaining Transferability of Evasion and Poisoning Attacks,” presented at the 28th USENIX Security Symposium (USENIX Security 19), 2019, pp. 321–338. https://www.usenix.org/conference/usenixsecurity19/presentation/demontis

W. Koh and P. Liang, “Understanding Black-box Predictions via Influence Functions,” in International Conference on Machine Learning (ICML), 2017. https://arxiv.org/abs/1703.04730

Melis, A. Demontis, B. Biggio, G. Brown, G. Fumera, and F. Roli, “Is Deep Learning Safe for Robot Vision? Adversarial Examples Against the iCub Humanoid,” in 2017 IEEE International Conference on Computer Vision Workshops (ICCVW), 2017, pp. 751–759. https://arxiv.org/abs/1708.06939

Biggio and F. Roli, “Wild Patterns: Ten Years After the Rise of Adversarial Machine Learning,” Pattern Recognition, vol. 84, pp. 317–331, 2018. https://arxiv.org/abs/1712.03141

Biggio et al., “Evasion attacks against machine learning at test time,” in Machine Learning and Knowledge Discovery in Databases (ECML PKDD), Part III, 2013, vol. 8190, pp. 387–402. https://arxiv.org/abs/1708.06131

Biggio, B. Nelson, and P. Laskov, “Poisoning attacks against support vector machines,” in 29th Int’l Conf. on Machine Learning, 2012, pp. 1807–1814. https://arxiv.org/abs/1206.6389

Dalvi, P. Domingos, Mausam, S. Sanghai, and D. Verma, “Adversarial classification,” in Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), Seattle, 2004, pp. 99–108. https://dl.acm.org/citation.cfm?id=1014066

Sundararajan, Mukund, Ankur Taly, and Qiqi Yan. "Axiomatic attribution for deep networks." Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 2017. https://arxiv.org/abs/1703.01365 

Ribeiro, Marco Tulio, Sameer Singh, and Carlos Guestrin. "Model-agnostic interpretability of machine learning." arXiv preprint arXiv:1606.05386 (2016). https://arxiv.org/abs/1606.05386

Guo, Wenbo, et al. "Lemna: Explaining deep learning based security applications." Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security. ACM, 2018. https://dl.acm.org/citation.cfm?id=3243792

Bach, Sebastian, et al. "On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation." PloS one 10.7 (2015): E0130140. https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0130140 

Oct 01, 2019
[RB] How to scale AI in your organisation (Ep. 79)
13:21

Scaling technology and business processes are not equal. Since the beginning of the enterprise technology, scaling software has been a difficult task to get right inside large organisations. When it comes to Artificial Intelligence and Machine Learning, it becomes vastly more complicated. 

In this episode I propose a framework - in five pillars - for the business side of artificial intelligence.

 

Sep 26, 2019
Replicating GPT-2, the most dangerous NLP model (with Aaron Gokaslan) (Ep. 78)
37:47

In this episode, I am with Aaron Gokaslan, computer vision researcher, AI Resident at Facebook AI Research. Aaron is the author of OpenGPT-2, a parallel NLP model to the most discussed version that OpenAI decided not to release because too accurate to be published.

We discuss about image-to-image translation, the dangers of the GPT-2 model and the future of AI.
Moreover, 
Aaron provides some very interesting links and demos that will blow your mind!

Enjoy the show! 

References

Multimodal image to image translation (not all mentioned in the podcast but recommended by Aaron)

Pix2Pix: 
 
CycleGAN:
 
GANimorph
 
Sep 23, 2019
Training neural networks faster without GPU [RB] (Ep. 77)
22:21

Training neural networks faster usually involves the usage of powerful GPUs. In this episode I explain an interesting method from a group of researchers from Google Brain, who can train neural networks faster by squeezing the hardware to their needs and making the training pipeline more dense.

Enjoy the show!

 

References

Faster Neural Network Training with Data Echoing
https://arxiv.org/abs/1907.05550

Sep 17, 2019
How to generate very large images with GANs (Ep. 76)
14:41

In this episode I explain how a research group from the University of Lubeck dominated the curse of dimensionality for the generation of large medical images with GANs.
The problem is not as trivial as it seems. Many researchers have failed in generating large images with GANs before. One interesting application of such approach is in medicine for the generation of CT and X-ray images.
Enjoy the show!

 

References

Multi-scale GANs for Memory-efficient Generation of High Resolution Medical Images https://arxiv.org/abs/1907.01376

Sep 06, 2019
[RB] Complex video analysis made easy with Videoflow (Ep. 75)
30:42

In this episode I am with Jadiel de Armas, senior software engineer at Disney and author of Videflow, a Python framework that facilitates the quick development of complex video analysis applications and other series-processing based applications in a multiprocessing environment. 

I have inspected the videoflow repo on Github and some of the capabilities of this framework and I must say that it’s really interesting. Jadiel is going to tell us a lot more than what you can read from Github 

 

References

Videflow Github official repository
https://github.com/videoflow/videoflow

 

Aug 29, 2019
[RB] Validate neural networks without data with Dr. Charles Martin (Ep. 74)
44:46

In this episode, I am with Dr. Charles Martin from Calculation Consulting a machine learning and data science consulting company based in San Francisco. We speak about the nuts and bolts of deep neural networks and some impressive findings about the way they work. 

The questions that Charles answers in the show are essentially two:

  1. Why is regularisation in deep learning seemingly quite different than regularisation in other areas on ML?

  2. How can we dominate DNN in a theoretically principled way?

 

References 

Aug 27, 2019
How to cluster tabular data with Markov Clustering (Ep. 73)
20:43

In this episode I explain how a community detection algorithm known as Markov clustering can be constructed by combining simple concepts like random walks, graphs, similarity matrix. Moreover, I highlight how one can build a similarity graph and then run a community detection algorithm on such graph to find clusters in tabular data.

You can find a simple hands-on code snippet to play with on the Amethix Blog 

Enjoy the show! 

 

References

[1] S. Fortunato, “Community detection in graphs”, Physics Reports, volume 486, issues 3-5, pages 75-174, February 2010.

[2] Z. Yang, et al., “A Comparative Analysis of Community Detection Algorithms on Artificial Networks”, Scientific Reports volume 6, Article number: 30750 (2016)

[3] S. Dongen, “A cluster algorithm for graphs”, Technical Report, CWI (Centre for Mathematics and Computer Science) Amsterdam, The Netherlands, 2000.

[4] A. J. Enright, et al., “An efficient algorithm for large-scale detection of protein families”, Nucleic Acids Research, volume 30, issue 7, pages 1575-1584, 2002.

Aug 20, 2019
Waterfall or Agile? The best methodology for AI and machine learning (Ep. 72)
14:26

The two most widely considered software development models in modern project management are, without any doubt, the Waterfall Methodology and the Agile Methodology. In this episode I make a comparison between the two and explain what I believe is the best choice for your machine learning project.

An interesting post to read (mentioned in the episode) is How businesses can scale Artificial Intelligence & Machine Learning https://amethix.com/how-businesses-can-scale-artificial-intelligence-machine-learning/

Aug 14, 2019
Training neural networks faster without GPU (Ep. 71)
22:21

Training neural networks faster usually involves the usage of powerful GPUs. In this episode I explain an interesting method from a group of researchers from Google Brain, who can train neural networks faster by squeezing the hardware to their needs and making the training pipeline more dense.

Enjoy the show!

 

References

Faster Neural Network Training with Data Echoing
https://arxiv.org/abs/1907.05550

Aug 06, 2019
Validate neural networks without data with Dr. Charles Martin (Ep. 70)
44:46

In this episode, I am with Dr. Charles Martin from Calculation Consulting a machine learning and data science consulting company based in San Francisco. We speak about the nuts and bolts of deep neural networks and some impressive findings about the way they work. 

The questions that Charles answers in the show are essentially two:

  1. Why is regularisation in deep learning seemingly quite different than regularisation in other areas on ML?

  2. How can we dominate DNN in a theoretically principled way?

 

References 

 

 
Jul 23, 2019
Complex video analysis made easy with Videoflow (Ep. 69)
30:42

In this episode I am with Jadiel de Armas, senior software engineer at Disney and author of Videflow, a Python framework that facilitates the quick development of complex video analysis applications and other series-processing based applications in a multiprocessing environment. 

I have inspected the videoflow repo on Github and some of the capabilities of this framework and I must say that it’s really interesting. Jadiel is going to tell us a lot more than what you can read from Github 

 

References

Videflow Github official repository
https://github.com/videoflow/videoflow

 

Jul 16, 2019
Episode 68: AI and the future of banking with Chris Skinner [RB]
41:42

In this episode I have a wonderful conversation with Chris Skinner.

Chris and I recently got in touch at The banking scene 2019, fintech conference recently held in Brussels. During that conference he talked as a real trouble maker - that’s how he defines himself - saying that “People are not educated with loans, credit, money” and that “Banks are failing at digital”.

After I got my hands on his last book Digital Human, I invited him to the show to ask him a few questions about innovation, regulation and technology in finance.

Jul 09, 2019
Episode 67: Classic Computer Science Problems in Python
28:35

Today I am with David Kopec, author of Classic Computer Science Problems in Python, published by Manning Publications.

His book deepens your knowledge of problem solving techniques from the realm of computer science by challenging you with interesting and realistic scenarios, exercises, and of course algorithms.
There are examples in the major topics any data scientist should be familiar with, for example search, clustering, graphs, and much more.

Get the book from https://www.manning.com/books/classic-computer-science-problems-in-python and use coupon code poddatascienceathome19 to get 40% discount.

 

References

Twitter https://twitter.com/davekopec

GitHub https://github.com/davecom

classicproblems.com

Jul 02, 2019
Episode 66: More intelligent machines with self-supervised learning
18:56

In this episode I talk about a new paradigm of learning, which can be found a bit blurry and not really different from the other methods we know of, such as supervised and unsupervised learning. The method I introduce here is called self-supervised learning.

Enjoy the show!

 

Don't forget to subscribe to our Newsletter at amethix.com and get the latest updates in AI and machine learning. We do not spam. Promise!

 

References

Deep Clustering for Unsupervised Learning of Visual Features

Self-supervised Visual Feature Learning with Deep Neural Networks: A Survey

 

Jun 25, 2019
Episode 65: AI knows biology. Or does it?
12:14

The successes of deep learning for text analytics, also introduced in a recent post about sentiment analysis and published here are undeniable. Many other tasks in NLP have also benefitted from the superiority of deep learning methods over more traditional approaches. Such extraordinary results have also been possible due to the neural network approach to learn meaningful character and word embeddings, that is the representation space in which semantically similar objects are mapped to nearby vectors.
All this is strictly related to a field one might initially find disconnected or off-topic: biology.

 


Don't forget to subscribe to our Newsletter at amethix.com and get the latest updates in AI and machine learning. We do not spam. Promise!


 

References

[1] Rives A., et al., “Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences”, biorxiv, doi: https://doi.org/10.1101/622803

[2] Vaswani A., et al., “Attention is all you need”, Advances in neural information processing systems, pp. 5998–6008, 2017.

[3] Bahdanau D., et al., “Neural machine translation by jointly learning to align and translate”, arXiv, http://arxiv.org/abs/1409.0473.

Jun 23, 2019
Episode 64: Get the best shot at NLP sentiment analysis
12:58

The rapid diffusion of social media like Facebook and Twitter, and the massive use of different types of forums like Reddit, Quora, etc., is producing an impressive amount of text data every day. 

There is one specific activity that many business owners have been contemplating over the last five years, that is identifying the social sentiment of their brand, by analysing the conversations of their users.

In this episode I explain how one can get the best shot at classifying sentences with deep learning and word embedding.

 

 

Additional material

Schematic representation of how to learn a word embedding matrix E by training a neural network that, given the previous M words, predicts the next word in a sentence. 

 

word2vec_training.png?w=702&ssl=1

 

 

Word2Vec example source code

https://gist.github.com/rlangone/ded90673f65e932fd14ae53a26e89eee#file-word2vec_example-py

 

 

References

[1] Mikolov, T. et al., "Distributed Representations of Words and Phrases and their Compositionality", Advances in Neural Information Processing Systems 26, pages 3111-3119, 2013.

[2] The Best Embedding Method for Sentiment Classification, https://medium.com/@bramblexu/blog-md-34c5d082a8c5

[3] The state of sentiment analysis: word, sub-word and character embedding 
https://amethix.com/state-of-sentiment-analysis-embedding/

 

Jun 14, 2019
Episode 63: Financial time series and machine learning
21:08

In this episode I speak to Alexandr Honchar, data scientist and owner of blog https://medium.com/@alexrachnog
Alexandr has written very interesting posts about time series analysis for financial data. His blog is in my personal list of best tutorial blogs. 

We discuss about financial time series and machine learning, what makes predicting the price of stocks a very challenging task and why machine learning might not be enough.
As usual, I ask Alexandr how he sees machine learning in the next 10 years. His answer - in my opinion quite futuristic - makes perfect sense. 

You can contact Alexandr on

 

Enjoy the show!

 

Jun 04, 2019
Episode 62: AI and the future of banking with Chris Skinner
42:03

In this episode I have a wonderful conversation with Chris Skinner.

Chris and I recently got in touch at The banking scene 2019, fintech conference recently held in Brussels. During that conference he talked as a real trouble maker - that’s how he defines himself - saying that “People are not educated with loans, credit, money” and that “Banks are failing at digital”.

After I got my hands on his last book Digital Human, I invited him to the show to ask him a few questions about innovation, regulation and technology in finance.

May 28, 2019
Episode 61: The 4 best use cases of entropy in machine learning
21:35

It all starts from physics. The entropy of an isolated system never decreases… Everyone at school, at some point of his life, learned this in his physics class. What does this have to do with machine learning?
To find out, listen to the show.

 

References

Entropy in machine learning 
https://amethix.com/entropy-in-machine-learning/

May 21, 2019
Episode 60: Predicting your mouse click (and a crash course in deeplearning)
39:50

Deep learning is the future. Get a crash course on deep learning. Now!
In this episode I speak to Oliver Zeigermann, author of Deep Learning Crash Course published by Manning Publications at https://www.manning.com/livevideo/deep-learning-crash-course

Oliver (Twitter: @DJCordhose) is a veteran of neural networks and machine learning. In addition to the course - that teaches you concepts from prototype to production - he's working on a really cool project that predicts something people do every day... clicking their mouse. 

If you use promo code poddatascathome19 you get a 40% discount for all products on the Manning platform

Enjoy the show!

 

References:

 

Deep Learning Crash Course (Manning Publications)

https://www.manning.com/livevideo/deep-learning-crash-course?a_aid=djcordhose&a_bid=e8e77cbf

 

Companion notebooks for the code samples of the video course "Deep Learning Crash Course"

https://github.com/DJCordhose/deep-learning-crash-course-notebooks/blob/master/README.md

 

Next-button-to-click predictor source code

https://github.com/DJCordhose/ux-by-tfjs

 

May 16, 2019
Episode 59: How to fool a smart camera with deep learning
24:11

In this episode I met three crazy researchers from KULeuven (Belgium) who found a method to fool surveillance cameras and stay hidden just by holding a special t-shirt. 
We discussed about the technique they used and some consequences of their findings.

They published their paper on Arxiv and made their source code available at https://gitlab.com/EAVISE/adversarial-yolo

Enjoy the show!

 

References

Fooling automated surveillance cameras: adversarial patches to attack person detection 
Simen ThysWiebe Van RanstToon Goedemé

 

Eavise Research Group KULeuven (Belgium)
https://iiw.kuleuven.be/onderzoek/eavise

May 07, 2019
Episode 58: There is physics in deep learning!
19:55

There is a connection between gradient descent based optimizers and the dynamics of damped harmonic oscillators. What does that mean? We now have a better theory for optimization algorithms.
In this episode I explain how all this works.

All the formulas I mention in the episode can be found in the post The physics of optimization algorithms

Enjoy the show.

 

Apr 30, 2019
Episode 57: Neural networks with infinite layers
16:19

How are differential equations related to neural networks? What are the benefits of re-thinking neural network as a differential equation engine? In this episode we explain all this and we provide some material that is worth learning. Enjoy the show!

 

Residual Block

Residual block

 

 

References

[1] K. He, et al., “Deep Residual Learning for Image Recognition”, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770-778, 2016

[2] S. Hochreiter, et al., “Long short-term memory”, Neural Computation 9(8), pages 1735-1780, 1997.

[3] Q. Liao, et al.,”Bridging the gaps between residual learning, recurrent neural networks and visual cortex”, arXiv preprint, arXiv:1604.03640, 2016.

[4] Y. Lu, et al., “Beyond Finite Layer Neural Networks: Bridging Deep Architectures and Numerical Differential Equation”, Proceedings of the 35th International Conference on Machine Learning (ICML), Stockholm, Sweden, 2018.

[5] T. Q. Chen, et al., ” Neural Ordinary Differential Equations”, Advances in Neural Information Processing Systems 31, pages 6571-6583}, 2018

Apr 23, 2019
Episode 56: The graph network
16:34

Since the beginning of AI in the 1950s and until the 1980s, symbolic AI approaches have dominated the field. These approaches, also known as expert systems, used mathematical symbols to represent objects and the relationship between them, in order to depict the extensive knowledge bases built by humans.
The opposite of the symbolic AI paradigm is named connectionism, which is behind the machine learning approaches of today

Apr 16, 2019
Episode 55: Beyond deep learning
17:23

The successes that deep learning systems have achieved in the last decade in all kinds of domains are unquestionable. Self-driving cars, skin cancer diagnostics, movie and song recommendations, language translation, automatic video surveillance, digital assistants represent just a few examples of the ongoing revolution that affects or is going to disrupt soon our everyday life.
But all that glitters is not gold…

Read the full post on the Amethix Technologies blog

Apr 09, 2019
Episode 54: Reproducible machine learning
11:50

In this episode I speak about how important reproducible machine learning pipelines are.
When you are collaborating with diverse teams, several tasks will be distributed among different individuals. Everyone will have good reasons to change parts of your pipeline, leading to confusion and definitely a number of options that soon explode.
In all those cases, tracking data and code is extremely helpful to build models that are reproducible anytime, anywhere.
Listen to the podcast and learn how.

 

Mar 09, 2019
Episode 53: Estimating uncertainty with neural networks
15:08

Have you ever wanted to get an estimate of the uncertainty of your neural network? Clearly Bayesian modelling provides a solid framework to estimate uncertainty by design. However, there are many realistic cases in which Bayesian sampling is not really an option and ensemble models can play a role.

In this episode I describe a simple yet effective way to estimate uncertainty, without changing your neural network’s architecture nor your machine learning pipeline at all.

The post with mathematical background and sample source code is published here.

Jan 23, 2019
Episode 52: why do machine learning models fail? [RB]
15:58

The success of a machine learning model depends on several factors and events. True generalization to data that the model has never seen before is more a chimera than a reality. But under specific conditions a well trained machine learning model can generalize well and perform with testing accuracy that is similar to the one performed during training.

In this episode I explain when and why machine learning models fail from training to testing datasets.

Jan 17, 2019
Episode 51: Decentralized machine learning in the data marketplace (part 2)
23:08

In this episode I am completing the explanation about the integration fitchain-oceanprotocol that allows secure on-premise compute to operate in the decentralized data marketplace designed by Ocean Protocol.

As mentioned in the show, this is a picture that provides a 10000-feet view of the integration.

 SEA-ocean-fitchain.png

 

I hope you enjoy the show!

Jan 08, 2019
Episode 50: Decentralized machine learning in the data marketplace
24:17

In this episode I briefly explain how two massive technologies have been merged in 2018 (work in progress :) - one providing secure machine learning on isolated data, the other implementing a decentralized data marketplace.

In this episode I explain:

  • How do we make machine learning decentralized and secure?
  • How can data owners keep their data private?
  • How can we benefit from blockchain technology for AI and machine learning?

 

I hope you enjoy the show!

 

References

fitchain.io decentralized machine learnin

Ocean protocol decentralized data marketplace

Dec 26, 2018
Episode 49: The promises of Artificial Intelligence
21:00

It's always good to put in perspective all the findings in AI, in order to clear some of the most common misunderstandings and promises.
In this episode I make a list of some of the most misleading statements about what artificial intelligence can achieve in the near future.

Dec 19, 2018
Episode 48: Coffee, Machine Learning and Blockchain
28:48

In this episode - which I advise to consume at night, in a quite place - I speak about private machine learning and blockchain, while I sip a cup of coffee in my home office.
There are several reasons why I believe we should start thinking about private machine learning...
It doesn't really matter what approach becomes successful and gets adopted, as long as it makes private machine learning possible. If people own their data, they should also own the by-product of such data.

Decentralized machine learning makes this scenario possible.

Oct 21, 2018
Episode 47: Are you ready for AI winter? [Rebroadcast]
56:55

Today I am having a conversation with Filip Piękniewski, researcher working on computer vision and AI at Koh Young Research America.
His adventure with AI started in the 90s and since then a long list of experiences at the intersection of computer science and physics, led him to the conclusion that deep learning might not be sufficient nor appropriate to solve the problem of intelligence, specifically artificial intelligence.  
I read some of his publications and got familiar with some of his ideas. Honestly, I have been attracted by the fact that Filip does not buy the hype around AI and deep learning in particular.
He doesn’t seem to share the vision of folks like Elon Musk who claimed that we are going to see an exponential improvement in self driving cars among other things (he actually said that before a Tesla drove over a pedestrian).

Sep 11, 2018
Episode 46: why do machine learning models fail? (Part 2)
17:12

In this episode I continue the conversation from the previous one, about failing machine learning models.

When data scientists have access to the distributions of training and testing datasets it becomes relatively easy to assess if a model will perform equally on both datasets. What happens with private datasets, where no access to the data can be granted?

At fitchain we might have an answer to this fundamental problem.

 

Sep 04, 2018
Episode 45: why do machine learning models fail?
16:21

The success of a machine learning model depends on several factors and events. True generalization to data that the model has never seen before is more a chimera than a reality. But under specific conditions a well trained machine learning model can generalize well and perform with testing accuracy that is similar to the one performed during training.

In this episode I explain when and why machine learning models fail from training to testing datasets.

Aug 28, 2018
Episode 44: The predictive power of metadata
21:08

In this episode I don't talk about data. In fact, I talk about metadata.

While many machine learning models rely on certain amounts of data eg. text, images, audio and video, it has been proved how powerful is the signal carried by metadata, that is all data that is invisible to the end user.
Behind a tweet of 140 characters there are more than 140 fields of data that draw a much more detailed profile of the sender and the content she is producing... without ever considering the tweet itself.

 

References
You are your Metadata: Identification and Obfuscation of Social Media Users using Metadata Information https://www.ucl.ac.uk/~ucfamus/papers/icwsm18.pdf

Aug 21, 2018
Episode 43: Applied Text Analysis with Python (interview with Rebecca Bilbro)
36:32

Today’s episode is about text analysis with python.
Python is the de facto standard in machine learning. A large community, a generous choice in the set of libraries, at the price of less performant tasks, sometimes. But overall a decent language for typical data science tasks.

I am with Rebecca Bilbro, co-author of Applied Text Analysis with Python, with Benjamin Bengfort and Tony Ojeda.

We speak about the evolution of applied text analysis, tools and pipelines, chatbots.

 

Aug 14, 2018
Episode 42: Attacking deep learning models (rebroadcast)
29:04

Attacking deep learning models

Compromising AI for fun and profit

 

Deep learning models have shown very promising results in computer vision and sound recognition. As more and more deep learning based systems get integrated in disparate domains, they will keep affecting the life of people. Autonomous vehicles, medical imaging and banking applications, surveillance cameras and drones, digital assistants, are only a few real applications where deep learning plays a fundamental role. A malfunction in any of these applications will affect the quality of such integrated systems and compromise the security of the individuals who directly or indirectly use them.

In this episode, we explain how machine learning models can be attacked and what we can do to protect intelligent systems from being  compromised.

Aug 07, 2018
Episode 41: How can deep neural networks reason
18:03

Today’s episode  will be about deep learning and reasoning. There has been a lot of discussion about the effectiveness of deep learning models and their capability to generalize, not only across domains but also on data that such models have never seen.

But there is a research group from the Department of Computer Science, Duke University that seems to be on something with deep learning and interpretability in computer vision.

 

References

Prediction Analysis Lab Duke University https://users.cs.duke.edu/~cynthia/lab.html

This looks like that: deep learning for interpretable image recognition https://arxiv.org/abs/1806.10574

Jul 31, 2018
Episode 40: Deep learning and image compression
17:20

Today’s episode  will be about deep learning and compression of data, and in particular compressing images. We all know how important compressing data is, reducing the size of digital objects without affecting the quality.
As a very general rule, the more one compresses an image the lower the quality, due to a number of factors like bitrate, quantization error, etcetera. I am glad to be here with Tong Chen,  researcher at the School of electronic Science and Engineering of Nanjing University, China.

Tong developed a deep learning based compression algorithm for images, that seems to improve over state of the art approaches like BPG, JPEG2000 and JPEG.

 

Reference

Deep Image Compression via End-to-End Learning - Haojie Liu, Tong Chen, Qiu Shen, Tao Yue, and Zhan Ma School of Electronic Science and Engineering, Nanjing University, Jiangsu, China

 

Jul 24, 2018
Episode 39: What is L1-norm and L2-norm?
21:55

In this episode I explain the differences between L1 and L2 regularization that you can find in function minimization in basically any machine learning model.

 

Jul 19, 2018
Episode 38: Collective intelligence (Part 2)
46:36

In the second part of this episode I am interviewing Johannes Castner from CollectiWise, a platform for collective intelligence.
I am moving the conversation towards the more practical aspects of the project, asking about the centralised AGI and blockchain components that are essential part of the platform.

 

References

  1. Opencog.org
    Thaler, Richard H., Sunstein, Cass R. and Balz, John P. (April 2, 2010). "Choice Architecture".
    doi:10.2139/ssrn.1583509. SSRN 1583509
  2. Teschner, F., Rothschild, D. & Gimpel, H. Group Decis Negot (2017) 26: 953. https://doi.org/10.1007/s10726-017-9531-0
  3. Firas Khatib, Frank DiMaio, Foldit Contenders Group, Foldit Void Crushers Group, Seth Cooper, Maciej Kazmierczyk, Miroslaw Gilski, Szymon Krzywda, Helena Zabranska, Iva Pichova, James Thompson, Zoran Popović, Mariusz Jaskolski & David Baker, Crystal structure of a monomeric retroviral protease solved by protein folding game players, Nature Structural & Molecular Biology volume18, pages1175–1177 (2011)
  4. Rosenthal, Franz; Dawood, Nessim Yosef David (1969). The Muqaddimah : an introduction to history ; in three volumes. 1. Princeton University Press. ISBN 0-691-01754-9.
  5. Kevin J. Boudreau and Karim R. Lakhani, Using the Crowd as an Innovation Partner, April 2013.
  6. Sam Bowles, The Moral Economy: Why Good Incentives are No Substitute for Good Citizens.
    Amartya K. Sen, Rational Fools: A Critique of the Behavioral Foundations of Economic Theory, Philosophy & Public Affairs, Vol. 6, No. 4 (Summer, 1977), pp. 317-344, Published by: Wiley, Stable URL: http://www.jstor.org/stable/2264946
Jul 17, 2018
Episode 38: Collective intelligence (Part 1)
30:58

This is the first part of the amazing episode with Johannes Castner, CEO and founder of CollectiWise. Johannes is finishing his PhD in Sustainable Development from Columbia University in New York City, and he is building a platform for collective intelligence. Today we talk about artificial general intelligence and wisdom.

All references and shownotes will be published after the next episode.
Enjoy and stay tuned!

Jul 12, 2018
Episode 37: Predicting the weather with deep learning
26:25

Predicting the weather is one of the most challenging tasks in machine learning due to the fact that physical phenomena are dynamic and riche of events. Moreover, most of traditional approaches to climate forecast are computationally prohibitive.
It seems that a joint research between the Earth System Science at the University of California, Irvine and the faculty of Physics at LMU Munich has an interesting improvement on the scalability and accuracy of climate predictive modeling. The solution is... superparameterization and deep learning.

 

References                  

Could Machine Learning Break the Convection Parameterization Deadlock?                               

  1. Gentine, M. Pritchard, S. Rasp, G. Reinaudi, and G. Yacalis
    Earth and Environmental Engineering, Columbia University, New York, NY, USA, Earth System Science, University of California, Irvine, CA, USA, Faculty of Physics, LMU Munich, Munich, Germany

           

Jul 09, 2018
Episode 36: The dangers of machine learning and medicine
22:07

Humans seem to have reached a cross-point, where they are asked to choose between functionality and privacy. But not both. Not both at all. No data, no service. That’s what companies building personal finance services say. The same applies to marketing companies, social media companies, search engine companies, and healthcare institutions.

In this episode I speak about the reasons to aggregate data for precision medicine, the consequences of such strategies and how can researchers and organizations provide services to individuals while respecting their privacy.

 

Jul 03, 2018
Episode 35: Attacking deep learning models
29:13

Attacking deep learning models

Compromising AI for fun and profit

 

Deep learning models have shown very promising results in computer vision and sound recognition. As more and more deep learning based systems get integrated in disparate domains, they will keep affecting the life of people. Autonomous vehicles, medical imaging and banking applications, surveillance cameras and drones, digital assistants, are only a few real applications where deep learning plays a fundamental role. A malfunction in any of these applications will affect the quality of such integrated systems and compromise the security of the individuals who directly or indirectly use them.

In this episode, we explain how machine learning models can be attacked and what we can do to protect intelligent systems from being  compromised.

Jun 29, 2018
Episode 34: Get ready for AI winter
59:04

Today I am having a conversation with Filip Piękniewski, researcher working on computer vision and AI at Koh Young Research America.
His adventure with AI started in the 90s and since then a long list of experiences at the intersection of computer science and physics, led him to the conclusion that deep learning might not be sufficient nor appropriate to solve the problem of intelligence, specifically artificial intelligence.  
I read some of his publications and got familiar with some of his ideas. Honestly, I have been attracted by the fact that Filip does not buy the hype around AI and deep learning in particular.
He doesn’t seem to share the vision of folks like Elon Musk who claimed that we are going to see an exponential improvement in self driving cars among other things (he actually said that before a Tesla drove over a pedestrian).

Jun 22, 2018
Episode 33: Decentralized Machine Learning and the proof-of-train
17:40

In the attempt of democratizing machine learning, data scientists should have the possibility to train their models on data they do not necessarily own, nor see. A model that is privately trained should be verified and uniquely identified across its entire life cycle, from its random initialization to setting the optimal values of its parameters.
How does blockchain allow all this? Fitchain is the decentralized machine learning platform that provides models an identity and a certification of their training procedure, the proof-of-train

Jun 11, 2018
Episode 32: I am back. I have been building fitchain
23:14

I know, I have been away too long without publishing much in the last 3 months.
But, there's a reason for that. I have been building a platform that combines machine learning with blockchain technology.
Let me introduce you to fitchain and tell you more in this episode.

If you want to collaborate on the project or just think it's interesting, drop me a line on the contact page at fitchain.io

Jun 04, 2018
Founder Interview – Francesco Gadaleta of Fitchain
31:04

Cross-posting from Cryptoradio.io

Overview

Francesco Gadaleta introduces Fitchain, a decentralized machine learning platform that combines blockchain technology and AI to solve the data manipulation problem in restrictive environments such as healthcare or financial institutions.Francesco Gadaleta is the founder of Fitchain.io and senior advisor to Abe AI. Fitchain is a platform that officially started in October 2017, which allows data scientists to write machine learning models on data they cannot see and access due to restrictions imposed in healthcare or financial environments. In the Fitchain platform, there are two actors, the data owner and the data scientist. They both run the Fitchain POD, which orchestrates the relationship between these two sides. The idea behind Fitchain is summarized in the thesis “do not move the data, move the model – bring the model where the data is stored.”

The Fitchain team has also coined a new term called “proof of train” – a way to guarantee that the model is truly trained at the organization, and that it becomes traceable on the blockchain. To develop the complex technological aspects of the platform, Fitchain has partnered up with BigChainDB, the project we have recently featured on Crypto Radio.

Roadmap

Fitchain team is currently validating the assumptions and increasing the security of the platform. In the next few months, they will extend the portfolio of machine learning libraries and are planning to move from a B2B product towards a Fitchain for consumers.

By June 2018 they plan to start the Internet of PODs. They will also design the Fitchain token – FitCoin, which will be a utility token to enable operating on the Fitchain platform.

 

May 24, 2018
Episode 31: The End of Privacy
39:03

Data is a complex topic, not only related to machine learning algorithms, but also and especially to privacy and security of individuals, the same individuals who create such data just by using the many mobile apps and services that characterize their digital life.

In this episode I am together with B.J.n Mendelson, author of “Social Media is Bullshit” from St. Martin’s Press and world-renowned speaker on issues involving the myths and realities involving today’s Internet platforms.  B.J. has a new a book about privacy and sent me a free copy of "Privacy, and how to get it back" that I read in just one day. That was enough to realise how much we have in common when it comes to data and data collection.

 

Apr 02, 2018
Episode 30: Neural networks and genetic evolution: an unfeasible approach
22:19

Despite what researchers claim about genetic evolution, in this episode we give a realistic view of the field.

Nov 21, 2017
Episode 29: Fail your AI company in 9 steps
14:27

In order to succeed with artificial intelligence, it is better to know how to fail first. It is easier than you think.
Here are 9 easy steps to fail your AI startup.

Nov 11, 2017
Episode 28: Towards Artificial General Intelligence: preliminary talk
20:34

The enthusiasm for artificial intelligence is raising some concerns especially with respect to some ventured conclusions about what AI can really do and what its direct descendent, artificial general intelligence would be capable of doing in the immediate future. From stealing jobs, to exterminating the entire human race, the creativity (of some) seems to have no limits. 
In this episode I make sure that everyone comes back to reality - which might sound less exciting than Hollywood but definitely more... real. 

Nov 04, 2017
Episode 27: Techstars accelerator and the culture of fireflies
17:42

In the aftermath of the Barclays Accelerator, powered by Techstars experience, one of the most innovative and influential startup accelerators in the world, I’d like to give back to the community lessons learned, including the need for confidence, soft-skills, and efficiency, to be applied to startups that deal with artificial intelligence and data science.
In this episode I also share some thoughts about the culture of fireflies in modern and dynamic organisations.

Oct 30, 2017
Episode 26: Deep Learning and Alzheimer
54:02

In this episode I speak about Deep Learning technology applied to Alzheimer disorder prediction. I had a great chat with Saman Sarraf, machine learning engineer at Konica Minolta, former lab manager at the Rotman Research Institute at Baycrest, University of Toronto and author of DeepAD: Alzheimer′ s Disease Classification via Deep Convolutional Neural Networks using MRI and fMRI.

I hope you enjoy the show.

Oct 23, 2017
Episode 25: How to become data scientist [RB]
16:16

In this episode, I speak about the requirements and the skills to become data scientist and join an amazing community that is changing the world with data analyticsa

Oct 16, 2017
Episode 24: How to handle imbalanced datasets
21:21

In machine learning and data science in general it is very common to deal at some point with imbalanced datasets and class distributions. This is the typical case where the number of observations that belong to one class is significantly lower than those belonging to the other classes.  Actually this happens all the time, in several domains, from finance, to healthcare to social media, just to name a few I have personally worked with.
Think about a bank detecting fraudulent transactions among millions or billions of daily operations, or equivalently in healthcare for the identification of rare disorders.
In genetics but also with clinical lab tests this is a normal scenario, in which, fortunately there are very few patients affected by a disorder and therefore very few cases wrt the large pool of healthy patients (or not affected).
There is no algorithm that can take into account the class distribution or the amount of observations in each class, if it is not explicitly designed to handle such situations.
In this episode I speak about some effective techniques to handle imbalanced datasets, advising the right method, or the most appropriate one to the right dataset or problem.

In this episode I explain how to deal with such common and challenging scenarios.

Oct 09, 2017
Episode 23: Why do ensemble methods work?
18:59

Ensemble methods have been designed to improve the performance of the single model, when the single model is not very accurate. According to the general definition of ensembling, it consists in building a number of single classifiers and then combining or aggregating their predictions into one classifier that is usually stronger than the single one.

The key idea behind ensembling is that some models will do well when they model certain aspects of the data while others will do well in modelling other aspects.
In this episode I show with a numeric example why and when ensemble methods work.

Oct 03, 2017
Episode 22: Parallelising and distributing Deep Learning
19:42

Continuing the discussion of the last two episodes, there is one more aspect of deep learning that I would love to consider and therefore left as a full episode, that is parallelising and distributing deep learning on relatively large clusters.

As a matter of fact, computing architectures are changing in a way that is encouraging parallelism more than ever before. And deep learning is no exception and despite the greatest improvements with commodity GPUs - graphical processing units, when it comes to speed, there is still room for improvement.

Together with the last two episodes, this one completes the picture of deep learning at scale. Indeed, as I mentioned in the previous episode, How to master optimisation in deep learning, the function optimizer is the horsepower of deep learning and neural networks in general. A slow and inaccurate optimisation method leads to networks that slowly converge to unreliable results.

In another episode titled “Additional strategies for optimizing deeplearning” I explained some ways to improve function minimisation and model tuning in order to get better parameters in less time. So feel free to listen to these episodes again, share them with your friends, even re-broadcast or download for your commute.

While the methods that I have explained so far represent a good starting point for prototyping a network, when you need to switch to production environments or take advantage of the most recent and advanced hardware capabilities of your GPU, well... in all those cases, you would like to do something more.  

Sep 25, 2017
Episode 21: Additional optimisation strategies for deep learning
15:08

In the last episode How to master optimisation in deep learning I explained some of the most challenging tasks of deep learning and some methodologies and algorithms to improve the speed of convergence of a minimisation method for deep learning.
I explored the family of gradient descent methods - even though not exhaustively - giving a list of approaches that deep learning researchers are considering for different scenarios. Every method has its own benefits and drawbacks, pretty much depending on the type of data, and data sparsity. But there is one method that seems to be, at least empirically, the best approach so far.

Feel free to listen to the previous episode, share it, re-broadcast or just download for your commute.

In this episode I would like to continue that conversation about some additional strategies for optimising gradient descent in deep learning and introduce you to some tricks that might come useful when your neural network stops learning from data or when the learning process becomes so slow that it really seems it reached a plateau even by feeding in fresh data.

Sep 18, 2017
Episode 20: How to master optimisation in deep learning
19:29

The secret behind deep learning is not really a secret. It is function optimisation. What a neural network essentially does, is optimising a function. In this episode I illustrate a number of optimisation methods and explain which one is the best and why.

Aug 28, 2017
Episode 19: How to completely change your data analytics strategy with deep learning
15:56

Over the past few years, neural networks have re-emerged as powerful machine-learning models, reaching state-of-the-art results in several fields like image recognition and speech processing. More recently, neural network models started to be applied also to textual data in order to deal with natural language, and there too with promising results. In this episode I explain why is deep learning performing the way it does, and what are some of the most tedious causes of failure.

Aug 09, 2017
Episode 18: Machines that learn like humans
42:06

Artificial Intelligence allow machines to learn patterns from data. The way humans learn however is different and more efficient. With Lifelong Machine Learning, machines can learn the way human beings do, faster, and more efficiently

Mar 28, 2017
Episode 17: Protecting privacy and confidentiality in data and communications
17:31

Talking about security of communication and privacy is never enough, especially when political instabilities are driving leaders towards decisions that will affect people on a global scale

Feb 15, 2017
Episode 16: 2017 Predictions in Data Science
20:31

We strongly believe 2017 will be a very interesting year for data science and artificial intelligence. Let me tell you what I expect and why.

Dec 23, 2016
Episode 15: Statistical analysis of phenomena that smell like chaos
10:14

Is the market really predictable? How do stock prices increase? What is their dynamics? Here is what I think about the magics and the reality of predictions applied to markets and the stock exchange.

Dec 05, 2016
Episode 14: The minimum required by a data scientist
16:46

Why the job of the data scientist can disappear soon. What is required by a data scientist to survive inflation.

Sep 27, 2016
Episode 13: Data Science and Fraud Detection at iZettle
16:32

Data science is making the difference also in fraud detection. In this episode I have a conversation with an expert in the field, Engineer Eyad Sibai, who works at iZettle, a fraud detection company

Sep 06, 2016
Episode 12: EU Regulations and the rise of Data Hijackers
16:17

Extracting knowledge from large datasets with large number of variables is always tricky. Dimensionality reduction helps in analyzing high dimensional data, still maintaining most of the information hidden behind complexity. Here are some methods that you must try before further analysis (Part 1).

Jul 26, 2016
Episode 11: Representative Subsets For Big Data Learning
21:25

How would you perform accurate classification on a very large dataset by just looking at a sample of it

May 03, 2016
Episode 10: History and applications of Deep Learning
22:39

What is deep learning?If you have no patience, deep learning is the result of training many layers of non-linear processing units for feature extraction and data transformation e.g. from pixel, to edges, to shapes, to object classification, to scene description, captioning, etc.

Mar 14, 2016
Episode 9: Markov Chain Montecarlo with full conditionals
17:57

At some point, statistical problems need sampling. Sampling consists in generating observations from a specific distribution.

Mar 02, 2016
Episode 7: 30 min with data scientist Sebastian Raschka
34:23

In this show I interview Sebastian Raschka, data scientist and author of Python Machine Learning.In addition to the fun we had offline, there are great elements about machine learning, data science, current and future trends, to keep an ear on. Moreover, it is the conversation of two data scientists who contribute and operate in the field, on a daily basis.

Feb 15, 2016
Episode 8: Frequentists and Bayesians
06:52

There are statisticians and data scientists... Among statisticians, there are some who just count. Some others who… think differently. In this show we explore the old time dilemma between frequentists and bayesians.Given a statistical problem, who’s going to be right?

Feb 15, 2016
Episode 6: How to be data scientist
15:11

In this episode, we tell you how to become data scientist and join an amazing community that is changing the world with data analytics.

Jan 19, 2016
Episode 5: Development and Testing Practices in Data Science
07:38

Should data scientists follow the old good practices of software engineering? Data scientists make software after all.

Jan 13, 2016
Episode 1: Predictions in Data Science for 2016
19:30

It’s time to experiment with Data Science at home. Since we are still dealing with our hosting service, consider the first episode purely experimental, even though the content might be of your interest, no matter what.

Dec 21, 2015
Episode 2: Networks and Graph Databases
22:49

Have you ever thought to get a Big Data infrastructure on your desk? That’s right! On your desk.

Jul 23, 2015
Episode 4: BigData on your desk
16:49

Have you ever thought to get a Big Data infrastructure on your desk? That’s right! On your desk.

Jul 23, 2015
Episode 3: Data Science and Bio-Inspired Algorithms
14:49

In this episode I meet Dr Eliseo Ferrante, formerly at the University of Leuven, currently researcher at the Université de Technologie de Compiègne, who studies self-organization and evolution.

Jul 08, 2015