diff --git a/transcripts/497-outlier-detection-with-python.txt b/transcripts/497-outlier-detection-with-python.txt index 3811469f..64da5db4 100644 --- a/transcripts/497-outlier-detection-with-python.txt +++ b/transcripts/497-outlier-detection-with-python.txt @@ -20,7 +20,7 @@ 00:01:09 And keep up with the show and listen to over nine years of episodes at talkpython.fm. If you want to be part of our live episodes, you can find the live streams over on YouTube. -00:01:19 Subscribe to our YouTube channel over at talkpython.fm/youtube and get notified about upcoming shows. This by Posit Connect from the makers of Shiny. +00:01:19 Subscribe to our YouTube channel over at talkpython.fm/youTube and get notified about upcoming shows. This by Posit Connect from the makers of Shiny. 00:01:30 Publish, share and deploy all of your data projects that you're creating using Python. @@ -104,13 +104,13 @@ 00:04:19 of what is a little bit not as mainstream as it is today. I mean, there's certainly a lot of other people doing it, but it wasn't the gigantic industry that it is now. -00:04:29 >> Yeah, there was definitely some inflection points or places where it's really taken some steps, significant steps up in the Python data science and ML space. Probably one is the Pandas NumPy era around 2006, something around 2012. I don't know really what caused And certainly in the machine learning, AI is a whole nother spike of growth there. +00:04:29 >> Yeah, there was definitely some inflection points or places where it's really taken some steps, significant steps up in the Python data science and ML space. Probably one is the Pandas NumPy era around 2006, something around 2012. I don't know really what caused And certainly in the machine learning, AI is a whole another spike of growth there. 00:04:50 So you were doing, it was kind of before mainstream data science, right? -00:04:54 It is starting to become mainstream, but like when working with text, for example, there's certain, there were spacey and word2vec and glove and things like that. +00:04:54 It is starting to become mainstream, but like when working with text, for example, there's certain, there were spaCy and Word2Vec and GloVe and things like that. -00:05:05 But this is long before GPT, but psychic learn existed and yeah. +00:05:05 But this is long before GPT, but Scikit-learn existed and yeah. 00:05:11 So we're actually using a lot of the same tools that you would, you would use now, but they're a little bit earlier versions of them. @@ -194,7 +194,7 @@ 00:07:08 But it's really a lot more subtle than that. -00:07:10 And I want to maybe go back to a guest I had, was that six months ago or something like that, Stephanie Molin, and she has this really cool project called Datamorph, which I'll link to on GitHub, and we talked about it then. +00:07:10 And I want to maybe go back to a guest I had, was that six months ago or something like that, Stephanie Molin, and she has this really cool project called Data-morph, which I'll link to on GitHub, and we talked about it then. 00:07:21 But if you pull up her GitHub repo, it's an animated GIF, I believe, of a whole bunch of different, very clear concrete shapes, like a star, literally a panda, not pandas, but the animal, and continuous animation, a bunch of data points that go from one to the other. @@ -312,8 +312,6 @@ 00:13:13 That's early 90s when the internet stopped having a sound, mid 90s, it stopped having a sound. -00:13:18 >> - 00:13:18 That's about when I first started getting a little, just as it's being opened up. 00:13:24 Remember the early days, it's basically university students that were on there. @@ -476,7 +474,7 @@ 00:20:50 Are there techniques that people use for, like-- -00:20:52 use, for example, Dask or QPi, like the CUDA stuff for GPUs? +00:20:52 use, for example, Dask or Quepy, like the CUDA stuff for GPUs? 00:20:59 Are there sort of like really high-end compute stuff that people are using for these problems? @@ -526,9 +524,9 @@ 00:24:10 There are exceptions to that where you just want to check – you just kind of want to spot check your data and just want to make sure it's largely free of outliers or that there's not – I mean, depending on how you define them. -00:24:19 example, a sensor readings, like you always expect a certain number of point anomalies, like a certain point in time where something spikes up and it's not it is an anomaly, but it's not a big deal. So in some some cases you might just want to make sure that your number of point anomalies is normal. So you don't have an unusual number of unusual events in that case or cases like that or if you're just doing data quality checks, for example, you might just want to make sure your data is large. +00:24:19 example, a sensor readings, like you always expect a certain number of point anomalies, like a certain point in time where something spikes up and it's not it is an anomaly, but it's not a big deal. So in some cases you might just want to make sure that your number of point anomalies is normal. So you don't have an unusual number of unusual events in that case or cases like that or if you're just doing data quality checks, for example, you might just want to make sure your data is large. -00:24:46 How do you know that you don't get outliers in your If you just feed it like a stream of data, you're like, well, this is what's normal. +00:24:46 How do you know that you don't get outliers in your training. If you just feed it like a stream of data, you're like, well, this is what's normal. 00:24:53 Actually, there were outliers, but you weren't ready to detect them yet, and now it thinks they're normal. @@ -586,7 +584,7 @@ 00:26:53 So it's actually doing, you can do that in the same data. -00:26:56 Now having said that, what you do do in outlier detection is you sometimes do that a little bit iteratively where you say, start with the original data and find the most anomalous records in there. +00:26:56 Now having said that, what you do in outlier detection is you sometimes do that a little bit iteratively where you say, start with the original data and find the most anomalous records in there. 00:27:06 And then you might inspect them manually and say, "Okay, these are things, yeah, that we wouldn't normally expect to happen. @@ -646,7 +644,7 @@ 00:30:33 Like if you're flagging something as maybe being potentially fraudulent, or if you're doing security and you're flagging some web activity as being problematic, or a person in some sort of security context is possibly being some sort of security risk. -00:30:49 You gotta know why in order to investigate effectively and quickly. +00:30:49 You got to know why in order to investigate effectively and quickly. 00:30:55 So we spent a lot of time looking at how to make Outlier Detection 1 interpretable and also just kind of justify the results that we're finding if we assess the tables of data and come back with a summary of them saying, these are the most unusual records in this data. @@ -710,7 +708,7 @@ 00:33:24 Let's talk about some of the tools that you highlighted in your book. -00:33:28 Like the PI OD, Python outlier detection. +00:33:28 Like the PyOD, Python outlier detection. 00:33:31 Yeah. @@ -740,7 +738,7 @@ 00:34:03 we'll see kernel density estimation, which makes it fairly easy to find outliers to just points in space that have low density as well clustering method called Gaussian mixture models and GM, which makes it fairly easy to do outlier detection using that as well. -00:34:19 But PIIOD, yeah, so PIIOD has those, has basically everything in scikit-learn, wraps it in its own interface. +00:34:19 But PyOD, yeah, so PyOD has those, has basically everything in scikit-learn, wraps it in its own interface. 00:34:26 And I think a couple dozen more, and it also has a number of deep learning based ones. @@ -748,11 +746,11 @@ 00:34:37 So yeah, in terms of deep learning, it has well autoencoders, variational autoencoders, GANs. -00:34:42 There are some limitations with PYAD, and I do get into that in the book. +00:34:42 There are some limitations with PyOD, and I do get into that in the book. 00:34:46 It assumes you're dealing with strictly numeric data, for example, which if you're dealing with tabular data, realistically, you have mixed data, categorical columns, date columns. -00:34:57 So PYAD does not handle that. +00:34:57 So PyOD does not handle that. 00:34:59 And it's strictly for tabular data. @@ -804,19 +802,19 @@ 00:36:56 maybe I'm headed up on the screen for a little bit, but give people a sense of kind of problems you can ask with this one or problems you solve questions you can ask. -00:37:03 It's a little bit like PIOD, it's less so, but it doesn't have like 30 or so detectors, it has four, but includes two of them. +00:37:03 It's a little bit like PyOD, it's less so, but it doesn't have like 30 or so detectors, it has four, but includes two of them. 00:37:12 So again, it's just for tabular data. 00:37:14 And again, it's just numeric tabular data, but it has two of the go-to algorithms. -00:37:19 They're called one's called isolation forest. +00:37:19 They're called one's called Isolation Forest. -00:37:21 And the other is called local outlier factor. +00:37:21 And the other is called Local Outlier Factor. 00:37:24 I think most of the time with outlier detection, if you're doing tabular data, if you're only gonna use two algorithms, those are the two. -00:37:31 Outlier detection is a little bit different than say prediction, because if you're building a predictive model, say on tabular data, most of the time you would use, say an XGBoost model or a CatBoost model or a random forest or something like that. +00:37:31 Outlier detection is a little bit different than say prediction, because if you're building a predictive model, say on tabular data, most of the time you would use, say an XGBoost model or a CatBoost model or a Random Forest or something like that. 00:37:45 You would tend to use one model. @@ -834,7 +832,7 @@ 00:38:33 And you can do it in just really simple ways. -00:38:35 If you say, Scikit-Learn has four outlier detectors, it has isolation forest, it has local outlier factor, has one called one class SVM and has one called elliptic envelope. +00:38:35 If you say, Scikit-Learn has four outlier detectors, it has isolation forest, it has Local Outlier Factor, has one called one class SVM and has one called Elliptic Envelope. 00:38:47 So if I run the four of those on my data, and let's say they're all four of them are appropriate and work well. Any records that are scored highly by all four of those detectors, you can say, well, these records are really @@ -854,23 +852,23 @@ 00:39:34 Just using that is probably gonna be, yeah, sufficient for a lot of cases. -00:39:38 If you want to do a little bit more thorough job using more detectors again, and just using detectors that can be a little bit more, PIOD has ones that are a little bit more performative, has some that are a little more interpretable. +00:39:38 If you want to do a little bit more thorough job using more detectors again, and just using detectors that can be a little bit more, PyOD has ones that are a little bit more performative, has some that are a little more interpretable. -00:39:53 It could be a good decision just to use PIOD and use a bunch of the detectors there. +00:39:53 It could be a good decision just to use PyOD and use a bunch of the detectors there. -00:39:58 One of the nice things about PIOD is it has the same API signature for all of its detectors. +00:39:58 One of the nice things about PyOD is it has the same API signature for all of its detectors. 00:40:02 So if you write code, and it's usually only like four or five lines, it's really kind of boilerplate easy, other than the hyperparameters for them. 00:40:11 It's really quite easy to just kind of swap one out and swap in another and just try a whole bunch of them. -00:40:17 One thing I do get into the book is that there's a lot of algorithms beyond even what's in PIOD that can be very useful to use. +00:40:17 One thing I do get into the book is that there's a lot of algorithms beyond even what's in PyOD that can be very useful to use. -00:40:25 The ones in PIOD, they're good and they're sufficient quite often, but they're limited into how interpretable they are. +00:40:25 The ones in PyOD, they're good and they're sufficient quite often, but they're limited into how interpretable they are. 00:40:32 So again, if you're in a situation where you're saying, well, this looks like a security threat or this looks like failing equipment this looks like a novel specimen. Say from astronomy, this looks like a novel transit. -00:40:45 This is a transit that's not normal. You want to know why. So a lot of the detectors from PIOD don't provide that necessarily to the degree you would wish. So I do suggest a bunch of others that could be useful for that purpose and that support categorical data +00:40:45 This is a transit that's not normal. You want to know why. So a lot of the detectors from PyOD don't provide that necessarily to the degree you would wish. So I do suggest a bunch of others that could be useful for that purpose and that support categorical data 00:41:01 as well. @@ -892,9 +890,9 @@ 00:41:33 - Yeah, right, 'cause it's basically, you take three or four or a thousand decision trees them all what their answers are and then you kind of like decide what the, it's like a voting process or something almost, right? -00:41:43 And that would be exactly how you would do it in a predictive model. So with an outlier detection model, well there's probably the closest thing to what you just described with outlier detection is there's an algorithm called isolation forest, which is a forest of isolation trees. So it's kind of similar ideas. It's probably close to, a little bit close to random forests, probably a little closer to extra trees for anyone who's familiar with predictive models. That's isolation forest. So it's kind of the outlier detection equivalent of extra trees model. And like that is that it's composed of a whole bunch of trees and depending on the implementation of it, not psychic learns, but depending on the implementation of it, you can run all those trees in parallel, which can give you your answers quite fast if you have sufficient hardware. +00:41:43 And that would be exactly how you would do it in a predictive model. So with an outlier detection model, well there's probably the closest thing to what you just described with outlier detection is there's an algorithm called isolation forest, which is a forest of isolation trees. So it's kind of similar ideas. It's probably close to, a little bit close to random forests, probably a little closer to extra trees for anyone who's familiar with predictive models. That's isolation forest. So it's kind of the outlier detection equivalent of extra trees model. And like that is that it's composed of a whole bunch of trees and depending on the implementation of it, not Scikit-learns, but depending on the implementation of it, you can run all those trees in parallel, which can give you your answers quite fast if you have sufficient hardware. -00:42:30 What can also be done with outlier detection too is, which can actually, depending on your environment, work a little faster, can actually be faster to run your detectors in sequence, believe it or not, because what you can do then is, depending on how you define your outliers, but say you say, I'm only concerned with things that are flagged by the isolation forest and the local outlier fracture model and the KNN model and the autoencoder model. +00:42:30 What can also be done with outlier detection too is, which can actually, depending on your environment, work a little faster, can actually be faster to run your detectors in sequence, believe it or not, because what you can do then is, depending on how you define your outliers, but say you say, I'm only concerned with things that are flagged by the Isolation Forest and the Local Outlier Factor model and the KNN model and the autoencoder model. 00:42:58 You can run them in, so anything that's filtered out by the first model doesn't need to be sent to the second model and then, or the third model or the fourth model, so you can run it, you can kind of put the fastest one out first and then the. @@ -938,9 +936,9 @@ 00:44:32 want to get a good balance of that. -00:44:33 - One of the other projects that you called out is Profit, time series anomaly detection with Profit, and it's based on the Facebook's open source library, Profit. +00:44:33 - One of the other projects that you called out is Prophet, time series anomaly detection with Prophet, and it's based on the Facebook's open source library, Prophet. -00:44:44 This is Profit anomaly detection versus just straight Profit, which is a tool for producing high quality forecasts for time series data that has multiple seasonality with linear and nonlinear growth. +00:44:44 This is Prophet anomaly detection versus just straight Prophet, which is a tool for producing high quality forecasts for time series data that has multiple seasonality with linear and nonlinear growth. 00:44:56 I imagine that Facebook probably has a lot of data in its time series. @@ -980,7 +978,7 @@ 00:46:18 and stuff. -00:46:20 - They do, the way profit works is that there's only, you work with tables data and they only have two columns. +00:46:20 - They do, the way Prophet works is that there's only, you work with tables data and they only have two columns. 00:46:25 There's the time column and there's the value column. @@ -1022,7 +1020,7 @@ 00:48:14 It's pretty sparse data. -00:48:16 - You might be able to see why you made the prediction you did because it's just based on how you, there's a lot of ways to do, profit works a certain way and there's a lot of ways to do time series forecasting, but generally you're just looking at the regular patterns, the general trend and maybe just like lag features and things like that. +00:48:16 - You might be able to see why you made the prediction you did because it's just based on how you, there's a lot of ways to do, Prophet works a certain way and there's a lot of ways to do time series forecasting, but generally you're just looking at the regular patterns, the general trend and maybe just like lag features and things like that. 00:48:33 So you can figure out why you made the prediction you did, but yeah, why it actually had the actual value that it did. @@ -1040,7 +1038,7 @@ 00:48:57 One, can I just throw a ton of data at ChatGPT and I'm not talking the, the ChatGPT four model. -00:49:04 I'm, I'm talking like, Oh, one reason the higher end model that can take could hold all the data potentially. +00:49:04 I'm, I'm talking like, O1 reason the higher end model that can take could hold all the data potentially. 00:49:09 And it's a little bit more thorough. @@ -1058,7 +1056,7 @@ 00:49:30 I mean, it's going to be like a lot of things. -00:49:32 If you have an, if you have an expertise in these, sort of areas, you're you're gonna be better off just to take advantage of that. +00:49:32 If you have an, if you have an expertise in these, sort of areas, you're gonna be better off just to take advantage of that. 00:49:38 But like a lot of areas of data science, if you're working into an area that you're, you don't happen to have spent years thinking about, these, yeah, these tools can be a good place to get you started. @@ -1142,7 +1140,7 @@ 00:52:22 One thing I was really happy with is some of the people that are architects of the most important algorithms in outlier detection. -00:52:30 So things like isolation forest, local outlier factor, extended isolation forest. +00:52:30 So things like isolation forest, Local Outlier Factor, extended Isolation Forest. 00:52:35 They gave some really nice feedback on the book. @@ -1151,86 +1149,3 @@ 00:52:41 And I think it's probably as comprehensive a book as you would need. 00:52:45 - I'll say it's quite comprehensive. - -00:52:47 I was going through it. - -00:52:48 - It covers what you would, yeah. - -00:52:50 If you're not doing time series analysis, for example, you can probably skip the chapter on time series analysis, but, or deep learning. - -00:52:56 But most of it, I think, is actually-- there's a lot of gotchas with outlier detection. - -00:53:01 I mean, they're not too hard to get your head around, but you do have to think about them. - -00:53:04 And so it does cover pretty much anything you would need to know. - -00:53:08 I think for almost any case, you wouldn't need to look for too many other resources after - -00:53:13 reading this. - -00:53:13 Excellent. - -00:53:14 And PyOD is also another good resource maybe to get started with, you think? - -00:53:17 Depending on your situation, either just scikit-learn or PyOD. - -00:53:21 And then if you really need to go into deep learning, I would go in deep OD or even just pyod. - -00:53:26 And for time series, there's quite a number of libraries you can use as well. - -00:53:30 - Okay, excellent. - -00:53:32 Well, thank you so much for being on the show and sharing your work. - -00:53:35 - Well, thank you very much for having me. - -00:53:37 Yeah, that was very good. - -00:53:37 - Yeah, you bet. - -00:53:38 Bye now. - -00:53:39 This has been another episode of "Talk Python to Me." Thank you to our sponsors. - -00:53:44 Be sure to check out what they're offering. - -00:53:46 It really helps support the show. - -00:53:48 This episode is sponsored by Posit Connect from the makers of Shiny. - -00:53:52 publish, share, and deploy all of your data projects that you're creating using Python. - -00:53:56 Streamlit, Dash, Shiny, Bokeh, FastAPI, Flask, Quarto, Reports, Dashboards, and APIs. - -00:54:03 Posit Connect supports all of them. - -00:54:05 Try Posit Connect for free by going to talkpython.fm/posit, P-O-S-I-T. - -00:54:11 Want to level up your Python? - -00:54:13 We have one of the largest catalogs of Python video courses over at Talk Python. - -00:54:17 Our content ranges from true beginners to deeply advanced topics like memory and async. - -00:54:22 And best of all, there's not a subscription in sight. - -00:54:24 Check it out for yourself at training.talkpython.fm. - -00:54:28 Be sure to subscribe to the show, open your favorite podcast app, and search for Python. - -00:54:32 We should be right at the top. - -00:54:34 You can also find the iTunes feed at /itunes, the Google Play feed at /play, and the Direct RSS feed at /rss on talkpython.fm. - -00:54:43 We're live streaming most of our recordings these days. - -00:54:46 If you wanna be part of the show and have your comments featured on the air, be sure to subscribe to our YouTube channel at talkpython.fm/youtube. - -00:54:54 This is your host, Michael Kennedy. - -00:54:56 Thanks so much for listening. - -00:54:57 I really appreciate it. - -00:54:58 Now get out there and write some Python code. -