diff --git a/transcripts/477-spacy-nlp.txt b/transcripts/477-spacy-nlp.txt index 2b17c31e..eaecff3a 100644 --- a/transcripts/477-spacy-nlp.txt +++ b/transcripts/477-spacy-nlp.txt @@ -96,7 +96,7 @@ 00:03:55 I did operations research, which is this sort of applied subfield of math. -00:03:59 That's very much a optimization problem, kind of Solvee kind of thing. So, +00:03:59 That's very much a optimization problem, kind of Solved kind of thing. So, 00:04:04 traveling salesman problem, that kind of stuff. @@ -128,7 +128,7 @@ 00:04:57 SymPy is cool. Google has OR tools, which is also a pretty easy starting point. And there's -00:05:01 also another package called CVXpy, which is all about convex optimization problems. And that's +00:05:01 also another package called CVXPY, which is all about convex optimization problems. And that's 00:05:07 very scikit-learn friendly as well, by the way, if you're into that. If you're an operations @@ -144,15 +144,15 @@ 00:05:35 we talked about CalmCode, which is a cool project that you've got going on. We'll talk about just a -00:05:40 moment through the Python Bytes stuff. And then through ExplosionAI and Spacey and all that, +00:05:40 moment through the Python Bytes stuff. And then through Explosion AI and Spacy and all that, 00:05:47 we actually teamed up to do a course that you wrote called Getting Started with NLP and Spacey, -00:05:52 which is over at TalkByThon, which is awesome. A lot of projects you got going on. Some of the +00:05:52 which is over at Talk Python, which is awesome. A lot of projects you got going on. Some of the 00:05:56 ideas that we're going to talk about here, and we'll dive into them as we get into the topics, -00:06:01 come from your course on TalkByThon. I'll put the link in the show notes. People will definitely +00:06:01 come from your course on Talk Python. I'll put the link in the show notes. People will definitely 00:06:04 want to check that out. But yeah, tell us a little bit more about the stuff you got going on. Like @@ -242,7 +242,7 @@ 00:09:15 YouTube channel. So definitely feel free to have a look at that. Yeah, I'll link that in the show -00:09:19 notes. Okay. As you said, that was in the CommCode YouTube account. The CommCode is more courses than +00:09:19 notes. Okay. As you said, that was in the CalmCode YouTube account. The CalmCode is more courses than 00:09:28 it is keyboards, right? Yes, definitely. So it kind of started as a COVID project. I kind of @@ -256,11 +256,11 @@ 00:09:53 position where, hey, let's just celebrate this project. So there's a collaborator helping me -00:09:56 out now. We are also writing a book that's on behalf of the CommCode brand. Like if you click, +00:09:56 out now. We are also writing a book that's on behalf of the CalmCode brand. Like if you click, 00:10:01 people can't see, I suppose, but... It's linked right on the homepage though. Yeah. -00:10:05 Yeah. So when you click it, like commcode.io/book, the book is titled Data Science Fiction. +00:10:05 Yeah. So when you click it, like calmcode.io/book, the book is titled Data Science Fiction. 00:10:10 The whole point of the book is just, these are anecdotes that people have told me while @@ -278,7 +278,7 @@ 00:10:38 it. I do have fun writing it is what I will say, but that's also like courses and stuff like this. -00:10:43 That's what I'm trying to do with the CommCode project. Just have something that's very fun to +00:10:43 That's what I'm trying to do with the CalmCode project. Just have something that's very fun to 00:10:47 maintain, but also something that people can actually have a good look at. @@ -366,41 +366,41 @@ 00:14:08 Wow. -00:14:09 So Psyched Lego, which is a somewhat popular project that I maintain, there's another +00:14:09 So Scikit Lego, which is a somewhat popular project that I maintain, there's another 00:14:12 collaborator on that now, Francesco. Basically everyone who has made a serious contribution is 00:14:17 also just invited to add a line to the poem. So it's just little things like that. That's what -00:14:23 today I learned. It's very easy to sort of share. Psyched Lego, by the way, I'm going to brag about +00:14:23 today I learned. It's very easy to sort of share. Scikit Lego, by the way, I'm going to brag about 00:14:28 that. It got a million downloads, got a million downloads now. So that happened two weeks ago. 00:14:32 So super proud of that. -00:14:34 What is Psyched Lego? +00:14:34 What is Scikit Lego? -00:14:35 Psyched Learn has all sorts of components and you've got regression models, classification +00:14:35 Scikit Learn has all sorts of components and you've got regression models, classification 00:14:40 models, pre-processing utilities, and you name it. And I, at some point, just noticed that there's a 00:14:44 couple of these Lego bricks that I really like to use and I didn't feel like rewriting them for -00:14:48 every single client I had. Psyched Lego just started out as a place for me and another maintainer +00:14:48 every single client I had. Scikit-Lego just started out as a place for me and another maintainer 00:14:54 just put stuff that we like to use. We didn't take the project that serious until other people did. -00:14:59 Like I actually got an email from a data engineer that works at Lego, just to give a example. But it's really just, there's a bunch of stuff that Psyched Learn, +00:14:59 Like I actually got an email from a data engineer that works at Lego, just to give a example. But it's really just, there's a bunch of stuff that Scikit Learn, 00:15:07 because it's such a mature project. There's a couple of these experimental things that can't -00:15:11 really go into Psyched Learn, but if people can convince us that it's a fun thing to maintain, +00:15:11 really go into Scikit Learn, but if people can convince us that it's a fun thing to maintain, 00:15:15 we will gladly put it in here. That's kind of the goal of the library. -00:15:18 Awesome. So kind of thinking of the building blocks of Psyched Learn as Lego blocks. +00:15:18 Awesome. So kind of thinking of the building blocks of Scikit Learn as Lego blocks. -00:15:24 Psyched Learn, you could look at it already, has a whole bunch of Lego bricks. It's just that this +00:15:24 Scikit Learn, you could look at it already, has a whole bunch of Lego bricks. It's just that this 00:15:27 library contributes a couple of more experimental ones. It's such a place right now that they can't @@ -520,11 +520,11 @@ 00:20:10 June for the audio channel. So it depends how you consumed it. So two to three months ago. Anyway, -00:20:15 we talked more about LLMs, not so much Spacey, even though she's behind it. So give people a +00:20:15 we talked more about LLMs, not so much Spacy, even though she's behind it. So give people a -00:20:21 sense of what is Spacey. We just talked about Scikit-Learn and the types of problems it solves. +00:20:21 sense of what is Spacy. We just talked about Scikit-Learn and the types of problems it solves. -00:20:26 What about Spacey? - There's a couple of stories that could be told about it, but +00:20:26 What about Spacy? - There's a couple of stories that could be told about it, but 00:20:29 one way to maybe think about it is that in Python, we've always had tools that could do NLP. We also @@ -538,11 +538,11 @@ 00:20:55 Lego bricks and it was definitely kind of useful, but it wasn't necessarily a coherent pipeline. -00:20:59 And one way to, I think, historically describe Spacey, it was like a very honest, good attempt to +00:20:59 And one way to, I think, historically describe Spacy, it was like a very honest, good attempt to 00:21:06 make a pipeline for all these different NLP components that kind of clicked together. -00:21:10 And the first component inside of Spacey that made it popular was basically a tokenizer, +00:21:10 And the first component inside of Spacy that made it popular was basically a tokenizer, 00:21:15 something that can take text and split it up into separate words. And basically that's a @@ -586,7 +586,7 @@ 00:22:56 to do. - Oh, waving hand emoji, step one. - Yeah, exactly. You're giving me ideas. This is gonna -00:23:02 happen. - But anyway, but back to SpaceGuy, I suppose. This is sort of the origin story. The +00:23:02 happen. - But anyway, but back to Spacy guy, I suppose. This is sort of the origin story. The 00:23:07 tokenization was the first sort of problem that they tackled. And then very quickly, they also @@ -604,7 +604,7 @@ 00:23:41 also just happens to be the most popular verb in the English language. So if you're just going to -00:23:45 match the string Go, you're simply not going to get there. Spacey was also one of the, I would +00:23:45 match the string Go, you're simply not going to get there. Spacy was also one of the, I would 00:23:50 say first projects that offered pretty good pre-trained free models that people could just @@ -632,7 +632,7 @@ 00:24:43 where you can see all sorts of plugins that people made. But the core and like the main way I still -00:24:47 like to think about Spacey, it is a relatively lightweight because a lot of it is implemented +00:24:47 like to think about Spacy, it is a relatively lightweight because a lot of it is implemented 00:24:51 in Cython. Pipeline for NLP projects. And again, like the main thing that people like to use it @@ -650,7 +650,7 @@ 00:25:23 I would argue, interesting hobby projects as well that are just more for fun, I guess. -00:25:28 Yeah. But there's a lot. I mean, one thing I will say, because Spacey's been around so much, some of those plugins are a +00:25:28 Yeah. But there's a lot. I mean, one thing I will say, because Spacy's been around so much, some of those plugins are a 00:25:33 bit dated now. Like you can definitely imagine a project that got started five years ago. @@ -660,7 +660,7 @@ 00:25:45 simple example here, just to give people a sense of, you know, maybe some, what does it look like -00:25:50 to write code with Spacey? I mean, got to be a little careful talking code on audio formats, +00:25:50 to write code with Spacy? I mean, got to be a little careful talking code on audio formats, 00:25:54 but what's the program? We can do it. I think we can manage. I mean, the first thing you typically @@ -668,7 +668,7 @@ 00:26:04 a model and there's kind of two ways of doing it. Like one thing you could do is you could say -00:26:09 Spacey dot blank, and then you give it a name of a language. So you can have a blank Dutch model, +00:26:09 Spacy dot blank, and then you give it a name of a language. So you can have a blank Dutch model, 00:26:13 or you can have a blank English model. And that's the model that will only carry the tokenizer and @@ -684,7 +684,7 @@ 00:26:41 But that's going to do all the heavy lifting. And then you get an object that can take text -00:26:45 and then turn that into a structured document. That's the entry point into Spacey. +00:26:45 and then turn that into a structured document. That's the entry point into Spacy. 00:26:50 I see. So what you might do with a web scraping with beautiful soup or something, @@ -804,7 +804,7 @@ 00:31:18 transcripts, and then lets you search them and do other things like that. And as part of that, -00:31:22 I used spacey or was that weird? You spacey because building a little lightweight custom +00:31:22 I used spacy or was that weird? You spacy because building a little lightweight custom 00:31:30 search engine, I said, all right, well, if somebody searches for a plural thing or the @@ -914,7 +914,7 @@ 00:35:46 that I do think is probably the most useful. You can just go that extra step further than just -00:35:51 basic string matching and spacey out of the box has a lot of sensible defaults that you don't +00:35:51 basic string matching and spacy out of the box has a lot of sensible defaults that you don't 00:35:56 have to think about. And there's for sure also like pretty good models on hugging face that you @@ -922,7 +922,7 @@ 00:36:04 ponies. That's not always the case, but they are usually trained for like one task in mind. -00:36:09 And the cool feeling that spacey just gives you is that even though it might not be the best, +00:36:09 And the cool feeling that spacy just gives you is that even though it might not be the best, 00:36:13 most performant model, it will be fast enough usually. And it will also just be in just enough @@ -930,7 +930,7 @@ 00:36:24 megabytes instead of gigabytes. If you, if you play your cards, right. Yes. So I see the word -00:36:29 token in here on spacey and I know number of tokens in LLMs is like sort of how much memory or +00:36:29 token in here on spacy and I know number of tokens in LLMs is like sort of how much memory or 00:36:37 context can they keep in mind? Are those the same things or they just happen to have the same word? @@ -984,7 +984,7 @@ 00:38:34 about is I want to go back to this, getting started with spacey and NLP course that you created -00:38:39 and talk through one of the, the pri let's say the primary demo dataset technique that you talked +00:38:39 and talk through one of the, the pre let's say the primary demo dataset technique that you talked 00:38:46 about in the course. And that would be to go and take nine years of transcripts for the podcast. @@ -1012,7 +1012,7 @@ 00:39:49 replacements that say that phrase always with that capitalization always leads to the correct -00:39:55 version. And then a sink and a wait, oh no, it's a space sink where like you wash your hands. +00:39:55 version. And then a sink and a wait, oh no, it's a spacy sink where like you wash your hands. 00:40:01 You're like, no, no, no, no, no. So there's a whole bunch of that that gets blasted on top @@ -1024,7 +1024,7 @@ 00:40:21 scikit-learn, you know, the well-known machine learning package, it always gets translated into -00:40:26 psychic learn. But that's an interesting aspect of like, you know, that the text that goes in is not +00:40:26 scikit- learn. But that's an interesting aspect of like, you know, that the text that goes in is not 00:40:33 necessarily perfect, but I was impressed. It is actually pretty darn good. There are some weird @@ -1042,13 +1042,13 @@ 00:41:04 incredibly cool. Right. You have one function you call that will read nine years of text and return -00:41:10 it line by line. This is the thing that people don't always recognize, but the way that spacey +00:41:10 it line by line. This is the thing that people don't always recognize, but the way that spacy 00:41:13 is made, if you're from scikit-learn, this sounds a bit surprising because in scikit-learn land, 00:41:18 you are typically used to the fact that you do batching and stuff that's factorized and -00:41:22 NumPy. And that's sort of the way you would do it. But spacey actually has a small preference +00:41:22 NumPy. And that's sort of the way you would do it. But spacy actually has a small preference 00:41:25 to using generators. And the whole thinking is that in natural language problems, you are @@ -1096,7 +1096,7 @@ 00:43:14 So that's the first thing that I usually end up doing when I'm doing something with spacey, -00:43:17 just get it into a generator. Spacey can batch the stuff for you, such as it's still nice and quick, +00:43:17 just get it into a generator. Spacy can batch the stuff for you, such as it's still nice and quick, 00:43:22 and you can do things in parallel even, but you think in generators a bit more than you do in @@ -1112,11 +1112,11 @@ 00:43:53 It is definitely different. When you're a data scientist, you're usually used to, -00:43:57 "Oh, it's a Pana's data frame. Everything's a Pana's data frame. I wake up and I brush my teeth +00:43:57 "Oh, it's a Panda's data frame. Everything's a Panda's data frame. I wake up and I brush my teeth -00:44:01 with a Pana's data frame." But in spacey land, that's the first thing you do notice. It's not +00:44:01 with a Panda's data frame." But in spacy land, that's the first thing you do notice. It's not -00:44:06 everything is a data frame, actually. In fact, some of the tools that I've used inside of spacey, +00:44:06 everything is a data frame, actually. In fact, some of the tools that I've used inside of spacy, 00:44:11 there's a little library called Seriously, that's for serialization. And one of the things that it @@ -1146,7 +1146,7 @@ 00:45:18 And it turned out that you can actually catch a whole bunch of these Python projects by just -00:45:22 taking the spacey product model, like the standard NER model, I think in the medium pipeline. And +00:45:22 taking the spacy product model, like the standard NER model, I think in the medium pipeline. And 00:45:27 you would just tell it like, "Hey, find me all the products." And of course it's not a perfect @@ -1188,7 +1188,7 @@ 00:46:54 I talk about how to do a, how to structure an NLP project. But at the end, I also talk about -00:46:58 these large language models and things you can do with that. And I use OpenAI. That's the thing I +00:46:58 these large language models and things you can do with that. And I use Open AI. That's the thing I 00:47:03 use. But there's also this new tool called Glee NER. You can find it on the Hugging Face. It's @@ -1330,7 +1330,7 @@ 00:52:20 but I can definitely imagine if you were really interested in doing something with like Python -00:52:24 tools, I would probably start with the Python bites one looking, thinking out loud. Maybe. +00:52:24 tools, I would probably start with the Python bytes one looking, thinking out loud. Maybe. 00:52:29 Yeah, that's a good idea. It's a good idea. The first step is that this is like publicly @@ -1464,7 +1464,7 @@ 00:57:30 that to what I get out of an LLM. When those two models disagree, something interesting is usually -00:57:35 happening because the LLM model is pretty good and the spacey model is pretty good. But when they +00:57:35 happening because the LLM model is pretty good and the spacy model is pretty good. But when they 00:57:39 disagree, then I'm probably dealing with either a model that can be improved or data point that's @@ -1540,7 +1540,7 @@ 01:00:30 Anyway, let's go ahead and wrap this thing up. Like people are interested in NLP, -01:00:34 spacey, maybe beyond like what in that space and what else do you want to leave people with? +01:00:34 spacey, maybe beyond like what in that spacy and what else do you want to leave people with? 01:00:39 - I guess the main thing is just approach everything with curiosity. And if you're @@ -1607,4 +1607,3 @@ 01:03:14 at talkpython.fm/youtube. This is your host, Michael Kennedy. Thanks so much for listening. 01:03:20 I really appreciate it. Now get out there and write some Python code. -