Merge pull request #161 from manikanta-hitunik-com/patch-497175

mikeckennedy · web-flow · commit ce3cdaff548a · 2024-10-08T15:19:44.000-07:00
Update 477-spacy-nlp.txt
diff --git a/transcripts/477-spacy-nlp.txt b/transcripts/477-spacy-nlp.txt
@@ -96,7 +96,7 @@
 
 00:03:55 I did operations research, which is this sort of applied subfield of math.
 
-00:03:59 That's very much a optimization problem, kind of Solvee kind of thing. So,
+00:03:59 That's very much a optimization problem, kind of Solved kind of thing. So,
 
 00:04:04 traveling salesman problem, that kind of stuff.
 
@@ -128,7 +128,7 @@
 
 00:04:57 SymPy is cool. Google has OR tools, which is also a pretty easy starting point. And there's
 
-00:05:01 also another package called CVXpy, which is all about convex optimization problems. And that's
+00:05:01 also another package called CVXPY, which is all about convex optimization problems. And that's
 
 00:05:07 very scikit-learn friendly as well, by the way, if you're into that. If you're an operations
 
@@ -144,15 +144,15 @@
 
 00:05:35 we talked about CalmCode, which is a cool project that you've got going on. We'll talk about just a
 
-00:05:40 moment through the Python Bytes stuff. And then through ExplosionAI and Spacey and all that,
+00:05:40 moment through the Python Bytes stuff. And then through Explosion AI and Spacy and all that,
 
 00:05:47 we actually teamed up to do a course that you wrote called Getting Started with NLP and Spacey,
 
-00:05:52 which is over at TalkByThon, which is awesome. A lot of projects you got going on. Some of the
+00:05:52 which is over at Talk Python, which is awesome. A lot of projects you got going on. Some of the
 
 00:05:56 ideas that we're going to talk about here, and we'll dive into them as we get into the topics,
 
-00:06:01 come from your course on TalkByThon. I'll put the link in the show notes. People will definitely
+00:06:01 come from your course on Talk Python. I'll put the link in the show notes. People will definitely
 
 00:06:04 want to check that out. But yeah, tell us a little bit more about the stuff you got going on. Like
 
@@ -242,7 +242,7 @@
 
 00:09:15 YouTube channel. So definitely feel free to have a look at that. Yeah, I'll link that in the show
 
-00:09:19 notes. Okay. As you said, that was in the CommCode YouTube account. The CommCode is more courses than
+00:09:19 notes. Okay. As you said, that was in the CalmCode YouTube account. The CalmCode is more courses than
 
 00:09:28 it is keyboards, right? Yes, definitely. So it kind of started as a COVID project. I kind of
 
@@ -256,11 +256,11 @@
 
 00:09:53 position where, hey, let's just celebrate this project. So there's a collaborator helping me
 
-00:09:56 out now. We are also writing a book that's on behalf of the CommCode brand. Like if you click,
+00:09:56 out now. We are also writing a book that's on behalf of the CalmCode brand. Like if you click,
 
 00:10:01 people can't see, I suppose, but... It's linked right on the homepage though. Yeah.
 
-00:10:05 Yeah. So when you click it, like commcode.io/book, the book is titled Data Science Fiction.
+00:10:05 Yeah. So when you click it, like calmcode.io/book, the book is titled Data Science Fiction.
 
 00:10:10 The whole point of the book is just, these are anecdotes that people have told me while
 
@@ -278,7 +278,7 @@
 
 00:10:38 it. I do have fun writing it is what I will say, but that's also like courses and stuff like this.
 
-00:10:43 That's what I'm trying to do with the CommCode project. Just have something that's very fun to
+00:10:43 That's what I'm trying to do with the CalmCode project. Just have something that's very fun to
 
 00:10:47 maintain, but also something that people can actually have a good look at.
 
@@ -366,41 +366,41 @@
 
 00:14:08 Wow.
 
-00:14:09 So Psyched Lego, which is a somewhat popular project that I maintain, there's another
+00:14:09 So Scikit Lego, which is a somewhat popular project that I maintain, there's another
 
 00:14:12 collaborator on that now, Francesco. Basically everyone who has made a serious contribution is
 
 00:14:17 also just invited to add a line to the poem. So it's just little things like that. That's what
 
-00:14:23 today I learned. It's very easy to sort of share. Psyched Lego, by the way, I'm going to brag about
+00:14:23 today I learned. It's very easy to sort of share. Scikit Lego, by the way, I'm going to brag about
 
 00:14:28 that. It got a million downloads, got a million downloads now. So that happened two weeks ago.
 
 00:14:32 So super proud of that.
 
-00:14:34 What is Psyched Lego?
+00:14:34 What is Scikit Lego?
 
-00:14:35 Psyched Learn has all sorts of components and you've got regression models, classification
+00:14:35 Scikit Learn has all sorts of components and you've got regression models, classification
 
 00:14:40 models, pre-processing utilities, and you name it. And I, at some point, just noticed that there's a
 
 00:14:44 couple of these Lego bricks that I really like to use and I didn't feel like rewriting them for
 
-00:14:48 every single client I had. Psyched Lego just started out as a place for me and another maintainer
+00:14:48 every single client I had. Scikit-Lego just started out as a place for me and another maintainer
 
 00:14:54 just put stuff that we like to use. We didn't take the project that serious until other people did.
 
-00:14:59 Like I actually got an email from a data engineer that works at Lego, just to give a example. But it's really just, there's a bunch of stuff that Psyched Learn,
+00:14:59 Like I actually got an email from a data engineer that works at Lego, just to give a example. But it's really just, there's a bunch of stuff that Scikit Learn,
 
 00:15:07 because it's such a mature project. There's a couple of these experimental things that can't
 
-00:15:11 really go into Psyched Learn, but if people can convince us that it's a fun thing to maintain,
+00:15:11 really go into Scikit Learn, but if people can convince us that it's a fun thing to maintain,
 
 00:15:15 we will gladly put it in here. That's kind of the goal of the library.
 
-00:15:18 Awesome. So kind of thinking of the building blocks of Psyched Learn as Lego blocks.
+00:15:18 Awesome. So kind of thinking of the building blocks of Scikit Learn as Lego blocks.
 
-00:15:24 Psyched Learn, you could look at it already, has a whole bunch of Lego bricks. It's just that this
+00:15:24 Scikit Learn, you could look at it already, has a whole bunch of Lego bricks. It's just that this
 
 00:15:27 library contributes a couple of more experimental ones. It's such a place right now that they can't
 
@@ -520,11 +520,11 @@
 
 00:20:10 June for the audio channel. So it depends how you consumed it. So two to three months ago. Anyway,
 
-00:20:15 we talked more about LLMs, not so much Spacey, even though she's behind it. So give people a
+00:20:15 we talked more about LLMs, not so much Spacy, even though she's behind it. So give people a
 
-00:20:21 sense of what is Spacey. We just talked about Scikit-Learn and the types of problems it solves.
+00:20:21 sense of what is Spacy. We just talked about Scikit-Learn and the types of problems it solves.
 
-00:20:26 What about Spacey? - There's a couple of stories that could be told about it, but
+00:20:26 What about Spacy? - There's a couple of stories that could be told about it, but
 
 00:20:29 one way to maybe think about it is that in Python, we've always had tools that could do NLP. We also
 
@@ -538,11 +538,11 @@
 
 00:20:55 Lego bricks and it was definitely kind of useful, but it wasn't necessarily a coherent pipeline.
 
-00:20:59 And one way to, I think, historically describe Spacey, it was like a very honest, good attempt to
+00:20:59 And one way to, I think, historically describe Spacy, it was like a very honest, good attempt to
 
 00:21:06 make a pipeline for all these different NLP components that kind of clicked together.
 
-00:21:10 And the first component inside of Spacey that made it popular was basically a tokenizer,
+00:21:10 And the first component inside of Spacy that made it popular was basically a tokenizer,
 
 00:21:15 something that can take text and split it up into separate words. And basically that's a
 
@@ -586,7 +586,7 @@
 
 00:22:56 to do. - Oh, waving hand emoji, step one. - Yeah, exactly. You're giving me ideas. This is gonna
 
-00:23:02 happen. - But anyway, but back to SpaceGuy, I suppose. This is sort of the origin story. The
+00:23:02 happen. - But anyway, but back to Spacy guy, I suppose. This is sort of the origin story. The
 
 00:23:07 tokenization was the first sort of problem that they tackled. And then very quickly, they also
 
@@ -604,7 +604,7 @@
 
 00:23:41 also just happens to be the most popular verb in the English language. So if you're just going to
 
-00:23:45 match the string Go, you're simply not going to get there. Spacey was also one of the, I would
+00:23:45 match the string Go, you're simply not going to get there. Spacy was also one of the, I would
 
 00:23:50 say first projects that offered pretty good pre-trained free models that people could just
 
@@ -632,7 +632,7 @@
 
 00:24:43 where you can see all sorts of plugins that people made. But the core and like the main way I still
 
-00:24:47 like to think about Spacey, it is a relatively lightweight because a lot of it is implemented
+00:24:47 like to think about Spacy, it is a relatively lightweight because a lot of it is implemented
 
 00:24:51 in Cython. Pipeline for NLP projects. And again, like the main thing that people like to use it
 
@@ -650,7 +650,7 @@
 
 00:25:23 I would argue, interesting hobby projects as well that are just more for fun, I guess.
 
-00:25:28 Yeah. But there's a lot. I mean, one thing I will say, because Spacey's been around so much, some of those plugins are a
+00:25:28 Yeah. But there's a lot. I mean, one thing I will say, because Spacy's been around so much, some of those plugins are a
 
 00:25:33 bit dated now. Like you can definitely imagine a project that got started five years ago.
 
@@ -660,15 +660,15 @@
 
 00:25:45 simple example here, just to give people a sense of, you know, maybe some, what does it look like
 
-00:25:50 to write code with Spacey? I mean, got to be a little careful talking code on audio formats,
+00:25:50 to write code with Spacy? I mean, got to be a little careful talking code on audio formats,
 
 00:25:54 but what's the program? We can do it. I think we can manage. I mean, the first thing you typically
 
 00:25:58 do is you just call import Spacey and that's pretty straightforward, but then you got to load
 
 00:26:04 a model and there's kind of two ways of doing it. Like one thing you could do is you could say
 
-00:26:09 Spacey dot blank, and then you give it a name of a language. So you can have a blank Dutch model,
+00:26:09 Spacy dot blank, and then you give it a name of a language. So you can have a blank Dutch model,
 
 00:26:13 or you can have a blank English model. And that's the model that will only carry the tokenizer and
 
@@ -684,7 +684,7 @@
 
 00:26:41 But that's going to do all the heavy lifting. And then you get an object that can take text
 
-00:26:45 and then turn that into a structured document. That's the entry point into Spacey.
+00:26:45 and then turn that into a structured document. That's the entry point into Spacy.
 
 00:26:50 I see. So what you might do with a web scraping with beautiful soup or something,
 
@@ -804,7 +804,7 @@
 
 00:31:18 transcripts, and then lets you search them and do other things like that. And as part of that,
 
-00:31:22 I used spacey or was that weird? You spacey because building a little lightweight custom
+00:31:22 I used spacy or was that weird? You spacy because building a little lightweight custom
 
 00:31:30 search engine, I said, all right, well, if somebody searches for a plural thing or the
 
@@ -914,23 +914,23 @@
 
 00:35:46 that I do think is probably the most useful. You can just go that extra step further than just
 
-00:35:51 basic string matching and spacey out of the box has a lot of sensible defaults that you don't
+00:35:51 basic string matching and spacy out of the box has a lot of sensible defaults that you don't
 
 00:35:56 have to think about. And there's for sure also like pretty good models on hugging face that you
 
 00:36:00 can go ahead and download for free. But typically those models are like kind of like one trick
 
 00:36:04 ponies. That's not always the case, but they are usually trained for like one task in mind.
 
-00:36:09 And the cool feeling that spacey just gives you is that even though it might not be the best,
+00:36:09 And the cool feeling that spacy just gives you is that even though it might not be the best,
 
 00:36:13 most performant model, it will be fast enough usually. And it will also just be in just enough
 
 00:36:18 in general. Yeah. And it doesn't have the heavy, heavy weight overloading. It's definitely a
 
 00:36:24 megabytes instead of gigabytes. If you, if you play your cards, right. Yes. So I see the word
 
-00:36:29 token in here on spacey and I know number of tokens in LLMs is like sort of how much memory or
+00:36:29 token in here on spacy and I know number of tokens in LLMs is like sort of how much memory or
 
 00:36:37 context can they keep in mind? Are those the same things or they just happen to have the same word?
 
@@ -984,7 +984,7 @@
 
 00:38:34 about is I want to go back to this, getting started with spacey and NLP course that you created
 
-00:38:39 and talk through one of the, the pri let's say the primary demo dataset technique that you talked
+00:38:39 and talk through one of the, the pre let's say the primary demo dataset technique that you talked
 
 00:38:46 about in the course. And that would be to go and take nine years of transcripts for the podcast.
 
@@ -1012,7 +1012,7 @@
 
 00:39:49 replacements that say that phrase always with that capitalization always leads to the correct
 
-00:39:55 version. And then a sink and a wait, oh no, it's a space sink where like you wash your hands.
+00:39:55 version. And then a sink and a wait, oh no, it's a spacy sink where like you wash your hands.
 
 00:40:01 You're like, no, no, no, no, no. So there's a whole bunch of that that gets blasted on top
 
@@ -1024,7 +1024,7 @@
 
 00:40:21 scikit-learn, you know, the well-known machine learning package, it always gets translated into
 
-00:40:26 psychic learn. But that's an interesting aspect of like, you know, that the text that goes in is not
+00:40:26 scikit- learn. But that's an interesting aspect of like, you know, that the text that goes in is not
 
 00:40:33 necessarily perfect, but I was impressed. It is actually pretty darn good. There are some weird
 
@@ -1042,13 +1042,13 @@
 
 00:41:04 incredibly cool. Right. You have one function you call that will read nine years of text and return
 
-00:41:10 it line by line. This is the thing that people don't always recognize, but the way that spacey
+00:41:10 it line by line. This is the thing that people don't always recognize, but the way that spacy
 
 00:41:13 is made, if you're from scikit-learn, this sounds a bit surprising because in scikit-learn land,
 
 00:41:18 you are typically used to the fact that you do batching and stuff that's factorized and
 
-00:41:22 NumPy. And that's sort of the way you would do it. But spacey actually has a small preference
+00:41:22 NumPy. And that's sort of the way you would do it. But spacy actually has a small preference
 
 00:41:25 to using generators. And the whole thinking is that in natural language problems, you are
 
@@ -1096,7 +1096,7 @@
 
 00:43:14 So that's the first thing that I usually end up doing when I'm doing something with spacey,
 
-00:43:17 just get it into a generator. Spacey can batch the stuff for you, such as it's still nice and quick,
+00:43:17 just get it into a generator. Spacy can batch the stuff for you, such as it's still nice and quick,
 
 00:43:22 and you can do things in parallel even, but you think in generators a bit more than you do in
 
@@ -1112,11 +1112,11 @@
 
 00:43:53 It is definitely different. When you're a data scientist, you're usually used to,
 
-00:43:57 "Oh, it's a Pana's data frame. Everything's a Pana's data frame. I wake up and I brush my teeth
+00:43:57 "Oh, it's a Panda's data frame. Everything's a Panda's data frame. I wake up and I brush my teeth
 
-00:44:01 with a Pana's data frame." But in spacey land, that's the first thing you do notice. It's not
+00:44:01 with a Panda's data frame." But in spacy land, that's the first thing you do notice. It's not
 
-00:44:06 everything is a data frame, actually. In fact, some of the tools that I've used inside of spacey,
+00:44:06 everything is a data frame, actually. In fact, some of the tools that I've used inside of spacy,
 
 00:44:11 there's a little library called Seriously, that's for serialization. And one of the things that it
 
@@ -1146,7 +1146,7 @@
 
 00:45:18 And it turned out that you can actually catch a whole bunch of these Python projects by just
 
-00:45:22 taking the spacey product model, like the standard NER model, I think in the medium pipeline. And
+00:45:22 taking the spacy product model, like the standard NER model, I think in the medium pipeline. And
 
 00:45:27 you would just tell it like, "Hey, find me all the products." And of course it's not a perfect
 
@@ -1188,7 +1188,7 @@
 
 00:46:54 I talk about how to do a, how to structure an NLP project. But at the end, I also talk about
 
-00:46:58 these large language models and things you can do with that. And I use OpenAI. That's the thing I
+00:46:58 these large language models and things you can do with that. And I use Open AI. That's the thing I
 
 00:47:03 use. But there's also this new tool called Glee NER. You can find it on the Hugging Face. It's
 
@@ -1330,7 +1330,7 @@
 
 00:52:20 but I can definitely imagine if you were really interested in doing something with like Python
 
-00:52:24 tools, I would probably start with the Python bites one looking, thinking out loud. Maybe.
+00:52:24 tools, I would probably start with the Python bytes one looking, thinking out loud. Maybe.
 
 00:52:29 Yeah, that's a good idea. It's a good idea. The first step is that this is like publicly
 
@@ -1464,7 +1464,7 @@
 
 00:57:30 that to what I get out of an LLM. When those two models disagree, something interesting is usually
 
-00:57:35 happening because the LLM model is pretty good and the spacey model is pretty good. But when they
+00:57:35 happening because the LLM model is pretty good and the spacy model is pretty good. But when they
 
 00:57:39 disagree, then I'm probably dealing with either a model that can be improved or data point that's
 
@@ -1540,7 +1540,7 @@
 
 01:00:30 Anyway, let's go ahead and wrap this thing up. Like people are interested in NLP,
 
-01:00:34 spacey, maybe beyond like what in that space and what else do you want to leave people with?
+01:00:34 spacey, maybe beyond like what in that spacy and what else do you want to leave people with?
 
 01:00:39 - I guess the main thing is just approach everything with curiosity. And if you're
 
@@ -1607,4 +1607,3 @@
 01:03:14 at talkpython.fm/youtube. This is your host, Michael Kennedy. Thanks so much for listening.
 
 01:03:20 I really appreciate it. Now get out there and write some Python code.
-