Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
97 changes: 48 additions & 49 deletions transcripts/477-spacy-nlp.txt
Original file line number Diff line number Diff line change
Expand Up @@ -96,7 +96,7 @@

00:03:55 I did operations research, which is this sort of applied subfield of math.

00:03:59 That's very much a optimization problem, kind of Solvee kind of thing. So,
00:03:59 That's very much a optimization problem, kind of Solved kind of thing. So,

00:04:04 traveling salesman problem, that kind of stuff.

Expand Down Expand Up @@ -128,7 +128,7 @@

00:04:57 SymPy is cool. Google has OR tools, which is also a pretty easy starting point. And there's

00:05:01 also another package called CVXpy, which is all about convex optimization problems. And that's
00:05:01 also another package called CVXPY, which is all about convex optimization problems. And that's

00:05:07 very scikit-learn friendly as well, by the way, if you're into that. If you're an operations

Expand All @@ -144,15 +144,15 @@

00:05:35 we talked about CalmCode, which is a cool project that you've got going on. We'll talk about just a

00:05:40 moment through the Python Bytes stuff. And then through ExplosionAI and Spacey and all that,
00:05:40 moment through the Python Bytes stuff. And then through Explosion AI and Spacy and all that,

00:05:47 we actually teamed up to do a course that you wrote called Getting Started with NLP and Spacey,

00:05:52 which is over at TalkByThon, which is awesome. A lot of projects you got going on. Some of the
00:05:52 which is over at Talk Python, which is awesome. A lot of projects you got going on. Some of the

00:05:56 ideas that we're going to talk about here, and we'll dive into them as we get into the topics,

00:06:01 come from your course on TalkByThon. I'll put the link in the show notes. People will definitely
00:06:01 come from your course on Talk Python. I'll put the link in the show notes. People will definitely

00:06:04 want to check that out. But yeah, tell us a little bit more about the stuff you got going on. Like

Expand Down Expand Up @@ -242,7 +242,7 @@

00:09:15 YouTube channel. So definitely feel free to have a look at that. Yeah, I'll link that in the show

00:09:19 notes. Okay. As you said, that was in the CommCode YouTube account. The CommCode is more courses than
00:09:19 notes. Okay. As you said, that was in the CalmCode YouTube account. The CalmCode is more courses than

00:09:28 it is keyboards, right? Yes, definitely. So it kind of started as a COVID project. I kind of

Expand All @@ -256,11 +256,11 @@

00:09:53 position where, hey, let's just celebrate this project. So there's a collaborator helping me

00:09:56 out now. We are also writing a book that's on behalf of the CommCode brand. Like if you click,
00:09:56 out now. We are also writing a book that's on behalf of the CalmCode brand. Like if you click,

00:10:01 people can't see, I suppose, but... It's linked right on the homepage though. Yeah.

00:10:05 Yeah. So when you click it, like commcode.io/book, the book is titled Data Science Fiction.
00:10:05 Yeah. So when you click it, like calmcode.io/book, the book is titled Data Science Fiction.

00:10:10 The whole point of the book is just, these are anecdotes that people have told me while

Expand All @@ -278,7 +278,7 @@

00:10:38 it. I do have fun writing it is what I will say, but that's also like courses and stuff like this.

00:10:43 That's what I'm trying to do with the CommCode project. Just have something that's very fun to
00:10:43 That's what I'm trying to do with the CalmCode project. Just have something that's very fun to

00:10:47 maintain, but also something that people can actually have a good look at.

Expand Down Expand Up @@ -366,41 +366,41 @@

00:14:08 Wow.

00:14:09 So Psyched Lego, which is a somewhat popular project that I maintain, there's another
00:14:09 So Scikit Lego, which is a somewhat popular project that I maintain, there's another

00:14:12 collaborator on that now, Francesco. Basically everyone who has made a serious contribution is

00:14:17 also just invited to add a line to the poem. So it's just little things like that. That's what

00:14:23 today I learned. It's very easy to sort of share. Psyched Lego, by the way, I'm going to brag about
00:14:23 today I learned. It's very easy to sort of share. Scikit Lego, by the way, I'm going to brag about

00:14:28 that. It got a million downloads, got a million downloads now. So that happened two weeks ago.

00:14:32 So super proud of that.

00:14:34 What is Psyched Lego?
00:14:34 What is Scikit Lego?

00:14:35 Psyched Learn has all sorts of components and you've got regression models, classification
00:14:35 Scikit Learn has all sorts of components and you've got regression models, classification

00:14:40 models, pre-processing utilities, and you name it. And I, at some point, just noticed that there's a

00:14:44 couple of these Lego bricks that I really like to use and I didn't feel like rewriting them for

00:14:48 every single client I had. Psyched Lego just started out as a place for me and another maintainer
00:14:48 every single client I had. Scikit-Lego just started out as a place for me and another maintainer

00:14:54 just put stuff that we like to use. We didn't take the project that serious until other people did.

00:14:59 Like I actually got an email from a data engineer that works at Lego, just to give a example. But it's really just, there's a bunch of stuff that Psyched Learn,
00:14:59 Like I actually got an email from a data engineer that works at Lego, just to give a example. But it's really just, there's a bunch of stuff that Scikit Learn,

00:15:07 because it's such a mature project. There's a couple of these experimental things that can't

00:15:11 really go into Psyched Learn, but if people can convince us that it's a fun thing to maintain,
00:15:11 really go into Scikit Learn, but if people can convince us that it's a fun thing to maintain,

00:15:15 we will gladly put it in here. That's kind of the goal of the library.

00:15:18 Awesome. So kind of thinking of the building blocks of Psyched Learn as Lego blocks.
00:15:18 Awesome. So kind of thinking of the building blocks of Scikit Learn as Lego blocks.

00:15:24 Psyched Learn, you could look at it already, has a whole bunch of Lego bricks. It's just that this
00:15:24 Scikit Learn, you could look at it already, has a whole bunch of Lego bricks. It's just that this

00:15:27 library contributes a couple of more experimental ones. It's such a place right now that they can't

Expand Down Expand Up @@ -520,11 +520,11 @@

00:20:10 June for the audio channel. So it depends how you consumed it. So two to three months ago. Anyway,

00:20:15 we talked more about LLMs, not so much Spacey, even though she's behind it. So give people a
00:20:15 we talked more about LLMs, not so much Spacy, even though she's behind it. So give people a

00:20:21 sense of what is Spacey. We just talked about Scikit-Learn and the types of problems it solves.
00:20:21 sense of what is Spacy. We just talked about Scikit-Learn and the types of problems it solves.

00:20:26 What about Spacey? - There's a couple of stories that could be told about it, but
00:20:26 What about Spacy? - There's a couple of stories that could be told about it, but

00:20:29 one way to maybe think about it is that in Python, we've always had tools that could do NLP. We also

Expand All @@ -538,11 +538,11 @@

00:20:55 Lego bricks and it was definitely kind of useful, but it wasn't necessarily a coherent pipeline.

00:20:59 And one way to, I think, historically describe Spacey, it was like a very honest, good attempt to
00:20:59 And one way to, I think, historically describe Spacy, it was like a very honest, good attempt to

00:21:06 make a pipeline for all these different NLP components that kind of clicked together.

00:21:10 And the first component inside of Spacey that made it popular was basically a tokenizer,
00:21:10 And the first component inside of Spacy that made it popular was basically a tokenizer,

00:21:15 something that can take text and split it up into separate words. And basically that's a

Expand Down Expand Up @@ -586,7 +586,7 @@

00:22:56 to do. - Oh, waving hand emoji, step one. - Yeah, exactly. You're giving me ideas. This is gonna

00:23:02 happen. - But anyway, but back to SpaceGuy, I suppose. This is sort of the origin story. The
00:23:02 happen. - But anyway, but back to Spacy guy, I suppose. This is sort of the origin story. The

00:23:07 tokenization was the first sort of problem that they tackled. And then very quickly, they also

Expand All @@ -604,7 +604,7 @@

00:23:41 also just happens to be the most popular verb in the English language. So if you're just going to

00:23:45 match the string Go, you're simply not going to get there. Spacey was also one of the, I would
00:23:45 match the string Go, you're simply not going to get there. Spacy was also one of the, I would

00:23:50 say first projects that offered pretty good pre-trained free models that people could just

Expand Down Expand Up @@ -632,7 +632,7 @@

00:24:43 where you can see all sorts of plugins that people made. But the core and like the main way I still

00:24:47 like to think about Spacey, it is a relatively lightweight because a lot of it is implemented
00:24:47 like to think about Spacy, it is a relatively lightweight because a lot of it is implemented

00:24:51 in Cython. Pipeline for NLP projects. And again, like the main thing that people like to use it

Expand All @@ -650,7 +650,7 @@

00:25:23 I would argue, interesting hobby projects as well that are just more for fun, I guess.

00:25:28 Yeah. But there's a lot. I mean, one thing I will say, because Spacey's been around so much, some of those plugins are a
00:25:28 Yeah. But there's a lot. I mean, one thing I will say, because Spacy's been around so much, some of those plugins are a

00:25:33 bit dated now. Like you can definitely imagine a project that got started five years ago.

Expand All @@ -660,15 +660,15 @@

00:25:45 simple example here, just to give people a sense of, you know, maybe some, what does it look like

00:25:50 to write code with Spacey? I mean, got to be a little careful talking code on audio formats,
00:25:50 to write code with Spacy? I mean, got to be a little careful talking code on audio formats,

00:25:54 but what's the program? We can do it. I think we can manage. I mean, the first thing you typically

00:25:58 do is you just call import Spacey and that's pretty straightforward, but then you got to load

00:26:04 a model and there's kind of two ways of doing it. Like one thing you could do is you could say

00:26:09 Spacey dot blank, and then you give it a name of a language. So you can have a blank Dutch model,
00:26:09 Spacy dot blank, and then you give it a name of a language. So you can have a blank Dutch model,

00:26:13 or you can have a blank English model. And that's the model that will only carry the tokenizer and

Expand All @@ -684,7 +684,7 @@

00:26:41 But that's going to do all the heavy lifting. And then you get an object that can take text

00:26:45 and then turn that into a structured document. That's the entry point into Spacey.
00:26:45 and then turn that into a structured document. That's the entry point into Spacy.

00:26:50 I see. So what you might do with a web scraping with beautiful soup or something,

Expand Down Expand Up @@ -804,7 +804,7 @@

00:31:18 transcripts, and then lets you search them and do other things like that. And as part of that,

00:31:22 I used spacey or was that weird? You spacey because building a little lightweight custom
00:31:22 I used spacy or was that weird? You spacy because building a little lightweight custom

00:31:30 search engine, I said, all right, well, if somebody searches for a plural thing or the

Expand Down Expand Up @@ -914,23 +914,23 @@

00:35:46 that I do think is probably the most useful. You can just go that extra step further than just

00:35:51 basic string matching and spacey out of the box has a lot of sensible defaults that you don't
00:35:51 basic string matching and spacy out of the box has a lot of sensible defaults that you don't

00:35:56 have to think about. And there's for sure also like pretty good models on hugging face that you

00:36:00 can go ahead and download for free. But typically those models are like kind of like one trick

00:36:04 ponies. That's not always the case, but they are usually trained for like one task in mind.

00:36:09 And the cool feeling that spacey just gives you is that even though it might not be the best,
00:36:09 And the cool feeling that spacy just gives you is that even though it might not be the best,

00:36:13 most performant model, it will be fast enough usually. And it will also just be in just enough

00:36:18 in general. Yeah. And it doesn't have the heavy, heavy weight overloading. It's definitely a

00:36:24 megabytes instead of gigabytes. If you, if you play your cards, right. Yes. So I see the word

00:36:29 token in here on spacey and I know number of tokens in LLMs is like sort of how much memory or
00:36:29 token in here on spacy and I know number of tokens in LLMs is like sort of how much memory or

00:36:37 context can they keep in mind? Are those the same things or they just happen to have the same word?

Expand Down Expand Up @@ -984,7 +984,7 @@

00:38:34 about is I want to go back to this, getting started with spacey and NLP course that you created

00:38:39 and talk through one of the, the pri let's say the primary demo dataset technique that you talked
00:38:39 and talk through one of the, the pre let's say the primary demo dataset technique that you talked

00:38:46 about in the course. And that would be to go and take nine years of transcripts for the podcast.

Expand Down Expand Up @@ -1012,7 +1012,7 @@

00:39:49 replacements that say that phrase always with that capitalization always leads to the correct

00:39:55 version. And then a sink and a wait, oh no, it's a space sink where like you wash your hands.
00:39:55 version. And then a sink and a wait, oh no, it's a spacy sink where like you wash your hands.

00:40:01 You're like, no, no, no, no, no. So there's a whole bunch of that that gets blasted on top

Expand All @@ -1024,7 +1024,7 @@

00:40:21 scikit-learn, you know, the well-known machine learning package, it always gets translated into

00:40:26 psychic learn. But that's an interesting aspect of like, you know, that the text that goes in is not
00:40:26 scikit- learn. But that's an interesting aspect of like, you know, that the text that goes in is not

00:40:33 necessarily perfect, but I was impressed. It is actually pretty darn good. There are some weird

Expand All @@ -1042,13 +1042,13 @@

00:41:04 incredibly cool. Right. You have one function you call that will read nine years of text and return

00:41:10 it line by line. This is the thing that people don't always recognize, but the way that spacey
00:41:10 it line by line. This is the thing that people don't always recognize, but the way that spacy

00:41:13 is made, if you're from scikit-learn, this sounds a bit surprising because in scikit-learn land,

00:41:18 you are typically used to the fact that you do batching and stuff that's factorized and

00:41:22 NumPy. And that's sort of the way you would do it. But spacey actually has a small preference
00:41:22 NumPy. And that's sort of the way you would do it. But spacy actually has a small preference

00:41:25 to using generators. And the whole thinking is that in natural language problems, you are

Expand Down Expand Up @@ -1096,7 +1096,7 @@

00:43:14 So that's the first thing that I usually end up doing when I'm doing something with spacey,

00:43:17 just get it into a generator. Spacey can batch the stuff for you, such as it's still nice and quick,
00:43:17 just get it into a generator. Spacy can batch the stuff for you, such as it's still nice and quick,

00:43:22 and you can do things in parallel even, but you think in generators a bit more than you do in

Expand All @@ -1112,11 +1112,11 @@

00:43:53 It is definitely different. When you're a data scientist, you're usually used to,

00:43:57 "Oh, it's a Pana's data frame. Everything's a Pana's data frame. I wake up and I brush my teeth
00:43:57 "Oh, it's a Panda's data frame. Everything's a Panda's data frame. I wake up and I brush my teeth

00:44:01 with a Pana's data frame." But in spacey land, that's the first thing you do notice. It's not
00:44:01 with a Panda's data frame." But in spacy land, that's the first thing you do notice. It's not

00:44:06 everything is a data frame, actually. In fact, some of the tools that I've used inside of spacey,
00:44:06 everything is a data frame, actually. In fact, some of the tools that I've used inside of spacy,

00:44:11 there's a little library called Seriously, that's for serialization. And one of the things that it

Expand Down Expand Up @@ -1146,7 +1146,7 @@

00:45:18 And it turned out that you can actually catch a whole bunch of these Python projects by just

00:45:22 taking the spacey product model, like the standard NER model, I think in the medium pipeline. And
00:45:22 taking the spacy product model, like the standard NER model, I think in the medium pipeline. And

00:45:27 you would just tell it like, "Hey, find me all the products." And of course it's not a perfect

Expand Down Expand Up @@ -1188,7 +1188,7 @@

00:46:54 I talk about how to do a, how to structure an NLP project. But at the end, I also talk about

00:46:58 these large language models and things you can do with that. And I use OpenAI. That's the thing I
00:46:58 these large language models and things you can do with that. And I use Open AI. That's the thing I

00:47:03 use. But there's also this new tool called Glee NER. You can find it on the Hugging Face. It's

Expand Down Expand Up @@ -1330,7 +1330,7 @@

00:52:20 but I can definitely imagine if you were really interested in doing something with like Python

00:52:24 tools, I would probably start with the Python bites one looking, thinking out loud. Maybe.
00:52:24 tools, I would probably start with the Python bytes one looking, thinking out loud. Maybe.

00:52:29 Yeah, that's a good idea. It's a good idea. The first step is that this is like publicly

Expand Down Expand Up @@ -1464,7 +1464,7 @@

00:57:30 that to what I get out of an LLM. When those two models disagree, something interesting is usually

00:57:35 happening because the LLM model is pretty good and the spacey model is pretty good. But when they
00:57:35 happening because the LLM model is pretty good and the spacy model is pretty good. But when they

00:57:39 disagree, then I'm probably dealing with either a model that can be improved or data point that's

Expand Down Expand Up @@ -1540,7 +1540,7 @@

01:00:30 Anyway, let's go ahead and wrap this thing up. Like people are interested in NLP,

01:00:34 spacey, maybe beyond like what in that space and what else do you want to leave people with?
01:00:34 spacey, maybe beyond like what in that spacy and what else do you want to leave people with?

01:00:39 - I guess the main thing is just approach everything with curiosity. And if you're

Expand Down Expand Up @@ -1607,4 +1607,3 @@
01:03:14 at talkpython.fm/youtube. This is your host, Michael Kennedy. Thanks so much for listening.

01:03:20 I really appreciate it. Now get out there and write some Python code.