Skip to content

Commit ce3cdaf

Browse files
authored
Merge pull request #161 from manikanta-hitunik-com/patch-497175
Update 477-spacy-nlp.txt
2 parents d04b15b + 5ca20dc commit ce3cdaf

File tree

1 file changed

+48
-49
lines changed

1 file changed

+48
-49
lines changed

transcripts/477-spacy-nlp.txt

Lines changed: 48 additions & 49 deletions
Original file line numberDiff line numberDiff line change
@@ -96,7 +96,7 @@
9696

9797
00:03:55 I did operations research, which is this sort of applied subfield of math.
9898

99-
00:03:59 That's very much a optimization problem, kind of Solvee kind of thing. So,
99+
00:03:59 That's very much a optimization problem, kind of Solved kind of thing. So,
100100

101101
00:04:04 traveling salesman problem, that kind of stuff.
102102

@@ -128,7 +128,7 @@
128128

129129
00:04:57 SymPy is cool. Google has OR tools, which is also a pretty easy starting point. And there's
130130

131-
00:05:01 also another package called CVXpy, which is all about convex optimization problems. And that's
131+
00:05:01 also another package called CVXPY, which is all about convex optimization problems. And that's
132132

133133
00:05:07 very scikit-learn friendly as well, by the way, if you're into that. If you're an operations
134134

@@ -144,15 +144,15 @@
144144

145145
00:05:35 we talked about CalmCode, which is a cool project that you've got going on. We'll talk about just a
146146

147-
00:05:40 moment through the Python Bytes stuff. And then through ExplosionAI and Spacey and all that,
147+
00:05:40 moment through the Python Bytes stuff. And then through Explosion AI and Spacy and all that,
148148

149149
00:05:47 we actually teamed up to do a course that you wrote called Getting Started with NLP and Spacey,
150150

151-
00:05:52 which is over at TalkByThon, which is awesome. A lot of projects you got going on. Some of the
151+
00:05:52 which is over at Talk Python, which is awesome. A lot of projects you got going on. Some of the
152152

153153
00:05:56 ideas that we're going to talk about here, and we'll dive into them as we get into the topics,
154154

155-
00:06:01 come from your course on TalkByThon. I'll put the link in the show notes. People will definitely
155+
00:06:01 come from your course on Talk Python. I'll put the link in the show notes. People will definitely
156156

157157
00:06:04 want to check that out. But yeah, tell us a little bit more about the stuff you got going on. Like
158158

@@ -242,7 +242,7 @@
242242

243243
00:09:15 YouTube channel. So definitely feel free to have a look at that. Yeah, I'll link that in the show
244244

245-
00:09:19 notes. Okay. As you said, that was in the CommCode YouTube account. The CommCode is more courses than
245+
00:09:19 notes. Okay. As you said, that was in the CalmCode YouTube account. The CalmCode is more courses than
246246

247247
00:09:28 it is keyboards, right? Yes, definitely. So it kind of started as a COVID project. I kind of
248248

@@ -256,11 +256,11 @@
256256

257257
00:09:53 position where, hey, let's just celebrate this project. So there's a collaborator helping me
258258

259-
00:09:56 out now. We are also writing a book that's on behalf of the CommCode brand. Like if you click,
259+
00:09:56 out now. We are also writing a book that's on behalf of the CalmCode brand. Like if you click,
260260

261261
00:10:01 people can't see, I suppose, but... It's linked right on the homepage though. Yeah.
262262

263-
00:10:05 Yeah. So when you click it, like commcode.io/book, the book is titled Data Science Fiction.
263+
00:10:05 Yeah. So when you click it, like calmcode.io/book, the book is titled Data Science Fiction.
264264

265265
00:10:10 The whole point of the book is just, these are anecdotes that people have told me while
266266

@@ -278,7 +278,7 @@
278278

279279
00:10:38 it. I do have fun writing it is what I will say, but that's also like courses and stuff like this.
280280

281-
00:10:43 That's what I'm trying to do with the CommCode project. Just have something that's very fun to
281+
00:10:43 That's what I'm trying to do with the CalmCode project. Just have something that's very fun to
282282

283283
00:10:47 maintain, but also something that people can actually have a good look at.
284284

@@ -366,41 +366,41 @@
366366

367367
00:14:08 Wow.
368368

369-
00:14:09 So Psyched Lego, which is a somewhat popular project that I maintain, there's another
369+
00:14:09 So Scikit Lego, which is a somewhat popular project that I maintain, there's another
370370

371371
00:14:12 collaborator on that now, Francesco. Basically everyone who has made a serious contribution is
372372

373373
00:14:17 also just invited to add a line to the poem. So it's just little things like that. That's what
374374

375-
00:14:23 today I learned. It's very easy to sort of share. Psyched Lego, by the way, I'm going to brag about
375+
00:14:23 today I learned. It's very easy to sort of share. Scikit Lego, by the way, I'm going to brag about
376376

377377
00:14:28 that. It got a million downloads, got a million downloads now. So that happened two weeks ago.
378378

379379
00:14:32 So super proud of that.
380380

381-
00:14:34 What is Psyched Lego?
381+
00:14:34 What is Scikit Lego?
382382

383-
00:14:35 Psyched Learn has all sorts of components and you've got regression models, classification
383+
00:14:35 Scikit Learn has all sorts of components and you've got regression models, classification
384384

385385
00:14:40 models, pre-processing utilities, and you name it. And I, at some point, just noticed that there's a
386386

387387
00:14:44 couple of these Lego bricks that I really like to use and I didn't feel like rewriting them for
388388

389-
00:14:48 every single client I had. Psyched Lego just started out as a place for me and another maintainer
389+
00:14:48 every single client I had. Scikit-Lego just started out as a place for me and another maintainer
390390

391391
00:14:54 just put stuff that we like to use. We didn't take the project that serious until other people did.
392392

393-
00:14:59 Like I actually got an email from a data engineer that works at Lego, just to give a example. But it's really just, there's a bunch of stuff that Psyched Learn,
393+
00:14:59 Like I actually got an email from a data engineer that works at Lego, just to give a example. But it's really just, there's a bunch of stuff that Scikit Learn,
394394

395395
00:15:07 because it's such a mature project. There's a couple of these experimental things that can't
396396

397-
00:15:11 really go into Psyched Learn, but if people can convince us that it's a fun thing to maintain,
397+
00:15:11 really go into Scikit Learn, but if people can convince us that it's a fun thing to maintain,
398398

399399
00:15:15 we will gladly put it in here. That's kind of the goal of the library.
400400

401-
00:15:18 Awesome. So kind of thinking of the building blocks of Psyched Learn as Lego blocks.
401+
00:15:18 Awesome. So kind of thinking of the building blocks of Scikit Learn as Lego blocks.
402402

403-
00:15:24 Psyched Learn, you could look at it already, has a whole bunch of Lego bricks. It's just that this
403+
00:15:24 Scikit Learn, you could look at it already, has a whole bunch of Lego bricks. It's just that this
404404

405405
00:15:27 library contributes a couple of more experimental ones. It's such a place right now that they can't
406406

@@ -520,11 +520,11 @@
520520

521521
00:20:10 June for the audio channel. So it depends how you consumed it. So two to three months ago. Anyway,
522522

523-
00:20:15 we talked more about LLMs, not so much Spacey, even though she's behind it. So give people a
523+
00:20:15 we talked more about LLMs, not so much Spacy, even though she's behind it. So give people a
524524

525-
00:20:21 sense of what is Spacey. We just talked about Scikit-Learn and the types of problems it solves.
525+
00:20:21 sense of what is Spacy. We just talked about Scikit-Learn and the types of problems it solves.
526526

527-
00:20:26 What about Spacey? - There's a couple of stories that could be told about it, but
527+
00:20:26 What about Spacy? - There's a couple of stories that could be told about it, but
528528

529529
00:20:29 one way to maybe think about it is that in Python, we've always had tools that could do NLP. We also
530530

@@ -538,11 +538,11 @@
538538

539539
00:20:55 Lego bricks and it was definitely kind of useful, but it wasn't necessarily a coherent pipeline.
540540

541-
00:20:59 And one way to, I think, historically describe Spacey, it was like a very honest, good attempt to
541+
00:20:59 And one way to, I think, historically describe Spacy, it was like a very honest, good attempt to
542542

543543
00:21:06 make a pipeline for all these different NLP components that kind of clicked together.
544544

545-
00:21:10 And the first component inside of Spacey that made it popular was basically a tokenizer,
545+
00:21:10 And the first component inside of Spacy that made it popular was basically a tokenizer,
546546

547547
00:21:15 something that can take text and split it up into separate words. And basically that's a
548548

@@ -586,7 +586,7 @@
586586

587587
00:22:56 to do. - Oh, waving hand emoji, step one. - Yeah, exactly. You're giving me ideas. This is gonna
588588

589-
00:23:02 happen. - But anyway, but back to SpaceGuy, I suppose. This is sort of the origin story. The
589+
00:23:02 happen. - But anyway, but back to Spacy guy, I suppose. This is sort of the origin story. The
590590

591591
00:23:07 tokenization was the first sort of problem that they tackled. And then very quickly, they also
592592

@@ -604,7 +604,7 @@
604604

605605
00:23:41 also just happens to be the most popular verb in the English language. So if you're just going to
606606

607-
00:23:45 match the string Go, you're simply not going to get there. Spacey was also one of the, I would
607+
00:23:45 match the string Go, you're simply not going to get there. Spacy was also one of the, I would
608608

609609
00:23:50 say first projects that offered pretty good pre-trained free models that people could just
610610

@@ -632,7 +632,7 @@
632632

633633
00:24:43 where you can see all sorts of plugins that people made. But the core and like the main way I still
634634

635-
00:24:47 like to think about Spacey, it is a relatively lightweight because a lot of it is implemented
635+
00:24:47 like to think about Spacy, it is a relatively lightweight because a lot of it is implemented
636636

637637
00:24:51 in Cython. Pipeline for NLP projects. And again, like the main thing that people like to use it
638638

@@ -650,7 +650,7 @@
650650

651651
00:25:23 I would argue, interesting hobby projects as well that are just more for fun, I guess.
652652

653-
00:25:28 Yeah. But there's a lot. I mean, one thing I will say, because Spacey's been around so much, some of those plugins are a
653+
00:25:28 Yeah. But there's a lot. I mean, one thing I will say, because Spacy's been around so much, some of those plugins are a
654654

655655
00:25:33 bit dated now. Like you can definitely imagine a project that got started five years ago.
656656

@@ -660,15 +660,15 @@
660660

661661
00:25:45 simple example here, just to give people a sense of, you know, maybe some, what does it look like
662662

663-
00:25:50 to write code with Spacey? I mean, got to be a little careful talking code on audio formats,
663+
00:25:50 to write code with Spacy? I mean, got to be a little careful talking code on audio formats,
664664

665665
00:25:54 but what's the program? We can do it. I think we can manage. I mean, the first thing you typically
666666

667667
00:25:58 do is you just call import Spacey and that's pretty straightforward, but then you got to load
668668

669669
00:26:04 a model and there's kind of two ways of doing it. Like one thing you could do is you could say
670670

671-
00:26:09 Spacey dot blank, and then you give it a name of a language. So you can have a blank Dutch model,
671+
00:26:09 Spacy dot blank, and then you give it a name of a language. So you can have a blank Dutch model,
672672

673673
00:26:13 or you can have a blank English model. And that's the model that will only carry the tokenizer and
674674

@@ -684,7 +684,7 @@
684684

685685
00:26:41 But that's going to do all the heavy lifting. And then you get an object that can take text
686686

687-
00:26:45 and then turn that into a structured document. That's the entry point into Spacey.
687+
00:26:45 and then turn that into a structured document. That's the entry point into Spacy.
688688

689689
00:26:50 I see. So what you might do with a web scraping with beautiful soup or something,
690690

@@ -804,7 +804,7 @@
804804

805805
00:31:18 transcripts, and then lets you search them and do other things like that. And as part of that,
806806

807-
00:31:22 I used spacey or was that weird? You spacey because building a little lightweight custom
807+
00:31:22 I used spacy or was that weird? You spacy because building a little lightweight custom
808808

809809
00:31:30 search engine, I said, all right, well, if somebody searches for a plural thing or the
810810

@@ -914,23 +914,23 @@
914914

915915
00:35:46 that I do think is probably the most useful. You can just go that extra step further than just
916916

917-
00:35:51 basic string matching and spacey out of the box has a lot of sensible defaults that you don't
917+
00:35:51 basic string matching and spacy out of the box has a lot of sensible defaults that you don't
918918

919919
00:35:56 have to think about. And there's for sure also like pretty good models on hugging face that you
920920

921921
00:36:00 can go ahead and download for free. But typically those models are like kind of like one trick
922922

923923
00:36:04 ponies. That's not always the case, but they are usually trained for like one task in mind.
924924

925-
00:36:09 And the cool feeling that spacey just gives you is that even though it might not be the best,
925+
00:36:09 And the cool feeling that spacy just gives you is that even though it might not be the best,
926926

927927
00:36:13 most performant model, it will be fast enough usually. And it will also just be in just enough
928928

929929
00:36:18 in general. Yeah. And it doesn't have the heavy, heavy weight overloading. It's definitely a
930930

931931
00:36:24 megabytes instead of gigabytes. If you, if you play your cards, right. Yes. So I see the word
932932

933-
00:36:29 token in here on spacey and I know number of tokens in LLMs is like sort of how much memory or
933+
00:36:29 token in here on spacy and I know number of tokens in LLMs is like sort of how much memory or
934934

935935
00:36:37 context can they keep in mind? Are those the same things or they just happen to have the same word?
936936

@@ -984,7 +984,7 @@
984984

985985
00:38:34 about is I want to go back to this, getting started with spacey and NLP course that you created
986986

987-
00:38:39 and talk through one of the, the pri let's say the primary demo dataset technique that you talked
987+
00:38:39 and talk through one of the, the pre let's say the primary demo dataset technique that you talked
988988

989989
00:38:46 about in the course. And that would be to go and take nine years of transcripts for the podcast.
990990

@@ -1012,7 +1012,7 @@
10121012

10131013
00:39:49 replacements that say that phrase always with that capitalization always leads to the correct
10141014

1015-
00:39:55 version. And then a sink and a wait, oh no, it's a space sink where like you wash your hands.
1015+
00:39:55 version. And then a sink and a wait, oh no, it's a spacy sink where like you wash your hands.
10161016

10171017
00:40:01 You're like, no, no, no, no, no. So there's a whole bunch of that that gets blasted on top
10181018

@@ -1024,7 +1024,7 @@
10241024

10251025
00:40:21 scikit-learn, you know, the well-known machine learning package, it always gets translated into
10261026

1027-
00:40:26 psychic learn. But that's an interesting aspect of like, you know, that the text that goes in is not
1027+
00:40:26 scikit- learn. But that's an interesting aspect of like, you know, that the text that goes in is not
10281028

10291029
00:40:33 necessarily perfect, but I was impressed. It is actually pretty darn good. There are some weird
10301030

@@ -1042,13 +1042,13 @@
10421042

10431043
00:41:04 incredibly cool. Right. You have one function you call that will read nine years of text and return
10441044

1045-
00:41:10 it line by line. This is the thing that people don't always recognize, but the way that spacey
1045+
00:41:10 it line by line. This is the thing that people don't always recognize, but the way that spacy
10461046

10471047
00:41:13 is made, if you're from scikit-learn, this sounds a bit surprising because in scikit-learn land,
10481048

10491049
00:41:18 you are typically used to the fact that you do batching and stuff that's factorized and
10501050

1051-
00:41:22 NumPy. And that's sort of the way you would do it. But spacey actually has a small preference
1051+
00:41:22 NumPy. And that's sort of the way you would do it. But spacy actually has a small preference
10521052

10531053
00:41:25 to using generators. And the whole thinking is that in natural language problems, you are
10541054

@@ -1096,7 +1096,7 @@
10961096

10971097
00:43:14 So that's the first thing that I usually end up doing when I'm doing something with spacey,
10981098

1099-
00:43:17 just get it into a generator. Spacey can batch the stuff for you, such as it's still nice and quick,
1099+
00:43:17 just get it into a generator. Spacy can batch the stuff for you, such as it's still nice and quick,
11001100

11011101
00:43:22 and you can do things in parallel even, but you think in generators a bit more than you do in
11021102

@@ -1112,11 +1112,11 @@
11121112

11131113
00:43:53 It is definitely different. When you're a data scientist, you're usually used to,
11141114

1115-
00:43:57 "Oh, it's a Pana's data frame. Everything's a Pana's data frame. I wake up and I brush my teeth
1115+
00:43:57 "Oh, it's a Panda's data frame. Everything's a Panda's data frame. I wake up and I brush my teeth
11161116

1117-
00:44:01 with a Pana's data frame." But in spacey land, that's the first thing you do notice. It's not
1117+
00:44:01 with a Panda's data frame." But in spacy land, that's the first thing you do notice. It's not
11181118

1119-
00:44:06 everything is a data frame, actually. In fact, some of the tools that I've used inside of spacey,
1119+
00:44:06 everything is a data frame, actually. In fact, some of the tools that I've used inside of spacy,
11201120

11211121
00:44:11 there's a little library called Seriously, that's for serialization. And one of the things that it
11221122

@@ -1146,7 +1146,7 @@
11461146

11471147
00:45:18 And it turned out that you can actually catch a whole bunch of these Python projects by just
11481148

1149-
00:45:22 taking the spacey product model, like the standard NER model, I think in the medium pipeline. And
1149+
00:45:22 taking the spacy product model, like the standard NER model, I think in the medium pipeline. And
11501150

11511151
00:45:27 you would just tell it like, "Hey, find me all the products." And of course it's not a perfect
11521152

@@ -1188,7 +1188,7 @@
11881188

11891189
00:46:54 I talk about how to do a, how to structure an NLP project. But at the end, I also talk about
11901190

1191-
00:46:58 these large language models and things you can do with that. And I use OpenAI. That's the thing I
1191+
00:46:58 these large language models and things you can do with that. And I use Open AI. That's the thing I
11921192

11931193
00:47:03 use. But there's also this new tool called Glee NER. You can find it on the Hugging Face. It's
11941194

@@ -1330,7 +1330,7 @@
13301330

13311331
00:52:20 but I can definitely imagine if you were really interested in doing something with like Python
13321332

1333-
00:52:24 tools, I would probably start with the Python bites one looking, thinking out loud. Maybe.
1333+
00:52:24 tools, I would probably start with the Python bytes one looking, thinking out loud. Maybe.
13341334

13351335
00:52:29 Yeah, that's a good idea. It's a good idea. The first step is that this is like publicly
13361336

@@ -1464,7 +1464,7 @@
14641464

14651465
00:57:30 that to what I get out of an LLM. When those two models disagree, something interesting is usually
14661466

1467-
00:57:35 happening because the LLM model is pretty good and the spacey model is pretty good. But when they
1467+
00:57:35 happening because the LLM model is pretty good and the spacy model is pretty good. But when they
14681468

14691469
00:57:39 disagree, then I'm probably dealing with either a model that can be improved or data point that's
14701470

@@ -1540,7 +1540,7 @@
15401540

15411541
01:00:30 Anyway, let's go ahead and wrap this thing up. Like people are interested in NLP,
15421542

1543-
01:00:34 spacey, maybe beyond like what in that space and what else do you want to leave people with?
1543+
01:00:34 spacey, maybe beyond like what in that spacy and what else do you want to leave people with?
15441544

15451545
01:00:39 - I guess the main thing is just approach everything with curiosity. And if you're
15461546

@@ -1607,4 +1607,3 @@
16071607
01:03:14 at talkpython.fm/youtube. This is your host, Michael Kennedy. Thanks so much for listening.
16081608

16091609
01:03:20 I really appreciate it. Now get out there and write some Python code.
1610-

0 commit comments

Comments
 (0)