|
96 | 96 |
|
97 | 97 | 00:03:55 I did operations research, which is this sort of applied subfield of math.
|
98 | 98 |
|
99 |
| -00:03:59 That's very much a optimization problem, kind of Solvee kind of thing. So, |
| 99 | +00:03:59 That's very much a optimization problem, kind of Solved kind of thing. So, |
100 | 100 |
|
101 | 101 | 00:04:04 traveling salesman problem, that kind of stuff.
|
102 | 102 |
|
|
128 | 128 |
|
129 | 129 | 00:04:57 SymPy is cool. Google has OR tools, which is also a pretty easy starting point. And there's
|
130 | 130 |
|
131 |
| -00:05:01 also another package called CVXpy, which is all about convex optimization problems. And that's |
| 131 | +00:05:01 also another package called CVXPY, which is all about convex optimization problems. And that's |
132 | 132 |
|
133 | 133 | 00:05:07 very scikit-learn friendly as well, by the way, if you're into that. If you're an operations
|
134 | 134 |
|
|
144 | 144 |
|
145 | 145 | 00:05:35 we talked about CalmCode, which is a cool project that you've got going on. We'll talk about just a
|
146 | 146 |
|
147 |
| -00:05:40 moment through the Python Bytes stuff. And then through ExplosionAI and Spacey and all that, |
| 147 | +00:05:40 moment through the Python Bytes stuff. And then through Explosion AI and Spacy and all that, |
148 | 148 |
|
149 | 149 | 00:05:47 we actually teamed up to do a course that you wrote called Getting Started with NLP and Spacey,
|
150 | 150 |
|
151 |
| -00:05:52 which is over at TalkByThon, which is awesome. A lot of projects you got going on. Some of the |
| 151 | +00:05:52 which is over at Talk Python, which is awesome. A lot of projects you got going on. Some of the |
152 | 152 |
|
153 | 153 | 00:05:56 ideas that we're going to talk about here, and we'll dive into them as we get into the topics,
|
154 | 154 |
|
155 |
| -00:06:01 come from your course on TalkByThon. I'll put the link in the show notes. People will definitely |
| 155 | +00:06:01 come from your course on Talk Python. I'll put the link in the show notes. People will definitely |
156 | 156 |
|
157 | 157 | 00:06:04 want to check that out. But yeah, tell us a little bit more about the stuff you got going on. Like
|
158 | 158 |
|
|
242 | 242 |
|
243 | 243 | 00:09:15 YouTube channel. So definitely feel free to have a look at that. Yeah, I'll link that in the show
|
244 | 244 |
|
245 |
| -00:09:19 notes. Okay. As you said, that was in the CommCode YouTube account. The CommCode is more courses than |
| 245 | +00:09:19 notes. Okay. As you said, that was in the CalmCode YouTube account. The CalmCode is more courses than |
246 | 246 |
|
247 | 247 | 00:09:28 it is keyboards, right? Yes, definitely. So it kind of started as a COVID project. I kind of
|
248 | 248 |
|
|
256 | 256 |
|
257 | 257 | 00:09:53 position where, hey, let's just celebrate this project. So there's a collaborator helping me
|
258 | 258 |
|
259 |
| -00:09:56 out now. We are also writing a book that's on behalf of the CommCode brand. Like if you click, |
| 259 | +00:09:56 out now. We are also writing a book that's on behalf of the CalmCode brand. Like if you click, |
260 | 260 |
|
261 | 261 | 00:10:01 people can't see, I suppose, but... It's linked right on the homepage though. Yeah.
|
262 | 262 |
|
263 |
| -00:10:05 Yeah. So when you click it, like commcode.io/book, the book is titled Data Science Fiction. |
| 263 | +00:10:05 Yeah. So when you click it, like calmcode.io/book, the book is titled Data Science Fiction. |
264 | 264 |
|
265 | 265 | 00:10:10 The whole point of the book is just, these are anecdotes that people have told me while
|
266 | 266 |
|
|
278 | 278 |
|
279 | 279 | 00:10:38 it. I do have fun writing it is what I will say, but that's also like courses and stuff like this.
|
280 | 280 |
|
281 |
| -00:10:43 That's what I'm trying to do with the CommCode project. Just have something that's very fun to |
| 281 | +00:10:43 That's what I'm trying to do with the CalmCode project. Just have something that's very fun to |
282 | 282 |
|
283 | 283 | 00:10:47 maintain, but also something that people can actually have a good look at.
|
284 | 284 |
|
|
366 | 366 |
|
367 | 367 | 00:14:08 Wow.
|
368 | 368 |
|
369 |
| -00:14:09 So Psyched Lego, which is a somewhat popular project that I maintain, there's another |
| 369 | +00:14:09 So Scikit Lego, which is a somewhat popular project that I maintain, there's another |
370 | 370 |
|
371 | 371 | 00:14:12 collaborator on that now, Francesco. Basically everyone who has made a serious contribution is
|
372 | 372 |
|
373 | 373 | 00:14:17 also just invited to add a line to the poem. So it's just little things like that. That's what
|
374 | 374 |
|
375 |
| -00:14:23 today I learned. It's very easy to sort of share. Psyched Lego, by the way, I'm going to brag about |
| 375 | +00:14:23 today I learned. It's very easy to sort of share. Scikit Lego, by the way, I'm going to brag about |
376 | 376 |
|
377 | 377 | 00:14:28 that. It got a million downloads, got a million downloads now. So that happened two weeks ago.
|
378 | 378 |
|
379 | 379 | 00:14:32 So super proud of that.
|
380 | 380 |
|
381 |
| -00:14:34 What is Psyched Lego? |
| 381 | +00:14:34 What is Scikit Lego? |
382 | 382 |
|
383 |
| -00:14:35 Psyched Learn has all sorts of components and you've got regression models, classification |
| 383 | +00:14:35 Scikit Learn has all sorts of components and you've got regression models, classification |
384 | 384 |
|
385 | 385 | 00:14:40 models, pre-processing utilities, and you name it. And I, at some point, just noticed that there's a
|
386 | 386 |
|
387 | 387 | 00:14:44 couple of these Lego bricks that I really like to use and I didn't feel like rewriting them for
|
388 | 388 |
|
389 |
| -00:14:48 every single client I had. Psyched Lego just started out as a place for me and another maintainer |
| 389 | +00:14:48 every single client I had. Scikit-Lego just started out as a place for me and another maintainer |
390 | 390 |
|
391 | 391 | 00:14:54 just put stuff that we like to use. We didn't take the project that serious until other people did.
|
392 | 392 |
|
393 |
| -00:14:59 Like I actually got an email from a data engineer that works at Lego, just to give a example. But it's really just, there's a bunch of stuff that Psyched Learn, |
| 393 | +00:14:59 Like I actually got an email from a data engineer that works at Lego, just to give a example. But it's really just, there's a bunch of stuff that Scikit Learn, |
394 | 394 |
|
395 | 395 | 00:15:07 because it's such a mature project. There's a couple of these experimental things that can't
|
396 | 396 |
|
397 |
| -00:15:11 really go into Psyched Learn, but if people can convince us that it's a fun thing to maintain, |
| 397 | +00:15:11 really go into Scikit Learn, but if people can convince us that it's a fun thing to maintain, |
398 | 398 |
|
399 | 399 | 00:15:15 we will gladly put it in here. That's kind of the goal of the library.
|
400 | 400 |
|
401 |
| -00:15:18 Awesome. So kind of thinking of the building blocks of Psyched Learn as Lego blocks. |
| 401 | +00:15:18 Awesome. So kind of thinking of the building blocks of Scikit Learn as Lego blocks. |
402 | 402 |
|
403 |
| -00:15:24 Psyched Learn, you could look at it already, has a whole bunch of Lego bricks. It's just that this |
| 403 | +00:15:24 Scikit Learn, you could look at it already, has a whole bunch of Lego bricks. It's just that this |
404 | 404 |
|
405 | 405 | 00:15:27 library contributes a couple of more experimental ones. It's such a place right now that they can't
|
406 | 406 |
|
|
520 | 520 |
|
521 | 521 | 00:20:10 June for the audio channel. So it depends how you consumed it. So two to three months ago. Anyway,
|
522 | 522 |
|
523 |
| -00:20:15 we talked more about LLMs, not so much Spacey, even though she's behind it. So give people a |
| 523 | +00:20:15 we talked more about LLMs, not so much Spacy, even though she's behind it. So give people a |
524 | 524 |
|
525 |
| -00:20:21 sense of what is Spacey. We just talked about Scikit-Learn and the types of problems it solves. |
| 525 | +00:20:21 sense of what is Spacy. We just talked about Scikit-Learn and the types of problems it solves. |
526 | 526 |
|
527 |
| -00:20:26 What about Spacey? - There's a couple of stories that could be told about it, but |
| 527 | +00:20:26 What about Spacy? - There's a couple of stories that could be told about it, but |
528 | 528 |
|
529 | 529 | 00:20:29 one way to maybe think about it is that in Python, we've always had tools that could do NLP. We also
|
530 | 530 |
|
|
538 | 538 |
|
539 | 539 | 00:20:55 Lego bricks and it was definitely kind of useful, but it wasn't necessarily a coherent pipeline.
|
540 | 540 |
|
541 |
| -00:20:59 And one way to, I think, historically describe Spacey, it was like a very honest, good attempt to |
| 541 | +00:20:59 And one way to, I think, historically describe Spacy, it was like a very honest, good attempt to |
542 | 542 |
|
543 | 543 | 00:21:06 make a pipeline for all these different NLP components that kind of clicked together.
|
544 | 544 |
|
545 |
| -00:21:10 And the first component inside of Spacey that made it popular was basically a tokenizer, |
| 545 | +00:21:10 And the first component inside of Spacy that made it popular was basically a tokenizer, |
546 | 546 |
|
547 | 547 | 00:21:15 something that can take text and split it up into separate words. And basically that's a
|
548 | 548 |
|
|
586 | 586 |
|
587 | 587 | 00:22:56 to do. - Oh, waving hand emoji, step one. - Yeah, exactly. You're giving me ideas. This is gonna
|
588 | 588 |
|
589 |
| -00:23:02 happen. - But anyway, but back to SpaceGuy, I suppose. This is sort of the origin story. The |
| 589 | +00:23:02 happen. - But anyway, but back to Spacy guy, I suppose. This is sort of the origin story. The |
590 | 590 |
|
591 | 591 | 00:23:07 tokenization was the first sort of problem that they tackled. And then very quickly, they also
|
592 | 592 |
|
|
604 | 604 |
|
605 | 605 | 00:23:41 also just happens to be the most popular verb in the English language. So if you're just going to
|
606 | 606 |
|
607 |
| -00:23:45 match the string Go, you're simply not going to get there. Spacey was also one of the, I would |
| 607 | +00:23:45 match the string Go, you're simply not going to get there. Spacy was also one of the, I would |
608 | 608 |
|
609 | 609 | 00:23:50 say first projects that offered pretty good pre-trained free models that people could just
|
610 | 610 |
|
|
632 | 632 |
|
633 | 633 | 00:24:43 where you can see all sorts of plugins that people made. But the core and like the main way I still
|
634 | 634 |
|
635 |
| -00:24:47 like to think about Spacey, it is a relatively lightweight because a lot of it is implemented |
| 635 | +00:24:47 like to think about Spacy, it is a relatively lightweight because a lot of it is implemented |
636 | 636 |
|
637 | 637 | 00:24:51 in Cython. Pipeline for NLP projects. And again, like the main thing that people like to use it
|
638 | 638 |
|
|
650 | 650 |
|
651 | 651 | 00:25:23 I would argue, interesting hobby projects as well that are just more for fun, I guess.
|
652 | 652 |
|
653 |
| -00:25:28 Yeah. But there's a lot. I mean, one thing I will say, because Spacey's been around so much, some of those plugins are a |
| 653 | +00:25:28 Yeah. But there's a lot. I mean, one thing I will say, because Spacy's been around so much, some of those plugins are a |
654 | 654 |
|
655 | 655 | 00:25:33 bit dated now. Like you can definitely imagine a project that got started five years ago.
|
656 | 656 |
|
|
660 | 660 |
|
661 | 661 | 00:25:45 simple example here, just to give people a sense of, you know, maybe some, what does it look like
|
662 | 662 |
|
663 |
| -00:25:50 to write code with Spacey? I mean, got to be a little careful talking code on audio formats, |
| 663 | +00:25:50 to write code with Spacy? I mean, got to be a little careful talking code on audio formats, |
664 | 664 |
|
665 | 665 | 00:25:54 but what's the program? We can do it. I think we can manage. I mean, the first thing you typically
|
666 | 666 |
|
667 | 667 | 00:25:58 do is you just call import Spacey and that's pretty straightforward, but then you got to load
|
668 | 668 |
|
669 | 669 | 00:26:04 a model and there's kind of two ways of doing it. Like one thing you could do is you could say
|
670 | 670 |
|
671 |
| -00:26:09 Spacey dot blank, and then you give it a name of a language. So you can have a blank Dutch model, |
| 671 | +00:26:09 Spacy dot blank, and then you give it a name of a language. So you can have a blank Dutch model, |
672 | 672 |
|
673 | 673 | 00:26:13 or you can have a blank English model. And that's the model that will only carry the tokenizer and
|
674 | 674 |
|
|
684 | 684 |
|
685 | 685 | 00:26:41 But that's going to do all the heavy lifting. And then you get an object that can take text
|
686 | 686 |
|
687 |
| -00:26:45 and then turn that into a structured document. That's the entry point into Spacey. |
| 687 | +00:26:45 and then turn that into a structured document. That's the entry point into Spacy. |
688 | 688 |
|
689 | 689 | 00:26:50 I see. So what you might do with a web scraping with beautiful soup or something,
|
690 | 690 |
|
|
804 | 804 |
|
805 | 805 | 00:31:18 transcripts, and then lets you search them and do other things like that. And as part of that,
|
806 | 806 |
|
807 |
| -00:31:22 I used spacey or was that weird? You spacey because building a little lightweight custom |
| 807 | +00:31:22 I used spacy or was that weird? You spacy because building a little lightweight custom |
808 | 808 |
|
809 | 809 | 00:31:30 search engine, I said, all right, well, if somebody searches for a plural thing or the
|
810 | 810 |
|
|
914 | 914 |
|
915 | 915 | 00:35:46 that I do think is probably the most useful. You can just go that extra step further than just
|
916 | 916 |
|
917 |
| -00:35:51 basic string matching and spacey out of the box has a lot of sensible defaults that you don't |
| 917 | +00:35:51 basic string matching and spacy out of the box has a lot of sensible defaults that you don't |
918 | 918 |
|
919 | 919 | 00:35:56 have to think about. And there's for sure also like pretty good models on hugging face that you
|
920 | 920 |
|
921 | 921 | 00:36:00 can go ahead and download for free. But typically those models are like kind of like one trick
|
922 | 922 |
|
923 | 923 | 00:36:04 ponies. That's not always the case, but they are usually trained for like one task in mind.
|
924 | 924 |
|
925 |
| -00:36:09 And the cool feeling that spacey just gives you is that even though it might not be the best, |
| 925 | +00:36:09 And the cool feeling that spacy just gives you is that even though it might not be the best, |
926 | 926 |
|
927 | 927 | 00:36:13 most performant model, it will be fast enough usually. And it will also just be in just enough
|
928 | 928 |
|
929 | 929 | 00:36:18 in general. Yeah. And it doesn't have the heavy, heavy weight overloading. It's definitely a
|
930 | 930 |
|
931 | 931 | 00:36:24 megabytes instead of gigabytes. If you, if you play your cards, right. Yes. So I see the word
|
932 | 932 |
|
933 |
| -00:36:29 token in here on spacey and I know number of tokens in LLMs is like sort of how much memory or |
| 933 | +00:36:29 token in here on spacy and I know number of tokens in LLMs is like sort of how much memory or |
934 | 934 |
|
935 | 935 | 00:36:37 context can they keep in mind? Are those the same things or they just happen to have the same word?
|
936 | 936 |
|
|
984 | 984 |
|
985 | 985 | 00:38:34 about is I want to go back to this, getting started with spacey and NLP course that you created
|
986 | 986 |
|
987 |
| -00:38:39 and talk through one of the, the pri let's say the primary demo dataset technique that you talked |
| 987 | +00:38:39 and talk through one of the, the pre let's say the primary demo dataset technique that you talked |
988 | 988 |
|
989 | 989 | 00:38:46 about in the course. And that would be to go and take nine years of transcripts for the podcast.
|
990 | 990 |
|
|
1012 | 1012 |
|
1013 | 1013 | 00:39:49 replacements that say that phrase always with that capitalization always leads to the correct
|
1014 | 1014 |
|
1015 |
| -00:39:55 version. And then a sink and a wait, oh no, it's a space sink where like you wash your hands. |
| 1015 | +00:39:55 version. And then a sink and a wait, oh no, it's a spacy sink where like you wash your hands. |
1016 | 1016 |
|
1017 | 1017 | 00:40:01 You're like, no, no, no, no, no. So there's a whole bunch of that that gets blasted on top
|
1018 | 1018 |
|
|
1024 | 1024 |
|
1025 | 1025 | 00:40:21 scikit-learn, you know, the well-known machine learning package, it always gets translated into
|
1026 | 1026 |
|
1027 |
| -00:40:26 psychic learn. But that's an interesting aspect of like, you know, that the text that goes in is not |
| 1027 | +00:40:26 scikit- learn. But that's an interesting aspect of like, you know, that the text that goes in is not |
1028 | 1028 |
|
1029 | 1029 | 00:40:33 necessarily perfect, but I was impressed. It is actually pretty darn good. There are some weird
|
1030 | 1030 |
|
|
1042 | 1042 |
|
1043 | 1043 | 00:41:04 incredibly cool. Right. You have one function you call that will read nine years of text and return
|
1044 | 1044 |
|
1045 |
| -00:41:10 it line by line. This is the thing that people don't always recognize, but the way that spacey |
| 1045 | +00:41:10 it line by line. This is the thing that people don't always recognize, but the way that spacy |
1046 | 1046 |
|
1047 | 1047 | 00:41:13 is made, if you're from scikit-learn, this sounds a bit surprising because in scikit-learn land,
|
1048 | 1048 |
|
1049 | 1049 | 00:41:18 you are typically used to the fact that you do batching and stuff that's factorized and
|
1050 | 1050 |
|
1051 |
| -00:41:22 NumPy. And that's sort of the way you would do it. But spacey actually has a small preference |
| 1051 | +00:41:22 NumPy. And that's sort of the way you would do it. But spacy actually has a small preference |
1052 | 1052 |
|
1053 | 1053 | 00:41:25 to using generators. And the whole thinking is that in natural language problems, you are
|
1054 | 1054 |
|
|
1096 | 1096 |
|
1097 | 1097 | 00:43:14 So that's the first thing that I usually end up doing when I'm doing something with spacey,
|
1098 | 1098 |
|
1099 |
| -00:43:17 just get it into a generator. Spacey can batch the stuff for you, such as it's still nice and quick, |
| 1099 | +00:43:17 just get it into a generator. Spacy can batch the stuff for you, such as it's still nice and quick, |
1100 | 1100 |
|
1101 | 1101 | 00:43:22 and you can do things in parallel even, but you think in generators a bit more than you do in
|
1102 | 1102 |
|
|
1112 | 1112 |
|
1113 | 1113 | 00:43:53 It is definitely different. When you're a data scientist, you're usually used to,
|
1114 | 1114 |
|
1115 |
| -00:43:57 "Oh, it's a Pana's data frame. Everything's a Pana's data frame. I wake up and I brush my teeth |
| 1115 | +00:43:57 "Oh, it's a Panda's data frame. Everything's a Panda's data frame. I wake up and I brush my teeth |
1116 | 1116 |
|
1117 |
| -00:44:01 with a Pana's data frame." But in spacey land, that's the first thing you do notice. It's not |
| 1117 | +00:44:01 with a Panda's data frame." But in spacy land, that's the first thing you do notice. It's not |
1118 | 1118 |
|
1119 |
| -00:44:06 everything is a data frame, actually. In fact, some of the tools that I've used inside of spacey, |
| 1119 | +00:44:06 everything is a data frame, actually. In fact, some of the tools that I've used inside of spacy, |
1120 | 1120 |
|
1121 | 1121 | 00:44:11 there's a little library called Seriously, that's for serialization. And one of the things that it
|
1122 | 1122 |
|
|
1146 | 1146 |
|
1147 | 1147 | 00:45:18 And it turned out that you can actually catch a whole bunch of these Python projects by just
|
1148 | 1148 |
|
1149 |
| -00:45:22 taking the spacey product model, like the standard NER model, I think in the medium pipeline. And |
| 1149 | +00:45:22 taking the spacy product model, like the standard NER model, I think in the medium pipeline. And |
1150 | 1150 |
|
1151 | 1151 | 00:45:27 you would just tell it like, "Hey, find me all the products." And of course it's not a perfect
|
1152 | 1152 |
|
|
1188 | 1188 |
|
1189 | 1189 | 00:46:54 I talk about how to do a, how to structure an NLP project. But at the end, I also talk about
|
1190 | 1190 |
|
1191 |
| -00:46:58 these large language models and things you can do with that. And I use OpenAI. That's the thing I |
| 1191 | +00:46:58 these large language models and things you can do with that. And I use Open AI. That's the thing I |
1192 | 1192 |
|
1193 | 1193 | 00:47:03 use. But there's also this new tool called Glee NER. You can find it on the Hugging Face. It's
|
1194 | 1194 |
|
|
1330 | 1330 |
|
1331 | 1331 | 00:52:20 but I can definitely imagine if you were really interested in doing something with like Python
|
1332 | 1332 |
|
1333 |
| -00:52:24 tools, I would probably start with the Python bites one looking, thinking out loud. Maybe. |
| 1333 | +00:52:24 tools, I would probably start with the Python bytes one looking, thinking out loud. Maybe. |
1334 | 1334 |
|
1335 | 1335 | 00:52:29 Yeah, that's a good idea. It's a good idea. The first step is that this is like publicly
|
1336 | 1336 |
|
|
1464 | 1464 |
|
1465 | 1465 | 00:57:30 that to what I get out of an LLM. When those two models disagree, something interesting is usually
|
1466 | 1466 |
|
1467 |
| -00:57:35 happening because the LLM model is pretty good and the spacey model is pretty good. But when they |
| 1467 | +00:57:35 happening because the LLM model is pretty good and the spacy model is pretty good. But when they |
1468 | 1468 |
|
1469 | 1469 | 00:57:39 disagree, then I'm probably dealing with either a model that can be improved or data point that's
|
1470 | 1470 |
|
|
1540 | 1540 |
|
1541 | 1541 | 01:00:30 Anyway, let's go ahead and wrap this thing up. Like people are interested in NLP,
|
1542 | 1542 |
|
1543 |
| -01:00:34 spacey, maybe beyond like what in that space and what else do you want to leave people with? |
| 1543 | +01:00:34 spacey, maybe beyond like what in that spacy and what else do you want to leave people with? |
1544 | 1544 |
|
1545 | 1545 | 01:00:39 - I guess the main thing is just approach everything with curiosity. And if you're
|
1546 | 1546 |
|
|
1607 | 1607 | 01:03:14 at talkpython.fm/youtube. This is your host, Michael Kennedy. Thanks so much for listening.
|
1608 | 1608 |
|
1609 | 1609 | 01:03:20 I really appreciate it. Now get out there and write some Python code.
|
1610 |
| - |
0 commit comments