|
170 | 170 |
|
171 | 171 | 00:05:08 Overall, I think we actually ended up getting better performance for the most part.
|
172 | 172 |
|
173 |
| -00:05:13 And the biggest thing was just us having the confidence to move forward very quickly without that, like, in the back of our mind, like, where's the next sext fault coming from? |
| 173 | +00:05:13 And the biggest thing was just us having the confidence to move forward very quickly without that, like, in the back of our mind, like, where's the next sect fault coming from? |
174 | 174 |
|
175 | 175 | 00:05:25 Yeah.
|
176 | 176 |
|
|
202 | 202 |
|
203 | 203 | 00:05:54 Most of our team are also new to Rust.
|
204 | 204 |
|
205 |
| -00:05:59 Or they're new when they join LandCB. |
| 205 | +00:05:59 Or they're new when they join LanceDB. |
206 | 206 |
|
207 | 207 | 00:06:02 And I think what helps is that most of them were also already proficient in C++, right?
|
208 | 208 |
|
|
300 | 300 |
|
301 | 301 | 00:08:29 Even for something like Apache Arrow, like PyArrow kind of API, it's the standard for in-memory data.
|
302 | 302 |
|
303 |
| -00:08:38 But the ChatGPG still makes up APIs for that. |
| 303 | +00:08:38 But the ChatGPT still makes up APIs for that. |
304 | 304 |
|
305 | 305 | 00:08:42 And if you're looking at in terms of like the effect on the developer community,
|
306 | 306 |
|
|
404 | 404 |
|
405 | 405 | 00:12:38 Like, do we have documents on this or whatever?
|
406 | 406 |
|
407 |
| -00:12:40 It's probably just like, here's a 175 email email thread. |
| 407 | +00:12:40 It's probably just like, here's a 175 email thread. |
408 | 408 |
|
409 | 409 | 00:12:43 Like, no, no, thanks.
|
410 | 410 |
|
411 | 411 | 00:12:45 This isn't going to help.
|
412 | 412 |
|
413 |
| -00:12:46 Let's talk LanceDB and multimodal data. |
| 413 | +00:12:46 Let's talk Lance DB and multimodal data. |
414 | 414 |
|
415 | 415 | 00:12:49 And I guess maybe that's the place before we get into the details of Lance.
|
416 | 416 |
|
|
428 | 428 |
|
429 | 429 | 00:13:33 And there's lots of use cases for a tool that can help those users extract insights and then possibly train models on top of those on top of that kind of data.
|
430 | 430 |
|
431 |
| -00:13:45 And so if you look at data by by volume, if you look at your average like TPCH data table, it's like what, like 150 bytes or something like that per row. |
| 431 | +00:13:45 And so if you look at data by volume, if you look at your average like TPCH data table, it's like what, like 150 bytes or something like that per row. |
432 | 432 |
|
433 | 433 | 00:13:55 And then embeddings, you know, if you just look at the previous generation of open AI, it's like, you know, 25 times that.
|
434 | 434 |
|
|
464 | 464 |
|
465 | 465 | 00:15:29 Yes.
|
466 | 466 |
|
467 |
| -00:15:29 Honestly, I think nuclear is it's worth considering if if you rather than coal or whatever. |
| 467 | +00:15:29 Honestly, I think nuclear is it's worth considering if you rather than coal or whatever. |
468 | 468 |
|
469 | 469 | 00:15:35 But still, that's a whole different discussion.
|
470 | 470 |
|
471 | 471 | 00:15:37 We don't need to go down that hole right now.
|
472 | 472 |
|
473 |
| -00:15:38 Maybe maybe later at the end. |
| 473 | +00:15:38 Maybe later at the end. |
474 | 474 |
|
475 | 475 | 00:15:40 Who knows?
|
476 | 476 |
|
|
546 | 546 |
|
547 | 547 | 00:17:27 Create your Sentry account now at talkpython.fm/sentry.
|
548 | 548 |
|
549 |
| -00:17:32 And if you sign up with the code talkpython, all capital, no spaces, it's good for two free |
| 549 | +00:17:32 And if you sign up with the code TALKPYTHON, all capital, no spaces, it's good for two free |
550 | 550 |
|
551 | 551 | 00:17:38 months of Sentry's business plan, which will give you up to 20 times as many monthly events
|
552 | 552 |
|
|
666 | 666 |
|
667 | 667 | 00:22:37 The core interface input output is arrow.
|
668 | 668 |
|
669 |
| -00:22:39 And then with LAN CB, there's on the output end, you can convert results to pandas data frames or polars data frames. |
| 669 | +00:22:39 And then with LANCE DB, there's on the output end, you can convert results to pandas data frames or polars data frames. |
670 | 670 |
|
671 | 671 | 00:22:47 And then on the input end to, I think natively, we can take pandas data frames as batches of input, but we convert that to arrow tables.
|
672 | 672 |
|
|
708 | 708 |
|
709 | 709 | 00:24:22 Yeah.
|
710 | 710 |
|
711 |
| -00:24:23 I think MinIO published their own integration with LAN CB as an example as well. |
| 711 | +00:24:23 I think MinIO published their own integration with LANCE DB as an example as well. |
712 | 712 |
|
713 | 713 | 00:24:28 Oh, did they?
|
714 | 714 |
|
|
758 | 758 |
|
759 | 759 | 00:25:46 So this is our commercial offering.
|
760 | 760 |
|
761 |
| -00:25:48 I'm just calling it LansCB Enterprise. |
| 761 | +00:25:48 I'm just calling it LanceDB Enterprise. |
762 | 762 |
|
763 |
| -00:25:50 And essentially, it's one distributed system that gives you a huge scale, low latency, high throughput, a system that can be backed by the same Lans data. |
| 763 | +00:25:50 And essentially, it's one distributed system that gives you a huge scale, low latency, high throughput, a system that can be backed by the same LanceDB data. |
764 | 764 |
|
765 | 765 | 00:26:03 It's just a file in S3.
|
766 | 766 |
|
|
828 | 828 |
|
829 | 829 | 00:29:29 What's the workflow?
|
830 | 830 |
|
831 |
| -00:29:30 With the Lansi B Enterprise, like the system that's running, you can just keep adding data to it. |
| 831 | +00:29:30 With the LanceDB Enterprise, like the system that's running, you can just keep adding data to it. |
832 | 832 |
|
833 | 833 | 00:29:34 The indexing and all of that is automatic once you've configured it properly, which is just, you know, here's the schema and like create indices on these columns, right?
|
834 | 834 |
|
|
844 | 844 |
|
845 | 845 | 00:30:13 So this is where the open data layer comes in.
|
846 | 846 |
|
847 |
| -00:30:15 So if you have a large data set and you have, whether it's Spark or Ray, you can use those large distributed systems to write data directly to S3 in the, in Lansi open source format. |
| 847 | +00:30:15 So if you have a large data set and you have, whether it's Spark or Ray, you can use those large distributed systems to write data directly to S3 in the, in LanceDB open source format. |
848 | 848 |
|
849 | 849 | 00:30:27 And then the system actually picks it up from object store and takes care of the indexing and compaction and all of that.
|
850 | 850 |
|
|
960 | 960 |
|
961 | 961 | 00:34:55 And instead, that open data layer is super important to plug into the rest of their existing ecosystem.
|
962 | 962 |
|
963 |
| -00:35:02 And with Lance, unlike Parquet or JSON or WebDataSet, all of their multi-bono data can live in one place so that they can do search and retrieval on the vectors. |
| 963 | +00:35:02 And with Lance, unlike Parquet or JSON or Web Dataset, all of their multi-bono data can live in one place so that they can do search and retrieval on the vectors. |
964 | 964 |
|
965 | 965 | 00:35:13 They can run SQL.
|
966 | 966 |
|
|
1140 | 1140 |
|
1141 | 1141 | 00:41:53 There's a translation layer between pedantic and arrow schemas.
|
1142 | 1142 |
|
1143 |
| -00:41:56 And so I think for a lot of our Python users, it's much easier to think in terms of pedantic objects as the data model rather than manually dealing with the Pyero API to create a schema. |
| 1143 | +00:41:56 And so I think for a lot of our Python users, it's much easier to think in terms of Pydantic objects as the data model rather than manually dealing with the Py Arrow API to create a schema. |
1144 | 1144 |
|
1145 | 1145 | 00:42:08 We saw lots of issues where users are like, well, how do I create a fixed size list?
|
1146 | 1146 |
|
|
1152 | 1152 |
|
1153 | 1153 | 00:42:27 There's lots of stuff that is just much easier to think of in terms of Python types and Python objects.
|
1154 | 1154 |
|
1155 |
| -00:42:33 I love the Python, the pydantic integration there. |
| 1155 | +00:42:33 I love the Python, the Pydantic integration there. |
1156 | 1156 |
|
1157 | 1157 | 00:42:36 That's super cool.
|
1158 | 1158 |
|
1159 | 1159 | 00:42:37 Well, the data layer used for my course's website and the podcast website and stuff is all based on Beanie and Mongo, which is async.
|
1160 | 1160 |
|
1161 |
| -00:42:45 Basically, you're writing async queries against pydantic models, which is, it's a real, got the validation, but you're not writing directly to the database and just random dictionaries. |
| 1161 | +00:42:45 Basically, you're writing async queries against Pydantic models, which is, it's a real, got the validation, but you're not writing directly to the database and just random dictionaries. |
1162 | 1162 |
|
1163 | 1163 | 00:42:55 And who knows if it stays consistent.
|
1164 | 1164 |
|
|
1196 | 1196 |
|
1197 | 1197 | 00:44:11 I know we've loved working with it for a long time.
|
1198 | 1198 |
|
1199 |
| -00:44:14 If any listeners are out there that's interested in just messing with like pedantic and arrow, so that translation layer, we'd love to get some help as well. |
| 1199 | +00:44:14 If any listeners are out there that's interested in just messing with like Pydantic and arrow, so that translation layer, we'd love to get some help as well. |
1200 | 1200 |
|
1201 | 1201 | 00:44:24 So for some of the more like complicated nested types out there, so where that's like lists of lists or lists of fixed with lists and dictionaries and that kind of thing, the translation layer is incomplete.
|
1202 | 1202 |
|
|
1208 | 1208 |
|
1209 | 1209 | 00:45:00 Right, right.
|
1210 | 1210 |
|
1211 |
| -00:45:01 And then maybe like either pedantic or maybe arrow, one of the two projects can own that translation layer. |
| 1211 | +00:45:01 And then maybe like either Pydantic or maybe arrow, one of the two projects can own that translation layer. |
1212 | 1212 |
|
1213 | 1213 | 00:45:07 I think that would be really great for the, for the ecosystem.
|
1214 | 1214 |
|
|
1224 | 1224 |
|
1225 | 1225 | 00:45:27 We wanted to make it so that it's really familiar for people who've worked with databases and data frame engines.
|
1226 | 1226 |
|
1227 |
| -00:45:34 So the main workload for Lansi B open source is that search API. |
| 1227 | +00:45:34 So the main workload for LanceDB open source is that search API. |
1228 | 1228 |
|
1229 | 1229 | 00:45:38 So when you, you can say table.search, you can pass in the vector, the query vector, and then you can call .limit to say how many results we want.
|
1230 | 1230 |
|
|
1248 | 1248 |
|
1249 | 1249 | 00:46:30 Right.
|
1250 | 1250 |
|
1251 |
| -00:46:31 You would just maybe want to take your pedantic and say, turn that into JSON and hand it over to the next agent or something like that. |
| 1251 | +00:46:31 You would just maybe want to take your Pydantic and say, turn that into JSON and hand it over to the next agent or something like that. |
1252 | 1252 |
|
1253 | 1253 | 00:46:37 If we sort of move forward in that example a little bit.
|
1254 | 1254 |
|
1255 | 1255 | 00:46:40 Yeah.
|
1256 | 1256 |
|
1257 | 1257 | 00:46:40 In addition to setting up the schema, there's also the embedding API that's really interesting so that when you create the schema.
|
1258 | 1258 |
|
1259 |
| -00:46:49 So here's an example where we can create the schema using pedantic. |
| 1259 | +00:46:49 So here's an example where we can create the schema using Pydantic. |
1260 | 1260 |
|
1261 | 1261 | 00:46:54 So in this block that I've declared the class words, which is a Lance model.
|
1262 | 1262 |
|
|
1272 | 1272 |
|
1273 | 1273 | 00:47:26 We're using ADA two, but new models have been released since.
|
1274 | 1274 |
|
1275 |
| -00:47:30 And I can use the pedantic annotations to say, hey, the text field is the source field for that function. |
| 1275 | +00:47:30 And I can use the Pydantic annotations to say, hey, the text field is the source field for that function. |
1276 | 1276 |
|
1277 | 1277 | 00:47:36 So it's text string equals func dot source field.
|
1278 | 1278 |
|
|
1292 | 1292 |
|
1293 | 1293 | 00:48:11 Yeah, just takes care of calling the open AI API on your behalf and then adding the vectors before, you know, adding the whole batch to the table.
|
1294 | 1294 |
|
1295 |
| -00:48:21 So open AI was was our first one, but there's dozens of compatible embedding models. |
| 1295 | +00:48:21 So open AI was our first one, but there's dozens of compatible embedding models. |
1296 | 1296 |
|
1297 | 1297 | 00:48:27 So pretty much anything you can pull off hugging face.
|
1298 | 1298 |
|
|
1302 | 1302 |
|
1303 | 1303 | 00:48:43 Exactly. Yeah.
|
1304 | 1304 |
|
1305 |
| -00:48:43 So a lot of the the ones you can pull off hugging face can just run locally and it exposes the options. |
| 1305 | +00:48:43 So a lot of the ones you can pull off hugging face can just run locally and it exposes the options. |
1306 | 1306 |
|
1307 |
| -00:48:51 So if you do have for your Mac mini or or even your MacBook laptop, there are options. |
| 1307 | +00:48:51 So if you do have for your Mac mini or even your MacBook laptop, there are options. |
1308 | 1308 |
|
1309 | 1309 | 00:48:57 There are lots of hugging face models where you can specify NPS to actually make it run a little bit faster.
|
1310 | 1310 |
|
|
1543 | 1543 | 00:57:55 Now get out there and write some Python code.
|
1544 | 1544 |
|
1545 | 1545 | 00:57:57 We'll see you next time.
|
1546 |
| - |
0 commit comments