Skip to content

Commit 78a7fb3

Browse files
Update 488-lancedb.txt
1 parent ec281aa commit 78a7fb3

File tree

1 file changed

+28
-29
lines changed

1 file changed

+28
-29
lines changed

transcripts/488-lancedb.txt

Lines changed: 28 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -170,7 +170,7 @@
170170

171171
00:05:08 Overall, I think we actually ended up getting better performance for the most part.
172172

173-
00:05:13 And the biggest thing was just us having the confidence to move forward very quickly without that, like, in the back of our mind, like, where's the next sext fault coming from?
173+
00:05:13 And the biggest thing was just us having the confidence to move forward very quickly without that, like, in the back of our mind, like, where's the next sect fault coming from?
174174

175175
00:05:25 Yeah.
176176

@@ -202,7 +202,7 @@
202202

203203
00:05:54 Most of our team are also new to Rust.
204204

205-
00:05:59 Or they're new when they join LandCB.
205+
00:05:59 Or they're new when they join LanceDB.
206206

207207
00:06:02 And I think what helps is that most of them were also already proficient in C++, right?
208208

@@ -300,7 +300,7 @@
300300

301301
00:08:29 Even for something like Apache Arrow, like PyArrow kind of API, it's the standard for in-memory data.
302302

303-
00:08:38 But the ChatGPG still makes up APIs for that.
303+
00:08:38 But the ChatGPT still makes up APIs for that.
304304

305305
00:08:42 And if you're looking at in terms of like the effect on the developer community,
306306

@@ -404,13 +404,13 @@
404404

405405
00:12:38 Like, do we have documents on this or whatever?
406406

407-
00:12:40 It's probably just like, here's a 175 email email thread.
407+
00:12:40 It's probably just like, here's a 175 email thread.
408408

409409
00:12:43 Like, no, no, thanks.
410410

411411
00:12:45 This isn't going to help.
412412

413-
00:12:46 Let's talk LanceDB and multimodal data.
413+
00:12:46 Let's talk Lance DB and multimodal data.
414414

415415
00:12:49 And I guess maybe that's the place before we get into the details of Lance.
416416

@@ -428,7 +428,7 @@
428428

429429
00:13:33 And there's lots of use cases for a tool that can help those users extract insights and then possibly train models on top of those on top of that kind of data.
430430

431-
00:13:45 And so if you look at data by by volume, if you look at your average like TPCH data table, it's like what, like 150 bytes or something like that per row.
431+
00:13:45 And so if you look at data by volume, if you look at your average like TPCH data table, it's like what, like 150 bytes or something like that per row.
432432

433433
00:13:55 And then embeddings, you know, if you just look at the previous generation of open AI, it's like, you know, 25 times that.
434434

@@ -464,13 +464,13 @@
464464

465465
00:15:29 Yes.
466466

467-
00:15:29 Honestly, I think nuclear is it's worth considering if if you rather than coal or whatever.
467+
00:15:29 Honestly, I think nuclear is it's worth considering if you rather than coal or whatever.
468468

469469
00:15:35 But still, that's a whole different discussion.
470470

471471
00:15:37 We don't need to go down that hole right now.
472472

473-
00:15:38 Maybe maybe later at the end.
473+
00:15:38 Maybe later at the end.
474474

475475
00:15:40 Who knows?
476476

@@ -546,7 +546,7 @@
546546

547547
00:17:27 Create your Sentry account now at talkpython.fm/sentry.
548548

549-
00:17:32 And if you sign up with the code talkpython, all capital, no spaces, it's good for two free
549+
00:17:32 And if you sign up with the code TALKPYTHON, all capital, no spaces, it's good for two free
550550

551551
00:17:38 months of Sentry's business plan, which will give you up to 20 times as many monthly events
552552

@@ -666,7 +666,7 @@
666666

667667
00:22:37 The core interface input output is arrow.
668668

669-
00:22:39 And then with LAN CB, there's on the output end, you can convert results to pandas data frames or polars data frames.
669+
00:22:39 And then with LANCE DB, there's on the output end, you can convert results to pandas data frames or polars data frames.
670670

671671
00:22:47 And then on the input end to, I think natively, we can take pandas data frames as batches of input, but we convert that to arrow tables.
672672

@@ -708,7 +708,7 @@
708708

709709
00:24:22 Yeah.
710710

711-
00:24:23 I think MinIO published their own integration with LAN CB as an example as well.
711+
00:24:23 I think MinIO published their own integration with LANCE DB as an example as well.
712712

713713
00:24:28 Oh, did they?
714714

@@ -758,9 +758,9 @@
758758

759759
00:25:46 So this is our commercial offering.
760760

761-
00:25:48 I'm just calling it LansCB Enterprise.
761+
00:25:48 I'm just calling it LanceDB Enterprise.
762762

763-
00:25:50 And essentially, it's one distributed system that gives you a huge scale, low latency, high throughput, a system that can be backed by the same Lans data.
763+
00:25:50 And essentially, it's one distributed system that gives you a huge scale, low latency, high throughput, a system that can be backed by the same LanceDB data.
764764

765765
00:26:03 It's just a file in S3.
766766

@@ -828,7 +828,7 @@
828828

829829
00:29:29 What's the workflow?
830830

831-
00:29:30 With the Lansi B Enterprise, like the system that's running, you can just keep adding data to it.
831+
00:29:30 With the LanceDB Enterprise, like the system that's running, you can just keep adding data to it.
832832

833833
00:29:34 The indexing and all of that is automatic once you've configured it properly, which is just, you know, here's the schema and like create indices on these columns, right?
834834

@@ -844,7 +844,7 @@
844844

845845
00:30:13 So this is where the open data layer comes in.
846846

847-
00:30:15 So if you have a large data set and you have, whether it's Spark or Ray, you can use those large distributed systems to write data directly to S3 in the, in Lansi open source format.
847+
00:30:15 So if you have a large data set and you have, whether it's Spark or Ray, you can use those large distributed systems to write data directly to S3 in the, in LanceDB open source format.
848848

849849
00:30:27 And then the system actually picks it up from object store and takes care of the indexing and compaction and all of that.
850850

@@ -960,7 +960,7 @@
960960

961961
00:34:55 And instead, that open data layer is super important to plug into the rest of their existing ecosystem.
962962

963-
00:35:02 And with Lance, unlike Parquet or JSON or WebDataSet, all of their multi-bono data can live in one place so that they can do search and retrieval on the vectors.
963+
00:35:02 And with Lance, unlike Parquet or JSON or Web Dataset, all of their multi-bono data can live in one place so that they can do search and retrieval on the vectors.
964964

965965
00:35:13 They can run SQL.
966966

@@ -1140,7 +1140,7 @@
11401140

11411141
00:41:53 There's a translation layer between pedantic and arrow schemas.
11421142

1143-
00:41:56 And so I think for a lot of our Python users, it's much easier to think in terms of pedantic objects as the data model rather than manually dealing with the Pyero API to create a schema.
1143+
00:41:56 And so I think for a lot of our Python users, it's much easier to think in terms of Pydantic objects as the data model rather than manually dealing with the Py Arrow API to create a schema.
11441144

11451145
00:42:08 We saw lots of issues where users are like, well, how do I create a fixed size list?
11461146

@@ -1152,13 +1152,13 @@
11521152

11531153
00:42:27 There's lots of stuff that is just much easier to think of in terms of Python types and Python objects.
11541154

1155-
00:42:33 I love the Python, the pydantic integration there.
1155+
00:42:33 I love the Python, the Pydantic integration there.
11561156

11571157
00:42:36 That's super cool.
11581158

11591159
00:42:37 Well, the data layer used for my course's website and the podcast website and stuff is all based on Beanie and Mongo, which is async.
11601160

1161-
00:42:45 Basically, you're writing async queries against pydantic models, which is, it's a real, got the validation, but you're not writing directly to the database and just random dictionaries.
1161+
00:42:45 Basically, you're writing async queries against Pydantic models, which is, it's a real, got the validation, but you're not writing directly to the database and just random dictionaries.
11621162

11631163
00:42:55 And who knows if it stays consistent.
11641164

@@ -1196,7 +1196,7 @@
11961196

11971197
00:44:11 I know we've loved working with it for a long time.
11981198

1199-
00:44:14 If any listeners are out there that's interested in just messing with like pedantic and arrow, so that translation layer, we'd love to get some help as well.
1199+
00:44:14 If any listeners are out there that's interested in just messing with like Pydantic and arrow, so that translation layer, we'd love to get some help as well.
12001200

12011201
00:44:24 So for some of the more like complicated nested types out there, so where that's like lists of lists or lists of fixed with lists and dictionaries and that kind of thing, the translation layer is incomplete.
12021202

@@ -1208,7 +1208,7 @@
12081208

12091209
00:45:00 Right, right.
12101210

1211-
00:45:01 And then maybe like either pedantic or maybe arrow, one of the two projects can own that translation layer.
1211+
00:45:01 And then maybe like either Pydantic or maybe arrow, one of the two projects can own that translation layer.
12121212

12131213
00:45:07 I think that would be really great for the, for the ecosystem.
12141214

@@ -1224,7 +1224,7 @@
12241224

12251225
00:45:27 We wanted to make it so that it's really familiar for people who've worked with databases and data frame engines.
12261226

1227-
00:45:34 So the main workload for Lansi B open source is that search API.
1227+
00:45:34 So the main workload for LanceDB open source is that search API.
12281228

12291229
00:45:38 So when you, you can say table.search, you can pass in the vector, the query vector, and then you can call .limit to say how many results we want.
12301230

@@ -1248,15 +1248,15 @@
12481248

12491249
00:46:30 Right.
12501250

1251-
00:46:31 You would just maybe want to take your pedantic and say, turn that into JSON and hand it over to the next agent or something like that.
1251+
00:46:31 You would just maybe want to take your Pydantic and say, turn that into JSON and hand it over to the next agent or something like that.
12521252

12531253
00:46:37 If we sort of move forward in that example a little bit.
12541254

12551255
00:46:40 Yeah.
12561256

12571257
00:46:40 In addition to setting up the schema, there's also the embedding API that's really interesting so that when you create the schema.
12581258

1259-
00:46:49 So here's an example where we can create the schema using pedantic.
1259+
00:46:49 So here's an example where we can create the schema using Pydantic.
12601260

12611261
00:46:54 So in this block that I've declared the class words, which is a Lance model.
12621262

@@ -1272,7 +1272,7 @@
12721272

12731273
00:47:26 We're using ADA two, but new models have been released since.
12741274

1275-
00:47:30 And I can use the pedantic annotations to say, hey, the text field is the source field for that function.
1275+
00:47:30 And I can use the Pydantic annotations to say, hey, the text field is the source field for that function.
12761276

12771277
00:47:36 So it's text string equals func dot source field.
12781278

@@ -1292,7 +1292,7 @@
12921292

12931293
00:48:11 Yeah, just takes care of calling the open AI API on your behalf and then adding the vectors before, you know, adding the whole batch to the table.
12941294

1295-
00:48:21 So open AI was was our first one, but there's dozens of compatible embedding models.
1295+
00:48:21 So open AI was our first one, but there's dozens of compatible embedding models.
12961296

12971297
00:48:27 So pretty much anything you can pull off hugging face.
12981298

@@ -1302,9 +1302,9 @@
13021302

13031303
00:48:43 Exactly. Yeah.
13041304

1305-
00:48:43 So a lot of the the ones you can pull off hugging face can just run locally and it exposes the options.
1305+
00:48:43 So a lot of the ones you can pull off hugging face can just run locally and it exposes the options.
13061306

1307-
00:48:51 So if you do have for your Mac mini or or even your MacBook laptop, there are options.
1307+
00:48:51 So if you do have for your Mac mini or even your MacBook laptop, there are options.
13081308

13091309
00:48:57 There are lots of hugging face models where you can specify NPS to actually make it run a little bit faster.
13101310

@@ -1543,4 +1543,3 @@
15431543
00:57:55 Now get out there and write some Python code.
15441544

15451545
00:57:57 We'll see you next time.
1546-

0 commit comments

Comments
 (0)