@@ -1452,44 +1452,23 @@ region_lang.value_counts("region", normalize=True)
1452
1452
1453
1453
+++
1454
1454
1455
- ## Apply functions across multiple columns with ` apply `
1455
+ ## Apply functions across multiple columns
1456
1456
1457
- ### Apply a function to each column with ` apply `
1458
-
1459
- An alternative to aggregating on a data frame
1460
- for applying a function to many columns is the ` apply ` method.
1461
- Let's again find the maximum value of each column of the
1462
- ` region_lang ` data frame, but using ` apply ` with the ` max ` function this time.
1463
- We focus on the two arguments of ` apply ` :
1464
- the function that you would like to apply to each column, and the ` axis ` along
1465
- which the function will be applied (` 0 ` for columns, ` 1 ` for rows).
1466
- Note that ` apply ` does not have an argument
1467
- to specify * which* columns to apply the function to.
1468
- Therefore, we will use the ` loc[] ` before calling ` apply `
1469
- to choose the columns for which we want the maximum.
1470
-
1471
- ``` {code-cell} ipython3
1472
- region_lang.loc[:, "most_at_home":"most_at_work"].apply(max)
1473
- ```
1474
-
1475
- We can use ` apply ` for much more than summary statistics.
1476
- Sometimes we need to apply a function to many columns in a data frame.
1477
- For example, we would need to do this when converting units of measurements across many columns.
1478
- We illustrate such a data transformation in {numref}` fig:mutate-across ` .
1479
-
1480
- +++ {"tags": [ ] }
1457
+ Computing summary statistics is not the only situation in which we need
1458
+ to apply a function across columns in a data frame. There are two other
1459
+ common wrangling tasks that require the application of a function across columns.
1460
+ The first is when we want to apply a transformation, such as a conversion of measurement units, to multiple columns.
1461
+ We illustrate such a data transformation in {numref}` fig:mutate-across ` ; note that it does not
1462
+ change the shape of the data frame.
1481
1463
1482
1464
``` {figure} img/wrangling/summarize.005.jpeg
1483
1465
:name: fig:mutate-across
1484
1466
:figclass: figure
1485
1467
1486
- `apply` is useful for applying functions across many columns. The darker, top row of each table represents the column headers.
1468
+ A transformation applied across many columns. The darker, top row of each table represents the column headers.
1487
1469
```
1488
1470
1489
- +++
1490
-
1491
- For example,
1492
- imagine that we wanted to convert all the numeric columns
1471
+ For example, imagine that we wanted to convert all the numeric columns
1493
1472
in the ` region_lang ` data frame from ` int64 ` type to ` int32 ` type
1494
1473
using the ` .as_type ` function.
1495
1474
When we revisit the ` region_lang ` data frame,
@@ -1503,88 +1482,52 @@ region_lang
1503
1482
``` {index} pandas.DataFrame; apply, pandas.DataFrame; loc[]
1504
1483
```
1505
1484
1506
- To accomplish such a task, we can use ` apply ` .
1507
- As we did above,
1508
- we again use ` loc[] ` to specify the columns
1509
- as well as the ` apply ` to specify the function we want to apply on these columns.
1510
- Now, we need a way to tell ` apply ` what function to perform to each column
1511
- so that we can convert them from ` int64 ` to ` int32 ` . We will use what is called
1512
- a ` lambda ` function in python; ` lambda ` functions are just regular functions,
1513
- except that you don't need to give them a name.
1514
- That means you can pass them as an argument into ` apply ` easily!
1515
- Let's consider a simple example of a ` lambda ` function that
1516
- multiplies a number by two.
1517
- ``` {code-cell} ipython3
1518
- lambda x: 2*x
1519
- ```
1520
- We define a ` lambda ` function in the following way. We start with the syntax ` lambda ` , which is a special word
1521
- that tells Python "what follows is
1522
- a function." Following this, we then state the name of the arguments of the function.
1523
- In this case, we just have one argument named ` x ` . After the list of arguments, we put a
1524
- colon ` : ` . And finally after the colon are the instructions: take the value provided and multiply it by 2.
1525
- Let's call our shiny new ` lambda ` function with the argument ` 2 ` (so the output should be ` 4 ` ).
1526
- Just like a regular function, we pass its argument between parentheses ` () ` symbols.
1527
- ``` {code-cell} ipython3
1528
- (lambda x: 2*x)(2)
1529
- ```
1485
+ We can simply call the ` .as_type ` function to apply it across the desired range of columns.
1530
1486
1531
- ``` {note}
1532
- Because we didn't give the `lambda` function a name, we have to surround it with
1533
- parentheses too if we want to call it. Otherwise, if we wrote something like `lambda x: 2*x(2)`, Python would get confused
1534
- and think that `(2)` was part of the instructions that comprise the `lambda` function.
1535
- As long as we don't want to call the `lambda` function ourselves, we don't need those parentheses. For example,
1536
- we can pass a `lambda` function as an argument to `apply` without any parentheses.
1537
- ```
1538
-
1539
- Returning to our example, let's use ` apply ` to convert the columns ` "mother_tongue":"lang_known" `
1540
- to ` int32 ` . To accomplish this we create a ` lambda ` function that takes one argument---a single column
1541
- of the data frame, which we will name ` col ` ---and apply the ` astype ` method to it.
1542
- Then the ` apply ` method will use that ` lambda ` function on every column we specify via ` loc[] ` .
1543
1487
``` {code-cell} ipython3
1544
- region_lang_nums = region_lang.loc[:, "mother_tongue":"lang_known"].apply(lambda col: col. astype("int32") )
1488
+ region_lang_nums = region_lang.loc[:, "mother_tongue":"lang_known"].astype("int32")
1545
1489
region_lang_nums.info()
1546
1490
```
1547
- You can now see that the columns from ` mother_tongue ` to ` lang_known ` are type ` int32 ` .
1548
- You can also see that ` apply ` returns a data frame with the same number of columns and rows
1549
- as the input data frame. The only thing ` apply ` does is use the ` lambda ` function argument
1550
- on each of the specified columns.
1491
+ You can now see that the columns from ` mother_tongue ` to ` lang_known ` are type ` int32 ` ,
1492
+ and that we have obtained a data frame with the same number of columns and rows
1493
+ as the input data frame.
1551
1494
1552
- ### Apply a function row-wise with ` apply `
1553
-
1554
- What if you want to apply a function across columns but within one row?
1555
- We illustrate such a data transformation in {numref}` fig:rowwise ` .
1556
-
1557
- +++ {"tags": [ ] }
1495
+ The second situation occurs when you want to apply a function across columns within each individual
1496
+ row, i.e., * row-wise* . This operation, illustrated in {numref}` fig:rowwise ` ,
1497
+ will produce a single column whose entries summarize each row in the original data frame;
1498
+ this new column can be added back into the original data.
1558
1499
1559
1500
``` {figure} img/wrangling/summarize.004.jpeg
1560
1501
:name: fig:rowwise
1561
1502
:figclass: figure
1562
1503
1563
- `apply` is useful for applying functions across columns within one row . The
1504
+ A function applied row-wise across a data frame, producing a new column . The
1564
1505
darker, top row of each table represents the column headers.
1565
1506
```
1566
1507
1567
- +++
1568
-
1569
- For instance, suppose we want to know the maximum value between ` mother_tongue ` ,
1570
- and ` lang_known ` for each language and region
1571
- in the ` region_lang_nums ` data set.
1508
+ For example, suppose we want to know the maximum value between ` mother_tongue ` ,
1509
+ and ` lang_known ` for each language and region in the ` region_lang_nums ` data set.
1572
1510
In other words, we want to apply the ` max ` function * row-wise.*
1573
- In order to tell ` apply ` that we want to work row-wise (as opposed to acting on each column
1511
+ In order to tell ` max ` that we want to work row-wise (as opposed to acting on each column
1574
1512
individually, which is the default behavior), we just specify the argument ` axis=1 ` .
1575
- For example, in the case of the ` max ` function, this tells Python that we would like
1576
- the ` max ` within each row of the input, as opposed to being applied on each column.
1577
1513
1578
1514
``` {code-cell} ipython3
1579
- region_lang_nums.apply( max, axis=1)
1515
+ region_lang_nums.max( axis=1)
1580
1516
```
1581
1517
1582
- We see that we get a column, which is the maximum value between ` mother_tongue ` ,
1583
- ` most_at_home ` , ` most_at_work ` and ` lang_known ` for each language
1584
- and region. It is often the case that we want to include a column result
1585
- from using ` apply ` row-wise as a new column in the data frame, so that we can make
1518
+ We see that we obtain a series containing the maximum value between ` mother_tongue ` ,
1519
+ ` most_at_home ` , ` most_at_work ` and ` lang_known ` for each row in the data frame. It
1520
+ is often the case that we want to include a column result
1521
+ from a row-wise operation as a new column in the data frame, so that we can make
1586
1522
plots or continue our analysis. To make this happen,
1587
- we will use ` assign ` to create a new column. This is discussed in the next section.
1523
+ we will use column assignment or the ` assign ` function to create a new column.
1524
+ This is discussed in the next section.
1525
+
1526
+ ``` {note}
1527
+ While `pandas` provides many methods (like `max`, `as_type`, etc.) that can be applied to a data frame,
1528
+ sometimes you may want to apply your own function to multiple columns in a data frame. In this case
1529
+ you can use the more general [`apply`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html) method.
1530
+ ```
1588
1531
1589
1532
(pandas-assign)=
1590
1533
## Modifying and adding columns
0 commit comments