Skip to content

Commit 710dceb

Browse files
remove apply from wrangling
1 parent 4e3ee87 commit 710dceb

File tree

1 file changed

+35
-92
lines changed

1 file changed

+35
-92
lines changed

source/wrangling.md

Lines changed: 35 additions & 92 deletions
Original file line numberDiff line numberDiff line change
@@ -1452,44 +1452,23 @@ region_lang.value_counts("region", normalize=True)
14521452

14531453
+++
14541454

1455-
## Apply functions across multiple columns with `apply`
1455+
## Apply functions across multiple columns
14561456

1457-
### Apply a function to each column with `apply`
1458-
1459-
An alternative to aggregating on a data frame
1460-
for applying a function to many columns is the `apply` method.
1461-
Let's again find the maximum value of each column of the
1462-
`region_lang` data frame, but using `apply` with the `max` function this time.
1463-
We focus on the two arguments of `apply`:
1464-
the function that you would like to apply to each column, and the `axis` along
1465-
which the function will be applied (`0` for columns, `1` for rows).
1466-
Note that `apply` does not have an argument
1467-
to specify *which* columns to apply the function to.
1468-
Therefore, we will use the `loc[]` before calling `apply`
1469-
to choose the columns for which we want the maximum.
1470-
1471-
```{code-cell} ipython3
1472-
region_lang.loc[:, "most_at_home":"most_at_work"].apply(max)
1473-
```
1474-
1475-
We can use `apply` for much more than summary statistics.
1476-
Sometimes we need to apply a function to many columns in a data frame.
1477-
For example, we would need to do this when converting units of measurements across many columns.
1478-
We illustrate such a data transformation in {numref}`fig:mutate-across`.
1479-
1480-
+++ {"tags": []}
1457+
Computing summary statistics is not the only situation in which we need
1458+
to apply a function across columns in a data frame. There are two other
1459+
common wrangling tasks that require the application of a function across columns.
1460+
The first is when we want to apply a transformation, such as a conversion of measurement units, to multiple columns.
1461+
We illustrate such a data transformation in {numref}`fig:mutate-across`; note that it does not
1462+
change the shape of the data frame.
14811463

14821464
```{figure} img/wrangling/summarize.005.jpeg
14831465
:name: fig:mutate-across
14841466
:figclass: figure
14851467
1486-
`apply` is useful for applying functions across many columns. The darker, top row of each table represents the column headers.
1468+
A transformation applied across many columns. The darker, top row of each table represents the column headers.
14871469
```
14881470

1489-
+++
1490-
1491-
For example,
1492-
imagine that we wanted to convert all the numeric columns
1471+
For example, imagine that we wanted to convert all the numeric columns
14931472
in the `region_lang` data frame from `int64` type to `int32` type
14941473
using the `.as_type` function.
14951474
When we revisit the `region_lang` data frame,
@@ -1503,88 +1482,52 @@ region_lang
15031482
```{index} pandas.DataFrame; apply, pandas.DataFrame; loc[]
15041483
```
15051484

1506-
To accomplish such a task, we can use `apply`.
1507-
As we did above,
1508-
we again use `loc[]` to specify the columns
1509-
as well as the `apply` to specify the function we want to apply on these columns.
1510-
Now, we need a way to tell `apply` what function to perform to each column
1511-
so that we can convert them from `int64` to `int32`. We will use what is called
1512-
a `lambda` function in python; `lambda` functions are just regular functions,
1513-
except that you don't need to give them a name.
1514-
That means you can pass them as an argument into `apply` easily!
1515-
Let's consider a simple example of a `lambda` function that
1516-
multiplies a number by two.
1517-
```{code-cell} ipython3
1518-
lambda x: 2*x
1519-
```
1520-
We define a `lambda` function in the following way. We start with the syntax `lambda`, which is a special word
1521-
that tells Python "what follows is
1522-
a function." Following this, we then state the name of the arguments of the function.
1523-
In this case, we just have one argument named `x`. After the list of arguments, we put a
1524-
colon `:`. And finally after the colon are the instructions: take the value provided and multiply it by 2.
1525-
Let's call our shiny new `lambda` function with the argument `2` (so the output should be `4`).
1526-
Just like a regular function, we pass its argument between parentheses `()` symbols.
1527-
```{code-cell} ipython3
1528-
(lambda x: 2*x)(2)
1529-
```
1485+
We can simply call the `.as_type` function to apply it across the desired range of columns.
15301486

1531-
```{note}
1532-
Because we didn't give the `lambda` function a name, we have to surround it with
1533-
parentheses too if we want to call it. Otherwise, if we wrote something like `lambda x: 2*x(2)`, Python would get confused
1534-
and think that `(2)` was part of the instructions that comprise the `lambda` function.
1535-
As long as we don't want to call the `lambda` function ourselves, we don't need those parentheses. For example,
1536-
we can pass a `lambda` function as an argument to `apply` without any parentheses.
1537-
```
1538-
1539-
Returning to our example, let's use `apply` to convert the columns `"mother_tongue":"lang_known"`
1540-
to `int32`. To accomplish this we create a `lambda` function that takes one argument---a single column
1541-
of the data frame, which we will name `col`---and apply the `astype` method to it.
1542-
Then the `apply` method will use that `lambda` function on every column we specify via `loc[]`.
15431487
```{code-cell} ipython3
1544-
region_lang_nums = region_lang.loc[:, "mother_tongue":"lang_known"].apply(lambda col: col.astype("int32"))
1488+
region_lang_nums = region_lang.loc[:, "mother_tongue":"lang_known"].astype("int32")
15451489
region_lang_nums.info()
15461490
```
1547-
You can now see that the columns from `mother_tongue` to `lang_known` are type `int32`.
1548-
You can also see that `apply` returns a data frame with the same number of columns and rows
1549-
as the input data frame. The only thing `apply` does is use the `lambda` function argument
1550-
on each of the specified columns.
1491+
You can now see that the columns from `mother_tongue` to `lang_known` are type `int32`,
1492+
and that we have obtained a data frame with the same number of columns and rows
1493+
as the input data frame.
15511494

1552-
### Apply a function row-wise with `apply`
1553-
1554-
What if you want to apply a function across columns but within one row?
1555-
We illustrate such a data transformation in {numref}`fig:rowwise`.
1556-
1557-
+++ {"tags": []}
1495+
The second situation occurs when you want to apply a function across columns within each individual
1496+
row, i.e., *row-wise*. This operation, illustrated in {numref}`fig:rowwise`,
1497+
will produce a single column whose entries summarize each row in the original data frame;
1498+
this new column can be added back into the original data.
15581499

15591500
```{figure} img/wrangling/summarize.004.jpeg
15601501
:name: fig:rowwise
15611502
:figclass: figure
15621503
1563-
`apply` is useful for applying functions across columns within one row. The
1504+
A function applied row-wise across a data frame, producing a new column. The
15641505
darker, top row of each table represents the column headers.
15651506
```
15661507

1567-
+++
1568-
1569-
For instance, suppose we want to know the maximum value between `mother_tongue`,
1570-
and `lang_known` for each language and region
1571-
in the `region_lang_nums` data set.
1508+
For example, suppose we want to know the maximum value between `mother_tongue`,
1509+
and `lang_known` for each language and region in the `region_lang_nums` data set.
15721510
In other words, we want to apply the `max` function *row-wise.*
1573-
In order to tell `apply` that we want to work row-wise (as opposed to acting on each column
1511+
In order to tell `max` that we want to work row-wise (as opposed to acting on each column
15741512
individually, which is the default behavior), we just specify the argument `axis=1`.
1575-
For example, in the case of the `max` function, this tells Python that we would like
1576-
the `max` within each row of the input, as opposed to being applied on each column.
15771513

15781514
```{code-cell} ipython3
1579-
region_lang_nums.apply(max, axis=1)
1515+
region_lang_nums.max(axis=1)
15801516
```
15811517

1582-
We see that we get a column, which is the maximum value between `mother_tongue`,
1583-
`most_at_home`, `most_at_work` and `lang_known` for each language
1584-
and region. It is often the case that we want to include a column result
1585-
from using `apply` row-wise as a new column in the data frame, so that we can make
1518+
We see that we obtain a series containing the maximum value between `mother_tongue`,
1519+
`most_at_home`, `most_at_work` and `lang_known` for each row in the data frame. It
1520+
is often the case that we want to include a column result
1521+
from a row-wise operation as a new column in the data frame, so that we can make
15861522
plots or continue our analysis. To make this happen,
1587-
we will use `assign` to create a new column. This is discussed in the next section.
1523+
we will use column assignment or the `assign` function to create a new column.
1524+
This is discussed in the next section.
1525+
1526+
```{note}
1527+
While `pandas` provides many methods (like `max`, `as_type`, etc.) that can be applied to a data frame,
1528+
sometimes you may want to apply your own function to multiple columns in a data frame. In this case
1529+
you can use the more general [`apply`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html) method.
1530+
```
15881531

15891532
(pandas-assign)=
15901533
## Modifying and adding columns

0 commit comments

Comments
 (0)