Skip to content

Commit ebcdd02

Browse files
authored
[Edit] Pandas DataFrame: .groupby() (#7409)
* [Edit] Pandas DataFrame: .groupby() * updated faqs based on PAA ---------
1 parent 773fd42 commit ebcdd02

File tree

1 file changed

+107
-36
lines changed
  • content/pandas/concepts/dataframe/terms/groupby

1 file changed

+107
-36
lines changed

content/pandas/concepts/dataframe/terms/groupby/groupby.md

Lines changed: 107 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -12,59 +12,130 @@ CatalogContent:
1212
- 'paths/data-science'
1313
---
1414

15-
The **`.groupby()`** function groups a [`DataFrame`](https://www.codecademy.com/resources/docs/pandas/dataframe) using a mapper or a series of columns and returns a [`GroupBy`](https://www.codecademy.com/resources/docs/pandas/groupby) object. A range of methods, as well as custom functions, can be applied to `GroupBy` objects in order to combine or transform large amounts of data in these groups.
15+
The Pandas DataFrame **`.groupby()`** function groups a `DataFrame` using a mapper or a series of columns and returns a [`GroupBy`](https://www.codecademy.com/resources/docs/pandas/groupby) object. A range of methods, as well as custom functions, can be applied to `GroupBy` objects in order to combine or transform large amounts of data in these groups.
1616

17-
## Syntax
17+
## Pandas `.groupby()` Syntax
1818

1919
```pseudo
20-
dataframevalue.groupby(by, axis, level, as_index, sort, group_keys, observed, dropna)
20+
df.groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, observed=False, dropna=True)
2121
```
2222

23-
`.groupby()` uses the following parameters:
23+
**Parameters:**
2424

2525
- `by`: If a dictionary or `Series` is passed, the values will determine groups. If a list or [ndarray](https://www.codecademy.com/resources/docs/numpy/ndarray) with the same length as the selected axis is passed, the values will be used to form groups. A label or list of labels can be used to group by a particular column or columns.
26-
- `axis`: Split along rows (0 or "index") or columns (1 or "columns"). Default value is 0.
27-
- `level`: If the axis is a `MultiIndex`, group by a particular level or levels. Value is int or level name, or sequence of them. Default value is `None`.
28-
- `as_index`: Boolean value. `True` returns group labels as an index in aggregated output, and `False` returns labels as `DataFrame` columns. Default value is `True`.
29-
- `sort`: Boolean value. `True` sorts the group keys. Default value is `True`.
30-
- `group_keys`: Boolean value. Add group keys to index when calling apply. Default value is `True`.
31-
- `observed`: Boolean value. If `True`, only show observed values for categorical groupers, otherwise show all values. Default value is `False`.
32-
- `dropna`: Boolean value. If `True`, drop groups whose keys contain `NA` values. If `False`, `NA` will be used as a key for those groups. Default value is `True`.
26+
- `axis`: Split along rows (`0` or `"index"`) or columns (`1` or `"columns"`).
27+
- `level`: If the axis is a `MultiIndex`, group by a particular level or levels. Value is an integer or level name, or a sequence of them.
28+
- `as_index`: Boolean value. `True` returns group labels as an index in aggregated output, and `False` returns labels as `DataFrame` columns.
29+
- `sort`: Boolean value. `True` sorts the group keys.
30+
- `group_keys`: Boolean value. If `False`, add group keys to index when calling apply.
31+
- `observed`: Boolean value. If `True`, only show observed values for categorical groupers, otherwise show all values.
32+
- `dropna`: Boolean value. If `True`, drop groups whose keys contain `NA` values. If `False`, `NA` will be used as a key for those groups.
3333

34-
## Example
34+
## Example 1: Group by Single Column Using `.groupby()`
3535

36-
This example uses `.groupby()` on a `DataFrame` to produce some aggregate results.
36+
This example uses `.groupby()` to group the data by a single column:
3737

3838
```py
3939
import pandas as pd
4040

41-
df = pd.DataFrame({'Key' : ['A', 'A', 'A', 'B', 'B', 'C'],
42-
'Value' : [15., 23., 17., 5., 8., 12.]})
43-
print(df, end='\n\n')
41+
data = {
42+
'Region': ['East', 'West', 'East', 'South', 'West', 'South', 'East'],
43+
'Sales': [250, 200, 300, 400, 150, 500, 100]
44+
}
4445

45-
print(df.groupby(['Key'], as_index=False).mean(), end='\n\n')
46+
df = pd.DataFrame(data)
4647

47-
print(df.groupby(['Key'], as_index=False).sum())
48+
result = df.groupby('Region')['Sales'].sum()
49+
50+
print(result)
51+
```
52+
53+
Here is the output:
54+
55+
```shell
56+
Region
57+
East 650
58+
South 900
59+
West 350
60+
Name: Sales, dtype: int64
61+
```
62+
63+
## Example 2: Group by Multiple Columns Using `.groupby()`
64+
65+
This example uses `.groupby()` to group the data by multiple columns:
66+
67+
```py
68+
import pandas as pd
69+
70+
data = {
71+
'Region': ['East', 'West', 'East', 'South', 'West', 'South', 'East'],
72+
'Product': ['A', 'B', 'A', 'B', 'A', 'A', 'B'],
73+
'Sales': [250, 200, 300, 400, 150, 500, 100]
74+
}
75+
76+
df = pd.DataFrame(data)
77+
78+
result = df.groupby(['Region', 'Product'])['Sales'].sum()
79+
80+
print(result)
4881
```
4982

50-
This produces the following output:
83+
Here is the output:
5184

5285
```shell
53-
Key Value
54-
0 A 15.0
55-
1 A 23.0
56-
2 A 17.0
57-
3 B 5.0
58-
4 B 8.0
59-
5 C 12.0
60-
61-
Key Value
62-
0 A 18.333333
63-
1 B 6.500000
64-
2 C 12.000000
65-
66-
Key Value
67-
0 A 55.0
68-
1 B 13.0
69-
2 C 12.0
86+
Region Product
87+
East A 550
88+
B 100
89+
South A 500
90+
B 400
91+
West A 150
92+
B 200
93+
Name: Sales, dtype: int64
94+
```
95+
96+
## Codebyte Example: Using Aggregate Functions with Python's `.groupby()`
97+
98+
This codebyte example uses `.groupby()` to group the data and then applies aggregate functions on the grouped data:
99+
100+
```codebyte/python
101+
import pandas as pd
102+
103+
data = {
104+
'Region': ['East', 'West', 'East', 'South', 'West', 'South', 'East'],
105+
'Product': ['A', 'B', 'A', 'B', 'A', 'A', 'B'],
106+
'Sales': [250, 200, 300, 400, 150, 500, 100]
107+
}
108+
109+
df = pd.DataFrame(data)
110+
111+
result = df.groupby('Region')['Sales'].agg(['sum', 'mean', 'max'])
112+
113+
print(result)
70114
```
115+
116+
## Frequently Asked Questions
117+
118+
### 1. When should I use `groupby` in Pandas?
119+
120+
Use `groupby` when you want to split data into groups, apply a function, and combine results. Common operations include computing aggregates like sum, mean, or count per category.
121+
122+
### 2. Is Pandas `groupby` slow?
123+
124+
It can be slow for large datasets, especially if:
125+
126+
- You’re grouping by multiple columns.
127+
- The dataset doesn’t fit in memory.
128+
- You're applying custom Python functions instead of built-ins.
129+
130+
For most medium-sized tasks, it's fast enough. For massive data, look into more efficient libraries like Polars or Dask.
131+
132+
### 3. Is Polars `groupby` faster than Pandas?
133+
134+
Yes, often much faster. Polars is built in Rust and optimized for speed and parallelism. It can handle larger-than-memory data better and is ideal for performance-critical data tasks.
135+
136+
Example speed difference:
137+
138+
- Pandas: single-threaded.
139+
- Polars: multi-threaded, faster `groupby` and aggregation.
140+
141+
If performance is a bottleneck, switching to Polars is worth considering.

0 commit comments

Comments
 (0)