Skip to content

Commit 606ed4c

Browse files
authored
Merge pull request #305 from rhaas80/gh-pages
_extras/guide.md: update to match episodes' contents
2 parents 9a7e9f6 + 404a11b commit 606ed4c

1 file changed

Lines changed: 61 additions & 82 deletions

File tree

_extras/guide.md

Lines changed: 61 additions & 82 deletions
Original file line numberDiff line numberDiff line change
@@ -101,7 +101,7 @@ rev[2] = "apple-sauce"
101101
~~~
102102
{: .language-python}
103103

104-
## 01-starting-with-data
104+
## 02-starting-with-data
105105

106106
###Bug Note:
107107

@@ -111,27 +111,27 @@ Pandas < .18.1 has a bug where surveys_df['weight'].describe() may return a runt
111111

112112
* `surveys_df.columns`
113113

114-
column names (optional: show `surveys_df.columns[4] = "siteid"` The index is not mutable; recap of previous lesson. Adapting the name is done by `rename` function `surveys_df.rename(columns={"site_id": "siteid"})`)
114+
column names (optional: show `surveys_df.columns[4] = "plotid"` The index is not mutable; recap of previous lesson. Adapting the name is done by `rename` function `surveys_df.rename(columns={"plot_id": "plotid"})`)
115115

116116
* `surveys_df.head()`. Also, what does `surveys_df.head(15)` do?
117117

118-
Show first `N` lines
118+
Show first 5 lines. Show first 15 lines.
119119

120120
* `surveys_df.tail()`
121121

122-
Show last `N` lines
122+
Show last 5 lines
123123

124124
* `surveys_df.shape`. Take note of the output of the shape method. What format does it return the shape of the DataFrame in?
125125

126126
`type(surveys_df.shape)` -> `Tuple`
127127

128128
### Calculating Statistics Challenges
129129

130-
* Create a list of unique site ID's found in the surveys data. Call it `site_names`. How many unique sites are in the data? How many unique species are in the data?
130+
* Create a list of unique plot ID's found in the surveys data. Call it `plot_names`. How many unique plots are in the data? How many unique species are in the data?
131131

132-
`site_names = pd.unique(surveys_df["site_id"])` Number of unique site ID's: `site_names.size` or `len(site_names)`; Number of unique species in the data: `len(pd.unique(surveys_df["species"]))`
132+
`plot_names = pd.unique(surveys_df["plot_id"])` Number of unique plot ID's: `plot_names.size` or `len(plot_names)`; Number of unique species in the data: `len(pd.unique(surveys_df["species"]))`
133133

134-
* What is the difference between `len(site_names)` and `surveys_df['site_id'].nunique()`?
134+
* What is the difference between `len(plot_names)` and `surveys_df['plot_id'].nunique()`?
135135

136136
Both do result in the same output, making it alternative ways of getting the unique values. `nunique` combines the count and unique value extraction.
137137

@@ -143,19 +143,19 @@ Both do result in the same output, making it alternative ways of getting the uni
143143

144144
* What happens when you group by two columns using the following syntax and then grab mean values?
145145

146-
The mean value for each combination of site and sex is calculated. Remark that the mean does not make sense for each variable, so you can specify this column-wise: e.g. I want to know the last survey year, median foot-length and mean weight for each site/sex combination:
146+
The mean value for each combination of plot and sex is calculated. Remark that the mean does not make sense for each variable, so you can specify this column-wise: e.g. I want to know the last survey year, median foot-length and mean weight for each plot/sex combination:
147147

148148
~~~
149-
surveys_df.groupby(['site_id','sex']).agg({"year": 'min',
149+
surveys_df.groupby(['plot_id','sex']).agg({"year": 'min',
150150
"hindfoot_length": 'median',
151151
"weight": 'mean'})`
152152
~~~
153153
{: .language-python}
154154

155-
* Summarize the weight values for each site in your data.
155+
* Summarize the weight values for each plot in your data.
156156

157157
~~~
158-
surveys_df.groupby(['site_id'])['weight'].describe()
158+
surveys_df.groupby(['plot_id'])['weight'].describe()
159159
~~~
160160
{: .language-python}
161161

@@ -165,14 +165,14 @@ surveys_df.groupby(['site_id'])['weight'].describe()
165165

166166
### Plotting Challenges
167167

168-
* Create a plot of the average weight across all species per site.
168+
* Create a plot of the average weight across all species per plot.
169169

170170
~~~
171-
surveys_df.groupby('site_id').mean()["weight"].plot(kind='bar')
171+
surveys_df.groupby('plot_id').mean()["weight"].plot(kind='bar')
172172
~~~
173173
{: .language-python}
174174

175-
![average weight across all species for each site](../fig/01_chall_bar_meanweight.png)
175+
![average weight across all species for each plot](../fig/01_chall_bar_meanweight.png)
176176

177177
* Create a plot of total males versus total females for the entire datase.
178178

@@ -182,7 +182,7 @@ surveys_df.groupby('sex').count()["record_id"].plot(kind='bar')
182182
{: .language-python}
183183
![total males versus total females for the entire dataset](../fig/01_chall_bar_totalsex.png)
184184

185-
## 02-index-slice-subset
185+
## 03-index-slice-subset
186186

187187
Tip: use `.head()` method throughout this lesson to keep your display neater for students. Encourage students to try with and without `.head()` to reinforce this useful tool and then to use it or not at their preference. For example, if a student worries about keeping up in pace with typing, let them know they can skip the `.head()`, but that you'll use it to keep more lines of previous steps visible.
188188

@@ -225,9 +225,9 @@ Tip: use `.head()` method throughout this lesson to keep your display neater for
225225
`surveys_df[(surveys_df["year"] == 1999) & (surveys_df["weight"] <= 8)]`; when only interested in how many,
226226
the sum of `True` values could be used as well: `sum((surveys_df["year"] == 1999) & (surveys_df["weight"] <= 8))`
227227

228-
* You can use the `isin` command in Python to query a DataFrame based upon a list of values as follows: `surveys_df[surveys_df['species_id'].isin([listGoesHere])]`. Use the `isin` function to find all sites that contain particular species in the surveys DataFrame. How many records contain these values?
228+
* You can use the `isin` command in Python to query a DataFrame based upon a list of values as follows: `surveys_df[surveys_df['species_id'].isin([listGoesHere])]`. Use the `isin` function to find all plots that contain particular species in the surveys DataFrame. How many records contain these values?
229229

230-
For example, using `PB` and `PL`: `surveys_df[surveys_df['species_id'].isin(['PB', 'PL'])]['site_id'].unique()` provides a list of the sites with these species involved. With `surveys_df[surveys_df['species_id'].isin(['PB', 'PL'])].shape` the number of records can be derived.
230+
For example, using `PB` and `PL`: `surveys_df[surveys_df['species_id'].isin(['PB', 'PL'])]['plot_id'].unique()` provides a list of the plots with these species involved. With `surveys_df[surveys_df['species_id'].isin(['PB', 'PL'])].shape` the number of records can be derived.
231231

232232
* Create a query that finds all rows with a weight value > or equal to 0.
233233

@@ -255,14 +255,14 @@ print(len(new))
255255

256256
Can verify the number of Nan values with `sum(surveys_df['sex'].isnull())`, which is equal to the number of none female/male records.
257257

258-
* Create a new DataFrame that contains only observations that are of sex male or female and where weight values are greater than 0. Create a stacked bar plot of average weight by site with male vs female values stacked for each site.
258+
* Create a new DataFrame that contains only observations that are of sex male or female and where weight values are greater than 0. Create a stacked bar plot of average weight by plot with male vs female values stacked for each plot.
259259

260260
~~~
261261
# selection of the data with isin
262262
stack_selection = surveys_df[(surveys_df['sex'].isin(['M', 'F'])) &
263-
surveys_df["weight"] > 0.][["sex", "weight", "site_id"]]
264-
# calculate the mean weight for each site id and sex combination:
265-
stack_selection = stack_selection.groupby(["site_id", "sex"]).mean().unstack()
263+
surveys_df["weight"] > 0.][["sex", "weight", "plot_id"]]
264+
# calculate the mean weight for each plot id and sex combination:
265+
stack_selection = stack_selection.groupby(["plot_id", "sex"]).mean().unstack()
266266
# and we can make a stacked bar plot from this:
267267
stack_selection.plot(kind='bar', stacked=True)
268268
~~~
@@ -271,21 +271,21 @@ stack_selection.plot(kind='bar', stacked=True)
271271
*Suggestion*: As we now the other values are all Nan values, we could also select all not null values (just preview, more on this in next lesson):
272272
~~~
273273
stack_selection = surveys_df[(surveys_df['sex'].notnull()) &
274-
surveys_df["weight"] > 0.][["sex", "weight", "site_id"]]
274+
surveys_df["weight"] > 0.][["sex", "weight", "plot_id"]]
275275
~~~
276276
{: .language-python}
277277

278-
![average weight for each site per sex](../fig/02_chall_stack_levelissue.png)
278+
![average weight for each plot per sex](../fig/02_chall_stack_levelissue.png)
279279

280280
However, due to the `unstack` command, the legend header contains two levels. In order to remove this, the column naming needs to be simplified :
281281
~~~
282282
stack_selection.columns = stack_selection.columns.droplevel()
283283
~~~
284284
{: .language-python}
285285

286-
![average weight for each site per sex](../fig/02_chall_stack_level.png)
286+
![average weight for each plot per sex](../fig/02_chall_stack_level.png)
287287

288-
## 03-data-types-and-format
288+
## 04-data-types-and-format
289289

290290
### Challenge - Changing Types
291291

@@ -297,9 +297,9 @@ surveys_df.isnull()
297297
If the students have trouble generating the output, or anything happens with that, there is a file
298298
called "sample output" that contains the data file they should generate.
299299

300-
## 04-merging-data
300+
## 05-merging-data
301301

302-
* In the data folder, there are two survey data files: survey2001.csv and survey2002.csv. Read the data into Python and combine the files to make one new data frame. Create a plot of average site weight by year grouped by sex. Export your results as a CSV and make sure it reads back into Python properly.
302+
* In the data folder, there are two survey data files: survey2001.csv and survey2002.csv. Read the data into Python and combine the files to make one new data frame. Create a plot of average plot weight by year grouped by sex. Export your results as a CSV and make sure it reads back into Python properly.
303303

304304
~~~
305305
# read the files:
@@ -332,65 +332,65 @@ merged_left = pd.merge(left=surveys_df,right=species_df, how='left', on="species
332332

333333
Then calculate and plot the distribution of:
334334

335-
**1. taxa per site** (number of species of each taxa per site):
335+
**1. taxa per plot** (number of species of each taxa per plot):
336336

337-
Species distribution (number of taxa for each site) can be derived as follows:
337+
Species distribution (number of taxa for each plot) can be derived as follows:
338338
~~~
339-
merged_left.groupby(["site_id"])["taxa"].nunique().plot(kind='bar')
339+
merged_left.groupby(["plot_id"])["taxa"].nunique().plot(kind='bar')
340340
~~~
341341
{: .language-python}
342342

343-
![taxa per site](../fig/04_chall_ntaxa_per_site.png)
343+
![taxa per plot](../fig/04_chall_ntaxa_per_site.png)
344344

345-
*Suggestion*: It is also possible to plot the number of individuals for each taxa in each site (stacked bar chart):
345+
*Suggestion*: It is also possible to plot the number of individuals for each taxa in each plot (stacked bar chart):
346346
~~~
347-
merged_left.groupby(["site_id", "taxa"]).count()["record_id"].unstack().plot(kind='bar', stacked=True)
347+
merged_left.groupby(["plot_id", "taxa"]).count()["record_id"].unstack().plot(kind='bar', stacked=True)
348348
plt.legend(loc='upper center', ncol=3, bbox_to_anchor=(0.5, 1.05))
349349
~~~
350350
{: .language-python}
351351
(the legend otherwise overlaps the bar plot)
352352

353-
![taxa per site](../fig/04_chall_taxa_per_site.png)
353+
![taxa per plot](../fig/04_chall_taxa_per_site.png)
354354

355-
**2. taxa by sex by site**:
355+
**2. taxa by sex by plot**:
356356
Providing the Nan values with the M|F values (can also already be changed to 'x'):
357357
~~~
358358
merged_left.loc[merged_left["sex"].isnull(), "sex"] = 'M|F'
359359
~~~
360360
{: .language-python}
361361

362-
Number of taxa for each site/sex combination:
362+
Number of taxa for each plot/sex combination:
363363
~~~
364-
ntaxa_sex_site= merged_left.groupby(["site_id", "sex"])["taxa"].nunique().reset_index(level=1)
365-
ntaxa_sex_site = ntaxa_sex_plot.pivot_table(values="taxa", columns="sex", index=ntaxa_sex_plot.index)
364+
ntaxa_sex_site= merged_left.groupby(["plot_id", "sex"])["taxa"].nunique().reset_index(level=1)
365+
ntaxa_sex_site = ntaxa_sex_site.pivot_table(values="taxa", columns="sex", index=ntaxa_sex_site.index)
366366
ntaxa_sex_site.plot(kind="bar", legend=False)
367367
plt.legend(loc='upper center', ncol=3, bbox_to_anchor=(0.5, 1.08),
368368
fontsize='small', frameon=False)
369369
~~~
370370
{: .language-python}
371371

372-
![taxa per site per sex](../fig/04_chall_ntaxa_per_site_sex.png)
372+
![taxa per plot per sex](../fig/04_chall_ntaxa_per_site_sex.png)
373373

374374
*Suggestion (for discussion only)*:
375375

376-
The number of individuals for each taxa in each site per sex can be derived as well.
376+
The number of individuals for each taxa in each plot per sex can be derived as well.
377377

378378
~~~
379-
sex_taxa_site = merged_left.groupby(["site_id", "taxa", "sex"]).count()['record_id']
379+
sex_taxa_site = merged_left.groupby(["plot_id", "taxa", "sex"]).count()['record_id']
380380
sex_taxa_site.unstack(level=[1, 2]).plot(kind='bar', logy=True)
381381
plt.legend(loc='upper center', ncol=3, bbox_to_anchor=(0.5, 1.15),
382382
fontsize='small', frameon=False)
383383
~~~
384384
{: .language-python}
385385

386-
![taxa per site per sex](../fig/04_chall_sex_taxa_site_intro.png)
386+
![taxa per plot per sex](../fig/04_chall_sex_taxa_site_intro.png)
387387

388388
This is not really the best plot choice: not readable,... A first option to make this better, is to make facets. However, pandas/matplotlib do not provide this by default. Just as a pure matplotlib example (`M|F` if for not-defined sex records):
389389

390390
~~~
391391
fig, axs = plt.subplots(3, 1)
392392
for sex, ax in zip(["M", "F", "M|F"], axs):
393-
sex_taxa_site[sex_taxa_plot["sex"] == sex].plot(kind='bar', ax=ax, legend=False)
393+
sex_taxa_site[sex_taxa_site["sex"] == sex].plot(kind='bar', ax=ax, legend=False)
394394
ax.set_ylabel(sex)
395395
if not ax.is_last_row():
396396
ax.set_xticks([])
@@ -400,26 +400,26 @@ axs[0].legend(loc='upper center', ncol=5, bbox_to_anchor=(0.5, 1.3),
400400
~~~
401401
{: .language-python}
402402

403-
![taxa per site per sex](../fig/04_chall_sex_taxa_site.png)
403+
![taxa per plot per sex](../fig/04_chall_sex_taxa_site.png)
404404

405405
However, it would be better to link to [Seaborn](https://stanford.edu/~mwaskom/software/seaborn/) and [Altair](https://github.com/ellisonbg/altair) for its kind of multivariate visualisations.
406406

407-
* In the data folder, there is a site CSV that contains information about the type associated with each site. Use that data to summarize the number of sites by site type.
407+
* In the data folder, there is a plot CSV that contains information about the type associated with each plot. Use that data to summarize the number of plots by plot type.
408408

409409
~~~
410-
site_info = pd.read_csv("data/sites.csv")
411-
site_info.groupby("site_type").count()
410+
plot_info = pd.read_csv("data/plots.csv")
411+
plot_info.groupby("plot_type").count()
412412
~~~
413413
{: .language-python}
414414

415-
* Calculate a diversity index of your choice for control vs rodent exclosure sites. The index should consider both species abundance and number of species. You might choose the simple biodiversity index described here which calculates diversity as `the number of species in the site / the total number of individuals in the site = Biodiversity index.`
415+
* Calculate a diversity index of your choice for control vs rodent exclosure plots. The index should consider both species abundance and number of species. You might choose the simple biodiversity index described here which calculates diversity as `the number of species in the plot / the total number of individuals in the plot = Biodiversity index.`
416416

417417
~~~
418-
merged_site_type = pd.merge(merged_left, site_info, on='site_id')
419-
# For each site, get the number of species for each site
420-
nspecies_site = merged_site_type.groupby(["site_id"])["species"].nunique().rename("nspecies")
421-
# For each site, get the number of individuals
422-
nindividuals_site = merged_site_type.groupby(["site_id"]).count()['record_id'].rename("nindiv")
418+
merged_site_type = pd.merge(merged_left, plot_info, on='plot_id')
419+
# For each plot, get the number of species for each plot
420+
nspecies_site = merged_site_type.groupby(["plot_id"])["species"].nunique().rename("nspecies")
421+
# For each plot, get the number of individuals
422+
nindividuals_site = merged_site_type.groupby(["plot_id"]).count()['record_id'].rename("nindiv")
423423
# combine the two series
424424
diversity_index = pd.concat([nspecies_site, nindividuals_site], axis=1)
425425
# calculate the diversity index
@@ -435,10 +435,10 @@ plt.xlabel("Diversity index")
435435
~~~
436436
{: .language-python}
437437

438-
![taxa per site per sex](../fig/04_chall_diversity_index.png)
438+
![taxa per plot per sex](../fig/04_chall_diversity_index.png)
439439

440440

441-
## 05-loops-and-functions
441+
## 06-loops-and-functions
442442

443443
### Basic Loop Challenges
444444

@@ -640,20 +640,20 @@ def yearly_data_csv_writer(all_data, yearcolumn="year",
640640
~~~
641641
{: .language-python}
642642

643-
## 06-plotting-with-ggplot
643+
## 07-plotting-with-ggplot
644644

645645
If the students have trouble generating the output, or anything happens with that, there is a file
646646
called "sample output" that contains the data file they should have generated in lesson 3.
647647

648648
iPython notebooks for plotting can be viewed in the `_extras` folder
649649

650-
## 07-putting-it-all-together
650+
## 08-putting-it-all-together
651651

652652
Scientists often operate on mathematical equations. Being able to use them in their graphics has a
653653
lot of added value. Luckily, Matplotlib provides powerful tools for text control. One of them is the
654654
ability to use LaTeX mathematical notation, whenever text is used (you can learn more about LaTeX
655655
math notation here: https://en.wikibooks.org/wiki/LaTeX/Mathematics). To use mathematical notation,
656-
surround your text using the dollar sign ("$"). LaTeX uses the backslash character ("\") a lot.
656+
surround your text using the dollar sign ("$"). LaTeX uses the backslash character ("\\") a lot.
657657
Since backslash has a special meaning in the Python strings, you should replace all the
658658
LaTeX-related backslashes with two backslashes.
659659

@@ -673,31 +673,10 @@ plt.show()
673673
~~~
674674
{: .language-python}
675675

676+
## 09-working-with-sql
676677

677-
[This page](https://matplotlib.org/users/mathtext.html) contains more information.
678-
679-
680-
681-
682-
683-
684-
685-
686-
687-
688-
689-
690-
691-
692-
693-
694-
695-
696-
697-
698-
699-
700-
678+
FIXME
701679

702680

681+
[This page](https://matplotlib.org/users/mathtext.html) contains more information.
703682

0 commit comments

Comments
 (0)