One of the last steps in establishing interfacing between a module and the 'outside' is units metadata for .csv files.
This is the only file type we handle that has this issue, since netCDF files have units as standardized metadata.
Based on several tests, I believe the best approach is to add a second 'header' that states the unit of each column, using No Unit to specify columns with text values, ratios, etc.
Reasoning:
- Removing a second header row is easy in pandas
- Fits well with the 'tidy' dataframe approach.
- It produces way less data overhead than adding an extra column in the case of timeseries.
calliope v0.7 is able to skip rows easily, so it avoids extra processing on the modelling side.
- Allows module devs to easily use pint for unit conversion, if they choose to.
Examples
Picture this table:
attribute,country,vehicle_type,vehicle_subtype,carrier,year,TotalEnergyConsumption
units,No Unit,No Unit,No Unit,No Unit, years, ktoe
index,,,,,,
0,DEU,Powered two-wheelers,Gasoline engine,Gasoline,2000,476.0664153213859
1,DEU,Powered two-wheelers,Gasoline engine,BioGasoline,2000,0.0
2,DEU,Passenger cars,Gasoline engine,Gasoline,2000,29431.27094818872
3,DEU,Passenger cars,Gasoline engine,BioGasoline,2000,0.0
4,DEU,Passenger cars,Diesel oil engine,Diesel,2000,8255.653519490721
| attribute |
country |
vehicle_type |
vehicle_subtype |
carrier |
year |
TotalEnergyConsumption |
| units |
No Unit |
No Unit |
No Unit |
No Unit |
years |
ktoe |
| index |
|
|
|
|
|
|
| 0 |
DEU |
Powered two-wheelers |
Gasoline engine |
Gasoline |
2000 |
476.0664153213859 |
| 1 |
DEU |
Powered two-wheelers |
Gasoline engine |
BioGasoline |
2000 |
0.0 |
| 2 |
DEU |
Passenger cars |
Gasoline engine |
Gasoline |
2000 |
29431.27094818872 |
| 3 |
DEU |
Passenger cars |
Gasoline engine |
BioGasoline |
2000 |
0.0 |
| 4 |
DEU |
Passenger cars |
Diesel oil engine |
Diesel |
2000 |
8255.653519490721 |
Loading and removing the second header:
If you do not want to use any fancy libraries to handle units, cleaning the data is trivial:
data = pd.read_csv("tmp/test2.csv", header=[0, 1], index_col=0)
data.columns = data.columns.droplevel("units")
data.head()

Automatic unit conversion with pint
If you want to be fancy (and lazy), you can just as easily use pint to do all the unit heavy lifting for you.
data = pd.read_csv("tmp/test2.csv", header=[0,1], index_col=0)
data = data.pint.quantify(level=-1).head()
data.head()


data['TotalEnergyConsumption'].pint.to_base_units()

One of the last steps in establishing interfacing between a module and the 'outside' is units metadata for .csv files.
This is the only file type we handle that has this issue, since netCDF files have
unitsas standardized metadata.Based on several tests, I believe the best approach is to add a second 'header' that states the unit of each column, using
No Unitto specify columns with text values, ratios, etc.Reasoning:
calliopev0.7 is able to skip rows easily, so it avoids extra processing on the modelling side.Examples
Picture this table:
Loading and removing the second header:
If you do not want to use any fancy libraries to handle units, cleaning the data is trivial:
Automatic unit conversion with
pintIf you want to be fancy (and lazy), you can just as easily use
pintto do all the unit heavy lifting for you.