Clean the Flora Danica dataset#
Script Summary#
The Flora Danica dataset contains more than 3000 .tiff files and also metadata in an xlsx file: Index_FloraDanica.xlsx. This notebook cleans the metadata to transform it into a uniform, tidy format suitable for explorative analysis.
Steps:
Load the Flora Danica metadata from an Excel file
Inspect the dataset structure and data types
Rename columns to create consistent naming conventions
Extract and standardize table numbers
Process author information and publication details
Clean issue numbers by removing path information
Extract taxonomic group information from structured data
Standardize copyright information
Process note fields and extract relevant information
Create a cleaned subset of the most relevant columns for downstream analysis
Export the cleaned dataset as a CSV file
Outputs: A cleaned and standardized CSV file (flora_danica_tidy_format.csv) ready for visualization and further analysis.
Dataset: You can find the metadata file on the library’s open access repository (LOAR).
Import the libraries#
import pandas as pd
import re
import requests
Make the metadata “tidy”#
The metadata comes from a data dump from the library’s digital collections, but it is not very suited for data analysis in the current format. The dataset appears at first glance to be well-structured, but upon closer examination of which data is stored in the dataset, it becomes clear that more can be extracted from it if the data is cleaned up.
The following section examines the data.
# Load the file with Flora Danica metadata
print ('Loading data')
df = pd.read_excel(r'mekuni_flora_danica_data/Index_FloraDanica.xlsx')
print (f'Done. Dataframe shape: {df.shape}')
print ('Inspect the first three rows of data.')
df.head(2)
Loading data
Done. Dataframe shape: (3240, 10)
Inspect the first three rows of data.
| Record Name | Titel | Opstilling | Lokalitet | Ophav | År | Note | Taxonomisk gruppe | Hæfte | Copyright | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | floradanica_0001.tif | Rubus Chamaemorus Linn. | Fol. Top. Bot. Danmark | Danmark\nNorge | Hornemann, Jens Wilken (6.3.1770-30.7.1841) bo... | 1761 | Flora Danica Hft. 1, Tab. 1\n\nFigur 1\nLatins... | Digitale Samlinger: Digitale Samlinger: Billed... | Digitale Samlinger: Billeder:Særudgivelser:Flo... | Materialet er fri af ophavsret |
| 1 | floradanica_0002.tif | Pedicularis lapponica L. | Fol. Top. Bot. Danmark | Danmark\nNorge | Hornemann, Jens Wilken (6.3.1770-30.7.1841) bo... | 1761 | Flora Danica Hft. 1, Tab. 2\n\nFigur 1\nLatins... | Digitale Samlinger: Digitale Samlinger: Billed... | Digitale Samlinger: Billeder:Særudgivelser:Flo... | Materialet er fri af ophavsret |
Rename columns#
The column names are messy, containing a mix of English and Danish. To rename the columns in the dataframe, the old names are mapped to new names and then the columns are renamed.
# Define the mapping from old column names to new column names
column_rename_mapping = {
"Record Name": "record_name",
"Titel": "title",
"Opstilling": "placement",
"Lokalitet": "location",
"Ophav": "author",
"År": "year",
"Note": "note",
"Taxonomisk gruppe": "taxonomic_group",
"Hæfte": "issue",
"Copyright": "copyright"
}
# Rename the columns using the mapping
df.rename(columns=column_rename_mapping, inplace=True)
Table number: add information about the table numbers#
Each image in Flora Danica is marked with a table number. It is often at the top of the image. There are 3240 different numbers. A column with the table numbers can be created by taking the index numbers, which start with 0, and adding one.
# Modify the dataframe and add the column called table_no
print('Modify the dataframe and add the column called table_no')
df = df.reset_index().rename(columns={'index': 'table_no'})
df['table_no'] = df['table_no'] + 1
df.head(2)
Modify the dataframe and add the column called table_no
| table_no | record_name | title | placement | location | author | year | note | taxonomic_group | issue | copyright | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | floradanica_0001.tif | Rubus Chamaemorus Linn. | Fol. Top. Bot. Danmark | Danmark\nNorge | Hornemann, Jens Wilken (6.3.1770-30.7.1841) bo... | 1761 | Flora Danica Hft. 1, Tab. 1\n\nFigur 1\nLatins... | Digitale Samlinger: Digitale Samlinger: Billed... | Digitale Samlinger: Billeder:Særudgivelser:Flo... | Materialet er fri af ophavsret |
| 1 | 2 | floradanica_0002.tif | Pedicularis lapponica L. | Fol. Top. Bot. Danmark | Danmark\nNorge | Hornemann, Jens Wilken (6.3.1770-30.7.1841) bo... | 1761 | Flora Danica Hft. 1, Tab. 2\n\nFigur 1\nLatins... | Digitale Samlinger: Digitale Samlinger: Billed... | Digitale Samlinger: Billeder:Særudgivelser:Flo... | Materialet er fri af ophavsret |
Column ‘issue’: Clean it and add a new column called ‘issue_st’#
The rows in the ‘issue’ column contain this long string: Digitale Samlinger: Billeder:Særudgivelser:Flora Danica:Hæfte: .
The string is a path that indicates the location in the Digital Collections. The path information is not needed. The data that is needed is what remains when the long string is removed.
This can be achieved by cleaning the string, which is done by replacing the long string with nothing. A function called clean_issue is written using the built-in Python method .replace(). The function is then applied to the ‘issue’ column and the data is added to the dataframe in a new column called issue_st.
print (df_w_year_author.at[0,'issue'])
long_string = df_w_year_author.at[0,'issue']
clean_string = long_string.replace('Digitale Samlinger: Billeder:Særudgivelser:Flora Danica:Hæfte:', '')
print(clean_string)
Digitale Samlinger: Billeder:Særudgivelser:Flora Danica:Hæfte:Hft. 1
Hft. 1
def clean_issue(text_string_in):
text_string_out = text_string_in.replace('Digitale Samlinger: Billeder:Særudgivelser:Flora Danica:Hæfte:', '')
return text_string_out
df_w_year_author['issue_st'] = df_w_year_author['issue'].apply( lambda x : clean_issue(x))
df_w_year_author.head(2)
| table_no | record_name | title | placement | location | author | year | note | taxonomic_group | issue | copyright | author_st | issue_st | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | floradanica_0001.tif | Rubus Chamaemorus Linn. | Fol. Top. Bot. Danmark | Danmark\nNorge | Hornemann, Jens Wilken (6.3.1770-30.7.1841) bo... | 1761 | Flora Danica Hft. 1, Tab. 1\n\nFigur 1\nLatins... | Digitale Samlinger: Digitale Samlinger: Billed... | Digitale Samlinger: Billeder:Særudgivelser:Flo... | Materialet er fri af ophavsret | G. C. Oeder | Hft. 1 |
| 1 | 2 | floradanica_0002.tif | Pedicularis lapponica L. | Fol. Top. Bot. Danmark | Danmark\nNorge | Hornemann, Jens Wilken (6.3.1770-30.7.1841) bo... | 1761 | Flora Danica Hft. 1, Tab. 2\n\nFigur 1\nLatins... | Digitale Samlinger: Digitale Samlinger: Billed... | Digitale Samlinger: Billeder:Særudgivelser:Flo... | Materialet er fri af ophavsret | G. C. Oeder | Hft. 1 |
Column name: taxonomic_group#
The values in the column are a bit messy.
They contain both information about the name of the collection and information about the taxonomic group to which the plant in the image belongs.
df_w_year_author.at[0, 'taxonomic_group']
'Digitale Samlinger: Digitale Samlinger: Billeder:Særudgivelser:Flora Danica:Taxonomisk gruppe:Karplanter'
Extract the relevant information and add it to a new column#
The actual information about taxonomy would be “Karplanter” (Vascular plants), while the rest of the text string can be considered noise.
The following section extracts the relevant data and adds it to a new column.
# Access a single value
S = df_w_year_author.at[0, 'taxonomic_group']
# Split on ':' and take the last element of the list (the information we actually want)
group_val = S.split(':')[-1]
group_val
'Karplanter'
A function is written and used to get the data and add it to a new column. When inspecting the dataset, it appears that most plants belong to the taxonomy group “Karplanter” (Vascular plants).
def get_taxonomy_data(S):
group_val = S.split(':')[-1]
return group_val
df_w_year_author['taxonomic_group_st'] = df_w_year_author['taxonomic_group'].apply(lambda x:get_taxonomy_data(x))
# Inspect data in the new column
print (df_w_year_author['taxonomic_group_st'].value_counts())
taxonomic_group_st
Karplanter 2073
Svampe 391
Mosser 331
Alger 228
Laver 163
Slimsvampe 39
Ukendt 15
Taxonomisk gruppe 1
Lave 1
Name: count, dtype: int64
Column copyright: modify text string#
The values in the “copyright” column are all the same. They consist of a text string that says “Materialet er fri af ophavsret” (The material is free of copyright).
The text is changed to the shorter “free”.
print (df_w_year_author.at[0, 'copyright'])
text_string = df_w_year_author.at[0, 'copyright']
new_text_string = text_string.replace('Materialet er fri af ophavsret', 'free')
print (new_text_string)
Materialet er fri af ophavsret
free
def clean_copyright(S):
new_text_string = S.replace('Materialet er fri af ophavsret', 'free')
return new_text_string
df_w_year_author['copyright'] = df_w_year_author['copyright'].apply(lambda x : clean_copyright(x))
Column name: note#
The values in the dataset’s other columns also have multiple values in other columns. For example in the ‘note’ column.
The following section examines the values in the first row of this column.
print (f'\nNote:\n{df_w_year_author.at[0,"note"]}\n\n')
Note:
Flora Danica Hft. 1, Tab. 1
Figur 1
Latinsk navn: Rubus chamaemorus L.
Dansk slægtsnavn: Multebær
Dansk familienavn: Rosenfamilien
Latinsk familienavn: Rosaceae
Below, data is read from the ‘Note’ column, and using regular expressions, specific information is searched for, namely the Latin family name, the Latin name and Lange nomenclature. The parse_notes function extracts the data, and the result is added as new columns to the original DataFrame.
# Function that finds the relevant data from the values in the 'note' column
print ('Find relevant data in the values in the "note" column')
def parse_notes(note):
# Start with default values
latin_family_name = None
latin_name = None
lange_nomenclature = None
danish_genus_name = None
danish_family_name = None
danish_species_epithet = None
# Regular expressions that find relevant information
latin_family_name_pattern = r'Latinsk familienavn:\s*(.*)'
latin_name_pattern = r'Latinsk navn:\s*(.*)'
lange_nomenclature_pattern = r'Lange nomenklator:\s*(.*)'
danish_genus_name_pattern = r'Dansk slægtsnavn:\s*(.*)'
danish_family_name_pattern = r'Dansk familienavn:\s*(.*)'
danish_species_epithet_pattern = r'Dansk artsepitet:\s*(.*)'
# Search for relevant information
latin_family_name_match = re.search(latin_family_name_pattern, note)
if latin_family_name_match:
latin_family_name = latin_family_name_match.group(1).strip()
latin_name_match = re.search(latin_name_pattern, note)
if latin_name_match:
latin_name = latin_name_match.group(1).strip()
lange_nomenclature_match = re.search(lange_nomenclature_pattern, note)
if lange_nomenclature_match:
lange_nomenclature = lange_nomenclature_match.group(1).strip()
danish_genus_name_match = re.search(danish_genus_name_pattern, note)
if danish_genus_name_match:
danish_genus_name = danish_genus_name_match.group(1).strip()
danish_family_name_match = re.search(danish_family_name_pattern, note)
if danish_family_name_match:
danish_family_name = danish_family_name_match.group(1).strip()
danish_species_epithet_match = re.search(danish_species_epithet_pattern, note)
if danish_species_epithet_match:
danish_species_epithet = danish_species_epithet_match.group(1).strip()
# Combine Latin name and Lange nomenclature
combined_latin_name = latin_name if latin_name != '-' else lange_nomenclature
return latin_family_name, combined_latin_name, danish_genus_name, danish_family_name, danish_species_epithet
# Apply the function to the dataframe
parsed_data = df['note'].apply(parse_notes)
df_parsed = pd.DataFrame(parsed_data.tolist(), columns=['latin_family_name', 'latin_name', 'danish_genus_name', 'danish_family_name', 'danish_species_epithet'])
print ('Done')
# Collect data in the "Note" column into a dataframe
print ('Concatenate the data with the original dataframe')
concat_df = pd.concat([df_w_year_author,df_parsed], axis=1)
print (f'Done. Dataframe shape: {concat_df.shape}')
Find relevant data in the values in the "note" column
Done
Concatenate the data with the original dataframe
Done. Dataframe shape: (3242, 19)
Create a “subset” and save it as a CSV file#
The dataset is cleaner and more well-organized than before.
It is easier for us to use it for analysis and visualizations.
However, only some of the columns are needed for further work. Therefore, I select some of the columns for my “subset”, which I save as a CSV file.
subset_df = concat_df[['table_no', 'record_name', 'title', 'year', 'author_st', 'taxonomic_group_st', 'issue', 'latin_family_name', 'latin_name', 'danish_genus_name', 'danish_family_name', 'danish_species_epithet', 'copyright']]
subset_df.to_csv(r'.\mekuni_flora_danica_data\flora_danica_tidy_format.csv', index=False)
Other studies#
The cleaned Flora Danica metadata invites several complementary lines of inquiry:
Analyze temporal patterns in plant documentation by examining publication dates and author contributions over time to understand the evolution of botanical knowledge.
Investigate taxonomic diversity by exploring the distribution of plant families, genera, and species across the collection.
Explore author contributions by examining which authors documented which types of plants and whether certain authors specialized in particular taxonomic groups.
Investigate geographic patterns if location data is available, to understand regional plant documentation.
Analyze the relationship between publication issues and taxonomic groups to identify organizational patterns in the collection.
Cross-reference the metadata with the actual TIFF images to create image-text analysis workflows.
Explore copyright and publication information to understand the historical context and accessibility of the botanical illustrations.