Clean the Flora Danica dataset

Clean the Flora Danica dataset#

Script Summary#

The Flora Danica dataset contains more than 3000 .tiff files and also metadata in an xlsx file: Index_FloraDanica.xlsx. This notebook cleans the metadata to transform it into a uniform, tidy format suitable for explorative analysis.

Steps:

Load the Flora Danica metadata from an Excel file
Inspect the dataset structure and data types
Rename columns to create consistent naming conventions
Extract and standardize table numbers
Process author information and publication details
Clean issue numbers by removing path information
Extract taxonomic group information from structured data
Standardize copyright information
Process note fields and extract relevant information
Create a cleaned subset of the most relevant columns for downstream analysis
Export the cleaned dataset as a CSV file

Outputs: A cleaned and standardized CSV file (flora_danica_tidy_format.csv) ready for visualization and further analysis.

Dataset: You can find the metadata file on the library’s open access repository (LOAR).

Import the libraries#

import pandas as pd
import re
import requests

Make the metadata “tidy”#

The metadata comes from a data dump from the library’s digital collections, but it is not very suited for data analysis in the current format. The dataset appears at first glance to be well-structured, but upon closer examination of which data is stored in the dataset, it becomes clear that more can be extracted from it if the data is cleaned up.

The following section examines the data.

# Load the file with Flora Danica metadata
print ('Loading data')
df = pd.read_excel(r'mekuni_flora_danica_data/Index_FloraDanica.xlsx')
print (f'Done. Dataframe shape: {df.shape}')
print ('Inspect the first three rows of data.')

df.head(2)

Loading data

Done. Dataframe shape: (3240, 10)
Inspect the first three rows of data.

	Record Name	Titel	Opstilling	Lokalitet	Ophav	År	Note	Taxonomisk gruppe	Hæfte	Copyright
0	floradanica_0001.tif	Rubus Chamaemorus Linn.	Fol. Top. Bot. Danmark	Danmark\nNorge	Hornemann, Jens Wilken (6.3.1770-30.7.1841) bo...	1761	Flora Danica Hft. 1, Tab. 1\n\nFigur 1\nLatins...	Digitale Samlinger: Digitale Samlinger: Billed...	Digitale Samlinger: Billeder:Særudgivelser:Flo...	Materialet er fri af ophavsret
1	floradanica_0002.tif	Pedicularis lapponica L.	Fol. Top. Bot. Danmark	Danmark\nNorge	Hornemann, Jens Wilken (6.3.1770-30.7.1841) bo...	1761	Flora Danica Hft. 1, Tab. 2\n\nFigur 1\nLatins...	Digitale Samlinger: Digitale Samlinger: Billed...	Digitale Samlinger: Billeder:Særudgivelser:Flo...	Materialet er fri af ophavsret

Rename columns#

The column names are messy, containing a mix of English and Danish. To rename the columns in the dataframe, the old names are mapped to new names and then the columns are renamed.

# Define the mapping from old column names to new column names
column_rename_mapping = {
    "Record Name": "record_name",
    "Titel": "title",
    "Opstilling": "placement",
    "Lokalitet": "location",
    "Ophav": "author",
    "År": "year",
    "Note": "note",
    "Taxonomisk gruppe": "taxonomic_group",
    "Hæfte": "issue",
    "Copyright": "copyright"
}

# Rename the columns using the mapping
df.rename(columns=column_rename_mapping, inplace=True)

Table number: add information about the table numbers#

Each image in Flora Danica is marked with a table number. It is often at the top of the image. There are 3240 different numbers. A column with the table numbers can be created by taking the index numbers, which start with 0, and adding one.

# Modify the dataframe and add the column called table_no  
print('Modify the dataframe and add the column called table_no')  
df = df.reset_index().rename(columns={'index': 'table_no'})  
df['table_no'] = df['table_no'] + 1  
df.head(2)

Modify the dataframe and add the column called table_no

	table_no	record_name	title	placement	location	author	year	note	taxonomic_group	issue	copyright
0	1	floradanica_0001.tif	Rubus Chamaemorus Linn.	Fol. Top. Bot. Danmark	Danmark\nNorge	Hornemann, Jens Wilken (6.3.1770-30.7.1841) bo...	1761	Flora Danica Hft. 1, Tab. 1\n\nFigur 1\nLatins...	Digitale Samlinger: Digitale Samlinger: Billed...	Digitale Samlinger: Billeder:Særudgivelser:Flo...	Materialet er fri af ophavsret
1	2	floradanica_0002.tif	Pedicularis lapponica L.	Fol. Top. Bot. Danmark	Danmark\nNorge	Hornemann, Jens Wilken (6.3.1770-30.7.1841) bo...	1761	Flora Danica Hft. 1, Tab. 2\n\nFigur 1\nLatins...	Digitale Samlinger: Digitale Samlinger: Billed...	Digitale Samlinger: Billeder:Særudgivelser:Flo...	Materialet er fri af ophavsret

Column: Author#

All rows in the “Author” column contain the same, identical information, which is the names and birth and death dates of several of the botanists who were responsible for the publication work.

The uniform information looks like this:

‘Hornemann, Jens Wilken (6.3.1770-30.7.1841) botanist \nLange, Johan Martin Christian (20.3.1818-3.4.1898) botanist \nLiebmann, Frederik Michael (10.10.1813-29.10.1856) botanist \nMüller, Otto Frederik (2.3.1730-26.12.1784) botanist \nOeder, Georg Christian (3.2.1728-28.1.1791) botanist \nVahl, Martin (7.10.1749-24.12.1804) botanist’

In order to sort and filter the dataset based on author information, detailed information is needed about which author was responsible for the publication of each image. Such a detailed overview can be found on this page

Issue	Plate No.	Year	Publisher
1	1-60	1761	G. C. Oeder
2	61-120	1763	G. C. Oeder
3	121-180	1764	G. C. Oeder
4	181-240	1765	G. C. Oeder
5	241-300	1766	G. C. Oeder
6	301-360	1767	G. C. Oeder
7	361-420	1768	G. C. Oeder
8	421-480	1769	G. C. Oeder
9	481-540	1770	G. C. Oeder
10	541-600	1771	G. C. Oeder
11	601-660	1775	O. F. Müller
12	661-720	1777	O. F. Müller
13	721-780	1778	O. F. Müller
14	781-840	1780	O. F. Müller
15	841-900	1782	O. F. Müller
16	901-960	1787	M. Vahl
17	961-1020	1790	M. Vahl
18	1021-1080	1792	M. Vahl
19	1081-1140	1794	M. Vahl
20	1141-1200	1797	M. Vahl
21	1201-1260	1799	M. Vahl
22	1261-1320	1806	J. W. Hornemann
23	1321-1380	1808	J. W. Hornemann
24	1381-1440	1810	J. W. Hornemann
25	1441-1500	1813	J. W. Hornemann
26	1501-1560	1816	J. W. Hornemann
27	1561-1620	1818	J. W. Hornemann
28	1621-1680	1819	J. W. Hornemann
29	1681-1740	1821	J. W. Hornemann
30	1741-1800	1823	J. W. Hornemann
31	1801-1860	1825	J. W. Hornemann
32	1861-1920	1827	J. W. Hornemann
33	1921-1980	1829	J. W. Hornemann
34	1981-2040	1830	J. W. Hornemann
35	2041-2100	1832	J. W. Hornemann
36	2101-2160	1834	J. W. Hornemann
37	2161-2220	1836	J. W. Hornemann
38	2221-2280	1839	J. W. Hornemann
39	2281-2340	1840	J. W. Hornemann
40	2341-2400	1843	S. Drejer, J. F. Schouw & J. Vahl
41	2401-2460	1845	F. Liebmann
42	2461-2520	1849	F. Liebmann
43	2521-2580	1852	F. Liebmann
44	2581-2640	1858	Japetus Steenstrup & Johan Lange
45	2641-2700	1861	Johan Lange
46	2701-2760	1867	Johan Lange
47	2761-2820	1869	Johan Lange
48	2821-2880	1871	Johan Lange
49	2881-2940	1877	Johan Lange
50	2940-3000	1880	Johan Lange
51	3001-3060	1883	Johan Lange
Suppl 1	1-60	1853	F. Liebmann
Suppl 2	61-120	1865	Johan Lange
Suppl 3	121-180	1874	Johan Lange

Let’s add the information about tables and authors to each row in the dataset.

Note that the last three lines describe the 180 tables in the supplementary pages of Flora Danica.

# Create a new dataframe with detailed information about how table numbers relate to author publications.


# Raw data string
data = """
table_no\tyear\tauthor
1-60 	1761 	G. C. Oeder
61-120 	1763 	G. C. Oeder
121-180 	1764 	G. C. Oeder
181-240 	1765 	G. C. Oeder
241-300 	1766 	G. C. Oeder
301-360 	1767 	G. C. Oeder
361-420 	1768 	G. C. Oeder
421-480 	1769 	G. C. Oeder
481-540 	1770 	G. C. Oeder
541-600 	1771 	G. C. Oeder
601-660 	1775 	O. F. Müller
661-720 	1777 	O. F. Müller
721-780 	1778 	O. F. Müller
781-840 	1780 	O. F. Müller
841-900 	1782 	O. F. Müller
901-960 	1787 	M. Vahl
961-1020 	1790 	M. Vahl
1021-1080 	1792 	M. Vahl
1081-1140 	1794 	M. Vahl
1141-1200 	1797 	M. Vahl
1201-1260 	1799 	M. Vahl
1261-1320 	1806 	J. W. Hornemann
1321-1380 	1808 	J. W. Hornemann
1381-1440 	1810 	J. W. Hornemann
1441-1500 	1813 	J. W. Hornemann
1501-1560 	1816 	J. W. Hornemann
1561-1620 	1818 	J. W. Hornemann
1621-1680 	1819 	J. W. Hornemann
1681-1740 	1821 	J. W. Hornemann
1741-1800 	1823 	J. W. Hornemann
1801-1860 	1825 	J. W. Hornemann
1861-1920 	1827 	J. W. Hornemann
1921-1980 	1829 	J. W. Hornemann
1981-2040 	1830 	J. W. Hornemann
2041-2100 	1832 	J. W. Hornemann
2101-2160 	1834 	J. W. Hornemann
2161-2220 	1836 	J. W. Hornemann
2221-2280 	1839 	J. W. Hornemann
2281-2340 	1840 	J. W. Hornemann
2341-2400 	1843 	S. Drejer, J. F. Schouw & J. Vahl
2401-2460 	1845 	F. Liebmann
2461-2520 	1849 	F. Liebmann
2521-2580 	1852 	F. Liebmann
2581-2640 	1858 	Japetus Steenstrup & Johan Lange
2641-2700 	1861 	Johan Lange
2701-2760 	1867 	Johan Lange
2761-2820 	1869 	Johan Lange
2821-2880 	1871 	Johan Lange
2881-2940 	1877 	Johan Lange
2940-3000 	1880 	Johan Lange
3001-3060 	1883 	Johan Lange
3060-3119 	1853 	F. Liebmann
3120-3179 	1865 	Johan Lange
3180-3240 	1874 	Johan Lange
"""

# Split data into lines and divide each line into columns
lines = data.strip().split('\n')
columns = lines[0].split('\t')  # Extract data
data_rows = [re.split(r'\s{2,}', line.strip()) for line in lines[1:]]  # Create dataframes with detailed info
detailed_info = pd.DataFrame(data_rows, columns=columns)

# Ensure that 'table_no' is treated as a string
detailed_info['table_no'] = detailed_info['table_no'].astype(str)

# List to store new dataframes
new_dfs = []

# Iterate through each row in the dataframe
for i, row in detailed_info.iterrows():
    # Parse the interval
    start, end = map(int, row['table_no'].split('-'))
    
    # Create index range
    indices = range(start, end + 1)
    
    # Create a new dataframe
    new_df = pd.DataFrame({
        'table_no': indices,
        'author_st': [row['author']] * len(indices)
    })
    
    # Add to the list
    new_dfs.append(new_df)

# Now new_dfs contains all the new dataframes
# and they can be combined into one large dataframe
detailed_info_df = pd.concat(new_dfs).reset_index(drop=True)

print ('New data about "author" has been added to the "author_st" column.\n')
# Combine datasets
df_w_year_author = pd.merge(df, detailed_info_df, how='left', on='table_no')
df_w_year_author.head(2)

New data about "author" has been added to the "author_st" column.

	table_no	record_name	title	placement	location	author	year	note	taxonomic_group	issue	copyright	author_st
0	1	floradanica_0001.tif	Rubus Chamaemorus Linn.	Fol. Top. Bot. Danmark	Danmark\nNorge	Hornemann, Jens Wilken (6.3.1770-30.7.1841) bo...	1761	Flora Danica Hft. 1, Tab. 1\n\nFigur 1\nLatins...	Digitale Samlinger: Digitale Samlinger: Billed...	Digitale Samlinger: Billeder:Særudgivelser:Flo...	Materialet er fri af ophavsret	G. C. Oeder
1	2	floradanica_0002.tif	Pedicularis lapponica L.	Fol. Top. Bot. Danmark	Danmark\nNorge	Hornemann, Jens Wilken (6.3.1770-30.7.1841) bo...	1761	Flora Danica Hft. 1, Tab. 2\n\nFigur 1\nLatins...	Digitale Samlinger: Digitale Samlinger: Billed...	Digitale Samlinger: Billeder:Særudgivelser:Flo...	Materialet er fri af ophavsret	G. C. Oeder

Column ‘issue’: Clean it and add a new column called ‘issue_st’#

The rows in the ‘issue’ column contain this long string: Digitale Samlinger: Billeder:Særudgivelser:Flora Danica:Hæfte: .

The string is a path that indicates the location in the Digital Collections. The path information is not needed. The data that is needed is what remains when the long string is removed.

This can be achieved by cleaning the string, which is done by replacing the long string with nothing. A function called clean_issue is written using the built-in Python method .replace(). The function is then applied to the ‘issue’ column and the data is added to the dataframe in a new column called issue_st.

print (df_w_year_author.at[0,'issue'])

long_string = df_w_year_author.at[0,'issue']

clean_string = long_string.replace('Digitale Samlinger: Billeder:Særudgivelser:Flora Danica:Hæfte:', '')
print(clean_string)

Digitale Samlinger: Billeder:Særudgivelser:Flora Danica:Hæfte:Hft.  1
Hft.  1

def clean_issue(text_string_in):
    text_string_out = text_string_in.replace('Digitale Samlinger: Billeder:Særudgivelser:Flora Danica:Hæfte:', '')
    return text_string_out


df_w_year_author['issue_st'] = df_w_year_author['issue'].apply( lambda x : clean_issue(x))
df_w_year_author.head(2)

	table_no	record_name	title	placement	location	author	year	note	taxonomic_group	issue	copyright	author_st	issue_st
0	1	floradanica_0001.tif	Rubus Chamaemorus Linn.	Fol. Top. Bot. Danmark	Danmark\nNorge	Hornemann, Jens Wilken (6.3.1770-30.7.1841) bo...	1761	Flora Danica Hft. 1, Tab. 1\n\nFigur 1\nLatins...	Digitale Samlinger: Digitale Samlinger: Billed...	Digitale Samlinger: Billeder:Særudgivelser:Flo...	Materialet er fri af ophavsret	G. C. Oeder	Hft. 1
1	2	floradanica_0002.tif	Pedicularis lapponica L.	Fol. Top. Bot. Danmark	Danmark\nNorge	Hornemann, Jens Wilken (6.3.1770-30.7.1841) bo...	1761	Flora Danica Hft. 1, Tab. 2\n\nFigur 1\nLatins...	Digitale Samlinger: Digitale Samlinger: Billed...	Digitale Samlinger: Billeder:Særudgivelser:Flo...	Materialet er fri af ophavsret	G. C. Oeder	Hft. 1

Column name: taxonomic_group#

The values in the column are a bit messy.

They contain both information about the name of the collection and information about the taxonomic group to which the plant in the image belongs.

df_w_year_author.at[0, 'taxonomic_group']

'Digitale Samlinger: Digitale Samlinger: Billeder:Særudgivelser:Flora Danica:Taxonomisk gruppe:Karplanter'

Extract the relevant information and add it to a new column#

The actual information about taxonomy would be “Karplanter” (Vascular plants), while the rest of the text string can be considered noise.

The following section extracts the relevant data and adds it to a new column.

# Access a single value
S = df_w_year_author.at[0, 'taxonomic_group']
# Split on ':' and take the last element of the list (the information we actually want)
group_val = S.split(':')[-1]
group_val

'Karplanter'

A function is written and used to get the data and add it to a new column. When inspecting the dataset, it appears that most plants belong to the taxonomy group “Karplanter” (Vascular plants).

def get_taxonomy_data(S):
    group_val = S.split(':')[-1]
    return group_val


df_w_year_author['taxonomic_group_st'] = df_w_year_author['taxonomic_group'].apply(lambda x:get_taxonomy_data(x))
# Inspect data in the new column
print (df_w_year_author['taxonomic_group_st'].value_counts())

taxonomic_group_st
Karplanter           2073
Svampe                391
Mosser                331
Alger                 228
Laver                 163
Slimsvampe             39
Ukendt                 15
Taxonomisk gruppe       1
Lave                    1
Name: count, dtype: int64

Column copyright: modify text string#

The values in the “copyright” column are all the same. They consist of a text string that says “Materialet er fri af ophavsret” (The material is free of copyright).

The text is changed to the shorter “free”.

print (df_w_year_author.at[0, 'copyright'])

text_string = df_w_year_author.at[0, 'copyright']
new_text_string = text_string.replace('Materialet er fri af ophavsret', 'free')
print (new_text_string)

Materialet er fri af ophavsret
free

def clean_copyright(S):
    new_text_string  = S.replace('Materialet er fri af ophavsret', 'free')
    return new_text_string


df_w_year_author['copyright'] = df_w_year_author['copyright'].apply(lambda x : clean_copyright(x))

Column name: note#

The values in the dataset’s other columns also have multiple values in other columns. For example in the ‘note’ column.

The following section examines the values in the first row of this column.

print (f'\nNote:\n{df_w_year_author.at[0,"note"]}\n\n')

Note:
Flora Danica Hft. 1, Tab. 1

Figur 1
Latinsk navn: Rubus chamaemorus L.
Dansk slægtsnavn: Multebær
Dansk familienavn: Rosenfamilien
Latinsk familienavn: Rosaceae

Below, data is read from the ‘Note’ column, and using regular expressions, specific information is searched for, namely the Latin family name, the Latin name and Lange nomenclature. The parse_notes function extracts the data, and the result is added as new columns to the original DataFrame.

# Function that finds the relevant data from the values in the 'note' column
print ('Find relevant data in the values in the "note" column')
def parse_notes(note):
    # Start with default values
    latin_family_name = None
    latin_name = None
    lange_nomenclature = None
    danish_genus_name = None
    danish_family_name = None
    danish_species_epithet = None

    # Regular expressions that find relevant information
    latin_family_name_pattern = r'Latinsk familienavn:\s*(.*)'
    latin_name_pattern = r'Latinsk navn:\s*(.*)'
    lange_nomenclature_pattern = r'Lange nomenklator:\s*(.*)'
    danish_genus_name_pattern = r'Dansk slægtsnavn:\s*(.*)'
    danish_family_name_pattern = r'Dansk familienavn:\s*(.*)'
    danish_species_epithet_pattern = r'Dansk artsepitet:\s*(.*)'

    # Search for relevant information
    latin_family_name_match = re.search(latin_family_name_pattern, note)
    if latin_family_name_match:
        latin_family_name = latin_family_name_match.group(1).strip()

    latin_name_match = re.search(latin_name_pattern, note)
    if latin_name_match:
        latin_name = latin_name_match.group(1).strip()

    lange_nomenclature_match = re.search(lange_nomenclature_pattern, note)
    if lange_nomenclature_match:
        lange_nomenclature = lange_nomenclature_match.group(1).strip()

    danish_genus_name_match = re.search(danish_genus_name_pattern, note)
    if danish_genus_name_match:
        danish_genus_name = danish_genus_name_match.group(1).strip()

    danish_family_name_match = re.search(danish_family_name_pattern, note)
    if danish_family_name_match:
        danish_family_name = danish_family_name_match.group(1).strip()

    danish_species_epithet_match = re.search(danish_species_epithet_pattern, note)
    if danish_species_epithet_match:
        danish_species_epithet = danish_species_epithet_match.group(1).strip()

    # Combine Latin name and Lange nomenclature
    combined_latin_name = latin_name if latin_name != '-' else lange_nomenclature

    return latin_family_name, combined_latin_name, danish_genus_name, danish_family_name, danish_species_epithet

# Apply the function to the dataframe
parsed_data = df['note'].apply(parse_notes)
df_parsed = pd.DataFrame(parsed_data.tolist(), columns=['latin_family_name', 'latin_name', 'danish_genus_name', 'danish_family_name', 'danish_species_epithet'])
print ('Done')

# Collect data in the "Note" column into a dataframe
print ('Concatenate the data with the original dataframe')
concat_df = pd.concat([df_w_year_author,df_parsed], axis=1)
print (f'Done. Dataframe shape: {concat_df.shape}')

Find relevant data in the values in the "note" column
Done
Concatenate the data with the original dataframe
Done. Dataframe shape: (3242, 19)

Create a “subset” and save it as a CSV file#

The dataset is cleaner and more well-organized than before.

It is easier for us to use it for analysis and visualizations.

However, only some of the columns are needed for further work. Therefore, I select some of the columns for my “subset”, which I save as a CSV file.

subset_df = concat_df[['table_no', 'record_name', 'title', 'year', 'author_st', 'taxonomic_group_st', 'issue', 'latin_family_name', 'latin_name', 'danish_genus_name', 'danish_family_name', 'danish_species_epithet', 'copyright']]
subset_df.to_csv(r'.\mekuni_flora_danica_data\flora_danica_tidy_format.csv', index=False)

Other studies#

The cleaned Flora Danica metadata invites several complementary lines of inquiry:

Analyze temporal patterns in plant documentation by examining publication dates and author contributions over time to understand the evolution of botanical knowledge.
Investigate taxonomic diversity by exploring the distribution of plant families, genera, and species across the collection.
Explore author contributions by examining which authors documented which types of plants and whether certain authors specialized in particular taxonomic groups.
Investigate geographic patterns if location data is available, to understand regional plant documentation.
Analyze the relationship between publication issues and taxonomic groups to identify organizational patterns in the collection.
Cross-reference the metadata with the actual TIFF images to create image-text analysis workflows.
Explore copyright and publication information to understand the historical context and accessibility of the botanical illustrations.