Clean the Flora Danica dataset#

Script Summary#

The Flora Danica dataset contains more than 3000 .tiff files and also metadata in an xlsx file: Index_FloraDanica.xlsx. This notebook cleans the metadata to transform it into a uniform, tidy format suitable for explorative analysis.

Steps:

  • Load the Flora Danica metadata from an Excel file

  • Inspect the dataset structure and data types

  • Rename columns to create consistent naming conventions

  • Extract and standardize table numbers

  • Process author information and publication details

  • Clean issue numbers by removing path information

  • Extract taxonomic group information from structured data

  • Standardize copyright information

  • Process note fields and extract relevant information

  • Create a cleaned subset of the most relevant columns for downstream analysis

  • Export the cleaned dataset as a CSV file

Outputs: A cleaned and standardized CSV file (flora_danica_tidy_format.csv) ready for visualization and further analysis.

Dataset: You can find the metadata file on the library’s open access repository (LOAR).

Import the libraries#

import pandas as pd
import re
import requests

Make the metadata “tidy”#

The metadata comes from a data dump from the library’s digital collections, but it is not very suited for data analysis in the current format. The dataset appears at first glance to be well-structured, but upon closer examination of which data is stored in the dataset, it becomes clear that more can be extracted from it if the data is cleaned up.

The following section examines the data.

# Load the file with Flora Danica metadata
print ('Loading data')
df = pd.read_excel(r'mekuni_flora_danica_data/Index_FloraDanica.xlsx')
print (f'Done. Dataframe shape: {df.shape}')
print ('Inspect the first three rows of data.')

df.head(2)
Loading data
Done. Dataframe shape: (3240, 10)
Inspect the first three rows of data.
Record Name Titel Opstilling Lokalitet Ophav År Note Taxonomisk gruppe Hæfte Copyright
0 floradanica_0001.tif Rubus Chamaemorus Linn. Fol. Top. Bot. Danmark Danmark\nNorge Hornemann, Jens Wilken (6.3.1770-30.7.1841) bo... 1761 Flora Danica Hft. 1, Tab. 1\n\nFigur 1\nLatins... Digitale Samlinger: Digitale Samlinger: Billed... Digitale Samlinger: Billeder:Særudgivelser:Flo... Materialet er fri af ophavsret
1 floradanica_0002.tif Pedicularis lapponica L. Fol. Top. Bot. Danmark Danmark\nNorge Hornemann, Jens Wilken (6.3.1770-30.7.1841) bo... 1761 Flora Danica Hft. 1, Tab. 2\n\nFigur 1\nLatins... Digitale Samlinger: Digitale Samlinger: Billed... Digitale Samlinger: Billeder:Særudgivelser:Flo... Materialet er fri af ophavsret

Rename columns#

The column names are messy, containing a mix of English and Danish. To rename the columns in the dataframe, the old names are mapped to new names and then the columns are renamed.

# Define the mapping from old column names to new column names
column_rename_mapping = {
    "Record Name": "record_name",
    "Titel": "title",
    "Opstilling": "placement",
    "Lokalitet": "location",
    "Ophav": "author",
    "År": "year",
    "Note": "note",
    "Taxonomisk gruppe": "taxonomic_group",
    "Hæfte": "issue",
    "Copyright": "copyright"
}

# Rename the columns using the mapping
df.rename(columns=column_rename_mapping, inplace=True)

Table number: add information about the table numbers#

Each image in Flora Danica is marked with a table number. It is often at the top of the image. There are 3240 different numbers. A column with the table numbers can be created by taking the index numbers, which start with 0, and adding one.

# Modify the dataframe and add the column called table_no  
print('Modify the dataframe and add the column called table_no')  
df = df.reset_index().rename(columns={'index': 'table_no'})  
df['table_no'] = df['table_no'] + 1  
df.head(2)
Modify the dataframe and add the column called table_no
table_no record_name title placement location author year note taxonomic_group issue copyright
0 1 floradanica_0001.tif Rubus Chamaemorus Linn. Fol. Top. Bot. Danmark Danmark\nNorge Hornemann, Jens Wilken (6.3.1770-30.7.1841) bo... 1761 Flora Danica Hft. 1, Tab. 1\n\nFigur 1\nLatins... Digitale Samlinger: Digitale Samlinger: Billed... Digitale Samlinger: Billeder:Særudgivelser:Flo... Materialet er fri af ophavsret
1 2 floradanica_0002.tif Pedicularis lapponica L. Fol. Top. Bot. Danmark Danmark\nNorge Hornemann, Jens Wilken (6.3.1770-30.7.1841) bo... 1761 Flora Danica Hft. 1, Tab. 2\n\nFigur 1\nLatins... Digitale Samlinger: Digitale Samlinger: Billed... Digitale Samlinger: Billeder:Særudgivelser:Flo... Materialet er fri af ophavsret

Column: Author#

All rows in the “Author” column contain the same, identical information, which is the names and birth and death dates of several of the botanists who were responsible for the publication work.

The uniform information looks like this:

‘Hornemann, Jens Wilken (6.3.1770-30.7.1841) botanist \nLange, Johan Martin Christian (20.3.1818-3.4.1898) botanist \nLiebmann, Frederik Michael (10.10.1813-29.10.1856) botanist \nMüller, Otto Frederik (2.3.1730-26.12.1784) botanist \nOeder, Georg Christian (3.2.1728-28.1.1791) botanist \nVahl, Martin (7.10.1749-24.12.1804) botanist’

In order to sort and filter the dataset based on author information, detailed information is needed about which author was responsible for the publication of each image. Such a detailed overview can be found on this page

Issue

Plate No.

Year

Publisher

1

1-60

1761

G. C. Oeder

2

61-120

1763

G. C. Oeder

3

121-180

1764

G. C. Oeder

4

181-240

1765

G. C. Oeder

5

241-300

1766

G. C. Oeder

6

301-360

1767

G. C. Oeder

7

361-420

1768

G. C. Oeder

8

421-480

1769

G. C. Oeder

9

481-540

1770

G. C. Oeder

10

541-600

1771

G. C. Oeder

11

601-660

1775

O. F. Müller

12

661-720

1777

O. F. Müller

13

721-780

1778

O. F. Müller

14

781-840

1780

O. F. Müller

15

841-900

1782

O. F. Müller

16

901-960

1787

M. Vahl

17

961-1020

1790

M. Vahl

18

1021-1080

1792

M. Vahl

19

1081-1140

1794

M. Vahl

20

1141-1200

1797

M. Vahl

21

1201-1260

1799

M. Vahl

22

1261-1320

1806

J. W. Hornemann

23

1321-1380

1808

J. W. Hornemann

24

1381-1440

1810

J. W. Hornemann

25

1441-1500

1813

J. W. Hornemann

26

1501-1560

1816

J. W. Hornemann

27

1561-1620

1818

J. W. Hornemann

28

1621-1680

1819

J. W. Hornemann

29

1681-1740

1821

J. W. Hornemann

30

1741-1800

1823

J. W. Hornemann

31

1801-1860

1825

J. W. Hornemann

32

1861-1920

1827

J. W. Hornemann

33

1921-1980

1829

J. W. Hornemann

34

1981-2040

1830

J. W. Hornemann

35

2041-2100

1832

J. W. Hornemann

36

2101-2160

1834

J. W. Hornemann

37

2161-2220

1836

J. W. Hornemann

38

2221-2280

1839

J. W. Hornemann

39

2281-2340

1840

J. W. Hornemann

40

2341-2400

1843

S. Drejer, J. F. Schouw & J. Vahl

41

2401-2460

1845

F. Liebmann

42

2461-2520

1849

F. Liebmann

43

2521-2580

1852

F. Liebmann

44

2581-2640

1858

Japetus Steenstrup & Johan Lange

45

2641-2700

1861

Johan Lange

46

2701-2760

1867

Johan Lange

47

2761-2820

1869

Johan Lange

48

2821-2880

1871

Johan Lange

49

2881-2940

1877

Johan Lange

50

2940-3000

1880

Johan Lange

51

3001-3060

1883

Johan Lange

Suppl 1

1-60

1853

F. Liebmann

Suppl 2

61-120

1865

Johan Lange

Suppl 3

121-180

1874

Johan Lange

Let’s add the information about tables and authors to each row in the dataset.

Note that the last three lines describe the 180 tables in the supplementary pages of Flora Danica.

# Create a new dataframe with detailed information about how table numbers relate to author publications.


# Raw data string
data = """
table_no\tyear\tauthor
1-60 	1761 	G. C. Oeder
61-120 	1763 	G. C. Oeder
121-180 	1764 	G. C. Oeder
181-240 	1765 	G. C. Oeder
241-300 	1766 	G. C. Oeder
301-360 	1767 	G. C. Oeder
361-420 	1768 	G. C. Oeder
421-480 	1769 	G. C. Oeder
481-540 	1770 	G. C. Oeder
541-600 	1771 	G. C. Oeder
601-660 	1775 	O. F. Müller
661-720 	1777 	O. F. Müller
721-780 	1778 	O. F. Müller
781-840 	1780 	O. F. Müller
841-900 	1782 	O. F. Müller
901-960 	1787 	M. Vahl
961-1020 	1790 	M. Vahl
1021-1080 	1792 	M. Vahl
1081-1140 	1794 	M. Vahl
1141-1200 	1797 	M. Vahl
1201-1260 	1799 	M. Vahl
1261-1320 	1806 	J. W. Hornemann
1321-1380 	1808 	J. W. Hornemann
1381-1440 	1810 	J. W. Hornemann
1441-1500 	1813 	J. W. Hornemann
1501-1560 	1816 	J. W. Hornemann
1561-1620 	1818 	J. W. Hornemann
1621-1680 	1819 	J. W. Hornemann
1681-1740 	1821 	J. W. Hornemann
1741-1800 	1823 	J. W. Hornemann
1801-1860 	1825 	J. W. Hornemann
1861-1920 	1827 	J. W. Hornemann
1921-1980 	1829 	J. W. Hornemann
1981-2040 	1830 	J. W. Hornemann
2041-2100 	1832 	J. W. Hornemann
2101-2160 	1834 	J. W. Hornemann
2161-2220 	1836 	J. W. Hornemann
2221-2280 	1839 	J. W. Hornemann
2281-2340 	1840 	J. W. Hornemann
2341-2400 	1843 	S. Drejer, J. F. Schouw & J. Vahl
2401-2460 	1845 	F. Liebmann
2461-2520 	1849 	F. Liebmann
2521-2580 	1852 	F. Liebmann
2581-2640 	1858 	Japetus Steenstrup & Johan Lange
2641-2700 	1861 	Johan Lange
2701-2760 	1867 	Johan Lange
2761-2820 	1869 	Johan Lange
2821-2880 	1871 	Johan Lange
2881-2940 	1877 	Johan Lange
2940-3000 	1880 	Johan Lange
3001-3060 	1883 	Johan Lange
3060-3119 	1853 	F. Liebmann
3120-3179 	1865 	Johan Lange
3180-3240 	1874 	Johan Lange
"""

# Split data into lines and divide each line into columns
lines = data.strip().split('\n')
columns = lines[0].split('\t')  # Extract data
data_rows = [re.split(r'\s{2,}', line.strip()) for line in lines[1:]]  # Create dataframes with detailed info
detailed_info = pd.DataFrame(data_rows, columns=columns)

# Ensure that 'table_no' is treated as a string
detailed_info['table_no'] = detailed_info['table_no'].astype(str)

# List to store new dataframes
new_dfs = []

# Iterate through each row in the dataframe
for i, row in detailed_info.iterrows():
    # Parse the interval
    start, end = map(int, row['table_no'].split('-'))
    
    # Create index range
    indices = range(start, end + 1)
    
    # Create a new dataframe
    new_df = pd.DataFrame({
        'table_no': indices,
        'author_st': [row['author']] * len(indices)
    })
    
    # Add to the list
    new_dfs.append(new_df)

# Now new_dfs contains all the new dataframes
# and they can be combined into one large dataframe
detailed_info_df = pd.concat(new_dfs).reset_index(drop=True)

print ('New data about "author" has been added to the "author_st" column.\n')
# Combine datasets
df_w_year_author = pd.merge(df, detailed_info_df, how='left', on='table_no')
df_w_year_author.head(2)
New data about "author" has been added to the "author_st" column.
table_no record_name title placement location author year note taxonomic_group issue copyright author_st
0 1 floradanica_0001.tif Rubus Chamaemorus Linn. Fol. Top. Bot. Danmark Danmark\nNorge Hornemann, Jens Wilken (6.3.1770-30.7.1841) bo... 1761 Flora Danica Hft. 1, Tab. 1\n\nFigur 1\nLatins... Digitale Samlinger: Digitale Samlinger: Billed... Digitale Samlinger: Billeder:Særudgivelser:Flo... Materialet er fri af ophavsret G. C. Oeder
1 2 floradanica_0002.tif Pedicularis lapponica L. Fol. Top. Bot. Danmark Danmark\nNorge Hornemann, Jens Wilken (6.3.1770-30.7.1841) bo... 1761 Flora Danica Hft. 1, Tab. 2\n\nFigur 1\nLatins... Digitale Samlinger: Digitale Samlinger: Billed... Digitale Samlinger: Billeder:Særudgivelser:Flo... Materialet er fri af ophavsret G. C. Oeder

Column ‘issue’: Clean it and add a new column called ‘issue_st’#

The rows in the ‘issue’ column contain this long string: Digitale Samlinger: Billeder:Særudgivelser:Flora Danica:Hæfte: .

The string is a path that indicates the location in the Digital Collections. The path information is not needed. The data that is needed is what remains when the long string is removed.

This can be achieved by cleaning the string, which is done by replacing the long string with nothing. A function called clean_issue is written using the built-in Python method .replace(). The function is then applied to the ‘issue’ column and the data is added to the dataframe in a new column called issue_st.

print (df_w_year_author.at[0,'issue'])

long_string = df_w_year_author.at[0,'issue']

clean_string = long_string.replace('Digitale Samlinger: Billeder:Særudgivelser:Flora Danica:Hæfte:', '')
print(clean_string)
Digitale Samlinger: Billeder:Særudgivelser:Flora Danica:Hæfte:Hft.  1
Hft.  1
def clean_issue(text_string_in):
    text_string_out = text_string_in.replace('Digitale Samlinger: Billeder:Særudgivelser:Flora Danica:Hæfte:', '')
    return text_string_out


df_w_year_author['issue_st'] = df_w_year_author['issue'].apply( lambda x : clean_issue(x))
df_w_year_author.head(2)
table_no record_name title placement location author year note taxonomic_group issue copyright author_st issue_st
0 1 floradanica_0001.tif Rubus Chamaemorus Linn. Fol. Top. Bot. Danmark Danmark\nNorge Hornemann, Jens Wilken (6.3.1770-30.7.1841) bo... 1761 Flora Danica Hft. 1, Tab. 1\n\nFigur 1\nLatins... Digitale Samlinger: Digitale Samlinger: Billed... Digitale Samlinger: Billeder:Særudgivelser:Flo... Materialet er fri af ophavsret G. C. Oeder Hft. 1
1 2 floradanica_0002.tif Pedicularis lapponica L. Fol. Top. Bot. Danmark Danmark\nNorge Hornemann, Jens Wilken (6.3.1770-30.7.1841) bo... 1761 Flora Danica Hft. 1, Tab. 2\n\nFigur 1\nLatins... Digitale Samlinger: Digitale Samlinger: Billed... Digitale Samlinger: Billeder:Særudgivelser:Flo... Materialet er fri af ophavsret G. C. Oeder Hft. 1

Column name: taxonomic_group#

The values in the column are a bit messy.

They contain both information about the name of the collection and information about the taxonomic group to which the plant in the image belongs.

df_w_year_author.at[0, 'taxonomic_group']
'Digitale Samlinger: Digitale Samlinger: Billeder:Særudgivelser:Flora Danica:Taxonomisk gruppe:Karplanter'

Extract the relevant information and add it to a new column#

The actual information about taxonomy would be “Karplanter” (Vascular plants), while the rest of the text string can be considered noise.

The following section extracts the relevant data and adds it to a new column.

# Access a single value
S = df_w_year_author.at[0, 'taxonomic_group']
# Split on ':' and take the last element of the list (the information we actually want)
group_val = S.split(':')[-1]
group_val
'Karplanter'

A function is written and used to get the data and add it to a new column. When inspecting the dataset, it appears that most plants belong to the taxonomy group “Karplanter” (Vascular plants).

def get_taxonomy_data(S):
    group_val = S.split(':')[-1]
    return group_val


df_w_year_author['taxonomic_group_st'] = df_w_year_author['taxonomic_group'].apply(lambda x:get_taxonomy_data(x))
# Inspect data in the new column
print (df_w_year_author['taxonomic_group_st'].value_counts())
taxonomic_group_st
Karplanter           2073
Svampe                391
Mosser                331
Alger                 228
Laver                 163
Slimsvampe             39
Ukendt                 15
Taxonomisk gruppe       1
Lave                    1
Name: count, dtype: int64

Column name: note#

The values in the dataset’s other columns also have multiple values in other columns. For example in the ‘note’ column.

The following section examines the values in the first row of this column.

print (f'\nNote:\n{df_w_year_author.at[0,"note"]}\n\n')
Note:
Flora Danica Hft. 1, Tab. 1

Figur 1
Latinsk navn: Rubus chamaemorus L.
Dansk slægtsnavn: Multebær
Dansk familienavn: Rosenfamilien
Latinsk familienavn: Rosaceae

Below, data is read from the ‘Note’ column, and using regular expressions, specific information is searched for, namely the Latin family name, the Latin name and Lange nomenclature. The parse_notes function extracts the data, and the result is added as new columns to the original DataFrame.

# Function that finds the relevant data from the values in the 'note' column
print ('Find relevant data in the values in the "note" column')
def parse_notes(note):
    # Start with default values
    latin_family_name = None
    latin_name = None
    lange_nomenclature = None
    danish_genus_name = None
    danish_family_name = None
    danish_species_epithet = None

    # Regular expressions that find relevant information
    latin_family_name_pattern = r'Latinsk familienavn:\s*(.*)'
    latin_name_pattern = r'Latinsk navn:\s*(.*)'
    lange_nomenclature_pattern = r'Lange nomenklator:\s*(.*)'
    danish_genus_name_pattern = r'Dansk slægtsnavn:\s*(.*)'
    danish_family_name_pattern = r'Dansk familienavn:\s*(.*)'
    danish_species_epithet_pattern = r'Dansk artsepitet:\s*(.*)'

    # Search for relevant information
    latin_family_name_match = re.search(latin_family_name_pattern, note)
    if latin_family_name_match:
        latin_family_name = latin_family_name_match.group(1).strip()

    latin_name_match = re.search(latin_name_pattern, note)
    if latin_name_match:
        latin_name = latin_name_match.group(1).strip()

    lange_nomenclature_match = re.search(lange_nomenclature_pattern, note)
    if lange_nomenclature_match:
        lange_nomenclature = lange_nomenclature_match.group(1).strip()

    danish_genus_name_match = re.search(danish_genus_name_pattern, note)
    if danish_genus_name_match:
        danish_genus_name = danish_genus_name_match.group(1).strip()

    danish_family_name_match = re.search(danish_family_name_pattern, note)
    if danish_family_name_match:
        danish_family_name = danish_family_name_match.group(1).strip()

    danish_species_epithet_match = re.search(danish_species_epithet_pattern, note)
    if danish_species_epithet_match:
        danish_species_epithet = danish_species_epithet_match.group(1).strip()

    # Combine Latin name and Lange nomenclature
    combined_latin_name = latin_name if latin_name != '-' else lange_nomenclature

    return latin_family_name, combined_latin_name, danish_genus_name, danish_family_name, danish_species_epithet

# Apply the function to the dataframe
parsed_data = df['note'].apply(parse_notes)
df_parsed = pd.DataFrame(parsed_data.tolist(), columns=['latin_family_name', 'latin_name', 'danish_genus_name', 'danish_family_name', 'danish_species_epithet'])
print ('Done')

# Collect data in the "Note" column into a dataframe
print ('Concatenate the data with the original dataframe')
concat_df = pd.concat([df_w_year_author,df_parsed], axis=1)
print (f'Done. Dataframe shape: {concat_df.shape}')
Find relevant data in the values in the "note" column
Done
Concatenate the data with the original dataframe
Done. Dataframe shape: (3242, 19)

Create a “subset” and save it as a CSV file#

The dataset is cleaner and more well-organized than before.

It is easier for us to use it for analysis and visualizations.

However, only some of the columns are needed for further work. Therefore, I select some of the columns for my “subset”, which I save as a CSV file.

subset_df = concat_df[['table_no', 'record_name', 'title', 'year', 'author_st', 'taxonomic_group_st', 'issue', 'latin_family_name', 'latin_name', 'danish_genus_name', 'danish_family_name', 'danish_species_epithet', 'copyright']]
subset_df.to_csv(r'.\mekuni_flora_danica_data\flora_danica_tidy_format.csv', index=False)

Other studies#

The cleaned Flora Danica metadata invites several complementary lines of inquiry:

  • Analyze temporal patterns in plant documentation by examining publication dates and author contributions over time to understand the evolution of botanical knowledge.

  • Investigate taxonomic diversity by exploring the distribution of plant families, genera, and species across the collection.

  • Explore author contributions by examining which authors documented which types of plants and whether certain authors specialized in particular taxonomic groups.

  • Investigate geographic patterns if location data is available, to understand regional plant documentation.

  • Analyze the relationship between publication issues and taxonomic groups to identify organizational patterns in the collection.

  • Cross-reference the metadata with the actual TIFF images to create image-text analysis workflows.

  • Explore copyright and publication information to understand the historical context and accessibility of the botanical illustrations.