Textmining Trykkefrihedens Skrifter#

Script Summary#

This notebook replicates the R text mining analysis for Trykkefrihedens Skrifter (Freedom of the Press Writings) using Python. The analysis explores historical Danish texts to identify linguistic patterns, word frequencies, and contextual usage of keywords.

Steps:

  • Load the structured dataset and project-specific stopwords

  • Tokenize text using regex patterns to extract words

  • Remove stopwords (Danish, German, and project-specific) to focus on meaningful content

  • Perform word frequency analysis on the entire dataset

  • Conduct filtered analysis focusing on specific series and volumes (Række 1, Bind 5+6)

  • Perform keyword-in-context (KWIC) analysis to examine how specific words are used in context

  • Export results to CSV files for further analysis

Outputs: Word frequency tables, filtered analysis results, KWIC tables, and CSV files ready for visualization or further linguistic study.


Install and Import Dependencies#

If running for the first time, uncomment and run the pip cell below.

# !pip install pandas numpy nltk matplotlib seaborn

import pandas as pd
import numpy as np
import re
from collections import Counter
import nltk
from nltk.corpus import stopwords
import matplotlib.pyplot as plt
import seaborn as sns

# Download NLTK data if not already present
nltk.download('punkt')
nltk.download('stopwords')
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\lakj\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\lakj\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
True

Load Data#

Check your data files: You need the main text data and project-specific stopwords as csv files.

  • tfs_structured.csv — Main text data

  • tfs_stopord.csv — Project-specific stopwords

You can find them in the corresponding repository: kb-dk/cultural-heritage-data-guides-Python

Or you can load them through these links:

Here we load the .csv files from our project folder.

The Data Structure in tfs_structured.csv is:

  • refnr: Reference number

  • række: Series number

  • bind: Volume number

  • side: Page number

  • content: Text content

# Path to your data files
main_data_path = 'tfs_structured.csv'
stopwords_path = 'tfs_stopord.csv'

# Load main data
try:
    tfs = pd.read_csv(main_data_path)
    print(f'Data loaded. Shape: {tfs.shape}')
except FileNotFoundError:
    print(f'File not found: {main_data_path}')
    tfs = None

# Load project-specific stopwords
try:
    stopord_tfs = pd.read_csv(stopwords_path)['word'].astype(str).str.lower().tolist()
    print(f'Loaded {len(stopord_tfs)} project-specific stopwords.')
except Exception as e:
    print(f'Could not load project-specific stopwords: {e}')
    stopord_tfs = []
Data loaded. Shape: (28133, 14)
Loaded 60 project-specific stopwords.

Tokenization with Regex#

We tokenize the text using the regex pattern \b\S+\b (matches non-whitespace word-like sequences).

def regex_tokenize(text):
    return re.findall(r'\b\S+\b', str(text).lower())

def tokenize_text(df, text_column='content'):
    tokens_list = []
    for idx, row in df.iterrows():
        tokens = regex_tokenize(row[text_column])
        for token in tokens:
            tokens_list.append({
                'refnr': row.get('refnr', idx),
                'række': row.get('række', None),
                'bind': row.get('bind', None),
                'side': row.get('side', None),
                'word': token
            })
    return pd.DataFrame(tokens_list)

if tfs is not None:
    tfs_tidy = tokenize_text(tfs)
    print(f'Tokenization complete. Total tokens: {len(tfs_tidy)}')
Tokenization complete. Total tokens: 4430589

Stopword Handling#

We use Danish and German stopwords from NLTK, and add project-specific stopwords.

# Danish and German stopwords
stopord_da = set(stopwords.words('danish'))
stopord_de = set(stopwords.words('german'))
stopord_tfs_set = set(stopord_tfs)

print(f'Danish stopwords: {len(stopord_da)}')
print(f'German stopwords: {len(stopord_de)}')
Danish stopwords: 94
German stopwords: 232

Word Frequency Analysis (All Data)#

We count the frequency of each word, first including all words, then after removing stopwords.

# All words
if tfs is not None:
    word_counts_all = tfs_tidy['word'].value_counts().reset_index()
    word_counts_all.columns = ['word', 'count']
    display(word_counts_all.head(20))

# Remove stopwords
    tfs_clean = tfs_tidy[~tfs_tidy['word'].isin(stopord_da)]
    tfs_clean = tfs_clean[~tfs_clean['word'].isin(stopord_de)]
    tfs_clean = tfs_clean[~tfs_clean['word'].isin(stopord_tfs_set)]
    word_counts_clean = tfs_clean['word'].value_counts().reset_index()
    word_counts_clean.columns = ['word', 'count']
    display(word_counts_clean.head(20))
word count
0 og 155742
1 at 104241
2 i 84537
3 den 64810
4 de 63648
5 det 60065
6 som 59769
7 til 55546
8 en 53309
9 er 52174
10 for 45422
11 af 44548
12 der 35973
13 jeg 34870
14 ikke 33852
15 paa 32586
16 med 31378
17 saa 31248
18 han 30741
19 har 28383
word count
0 gud 5764
1 tid 4627
2 mere 4425
3 imod 4417
4 vore 4303
5 store 4268
6 folk 4144
7 stor 4069
8 intet 4050
9 aar 3934
10 2 3837
11 andet 3739
12 4 3670
13 bør 3607
14 ret 3443
15 vare 3409
16 første 3407
17 aldrig 3406
18 guds 3341
19 landet 3331

Filtered Analysis: Række 1, Bind 5+6#

We focus on texts from Række 1 and Bind 5 or 6 (Landøkonomi).

if tfs is not None:
    filter_mask = (tfs_clean['række'] == 1) & (tfs_clean['bind'].isin([5, 6]))
    tfs_filtered = tfs_clean[filter_mask]
    word_counts_filtered = tfs_filtered['word'].value_counts().reset_index()
    word_counts_filtered.columns = ['word', 'count']
    display(word_counts_filtered.head(20))
word count
0 bonden 620
1 mere 482
2 intet 417
3 aar 375
4 folk 366
5 landet 352
6 tid 336
7 andet 334
8 penge 296
9 nytte 282
10 mindre 277
11 bønderne 277
12 bedre 276
13 bønder 275
14 bonde 263
15 stor 258
16 deraf 254
17 kand 250
18 store 246
19 imod 246

Keyword-in-Context (KWIC) Analysis#

We analyze the context of the keyword ‘jord’ within a window of 4 words.

def keyword_in_context(df, keyword, window_size=4, filter_række=1, filter_bind=[5,6]):
    results = []
    filtered = df[(df['række'] == filter_række) & (df['bind'].isin(filter_bind))]
    for idx, row in filtered.iterrows():
        tokens = regex_tokenize(row['content'])
        for i, token in enumerate(tokens):
            if token == keyword:
                start = max(0, i - window_size)
                end = min(len(tokens), i + window_size + 1)
                left = ' '.join(tokens[start:i])
                right = ' '.join(tokens[i+1:end])
                results.append({
                    'docid': f"{row.get('refnr', idx)}- side {row.get('side', '')}",
                    'left_context': left,
                    'keyword': token,
                    'right_context': right
                })
    return pd.DataFrame(results)

if tfs is not None:
    kwic_df = keyword_in_context(tfs, 'jord', window_size=4)
    display(kwic_df.head(10))
docid left_context keyword right_context
0 1.5.1- side 39 enhver bonde havde sin jord for sig selv afdeelt
1 1.5.1- side 40 han skal giøde sin jord ei bortføre mere end
2 1.5.1- side 40 naar han havde sin jord for sig selv beliggende
3 1.5.2- side 33 nu med steen tang jord c
4 1.5.2- side 38 overlade ham min eyendoms jord og grund for at
5 1.5.2- side 50 falde paa den allerbeste jord i markerne andres derimod
6 1.5.2- side 50 at han fik mere jord i sit maal fordi
7 1.5.2- side 50 megen og slet ufrugtbar jord for den som forhen
8 1.5.2- side 50 maaske svare den slette jord kan ved flid og
9 1.5.2- side 51 etableres paa den dyrkværdige jord som skulde anvendes til

Save Results#

You can save the results to CSV files for further analysis.

if tfs is not None:
    word_counts_clean.to_csv('results_word_frequencies_all.csv', index=False)
    word_counts_filtered.to_csv('results_word_frequencies_filtered.csv', index=False)
    kwic_df.to_csv('results_kwic_jord.csv', index=False)
    print('Results saved to CSV files.')
Results saved to CSV files.

Other studies#

The Trykkefrihedens Skrifter dataset invites several complementary lines of inquiry beyond the current text mining workflow:

  • Expand keyword-in-context analysis to explore additional terms of historical or thematic interest and compare usage patterns across different volumes.

  • Analyze temporal changes in vocabulary by comparing word frequencies and linguistic features across different publication periods.

  • Investigate genre-specific patterns by examining differences in vocabulary between different types of writings in the collection.

  • Apply topic modeling techniques (e.g., LDA) to discover latent themes and topics across the corpus.

  • Explore semantic networks by analyzing co-occurrence patterns of key terms and concepts.

  • Compare linguistic features with other historical Danish text collections to identify distinctive characteristics of freedom of press writings.

  • Analyze the relationship between word frequency and contextual usage to identify both common and specialized terminology.

  • Investigate author-specific patterns if author information is available to explore individual writing styles.