Textmining Trykkefrihedens Skrifter

Textmining Trykkefrihedens Skrifter#

Script Summary#

This notebook replicates the R text mining analysis for Trykkefrihedens Skrifter (Freedom of the Press Writings) using Python. The analysis explores historical Danish texts to identify linguistic patterns, word frequencies, and contextual usage of keywords.

Steps:

Load the structured dataset and project-specific stopwords
Tokenize text using regex patterns to extract words
Remove stopwords (Danish, German, and project-specific) to focus on meaningful content
Perform word frequency analysis on the entire dataset
Conduct filtered analysis focusing on specific series and volumes (Række 1, Bind 5+6)
Perform keyword-in-context (KWIC) analysis to examine how specific words are used in context
Export results to CSV files for further analysis

Outputs: Word frequency tables, filtered analysis results, KWIC tables, and CSV files ready for visualization or further linguistic study.

Dataset: To download the dataset follow go to the Library’s Open Access Repository.

Install and Import Dependencies#

If running for the first time, uncomment and run the pip cell below.

# !pip install pandas numpy nltk matplotlib seaborn

import pandas as pd
import numpy as np
import re
from collections import Counter
import nltk
from nltk.corpus import stopwords
import matplotlib.pyplot as plt
import seaborn as sns

# Download NLTK data if not already present
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\lakj\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\lakj\AppData\Roaming\nltk_data...

[nltk_data]   Unzipping corpora\stopwords.zip.

True

Load Data#

Check your data files: You need the main text data and project-specific stopwords as csv files.

tfs_structured.csv — Main text data
tfs_stopord.csv — Project-specific stopwords

You can find them in the corresponding repository: kb-dk/cultural-heritage-data-guides-Python

Or you can load them through these links:

Here we load the .csv files from our project folder.

The Data Structure in tfs_structured.csv is:

refnr: Reference number
række: Series number
bind: Volume number
side: Page number
content: Text content

# Path to your data files
main_data_path = 'tfs_structured.csv'
stopwords_path = 'tfs_stopord.csv'

# Load main data
try:
    tfs = pd.read_csv(main_data_path)
    print(f'Data loaded. Shape: {tfs.shape}')
except FileNotFoundError:
    print(f'File not found: {main_data_path}')
    tfs = None

# Load project-specific stopwords
try:
    stopord_tfs = pd.read_csv(stopwords_path)['word'].astype(str).str.lower().tolist()
    print(f'Loaded {len(stopord_tfs)} project-specific stopwords.')
except Exception as e:
    print(f'Could not load project-specific stopwords: {e}')
    stopord_tfs = []

Data loaded. Shape: (28133, 14)
Loaded 60 project-specific stopwords.

Tokenization with Regex#

We tokenize the text using the regex pattern \b\S+\b (matches non-whitespace word-like sequences).

def regex_tokenize(text):
    return re.findall(r'\b\S+\b', str(text).lower())

def tokenize_text(df, text_column='content'):
    tokens_list = []
    for idx, row in df.iterrows():
        tokens = regex_tokenize(row[text_column])
        for token in tokens:
            tokens_list.append({
                'refnr': row.get('refnr', idx),
                'række': row.get('række', None),
                'bind': row.get('bind', None),
                'side': row.get('side', None),
                'word': token
            })
    return pd.DataFrame(tokens_list)

if tfs is not None:
    tfs_tidy = tokenize_text(tfs)
    print(f'Tokenization complete. Total tokens: {len(tfs_tidy)}')

Tokenization complete. Total tokens: 4430589

Stopword Handling#

We use Danish and German stopwords from NLTK, and add project-specific stopwords.

# Danish and German stopwords
stopord_da = set(stopwords.words('danish'))
stopord_de = set(stopwords.words('german'))
stopord_tfs_set = set(stopord_tfs)

print(f'Danish stopwords: {len(stopord_da)}')
print(f'German stopwords: {len(stopord_de)}')

Danish stopwords: 94
German stopwords: 232

Word Frequency Analysis (All Data)#

We count the frequency of each word, first including all words, then after removing stopwords.

# All words
if tfs is not None:
    word_counts_all = tfs_tidy['word'].value_counts().reset_index()
    word_counts_all.columns = ['word', 'count']
    display(word_counts_all.head(20))

# Remove stopwords
    tfs_clean = tfs_tidy[~tfs_tidy['word'].isin(stopord_da)]
    tfs_clean = tfs_clean[~tfs_clean['word'].isin(stopord_de)]
    tfs_clean = tfs_clean[~tfs_clean['word'].isin(stopord_tfs_set)]
    word_counts_clean = tfs_clean['word'].value_counts().reset_index()
    word_counts_clean.columns = ['word', 'count']
    display(word_counts_clean.head(20))

	word	count
0	og	155742
1	at	104241
2	i	84537
3	den	64810
4	de	63648
5	det	60065
6	som	59769
7	til	55546
8	en	53309
9	er	52174
10	for	45422
11	af	44548
12	der	35973
13	jeg	34870
14	ikke	33852
15	paa	32586
16	med	31378
17	saa	31248
18	han	30741
19	har	28383

	word	count
0	gud	5764
1	tid	4627
2	mere	4425
3	imod	4417
4	vore	4303
5	store	4268
6	folk	4144
7	stor	4069
8	intet	4050
9	aar	3934
10	2	3837
11	andet	3739
12	4	3670
13	bør	3607
14	ret	3443
15	vare	3409
16	første	3407
17	aldrig	3406
18	guds	3341
19	landet	3331

Filtered Analysis: Række 1, Bind 5+6#

We focus on texts from Række 1 and Bind 5 or 6 (Landøkonomi).

if tfs is not None:
    filter_mask = (tfs_clean['række'] == 1) & (tfs_clean['bind'].isin([5, 6]))
    tfs_filtered = tfs_clean[filter_mask]
    word_counts_filtered = tfs_filtered['word'].value_counts().reset_index()
    word_counts_filtered.columns = ['word', 'count']
    display(word_counts_filtered.head(20))

	word	count
0	bonden	620
1	mere	482
2	intet	417
3	aar	375
4	folk	366
5	landet	352
6	tid	336
7	andet	334
8	penge	296
9	nytte	282
10	mindre	277
11	bønderne	277
12	bedre	276
13	bønder	275
14	bonde	263
15	stor	258
16	deraf	254
17	kand	250
18	store	246
19	imod	246

Keyword-in-Context (KWIC) Analysis#

We analyze the context of the keyword ‘jord’ within a window of 4 words.

def keyword_in_context(df, keyword, window_size=4, filter_række=1, filter_bind=[5,6]):
    results = []
    filtered = df[(df['række'] == filter_række) & (df['bind'].isin(filter_bind))]
    for idx, row in filtered.iterrows():
        tokens = regex_tokenize(row['content'])
        for i, token in enumerate(tokens):
            if token == keyword:
                start = max(0, i - window_size)
                end = min(len(tokens), i + window_size + 1)
                left = ' '.join(tokens[start:i])
                right = ' '.join(tokens[i+1:end])
                results.append({
                    'docid': f"{row.get('refnr', idx)}- side {row.get('side', '')}",
                    'left_context': left,
                    'keyword': token,
                    'right_context': right
                })
    return pd.DataFrame(results)

if tfs is not None:
    kwic_df = keyword_in_context(tfs, 'jord', window_size=4)
    display(kwic_df.head(10))

	docid	left_context	keyword	right_context
0	1.5.1- side 39	enhver bonde havde sin	jord	for sig selv afdeelt
1	1.5.1- side 40	han skal giøde sin	jord	ei bortføre mere end
2	1.5.1- side 40	naar han havde sin	jord	for sig selv beliggende
3	1.5.2- side 33	nu med steen tang	jord	c
4	1.5.2- side 38	overlade ham min eyendoms	jord	og grund for at
5	1.5.2- side 50	falde paa den allerbeste	jord	i markerne andres derimod
6	1.5.2- side 50	at han fik mere	jord	i sit maal fordi
7	1.5.2- side 50	megen og slet ufrugtbar	jord	for den som forhen
8	1.5.2- side 50	maaske svare den slette	jord	kan ved flid og
9	1.5.2- side 51	etableres paa den dyrkværdige	jord	som skulde anvendes til

Save Results#

You can save the results to CSV files for further analysis.

if tfs is not None:
    word_counts_clean.to_csv('results_word_frequencies_all.csv', index=False)
    word_counts_filtered.to_csv('results_word_frequencies_filtered.csv', index=False)
    kwic_df.to_csv('results_kwic_jord.csv', index=False)
    print('Results saved to CSV files.')

Results saved to CSV files.

Other studies#

The Trykkefrihedens Skrifter dataset invites several complementary lines of inquiry beyond the current text mining workflow:

Expand keyword-in-context analysis to explore additional terms of historical or thematic interest and compare usage patterns across different volumes.
Analyze temporal changes in vocabulary by comparing word frequencies and linguistic features across different publication periods.
Investigate genre-specific patterns by examining differences in vocabulary between different types of writings in the collection.
Apply topic modeling techniques (e.g., LDA) to discover latent themes and topics across the corpus.
Explore semantic networks by analyzing co-occurrence patterns of key terms and concepts.
Compare linguistic features with other historical Danish text collections to identify distinctive characteristics of freedom of press writings.
Analyze the relationship between word frequency and contextual usage to identify both common and specialized terminology.
Investigate author-specific patterns if author information is available to explore individual writing styles.