Textmining Trykkefrihedens Skrifter#
Script Summary#
This notebook replicates the R text mining analysis for Trykkefrihedens Skrifter (Freedom of the Press Writings) using Python. The analysis explores historical Danish texts to identify linguistic patterns, word frequencies, and contextual usage of keywords.
Steps:
Load the structured dataset and project-specific stopwords
Tokenize text using regex patterns to extract words
Remove stopwords (Danish, German, and project-specific) to focus on meaningful content
Perform word frequency analysis on the entire dataset
Conduct filtered analysis focusing on specific series and volumes (Række 1, Bind 5+6)
Perform keyword-in-context (KWIC) analysis to examine how specific words are used in context
Export results to CSV files for further analysis
Outputs: Word frequency tables, filtered analysis results, KWIC tables, and CSV files ready for visualization or further linguistic study.
Install and Import Dependencies#
If running for the first time, uncomment and run the pip cell below.
# !pip install pandas numpy nltk matplotlib seaborn
import pandas as pd
import numpy as np
import re
from collections import Counter
import nltk
from nltk.corpus import stopwords
import matplotlib.pyplot as plt
import seaborn as sns
# Download NLTK data if not already present
nltk.download('punkt')
nltk.download('stopwords')
[nltk_data] Downloading package punkt to
[nltk_data] C:\Users\lakj\AppData\Roaming\nltk_data...
[nltk_data] Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data] C:\Users\lakj\AppData\Roaming\nltk_data...
[nltk_data] Package stopwords is already up-to-date!
True
Load Data#
Check your data files: You need the main text data and project-specific stopwords as csv files.
tfs_structured.csv— Main text datatfs_stopord.csv— Project-specific stopwords
You can find them in the corresponding repository: kb-dk/cultural-heritage-data-guides-Python
Or you can load them through these links:
https://raw.githubusercontent.com/maxodsbjerg/TextMiningTFS/refs/heads/main/data/tfs_structured.csv
https://raw.githubusercontent.com/maxodsbjerg/TextMiningTFS/refs/heads/main/data/tfs_stopord.csv
Here we load the .csv files from our project folder.
The Data Structure in tfs_structured.csv is:
refnr: Reference numberrække: Series numberbind: Volume numberside: Page numbercontent: Text content
# Path to your data files
main_data_path = 'tfs_structured.csv'
stopwords_path = 'tfs_stopord.csv'
# Load main data
try:
tfs = pd.read_csv(main_data_path)
print(f'Data loaded. Shape: {tfs.shape}')
except FileNotFoundError:
print(f'File not found: {main_data_path}')
tfs = None
# Load project-specific stopwords
try:
stopord_tfs = pd.read_csv(stopwords_path)['word'].astype(str).str.lower().tolist()
print(f'Loaded {len(stopord_tfs)} project-specific stopwords.')
except Exception as e:
print(f'Could not load project-specific stopwords: {e}')
stopord_tfs = []
Data loaded. Shape: (28133, 14)
Loaded 60 project-specific stopwords.
Tokenization with Regex#
We tokenize the text using the regex pattern \b\S+\b (matches non-whitespace word-like sequences).
def regex_tokenize(text):
return re.findall(r'\b\S+\b', str(text).lower())
def tokenize_text(df, text_column='content'):
tokens_list = []
for idx, row in df.iterrows():
tokens = regex_tokenize(row[text_column])
for token in tokens:
tokens_list.append({
'refnr': row.get('refnr', idx),
'række': row.get('række', None),
'bind': row.get('bind', None),
'side': row.get('side', None),
'word': token
})
return pd.DataFrame(tokens_list)
if tfs is not None:
tfs_tidy = tokenize_text(tfs)
print(f'Tokenization complete. Total tokens: {len(tfs_tidy)}')
Tokenization complete. Total tokens: 4430589
Stopword Handling#
We use Danish and German stopwords from NLTK, and add project-specific stopwords.
# Danish and German stopwords
stopord_da = set(stopwords.words('danish'))
stopord_de = set(stopwords.words('german'))
stopord_tfs_set = set(stopord_tfs)
print(f'Danish stopwords: {len(stopord_da)}')
print(f'German stopwords: {len(stopord_de)}')
Danish stopwords: 94
German stopwords: 232
Word Frequency Analysis (All Data)#
We count the frequency of each word, first including all words, then after removing stopwords.
# All words
if tfs is not None:
word_counts_all = tfs_tidy['word'].value_counts().reset_index()
word_counts_all.columns = ['word', 'count']
display(word_counts_all.head(20))
# Remove stopwords
tfs_clean = tfs_tidy[~tfs_tidy['word'].isin(stopord_da)]
tfs_clean = tfs_clean[~tfs_clean['word'].isin(stopord_de)]
tfs_clean = tfs_clean[~tfs_clean['word'].isin(stopord_tfs_set)]
word_counts_clean = tfs_clean['word'].value_counts().reset_index()
word_counts_clean.columns = ['word', 'count']
display(word_counts_clean.head(20))
| word | count | |
|---|---|---|
| 0 | og | 155742 |
| 1 | at | 104241 |
| 2 | i | 84537 |
| 3 | den | 64810 |
| 4 | de | 63648 |
| 5 | det | 60065 |
| 6 | som | 59769 |
| 7 | til | 55546 |
| 8 | en | 53309 |
| 9 | er | 52174 |
| 10 | for | 45422 |
| 11 | af | 44548 |
| 12 | der | 35973 |
| 13 | jeg | 34870 |
| 14 | ikke | 33852 |
| 15 | paa | 32586 |
| 16 | med | 31378 |
| 17 | saa | 31248 |
| 18 | han | 30741 |
| 19 | har | 28383 |
| word | count | |
|---|---|---|
| 0 | gud | 5764 |
| 1 | tid | 4627 |
| 2 | mere | 4425 |
| 3 | imod | 4417 |
| 4 | vore | 4303 |
| 5 | store | 4268 |
| 6 | folk | 4144 |
| 7 | stor | 4069 |
| 8 | intet | 4050 |
| 9 | aar | 3934 |
| 10 | 2 | 3837 |
| 11 | andet | 3739 |
| 12 | 4 | 3670 |
| 13 | bør | 3607 |
| 14 | ret | 3443 |
| 15 | vare | 3409 |
| 16 | første | 3407 |
| 17 | aldrig | 3406 |
| 18 | guds | 3341 |
| 19 | landet | 3331 |
Filtered Analysis: Række 1, Bind 5+6#
We focus on texts from Række 1 and Bind 5 or 6 (Landøkonomi).
if tfs is not None:
filter_mask = (tfs_clean['række'] == 1) & (tfs_clean['bind'].isin([5, 6]))
tfs_filtered = tfs_clean[filter_mask]
word_counts_filtered = tfs_filtered['word'].value_counts().reset_index()
word_counts_filtered.columns = ['word', 'count']
display(word_counts_filtered.head(20))
| word | count | |
|---|---|---|
| 0 | bonden | 620 |
| 1 | mere | 482 |
| 2 | intet | 417 |
| 3 | aar | 375 |
| 4 | folk | 366 |
| 5 | landet | 352 |
| 6 | tid | 336 |
| 7 | andet | 334 |
| 8 | penge | 296 |
| 9 | nytte | 282 |
| 10 | mindre | 277 |
| 11 | bønderne | 277 |
| 12 | bedre | 276 |
| 13 | bønder | 275 |
| 14 | bonde | 263 |
| 15 | stor | 258 |
| 16 | deraf | 254 |
| 17 | kand | 250 |
| 18 | store | 246 |
| 19 | imod | 246 |
Keyword-in-Context (KWIC) Analysis#
We analyze the context of the keyword ‘jord’ within a window of 4 words.
def keyword_in_context(df, keyword, window_size=4, filter_række=1, filter_bind=[5,6]):
results = []
filtered = df[(df['række'] == filter_række) & (df['bind'].isin(filter_bind))]
for idx, row in filtered.iterrows():
tokens = regex_tokenize(row['content'])
for i, token in enumerate(tokens):
if token == keyword:
start = max(0, i - window_size)
end = min(len(tokens), i + window_size + 1)
left = ' '.join(tokens[start:i])
right = ' '.join(tokens[i+1:end])
results.append({
'docid': f"{row.get('refnr', idx)}- side {row.get('side', '')}",
'left_context': left,
'keyword': token,
'right_context': right
})
return pd.DataFrame(results)
if tfs is not None:
kwic_df = keyword_in_context(tfs, 'jord', window_size=4)
display(kwic_df.head(10))
| docid | left_context | keyword | right_context | |
|---|---|---|---|---|
| 0 | 1.5.1- side 39 | enhver bonde havde sin | jord | for sig selv afdeelt |
| 1 | 1.5.1- side 40 | han skal giøde sin | jord | ei bortføre mere end |
| 2 | 1.5.1- side 40 | naar han havde sin | jord | for sig selv beliggende |
| 3 | 1.5.2- side 33 | nu med steen tang | jord | c |
| 4 | 1.5.2- side 38 | overlade ham min eyendoms | jord | og grund for at |
| 5 | 1.5.2- side 50 | falde paa den allerbeste | jord | i markerne andres derimod |
| 6 | 1.5.2- side 50 | at han fik mere | jord | i sit maal fordi |
| 7 | 1.5.2- side 50 | megen og slet ufrugtbar | jord | for den som forhen |
| 8 | 1.5.2- side 50 | maaske svare den slette | jord | kan ved flid og |
| 9 | 1.5.2- side 51 | etableres paa den dyrkværdige | jord | som skulde anvendes til |
Save Results#
You can save the results to CSV files for further analysis.
if tfs is not None:
word_counts_clean.to_csv('results_word_frequencies_all.csv', index=False)
word_counts_filtered.to_csv('results_word_frequencies_filtered.csv', index=False)
kwic_df.to_csv('results_kwic_jord.csv', index=False)
print('Results saved to CSV files.')
Results saved to CSV files.
Other studies#
The Trykkefrihedens Skrifter dataset invites several complementary lines of inquiry beyond the current text mining workflow:
Expand keyword-in-context analysis to explore additional terms of historical or thematic interest and compare usage patterns across different volumes.
Analyze temporal changes in vocabulary by comparing word frequencies and linguistic features across different publication periods.
Investigate genre-specific patterns by examining differences in vocabulary between different types of writings in the collection.
Apply topic modeling techniques (e.g., LDA) to discover latent themes and topics across the corpus.
Explore semantic networks by analyzing co-occurrence patterns of key terms and concepts.
Compare linguistic features with other historical Danish text collections to identify distinctive characteristics of freedom of press writings.
Analyze the relationship between word frequency and contextual usage to identify both common and specialized terminology.
Investigate author-specific patterns if author information is available to explore individual writing styles.