Bigram Analysis of Trykkefrihedens Skrifter#

Script Summary#

This notebook performs bigram (word pair) analysis on Trykkefrihedens Skrifter in Python, inspired by the R workflow. Bigram analysis identifies frequently co-occurring word pairs to reveal linguistic patterns and phrase structures in historical Danish texts.

Steps:

  • Load the structured dataset and project-specific stopwords

  • Create bigrams (word pairs) for each text in the dataset

  • Count the frequency of each bigram across the corpus

  • Filter out bigrams containing stopwords to focus on meaningful word combinations

  • Search for specific bigrams matching patterns (e.g., words starting with ‘gud’)

  • Visualize frequent bigrams as network graphs showing word relationships

  • Export results to CSV files and save network visualizations

Outputs: Bigram frequency tables, filtered bigram results, pattern-matched bigrams, network graph visualizations, and CSV files for further analysis.


Install and Import Dependencies#

If you’re missing any packages, uncomment and run the pip cell below.

# !pip install pandas numpy nltk matplotlib seaborn networkx

import pandas as pd
import numpy as np
import re
import nltk
from nltk.corpus import stopwords
import matplotlib.pyplot as plt
import seaborn as sns
import networkx as nx

nltk.download('punkt')
nltk.download('stopwords')
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\lakj\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\lakj\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
True

Load Data#

Check your data files: You need the main text data and project-specific stopwords as csv files.

  • tfs_structured.csv — Main text data

  • tfs_stopord.csv — Project-specific stopwords

You can find them in the corresponding repository: kb-dk/cultural-heritage-data-guides-Python

Or you can load them through these links:

Here we load the .csv files from our project folder.

The Data Structure in tfs_structured.csv is:

  • refnr: Reference number

  • række: Series number

  • bind: Volume number

  • side: Page number

  • content: Text content

With the data in the right folder we load the main data and stopwords from CSV files.

main_data_path = 'tfs_structured.csv'
stopwords_path = 'tfs_stopord.csv'

try:
    tfs = pd.read_csv(main_data_path)
    print(f'Data loaded. Shape: {tfs.shape}')
except FileNotFoundError:
    print(f'File not found: {main_data_path}')
    tfs = None

try:
    stopord_tfs = pd.read_csv(stopwords_path)['word'].astype(str).str.lower().tolist()
    print(f'Loaded {len(stopord_tfs)} project-specific stopwords.')
except Exception as e:
    print(f'Could not load project-specific stopwords: {e}')
    stopord_tfs = []

# Optional extra stopwords list from URL (can be added if desired)
# stopord_url = "https://gist.githubusercontent.com/maxodsbjerg/4d1e3b1081ebba53a8d2c3aae2a1a070/raw/e1f63b4c81c15bb58a54a2f94673c97d75fe6a74/stopord_18.csv"
# stopord_extra = pd.read_csv(stopord_url)['word'].astype(str).str.lower().tolist()
# stopord_tfs += stopord_extra
Data loaded. Shape: (28133, 14)
Loaded 60 project-specific stopwords.

Formation of Bigrams#

We create bigrams (word pairs) for each text.

def regex_tokenize(text):
    # Simple tokenization, can be extended if needed
    return re.findall(r'\b\S+\b', str(text).lower())

def make_bigrams(tokens):
    return [(tokens[i], tokens[i+1]) for i in range(len(tokens)-1)]

# Create bigrams for all rows
if tfs is not None:
    bigrams_list = []
    for idx, row in tfs.iterrows():
        tokens = regex_tokenize(row['content'])
        bigrams = make_bigrams(tokens)
        for bg in bigrams:
            bigrams_list.append({'refnr': row.get('refnr', idx),
                                 'række': row.get('række', None),
                                 'bind': row.get('bind', None),
                                 'side': row.get('side', None),
                                 'word1': bg[0],
                                 'word2': bg[1]})

    tfs_bigrams = pd.DataFrame(bigrams_list)
    print(f'Number of bigrams: {len(tfs_bigrams)}')
    display(tfs_bigrams.head())
else:
    print('Data not loaded. Please ensure tfs_structured.csv is available.')
    tfs_bigrams = pd.DataFrame()
Number of bigrams: 4402472
refnr række bind side word1 word2
0 1.1.1 1 1 1 philopatreias trende
1 1.1.1 1 1 1 trende anmærkninger
2 1.1.1 1 1 1 anmærkninger i
3 1.1.1 1 1 1 i om
4 1.1.1 1 1 1 om de

Count Bigrams#

We count the frequency of each bigram (word pair).

bigram_counts = tfs_bigrams.groupby(['word1', 'word2']).size().reset_index(name='count')
bigram_counts = bigram_counts.sort_values('count', ascending=False)
display(bigram_counts.head(20))
word1 word2 count
1417738 til at 10099
335398 det er 6434
507506 for at 6155
130779 at de 5393
444015 er det 3924
132588 at han 3605
759583 i det 3584
759543 i den 3010
759924 i en 3003
130867 at det 2853
1071406 og at 2728
1328585 som de 2636
1074505 og det 2563
130821 at den 2497
61524 af de 2487
1328912 som en 2445
1081091 og i 2432
137823 at være 2378
133386 at jeg 2354
1074337 og de 2252

Remove Bigrams Where One of the Words is a Stopword#

We remove bigrams where either word1 or word2 is a stopword (Danish, German, or project-specific).

stopord_da = set(stopwords.words('danish'))
stopord_de = set(stopwords.words('german'))
stopord_all = stopord_da | stopord_de | set(stopord_tfs)

bigrams_filtered = tfs_bigrams[
    (~tfs_bigrams['word1'].isin(stopord_all)) &
    (~tfs_bigrams['word2'].isin(stopord_all))
]
print(f'Number of bigrams after stopword filtering: {len(bigrams_filtered)}')
Number of bigrams after stopword filtering: 966952

Count Filtered Bigrams#

We now count bigrams without stopwords.

bigram_counts_filtered = bigrams_filtered.groupby(['word1', 'word2']).size().reset_index(name='count')
bigram_counts_filtered = bigram_counts_filtered.sort_values('count', ascending=False)
display(bigram_counts_filtered.head(20))
word1 word2 count
358413 kiøbenhavn 1771 477
552523 skilling stor 353
9418 1771 trykt 318
560772 slet intet 305
358686 kiøbenhavn trykt 275
9615 1772 trykt 272
524686 s v 266
461193 o s 266
340844 intet andet 262
381451 lang tid 251
358414 kiøbenhavn 1772 228
368694 kort tid 220
364293 kong christian 218
200040 findes tilkiøbs 207
191 1 ark 206
349143 junior philopatreias 200
839 1/2 ark 195
392811 ligesaa lidet 182
23815 4 skilling 176
581354 stor 1 175

Search for Specific Bigrams (e.g., where word2 matches a pattern)#

We can filter bigrams where word2 matches a specific pattern, e.g., words starting with ‘gud’.

pattern = r'\bgud[a-zæø]*'
match_bigrams = bigrams_filtered[bigrams_filtered['word2'].str.contains(pattern, regex=True)]
match_counts = match_bigrams.groupby(['word1', 'word2']).size().reset_index(name='count')
match_counts = match_counts.sort_values('count', ascending=False)
display(match_counts.head(20))
word1 word2 count
1523 o gud 95
1145 imod gud 83
1935 store gud 47
2011 takke gud 38
816 frygte gud 35
993 herre gud 32
1195 jordens guder 31
1148 imod guds 31
913 gode gud 29
1791 sande gud 28
1794 sande guds 21
821 frygter gud 21
531 elske gud 21
2042 tiene gud 19
241 bede gud 19
1217 kiende gud 18
1706 retfærdige gud 17
159 almægtige gud 16
937 gudernes gud 16
1787 sand gudsfrygt 16

Visualization as Network Graph#

We can visualize the most frequent bigrams as a network graph.

# Select bigrams that occur more than e.g., 8 times
threshold = 8
graph_data = match_counts[match_counts['count'] > threshold]

G = nx.DiGraph()
for _, row in graph_data.iterrows():
    G.add_edge(row['word1'], row['word2'], weight=row['count'])

plt.figure(figsize=(12, 8))
pos = nx.spring_layout(G, k=0.5)
edges = G.edges()
weights = [G[u][v]['weight'] for u,v in edges]
nx.draw(G, pos, with_labels=True, node_color='lightgreen', edge_color=weights, width=2.0, edge_cmap=plt.cm.Greens, arrows=True)
plt.title('Network Graph of Frequent Bigrams')
plt.show()
_images/caa286edf07f13b4d9b949e9d859752d3825142216a0556a7c224d6110803c4f.png

Save Graph and Results#

You can save the graph and results as CSV files.

bigram_counts_filtered.to_csv('results_bigrams_filtered.csv', index=False)
match_counts.to_csv('results_bigrams_match.csv', index=False)
plt.savefig('graphics/bigram_network.png', bbox_inches='tight', dpi=150)
print('Results and graph saved.')
Results and graph saved.
<Figure size 640x480 with 0 Axes>

Other studies#

The bigram analysis of Trykkefrihedens Skrifter can be extended with several complementary approaches:

  • Expand pattern matching to explore other semantic domains (e.g., political terms, social concepts, religious language) and identify distinctive phrase patterns.

  • Analyze temporal changes in bigram usage by comparing frequent word pairs across different publication periods.

  • Investigate trigram and n-gram patterns to explore longer phrase structures and multi-word expressions.

  • Apply network analysis techniques to identify central nodes and communities in the bigram network.

  • Compare bigram patterns with other historical Danish text collections to identify genre-specific or period-specific linguistic features.

  • Explore the relationship between bigram frequency and semantic coherence to identify meaningful versus coincidental co-occurrences.

  • Analyze bigram patterns in specific volumes or series to identify thematic or stylistic variations.

  • Integrate bigram analysis with topic modeling to understand how word pairs contribute to broader thematic structures.