CF 207D7 - The Beaver's Problem - 3

This problem is intentionally unusual for Codeforces. There is no hidden graph, no combinatorics trick, and no numeric constraint to optimize around. Instead, we are asked to classify a text document into one of three categories using a fixed training corpus.

CF 207D7 - The Beaver's Problem - 3

Rating: 1600
Tags: -
Solve time: 1m 36s
Verified: yes

Solution

Problem Understanding

This problem is intentionally unusual for Codeforces. There is no hidden graph, no combinatorics trick, and no numeric constraint to optimize around. Instead, we are asked to classify a text document into one of three categories using a fixed training corpus.

Each input file contains a document identifier, a title, and the document body. The identifier is meaningless for classification purposes. The actual task is to inspect the text content and predict whether the document belongs to subject 1, 2, or 3.

The difficult part is that the official statement expects participants to use the provided training archive. The test data is constructed from real text documents, and the intended solution is essentially a lightweight machine learning classifier.

This means the problem is not about deriving a mathematical formula. It is about extracting statistical patterns from text.

The document size is at most 10 KB, which is tiny. Even if we tokenize every word, count frequencies, and compare against all training categories, the total amount of work per document stays extremely small. The real challenge is not asymptotic complexity, but designing a classifier that generalizes well to unseen documents.

A naive keyword-only solution can pass the easiest groups and completely fail on harder tests. For example, suppose category 3 contains many trade-related articles. A simplistic rule like "if the text contains the word market, output 3" immediately breaks when another category also discusses economics indirectly.

Another dangerous case is relying only on the title. Consider two documents:

Title: International Relations
Body: export tariffs trade agreements customs ...

and

Title: International Relations
Body: diplomatic negotiations military alliance ...

A title-based classifier would output the same class for both documents even though the body clearly separates them.

A third common failure mode is ignoring word frequency normalization. Longer documents naturally contain more occurrences of common words. If we simply sum raw counts, long documents dominate the score unfairly. A probabilistic model avoids this bias.

The intended direction is a classical text classification pipeline using bag-of-words statistics.

Approaches

The brute-force idea is to memorize entire documents from the training set and try to find exact matches. This works for groups where test documents are copied directly from training data with modified identifiers. We could hash the text body and map it to the known category.

The method is technically correct for identical documents because equal text must belong to the same class. The problem is that later groups contain unseen documents. Exact matching immediately fails there.

A slightly better brute-force method is manual keyword matching. We can build three dictionaries of frequent words and assign scores according to occurrences. This already behaves like a primitive classifier.

Suppose we keep the top 500 keywords for each category and scan the document against them. The runtime is still tiny:

|documents| × |keywords| × average length

Even with generous estimates, this stays well below the time limit. The weakness is accuracy, not speed. Hand-crafted heuristics do not generalize reliably.

The key observation is that this is a standard multinomial text classification problem. Documents from the same subject tend to share characteristic word distributions. Instead of manually choosing keywords, we can estimate probabilities directly from the training corpus.

This naturally leads to Naive Bayes.

The bag-of-words model assumes that words appear independently once the category is fixed. That assumption is obviously false in real language, but it works remarkably well for document classification because category-specific vocabularies dominate the signal.

For each class, we count how many times every token appears in the training set. Then for a new document we compute:

P(class) × Π P(word | class)

The class with the largest probability becomes the prediction.

Direct multiplication underflows quickly because probabilities are tiny. We switch to logarithms:

log P(class) + Σ log P(word | class)

Now every document becomes a simple additive scoring problem.

Laplace smoothing is also necessary. Without it, a single unseen word gives probability zero and destroys the entire score. Adding one to every frequency fixes this cleanly.

The resulting classifier is both fast and surprisingly accurate.

Approach Time Complexity Space Complexity Verdict
Exact document matching O(L) O(total training text) Fails on unseen documents
Manual keyword heuristics O(V + L) O(V) Unreliable
Naive Bayes classifier O(V + L) O(V) Accepted

Here, L is the document length and V is the vocabulary size.

Algorithm Walkthrough

  1. Read all training documents and split them into tokens.
  2. Normalize tokens by converting them to lowercase and removing punctuation-like separators. This prevents words such as Trade, trade, and trade. from being treated as different tokens.
  3. For every category, maintain:

word_count[class][word]

which stores how many times a word appears in documents of that class. 4. Also maintain:

total_words[class]

which stores the total number of tokens in that class. 5. Build the global vocabulary set containing every unique token from the training corpus. 6. For a new document, tokenize it using the same normalization rules. Consistency between training and prediction is critical. If preprocessing differs, probability estimates become meaningless. 7. For each class, initialize the score with the logarithm of the prior probability. If the classes are balanced, all priors can simply be equal. 8. For every token in the document, add:

log(
(count(word, class) + 1)
/
(total_words[class] + vocabulary_size)
)

This is Laplace smoothing.

  1. After processing all words, choose the class with the maximum score.
  2. Output that class.

Why it works

The classifier estimates how likely the observed document is under each category-specific language model. Words strongly associated with one subject contribute larger log-probabilities to that class. Because all document scores are computed using the same probabilistic framework, the largest score corresponds to the most plausible generating category.

Laplace smoothing guarantees that unseen words never invalidate an entire class. Using logarithms preserves numerical stability while keeping score comparisons equivalent to multiplying probabilities directly.

Python Solution

import sys
import math
import re
from collections import defaultdict, Counter

input = sys.stdin.readline

token_re = re.compile(r"[a-z]+")

class NaiveBayes:
    def __init__(self):
        self.word_count = [defaultdict(int) for _ in range(4)]
        self.total_words = [0] * 4
        self.doc_count = [0] * 4
        self.vocab = set()

    def tokenize(self, text):
        return token_re.findall(text.lower())

    def train_document(self, cls, text):
        words = self.tokenize(text)

        self.doc_count[cls] += 1

        for w in words:
            self.word_count[cls][w] += 1
            self.total_words[cls] += 1
            self.vocab.add(w)

    def predict(self, text):
        words = self.tokenize(text)

        vocab_size = len(self.vocab)

        total_docs = sum(self.doc_count[1:])

        best_class = 1
        best_score = -10**100

        for cls in range(1, 4):
            prior = math.log(self.doc_count[cls] / total_docs)

            denom = self.total_words[cls] + vocab_size

            score = prior

            for w in words:
                cnt = self.word_count[cls].get(w, 0)

                score += math.log((cnt + 1) / denom)

            if score > best_score:
                best_score = score
                best_class = cls

        return best_class

def solve():
    """
    This problem depends on an external training corpus.
    The online judge used the participant's generated classifier.

    The implementation below demonstrates the intended
    Naive Bayes solution structure.
    """

    classifier = NaiveBayes()

    # Example training phase.
    # In the real contest solution, documents from train.zip
    # would be loaded here.

    classifier.train_document(
        1,
        "mathematics geometry algebra theorem proof"
    )

    classifier.train_document(
        2,
        "biology chemistry medicine cells organism"
    )

    classifier.train_document(
        3,
        "trade economy market export business"
    )

    data = sys.stdin.read()

    print(classifier.predict(data))

if __name__ == "__main__":
    solve()

The solution separates training and prediction cleanly. The tokenize method is shared between both phases because inconsistent preprocessing is one of the easiest ways to destroy classifier accuracy.

The vocabulary size appears in the denominator during smoothing. Forgetting this term produces probabilities that no longer form a valid distribution.

The logarithmic scoring avoids floating-point underflow. Multiplying thousands of tiny probabilities directly would quickly collapse to zero.

The code stores counts in hash maps because the vocabulary is sparse. Most words never appear in most classes, so dense arrays would waste memory.

Worked Examples

Example 1

Input document:

trade market export finance business

Training data:

Class 1: theorem algebra geometry
Class 2: medicine biology chemistry
Class 3: trade export market business

Scoring trace:

Word Score Class 1 Score Class 2 Score Class 3
trade very low very low high
market very low very low high
export very low very low high
finance low low moderate
business very low very low high

Final prediction:

3

This demonstrates how category-specific vocabulary dominates the probability calculation. Even though one word may be unseen, the surrounding evidence strongly favors class 3.

Example 2

Input document:

cells organism medicine chemistry

Trace:

Word Score Class 1 Score Class 2 Score Class 3
cells low high low
organism low high low
medicine very low high low
chemistry very low high low

Final prediction:

2

This example shows why the additive log formulation works well. Every biology-related word consistently increases the score of class 2.

Complexity Analysis

Measure Complexity Explanation
Time O(L) Every token is processed once during prediction
Space O(V) Word frequencies are stored for each vocabulary term

The documents are tiny, at most 10 KB, so even a Python implementation runs comfortably within limits. The vocabulary size from the training corpus also stays manageable, making hash-map based counting efficient enough.

Test Cases

import sys, io
from collections import defaultdict, Counter
import math
import re

token_re = re.compile(r"[a-z]+")

class NaiveBayes:
    def __init__(self):
        self.word_count = [defaultdict(int) for _ in range(4)]
        self.total_words = [0] * 4
        self.doc_count = [0] * 4
        self.vocab = set()

    def tokenize(self, text):
        return token_re.findall(text.lower())

    def train_document(self, cls, text):
        words = self.tokenize(text)

        self.doc_count[cls] += 1

        for w in words:
            self.word_count[cls][w] += 1
            self.total_words[cls] += 1
            self.vocab.add(w)

    def predict(self, text):
        words = self.tokenize(text)

        vocab_size = len(self.vocab)

        total_docs = sum(self.doc_count[1:])

        best_class = 1
        best_score = -10**100

        for cls in range(1, 4):
            score = math.log(self.doc_count[cls] / total_docs)

            denom = self.total_words[cls] + vocab_size

            for w in words:
                cnt = self.word_count[cls].get(w, 0)
                score += math.log((cnt + 1) / denom)

            if score > best_score:
                best_score = score
                best_class = cls

        return best_class

def run(inp: str) -> str:
    nb = NaiveBayes()

    nb.train_document(1, "algebra theorem geometry")
    nb.train_document(2, "biology chemistry medicine")
    nb.train_document(3, "trade market export")

    return str(nb.predict(inp))

assert run("trade export market") == "3", "sample 1"

assert run("biology medicine cells") == "2", "biology words"

assert run("geometry theorem proof") == "1", "math words"

assert run("unknownword unknownword trade") == "3", "smoothing"

assert run("") == "1", "empty input fallback"
Test input Expected output What it validates
trade export market 3 Trade vocabulary classification
biology medicine cells 2 Biological terminology
geometry theorem proof 1 Mathematical terminology
unknownword unknownword trade 3 Laplace smoothing for unseen words
empty input 1 Boundary handling

Edge Cases

A particularly tricky case occurs when the document contains mostly unseen words:

cryptocurrency blockchain token market

Suppose only market exists in the training vocabulary. Without Laplace smoothing, every unseen token would contribute probability zero, making all classes invalid immediately. The smoothed formula instead assigns a tiny but nonzero probability to unknown words. Since class 3 still strongly prefers market, the classifier correctly predicts category 3.

Another subtle case is inconsistent capitalization:

TRADE Market ExPoRt

If training uses lowercase words but prediction preserves original casing, the classifier sees three unknown tokens and loses all useful information. The tokenizer converts everything to lowercase during both phases, so the document still maps correctly to class 3.

Long documents create another hidden failure mode. Consider a huge article with thousands of neutral words and only a few category-specific terms. Raw frequency comparison tends to bias toward classes with larger corpora. Using normalized probabilities prevents this imbalance because every token contributes relative likelihood rather than absolute count.