CF 207D7 - The Beaver's Problem - 3
This problem is intentionally unusual for Codeforces. There is no hidden graph, no combinatorics trick, and no numeric constraint to optimize around. Instead, we are asked to classify a text document into one of three categories using a fixed training corpus.
CF 207D7 - The Beaver's Problem - 3
Rating: 1600
Tags: -
Solve time: 1m 36s
Verified: yes
Solution
Problem Understanding
This problem is intentionally unusual for Codeforces. There is no hidden graph, no combinatorics trick, and no numeric constraint to optimize around. Instead, we are asked to classify a text document into one of three categories using a fixed training corpus.
Each input file contains a document identifier, a title, and the document body. The identifier is meaningless for classification purposes. The actual task is to inspect the text content and predict whether the document belongs to subject 1, 2, or 3.
The difficult part is that the official statement expects participants to use the provided training archive. The test data is constructed from real text documents, and the intended solution is essentially a lightweight machine learning classifier.
This means the problem is not about deriving a mathematical formula. It is about extracting statistical patterns from text.
The document size is at most 10 KB, which is tiny. Even if we tokenize every word, count frequencies, and compare against all training categories, the total amount of work per document stays extremely small. The real challenge is not asymptotic complexity, but designing a classifier that generalizes well to unseen documents.
A naive keyword-only solution can pass the easiest groups and completely fail on harder tests. For example, suppose category 3 contains many trade-related articles. A simplistic rule like "if the text contains the word market, output 3" immediately breaks when another category also discusses economics indirectly.
Another dangerous case is relying only on the title. Consider two documents:
Title: International Relations
Body: export tariffs trade agreements customs ...
and
Title: International Relations
Body: diplomatic negotiations military alliance ...
A title-based classifier would output the same class for both documents even though the body clearly separates them.
A third common failure mode is ignoring word frequency normalization. Longer documents naturally contain more occurrences of common words. If we simply sum raw counts, long documents dominate the score unfairly. A probabilistic model avoids this bias.
The intended direction is a classical text classification pipeline using bag-of-words statistics.
Approaches
The brute-force idea is to memorize entire documents from the training set and try to find exact matches. This works for groups where test documents are copied directly from training data with modified identifiers. We could hash the text body and map it to the known category.
The method is technically correct for identical documents because equal text must belong to the same class. The problem is that later groups contain unseen documents. Exact matching immediately fails there.
A slightly better brute-force method is manual keyword matching. We can build three dictionaries of frequent words and assign scores according to occurrences. This already behaves like a primitive classifier.
Suppose we keep the top 500 keywords for each category and scan the document against them. The runtime is still tiny:
|documents| × |keywords| × average length
Even with generous estimates, this stays well below the time limit. The weakness is accuracy, not speed. Hand-crafted heuristics do not generalize reliably.
The key observation is that this is a standard multinomial text classification problem. Documents from the same subject tend to share characteristic word distributions. Instead of manually choosing keywords, we can estimate probabilities directly from the training corpus.
This naturally leads to Naive Bayes.
The bag-of-words model assumes that words appear independently once the category is fixed. That assumption is obviously false in real language, but it works remarkably well for document classification because category-specific vocabularies dominate the signal.
For each class, we count how many times every token appears in the training set. Then for a new document we compute:
P(class) × Π P(word | class)
The class with the largest probability becomes the prediction.
Direct multiplication underflows quickly because probabilities are tiny. We switch to logarithms:
log P(class) + Σ log P(word | class)
Now every document becomes a simple additive scoring problem.
Laplace smoothing is also necessary. Without it, a single unseen word gives probability zero and destroys the entire score. Adding one to every frequency fixes this cleanly.
The resulting classifier is both fast and surprisingly accurate.
| Approach | Time Complexity | Space Complexity | Verdict |
|---|---|---|---|
| Exact document matching | O(L) | O(total training text) | Fails on unseen documents |
| Manual keyword heuristics | O(V + L) | O(V) | Unreliable |
| Naive Bayes classifier | O(V + L) | O(V) | Accepted |
Here, L is the document length and V is the vocabulary size.
Algorithm Walkthrough
- Read all training documents and split them into tokens.
- Normalize tokens by converting them to lowercase and removing punctuation-like separators. This prevents words such as
Trade,trade, andtrade.from being treated as different tokens. - For every category, maintain:
word_count[class][word]
which stores how many times a word appears in documents of that class. 4. Also maintain:
total_words[class]
which stores the total number of tokens in that class. 5. Build the global vocabulary set containing every unique token from the training corpus. 6. For a new document, tokenize it using the same normalization rules. Consistency between training and prediction is critical. If preprocessing differs, probability estimates become meaningless. 7. For each class, initialize the score with the logarithm of the prior probability. If the classes are balanced, all priors can simply be equal. 8. For every token in the document, add:
log(
(count(word, class) + 1)
/
(total_words[class] + vocabulary_size)
)
This is Laplace smoothing.
- After processing all words, choose the class with the maximum score.
- Output that class.
Why it works
The classifier estimates how likely the observed document is under each category-specific language model. Words strongly associated with one subject contribute larger log-probabilities to that class. Because all document scores are computed using the same probabilistic framework, the largest score corresponds to the most plausible generating category.
Laplace smoothing guarantees that unseen words never invalidate an entire class. Using logarithms preserves numerical stability while keeping score comparisons equivalent to multiplying probabilities directly.
Python Solution
import sys
import math
import re
from collections import defaultdict, Counter
input = sys.stdin.readline
token_re = re.compile(r"[a-z]+")
class NaiveBayes:
def __init__(self):
self.word_count = [defaultdict(int) for _ in range(4)]
self.total_words = [0] * 4
self.doc_count = [0] * 4
self.vocab = set()
def tokenize(self, text):
return token_re.findall(text.lower())
def train_document(self, cls, text):
words = self.tokenize(text)
self.doc_count[cls] += 1
for w in words:
self.word_count[cls][w] += 1
self.total_words[cls] += 1
self.vocab.add(w)
def predict(self, text):
words = self.tokenize(text)
vocab_size = len(self.vocab)
total_docs = sum(self.doc_count[1:])
best_class = 1
best_score = -10**100
for cls in range(1, 4):
prior = math.log(self.doc_count[cls] / total_docs)
denom = self.total_words[cls] + vocab_size
score = prior
for w in words:
cnt = self.word_count[cls].get(w, 0)
score += math.log((cnt + 1) / denom)
if score > best_score:
best_score = score
best_class = cls
return best_class
def solve():
"""
This problem depends on an external training corpus.
The online judge used the participant's generated classifier.
The implementation below demonstrates the intended
Naive Bayes solution structure.
"""
classifier = NaiveBayes()
# Example training phase.
# In the real contest solution, documents from train.zip
# would be loaded here.
classifier.train_document(
1,
"mathematics geometry algebra theorem proof"
)
classifier.train_document(
2,
"biology chemistry medicine cells organism"
)
classifier.train_document(
3,
"trade economy market export business"
)
data = sys.stdin.read()
print(classifier.predict(data))
if __name__ == "__main__":
solve()
The solution separates training and prediction cleanly. The tokenize method is shared between both phases because inconsistent preprocessing is one of the easiest ways to destroy classifier accuracy.
The vocabulary size appears in the denominator during smoothing. Forgetting this term produces probabilities that no longer form a valid distribution.
The logarithmic scoring avoids floating-point underflow. Multiplying thousands of tiny probabilities directly would quickly collapse to zero.
The code stores counts in hash maps because the vocabulary is sparse. Most words never appear in most classes, so dense arrays would waste memory.
Worked Examples
Example 1
Input document:
trade market export finance business
Training data:
Class 1: theorem algebra geometry
Class 2: medicine biology chemistry
Class 3: trade export market business
Scoring trace:
| Word | Score Class 1 | Score Class 2 | Score Class 3 |
|---|---|---|---|
| trade | very low | very low | high |
| market | very low | very low | high |
| export | very low | very low | high |
| finance | low | low | moderate |
| business | very low | very low | high |
Final prediction:
3
This demonstrates how category-specific vocabulary dominates the probability calculation. Even though one word may be unseen, the surrounding evidence strongly favors class 3.
Example 2
Input document:
cells organism medicine chemistry
Trace:
| Word | Score Class 1 | Score Class 2 | Score Class 3 |
|---|---|---|---|
| cells | low | high | low |
| organism | low | high | low |
| medicine | very low | high | low |
| chemistry | very low | high | low |
Final prediction:
2
This example shows why the additive log formulation works well. Every biology-related word consistently increases the score of class 2.
Complexity Analysis
| Measure | Complexity | Explanation |
|---|---|---|
| Time | O(L) | Every token is processed once during prediction |
| Space | O(V) | Word frequencies are stored for each vocabulary term |
The documents are tiny, at most 10 KB, so even a Python implementation runs comfortably within limits. The vocabulary size from the training corpus also stays manageable, making hash-map based counting efficient enough.
Test Cases
import sys, io
from collections import defaultdict, Counter
import math
import re
token_re = re.compile(r"[a-z]+")
class NaiveBayes:
def __init__(self):
self.word_count = [defaultdict(int) for _ in range(4)]
self.total_words = [0] * 4
self.doc_count = [0] * 4
self.vocab = set()
def tokenize(self, text):
return token_re.findall(text.lower())
def train_document(self, cls, text):
words = self.tokenize(text)
self.doc_count[cls] += 1
for w in words:
self.word_count[cls][w] += 1
self.total_words[cls] += 1
self.vocab.add(w)
def predict(self, text):
words = self.tokenize(text)
vocab_size = len(self.vocab)
total_docs = sum(self.doc_count[1:])
best_class = 1
best_score = -10**100
for cls in range(1, 4):
score = math.log(self.doc_count[cls] / total_docs)
denom = self.total_words[cls] + vocab_size
for w in words:
cnt = self.word_count[cls].get(w, 0)
score += math.log((cnt + 1) / denom)
if score > best_score:
best_score = score
best_class = cls
return best_class
def run(inp: str) -> str:
nb = NaiveBayes()
nb.train_document(1, "algebra theorem geometry")
nb.train_document(2, "biology chemistry medicine")
nb.train_document(3, "trade market export")
return str(nb.predict(inp))
assert run("trade export market") == "3", "sample 1"
assert run("biology medicine cells") == "2", "biology words"
assert run("geometry theorem proof") == "1", "math words"
assert run("unknownword unknownword trade") == "3", "smoothing"
assert run("") == "1", "empty input fallback"
| Test input | Expected output | What it validates |
|---|---|---|
trade export market |
3 |
Trade vocabulary classification |
biology medicine cells |
2 |
Biological terminology |
geometry theorem proof |
1 |
Mathematical terminology |
unknownword unknownword trade |
3 |
Laplace smoothing for unseen words |
| empty input | 1 |
Boundary handling |
Edge Cases
A particularly tricky case occurs when the document contains mostly unseen words:
cryptocurrency blockchain token market
Suppose only market exists in the training vocabulary. Without Laplace smoothing, every unseen token would contribute probability zero, making all classes invalid immediately. The smoothed formula instead assigns a tiny but nonzero probability to unknown words. Since class 3 still strongly prefers market, the classifier correctly predicts category 3.
Another subtle case is inconsistent capitalization:
TRADE Market ExPoRt
If training uses lowercase words but prediction preserves original casing, the classifier sees three unknown tokens and loses all useful information. The tokenizer converts everything to lowercase during both phases, so the document still maps correctly to class 3.
Long documents create another hidden failure mode. Consider a huge article with thousands of neutral words and only a few category-specific terms. Raw frequency comparison tends to bias toward classes with larger corpora. Using normalized probabilities prevents this imbalance because every token contributes relative likelihood rather than absolute count.