CF 207D2 - The Beaver's Problem - 3

This problem is unusual for Codeforces because it is not a traditional algorithmic task. We are given a single document and must classify it into one of three categories using a provided training dataset. The input contains an identifier, a title, and the full document text.

CF 207D2 - The Beaver's Problem - 3

Rating: 2000
Tags: -
Solve time: 1m 13s
Verified: yes

Solution

Problem Understanding

This problem is unusual for Codeforces because it is not a traditional algorithmic task. We are given a single document and must classify it into one of three categories using a provided training dataset. The input contains an identifier, a title, and the full document text. The identifier is intentionally meaningless, so only the textual content matters.

The hidden tests contain documents from three different topics. The challenge is to build a classifier that predicts the correct topic number from the vocabulary used in the document.

The document size is at most 10 KB, which is tiny from a machine learning perspective. Even fairly heavy text processing is fast enough. The real difficulty is not computational complexity, but designing a robust heuristic that generalizes from the training data to unseen documents.

This was an old experimental Codeforces problem where contestants were expected to use practical text-classification techniques instead of pure algorithms. The intended solution is essentially a lightweight Naive Bayes classifier.

A naive approach that only searches for a few manually selected keywords fails immediately. For example, suppose category 3 often discusses economics. A simplistic classifier might look for words like "market" or "trade". Hidden tests can easily contain documents about economics without those exact words.

Another dangerous mistake is relying on the document title alone. Some titles are ambiguous or intentionally generic. Two completely different subjects may use similar high-level terminology.

A third subtle issue is common words. Words like "the", "and", "system", or "information" appear everywhere. If we count them equally with informative words, the classifier becomes noisy and unstable.

For example:

Input:

123
Annual report
The company increased exports and financial activity during the year.

A classifier based only on title similarity might fail because "Annual report" is generic. The body contains the actual signal.

Another example:

Input:

777
Neural methods
Learning algorithms improve recognition accuracy.

If category 1 contains scientific papers and category 2 contains engineering papers, the classifier must combine many weak signals instead of depending on one keyword.

The important observation is that text classification works statistically. Individual words are unreliable, but aggregate frequencies across the whole document become highly predictive.

Approaches

The brute-force idea is to compare the input document against every training document directly. We could tokenize all documents and compute a similarity score such as cosine similarity or shared-word count. Then we would choose the category of the most similar training document.

This approach is correct in spirit because documents from the same topic tend to share vocabulary. The problem is scalability and robustness. If there are thousands of training documents, comparing every tokenized word against every training sample becomes expensive. More importantly, nearest-neighbor methods are sensitive to noise and document length.

A better approach is to model the probability distribution of words inside each category. Instead of asking "Which training document is closest?", we ask "Which category most likely generated this document?"

That observation leads directly to Naive Bayes.

Suppose a word appears frequently in category 2 but rarely in categories 1 and 3. Seeing that word strongly increases the probability that the document belongs to category 2. A document contains many words, so we accumulate evidence across all tokens.

The "naive" assumption is that words are conditionally independent given the category. This assumption is mathematically false, but in practice it works surprisingly well for text classification.

The algorithm becomes:

  1. Count word frequencies for each category in the training set.
  2. Tokenize the input document.
  3. Compute a score for each category using log-probabilities.
  4. Output the category with the highest score.

Using logarithms is critical because multiplying many tiny probabilities quickly underflows to zero.

We also need Laplace smoothing. Without it, a single unseen word would produce probability zero and completely destroy the score.

Approach Time Complexity Space Complexity Verdict
Brute Force document similarity O(N × L) per query O(N × L) Too fragile
Naive Bayes classifier O(V + L) per query O(V) Accepted

Here, N is the number of training documents, L is average document length, and V is vocabulary size.

Algorithm Walkthrough

  1. Read and preprocess all training documents.

Convert text to lowercase and split it into normalized tokens. Usually punctuation is removed and only alphabetic words are kept. 2. Maintain separate word-frequency maps for categories 1, 2, and 3.

Each map stores how many times every word appears inside documents of that category. 3. Maintain total token counts for every category.

These totals are needed to convert raw counts into probabilities. 4. Build the global vocabulary.

The vocabulary size is required for Laplace smoothing. 5. Tokenize the input document using the same preprocessing rules.

Training and prediction must use identical normalization. If training lowercases text but prediction does not, the classifier becomes inconsistent. 6. For each category, initialize a log-score.

The score starts from the prior probability of the category. If priors are unknown, equal priors are acceptable. 7. For every token in the input document, update the category score.

Use the smoothed probability:

$P(w\mid c)=\frac{\text{count}(w,c)+1}{\text{totalWords}(c)+|V|}$

Add the logarithm of this probability to the score. 8. Choose the category with the largest final score.

Since logarithm is monotonic, maximizing log-probability is equivalent to maximizing probability itself.

Why it works

Naive Bayes computes the probability that a category generated the observed document. Every word contributes evidence toward or against a category. Frequent category-specific words accumulate large positive influence, while generic words contribute almost equally everywhere.

Laplace smoothing guarantees that unseen words reduce confidence instead of making a category impossible. Using logarithms preserves numerical stability while keeping the relative ordering of probabilities unchanged.

Because topic-specific vocabularies differ significantly across categories, the highest-probability category usually matches the true subject.

Python Solution

import sys
import math
import re
from collections import Counter, defaultdict

input = sys.stdin.readline

token_re = re.compile(r"[a-z]+")

def tokenize(text: str):
    return token_re.findall(text.lower())

class NaiveBayes:
    def __init__(self):
        self.word_count = [Counter() for _ in range(4)]
        self.total_words = [0] * 4
        self.doc_count = [0] * 4
        self.vocab = set()

    def add_document(self, category: int, text: str):
        words = tokenize(text)

        self.doc_count[category] += 1

        for w in words:
            self.word_count[category][w] += 1
            self.total_words[category] += 1
            self.vocab.add(w)

    def predict(self, text: str):
        words = tokenize(text)

        vocab_size = len(self.vocab)

        best_cat = 1
        best_score = -10**100

        total_docs = sum(self.doc_count[1:])

        for cat in range(1, 4):
            if self.doc_count[cat] == 0:
                continue

            score = math.log(self.doc_count[cat] / total_docs)

            denom = self.total_words[cat] + vocab_size

            for w in words:
                cnt = self.word_count[cat][w]
                score += math.log((cnt + 1) / denom)

            if score > best_score:
                best_score = score
                best_cat = cat

        return best_cat

def solve():
    """
    This is only a template illustrating the intended solution.
    The original problem provided an external training dataset.
    """

    nb = NaiveBayes()

    # Example training phase.
    # In the actual contest environment contestants trained offline.

    nb.add_document(
        1,
        "mathematics geometry algebra theorem proof equation"
    )

    nb.add_document(
        2,
        "computer programming software algorithm system network"
    )

    nb.add_document(
        3,
        "economics trade finance market company business"
    )

    lines = sys.stdin.read().splitlines()

    if not lines:
        return

    text = "\n".join(lines)

    print(nb.predict(text))

if __name__ == "__main__":
    solve()

The solution separates training and prediction cleanly. The tokenize function performs all normalization in one place so the classifier behaves consistently.

The regular expression keeps only lowercase alphabetic words. This removes punctuation noise and avoids treating "Trade" and "trade" as different tokens.

The word_count structure stores per-category frequencies, while total_words stores the denominator needed for probability computation.

The prediction phase uses logarithms because multiplying hundreds of tiny probabilities would underflow in floating-point arithmetic. Adding logarithms produces the same ordering while remaining numerically stable.

Laplace smoothing appears in:

(cnt + 1) / denom

Without the +1, any unseen word would contribute probability zero and destroy the entire category score.

Another subtle detail is vocabulary size inside the denominator. Omitting it biases probabilities and breaks normalization.

Worked Examples

Example 1

Suppose the classifier was trained with:

Category 1:

proof theorem algebra

Category 2:

computer algorithm software

Category 3:

market finance trade

Input:

Neural algorithm design
Word Category 1 Count Category 2 Count Category 3 Count
neural 0 0 0
algorithm 0 1 0
design 0 0 0

After smoothing, all categories receive small penalties for unseen words, but category 2 gains an advantage because "algorithm" is strongly associated with it.

The classifier outputs:

2

This trace demonstrates why smoothing matters. Two words are unseen everywhere, yet the classifier still makes a stable decision.

Example 2

Input:

financial market company exports
Word Category 1 Count Category 2 Count Category 3 Count
financial 0 0 0
market 0 0 1
company 0 0 1
exports 0 0 0

Category 3 accumulates evidence from "market" and "company".

Output:

3

This example shows how multiple weak signals combine into a strong prediction.

Complexity Analysis

Measure Complexity Explanation
Time O(V + L) Vocabulary traversal plus processing document tokens
Space O(V) Frequency tables for all distinct words

Here V is the vocabulary size and L is the number of tokens in the input document.

The document size is only 10 KB, so even Python implementations run comfortably within limits. Hash-map operations are effectively constant time, and the total vocabulary remains manageable.

Test Cases

# helper: run solution on input string, return output string
import sys
import io

def run(inp: str) -> str:
    import math
    import re
    from collections import Counter

    token_re = re.compile(r"[a-z]+")

    def tokenize(text):
        return token_re.findall(text.lower())

    class NaiveBayes:
        def __init__(self):
            self.word_count = [Counter() for _ in range(4)]
            self.total_words = [0] * 4
            self.doc_count = [0] * 4
            self.vocab = set()

        def add_document(self, category, text):
            words = tokenize(text)

            self.doc_count[category] += 1

            for w in words:
                self.word_count[category][w] += 1
                self.total_words[category] += 1
                self.vocab.add(w)

        def predict(self, text):
            words = tokenize(text)

            vocab_size = len(self.vocab)

            best_cat = 1
            best_score = -10**100

            total_docs = sum(self.doc_count[1:])

            for cat in range(1, 4):
                score = math.log(self.doc_count[cat] / total_docs)

                denom = self.total_words[cat] + vocab_size

                for w in words:
                    cnt = self.word_count[cat][w]
                    score += math.log((cnt + 1) / denom)

                if score > best_score:
                    best_score = score
                    best_cat = cat

            return best_cat

    nb = NaiveBayes()

    nb.add_document(1, "proof theorem algebra geometry")
    nb.add_document(2, "computer algorithm software network")
    nb.add_document(3, "market finance trade business")

    sys.stdin = io.StringIO(inp)

    text = sys.stdin.read()

    return str(nb.predict(text)) + "\n"

# custom cases
assert run("algorithm software") == "2\n", "programming vocabulary"
assert run("market trade company") == "3\n", "economics vocabulary"
assert run("geometry theorem proof") == "1\n", "mathematics vocabulary"
assert run("unknown mysterious token") in {
    "1\n", "2\n", "3\n"
}, "unseen words still produce valid class"
Test input Expected output What it validates
algorithm software 2 Strong category-specific vocabulary
market trade company 3 Multiple matching finance terms
geometry theorem proof 1 Mathematical terminology
unknown mysterious token Any valid class Laplace smoothing prevents failure

Edge Cases

Consider a document where every word is unseen:

Input:

quantum rainforest telescope

Without smoothing, every category probability becomes zero because none of the words appeared during training. The algorithm avoids this by adding one to every frequency count:

$\frac{0+1}{\text{totalWords}(c)+|V|}$

All categories receive finite scores, and the classifier still returns a valid answer.

Another tricky case is inconsistent capitalization:

Input:

ALGORITHM Software NETWORK

The tokenizer converts everything to lowercase before processing. All three words correctly match category 2 vocabulary.

A third subtle case involves punctuation-heavy text:

Input:

market!!! finance??? trade...

A naive whitespace split would produce tokens like "market!!!" that never match training words. The regex tokenizer strips punctuation and extracts clean alphabetic tokens, preserving the intended signal.