Index Construction¶
Text Preprocessing¶
How to construct the index?
- Depends on: what type of search (and: analytics) do we want to support?
For now: think about Boolean queries over single search terms:
- (Corona OR Covid) AND vaccination
From Byte Stream to Index Terms¶
Tokenization¶
Initial heuristic for English (or Danish, German, ...):
- split on whitespace
- delete punctuation characters
But:
-
some things should perhaps not be split:
-
May also need to split where there is no white space:
No golden bullet! Lots of language or domain-specific rules and heuristics.
Normalization¶
Transform distinct ’equivalent’ tokens into one normalized form. E.g.:
- write all in lower case:
This
\tothis
- use non-hyphenated forms:
anti-discriminatory
\toantidiscriminatory
- delete periods:
U.S.A
\toUSA
(orusa
)
Have to balance:
- more normalization:
- smaller index
- more permissive search: user searching for ’U.S.A’ also receives results containing ’usa’
- less normalization:
- supports more specific search: users searching for ’C.A.T.’ don’t receive results for ’cat’. (try Google vs. Bing on this one!)
Stop word removal¶
Stop words: very frequent words that are not semantically descriptive:
- the, a, this, and, of, that, ...
Stop list: list of stop words that are removed. E.g. containing 15-200 stop words.
Problem: stop words may become important as part of an expression:
- “To be or not to be”
Stemming¶
Similar to normalization: reduce different grammatical variants to their common underlying word “stem”:
Inflectional Forms:
- learn, learns, learned \to learn
Word types:
- organize, organizer, organization \to organ
but also
- organ, organizer, organic \to organ
Term
The strings that result from stemming are the terms that will be included in the index
Porters Stemming Algorithm¶
Set of rules for iteratively reducing/removing suffixes of words. E.g.
Only the most specific (longest suffix) rule is applied. Other applicable rules are ignored
- rule SS \to SS is not redundant!
Rules may be conditioned on what comes before the suffix: $$ (m>1)\ \text{EMENT} \to $$ only applies if the part before suffix EMENT has more than 1 syllable
Putting it all together¶
Putting things together, a text is trandformed into a sequence of terms:
Corpus and Vocabulary
Corpus: collection of documents (e.g., all web-pages that have been crawled)
Vocabulary: all terms that appear in the corpus
- We now want to build an index that allows to retrieve documents in the corpus according to (search-) terms in the vocabulary.
Inverted Index¶
Term-document incidence matrix¶
The occurrence/absence of terms in documents can be represented by a term-document incidence matrix:
0: term does not appear in document
1: term appears in document
Inverted Index¶
Putting the terms in the rows, and reducing incidence (bit-)vectors to sorted lists of postings (= IDs or references of documents):
- This is really just an index; nothing “inverted” about it
- This is only a schematic picture; the actual implementation will be a dictionary data structure with terms as keys
Boolean Retrieval¶
Simple Boolean Queries¶
Single Word Search
Search "Education"
- Apply normalization and stemming: Education \to educ
- Retrieve postings for key “educ” from inverted index:
- 2, 15, 7529, 43875, 884903
Boolean AND
Search: “Education AND Book”
- Retrieve postings for “educ” and “book”
- Construct intersection using the merge algorithm
Merge Algorithm¶
Other Boolean Operators¶
OR
term_1 OR term_2
Variation of merge algorithm:
- append all postings from two lists to output, without duplicates (very similar to merge step in merge sort).
NOT
NOT term
In principle simple:
- construct sorted list of document keys that are not in the list of term. But: output intractably large
AND NOT
term_1 AND NOT term_2
Variation of merge algorithm:
- append to output all postings from term_1 list that are not found in term_2 list
Phrase Queries¶
Most users don’t want to bother with properly constructed Boolean queries. More popular are phrase queries:
"Graduation ceremony at Aalborg University"
After pre-processing (but no stopword removal!):
"graduat ceremoni at aalborg univers"
N-gram Index¶
- Create dictionary over N-grams of terms instead of dictionary over terms (e.g. N=2 or N=3)
- Search for
- graduat ceremoni AND ceremoni at AND at aalborg AND aalborg univers
- Can give false positives
Positional Index¶
Augment the inverted index with
- for each term: {\color{blue}{\text{document frequency}}} (=length of posting list for the term)
- for each posting: {\color{red}{\text{number}}} and {\color{green}{\text{list}}} of positions at which term appears in document
Size
- T: total number of terms (stop words removed) in the corpus
- K: average number of occurrences of a given term in a document
Then:
- space for positional index: \Theta(T)
- space for non-positional index: \Theta(T/K)
In practice: positional index has 30% to 50% the size of the raw corpus
Answering Phrase Queries¶
Query: graduat ceremoni at
Step 1
retrieve documents for Boolean query graduat AND ceremoni AND at (can be optimized by merging posting lists with low document frequency count first)
- obtain for each retrieved document the position lists (e.g. document 5):
Step 2
apply variation of merge algorithm to find consecutive occurrences of the terms
- Can also be adjusted to deal with proximity queries:
The two terms are required to occur within 5 positions of each other (any order)