castorini · Linsen-gao-457 · Dec 17, 2024 · Dec 17, 2024 · Dec 17, 2024 · Dec 17, 2024
diff --git a/-collection b/-collection
diff --git a/-generator b/-generator
diff --git a/-index b/-index
diff --git a/-input b/-input
diff --git a/-threads b/-threads
diff --git a/docs/experiments-msmarco-passage.md b/docs/experiments-msmarco-passage.md
@@ -1,23 +1,21 @@
 # Anserini: BM25 Baselines for MS MARCO Passage Ranking
 
-This page contains instructions for running BM25 baselines on the [MS MARCO *passage* ranking task](https://microsoft.github.io/msmarco/).
-Note that there is a separate [MS MARCO *document* ranking task](experiments-msmarco-doc.md).
+This page contains instructions for running BM25 baselines on the [MS MARCO _passage_ ranking task](https://microsoft.github.io/msmarco/).
+Note that there is a separate [MS MARCO _document_ ranking task](experiments-msmarco-doc.md).
 This exercise will require a machine with >8 GB RAM and >15 GB free disk space .
 If you're using a Windows machine, equivalent commands are provided alongside the Unix-like (Linux/macOS) commands.
 
-If you're a Waterloo student traversing the [onboarding path](https://github.com/lintool/guide/blob/master/ura.md), [start here](start-here.md
-).
+If you're a Waterloo student traversing the [onboarding path](https://github.com/lintool/guide/blob/master/ura.md), [start here](start-here.md).
 In general, don't try to rush through this guide by just blindly copying and pasting commands into a shell;
 that's what I call [cargo culting](https://en.wikipedia.org/wiki/Cargo_cult_programming).
 Instead, really try to understand what's going on.
 
-
 **Learning outcomes** for this guide, building on previous steps in the onboarding path:
 
-+ Be able to use Anserini to build a Lucene inverted index on the MS MARCO passage collection.
-+ Be able to use Anserini to perform a batch retrieval run on the MS MARCO passage collection with the dev queries.
-+ Be able to evaluate the retrieved results above.
-+ Understand the MRR metric.
+- Be able to use Anserini to build a Lucene inverted index on the MS MARCO passage collection.
+- Be able to use Anserini to perform a batch retrieval run on the MS MARCO passage collection with the dev queries.
+- Be able to evaluate the retrieved results above.
+- Understand the MRR metric.
 
 What's Anserini?
 Well, it's the repo that you're in right now.
@@ -33,8 +31,7 @@ That is, most things done with Anserini can be "translated" into Elasticsearch q
 ## Data Prep
 
 In this guide, we're just going through the mechanical steps of data prep.
-To better understand what you're actually doing, go through the [start here](start-here.md
-) guide.
+To better understand what you're actually doing, go through the [start here](start-here.md) guide.
 The guide contains the same exact instructions, but provide more detailed explanations.
 
 We're going to use the repository's root directory as the working directory.
@@ -80,8 +77,8 @@ The output queries file `collections/msmarco-passage/queries.dev.small.tsv` shou
 
 In building a retrieval system, there are generally two phases:
 
-+ In the **indexing** phase, an indexer takes the document collection (i.e., corpus) and builds an index, which is a data structure that supports efficient retrieval.
-+ In the **retrieval** (or **search**) phase, the retrieval system returns a ranked list given a query _q_, with the aid of the index constructed in the previous phase.
+- In the **indexing** phase, an indexer takes the document collection (i.e., corpus) and builds an index, which is a data structure that supports efficient retrieval.
+- In the **retrieval** (or **search**) phase, the retrieval system returns a ranked list given a query _q_, with the aid of the index constructed in the previous phase.
 
 (There's also a training phase when we start to discuss models that _learn_ from data, but we're not there yet.)
 
@@ -102,13 +99,13 @@ bin/run.sh io.anserini.index.IndexCollection \
   -generator DefaultLuceneDocumentGenerator \
   -threads 9 -storePositions -storeDocvectors -storeRaw
 ```
+
 For Windows:
+
 ```bash
 bin\run.bat io.anserini.index.IndexCollection -collection JsonCollection -input collections\msmarco-passage\collection_jsonl -index indexes\msmarco-passage\lucene-index-msmarco -generator DefaultLuceneDocumentGenerator -threads 9 -storePositions -storeDocvectors -storeRaw
 ```
 
-
-
 In this case, Lucene creates what is known as an **inverted index**.
 
 Upon completion, we should have an index with 8,841,823 documents.
@@ -117,7 +114,6 @@ On the new MacBook Pro M3 Laptop, if you only have 8GB memory, you might encount
 finishes. This is likely caused by JVM allocating more memory than available on the system, thus causing too much memory swapping without actively
 garbage collecting. To mitigate this issue, you may need to modify run.sh to change the -Xms option to 2GB and -Xmx to 6GB.
 
-
 ## Retrieval
 
 In the above step, we've built the inverted index.
@@ -132,7 +128,9 @@ bin/run.sh io.anserini.search.SearchCollection \
   -parallelism 4 \
   -bm25 -bm25.k1 0.82 -bm25.b 0.68 -hits 1000
 ```
+
 For Windows:
+
 ```bash
 bin\run.bat io.anserini.search.SearchCollection -index indexes\msmarco-passage\lucene-index-msmarco -topics collections\msmarco-passage\queries.dev.small.tsv -topicReader TsvInt -output runs\run.msmarco-passage.dev.small.tsv -format msmarco -parallelism 4 -bm25 -bm25.k1 0.82 -bm25.b 0.68 -hits 1000
 ```
@@ -191,8 +189,7 @@ $ grep 7187158 collections/msmarco-passage/collection.tsv
 In this case, the document (hit) seems relevant.
 That is, it contains information that addresses the information need.
 So here, the retrieval system "did well".
-Remember that this document was indeed marked relevant in the qrels, as we saw in the [start here](start-here.md
-) guide.
+Remember that this document was indeed marked relevant in the qrels, as we saw in the [start here](start-here.md) guide.
 
 As an additional sanity check, run the following:
 
@@ -224,8 +221,7 @@ QueriesRanked: 6980
 
 (Yea, the number of digits of precision is a bit... excessive)
 
-Remember from the [start here](start-here.md
-) guide that with relevance judgments (qrels), we can automatically evaluate the retrieval system output (i.e., the run).
+Remember from the [start here](start-here.md) guide that with relevance judgments (qrels), we can automatically evaluate the retrieval system output (i.e., the run).
 
 The final ingredient is a metric, i.e., how to quantify the "quality" of a ranked list.
 Here, we're using a metric called MRR, or mean reciprocal rank.
@@ -329,22 +325,23 @@ There are five different sets of 10k samples (using the `shuf` command).
 We tuned on each individual set and then averaged parameter values across all five sets (this has the effect of regularization).
 In separate trials, we optimized for:
 
-+ recall@1000, since Anserini output serves as input to downstream rerankers (e.g., based on BERT), and we want to maximize the number of relevant documents the rerankers have to work with;
-+ MRR@10, for the case where Anserini output is directly presented to users (i.e., no downstream reranking).
+- recall@1000, since Anserini output serves as input to downstream rerankers (e.g., based on BERT), and we want to maximize the number of relevant documents the rerankers have to work with;
+- MRR@10, for the case where Anserini output is directly presented to users (i.e., no downstream reranking).
 
 It turns out that optimizing for MRR@10 and MAP yields the same settings.
 
 Here's the comparison between the Anserini default and optimized parameters:
 
 | Setting                                         | MRR@10 |    MAP | Recall@1000 |
-|:------------------------------------------------|-------:|-------:|------------:|
+| :---------------------------------------------- | -----: | -----: | ----------: |
 | Default (`k1=0.9`, `b=0.4`)                     | 0.1840 | 0.1926 |      0.8526 |
 | Optimized for recall@1000 (`k1=0.82`, `b=0.68`) | 0.1874 | 0.1957 |      0.8573 |
 | Optimized for MRR@10/MAP (`k1=0.60`, `b=0.62`)  | 0.1892 | 0.1972 |      0.8555 |
 
 As mentioned above, the BM25 run with `k1=0.82`, `b=0.68` corresponds to the entry "BM25 (Lucene8, tuned)" dated 2019/06/26 on the [MS MARCO Passage Ranking Leaderboard](https://microsoft.github.io/MSMARCO-Passage-Ranking-Submissions/leaderboard/).
 The BM25 run with default parameters `k1=0.9`, `b=0.4` roughly corresponds to the entry "BM25 (Anserini)" dated 2019/04/10 (but Anserini was using Lucene 7.6 at the time).
 
+
 ## Reproduction Log[*](reproducibility.md)
 
 + Results reproduced by [@ronakice](https://github.com/ronakice) on 2019-08-12 (commit [`5b29d16`](https://github.com/castorini/anserini/commit/5b29d1654abc5e8a014c2230da990ab2f91fb340))
@@ -542,5 +539,6 @@ The BM25 run with default parameters `k1=0.9`, `b=0.4` roughly corresponds to th
 + Results reproduced by [@sherloc512](https://github.com/sherloc512) on 2024-12-04 (commit [`9e55b1c`](https://github.com/castorini/anserini/commit/9e55b1c97fced46530dac1f78975d19635ffaf7a))
 + Results reproduced by [@zdann15](https://github.com/zdann15) on 2024-12-04 (commit [`9d311b4`](https://github.com/castorini/anserini/commit/9d311b4409a9ff3d79b01910178eaec3931f0abe))
 + Results reproduced by [@Alireza-Zwolf](https://github.com/Alireza-Zwolf) on 2024-12-15 (commit [`c7dff5f`](https://github.com/castorini/anserini/commit/c7dff5f8417905612ad9f97e85012440e9e16087))
-+ Results reproduced by [@Linsen-gao-457](https://github.com/Linsen-gao-457) on 2024-12-17 (commit [a86484a6](https://github.com/castorini/anserini/commit/a86484a6e99a7a97966c423d230ad05279b24508))
++ Results reproduced by [@Linsen-gao-457](https://github.com/Linsen-gao-457) on 2024-12-17 (commit [`a86484a`](https://github.com/castorini/anserini/commit/a86484a6e99a7a97966c423d230ad05279b24508))
 + Results reproduced by [@vincent-4](https://github.com/vincent-4) on 2024-12-20 (commit [`c619dc8`](https://github.com/castorini/anserini/commit/c619dc8d9ab28298251964053a927906e9957f51))
+
diff --git a/docs/regressions/regressions-backgroundlinking18.md b/docs/regressions/regressions-backgroundlinking18.md
@@ -45,21 +45,21 @@ After indexing has completed, you should be able to perform retrieval as follows
 ```
 bin/run.sh io.anserini.search.SearchCollection \
   -index indexes/lucene-index.wapo.v2/ \
-  -topics tools/topics-and-qrels/topics.backgroundlinking18.txt \
+  -topics tools\topics-and-qrels\topics.backgroundlinking18.txt \
   -topicReader BackgroundLinking \
   -output runs/run.wapo.v2.bm25.topics.backgroundlinking18.txt \
   -backgroundLinking -backgroundLinking.k 100 -bm25 -hits 100 &
 
 bin/run.sh io.anserini.search.SearchCollection \
   -index indexes/lucene-index.wapo.v2/ \
-  -topics tools/topics-and-qrels/topics.backgroundlinking18.txt \
+  -topics tools\topics-and-qrels\topics.backgroundlinking18.txt \
   -topicReader BackgroundLinking \
   -output runs/run.wapo.v2.bm25+rm3.topics.backgroundlinking18.txt \
   -backgroundLinking -backgroundLinking.k 100 -bm25 -rm3 -hits 100 &
 
 bin/run.sh io.anserini.search.SearchCollection \
   -index indexes/lucene-index.wapo.v2/ \
-  -topics tools/topics-and-qrels/topics.backgroundlinking18.txt \
+  -topics tools\topics-and-qrels\topics.backgroundlinking18.txt \
   -topicReader BackgroundLinking \
   -output runs/run.wapo.v2.bm25+rm3+df.topics.backgroundlinking18.txt \
   -backgroundLinking -backgroundLinking.dateFilter -backgroundLinking.k 100 -bm25 -rm3 -hits 100 &
@@ -68,11 +68,11 @@ bin/run.sh io.anserini.search.SearchCollection \
 Evaluation can be performed using `trec_eval`:
 
 ```
-bin/trec_eval -c -M1000 -m map -c -M1000 -m ndcg_cut.5 tools/topics-and-qrels/qrels.backgroundlinking18.txt runs/run.wapo.v2.bm25.topics.backgroundlinking18.txt
+bin/trec_eval -c -M1000 -m map -c -M1000 -m ndcg_cut.5 tools\topics-and-qrels\qrels.backgroundlinking18.txt runs/run.wapo.v2.bm25.topics.backgroundlinking18.txt
 
-bin/trec_eval -c -M1000 -m map -c -M1000 -m ndcg_cut.5 tools/topics-and-qrels/qrels.backgroundlinking18.txt runs/run.wapo.v2.bm25+rm3.topics.backgroundlinking18.txt
+bin/trec_eval -c -M1000 -m map -c -M1000 -m ndcg_cut.5 tools\topics-and-qrels\qrels.backgroundlinking18.txt runs/run.wapo.v2.bm25+rm3.topics.backgroundlinking18.txt
 
-bin/trec_eval -c -M1000 -m map -c -M1000 -m ndcg_cut.5 tools/topics-and-qrels/qrels.backgroundlinking18.txt runs/run.wapo.v2.bm25+rm3+df.topics.backgroundlinking18.txt
+bin/trec_eval -c -M1000 -m map -c -M1000 -m ndcg_cut.5 tools\topics-and-qrels\qrels.backgroundlinking18.txt runs/run.wapo.v2.bm25+rm3+df.topics.backgroundlinking18.txt
 ```
 
 ## Effectiveness

diff --git a/docs/regressions/regressions-backgroundlinking19.md b/docs/regressions/regressions-backgroundlinking19.md
@@ -45,21 +45,21 @@ After indexing has completed, you should be able to perform retrieval as follows
 ```
 bin/run.sh io.anserini.search.SearchCollection \
   -index indexes/lucene-index.wapo.v2/ \
-  -topics tools/topics-and-qrels/topics.backgroundlinking19.txt \
+  -topics tools\topics-and-qrels\topics.backgroundlinking19.txt \
   -topicReader BackgroundLinking \
   -output runs/run.wapo.v2.bm25.topics.backgroundlinking19.txt \
   -backgroundLinking -backgroundLinking.k 100 -bm25 -hits 100 &
 
 bin/run.sh io.anserini.search.SearchCollection \
   -index indexes/lucene-index.wapo.v2/ \
-  -topics tools/topics-and-qrels/topics.backgroundlinking19.txt \
+  -topics tools\topics-and-qrels\topics.backgroundlinking19.txt \
   -topicReader BackgroundLinking \
   -output runs/run.wapo.v2.bm25+rm3.topics.backgroundlinking19.txt \
   -backgroundLinking -backgroundLinking.k 100 -bm25 -rm3 -hits 100 &
 
 bin/run.sh io.anserini.search.SearchCollection \
   -index indexes/lucene-index.wapo.v2/ \
-  -topics tools/topics-and-qrels/topics.backgroundlinking19.txt \
+  -topics tools\topics-and-qrels\topics.backgroundlinking19.txt \
   -topicReader BackgroundLinking \
   -output runs/run.wapo.v2.bm25+rm3+df.topics.backgroundlinking19.txt \
   -backgroundLinking -backgroundLinking.dateFilter -backgroundLinking.k 100 -bm25 -rm3 -hits 100 &
@@ -68,11 +68,11 @@ bin/run.sh io.anserini.search.SearchCollection \
 Evaluation can be performed using `trec_eval`:
 
 ```
-bin/trec_eval -c -M1000 -m map -c -M1000 -m ndcg_cut.5 tools/topics-and-qrels/qrels.backgroundlinking19.txt runs/run.wapo.v2.bm25.topics.backgroundlinking19.txt
+bin/trec_eval -c -M1000 -m map -c -M1000 -m ndcg_cut.5 tools\topics-and-qrels\qrels.backgroundlinking19.txt runs/run.wapo.v2.bm25.topics.backgroundlinking19.txt
 
-bin/trec_eval -c -M1000 -m map -c -M1000 -m ndcg_cut.5 tools/topics-and-qrels/qrels.backgroundlinking19.txt runs/run.wapo.v2.bm25+rm3.topics.backgroundlinking19.txt
+bin/trec_eval -c -M1000 -m map -c -M1000 -m ndcg_cut.5 tools\topics-and-qrels\qrels.backgroundlinking19.txt runs/run.wapo.v2.bm25+rm3.topics.backgroundlinking19.txt
 
-bin/trec_eval -c -M1000 -m map -c -M1000 -m ndcg_cut.5 tools/topics-and-qrels/qrels.backgroundlinking19.txt runs/run.wapo.v2.bm25+rm3+df.topics.backgroundlinking19.txt
+bin/trec_eval -c -M1000 -m map -c -M1000 -m ndcg_cut.5 tools\topics-and-qrels\qrels.backgroundlinking19.txt runs/run.wapo.v2.bm25+rm3+df.topics.backgroundlinking19.txt
 ```
 
 ## Effectiveness

diff --git a/docs/regressions/regressions-backgroundlinking20.md b/docs/regressions/regressions-backgroundlinking20.md
@@ -45,21 +45,21 @@ After indexing has completed, you should be able to perform retrieval as follows
 ```
 bin/run.sh io.anserini.search.SearchCollection \
   -index indexes/lucene-index.wapo.v3/ \
-  -topics tools/topics-and-qrels/topics.backgroundlinking20.txt \
+  -topics tools\topics-and-qrels\topics.backgroundlinking20.txt \
   -topicReader BackgroundLinking \
   -output runs/run.wapo.v3.bm25.topics.backgroundlinking20.txt \
   -backgroundLinking -backgroundLinking.k 100 -bm25 -hits 100 &
 
 bin/run.sh io.anserini.search.SearchCollection \
   -index indexes/lucene-index.wapo.v3/ \
-  -topics tools/topics-and-qrels/topics.backgroundlinking20.txt \
+  -topics tools\topics-and-qrels\topics.backgroundlinking20.txt \
   -topicReader BackgroundLinking \
   -output runs/run.wapo.v3.bm25+rm3.topics.backgroundlinking20.txt \
   -backgroundLinking -backgroundLinking.k 100 -bm25 -rm3 -hits 100 &
 
 bin/run.sh io.anserini.search.SearchCollection \
   -index indexes/lucene-index.wapo.v3/ \
-  -topics tools/topics-and-qrels/topics.backgroundlinking20.txt \
+  -topics tools\topics-and-qrels\topics.backgroundlinking20.txt \
   -topicReader BackgroundLinking \
   -output runs/run.wapo.v3.bm25+rm3+df.topics.backgroundlinking20.txt \
   -backgroundLinking -backgroundLinking.dateFilter -backgroundLinking.k 100 -bm25 -rm3 -hits 100 &
@@ -68,11 +68,11 @@ bin/run.sh io.anserini.search.SearchCollection \
 Evaluation can be performed using `trec_eval`:
 
 ```
-bin/trec_eval -c -M1000 -m map -c -M1000 -m ndcg_cut.5 tools/topics-and-qrels/qrels.backgroundlinking20.txt runs/run.wapo.v3.bm25.topics.backgroundlinking20.txt
+bin/trec_eval -c -M1000 -m map -c -M1000 -m ndcg_cut.5 tools\topics-and-qrels\qrels.backgroundlinking20.txt runs/run.wapo.v3.bm25.topics.backgroundlinking20.txt
 
-bin/trec_eval -c -M1000 -m map -c -M1000 -m ndcg_cut.5 tools/topics-and-qrels/qrels.backgroundlinking20.txt runs/run.wapo.v3.bm25+rm3.topics.backgroundlinking20.txt
+bin/trec_eval -c -M1000 -m map -c -M1000 -m ndcg_cut.5 tools\topics-and-qrels\qrels.backgroundlinking20.txt runs/run.wapo.v3.bm25+rm3.topics.backgroundlinking20.txt
 
-bin/trec_eval -c -M1000 -m map -c -M1000 -m ndcg_cut.5 tools/topics-and-qrels/qrels.backgroundlinking20.txt runs/run.wapo.v3.bm25+rm3+df.topics.backgroundlinking20.txt
+bin/trec_eval -c -M1000 -m map -c -M1000 -m ndcg_cut.5 tools\topics-and-qrels\qrels.backgroundlinking20.txt runs/run.wapo.v3.bm25+rm3+df.topics.backgroundlinking20.txt
 ```
 
 ## Effectiveness

diff --git a/...egressions/regressions-beir-v1.0.0-arguana.bge-base-en-v1.5.flat-int8.cached.md b/...egressions/regressions-beir-v1.0.0-arguana.bge-base-en-v1.5.flat-int8.cached.md
@@ -53,7 +53,7 @@ After indexing has completed, you should be able to perform retrieval as follows
 ```
 bin/run.sh io.anserini.search.SearchFlatDenseVectors \
   -index indexes/lucene-flat-int8.beir-v1.0.0-arguana.bge-base-en-v1.5/ \
-  -topics tools/topics-and-qrels/topics.beir-v1.0.0-arguana.test.bge-base-en-v1.5.jsonl.gz \
+  -topics tools\topics-and-qrels\topics.beir-v1.0.0-arguana.test.bge-base-en-v1.5.jsonl.gz \
   -topicReader JsonStringVector \
   -output runs/run.beir-v1.0.0-arguana.bge-base-en-v1.5.bge-flat-int8-cached.topics.beir-v1.0.0-arguana.test.bge-base-en-v1.5.jsonl.txt \
   -hits 1000 -removeQuery -threads 16 &
@@ -62,9 +62,9 @@ bin/run.sh io.anserini.search.SearchFlatDenseVectors \
 Evaluation can be performed using `trec_eval`:
 
 ```
-bin/trec_eval -c -m ndcg_cut.10 tools/topics-and-qrels/qrels.beir-v1.0.0-arguana.test.txt runs/run.beir-v1.0.0-arguana.bge-base-en-v1.5.bge-flat-int8-cached.topics.beir-v1.0.0-arguana.test.bge-base-en-v1.5.jsonl.txt
-bin/trec_eval -c -m recall.100 tools/topics-and-qrels/qrels.beir-v1.0.0-arguana.test.txt runs/run.beir-v1.0.0-arguana.bge-base-en-v1.5.bge-flat-int8-cached.topics.beir-v1.0.0-arguana.test.bge-base-en-v1.5.jsonl.txt
-bin/trec_eval -c -m recall.1000 tools/topics-and-qrels/qrels.beir-v1.0.0-arguana.test.txt runs/run.beir-v1.0.0-arguana.bge-base-en-v1.5.bge-flat-int8-cached.topics.beir-v1.0.0-arguana.test.bge-base-en-v1.5.jsonl.txt
+bin/trec_eval -c -m ndcg_cut.10 tools\topics-and-qrels\qrels.beir-v1.0.0-arguana.test.txt runs/run.beir-v1.0.0-arguana.bge-base-en-v1.5.bge-flat-int8-cached.topics.beir-v1.0.0-arguana.test.bge-base-en-v1.5.jsonl.txt
+bin/trec_eval -c -m recall.100 tools\topics-and-qrels\qrels.beir-v1.0.0-arguana.test.txt runs/run.beir-v1.0.0-arguana.bge-base-en-v1.5.bge-flat-int8-cached.topics.beir-v1.0.0-arguana.test.bge-base-en-v1.5.jsonl.txt
+bin/trec_eval -c -m recall.1000 tools\topics-and-qrels\qrels.beir-v1.0.0-arguana.test.txt runs/run.beir-v1.0.0-arguana.bge-base-en-v1.5.bge-flat-int8-cached.topics.beir-v1.0.0-arguana.test.bge-base-en-v1.5.jsonl.txt
 ```
 
 ## Effectiveness