FeatureVector no longer uses fastutil Hash Map #2939

JJGreen0 · 2025-08-21T21:42:56Z

Speed Comparison:

- [HashMap] Adds: 1,000,000 ops in 0.265 s (3.8 Mops/s)
- [HashMap] L2 scale: n=198,675 in 0.083 s
- [HashMap] L1 scale: n=198,675 in 0.086 s
- [HashMap] Prune to k=200: in 0.119 s
- [fastutil] Adds: 1,000,000 ops in 0.157 s (6.4 Mops/s)
- [fastutil] L2 scale: n=198,675 in 0.038 s
- [fastutil] L1 scale: n=198,675 in 0.053 s
- [fastutil] Prune to k=200: in 0.150 s

Adds: time to apply many addFeatureValue(term, +1.0f) updates.
L2/L1 scale: time to normalize all feature weights to unit L2/L1 norm.
Prune: time to keep the top k largest-weight features.

lintool · 2025-08-21T22:04:51Z

So, it's slower... question is, does it have any end-to-end impact?

@JJGreen0 can you find out where this class is being used in retrieval and assess the e2e impact? IIRC, this is being used in relevance feedback?

JJGreen0 · 2025-08-22T20:35:18Z

Yes, this class is used in relevance feedback, via the the RM3 and Rocchio rerankers.

RM3: src/main/java/io/anserini/rerank/lib/Rm3Reranker.java
- Core calls: FeatureVector.fromTerms(...), addFeatureValue(...), pruneToSize(...), scaleToUnitL1Norm(), interpolate(...),
  iteration over features to build the Lucene query.
Rocchio: src/main/java/io/anserini/rerank/lib/RocchioReranker.java
- Core calls: same FeatureVector methods as RM3, plus repeated getValue(...) lookups in the feedback aggregation.

First-pass retrieval is not affected
Indexing and the Lucene scorer are unaffected

The end-to-end slow downs shouldn't be too large because the affected methods (add, L2/L1 scale, Prune) aren't the most dominant costs compared to sorting and reading term info. Also, the expected range of map operations per query is much smaller than the amount of operations in the speed tests, and translates to sub-millisecond differences per query for the FeatureVector portion.

lintool · 2025-08-23T11:54:35Z

@JJGreen0 Can you run some e2e tests with a corpus you already have access to, like MS MARCO?
https://github.com/castorini/anserini/blob/master/docs/regressions/regressions-dl19-passage.md

Both RM3 and Rocchio are used there.

Measure e2e latency? And make sure both implementations give the same results?

JJGreen0 · 2025-08-24T17:32:22Z

Goal: Check if replacing fastutil maps with standard Java HashMaps slows end-to-end query-time when using relevance feedback (RM3/Rocchio).

Corpus: MS MARCO V1 passage (index already on disk). TREC DL’19 passage topics (43 queries)

A/B comparison: Timed two builds on the same index and topics:
- Pre-change build (fastutil).
- Post-change build (Java HashMap).

Methodology: Ran the exact same SearchCollection commands for each pipeline; warmed runs; captured wall-clock with /usr/bin/time;

Results:

- Post-change (Java HashMap):
    - BM25: 2.82 s
    - +RM3: 3.08 s
    - +Rocchio: 2.98–3.13 s (2.98 s warmed)
- Pre-change (fastutil):
    - BM25: 2.84 s
    - +RM3: 3.08–3.09 s
    - +Rocchio: 3.09–3.27 s (≈3.09 s warmed)
- Delta (post – pre):
    - +RM3: ~0.0 ms/query
    - +Rocchio: ~−2.6 ms/query (faster)

No meaningful end-to-end latency increase. The differences are likely in a range of normal variance between runs (Only 1 run was done, so the sample size isn't too great but the idea is there)

FeatureVector no longer uses fastutil Hash Map

40999f1

lintool mentioned this pull request Aug 23, 2025

Replace fastutil with HPPC #2945

Merged

lintool self-requested a review September 16, 2025 11:07

lintool approved these changes Sep 16, 2025

View reviewed changes

lintool merged commit 7cbc02e into castorini:master Sep 16, 2025
1 check passed

This was referenced Sep 16, 2025

Remove fastutil dependency #2974

Merged

Retire it.unimi.dsi.fastutil #2937

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

FeatureVector no longer uses fastutil Hash Map #2939

FeatureVector no longer uses fastutil Hash Map #2939

Uh oh!

JJGreen0 commented Aug 21, 2025 •

edited

Loading

Uh oh!

lintool commented Aug 21, 2025

Uh oh!

JJGreen0 commented Aug 22, 2025 •

edited

Loading

Uh oh!

lintool commented Aug 23, 2025

Uh oh!

JJGreen0 commented Aug 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

FeatureVector no longer uses fastutil Hash Map #2939

FeatureVector no longer uses fastutil Hash Map #2939

Uh oh!

Conversation

JJGreen0 commented Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lintool commented Aug 21, 2025

Uh oh!

JJGreen0 commented Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lintool commented Aug 23, 2025

Uh oh!

JJGreen0 commented Aug 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

JJGreen0 commented Aug 21, 2025 •

edited

Loading

JJGreen0 commented Aug 22, 2025 •

edited

Loading