jitterRANN: Fast Nearest Neighbour Search with Random Tie Breaking

This is a “fork” of RANN v2.5 which adds random tie breaking to the standard search type with kd trees. It is as if we added some jitter (i.e. some small amount of noise) to the data, hence the name of this package. Note though that this implementation uses an exact algorithm to break ties at random.

Example

data <- data.frame(
  x = c(rep(c(1, 2, 3), each = 5)),
  y = c(rep(c(1, 2, 3), each = 5))
)

query <- data.frame(
  x = rep(1, 10),
  y = rep(1, 10)
)

print(data)

##    x y
## 1  1 1
## 2  1 1
## 3  1 1
## 4  1 1
## 5  1 1
## 6  2 2
## 7  2 2
## 8  2 2
## 9  2 2
## 10 2 2
## 11 3 3
## 12 3 3
## 13 3 3
## 14 3 3
## 15 3 3

print(query)

##    x y
## 1  1 1
## 2  1 1
## 3  1 1
## 4  1 1
## 5  1 1
## 6  1 1
## 7  1 1
## 8  1 1
## 9  1 1
## 10 1 1

The RANN packages returns for each of the 10 identical query points the exact same results:

RANN::nn2(data, query, k = 8)

## $nn.idx
##       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
##  [1,]    5    4    3    2    1    6    7    8
##  [2,]    5    4    3    2    1    6    7    8
##  [3,]    5    4    3    2    1    6    7    8
##  [4,]    5    4    3    2    1    6    7    8
##  [5,]    5    4    3    2    1    6    7    8
##  [6,]    5    4    3    2    1    6    7    8
##  [7,]    5    4    3    2    1    6    7    8
##  [8,]    5    4    3    2    1    6    7    8
##  [9,]    5    4    3    2    1    6    7    8
## [10,]    5    4    3    2    1    6    7    8
## 
## $nn.dists
##       [,1] [,2] [,3] [,4] [,5]     [,6]     [,7]     [,8]
##  [1,]    0    0    0    0    0 1.414214 1.414214 1.414214
##  [2,]    0    0    0    0    0 1.414214 1.414214 1.414214
##  [3,]    0    0    0    0    0 1.414214 1.414214 1.414214
##  [4,]    0    0    0    0    0 1.414214 1.414214 1.414214
##  [5,]    0    0    0    0    0 1.414214 1.414214 1.414214
##  [6,]    0    0    0    0    0 1.414214 1.414214 1.414214
##  [7,]    0    0    0    0    0 1.414214 1.414214 1.414214
##  [8,]    0    0    0    0    0 1.414214 1.414214 1.414214
##  [9,]    0    0    0    0    0 1.414214 1.414214 1.414214
## [10,]    0    0    0    0    0 1.414214 1.414214 1.414214

While this is technically correct, many algorithms will run into issues with such results. E.g., I originally created jitterRANN for the use in a Predictive Mean Matching implementation. Using RANN here could result in severely biased imputations.

Note that it is not sufficient to permute points (with the same distances) afterwards, since we would still completely lack points 9 and 10, which are just as good as points 6 to 8.

Here is what jitterRANN returns:

set.seed(156)
jitterRANN::nn2(data, query, k = 8)

## $nn.idx
##       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
##  [1,]    2    5    4    1    3   10    9    7
##  [2,]    2    3    4    1    5   10    9    8
##  [3,]    5    4    1    3    2    7   10    8
##  [4,]    1    3    5    4    2   10    7    9
##  [5,]    3    5    1    2    4   10    9    7
##  [6,]    3    4    2    5    1    6    8   10
##  [7,]    4    2    1    3    5    6    7   10
##  [8,]    1    3    4    2    5    7    6    9
##  [9,]    1    2    3    5    4    6    9    8
## [10,]    1    5    4    3    2    6    7   10
## 
## $nn.dists
##       [,1] [,2] [,3] [,4] [,5]     [,6]     [,7]     [,8]
##  [1,]    0    0    0    0    0 1.414214 1.414214 1.414214
##  [2,]    0    0    0    0    0 1.414214 1.414214 1.414214
##  [3,]    0    0    0    0    0 1.414214 1.414214 1.414214
##  [4,]    0    0    0    0    0 1.414214 1.414214 1.414214
##  [5,]    0    0    0    0    0 1.414214 1.414214 1.414214
##  [6,]    0    0    0    0    0 1.414214 1.414214 1.414214
##  [7,]    0    0    0    0    0 1.414214 1.414214 1.414214
##  [8,]    0    0    0    0    0 1.414214 1.414214 1.414214
##  [9,]    0    0    0    0    0 1.414214 1.414214 1.414214
## [10,]    0    0    0    0    0 1.414214 1.414214 1.414214

Let’s verify the results a bit more by looking at the relative distribution of the neighbours:

query <- data.frame(
  x = rep(1, 100000),
  y = rep(1, 100000)
)

res <- jitterRANN::nn2(data, query, k = 8)

for (j in 1:8) {
  cat("\n", j, ". neighbour:\n", sep = "")
  print(prop.table(table(res[["nn.idx"]][, j])))
}

## 
## 1. neighbour:
## 
##       1       2       3       4       5 
## 0.19834 0.19999 0.20258 0.19963 0.19946 
## 
## 2. neighbour:
## 
##       1       2       3       4       5 
## 0.20239 0.19955 0.19877 0.19959 0.19970 
## 
## 3. neighbour:
## 
##       1       2       3       4       5 
## 0.20079 0.20071 0.20001 0.20102 0.19747 
## 
## 4. neighbour:
## 
##       1       2       3       4       5 
## 0.20112 0.19919 0.19794 0.19983 0.20192 
## 
## 5. neighbour:
## 
##       1       2       3       4       5 
## 0.19736 0.20056 0.20070 0.19993 0.20145 
## 
## 6. neighbour:
## 
##       6       7       8       9      10 
## 0.20142 0.19948 0.19920 0.19932 0.20058 
## 
## 7. neighbour:
## 
##       6       7       8       9      10 
## 0.19733 0.20234 0.20103 0.20076 0.19854 
## 
## 8. neighbour:
## 
##       6       7       8       9      10 
## 0.20095 0.19841 0.19918 0.20045 0.20101

Performance

If there are no ties to break, both implementations are equally fast:

data <- data.frame(
  x = runif(1e5),
  y = runif(1e5)
)

query <- data.frame(
  x = runif(1e3),
  y = runif(1e3)
)

bench <- microbenchmark::microbenchmark(
  RANN = RANN::nn2(data, query, k = 10),
  jitterRANN = jitterRANN::nn2(data, query, k = 10),
  times = 10
)

withr::with_options(
  list(digits = 3),
  print(bench)
)

## Unit: milliseconds
##        expr  min   lq mean median   uq  max neval
##        RANN 75.9 76.6 77.4   77.2 77.8 80.1    10
##  jitterRANN 76.1 76.3 78.3   77.5 79.1 85.2    10

If there is a high number of ties, this has a measurable performance impact, but I found it still acceptable also for practical use cases with big data:

data <- data.frame(
  x = rep(runif(1e3), 1e2),
  y = rep(runif(1e3), 1e2)
)

query <- data.frame(
  x = runif(1e3),
  y = runif(1e3)
)

bench <- microbenchmark::microbenchmark(
  RANN = RANN::nn2(data, query, k = 10),
  jitterRANN = jitterRANN::nn2(data, query, k = 10),
  times = 10
)

withr::with_options(
  list(digits = 3),
  print(bench)
)

## Unit: milliseconds
##        expr   min    lq  mean median    uq   max neval
##        RANN  64.6  66.2  67.6   68.3  68.8  69.9    10
##  jitterRANN 128.3 130.2 136.2  132.3 141.3 157.3    10

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.github/workflows		.github/workflows
R		R
inst		inst
man		man
src		src
tests		tests
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
NAMESPACE		NAMESPACE
NEWS		NEWS
README.Rmd		README.Rmd
README.md		README.md
jitterRANN.Rproj		jitterRANN.Rproj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

jitterRANN: Fast Nearest Neighbour Search with Random Tie Breaking

Example

Performance

About

Uh oh!

Releases

Packages

Languages

akersting/jitterRANN

Folders and files

Latest commit

History

Repository files navigation

jitterRANN: Fast Nearest Neighbour Search with Random Tie Breaking

Example

Performance

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages