Introducing a new config to ignore nulls while computing String Cardinality #12345

somu-imply · 2022-03-17T18:42:58Z

Introduced a new config flag druid.generic.ignoreNullsForStringCardinality which is by default set to false. When set to true the nulls in a string column are not counted towards cardinality.

For example in this testTable column

stringVal
null
null
null
abc
abc
abc

In the default case (when the config is set to false) the query
select COUNT(DISTINCT stringVal) from testTable will return a value of 2

When set to true the same query will ignore the nulls and give a value of 1.

This PR has:

been self-reviewed.
- using the concurrency checklist (Remove this item if the PR doesn't have any relation to concurrency.)
added documentation for new or modified features or behaviors.
added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
added or updated version, license, or notice information in licenses.yaml
added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
added integration tests.
been tested in a test Druid cluster.

suneet-s · 2022-03-17T23:15:54Z

Added Design Review because it introduces a feature flag.

clintropolis · 2022-03-18T03:48:48Z

core/src/main/java/org/apache/druid/common/config/NullHandling.java

  }

+  /**
+   * whether nulls should be counted during String cardinality


i think this needs to clarify that it only applies to the built-in cardinality aggregator, and also clarify that it means the empty string isn't counted as a value either, since null and empty string are equivalent in default mode

I wonder if it should be rather a constructor parameter of the aggregator than a system property. It would make it clear what aggregator is impacted.

docs/configuration/index.md

clintropolis · 2022-03-18T03:54:15Z

...d/query/aggregation/cardinality/types/StringCardinalityAggregatorColumnSelectorStrategy.java

+    if (NullHandling.ignoreNullsForStringCardinality()) {
+      //check and do not count nulls for Strings
+      for (int i = 0, rowSize = row.size(); i < rowSize; i++) {
+        int index = row.get(i);
+        final String value = dimSelector.lookupName(index);
+        if (value != null) {
+          addStringToCollector(collector, value);
+        }
+      }
+    } else {
+      //count everything
+      for (int i = 0, rowSize = row.size(); i < rowSize; i++) {
+        int index = row.get(i);
+        final String value = dimSelector.lookupName(index);
+        addStringToCollector(collector, value);
+      }


this doesn't really seem like the right place to do this, rather this should be done in addStringToCollector? Then it could be something like

if (s != null || (NullHandling.replaceWithDefault() && !NullHandling.ignoreNullsForStringCardinality())) { ...

Also, I don't think this setting should apply when druid.generic.useDefaultValueForNull=false, which should behave in an SQL compatible manner. Anyway, if it was pushed into there then it would fix all of the string cardinality aggregators I think?

Also, it seems like hashRow isn't wired up to this either, shouldn't this setting control the behavior of all strings with the cardinality aggregator in all of its modes?

clintropolis · 2022-03-18T03:55:52Z

core/src/main/java/org/apache/druid/common/config/NullValueHandlingConfig.java

+          "false"
+      ));
+    } else {
+      this.ignoreNullsForStringCardinality = ignoreNullsForStringCardinality;


I think this should be force set to false when useDefaultValuesForNull is false, since we don't want a way to make the results not compatible with SQL.

clintropolis · 2022-03-18T04:00:52Z

.../src/test/java/org/apache/druid/query/aggregation/cardinality/CardinalityAggregatorTest.java

    }
-    Assert.assertEquals(NullHandling.replaceWithDefault() ? 7.0 : 6.0, (Double) valueAggregatorFactory.finalizeComputation(agg.get()), 0.05);
-    Assert.assertEquals(NullHandling.replaceWithDefault() ? 7L : 6L, rowAggregatorFactoryRounded.finalizeComputation(agg.get()));
+    if (NullHandling.ignoreNullsForStringCardinality()) {


it would be useful to have a method for tests on NullHandling that allows explicitly configuring this value so you can test both modes (to be ultra safe, in a try/finally to revert it in case anything happens it doesn't leave static config in a funny state)

somu-imply · 2022-03-18T17:25:34Z

Thanks for the comments. Working on it

…e proper position, handling hasRow() and hasValue() changes

clintropolis

overall lgtm 👍

my only comments are nit-picking to make docs a bit clearer, so will go ahead and approve

core/src/main/java/org/apache/druid/common/config/NullValueHandlingConfig.java

docs/configuration/index.md

jihoonson · 2022-03-29T16:10:10Z

@somu-imply the doc doesn't seem clear to me. Is @clintropolis's comment in #12345 (comment) addressed? The reader should be able to tell exactly when the new config takes effect and what the exact effect is.

somu-imply · 2022-03-29T16:43:45Z

hi @jihoonson the docs have been updated to indicate it works only for built-in cardinality on string columns

jihoonson

Thanks @somu-imply. LGTM

jihoonson · 2022-03-29T21:31:29Z

Merging this PR. The Travis failure seems flaky as it passed for 808269a and there is only doc change between it and the last commit.

…nality (apache#12345) * Counting nulls in String cardinality with a config * Adding tests for the new config * Wrapping the vectorize part to allow backward compatibility * Adding different tests, cleaning the code and putting the check at the proper position, handling hasRow() and hasValue() changes * Updating testcase and code * Adding null handling test to improve coverage * Checkstyle fix * Adding 1 more change in docs * Making docs clearer

somu-imply added 3 commits March 17, 2022 10:49

Counting nulls in String cardinality with a config

bde318b

Adding tests for the new config

c398b4c

Wrapping the vectorize part to allow backward compatibility

04abfec

suneet-s added Design Review Area - Querying Bug labels Mar 17, 2022

clintropolis reviewed Mar 18, 2022

View reviewed changes

somu-imply added 4 commits March 21, 2022 13:27

Adding different tests, cleaning the code and putting the check at th…

ef0b53a

…e proper position, handling hasRow() and hasValue() changes

Updating testcase and code

87b2ce2

Adding null handling test to improve coverage

f2622a3

Checkstyle fix

553e37b

clintropolis approved these changes Mar 28, 2022

View reviewed changes

core/src/main/java/org/apache/druid/common/config/NullValueHandlingConfig.java Show resolved Hide resolved

docs/configuration/index.md Outdated Show resolved Hide resolved

Adding 1 more change in docs

808269a

somu-imply force-pushed the cardinality_nulls branch from 8943376 to 808269a Compare March 29, 2022 16:42

Making docs clearer

4af91de

jihoonson approved these changes Mar 29, 2022

View reviewed changes

jihoonson merged commit a1ea658 into apache:master Mar 29, 2022

abhishekagarwal87 added this to the 0.23.0 milestone May 11, 2022

abhishekagarwal87 mentioned this pull request Aug 1, 2023

Improve the script to find missing backports #14723

Merged

2 tasks

Introducing a new config to ignore nulls while computing String Cardinality #12345

Introducing a new config to ignore nulls while computing String Cardinality #12345

Uh oh!

Conversation

somu-imply commented Mar 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

suneet-s commented Mar 17, 2022

Uh oh!

clintropolis Mar 18, 2022

Choose a reason for hiding this comment

Uh oh!

jihoonson Mar 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

clintropolis Mar 18, 2022

Choose a reason for hiding this comment

Uh oh!

clintropolis Mar 18, 2022

Choose a reason for hiding this comment

Uh oh!

clintropolis Mar 18, 2022

Choose a reason for hiding this comment

Uh oh!

clintropolis Mar 18, 2022

Choose a reason for hiding this comment

Uh oh!

somu-imply commented Mar 18, 2022

Uh oh!

clintropolis left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jihoonson commented Mar 29, 2022

Uh oh!

somu-imply commented Mar 29, 2022

Uh oh!

jihoonson left a comment

Choose a reason for hiding this comment

Uh oh!

jihoonson commented Mar 29, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

somu-imply commented Mar 17, 2022 •

edited

Loading

jihoonson Mar 18, 2022 •

edited

Loading