Skip to content

Conversation

somu-imply
Copy link
Contributor

@somu-imply somu-imply commented Mar 17, 2022

Introduced a new config flag druid.generic.ignoreNullsForStringCardinality which is by default set to false. When set to true the nulls in a string column are not counted towards cardinality.

For example in this testTable column

stringVal
null
null
null
abc
abc
abc

In the default case (when the config is set to false) the query
select COUNT(DISTINCT stringVal) from testTable will return a value of 2

When set to true the same query will ignore the nulls and give a value of 1.

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@suneet-s
Copy link
Contributor

Added Design Review because it introduces a feature flag.

}

/**
* whether nulls should be counted during String cardinality
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this needs to clarify that it only applies to the built-in cardinality aggregator, and also clarify that it means the empty string isn't counted as a value either, since null and empty string are equivalent in default mode

Copy link
Contributor

@jihoonson jihoonson Mar 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it should be rather a constructor parameter of the aggregator than a system property. It would make it clear what aggregator is impacted.

Comment on lines 91 to 106
if (NullHandling.ignoreNullsForStringCardinality()) {
//check and do not count nulls for Strings
for (int i = 0, rowSize = row.size(); i < rowSize; i++) {
int index = row.get(i);
final String value = dimSelector.lookupName(index);
if (value != null) {
addStringToCollector(collector, value);
}
}
} else {
//count everything
for (int i = 0, rowSize = row.size(); i < rowSize; i++) {
int index = row.get(i);
final String value = dimSelector.lookupName(index);
addStringToCollector(collector, value);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't really seem like the right place to do this, rather this should be done in addStringToCollector? Then it could be something like

if (s != null || (NullHandling.replaceWithDefault() && !NullHandling.ignoreNullsForStringCardinality())) {
...

Also, I don't think this setting should apply when druid.generic.useDefaultValueForNull=false, which should behave in an SQL compatible manner. Anyway, if it was pushed into there then it would fix all of the string cardinality aggregators I think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it seems like hashRow isn't wired up to this either, shouldn't this setting control the behavior of all strings with the cardinality aggregator in all of its modes?

"false"
));
} else {
this.ignoreNullsForStringCardinality = ignoreNullsForStringCardinality;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be force set to false when useDefaultValuesForNull is false, since we don't want a way to make the results not compatible with SQL.

}
Assert.assertEquals(NullHandling.replaceWithDefault() ? 7.0 : 6.0, (Double) valueAggregatorFactory.finalizeComputation(agg.get()), 0.05);
Assert.assertEquals(NullHandling.replaceWithDefault() ? 7L : 6L, rowAggregatorFactoryRounded.finalizeComputation(agg.get()));
if (NullHandling.ignoreNullsForStringCardinality()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be useful to have a method for tests on NullHandling that allows explicitly configuring this value so you can test both modes (to be ultra safe, in a try/finally to revert it in case anything happens it doesn't leave static config in a funny state)

@somu-imply
Copy link
Contributor Author

Thanks for the comments. Working on it

Copy link
Member

@clintropolis clintropolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall lgtm 👍

my only comments are nit-picking to make docs a bit clearer, so will go ahead and approve

@jihoonson
Copy link
Contributor

@somu-imply the doc doesn't seem clear to me. Is @clintropolis's comment in #12345 (comment) addressed? The reader should be able to tell exactly when the new config takes effect and what the exact effect is.

@somu-imply
Copy link
Contributor Author

hi @jihoonson the docs have been updated to indicate it works only for built-in cardinality on string columns

Copy link
Contributor

@jihoonson jihoonson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @somu-imply. LGTM

@jihoonson
Copy link
Contributor

Merging this PR. The Travis failure seems flaky as it passed for 808269a and there is only doc change between it and the last commit.

@jihoonson jihoonson merged commit a1ea658 into apache:master Mar 29, 2022
TSFenwick pushed a commit to TSFenwick/druid that referenced this pull request Apr 11, 2022
…nality (apache#12345)

* Counting nulls in String cardinality with a config

* Adding tests for the new config

* Wrapping the vectorize part to allow backward compatibility

* Adding different tests, cleaning the code and putting the check at the proper position, handling hasRow() and hasValue() changes

* Updating testcase and code

* Adding null handling test to improve coverage

* Checkstyle fix

* Adding 1 more change in docs

* Making docs clearer
@abhishekagarwal87 abhishekagarwal87 added this to the 0.23.0 milestone May 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants