-
Notifications
You must be signed in to change notification settings - Fork 3.8k
Introducing a new config to ignore nulls while computing String Cardinality #12345
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Added |
} | ||
|
||
/** | ||
* whether nulls should be counted during String cardinality |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think this needs to clarify that it only applies to the built-in cardinality aggregator, and also clarify that it means the empty string isn't counted as a value either, since null and empty string are equivalent in default mode
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it should be rather a constructor parameter of the aggregator than a system property. It would make it clear what aggregator is impacted.
if (NullHandling.ignoreNullsForStringCardinality()) { | ||
//check and do not count nulls for Strings | ||
for (int i = 0, rowSize = row.size(); i < rowSize; i++) { | ||
int index = row.get(i); | ||
final String value = dimSelector.lookupName(index); | ||
if (value != null) { | ||
addStringToCollector(collector, value); | ||
} | ||
} | ||
} else { | ||
//count everything | ||
for (int i = 0, rowSize = row.size(); i < rowSize; i++) { | ||
int index = row.get(i); | ||
final String value = dimSelector.lookupName(index); | ||
addStringToCollector(collector, value); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this doesn't really seem like the right place to do this, rather this should be done in addStringToCollector
? Then it could be something like
if (s != null || (NullHandling.replaceWithDefault() && !NullHandling.ignoreNullsForStringCardinality())) {
...
Also, I don't think this setting should apply when druid.generic.useDefaultValueForNull=false
, which should behave in an SQL compatible manner. Anyway, if it was pushed into there then it would fix all of the string cardinality aggregators I think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, it seems like hashRow
isn't wired up to this either, shouldn't this setting control the behavior of all strings with the cardinality aggregator in all of its modes?
"false" | ||
)); | ||
} else { | ||
this.ignoreNullsForStringCardinality = ignoreNullsForStringCardinality; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be force set to false when useDefaultValuesForNull
is false, since we don't want a way to make the results not compatible with SQL.
} | ||
Assert.assertEquals(NullHandling.replaceWithDefault() ? 7.0 : 6.0, (Double) valueAggregatorFactory.finalizeComputation(agg.get()), 0.05); | ||
Assert.assertEquals(NullHandling.replaceWithDefault() ? 7L : 6L, rowAggregatorFactoryRounded.finalizeComputation(agg.get())); | ||
if (NullHandling.ignoreNullsForStringCardinality()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be useful to have a method for tests on NullHandling
that allows explicitly configuring this value so you can test both modes (to be ultra safe, in a try/finally to revert it in case anything happens it doesn't leave static config in a funny state)
Thanks for the comments. Working on it |
…e proper position, handling hasRow() and hasValue() changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall lgtm 👍
my only comments are nit-picking to make docs a bit clearer, so will go ahead and approve
core/src/main/java/org/apache/druid/common/config/NullValueHandlingConfig.java
Show resolved
Hide resolved
@somu-imply the doc doesn't seem clear to me. Is @clintropolis's comment in #12345 (comment) addressed? The reader should be able to tell exactly when the new config takes effect and what the exact effect is. |
8943376
to
808269a
Compare
hi @jihoonson the docs have been updated to indicate it works only for built-in cardinality on string columns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @somu-imply. LGTM
Merging this PR. The Travis failure seems flaky as it passed for 808269a and there is only doc change between it and the last commit. |
…nality (apache#12345) * Counting nulls in String cardinality with a config * Adding tests for the new config * Wrapping the vectorize part to allow backward compatibility * Adding different tests, cleaning the code and putting the check at the proper position, handling hasRow() and hasValue() changes * Updating testcase and code * Adding null handling test to improve coverage * Checkstyle fix * Adding 1 more change in docs * Making docs clearer
Introduced a new config flag
druid.generic.ignoreNullsForStringCardinality
which is by default set tofalse
. When set to true the nulls in a string column are not counted towards cardinality.For example in this
testTable
columnIn the default case (when the config is set to
false
) the queryselect COUNT(DISTINCT stringVal) from testTable
will return a value of 2When set to
true
the same query will ignore the nulls and give a value of 1.This PR has: