-
Notifications
You must be signed in to change notification settings - Fork 1
RFC: System Parameters on Meta #34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
What if some system configs are required on cluster start, e.g. (hummock URL)? |
Initialize it until the parameters are fetched from Meta, or |
It works for worker types other than Meta, but what about Meta itself? How should these configs be initialized, maybe in command line options? |
How about setting it before any compute node join the cluster? |
|
Regarding the decision where to put an argument: If e.g. the argument relates to a RW instance, how do we decide if we put it into the SQL interface or the web console? Or will we always offer both options? Same for process configs: When do we use a CLI arg and when the .toml file? |
IIUC it's layered config:
When some configs are frequently used and modified, we can provide them in the upper layer interface for ease of use. |
rfcs/0034-system-parameters.md
Outdated
| system --> cloud(Web Console) | ||
| ``` | ||
|
|
||
| ### System Configs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to clarify: Are system configs distinct from config files? From my understanding of the motivation it's yes. A related question is: If so, the default values for them (i.e., the values at startup) cannot be changed, right? 🤔
FWIW since you referred PG, in PG they are the same set of configs:
These variables can be set using the
SETstatement, by editing thepostgresql.confconfiguration file, through thePGOPTIONSenvironmental variable (when using libpq or a libpq-based application), or through command-line flags when starting thepostgresserver.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Things are easier for single-node systems like PG/MySQL. Here we need to to either 1) move all configs to meta services 2) divided them into 2 sets: local and global.
I chose 2 because local config seems impossible to avoid. For example, --memory must be a local parameter.
xxchan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ideas generally LGTM, but needs some clarifications as commented
SQL interface and web console control exactly the same set of parameters. Web console is just an optional GUI. CLI args is a subset of |
|
related draft PR: risingwavelabs/risingwave#7273 |
Co-authored-by: Noel Kwan <[email protected]>
This PR is drafted before this RFC, so it just includes a very small part of this RFC. Essentially, for demo purpose, it moves one config |
|
|
||
| ### Mutability | ||
|
|
||
| - Process configs must be **immutable**, because there is no user-interface to change it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if user has been running some workload for a while, and realize having a lower/higher parallelism will better utilize their resources.
In this scenario, parallelism is process level config, and is immutable.
What should the user do if they want to adjust it, do they have to start a new cluster?
Or is this not a realistic scenario that happens?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What should the user do if they want to adjust it, do they have to start a new cluster?
In that case, he/she can restart (or start new, they are equivalent actually) the CN process with new parallelism config. No need to restart the cluster.
Parallelism is actually a very special case than other parameters. Let me tell the whole story in my mind:
Overall, we have
- Process-level config to control the parallelism of one particular CN node
- System parameter
parallelismas the default parallelism for new streaming jobs. The default value is0or'auto'which implies the new streaming jobs are supposed to use all available parallel units in all compute nodes.- Sure, it can be session-level as well.
When the system bootup:
- Compute nodes register itself to Meta node when starting up, including its parallelism, which is set by the config
parallelismor # physical CPU cores by default. - Meta service aggregates these numbers to get total parallelism, or we called the total number of parallel units.
- Note: A "prallel unit" is basically a worker corresponding to one CPU core, but it's logical and is not strictly bound to a CPU core.
- If the system parameter
parallelismis set to0or'auto', we will use that "total parallelism" in step 2 as the effective value to plan/schedule/run streaming job actors. Otherwise, the user-set value will be used.
If the cluster changed (e.g. new nodes joined or existing nodes left), it will only affect the "total parallelism" in step 2. As a result, existing streaming jobs won't be affected and they still use original parallelism. This is expected, because we expect the K8s operator or human operator to redistribute the existing streaming actors instead of doing it automatically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically, is it correct to say that stream_parallelism is only a meaningful config for dev/on-prem, to set a stream-job level default, but those using cloud and even cloud team should never have to care about it due to the operator’s ability for auto-redistribution?
Btw, I’m quite unclear, but do we currently have an API for scaling individual fragments? Is this API an SQL API or is it a HTTP/gRPC endpoint to meta?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, is there a way for the user to hint the starting parallelism? For instance, even if cloud may auto distribute, it is unclear if it is better to start with small parallelism or high for a given stream job. User may know better.
So something like create materialized view (stream_job_parallelism_hint=3) … when the user thinks the stream job requires small throughput.
It seems similar to session variable stream_parallelism, but to me it is unclear what this variable does if we are going to auto scale, does it override the autoscaling etc, whereas a hint is at least clearer in that the parallelism may be modified by the system or by external operator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, I’m quite unclear, but do we currently have an API for scaling individual fragments
Yes, See the design doc at https://singularity-data.quip.com/RubDA4a51oS0/Low-level-Scaling-Interface-on-Meta-Service
So something like
create materialized view (stream_job_parallelism_hint=3)… when the user thinks the stream job requires small throughput.
That's exactly the session variable streaming_parallelism does. See risingwavelabs/risingwave#7370.
but to me it is unclear what this variable does if we are going to auto scale, whereas a hint is at least clearer in that the parallelism may be modified by the system or by external operator.
I understand this seems to be somewhat inconsistent. In the current design, streaming_parallelism only applies for the subsequent CREATE MVIEW command in the same session. The syntax you proposed LGTM as well. After the MV is created, the only way to manipulate its parallelism is via the Scaling API (the doc above).
The reason is that we consider scaling as an advanced operational feature that should be part of the cloud platform instead of the database kernel, especially for the policy i.e. how and when to do scaling. Hence, we divide the feature "scaling" into 2 parts: 1) the cloud platform make decisions about how & when, and then 2) the kernel simply executes these directives. That's the design philosophy behind that API. However, the only exception is creating an MView, during which the meta must decide the parallelism without cloud platform engaged. In that case, I think the best we can do is to use stream_parallelism for each fragment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we say that local or process configs are immutable during runtime to avoid confusion? Because users can always kill and restart a process with different config values.
because there is no user-interface to change it.
That won't be true if the cloud portal supports updating with a reboot.
|
For process config, should we have a separate config file (template) for each type of node, or add more sections in # Below sections may apply to multiple worker types, but the user does not need to know the details.
[server]
...
[meta]
...
[storage]
...
# The sections with the name of the worker type should apply ONLY to that worker type.
[compute_node]
# Put the entries from `ComputeNodeOpts` here
[compactor]
# Put the entries from `CompactorOpts` hereI was just reviewing the command line arguments for each type of node, and realized they differ a lot. If we put all those arguments into one config file without add more sections, it will be very cluttered, and the subset of configs that will take effect vary from worker type. |
Such brilliant solution! |
LGTM for adding more sections. Multiple files may be troublesome for deployment. |
|
|
||
| ### System Parameters | ||
|
|
||
| What we need is a centralized source-of-truth of cluster configurations. Obviously, the best choice for RisingWave is the Meta Service and the underlying etcd. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm afraid it's not a good idea to persist the configuration to the meta store... Here're several reasons:
- Meta store is not designed to be directly manipulated by the user. If one has corrupted some configuration with
SETand finds that the cluster does not boot anymore, he can do nothing about this. - Maintaining persisted configuration and consistency with the
clapdefinition can be really hard, especially for version compatibility (some famous HTAP database has had a lot of production accidents due to this). - It seems not convincing to persist configuration like
data_directory... what if the user really moves the data? Note that the cluster will immediately enter the recovery mode on startup, without giving a chance to the users to set a newdata_directory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm. These problems exist indeed, but persisting parameters (instead of losing all changes on reboot) is also a must-have feature for a distribution system.
Meta store is not designed to be directly manipulated by the user. If one has corrupted some configuration with SET and finds that the cluster does not boot anymore, he can do nothing about this.
We need to equip the operation tool with the ability to fix it. For example, support a new command to risectl set <parameter> to <value> to manipulate etcd directly.
Maintaining persisted configuration and consistency with the
clapdefinition can be really hard, especially for version compatibility (some famous HTAP database has had a lot of production accidents due to this).
clap definitions must refer to the CLI arguments, right? They are unrelated to system parameters in most cases.
It seems not convincing to persist configuration like data_directory... what if the user really moves the data? Note that the cluster will immediately enter the recovery mode on startup, without giving a chance to the users to set a new data_directory.
By "persist" I mean to be recoverable after rebooting. It can be mutable, as described in section "mutability".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clapdefinitions must refer to the CLI arguments, right?
Or serde? 🥺 If we take a config file as the initialization of these parameters, then we need to have a separate definition of its schema in the code (or we can make it a dynamic HashMap).
It can be mutable, as described in section "mutability".
Yes, but it cannot be mutated without a normal running cluster or risectl mentioned above. As the old data_directory is invalid, we cannot survive the recovery phase.
but persisting parameters (instead of losing all changes on reboot) is also a must-have feature for a distribution system.
I'm thinking if we treat all SETs to be temporary and just for the users to give it a try, users can always write down the desired config in a new toml file and reboot the cluster to apply it. 🤔 I cannot remember well but I guess this is also the Flink way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or
serde? 🥺 If we take a config file as the initialization of these parameters, then we need to have a separate definition of its schema in the code (or we can make it a dynamic HashMap).
In my mind, the system parameters are all simple strings or integers so se/deserialization won't be a big issue.
Yes, but it cannot be mutated without a normal running cluster or risectl mentioned above. As the old data_directory is invalid, we cannot survive the recovery phase.
Perhaps it also needs that operational tool: "We need to equip the operation tool with the ability to fix it" (mentioned in previous comment)
Anyway, data_directory is very special in many aspects: It's required to be set; it can hardly be changed when RisingWave is running; it need to be mutated offline. Things may get a little dirty for it.
I'm thinking if we treat all SETs to be temporary and just for the users to give it a try, users can always write down the desired config in a new toml file and reboot the cluster to apply it. 🤔 I cannot remember well but I guess this is also the Flink way?
Hmmm. Interesting. It's indeed a popular way in big data systems.
That's basically a centralized but non-persistent config store. It makes senses, but since we had a built-in metadata database, why not just persist it? I think in most cases this looks better (except data_dir, I know 😇)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This acts kind of like a mobile App and we need to provide several mechanisms to make it work, like entering the "emergency mode" and providing fixing utilities if it frequently fails, or guiding them to contact customer services for asking help. 😄
I do agree we should clarify the difference between "process" configs and "cluster" configs, but I'm still doubting the possible issues and complexities for persisting it. I feel like we are taking (possibly unnecessary) responsibility for ourselves that could otherwise belong to the user or DevOps. 😂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO ConfigMaps may actually be a good solution here. I am not sure how we would make this work in the risedev environment. Btw: Under the hood CMs, like all K8s objects, are also stored in ETCD.
Meta store is not designed to be directly manipulated by the user. If one has corrupted some configuration with SET and finds that the cluster does not boot anymore, he can do nothing about this.
Do you think it would help to version the configurations? Maybe this would make it simpler to fall back to a previous version if an update to the settings broke something. ETCD supports a --prefix operation, so you could retrieve all values belonging to a setting independent from the timestamp like this.
etcdctl put some_setting_23_01_27_13_22 val1 # original value
etcdctl put some_setting_23_01_28_09_00 val1 # updated value
etcdctl get some_setting --prefix # retrieve all values
some_setting_23_01_27_13_22 val1
some_setting_23_01_28_09_00 val1There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it would help to version the configurations
IMHO It's worth a separate RFC 🫣
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. But I believe first we would have to decide on how to store the configs, before deciding if/how we want to version them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 If we really want to persist them to the meta store, we should carefully think of a way to store them. For example, we should introduce something like the "field number" in protobuf, so that we can handle the backward compatibility more easily.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I totally agree with the centralized system-wide parameters, no matter where they are stored, for the reason that it's painful for the operator to dispatch those global vars like data_directory now 👍 . IMO, it can be stored in
- the etcd as a part of the metadata because we may want to change them afterward
- a local and consistent file mounted on the meta nodes so that the meta node can be the single source of truth, as all of the other components will contact meta before they get started
| Similar to other system metadata e.g. catalog, | ||
|
|
||
| - Changes to global parameters should be propagated to all nodes, in the cluster in a deterministic order ideally | ||
| - When a node reboots or recovers, the latest snapshot of parameters should be passed from Meta Service. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please elaborate it more about the initialization of the "snapshot" (in meta store)? For example,
- Do we still have configuration file for the cluster? If so, do we only persist the configuration on the first run, or do we also detect changes in the configuration file on restarts?
- Do we persist the default values of the entries that do not appear in
.toml? We should be cautious about some default values that are "computed" instead of constants, as persisting them may be confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to avoid misunderstanding: the "snapshot" here refers to the current system parameters in Meta Service at the beginning of subscription, just like the "snapshot" of the catalog or Hummock versions.
Answers to questions:
Do we still have configuration file for the cluster? If so, do we only persist the configuration on the first run, or do we also detect changes in the configuration file on restarts?
Basically not. It may be useful for a few cases like #34 (comment), but generally, I don't think we need such a design.
Do we persist the default values of the entries that do not appear in
.toml? We should be cautious about some default values that are "computed" instead of constants, as persisting them may be confusing.
Computed values (like parallelism described in #34 (comment)) should be stored as a placeholder e.g. 'auto' or 0, so does this solve your concern?
In this way, either persisting default values or not can work well, but I still prefer not to persist default values because it allows us to tune the defaults in higher-version RisingWave for better performance or whatever.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically not. It may be useful for a few cases like #34 (comment), but generally, I don't think we need such a design.
Does it mean that we need to manually initialize this on the first run, before issuing any streaming jobs? It sounds not that friendly for development and playground purposes.
(While if only for the state store, we can make an exception for it to be passed in from the meta CLI parameter as the default value.)
but I still prefer not to persist default values
Sounds like a reasonable choice! But we should still be cautious when upgrading if some users rely on the previous default values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it mean that we need to manually initialize this on the first run, before issuing any streaming jobs? It sounds not that friendly for development and playground purposes.
IIUC, we use minio for development and playground purposes now, and that's why we can have a default value for data_directory. We can keep this behavior unchanged.
But for productional deployment or cloud platform, a unique S3 bucket must be manually specified for sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not only the data_directory, but also the state_store_url which tells which state store backend we should use. 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the data_directory will be a big problem as long as it's not mutable and the deployment tool enforces the value. What really matters is how we keep the caches on various nodes updated and how to make sure things do not break when parameter change happens.
| - Global configs might be **immutable or mutable**, the mutable ones can be further divided into take effect immediately or need reboot. | ||
| - **Immutable** e.g. `data_directory` should be rejected upon SET | ||
| - **Mutable & need reboot** e.g. `heartbeat_interval_ms` | ||
| - **Mutable & immediately** e.g. `batch_chunk_size` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess there're more things to consider in defining "immediately". As Flink says,
Because options are read at different point in time when performing operations, it is recommended to set configuration options early after instantiating a table environment.
If we only read the configuration on scheduling (of batch queries...), then it sounds not that "immediate" 😄 but more like a session variable that only affects future queries. If we read it continuously, we may need to consider whether it'll lead to unexpected behavior if the config can be modified at any time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly. The boundary seems obscure for these cases. I just collectively call them "immediately" to distinguish them from the need-reboot ones.
Is it a good idea to put configs that are bound to be different among nodes like host address in the same config file 🤔 ? |
Then use CLI args instead, if it makes you feel better 🤣 |
I agree CLI args is more suitable. So these configs shouldn't be written in the template config file, although the users can surely add them in their own files, right? Because there's no appropriate default value. |
|
|
||
| Similar to other system metadata e.g. catalog, | ||
|
|
||
| - Changes to global parameters should be propagated to all nodes, in the cluster in a deterministic order ideally |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Considering the nature of distributed systems, there will be cases where multiple values of the same parameter exist on various nodes, no matter how we optimize the notification mechanism. Will we introduce things like what is described in the Online Schema Change paper to ensure consistency and atomicity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean Online, Asynchronous Schema Change in F1 by Rae et al.? As I understand this summary they basically use 2PC to change schemas, right?
Maybe we coul rely on ETCD (with or without abstracting via a K8s CM), since it offers strong consistency? A 2PC using ETCD could look like this maybe:
- Client issues settings change against system
- ETCD changes configs
- Pods are alerted about config change (if we use CM, K8s would do that for us)
- Meta (or some other components) asks each pod if it is ready to apply the changes. If all reply yes, then we apply.
Just a suggestion :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, it's the exact paper that I referred.
Maybe we coul rely on ETCD (with or without abstracting via a K8s CM), since it offers strong consistency?
🤔 Sounds like a feasible solution.
rfcs/0034-system-parameters.md
Outdated
|
|
||
| Let's rethink the current design of CLI options and config files. ([#5676](https://github.com/risingwavelabs/risingwave/issues/5676)) Given the existence of system parameters, some existing parameters may need to be moved to system parameters, like `data_directory` as mentioned before. | ||
|
|
||
| We propose to consider CLI options as a way to override config files (i.e. in term of priority, CLI options -> config file -> defaults); as a result, the items in CLI options are a subset of the config file. The reasons for this include |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we also consider supporting env vars as one of the config sources? In containerized deployments, it is way easier to set an env var than update a mounted file, not to mention that RisingWave is a distributed system that makes config dispatching a mess: it's hard to maintain a lot of local config files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[...] not to mention that RisingWave is a distributed system that makes config dispatching a mess: it's hard to maintain a lot of local config files.
IMHO K8s ConfigMaps may help here, since dispatch and updates will happened automatically. Maybe it would be a good alternative to combine CMs with env vars? A local env var could overwrite the CM and you could thus customise the global configs for every pod.
|
|
||
| ## Discussions | ||
|
|
||
| 1. Do we need database parameters, global (instance-wide) parameters, or both? No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 Are there real cases or just a proposal? Adding levels will make the configuration complicated. I'd say that instance-wide and object(table, source, mv)-specific parameters should be enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In PG there's only a set of variables, and can be set at different levels (session, role, database, instance).
object(table, source, mv)-specific parameters
But this seems similar to database parameters. IMHO we didn't consider such things in the RFC? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But this seems similar to database parameters. IMHO we didn't consider such things in the RFC? 🤔
You're right. Let's be PG-similar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not decided yet. But anyway we can add support for database parameters after the completion of global parameters.
|
Shall we store the parameters in etcd in a simple readable manner? Such as plain key-values (in human-readable strings e.g. In this way, the global parameters can be modified more easily, which is one of the benefits of using a config file, while also get consistency and replicated among multiple nodes. |
+1 |
rfcs/0034-system-parameters.md
Outdated
| ### Session Parameters | ||
|
|
||
| PostgreSQL allows users to set parameters for the current session as well. | ||
|
|
||
| ```sql | ||
| SET parameter TO value / 'value' / DEFAULT; | ||
| ``` | ||
|
|
||
| References: | ||
|
|
||
| - [PostgreSQL: Documentation: 15: 20.1. Setting Parameters](https://www.postgresql.org/docs/current/config-setting.html) | ||
|
|
||
| Previously, we had already introduced [session parameters in RisingWave](https://github.com/risingwavelabs/risingwave/blob/53f7e0db772ac7e51773791bb8301624ed763ae8/src/common/src/session_config/mod.rs#L265). Some of them should be system parameters like `batch_enable_sort_agg` or `query_mode`. Postgres **enforces** that session parameters must also be system paramters, we can follow this rule as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Session parameters seems to conflict with the definition of system parameters, since the values can vary from session to session (on different fe nodes). So I think session parameters should be out of scope for this RFC.
Gun9niR
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀
|
feedback: This really helps
And I guess we need this for Ops.
|
Preview: https://github.com/risingwavelabs/rfcs/blob/eric/system-parameter/rfcs/0034-system-parameters.md
Related risingwavelabs/risingwave#7203