Skip to content

Conversation

@JTischbein
Copy link
Contributor

Follow up for PR #18012 (comment).

To enable Direct IO model reading by default on Linux and Windows, but to stay with --mmap as default on Mac, this PR adds an additional flag for enabling/disabling Direct IO. This flag is by default true and overrules the mmap parameter. In case --direct-io is true and Direct IO is available, --mmap gets disabled. And if --no-direct-io is set or Direct IO is not available (e.g. on Mac), the specified mmap value is used.

@ggerganov
Copy link
Member

I think you need this:

diff --git a/src/llama-model-loader.cpp b/src/llama-model-loader.cpp
index 1355eea95..2db2115a0 100644
--- a/src/llama-model-loader.cpp
+++ b/src/llama-model-loader.cpp
@@ -918,8 +918,7 @@ void llama_model_loader::load_data_for(struct ggml_tensor * cur) const {
         GGML_ASSERT(cur->data != nullptr);
         GGML_ASSERT(w.idx < files.size());
         const auto & file = files.at(w.idx);
-        file->seek(w.offs, SEEK_SET);
-        file->read_raw(cur->data, ggml_nbytes(cur));
+        file->read_raw_at(cur->data, ggml_nbytes(cur), w.offs);
     }
 
     if (check_tensors && !ggml_validate_row_data(cur->type, cur->data, ggml_nbytes(cur))) {

Probably need to assert that llama_file::read_raw is never used with direct io?

@JTischbein
Copy link
Contributor Author

Thanks for the hint, changed that.

I think an assert would not work as read_raw in the current form is needed. Would you suggest to rename read_raw_at to be read_raw (and using tell instead of the offset argument)? Like this read_raw can be safely used again and in the loop of load_all_data the current read_raw is called (as read_raw_direct?)

@ggerganov
Copy link
Member

Would you suggest to rename read_raw_at to be read_raw (and using tell instead of the offset argument)? Like this read_raw can be safely used again and in the loop of load_all_data the current read_raw is called (as read_raw_direct?)

Ok. Would we even need to have read_raw_direct in this case? If we still needed for some reason, then maybe call it read_raw_unsafe to not overload the "direct" word with more meanings.


Also some suggestions that I have not tested, but should at least convey what I mean:

diff --git a/src/llama-model-loader.cpp b/src/llama-model-loader.cpp
index 2db2115a0..ae0c698be 100644
--- a/src/llama-model-loader.cpp
+++ b/src/llama-model-loader.cpp
@@ -508,8 +508,11 @@ llama_model_loader::llama_model_loader(
     files.emplace_back(new llama_file(fname.c_str(), "rb", use_direct_io));
     contexts.emplace_back(ctx);
 
-    // Disable mmap in case Direct I/O is enabled and available
-    if (use_direct_io && files.at(0)->has_direct_io()) {
+    // check if direct io is enabled and supported
+    use_direct_io = use_direct_io && files.back()->has_direct_io();
+
+    if (use_direct_io && use_mmap) {
+        LLAMA_LOG_WARN("%s: direct I/O is enabled, disabling mmap\n", __func__);
         use_mmap = false;
     }
 
@@ -581,6 +584,10 @@ llama_model_loader::llama_model_loader(
             files.emplace_back(new llama_file(fname_split, "rb", use_direct_io));
             contexts.emplace_back(ctx);
 
+            if (use_direct_io && !files.back()->has_direct_io()) {
+                throw std::runtime_error(format("unexpected: direct I/O is not supported for split file %s", fname_split));
+            }
+
             // Save tensors data offset info of the shard.
             for (ggml_tensor * cur = ggml_get_first_tensor(ctx); cur; cur = ggml_get_next_tensor(ctx, cur)) {
                 std::string tensor_name = std::string(cur->name);
@@ -722,6 +729,7 @@ llama_model_loader::llama_model_loader(
     }
 
     this->use_mmap = use_mmap;
+    this->use_direct_io = use_direct_io;
     this->check_tensors = check_tensors;
     this->no_alloc = no_alloc;
 }
diff --git a/src/llama-model-loader.h b/src/llama-model-loader.h
index de06b5283..6f15115ce 100644
--- a/src/llama-model-loader.h
+++ b/src/llama-model-loader.h
@@ -70,6 +70,7 @@ struct llama_model_loader {
     size_t   n_bytes    = 0;
 
     bool use_mmap = false;
+    bool use_direct_io = false;
     bool check_tensors;
     bool no_alloc;
 
diff --git a/src/llama-model.cpp b/src/llama-model.cpp
index cf0c39475..502859d2e 100644
--- a/src/llama-model.cpp
+++ b/src/llama-model.cpp
@@ -2337,7 +2337,8 @@ bool llama_model::load_tensors(llama_model_loader & ml) {
 
     const bool use_mmap_buffer = true;
 
-    LLAMA_LOG_INFO("%s: loading model tensors, this can take a while... (mmap = %s)\n", __func__, ml.use_mmap ? "true" : "false");
+    LLAMA_LOG_INFO("%s: loading model tensors, this can take a while... (mmap = %s, direct_io = %s)\n",
+            __func__, ml.use_mmap ? "true" : "false", ml.use_direct_io ? "true" : "false");
 
     // build a list of buffer types for the CPU and GPU devices
     pimpl->cpu_buft_list = make_cpu_buft_list(devices, params.use_extra_bufts, params.no_host);

@askmyteapot
Copy link

Just an FYI. #18012 Broke loading with mmap disabled on windows.

@JTischbein
Copy link
Contributor Author

@askmyteapot Thank you for the hint. Issue was that I used off_t, which is a signed long on windows. The fix is in this PR.

@JTischbein
Copy link
Contributor Author

@ggerganov Should I isolate the changes with the Windows fix in a new PR?

@ggerganov
Copy link
Member

Yes, would like to take extra look at the changes here, so better to fix the windows issue in the meantime. Thanks

Copy link
Collaborator

@NeoZhangJianyu NeoZhangJianyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the parameters to load the model file in this PR?
Here is my understanding, please correct me if it's wrong:
--no-mmap -ndio
--no-mmap -dio
--mmap

When must user use -ndio?

I think the parameters are a little complex.
Here is my suggestion:
We should keep: --mmap and --no-mmap.
In the case of --no-mmap, the code should detect & switch to use dio or ndio smartly in windows/linux/mac.

@ehoogeveen-medweb
Copy link

IIUC, the implementation in this PR currently requires passing --no-direct-io in order to enable mmap, and --no-direct-io --no-mmap to disable both. I think it should also recognize that if a user passes just --mmap, they want to disable Direct IO and use mmap (as the two options are mutually exclusive).

In other words, --mmap should imply --no-direct-io (and --direct-io should imply --no-mmap, although that doesn't matter with the current logic). Aside from that case, I think the logic is reasonable assuming that preferring Direct IO over mmap is the way to go.

@JTischbein
Copy link
Contributor Author

I agree with @ehoogeveen-medweb, now explicitly specifying --mmap disables direct io. Only handling the model load using --mmap and --no-mmap does not work with the current implementation, as we need the code paths for loading via mmap, read() and std::fread().

Now there are three ways:

  • Default (implicitly -dio --mmap): Load via direct io and if it is not available fallback to mmap
  • Explicitly specifying --mmap: Load via mmap
  • Explicitly specifying --no-mmap -ndio: Load via std::fread()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants