-
Notifications
You must be signed in to change notification settings - Fork 43
8385310: [CRaC] CRaC should fail gracefully on a second checkpoint attempt #317
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: crac
Are you sure you want to change the base?
Changes from all commits
234d49f
4c236e2
6573d05
097d1dd
985ee64
b6223e4
1c88fd3
4323aac
dd687ef
d4b365b
7f1313a
dc1487e
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,54 @@ | ||||||||||||||
| /* | ||||||||||||||
| * Copyright (c) 2026, Azul Systems, Inc. All rights reserved. | ||||||||||||||
| * DO NOT ALTER OR REMOVE COPYRIGHT NOTICES OR THIS FILE HEADER. | ||||||||||||||
| * | ||||||||||||||
| * This code is free software; you can redistribute it and/or modify it | ||||||||||||||
| * under the terms of the GNU General Public License version 2 only, as | ||||||||||||||
| * published by the Free Software Foundation. | ||||||||||||||
| * | ||||||||||||||
| * This code is distributed in the hope that it will be useful, but WITHOUT | ||||||||||||||
| * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or | ||||||||||||||
| * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License | ||||||||||||||
| * version 2 for more details (a copy is included in the LICENSE file that | ||||||||||||||
| * accompanied this code). | ||||||||||||||
| * | ||||||||||||||
| * You should have received a copy of the GNU General Public License version | ||||||||||||||
| * 2 along with this work; if not, write to the Free Software Foundation, | ||||||||||||||
| * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA. | ||||||||||||||
| * | ||||||||||||||
| * Please contact Oracle, 500 Oracle Parkway, Redwood Shores, CA 94065 USA | ||||||||||||||
| * or visit www.oracle.com if you need additional information or have any | ||||||||||||||
| * questions. | ||||||||||||||
| */ | ||||||||||||||
| #ifndef CRLIB_CHECKPOINT_AVAILABILITY_H | ||||||||||||||
| #define CRLIB_CHECKPOINT_AVAILABILITY_H | ||||||||||||||
|
|
||||||||||||||
| #include "crlib.h" | ||||||||||||||
|
|
||||||||||||||
| #ifdef __cplusplus | ||||||||||||||
| extern "C" { | ||||||||||||||
| #endif | ||||||||||||||
|
|
||||||||||||||
| #define CRLIB_EXTENSION_CHECKPOINT_AVAILABILITY_NAME "checkpointable data" | ||||||||||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||
| #define CRLIB_EXTENSION_CHECKPOINTABLE(api) \ | ||||||||||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should be |
||||||||||||||
| CRLIB_EXTENSION(api, crlib_checkpoint_availability_t, CRLIB_EXTENSION_CHECKPOINT_CRLIB_CHECKPOINT_AVAILABILITY_NAME) | ||||||||||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A smoke test for the feature would reveal typos like this
Suggested change
|
||||||||||||||
|
|
||||||||||||||
| typedef int crlib_checkpointable_status_t; | ||||||||||||||
| #define CRLIB_CHECKPOINTABLE_NEVER 0 // engine will never accept another checkpoint | ||||||||||||||
| #define CRLIB_CHECKPOINTABLE_NOT_YET 1 // not now; may become ready later | ||||||||||||||
| #define CRLIB_CHECKPOINTABLE_READY 2 // checkpoint can proceed | ||||||||||||||
|
Comment on lines
+37
to
+39
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nitpicks, checkpoint can be accepted to restore from but that is not what was meant here
Suggested change
|
||||||||||||||
|
|
||||||||||||||
|
|
||||||||||||||
|
Comment on lines
+40
to
+41
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||
| // API for obtaining information about chackpointable status. | ||||||||||||||
| struct crlib_checkpoint_availability { | ||||||||||||||
| crlib_extension_t header; | ||||||||||||||
|
|
||||||||||||||
| crlib_checkpointable_status_t (*get_checkpointable_status)(crlib_conf_t *); | ||||||||||||||
| }; | ||||||||||||||
| typedef const struct crlib_checkpoint_availability crlib_checkpoint_availability_t; | ||||||||||||||
|
|
||||||||||||||
| #ifdef __cplusplus | ||||||||||||||
| } // extern "C | ||||||||||||||
| #endif | ||||||||||||||
|
|
||||||||||||||
| #endif // CRLIB_CHECKPOINT_AVAILABILITY_H | ||||||||||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||
| Original file line number | Diff line number | Diff line change | ||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -1,5 +1,5 @@ | ||||||||||||||||||
| /* | ||||||||||||||||||
| * Copyright (c) 2023, Azul Systems, Inc. All rights reserved. | ||||||||||||||||||
| * Copyright (c) 2023, 2026, Azul Systems, Inc. All rights reserved. | ||||||||||||||||||
| * DO NOT ALTER OR REMOVE COPYRIGHT NOTICES OR THIS FILE HEADER. | ||||||||||||||||||
| * | ||||||||||||||||||
| * This code is free software; you can redistribute it and/or modify it | ||||||||||||||||||
|
|
@@ -80,6 +80,24 @@ jlong crac::uptime_since_restore() { | |||||||||||||||||
| return os::javaTimeNanos() - _restore_start_nanos; | ||||||||||||||||||
| } | ||||||||||||||||||
|
|
||||||||||||||||||
| jint crac::checkpointable_status() { | ||||||||||||||||||
| if (crac::_engine->prepare_checkpoint_availability_api() == CracEngine::ApiStatus::OK) { | ||||||||||||||||||
| crlib_checkpointable_status_t status = crac::_engine->get_checkpointable_status(); | ||||||||||||||||||
| if (status == CRLIB_CHECKPOINTABLE_NEVER) { | ||||||||||||||||||
| if (crac::_generation > 1) { | ||||||||||||||||||
|
Comment on lines
+84
to
+87
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||||||
| return status; | ||||||||||||||||||
| } else { | ||||||||||||||||||
| // Engine doesn't support checkpoint after restore, | ||||||||||||||||||
| // but it's a first checkpoint. So return READY. | ||||||||||||||||||
| return CRLIB_CHECKPOINTABLE_READY; | ||||||||||||||||||
|
Comment on lines
+90
to
+92
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We should not override what the engine reports, it is very unintuitive |
||||||||||||||||||
| } | ||||||||||||||||||
| } | ||||||||||||||||||
| return status; | ||||||||||||||||||
| } | ||||||||||||||||||
|
|
||||||||||||||||||
| return CRLIB_CHECKPOINTABLE_READY; | ||||||||||||||||||
| } | ||||||||||||||||||
|
|
||||||||||||||||||
| void VM_Crac::print_resources(const char* msg, ...) { | ||||||||||||||||||
| if (CRaCPrintResourcesOnCheckpoint) { | ||||||||||||||||||
| va_list ap; | ||||||||||||||||||
|
|
||||||||||||||||||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,5 +1,5 @@ | ||
| /* | ||
| * Copyright (c) 2023, Azul Systems, Inc. All rights reserved. | ||
| * Copyright (c) 2023, 2026, Azul Systems, Inc. All rights reserved. | ||
| * DO NOT ALTER OR REMOVE COPYRIGHT NOTICES OR THIS FILE HEADER. | ||
| * | ||
| * This code is free software; you can redistribute it and/or modify it | ||
|
|
@@ -57,6 +57,7 @@ class crac: AllStatic { | |
|
|
||
| static jlong restore_start_time(); | ||
| static jlong uptime_since_restore(); | ||
| static jint checkpointable_status(); | ||
|
antonvoznia marked this conversation as resolved.
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nitpick: since we have |
||
|
|
||
| static jlong monotonic_time_offset() { | ||
| return _javaTimeNanos_offset; | ||
|
|
||
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -38,6 +38,7 @@ | |||||||||||||||||||||||||||
| #include <sys/wait.h> | ||||||||||||||||||||||||||||
| #include <unistd.h> | ||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||
| #include "crlib/crlib_checkpoint_availability.h" | ||||||||||||||||||||||||||||
| #include "crlib/crlib_description.h" | ||||||||||||||||||||||||||||
| #include "crlib/crlib_restore_data.h" | ||||||||||||||||||||||||||||
| #include "crlib/crlib_user_data.h" | ||||||||||||||||||||||||||||
|
|
@@ -495,6 +496,18 @@ static char* get_relative_file(const char* rel) { | |||||||||||||||||||||||||||
| return buf; | ||||||||||||||||||||||||||||
| } | ||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||
| static crlib_checkpointable_status_t get_checkpointable_status(crlib_conf_t * conf) { | ||||||||||||||||||||||||||||
| if (!conf->direct_map()) { | ||||||||||||||||||||||||||||
| return CRLIB_CHECKPOINTABLE_READY; | ||||||||||||||||||||||||||||
| } | ||||||||||||||||||||||||||||
|
Comment on lines
+500
to
+502
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Now that I think of it, I do not think this would actually work. On restore there are two instances of the engine: (1) the restoring one and (2) the one being restored. Options are not propagated from (1) to (2) this way. I think you would need to inspect the memory of the process to see if it comes from a mmaped file to make this work.
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I wouldn't do any introspection, that's not in line with what CRIU does. The restoring engine should rather transfer data to checkpointed engine (we do that for env and system properties, though the implementation passes only the PID number and the 'data' is actually transferred at JVM level)
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. FYI currently the restoring JVM does not record crac/src/hotspot/share/runtime/arguments.cpp Lines 2698 to 2710 in aabad2d
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Making the engine parse
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I looked at the code, it seems I could use
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
can we extend the purpose of the mechanism? To avoid creating similar things a code generation for no reason
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I believe Anyway, relaying
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Refreshed my knowledge on user-data. It is used to add arbitrary data to checkpoint image, in
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. as @TimPushkin suggested at the first message, does it make sense to look at process memory mapping in
Do you guys think the approach makes sense or not?
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As I've said before, I don't think that criuengine should do any introspection. If you need to pass data from restoring VM to restored one on JVM level, this would go to Right now the implementation works this way:
The
This approach will require modifications to engines that don't support arbitrary length of restore data, but I believe it is architecturally correct approach. |
||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||
| // With direct_map, restored memory pages are mmap'd directly from the | ||||||||||||||||||||||||||||
| // checkpoint image file rather than copied into anonymous memory. A second | ||||||||||||||||||||||||||||
| // checkpoint would overwrite that same image, corrupting the live mappings | ||||||||||||||||||||||||||||
| // of the running process. CRIU has no safe way to re-checkpoint in this mode. | ||||||||||||||||||||||||||||
|
Comment on lines
+504
to
+507
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just to clarify: you can change |
||||||||||||||||||||||||||||
| return CRLIB_CHECKPOINTABLE_NEVER; | ||||||||||||||||||||||||||||
| } | ||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||
| const char* criuengine::get_criu() { | ||||||||||||||||||||||||||||
| char* criu = _criu_location; | ||||||||||||||||||||||||||||
| if (criu == nullptr) { | ||||||||||||||||||||||||||||
|
|
@@ -974,8 +987,17 @@ static crlib_user_data_t user_data_extension = { | |||||||||||||||||||||||||||
| destroy_user_data, | ||||||||||||||||||||||||||||
| }; | ||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||
| static crlib_checkpoint_availability_t checkpoint_availability_extension = { | ||||||||||||||||||||||||||||
| { | ||||||||||||||||||||||||||||
| CRLIB_EXTENSION_CHECKPOINT_AVAILABILITY_NAME, | ||||||||||||||||||||||||||||
| sizeof(checkpoint_availability_extension) | ||||||||||||||||||||||||||||
| }, | ||||||||||||||||||||||||||||
| get_checkpointable_status, | ||||||||||||||||||||||||||||
| }; | ||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||
| static const crlib_extension_t* extensions[] = { | ||||||||||||||||||||||||||||
| &restore_data_extension.header, | ||||||||||||||||||||||||||||
| &checkpoint_availability_extension.header, | ||||||||||||||||||||||||||||
| &image_constraints_extension.header, | ||||||||||||||||||||||||||||
|
Comment on lines
+1000
to
1001
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Image constraints are always used on restore (which is expected to happen more often than checkpoint) so would be better to place the new extension beneath it. Same in |
||||||||||||||||||||||||||||
| &image_score_extension.header, | ||||||||||||||||||||||||||||
| &user_data_extension.header, | ||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,38 @@ | ||
| /* | ||
| * Copyright (c) 2026, Azul Systems, Inc. All rights reserved. | ||
| * DO NOT ALTER OR REMOVE COPYRIGHT NOTICES OR THIS FILE HEADER. | ||
| * | ||
| * This code is free software; you can redistribute it and/or modify it | ||
| * under the terms of the GNU General Public License version 2 only, as | ||
| * published by the Free Software Foundation. Oracle designates this | ||
| * particular file as subject to the "Classpath" exception as provided | ||
| * by Oracle in the LICENSE file that accompanied this code. | ||
| * | ||
| * This code is distributed in the hope that it will be useful, but WITHOUT | ||
| * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or | ||
| * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License | ||
| * version 2 for more details (a copy is included in the LICENSE file that | ||
| * accompanied this code). | ||
| * | ||
| * You should have received a copy of the GNU General Public License version | ||
| * 2 along with this work; if not, write to the Free Software Foundation, | ||
| * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA. | ||
| * | ||
| * Please contact Oracle, 500 Oracle Parkway, Redwood Shores, CA 94065 USA | ||
| * or visit www.oracle.com if you need additional information or have any | ||
| * questions. | ||
| */ | ||
|
|
||
| package jdk.internal.crac.mirror; | ||
|
|
||
| public enum CheckpointableStatus { | ||
| NEVER_AFTER_RESTORE, READY_LATER, READY; | ||
|
|
||
| public static CheckpointableStatus fromCode(int code) { | ||
| CheckpointableStatus[] values = values(); | ||
| if (code < 0 || code >= values.length) { | ||
| throw new IllegalArgumentException("Unknown CheckpointableStatus code: " + code); | ||
| } | ||
| return values[code]; | ||
| } | ||
| } |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,5 +1,5 @@ | ||
| /* | ||
| * Copyright (c) 2017, 2025, Azul Systems, Inc. All rights reserved. | ||
| * Copyright (c) 2017, 2026, Azul Systems, Inc. All rights reserved. | ||
| * Copyright (c) 2021, Oracle and/or its affiliates. All rights reserved. | ||
| * DO NOT ALTER OR REMOVE COPYRIGHT NOTICES OR THIS FILE HEADER. | ||
| * | ||
|
|
@@ -57,6 +57,7 @@ public class Core { | |
|
|
||
| private static final long JCMD_STREAM_NULL = 0; | ||
| private static native Object[] checkpointRestore0(int[] fdArr, Object[] objArr, boolean dryRun, long jcmdStream); | ||
| private static native int getCheckpointableStatus0(); | ||
| private static final Object checkpointRestoreLock = new Object(); | ||
| private static boolean checkpointInProgress = false; | ||
|
|
||
|
|
@@ -272,6 +273,26 @@ public static void checkpointRestore() throws | |
| private static void checkpointRestore(long jcmdStream) throws | ||
| CheckpointException, | ||
| RestoreException { | ||
| int status_code = getCheckpointableStatus0(); | ||
| switch (CheckpointableStatus.fromCode(status_code)) { | ||
| case NEVER_AFTER_RESTORE -> { | ||
| CheckpointException ex = new CheckpointException(); | ||
| ex.addSuppressed(new IllegalStateException("Current engine doesn't support second checkpoint after restore.")); | ||
| throw ex; | ||
| } | ||
| case READY_LATER -> { | ||
| CheckpointException ex = new CheckpointException(); | ||
| ex.addSuppressed(new IllegalStateException("CRaC cannot commit checkpoint right now, try later.")); | ||
| throw ex; | ||
| } | ||
| case READY -> { | ||
| // fall through to checkpoint logic below | ||
| } | ||
| default -> { | ||
| System.err.printf("Engine returned unknown checkpointable status code: %d. Proceeding with checkpoint.%n", status_code); | ||
|
TimPushkin marked this conversation as resolved.
|
||
| } | ||
| } | ||
|
Comment on lines
+276
to
+294
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I thought we wanted to make this a public interface? I'm not sure if we should completely block checkpoint based on the current status since it can change until we actually reach calling the engine: resource processing below can take a long time (resources are free to block the thread).
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I'm not sure I caught the point.
I could move this verification into synchronized block
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
The race is between the Java code and the engine so synchronizing only on Java side would not help. I still think we should leave calling this up to the user, while
|
||
|
|
||
| final List<String> newArguments; | ||
|
|
||
| // checkpointRestoreLock protects against the simultaneous | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For me "checkpoint is available" means that there is an existing checkpoint image from which we can restore, while the extension is for checking it is currently possible/permitted to create a checkpoint image.