-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Add a TurnDiffTracker to create a unified diff for an entire turn #1770
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
SummaryAdds Notes
ReviewNice feature – delivers running diffs and keeps core compile-clean. A few suggestions:
Otherwise, looks solid – great work! |
@@ -525,6 +527,11 @@ pub struct PatchApplyEndEvent { | |||
pub success: bool, | |||
} | |||
|
|||
#[derive(Debug, Clone, Deserialize, Serialize)] | |||
pub struct TurnDiffEvent { | |||
pub unified_diff: String, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like this would be easier to work with programmatically if this were keyed by path, more like changes
in PatchApplyBeginEvent
. Maybe for a full add or a full delete for an individual file, we still want the unified diff, but it's nice to have added/modified/removed metadata for each path so it's easy to build a compact summary for the diff (maybe with +/- line counts)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What guarantees, if any, can we make about the paths in the unified_diff
: will they all be absolute paths?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't currently need this because we parse the whole unified diff. Could we add it if/when we need it?
@@ -1471,6 +1528,7 @@ fn maybe_run_with_user_profile(params: ExecParams, sess: &Session) -> ExecParams | |||
async fn handle_container_exec_with_params( | |||
params: ExecParams, | |||
sess: &Session, | |||
turn_diff_tracker: &mut TurnDiffTracker, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, what would happen if we wanted to support parallel tool calls at one point. This would be a problem, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you think it would break?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because only one tool call could take ownership of TurnDiffTracker.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can solve this then. Maybe a Mutex or a channel or something, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, though also, if we introduce a struct TurnContext
as mentioned above, that may also force the move to Mutex
. But yes, does not have to be done in this PR.
/// 1. Create a temp directory to store baseline snapshots of files when they are first seen. | ||
/// 2. When a path is first observed, copy its current contents into the baseline dir if it exists on disk. | ||
/// For new additions, do not create a baseline file so that diffs are shown as proper additions (using /dev/null). | ||
/// 3. Keep a stable internal filename (uuid + same extension) per external path for path rewrite in diffs. | ||
/// 4. To compute the aggregated unified diff, compare each baseline snapshot to the current file on disk using | ||
/// `git diff --no-index` and rewrite paths to external paths. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like https://crates.io/crates/similar or something would be a cleaner approach than maintaining a tmp dir; everything could be in-memory and there would be no need to rewrite anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, gave it a shot and I think it's cleaner.
let id = Uuid::new_v4().to_string(); | ||
match path.extension().and_then(|e| e.to_str()) { | ||
Some(ext) if !ext.is_empty() => format!("{id}.{ext}"), | ||
_ => id, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_ => id, | |
None => id, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still unclear why ext
is added.
fn rewrite_diff_paths(&self, diff: &str) -> String { | ||
let mut out = String::new(); | ||
for line in diff.lines() { | ||
if let Some(rest) = line.strip_prefix("diff --git ") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This gets indented pretty far, so maybe it's worth moving to a helper function that takes (&mut String, temp_name_to_current_external, temp_name_to_baseline_external)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fully removed in the new version
fs::write(&file, "foo\n").unwrap(); | ||
acc.update_and_get_unified_diff().unwrap(); | ||
let first = acc.unified_diff.clone().unwrap(); | ||
assert!(first.contains("+foo")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of contains()
checks, can these all be full assert_eq!()
checks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It wasn't trivial because it included the TempDir path (I originally tried to do it that way). If I switch to similar
, I can change this though!
|
||
#[cfg(test)] | ||
mod tests { | ||
#![allow(clippy::unwrap_used)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also test that paths with spaces work as intended?
Ok(()) | ||
} | ||
|
||
/// Recompute the aggregated unified diff by comparing all baseline snapshots against |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you expand this comment? I don't have a great mental model of the structure you're trying to set up for the ultimate git diff
call.
I want to understand why this isn't something simpler like diff -u backed-up-file current-file
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The version I'm about to push uses similar and does it all in-memory
b366916
to
89fba63
Compare
89fba63
to
7d354d7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a lot of comments, but they are generally small, so please fix and merge!
}; | ||
} | ||
|
||
Ok(()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see a ?
or a place where the Err
variant is constructed, so does this need to return Result
?
changes, | ||
}), | ||
}) => { | ||
let _ = turn_diff_tracker.on_patch_begin(&changes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this doesn't have to return Result
, then let _
can go away, of course, but depending on what sort of Err
we expect, perhaps we should at least warn!()
or error!()
?
@@ -392,8 +402,10 @@ impl Session { | |||
let _ = self.tx_event.send(event).await; | |||
} | |||
|
|||
async fn notify_exec_command_end( | |||
#[allow(clippy::too_many_arguments)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should maybe introduce a struct in a follow-up PR.
@@ -1163,6 +1193,7 @@ async fn run_task(sess: Arc<Session>, sub_id: String, input: Vec<InputItem>) { | |||
|
|||
async fn run_turn( | |||
sess: &Session, | |||
turn_diff_tracker: &mut TurnDiffTracker, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll probably want a struct TurnContext
or somesuch in the near future.
|
||
const ZERO_OID: &str = "0000000000000000000000000000000000000000"; | ||
const REGULAR_MODE: &str = "100644"; | ||
#[cfg(unix)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though even on Windows, this has to be readable (and preserved?) in a Git tree object, no?
fn symlink_blob_bytes(path: &Path) -> Option<Vec<u8>> { | ||
use std::os::unix::ffi::OsStrExt; | ||
let target = std::fs::read_link(path).ok()?; | ||
Some(target.as_os_str().as_bytes().to_vec()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One interesting operating system fact is that the contents of a symlink do not have to be a path to a file: you can just use it for arbitrary data storage. (As such, I think the max number of bytes you can store in a symlink is PATH_MAX
, though.)
I knew of one project that did this to save a system call because readlink()
is one system call but open()
plus read()
for a regular file is two?
fn git_blob_sha1_hex_bytes(data: &[u8]) -> String { | ||
// Git blob hash is sha1 of: "blob <len>\0<data>" | ||
let header = format!("blob {}\0", data.len()); | ||
use sha1::Digest; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Returning this type is slightly stronger since you don't have to verify the integrity of the String
contents elsewhere.
let mode = file_mode_for_path(&file).unwrap_or_else(|| REGULAR_MODE.to_string()); | ||
let right_oid = git_blob_sha1_hex("foo\n"); | ||
format!( | ||
"diff --git a/<TMP>/a.txt b/<TMP>/a.txt\nnew file mode {mode}\nindex {ZERO_OID}..{right_oid}\n--- /dev/null\n+++ b/<TMP>/a.txt\n@@ -0,0 +1 @@\n+foo\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
r#
for better readability?
// No file existed initially; create only dest | ||
fs::write(&dest, "hello\n").unwrap(); | ||
let diff = acc.get_unified_diff().unwrap().unwrap(); | ||
assert!(diff.contains("new file mode")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't assert_eq!()
here?
This lets us show an accumulating diff across all patches in a turn. Refer to the docs for TurnDiffTracker for implementation details.
There are multiple ways this could have been done and this felt like the right tradeoff between reliability and completeness:
Pros
Cons