-
Notifications
You must be signed in to change notification settings - Fork 238
common: util: stop using binary mode in SetContents #2371
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
common: util: stop using binary mode in SetContents #2371
Conversation
I'm not really sure this fixes the issue but seems reasonable (?) Let's see if the CI produces binaries so that we can test it on Windows. Open for suggestions if there is a better way :) |
mmh, I rather would like file operations behave correctly (output exactly what we want) and rather do choices of what to output in content we generate. |
I guess another potential fix would be changing Edit: Unless I'm missing something, the "non inplace" path directly outputs to |
You try to fix the case of having CRLF output on Windows, but that is also not necessarily a good default (most developers who have to work on Windows prefer to use LF these days). So I am thinking of it more in terms of configurability. Starting with a flag, probably somewhere around here Then it is possible to fix whatever place outputs newlines and also test on any platform that everything works as it should if forcing the one or other output in unit tests and observe that this is indeed working. |
We probably want to fix that as well to one or the other output once we have the flag. |
Makes sense. I had no idea people worked around this CRLF issue in Windows, thats why I assumed the "fix" would be to stop using binary mode.
Chasing around every
This should be easier to implement |
I suspect there is only really one or two places where newlines are emitted. There are a few reasons why I want the internal representation to be a faithful representation of all the bytes in the input file not line-ending modified. One of the things that Verible does is to keep track of the position in the file by having substrings into that file; if a file with one-line-ending is converted into a content with a different line-ending in memory, then the positions we report are somewhat off to the actual positions in files ( I suspect this is was the issue with #1950 but I couldn't really test on Windows). Once someone implements memory mapping on Windows ( https://github.com/chipsalliance/verible/blob/master/verible/common/util/file-util.cc#L231 ) we will also have the full binary representation in memory on read. On parsing we alerady correctly identify newlines in any shape ( https://github.com/chipsalliance/verible/blob/master/verible/verilog/parser/verilog.lex#L130 is doing that). That is responsible for essentially counting lines and keep track of relative distances from the beginning of the file. When emitting text in formatting, we need to make sure we emit newlines as desired. And that is where I suspect it really is one or two places where that is happening. I suspect there might be a discrepancy if we copy multiple lines verbatim from the input, e.g. some multi-line comment. If the input is CRLF, it will stay CRLF even if we want something else; maybe we already handle this case correctly; this corner case however would be actually benign when on Windows with CRLF output requirement as it does not change things; only if the user actually wants the output to be CRLF -> LF. A new issue just came in #2383 which looks like it is in the same ballpark of problems. Thanks for working on this, Ignacio! |
Thank you for the deep dive! I'll see if I can get away with rubber duck debugging + GitHub CI. If it gets too messy I'll try my luck with Bazel on Windows :) |
c22750b
to
d566ba4
Compare
Sorry for the delay, I had a complicated week 😆. I'm not entirely happy with the result but it might be good enough |
d566ba4
to
2e4d875
Compare
Sorry, also busy week, I hope to get to it to review till Friday. The chipsalliance folks have added a CLA to sign, you probably have to go through that process so that things can be merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
I think what we need now is a test that creates to input with \n
and with \r\n
line endings, and then we run the formatter with line terminator style CR
and CRLF
and observe that whatever the input, the output corresponds to the chosen form.
(and possibly that inplace and from stdout behaves the same)
Pseudocode
for input_lineend in '\n' '\r\n'` ; do
create_input_with_line_endings($input_lineend) > input.v
for output_lineend in CR CRLF ; do
verible-verilog-format --line_terminator=$output_lineend input.v > out-stream.v
verible-verilog-format --line_terminator=$output_lineend -i input.v
verify_that_line_endings_are($output_lineend, input.v)
# also streamed output and in-place edited input should be the same
cmp input.v out-stream.v
fail_if_$?_of_command_before_is_not_zero
done
done
case LineTerminatorStyle::kCRLF: | ||
return stream << "\r\n"; | ||
} | ||
__builtin_unreachable(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, did you have to do that ? The compiler should be able to figure that out by itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not at all. I just added it out of caution (a new enum case might be defined without updating the print blah blah) I can drop it :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modern compilers will automatically warn if there is a new enum that is not handled, so no need (In fact, I always avoid adding a default
case for these enum-switch/case to automatically have the compiler fine them for me).
(I am more worried of __builtin_unreachable()
being very specific to gcc and clang and might stump other compilers)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll drop it, no worries. I added it out of habit because where I work people absolutely ignore compiler warnings 😆
@cezchitos it would be great if you could test this :)
I just realized my change breaks comment formatting:
Hopefully this issue doesn't happen if you use the line_terminator that matches your platform's. I'll work on fixing this Edit: This issue seems to go away with #2370 (comment) Great news :D |
Thanks for reviewing. I'll work on those tests! |
I have just managed to test and build. As long as I include the change in #2370 (comment) it all works great, nice work! Without the additional lex change I get the following error: 'Formatted output is lexically different from the input.' Maybe best to include that as well? |
Thank you very much! Yeah, it seems we need your change. We have two options:
|
I think 2. would be the best. Unfortunately I am not familiar at all with the lexer, I just did the modifications in #2370 (comment) more or less by trial and error. Is the '.lex' actually generated by something else or is that patch enough? |
The patch in #2370 (comment) looks fishy though. It looks like it makes a character class that contains If anything, I'd expected rounded parenthesis around If the patch with the square-bracket works, I'd say it is probably accidental and it should be properly tested and explained by the flex documentation. From how I read it it looks like it should not work. |
I tested: as expected with the square brackets, the lexer unit test will fail (and a bunch of other tests). If I had a guess, it might be probably due to the overly simplistic |
What is the current status here ? |
It seemed a bit too easy to be true. Ok, so let's just forget my lexer change proposal.
Since my lexer 'hack' didn't work I think there still needs to be a change for this particular case (comment formatting problem) before this PR can be merged, otherwise it will be in an (arguably) worse state (for a Windows user point of view) until it is fixed. I am happy to continue testing if there are any changes. |
Sorry about the delay, things got pretty busy for me unexpectedly... I'll try to write those tests we discussed and investigate the issue over the weekend :) Edit: I'm not familiar with diff --git i/verible/verilog/parser/verilog.lex w/verible/verilog/parser/verilog.lex
index f90216cc1955..cd35b7d22d47 100644
--- i/verible/verilog/parser/verilog.lex
+++ w/verible/verilog/parser/verilog.lex
@@ -294,7 +294,10 @@ PragmaEndProtected {Pragma}{Space}+protect{Space}+end_protected
yymore();
}
{LineTerminator} {
- yyless(yyleng-1); /* return \n to input stream */
+ if(yytext[yyleng-2] == '\r')
+ yyless(yyleng-2); /* return \n to input stream */
+ else
+ yyless(yyleng-1); /* return \n to input stream */
UpdateLocation();
yy_pop_state();
return TK_EOL_COMMENT; The error I'm getting seems to happen only inside commented blocks (and only when going from |
Instead of explicitly checking against "-" rely on IsStdin for better readability.
… by verible Newlines `\n` are transformed into `\r\n` by default when running in Windows systems. This is NOT what we want. For more information see 13e0cd0 (If non-mmap: use stdio reading of files instead of std::stream, 2024-10-02) This was already fixed for inplace formatting because the output file is opened in binary mode, but the non-inplace path was left untouched. Set stdout to binary mode too to make the behaviour consistent.
a3d993e
to
57818be
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, just a few smallish comments.
I am wondering if this will be working for the 'formatting' feature in the language server as well (around here https://github.com/chipsalliance/verible/blob/master/verible/verilog/tools/ls/verible-lsp-adapter.cc#L277 ).
I don't know if windows-based editors even expects some sort of CR or CRLF in formatted output in the language server or if they auto-convert anyway. But worth checking with the spec and if we emit the right thing there.
return kLineTerminatorStyleStringMap; | ||
} | ||
|
||
std::ostream &operator<<(std::ostream &stream, LineTerminatorStyle style) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mmh, I don't know if I like the auto-conversion from the enum to a string.
When I saw the code in comment-controls.cc I was surprised about printing an enum and then found this operator<<
. So it didn't pass the "least surprise" test :)
So this might be too clever and probably should just have regular function. Something like
EmitLineTerminator(LineTerminatorStyle style, std::ostream &stream);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. I hadn't heard about the "least surprise" before :)
I still need to keep the operator<<
to make the enum-flags.h
"machinery" happy.
verible/verilog/parser/verilog.lex
Outdated
@@ -294,7 +294,11 @@ PragmaEndProtected {Pragma}{Space}+protect{Space}+end_protected | |||
yymore(); | |||
} | |||
{LineTerminator} { | |||
yyless(yyleng-1); /* return \n to input stream */ | |||
// HACK |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add an explanation that the next person looking at this can well understand what is going on here ?
We allow \r
-terminated lines in LineTerminator
, looks like it makes sense to see if this is handled correctly in the lexer test (e.g. with two lines, separated by \r\r
). There is a chance that there could be a situation where it is not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure! I'll add a brief comment here.
About the cornercase youre mentioning: I think this is run only if we're in <IN_EOL_COMMENT>
, that's why I figured we were fine doing yyleng-2
.
Also, if we encounter an \r
we'll exit the <IN_EOL_COMMENT>
state, so I don't think we can get to this snippet with \r\r
.
kLF, | ||
// Carriage return + Line Feed `\r\n` (DOS Style) | ||
kCRLF, | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a follow-up, we should consider here some sort of 'auto' setting that then maybe looks at some statistics of the input newline-types it has seen and choose the corresponding output type. Not here, follow-up for later is sufficient.
|
||
for original_newline in LF CRLF; do | ||
if [[ "$original_newline" == "CRLF" ]]; then | ||
unix2dos ${MY_INPUT_FILE} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if the dependency on unix2dos
and dos2unix
might cause some trouble on various machines that have to install these first to run the test successfully (case in point: the places where you have to add these in the verible-ci.yml
).
Maybe some simple awk '{printf("%s\n", $0);}'
and awk '{printf("%s\r\n", $0);}'
(or something with sed
).
(we already use sed
and awk
in some other tests, so at least it would not widen the dependencies on the system).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I figured out how to skip the dependency. I also thought it was a bit unnecesary but I wanted to ensure my hack worked on the CI as I don't have access to Windows/Mac machines.
300e454
to
c636aa9
Compare
c636aa9
to
0190db8
Compare
Precise control of file contents is required. This makes verible handle files in `binary` mode, effectively disabling the platform-specific hooks that for example translate \n into \r\n for DOS systems. Let's introduce the line_terminator flag into BasicFormatStyle so that users can decide which line terminator they want to use. Current options are \n (LF) or \r\n (CRLF). LF is the default configuration.
DOS newlines (\r\n) leak the \r character into TK_EOL_COMMENT token. If the formatter changes the newline mode from UNIX into DOS, the original and the formatted version will be lexically different, triggering an error inside the VerifyFormatting
0190db8
to
feaaa5a
Compare
Thanks a lot Ignacio. Merged. |
Thank you for having such a great codebase and your reviews! Edit: I meant to edit the PR name/description as my original approach wasnt the final implementation, but I forgot. Hopefully nobody looks at the merge commit itself. Sorry about that! |
Issue #2370 points out that inplace formatting in Windows converts
\r\n
into\n
.This happens because opening files in binary mode explicitly disables the otherwise implicit translation of
\n
into\r\n
for Windows.