[Solved] Convert commonmark links to Headings with spaces to GitHub flavored markdown.
Edit
My question was very badly written but the new title reflect the actual question. Thanks to 3 very friendly and dedicated users (@harsh3466@tuna@learnbyexample) I was able to find a solution for my files, so thank you guys !!!
For those who will randomly come across this post here are 3 possible ways to achieve the desired results.
#! /bin/bash
files="/home/USER/projects/test.md"
mdlinks="$(grep -Po ']\((?!https).*\)' "$files")"
mdlinks2="$(grep -Po '#.*' <<<$mdlinks)"
while IFS= read -r line; do
#Converts 1.2 to 1-2 (For a third level heading needs to add a supplementary [0-9])
dashlink="$(echo "$line" | sed -r 's|(.+[0-9]+)\.([0-9]+.+\))|\1-\2|')"
sed -i "s/$line/${dashlink}/" "$files"
#Puts everything to lowercase after a hashtag
lowercaselink="$(echo "$dashlink" | sed -r 's|#.+\)|\L&|')"
sed -i "s/$dashlink/${lowercaselink}/" "$files"
#Removes spaces (%20) from markdown links after a hashtag
spacelink="$(echo "$lowercaselink" | sed 's|%20|-|g')"
sed -i "s/$lowercaselink/${spacelink}/" "$files"
done <<<"$mdlinks2"
sed -E ':l;s/(\[[^]]*\]\()([^)#]*#[^)]*\))/\1\n\2/;Te;H;g;s/\n//;s/\n.*//;x;s/.*\n//;/^https?:/!{:h;s/^([^#]*#[^)]*)(%20|\.)([^)]*\))/\1-\3/;th;s/(#[^)]*\))/\L\1/;};tl;:e;H;z;x;s/\n//;'
I'm in need for some assistance for string manipulation with sed and regex. I tried a whole day to trial & error and look around the web to find a solution however it's way over my capabilities and maybe here are some sed/regex gurus who are willing to give me a helping hand !
With everything I gathered around the web, It seems it's rather a complicated regex and sed substitution, here we go !
What Am I trying to achieve?
I have a lot of markdown guides I want to host on a self-hosted forgejo based git markdown. However the classic markdown links are not the same as one github/forgejo...
Convert the following string:
[Some text](#Header%20Linking%20MARKDOWN.md)
Into
[Some text](#header-linking-markdown.md)
As you can see those are the following requirement:
Pattern: [Some text](#link%20to%20header.md)
Only edit what's between parentheses
Replace space (%20) with -
Everything as lowercase
Links are sometimes in nested parentheses
e.g. (look here [Some text](#link%20to%20header.md))
Do not change a line that begins with https (external links)
While everything is probably a bit complex as a whole the trickiest part is probably the nested parentheses :/
What I tried
The furthest I got was the following:
sed -Ei 's|\(([^\)]+)\)|\L&|g' test3.md #make everything between parentheses lowercase
sed -i '/https/ ! s/%20/-/g' test3.md #change every %20 occurrence to -
These sed/regx substitution are what I put together while roaming the web, but it has a lot a flaws and doesn't work with nested parentheses. Also this would change every %20 occurrence in the file.
The closest solution I found on stackoverflow looks similar but wasn't able to fit to my needs. Actually my lack of regex/sed understanding makes it impossible to adapt to my requirements.
I would appreciate any help even if a change of tool is needed, however I'm more into a learning processes, so a script or CLI alternative is very appreciated :) actually any help is appreciated :D !
This is more of a general suggestion: if you use Regular Expression, use https://regex101.com. It provides syntax highlighting, explains the syntax and allows you to test your regexes.
Additionally, I think that sd is way more intuitive than sed.
Yeah probably bare bone regex was a mistake however a friendly user gave me a step by step guide on how to achieve my goal:
#! /bin/bash
files="/home/USER/projects/test.md"
mdlinks="$(grep -Po ']\((?!https).*\)' "$files")"
mdlinks2="$(grep -Po '#.*' <<<$mdlinks)"
while IFS= read -r line; do
#Converts 1.2 to 1-2 (For a third level heading needs to add a supplementary [0-9])
dashlink="$(echo "$line" | sed -r 's|(.+[0-9]+)\.([0-9]+.+\))|\1-\2|')"
sed -i "s/$line/${dashlink}/" "$files"
#Puts everything to lowercase after a hashtag
lowercaselink="$(echo "$dashlink" | sed -r 's|#.+\)|\L&|')"
sed -i "s/$dashlink/${lowercaselink}/" "$files"
#Removes spaces (%20) from markdown links after a hashtag
spacelink="$(echo "$lowercaselink" | sed 's|%20|-|g')"
sed -i "s/$lowercaselink/${spacelink}/" "$files"
done <<<"$mdlinks2"
If you know a better way to achieve similar results I'm very open for every new lead and learn something new !
Honestly, I'd be looking at doing this in any other language that has a Markdown library to parse these. You're doing this on "hard mode" with sed. There are probably already a ton of Python tools out there that do this.
I have thought of a python script and looked a bit around but couldn't find something satisfactory. Also I'm a tiny bit more versed in bash/CLI than with python... Even though that's very arguable !
I looked through the Github repo and at first glance I have no idea how this could do the job, again I probably have to dig a bit deeper and understand what this is actually doing !
sed ':loop;/\[[^]]*\](http/! s/\(\[[^]]*\]\)\(([^)]*\)%20\([^)]*)\)/\1\2-\3/g;t loop;/\[[^]]*\](http/! s/\(\[[^]]*\]\)\(([^)]*)\)/\1\L\2/g'
example file
[Some text](#Header%20Linking%20MARKDOWN.md)
(#Should%20stay%20as%20is.md)
Text surrounding [a link](readme.md#Other%20Page). Cool
Multiple [links](#Links.md) in (%20) [a](#An%20A.md) SINGLE [line](#Lines.md)
Do [NOT](https://example.com/URL%20Should%20Be%20Untouched.html) CHANGE%20 [hyperlinks](http://example.com/No%20Touchy.html)
but it doesn't work if you have a http link and markdown link in the same line, and doesn't work with [escaped \] square brackets](#and-escaped-\)-parenthesis) in the link
Effectively your regex is very close as a one line, I'm pretty impress ! :0 However I missed to mention something In my post (I only though about it after working on it with another user in the comments...). There a 2 things missing on your beautiful and complex regex:
Numbering with dots also needs to have a dash in between (actually I think every special characters like spaces or a dots are converted to a dash )
FROM
---------------
[Link with numbers](readme.md#1.3%20this%20is%20another%20test)
TO
---------------
[Link with numbers](readme.md#1-3-this-is-another-test)
The part before the hashtag needs to keep it original form (links to a real file)
FROM
---------------
[Link with numbers](Another%20file%20to%20readme.md#1.3%20this%20is%20another%20test.md)
TO
---------------
[Link with numbers](Another%20file%20to%20readme.md#1-3-this-is-another-test.md)
Sorry for the trouble I wasn't aware of all the GitHub-Flavored Markdown syntax :/. I got a a very cool working script that works perfectly with another user but If you want to modify your regex and try to solve the issue in pure regex feel free :) I'm very curious how It could look like (god regex is so obscure and at the same time it has some beauty in it !)
#! /bin/bash
files="/home/USER/projects/test.md"
mdlinks="$(grep -Po ']\((?!https).*\)' "$files")"
mdlinks2="$(grep -Po '#.*' <<<$mdlinks)"
while IFS= read -r line; do
#Converts 1.2 to 1-2 (For a third level heading needs to add a supplementary [0-9])
dashlink="$(echo "$line" | sed -r 's|(.+[0-9]+)\.([0-9]+.+\))|\1-\2|')"
sed -i "s/$line/${dashlink}/" "$files"
#Puts everything to lowercase after a hashtag
lowercaselink="$(echo "$dashlink" | sed -r 's|#.+\)|\L&|')"
sed -i "s/$dashlink/${lowercaselink}/" "$files"
#Removes spaces (%20) from markdown links after a hashtag
spacelink="$(echo "$lowercaselink" | sed 's|%20|-|g')"
sed -i "s/$lowercaselink/${spacelink}/" "$files"
done <<<"$mdlinks2"
I did it!! It also handles the case where an external link and internal link are on the same line :D
sed -E ':l;s/(\[[^]]*\]\()([^)#]*#[^)]*\))/\1\n\2/;Te;H;g;s/\n//;s/\n.*//;x;s/.*\n//;/^https?:/!{:h;s/^([^#]*#[^)]*)(%20|\.)([^)]*\))/\1-\3/;th;s/(#[^)]*\))/\L\1/;};tl;:e;H;z;x;s/\n//;'
Here is my annotated file
# Begin loop
:l;
# Bisect first link in pattern space into pattern space and append to hold space
# Example: `text [label](file#fragment)'
# Pattern space: `file#fragment)'
# Hold space: `text [label]('
# Steps:
# 1. Strategically insert \n
# 1a. If this fails, branch out
# 2. Append to hold space (this creates two \n's. It feels weird for the
# first iteration, but that's ok)
# 3. Copy hold space to pattern space, remove first \n, then trim off
# everything past the second \n
# 4. Swap pattern/hold, and trim off everything up to and incl the last \n
s/(\[[^]]*\]\()([^)#]*#[^)]*\))/\1\n\2/;
Te;
H;
g; s/\n//; s/\n.*//;
x; s/.*\n//;
# Modify only if it is an internal link
/^https?:/! {
# Add hyphens
:h;
s/^([^#]*#[^)]*)(%20|\.)([^)]*\))/\1-\3/;
th;
# Make lowercase
s/(#[^)]*\))/\L\1/;
};
# "conditional" branch so it checks the next conditional again
tl;
# Exit: join pattern space to hold space, then move to pattern space.
# Since the loop uses H instead of h, have to make sure hold space is empty
:e;
H;
z;
x; s/\n//;
# use a loop to iteratively replace the %20 with -, since doing s/%20/-/g would replace too much. we loop until it cant substitute any more
# label for looping
:loop;
# skip the following substitute command if the line contains an http link in markdown format
/\[[^]]*\](http/!
# capture each part of the link, and join it together with -
s/\(\[[^]]*\]\)\(([^)]*\)%20\([^)]*)\)/\1\2-\3/g;
# if the substitution made a change, loop again, otherwise break
t loop;
# convert all insides to the link lowercase if the line doesnt contain an http link
/\[[^]]*\](http/!
# this is outside the loop rather than in the s command above because if the link doesnt contain %20 at all then it won't convert to lowercase
s/\(\[[^]]*\]\)\(([^)]*)\)/\1\L\2/g
I've got a sed regex that should work, just writing up a breakdown of the whole command so anyone interested can follow what it does. Will post in a bit.
Okay, here's the command and a breakdown. I broke down every part of the command, not because I think you are dumb, but because reading these can be complicated and confusing. Additionally, detailed breakdowns like these have helped me in the past.
The command:
sed -ri 's|]\(#.+\)|\L&|; s|%20|-|g' /path/to/somefile
The breakdown:
sed - calls sed
-r - allows for the use of extended regular expressions
-i - edit the file given as an argument at the end of the command (note, the i flag must follow the r flag, or the extended regular expressions will not be evaluated)
Now the regex piece by piece. This command has two substitution regex to break down the goals into managable chunks.
Expression one is to convert the markdown links to lowercase. That expression is:
's|]\(#.+\)|\L&|;
The goal of this expression is to find markdown links, and to ignore https links. In your post you indicate the markdown links all start with a # symbol, so we don't have to explicitly ignore the https as much as we just have to match all links starting with #. Here's the breakdown:
' - begins the entire expression set. If you had to match the ' character in your expression you would begin the expression set with " instead of '.
s| - invoking find and replace (substitution). Note, Im using the | as a separator instead of the / for easier readability. In sed, you can use just about any separator you want in your syntax
]\(# - This is how we find the link we want to work on. In markdown, every link is preceded by ]( to indicate a closing of the link text and the opening of the actual url. In the expression, the ( is preceded by a \ because it is a special regex character. So \( tells sed to find an actual closing parentheses character. Finally the # will be the first character of the markdown links we want to convert to lowercase, as indicated by your example. The inclusion of the # insures no https links will be caught up in the processing.
.+ - this bit has two parts, . and +. These are two special regex characters. the . tells sed to find any character at all and the + tells it to find the preceding character one or more times. In the case of .+, it's telling sed to find one or more of any characters. You might think this will eat ALL of the text in the document and make it all lowercase, but it will not because of the next part of the regex.
\) - this tells sed to find a closing parentheses. Like the opening parentheses, it is a special regex character and needs to be escaped with the backslash to tell sed to find an actual closing parentheses character. This is what stops the command from converting the entire document to lowercase, because when you combine the previous bit with this bit like so .+\), you're telling sed to find one or more of any character UNTIL you find a closing parentheses.
| - This tells sed we're done looking for text to match. The next bits are about how to modify/replace that text
\L - This tells sed to convert the given text to all lowercase
& - This is the given text to modify. In this case the & is a special mertacharacter that tells sed to modify the entire pattern matched in the matching portion of the expression. So when the & is preceded by the \L, this tells sed Take everything that was matched in the pattern matching expression and convert it to lowercase.
; - this tells sed that this is the end of the first expression, and that more are coming.
So all together, what this first expression does is: Find a closing bracket followed by an opening parentheses followed by a pound/hash symbol followed by one or more of any characters until finding a closing parentheses. Then convert that entire chunk of text to lowercase. Because symbols don't have case you can just convert the entire matched pattern to lowercase. If there were specific parts that had to be kept case sensitive, then you'd have to match and modify more precisely.
The next expression is pretty easy, UNLESS any of your https links also include the string %20:
If no https links contain the %20 string, then this will do the trick:
s|%20|-|g'
s| - again opens the expression telling sed wer're looking to substitute/modify text
%20 - tells sed to find exactly the character sequence %20
| - ends the pattern matching portion of the expression
- - tells sed to replace the matched pattern with the exact character -
| - tells sed that's the end of the modification instructions
g - tells sed to do this globally throughout the document. In other words, to find all occurrances of the string %20 and replace them with the string -
' - tells sed that is the end of the expression(s) to be evaluated.
So all together, what this expression does is: Within the given document, find every occurrence of a percent symbol followed by the number two followed by the number zero and replace them with the dash character.
/path/to/somefile - tells sed what file to work on.
Part of using regex is understanding the contents of your own text, and with the information and examples given, this should work. However, if the markdown links have different formatting patterns, or as mentioned any of the https links have the %20 string in them, or other text in the document might falsely match, then you'd have to provide more information to get a more nuanced regex to match.
Edit: clarified the use of the & metacharacter.
Edit 2: clarified that the + metacharacter indicates finding the preceding character (or character set) one or more times.
As I see, you've already got an answer how to convert text to lower case. So I just tell you how to replace all occurrences of %20 with -. You need to repeat substitution until no matches found. For such iteration you need to use branching to label. Below is sed script with comments.
:subst # label
s/(\[[^]]+\]\([^)#]*#[^)]*)%20([^)]*\))/\1-\2/ # replace the first occurrence of `%20` in the URL fragment
t subst # go to the `subst` label if the substitution took place
However there are some cases when this script will fail, e. g. if there is an escaped ] character in the link text. You cannot avoid such mistakes using only simple regexps, you need a full featured markdown parser for this.
Thank you very much for taking your time and trying to help me with comments and all !
you need a full featured markdown parser for this.
Do you mean something like pandoc? Someone pointed me to it and it seems it can covert to GitHub-Flavored Markdown ! Thanks for the pointer will give it a try to see how it works out with my actual script :)
Sorry for the very late response !! Here is the working bash script another user helped me put together:
#! /bin/bash
files="/home/USER/projects/test.md"
mdlinks="$(grep -Po ']\((?!https).*\)' "$files")"
mdlinks2="$(grep -Po '#.*' <<<$mdlinks)"
while IFS= read -r line; do
#Converts 1.2 to 1-2 (For a third level heading needs to add a supplementary [0-9])
dashlink="$(echo "$line" | sed -r 's|(.+[0-9]+)\.([0-9]+.+\))|\1-\2|')"
sed -i "s/$line/${dashlink}/" "$files"
#Puts everything to lowercase after a hashtag
lowercaselink="$(echo "$dashlink" | sed -r 's|#.+\)|\L&|')"
sed -i "s/$dashlink/${lowercaselink}/" "$files"
#Removes spaces (%20) from markdown links after a hashtag
spacelink="$(echo "$lowercaselink" | sed 's|%20|-|g')"
sed -i "s/$lowercaselink/${spacelink}/" "$files"
done <<<"$mdlinks2"
NB: global substitution s///g is not applicable here because you need to perform new substitutions in a substituted text. Both sed regexp syntaxes (basic and extended) don't support lookarounds that could solve this issue.
basically, matching #this%20is%20LIKELY%20a%20link.md
as opposed to matching whole markdown link
lowercasing that entire match,
then on a search matching stuff that looks like that, replace the %20 with a hyphen (combined into a single sed command). this only fails when an http link falls within the same line as a markdown hyperlink
Hello :) Sorry for the late response !!! I was busy working it out with another user ! However out of curiosity gave your sed regex a try, but there seems a missing ( somewhere ! I tried to fix the issue but your regex is way over my capabilities ! If you are sed/regex fanatic a want to give it another try feel free :). Right now I found a solution with another user that works great here's the script in question if you are interested:
#! /bin/bash
files="/home/USER/projects/test.md"
mdlinks="$(grep -Po ']\((?!https).*\)' "$files")"
mdlinks2="$(grep -Po '#.*' <<<$mdlinks)"
while IFS= read -r line; do
#Converts 1.2 to 1-2 (For a third level heading needs to add a supplementary [0-9])
dashlink="$(echo "$line" | sed -r 's|(.+[0-9]+)\.([0-9]+.+\))|\1-\2|')"
sed -i "s/$line/${dashlink}/" "$files"
#Puts everything to lowercase after a hashtag
lowercaselink="$(echo "$dashlink" | sed -r 's|#.+\)|\L&|')"
sed -i "s/$dashlink/${lowercaselink}/" "$files"
#Replace spaces (%20) from markdown links to - after a hashtag
spacelink="$(echo "$lowercaselink" | sed 's|%20|-|g')"
sed -i "s/$lowercaselink/${spacelink}/" "$files"
done <<<"$mdlinks2"
It's not very elegant but it does the job... While working on it with another very friendly user I came across other thing I haven't though of like:
Converting 1.2 to 1-2 (e.g. [Just a placeholder](#1.2%20Just%20a%20link%20to%20header))
Linking to another markdown file (e.g. [Just a placeholder](Another%20File.md#1.2%20Just%20a%20link%20to%20header))
The link to file before the # need to keeps it's original form (e.g. [Just a placeholder](Another%20File.md#1-2-just-a-link-tp-header))
Well I think that bare bone sed/regex wasn't the right tool, but in a bash script it does exactly what I'm expecting :)
Hello :) Sorry to pin you, I just gave pandoc a try but it doesn't work and I had to dig a bit further into the web to find out why !
Links to Headings with Spaces are not specified by CommonMark and each tool implement a different approach... Most replace space with hyphens other use URL encoding (%20). So even though pandoc looks awesome it doesn't work for my use case (or did i miss something? Feel free to comment).
[Just a test](#Just a test)
[Just a link](https://mylink/%20with%20space.com)
[External link](Readme.md#JUST%20a%20test)
[Link with numbers](readme.md#1.3%20this%20is%20another%20test)
[Link with numbers](Another%20file%20to%20readme.md#1.3%20this%20is%20another%20test)
Hey I just did a quick web search and found this. I haven't used the tool specifically before. However I recommend either searching the web for a similar tool or using a chatgpt like tool to create a python script that'll achieve your end result. Sed and regex are cool and useful, but they're only going to make it more difficult to achieve what you need.
Thanks for the pointer I wasn't aware pandoc was able to do that :/ It seems It can convert to Github-Flavored Markdown !! I have to give it a try :) Still I learned a lot from another user about regex/sed and Pearl :) !
Sorry for the late response... I was busy with another user :S My English is so bad I'm not able to response to every one at the same time... Whatever...
I tried your pearl regex substitution and effectively it does what I ask from my post, so thank you very much for your help ! However, I missed a few use cases were your regex breaks... But that's on me, your command works as expected !!!
[Link with numbers](Another%20Markdown%20file.md#1.3%20this%20is%20another%20test.md)
The part before the hashtag need to keeps it's original form (even with %20) because it links to a markdown file directly and not a header (Hope it's comprehensible?). It took me a lot of time with another user and we came to a wrapped up script that does everything:
#! /bin/bash
files="/home/USER/projects/test.md"
mdlinks="$(grep -Po ']\((?!https).*\)' "$files")"
mdlinks2="$(grep -Po '#.*' <<<$mdlinks)"
while IFS= read -r line; do
#Converts 1.2 to 1-2 (For a third level heading needs to add a supplementary [0-9])
dashlink="$(echo "$line" | sed -r 's|(.+[0-9]+)\.([0-9]+.+\))|\1-\2|')"
sed -i "s/$line/${dashlink}/" "$files"
#Puts everything to lowercase after a hashtag
lowercaselink="$(echo "$dashlink" | sed -r 's|#.+\)|\L&|')"
sed -i "s/$dashlink/${lowercaselink}/" "$files"
#Removes spaces (%20) from markdown links after a hashtag
spacelink="$(echo "$lowercaselink" | sed 's|%20|-|g')"
sed -i "s/$lowercaselink/${spacelink}/" "$files"
done <<<"$mdlinks2"
If you are motivated you can still improve your regex If you want :) I'm kinda curious If it's possible with a one-liner ! Thank again for your help and sorry for the late response !!
This might work, but I think it is best to not tinker further if you already have a working script (especially one that you understand and can modify further if needed).