Talk.Require Space After Bullet Proposal

This is version . It is not the current version, and thus it cannot be edited.
[Back to current version] [Restore this version]

For Creole 0.4 I'd like to bring out the issue of spaces after the bullets. The current (0.3) draft and previous specs have this ugly special case:

About unordered lists and bold: a line starting with ** (including optional whitespace before and afterwards), immediately following an unordered list element a line above, will be treated as a nested unordered list element. Otherwise it will be treated as the beginning of bold text. Also note that bold and/or italics cannot span lines in  a list.

I think it's ugly and complicates the parser needlessly. Also, many wikis already have very similar list markup, just without this special case -- making them accept both Creole and native markup at the same time would require some sort of a hack (I can't even imagine it curently).

One possible way of getting rid of that special case and still keeping list markup unambigous with bold markup is requiring a space after the bullet.

Now, this is a different case than with space before the bullet. There are wiki engines that don't allow space before the bullet, and those that require it -- making it optional is really the only way to make them agree.

On the other hand, no wiki engine I know prohibits the space after the bullet. Some require it.

Moreover, putting a space after most punctuation characters is a tradition, and for many people -- a reflex. I can see nothing unnatural in requiring it -- and it simplifies the parsers and the specs -- making Creole both easier to implement and to teach.

By the way, there is a (pretty ugly) hack to get a bold line even if the above special case is removed (remove the single space):

 {{{}} }**bold line**
}}}
-- [RadomirDopieralski], 2006-12-14

Why not accept both (asterisks and dashes)? And it goes with the unofficial [Goals] {{{Rule of least surprise}}} and some others...

-- [EricChartre], 2006-12-28

Regarding the possible ambiguity of the asterisks, there are none (for the parser anyway) if the specs do not allow for bold text to span multiple lines and that bold text must end at some point with __. Also, I __don't__ think that a user would ever, on purpose, do something like:__

{{{
** is this text bold
** or are these just two second-level list items
}}}

meaning 

{{{
<em> is this text bold<br />
</em> or are these just two second-level list items
}}}


However, the parser must do a look-ahead or a two-level parsing...

-- [EricChartre], 2006-12-28

I don't think there is any ambiguity, in the example given above. I believe the asterix signify strong, as it seems illogical to start a sub-list directly.

And the following would be considered list items.
{{{
* List
** SubItem 1
** SubItem 2
}}}

-- [JaredWilliams], 2006-12-30

Yes, the problem is rather with these examples:
{{{
**foo**bar**baz
**one**two
}}}

They could be parsed as:
----
__foo__bar__baz__
__one__two
----
or
----
__ foo__bar__baz
__ one__two__
----
or
----
__ foo__bar__baz
__one__two
----
You can't really decide without infinite (unbound) lookahead -- and that's a great problem if you need to use a ready parsing algorithm or parser framework -- this rules out most of the extensible, plugin-based wiki engines.__

You can't just make list or bold the default here -- because there are popular use cases for both:

__Paragraph titles__ are often integrated in the paragraph, like in this example. They are tradidtionally distinguished by making them bold. Italics won't do.

* multilevel lists
** can contain __bold__ fragments

Really, I think that requiring a space after the list bullets is a simple and effective solution. And it also removes the conflict with {{{#pragma}}} and {{{# numbered list}}} for many wiki engines.

-- RadomirDopieralski, 2006-12-30

I have my parser doing this

{{{
**foo**bar**baz
**one**two
}}}
is
{{{<div><p>
<strong>foo</strong>bar<strong>baz</strong>one<strong>two</strong>
</p></div>}}}
But
{{{
*list
**foo**bar**baz
**one**two
}}}
is
{{{<div><ul><li>list<ul>
   <li>foo<strong>bar</strong>baz</li>
   <li>one<strong>two</strong></li>
</ul></li></ul></div>}}}

Which I think covers it.

-- [JaredWilliams], 2006-12-30

How does it looks in the regular expressions? Something like:
{{{
(?=\n\s*\*+\s*.*)\n\s*\*+\s*(.*)
}}}
as an additional rule for the lists? Or did you just write your own algorithm and remember the state between the lines?

-- RadomirDopieralski, 2006-12-30

I don't use regular expressions. 

But here is the algorithm in PHP in anycase, called when the parse has seen {{{\n[*-#]}}}, with $i holding the position of the {{{[*-#]}}}.

{{{
/*
 * $text is the creole text
 * $i is the current position in $text
 * $l is the strlen($text)
 * $doc is the DOM Document
 * $node is the current position in the DOM Document
 * $listMap = array('-' => 'ul', '*' => 'ul', '#' => 'ol');
 */

// Traverse up the DOM tree, from our current position, looking for open lists.
$lists = array();
for($n = $node; $n; $n = $n->parentNode)
	if ($n->nodeName == 'ol' || $n->nodeName == 'ul')
		array_unshift($lists, $n);

// See how many lists we can match... from the $text 
$j = 0;
while (isset($text[$i + $j], $lists[$j], $listMap[$text[$i + $j]])
		&& $listMap[$text[$i + $j]] == $lists[$j]->nodeName)
	++$j;

// See how many list markers left...
$k = strspn($text, '-#*', $i + $j);
switch ($k)
{
	case 1:
		// Going a level deeper..
		if (isset($lists[$j - 1]))
			$node = $lists[$j - 1]->lastChild;
		else if ($j == 0 && $node->nodeName == 'li')
			$node = $node->parentNode;

                // Create UL or UL...
		$node = $this->insertElement($node, $listMap[$text[$i + $j]]);

		$node = $node->appendChild($doc->createElement('li'));
		$i += $j + $k;
		break;

	case 0:
		// List item of the most recent open list.
		$node = $this->insertElement($lists[$j - 1], 'li');
		$i += $j;
		break;

	default:
		// Horizontal line...
		if (strspn($text, '-', $i) >= 4)
		{
			$this->insertElement($node, 'hr');
			$i += $j + $k;
		}
		break;
}
}}}

So __foo__bar__baz doesn't get recognised as a list, as $k = 2, and gets left alone for the inline parser to interpret as <strong>. But *list\n__foo__bar__baz, $k = 1, for both lines.

-- [JaredWilliams], 2006-12-30

----

As I've mentioned in [Raph's 0.4 recommendations], I'm in favor of using trailing whitespace to disambiguate second level list bullets from bold. It's simple and easy to understand. I am not in favor of "magic" algorithms to resolve the ambiguity. I think that non-local algorithms are especially undesirable for bullet lists, because they're often rearranged by cutting and pasting. Requiring trailing whitespace is also NotNew.

From what I can tell in the above tangled discussion, it's also Radomir's favored solution. It seems to me we should be able to reach consensus on this issue fairly easily. Am I off base?

-- [RaphLevien], 2007-01-07

Ideed, I was in favor of that solution, but now after this discussion I think that both can be considered fairly equivalent. I still prefer the added whitespaces slightly -- it has an advantage of being easier to explain, and also fixes the {{{#pragma}}} conflict in many wikis. 

Raph, I'm not really in any way more "core" than you are -- the fact that I dominated RecentChanges recently is a coincidence. On the other hand, I'd really like Creole to be designed in an OpenProcess, while minimising arbitrary decissions and [bikesheding|http://www.freebsd.org/doc/en_US.ISO8859-1/books/faq/misc.html#BIKESHED-PAINTING]. That's why I want every possible difference in opinions discussed, even if they seem no-brainers.

So then, I have listed some advantages I perceive in requiring these whitespaces. During the discussion, some alternative solutions have been brought up, most of them pretty much acceptable. Now I just miss one thing: ''Is there anything important __against__ the requirement of whitespace after list bullets, other than the desire to have as free and unrestricted format as possible?''

-- RadomirDopieralski, 2007-01-08


I don't see anything against a required space after list bullets __except__ for end-user freedom. Personally, I never put a space after the bullet because that feels like it types faster and allows my thoughts to flow more smoothly. (silly maybe)

I'd like to point out once more that there is no ambiguity between bold and second level bullet items here precisely because of the "ugly special case". Also, it is relatively simple to parse __ so that it is always interpreted as bold __except__ when at the beginning of the line __and__ preceded by a first level bullet item. At least if you are parsing using something like flex, I'm not sure about it when using regular expressions though.__

--MartijnVanDerKleijn, 2007-01-11

No problem when parsing using regular expressions, according to my experience.

-- MicheleTomaiuolo, 2007-01-11

Another argument for requiring a space after bullets is that Creole should represent a minimal common set of rules shared by other wiki dialects, which all wiki engines should interpret correctly. Right? So I think requiring a space makes it simpler for engines to handle Creole. The stricter, the better.

If engines relax this constraint, well, it's an extension and it's allowed.

OT now, but this could also stand for titles of subsections, for examples. If we say that trailing equal signs are required, it would make simpler for existing engines to interpret Creole. It would be a single case, and not two.

-- MicheleTomaiuolo, 2007-01-31

Add new attachment

Only authorized users are allowed to upload new attachments.

« This particular version was published on 31-Jan-2007 23:56 by MicheleTomaiuolo.