Better Diffs comparison 2023-06-09

Table of Contents

1 Methodology

Wiki pages were selected at random (API:Random) and two revisions selected from the page at random. The HTML of the two-column diff of these two revisions is taken from the respective wiki.

The wikitext from both revisions is POSTed to https://wikidiff2-demo.wmcloud.org/demo.php four times with different parameters:

  1. changeThreshold = 0.2; maxSplitSize = 3 (default)
  2. changeThreshold = 0.3; maxSplitSize = 3
  3. changeThreshold = 0.2; maxSplitSize = 1
  4. changeThreshold = 0.3; maxSplitSize = 1

This returns the HTML of a two-column diff.

Parameter 1 is what we are planning to release wikidiff2 with.

Parameter 2 is included because there was some debate over whether we should change the value of changeThreshold.

Parameter 3 in theory should disable the new paragraph split code and is to detect any changes to the diff algorithm not due to the way we are now handling paragraph splits.

Parameter 4 is included for completeness.

Then the HTML from the wiki is compared with the HTML of these four different parameters (after some normalisation).

2 Results

Below shows whether or not the HTML from https://wikidiff2-demo.wmcloud.org/demo.php for the four different parameters matches the HTML from the wiki, and the total:

Case #1 #2 #3 #4 Total
1 T T T T 4785
2 T F T F 409
3 F T F T 27
4 F F T T 55
5 F F T F 13
6 F F F T 1
7 F F F F 14

T = match F = did not match

2.1 Case 2

The most common differences were where changeThreshold = 0.2 matches the current implementation but changeThreshold = 0.3 does not.

There were too many for me to review them all (yet). But, of the ones I have reviewed, the majority show changeThreshold = 0.3 being arguably better match, because it does not match lines which probably should not match.

Examples where changeThreshold = 0.3 is better:

Examples where changeThreshold = 0.3 is worse:

Example which is debatable:

2.2 Case 3

These seem to be cases where the value of changeThreshold is affecting the match, rather than the paragraph splitting algorithm.

The majority of these cases I would argue show the default parameters being worse than the current implementation. This is because lines are being matched which probably should not be matched.

Examples where default parameters are worse than current:

Examples which are debatable:

2.3 Case 4

These seem to be cases where paragraphs are being split.

In the majority of cases, the default parameters are better than the current implementation, because it recognises a paragraph split. In these cases the value of changeThreshold does not seem to matter.

Examples where default parameters are better than current:

Examples which are debatable:

2.4 Case 5

Cases where paragraphs are split but where changeThreshold value also matters.

Of the ones I have reviewed, roughly half the time changeThreshold = 0.2 is equivalent to 0.3 and half the time 0.3 is better.

Examples where default parameters are arguably best:

Examples where default and changeThreshold = 0.3 are equivalent and best:

Examples where changeThreshold = 0.3 is best:

2.5 Case 6

This is the only example of this. When maxSplitSize = 1, the value of changeThreshold seems to matter.

2.6 Case 7

Where the new implementation does not match the current implementation, regardless of the values of the parameters.

These are a mixed bag in terms of whether the new implementation is better or not.

3 Why isn't this counted as a paragraph split?

4 Code

Author: Dominic Walden

Created: 2023-06-09 Fri 13:13

Validate