Yet Another Joomla! Search Result Hijack Dissected: Breakthrough

If you’ve not been following, it’s a great time to catch up on recent developments.  The plot remains the same.  More Levietra spam, site still not completely clean.  To summarize the previous methods of removal:

1. validate changes to files by date.  Check out changed files versus known good – Again, the development GitHub proves invaluable as a validation resource.

2. Using compared code, make sense out of the attack, and search entire server for instances of similar malicious code.  Always a great time to add a few ip deny statements based upon your findings.  like other countries’ IP space.

3. Armed with dates, dump php error logs, IIS logs, check application and other system managed logs using automation such as grep in Linux, and findstr on Windows *(unless you are of the cygwin persuasion, or use the Linux subsystem for Windows).  This step alone leads to any lazy/hacked/bad code errors, often to the right file.

4. Throw whole codebase at site.  Run a differential tool, they are quick and effective.  Read logs.  Read more logs, ad infinitum.

At this point, no matter the search, I wasn’t finding the base64_decode I knew was in there somewhere.  It simply had to be encoded, as I rolled a test db over the environment for a few.  Still there.  I searched the entire code base for the raw text contained in the site as Google saw it.  I didn’t get the same text in my source, so I knew more of the same was hiding.  All the remaining file dates matched the day of distribution.

Re-examining how base 64 works, I soon realized the very text the spammer was using could be used against him.  Rather than search the site in increasingly complex ways, I decided to get stupid simple about it.

The encoding method has a feature of it that makes the end value of a character dependent upon it’s neighbor.  This is what makes it so you can’t simply base64_encode your text and get the same value as the malicious code has stored in the files.  Or can you?

In this instance I could validate that in my web client, I got spam-free bits.  If only I could view their site as Google views it.  I can, but that requires effort and spoofing IP addresses – which we WILL go over, but not today, please.  The easy way is to log in to the Google webmaster tools, and ask it to show you the site.

After you login and select your site, click on site health at the left, and then Fetch as Google.  This is where I could still see there were large enough portions of text that were probably encoded contiguously.  Using that assumption, I went back to the drawing board.  My first draft was horribly complex.  I figured I could encode a portion of the code with every possible character on either side.  Then I realized it would take an impractical amount of time to calculate all the permutations for (Vardenafil) – the longest contiguous chunk of text you could reasonably expect to be encoded as one string.

If you break it down, the fewest number of characters you can expect this to work with is 7, and the likelyhood of false positives is very high.  If you increase you odds by searching for a longer string, you’ll likely only have positive results.  I chose 12 characters, allowing 4 characters of precision.  You’ll see why:

(Vardenafil)    KFZhcmRlbmFmaWwp
    denafil)        ZGVuYWZpbCk=
   rdenafil         cmRlbmFmaWw=
  ardenafi       YXJkZW5hZmk=
 Vardenaf       dmFyZGVuYWY=
(Vardena        KFZhcmRlbmE=
    denafil)        ZGVuYWZpbCkg
   rdenafil)        cmRlbmFmaWwp
  ardenafil      YXJkZW5hZmls
 Vardenafi      VmFyZGVuYWZp
(Vardenaf       KFZhcmRlbmFm
 (Vardena    IChWYXJkZW5h

Encoded and shifted, you get the Idea in my head as I initially saw it, but to put in a more succinct order:

 (Vardenafil) KFZhcmRlbmFmaWwp
 (Vardenaf    KFZhcmRlbmFm
 (Vardena     KFZhcmRlbmE=
    rdenafil)     cmRlbmFmaWwp
    rdenafil      cmRlbmFmaWw=
 (Vardena     IChWYXJkZW5h
   ardenafil      YXJkZW5hZmls
   ardenafi       YXJkZW5hZmk=
  Vardenaf    dmFyZGVuYWY=
  Vardenafi   VmFyZGVuYWZp
     denafil)     ZGVuYWZpbCk=
     denafil)     ZGVuYWZpbCkg

If you look very carefully, depending upon what’s next to dena, the 4 characters of precision I speak of, you see the begenning characters and ending characters change, but of the characters that stay put, there are only 3 possible output combinations for dena, based on what characters are on either side, or if it’s the very beginning.  Note certain input was padded with a space, as I guess I could be reasonably certain that would be on either side of the  characters.  This is to give you some visual, still – the theory works with fewer pieces.  Here’s the actual script I decided to run, and the results:

findstr /S VuYWZpbCk *.*
findstr /S RlbmFmaWw *.*
findstr /S JkZW5hZmk *.*
findstr /S FyZGVuYWY *.*
findstr /S ZhcmRlbmE *.*
findstr /S hcmRlbmFm *.*
findstr /S ZGVuYWZpbCkg *.*
findstr /S cmRlbmFmaWwp *.*
findstr /S YXJkZW5hZmls *.*
findstr /S VmFyZGVuYWZp *.*
findstr /S KFZhcmRlbmFm *.*
findstr /S IChWYXJkZW5h *.*

Run this, and start looking hard at any results.  I just compared the found file to the reference version again, and away the issue went.

findstr /S Vardenafil *.*
findstr /S VuYWZpbCk *.*
findstr /S RlbmFmaWw *.*
libraries\joomla\utilities\utility.php:bnRlbnQgTWFuYWdlbWVudCIgLz4KICA8dGl0bGU+Q
nV5IExldml0cmEgKFZhcmRlbmFmaWwpIE9u
libraries\joomla\utilities\utility.php:IFF1b3RlczwvYT4gZm9yIHNpbGRlbmFmaWwgY2l0c
mF0ZSwgd2hpY2ggRWxlY3Ryb25pYyBjaWdh
findstr /S JkZW5hZmk *.*
findstr /S FyZGVuYWY *.*
findstr /S ZhcmRlbmE *.*
findstr /S hcmRlbmFm *.*
libraries\joomla\utilities\utility.php:bnRlbnQgTWFuYWdlbWVudCIgLz4KICA8dGl0bGU+Q
nV5IExldml0cmEgKFZhcmRlbmFmaWwpIE9u
libraries\joomla\utilities\utility.php:ZXZpdHJhfHZhcmRlbmFmaWx+aScsICRrZXl3b3JkK
SkgKSB7CgkJCQkJaGVhZGVyKCdMb2NhdGlv
findstr /S ZGVuYWZpbCkg *.*
findstr /S cmRlbmFmaWwp *.*
libraries\joomla\utilities\utility.php:bnRlbnQgTWFuYWdlbWVudCIgLz4KICA8dGl0bGU+Q
nV5IExldml0cmEgKFZhcmRlbmFmaWwpIE9u
findstr /S YXJkZW5hZmls *.*
findstr /S VmFyZGVuYWZp *.*
findstr /S KFZhcmRlbmFm *.*
libraries\joomla\utilities\utility.php:bnRlbnQgTWFuYWdlbWVudCIgLz4KICA8dGl0bGU+Q
nV5IExldml0cmEgKFZhcmRlbmFmaWwpIE9u
findstr /S IChWYXJkZW5h *.*

You’ll notice it hit on every 3rd one, all three subtle variants on a similar theme.  If you line them up as I did in the middle, you can pick out the three subsets that all look similar as I did, and search them.  My inventive method paid off, as the site returned clear results when I tested via Google Webmaster Tools.

—- Begin Update —-
This has been my most popular blog post, and in honor of its 12-year anniversary, below is an implementation of the sliding window encoder, to allow you to experiment with base64 encoded search on your own.

Sliding Window Encoder

—- End Update —-