Monday 27 December 2021

[re:educate] Introduction to Malicious Document Analysis

Introduction to Malicious Document Analysis

Introduction to Malicious Document Analysis

This article is a part of our program, #re:educate where we empowering cybersecurity students and beginners to share their understanding about anything related to offensive security. For more info, refer to this link RE:HACK - #re:educate

Author: KH Lai
University: UOW Malaysia KDU

Following are the sections which will be discussed by KH Lai in his article:

  1. Introduction to Malicious Document Analysis

  2. Document Analysis Methodology

  3. Malware Sample

  4. Maldoc Analysis Tools

  5. Maldoc Analysis with VirusTotal

  6. Analyse Maldocs Manually

  7. References

Introduction to Malicious Document Analysis


Malicious documents or “Maldocs” as the industry calls it, are typically benign-looking Microsoft Office documents that adversaries have embedded malicious code into them to either try to gain access or manipulate the victim’s system and in some cases, trick victims into giving away credentials through phishing attacks. The creator of these Maldocs can embed executables (.exe) or malicious codes into Microsoft Office documents such as (.pdf), (.docx), (.docm), (.xls), etc. Microsoft Office documents have various features that can be abused by adversaries to embed malicious code and objects into it that acts as a catalyst for the malware which would not raise suspicion if it were to be presented to a non-tech savvy and non-aware person. In the next section, we will go through some of the Microsoft Office document formats which are important in Maldoc analysis.


Macros are a compilation of commands that are used to automate tasks within Microsoft Office documents. Macros are Visual Basic Applications (VBA) sub-procedures that wrap the actual VBA codes and execute them. Documents that carry these macros are called Macro-Enabled documents. According to Microsoft, Macros are a powerful way to automate common tasks in Microsoft Office and can make people become more productive. As good as it sounds, it is also a dangerous component to have. Adversaries can either modify the properties of an existing element (e.g cell) of an Excel spreadsheet document or create a new element (e.g button) and modify its properties to specify the name of the sub-procedure to be performed when interacted. The name of these sub-procedures, when given a special macro name that starts with “Auto” , will be treated as automatic macros by Microsoft Office. Automatic macros as it sounds are run automatically upon opening the document. That is why they are dangerous. There are limited “Auto” names that can be used and they are as follow:

Macro NameDescription
AutoExecThis macro executes when the program starts or a global template is being loaded
AutoNewThis macro executes when a new document is created
AutoOpenThis macro executes when an existing document is opened
AutoCloseThis macro executes when a document is closed
AutoExitThis macro executes when the program is closed or a global template is unloaded


OLE which is short for Object Linking and Embedding is a mechanism in Microsoft Office that allows users to link different types of objects together. For example, linking an excel sheet into a word document. These objects are rendered by different components of Microsoft Office but linked and displayed in a single document. This is what they call compound documents which consist of different types of objects. Documents that support the OLE format often have the capability to have macros embedded within them. File extensions with this format are usually (.doc), (.ppt), (.xls), etc. Normally, OLE documents typically indicate that it can execute macros. There are 2 versions of OLE, which are OLE1 and OLE2.

OLE1 relies on DDE protocol which stands for Dynamic Data Exchange. It is one of the inter-process communication (IPC) mechanisms supported by Windows. DDE is text-based. It works the same way as the clipboard does. It pulls text from other documents and displays it. They further extend the DDE protocol into OLE1 to support more types of objects by using binary format. There is a function in the DDE protocol which gets abused the most, which is the “DDEAUTO” field that does automatic data exchange. This dangerous function is actually an undocumented feature of the DDE protocol that the adversaries have been leveraging to execute Powershell scripts from remote sources.

OLE2 works on top of Windows COM which is another IPC mechanism on Windows that is widely used even on the latest Windows version. The way COM works is it has a server-client communication where the COM server serves a number of methods/functions exposed by its Interface Description Language (IDL) and the COM client has to call the correct COM server which can be identified by its UUID and then get the pointer to its virtual table which has the pointer to its methods and then call the target methods.


OOXML stands for Office Open XML is a zipped format that is used for representing Word documents, Excel spreadsheets, PowerPoint documents, etc. Documents that use this format usually have the extension of (.docx), (.xlsx), (.pptx), etc. These files can usually be unzipped and it does not support macros unless the file extensions end with (.docm), (.xlsm), (.pptm), etc. If it is, then it is an indicator that the files are macro-enabled. OOXML files once unzipped, must contain a [Content_Types.xml] that is found at the root of the package. This file contains a list of all of the content types of the components in the package. Moreover, every OOXML package also contains a relationship component that defines the relationships between other parts and to external resources. This separates the relationships from content and makes it easy to change relationships without changing the sources that reference targets. These relationship files are typically named as (.rels).

There are different types of file formats that Microsoft Office supports. These are some of the notable ones:

Microsoft Word
ExtensionsName of File FormatDescription
.docWord 97-2003 DocumentBinary file format for Word 97-2003
.docxWord DocumentThe default XML file format for Word
.docmWord Macro-Enabled DocumentThe XML-based and macro-enabled file format for Word. Also stores Visual Basic for Applications (VBA) macro code
.dotWord 97-2003 TemplateTemplate for Word 97-2003
.dotxWord TemplateTemplate for creating recent versions of Word. Also does not contain macros
.dotmWord Macro-Enabled TemplateTemplate for creating recent versions of Word. Contains macros
.xmlWord XML DocumentThe XML file format supported in Word _Note_: The versions 2003 of .xml word files are different from newer versions
Microsoft Excel
ExtensionsName of File FormatDescription
.xlsExcel 97-Excel 2003 WorkbookBinary file format for Excel 97- 2003
.xlsxExcel WorkbookThe default XML-based file format for Excel. Does not store VBA code and macros sheets
.xlsxStrict Open XML SpreadsheetAn ISO strict version of the Excel Workbook file format
.xlsmExcel Macro-Enabled WorkbookXML-based and macro-enabled file format for Excel. Stores VBA code and macro sheets
.csvCSV (MS-DOS/Macintosh)Saves a workbook as a comma-delimited text file for use on the MS-DOS/Macintosh operating system
.xmlXML DataXML Data format. _Note_: XML spreadsheet 2003 file formats are different from others
.difData Interchange FormatSaves only the active sheet
Microsoft Powerpoint
ExtensionsName of File FormatDescription
.pptPowerPoint 97-2003 PresentationThe default PowerPoint 97 to Office PowerPoint 2003 format
.pptxPowerPoint PresentationThe default PowerPoint XML-based file format
.pptxStrict Open XML PresentationAn ISO strict version of the PowerPoint Presentation file format
.pptmPowerPoint Macro-Enabled PresentationA presentation that contains VBA code
.potPowerPoint 97-2003 TemplateA template for PowerPoint 97 to Office PowerPoint 2003 presentations
.potxPowerPoint TemplateA template for creating new PowerPoint presentations
.potmPowerPoint Macro-Enabled TemplateA template for creating new PowerPoint presentations that contains macros
.ppsPowerPoint 97-2003 showPresentation slide show
.ppsxPowerPoint ShowPresentation slide show
.ppsmPowerPoint Macro-Enabled ShowSlideshow which contains macros
.xmlPowerPoint XML PresentationThe XML format that is supported in PowerPoint

Document-based Malware Analysis Methodology

The general approach to document analysis are as follows:

  • Analyse documents to find suspicious components. For example, suspicious file extensions, suspicious scripts and functions, or any embedded components.

    • Identify the file format of the malicious document and plan the appropriate tools to use for that case.

  • Locate the embedded components such as macros, scripts, shellcodes, objects, etc.

  • Extract the malicious component/payload from the inspecting file

    • Oftentimes, malware authors will obfuscate the malware/payload to decrease the chances to get detected. If so, then it is necessary to deobfuscate them to examine it.

  • Understand the malware to see what it is intended to do and what is the next stage of the malware.

Malware Sample

In this article, we will be using two malware samples which we obtained from two different places. List of malware samples is listed as below:

Hash of Maldoc:

MD5 - fb5ed444ddc37d748639f624397cff2a

SHA1 - 3c1a4c0744203d2d08a23f4a9de10a1b593e7763

SHA256 - 0ff0692939044528e396512689cbb6ccee6d4ef14712b27c1efd832a00e24818

Additional Information

Obtained From:

Hash of Maldoc:

MD5 - 26ca42bf5845cefe41189add76de5a3c

SHA1 - 706301fc19042ffcab697775c30fe7dd9db4c5a6

SHA256 - 27a9cc271bc3e1ad5304e6123d8c21078f6bf19824717975288deae932fc76fc

Additional Information

From: Wargames MY 2021

Maldoc Analysis Tools

There are many useful document analysis tools out there for us to use. Here is a compilation of useful tools.


Maldoc Analysis With VirusTotal

From the downloaded file, we extracted the malware sample and it is a .bin file.

One of the fastest ways to get information from a Maldoc is by uploading it to Virustotal. We will upload our malware sample to VirusTotal to obtain some information.

We can see that Virustotal helps us detect the file type of the malware and it tells us that it is a Microsoft Excel spreadsheet file. We can also get some other information like connected domains, IP addresses and possible URLs embedded within the maldoc.

To provide another example, we can use a Maldoc obtained from Wargames MY 2021. There was a Maldoc in the form of a Microsoft Word document that contained malware. We can try uploading the file to VirusTotal to see what information we can get from it.

SHA256 Hash of file:

Based on the image above, we can see that it detects embedded URLs within the file which contains a Dropper Site and a Malware executable.

Analyse Maldocs Manually

Another way to analyse malicious documents is to do it manually using open source tools. Instead of uploading to a free malware analysis website such as the VirusTotal, we can obtain even more details of the malware by analysing it manually. All the following tools and commands are performed on Linux-based operating system dubbed as the Parrot OS.

Step 1: Analysing Maldoc for suspicious components

We will be analysing the Maldoc which can be obtained here.

Hash of Maldoc:

MD5 - fb5ed444ddc37d748639f624397cff2a

First and foremost, let’s start by using the file command to check what type of file we are looking at.

It is stated that it was originally a Microsoft Excel spreadsheet. As shown previously when we uploaded it to VirusTotal, it was flagged as malicious, therefore we can be sure that it may contain some form of macro malware. The first thing that we can do is use tools to inspect macros hidden within the file.

In this demonstration, we will be using oletools which can be obtained from here. Oletools is a package of python tools used to analyse Microsoft OLE files. The package consists of many different python tools that can be used for different purposes. We will first be using mraptor which comes together with oletools. mraptor or “MacroRaptor” is a python script which parses OLE and XML files to detect malicious macros. Let’s start off by using mraptor because it can quickly output any results.

It found 0 macros. Maybe mraptor is not powerful enough. We can try another tool which is olevba to get more information. olevba is another python script that parses OLE and XML files to extract Microsoft Office macros and displays them in cleartext. Not only that, it is also capable of deobfuscating and analysing malicious macros if present. We will start by using olevba with the -a argument to display the analysis results of the output.

We got an output of a few things. Let’s get into it and explain each one.AutoExec is a macro where it executes when the file is opened. This is very dangerous whereas unaware people would open a file that is sent to them unbeknownst to them that it will automatically run an executable file, commonly (.exe) malwares that may attempt to gain access to the victim’s machine. As seen from the screenshot above, the tool detected that it has some form of executable file embedded within it. Moreover, it also detected hex-encoded strings which tells us that there may be some obfuscated content inside. It also found malicious macros. All of the detected components are flagged as suspicious.

Step 2: Locate suspicious components

Let’s dig deeper and look into the file for macros. By using the following command with the -c argument, we can output the list of all available macros inside the file.

We found that there are a total of 9 spreadsheets in this Excel file whereas 6 of them are Macros and are hidden, and 3 are visible to the users. The first 2 sheets really look suspicious.


  • OHqYbvYcqmWjJJjs

There were many functions embedded within the file that links to the first sheet SOCWNEScLLxkLhtJ, and all of the output seems obfuscated. Looking back at the metadata as shown in the malware sample which can be obtained by using exiftool which can be obtained here. exiftool is a script that runs a set of Perl modules that allows us to read/write metadata from a given file. It tells us that the file is password protected which means that it is encrypted. We have to find a way to decrypt it. Based on the author’s post, something called “VelvetSweatshop” was mentioned. We will explain and find out what exactly it is.

VelvetSweatshop is a default encryption password used by Microsoft Excel spreadsheets. So why is this a threat? In normal circumstances, when a sender wants to protect an Excel file, symmetric encryption will be used. To open the file, the receiver is required to input that particular password. To provide a scenario, a malicious actor would send an encrypted file to a victim and attach the password inside the phishing email. The victim will then be required to use the provided password to open the file, once the file is opened, embedded components such as macros and scripts would then run, thus compromising the victim’s machine. The reason why attackers send encrypted files is to evade and bypass the average detection systems.

This is where the VelvetSweatshop default password comes into play. If an encrypted excel file were sent to a victim. Microsoft Excel will first use the default password to decrypt the file. If the encrypted file is using the default password, Excel will open the file without generating any warnings or dialogs. If it is not, then excel will prompt the user for a password. This is an issue because, if the attacker uses the default password to encrypt the file, then Excel will automatically decrypt the file and open the file whilst running any malicious macros and scripts embedded within the file.

Additional Note:

There is another variant to this default password technique which affects PowerPoint files. According to Binary Document Write Protection Method 3 from Microsoft, it is stated that binary documents that are in the file format of (.ppt) must store write protection passwords in the file in plaintext. When a binary document uses this method, it should NOT be encrypted with the following encryption:

  • ECMA-376 encryption

  • CryptoAPI RC4 encryption

  • RC4 encryption

  • XOR obfuscation

When a user did not supply an encryption password whilst the document is encrypted, the default password must be the following:


Note that the hexadecimal value of which is equal to an ASCII string /01Hannes Ruescher/01 was not tested at the time of writing. It may or may not be working as it is stated on the Microsoft documentation.

Step 3: Extract Malicious Components

Let’s deobfuscate the file using the default password to see what can be obtained.

Based on the output, we can see many suspicious components within the maldoc.

Now we have even more information about the maldoc. There are many components here, first of all, it detected multiple functions like ShellExecute and Shell32 which can execute malicious payload. Also there is URLDownloadToFile which can download files from a given link. Most importantly, it found 5 IOCs. IOC which stands for Indicators of Compromise is a term used in Digital Forensics and Incident Response (DFIR) that indicates the activities of a specific threat or in this case is a malware and it can be further utilized for making a threat attribution. . Getting back to the output, it found some URLs which we can assume that it is where the malware is hosted. It also found two executables.

Step 4: Understanding The Malware

The main part of the malware is shown as above. How this is intended to work is that when the Excel spreadsheet is opened, it will create a directory then it will download an executable file from a URL and will run the executable file (.exe) using a shell.