Dissertation | PhD Thesis

Reverse Engineering of Unknown Binary Network Protocols

Protocol reverse engineering (PRE) is the process of inferring the message syntax, message type, and grammar of an unknown network protocol, forwhich no specification is available. The understanding of these aspects of a protocol can be achieved by Dynamic Entity Analysis (DEA), Dynamic Traffic Analysis (DTA), and Static Traffic Analysis (STA). DEA observes the dynamic program flow analysis of the communicating software. DTA actively probes the input of communicating software entities and interprets their responses. STA is solely based on observing traffic traces of a communication using that protocol.
We survey related work that proposes solutions for the three types of PRE. By reimplementing existing approaches, we confirm that DEA is tedious and error prone. DTA is rarely discussed due to its complex challenges, which we discuss. STA, finally follows a sequence of common tasks as we show by deriving a common process from previous work. Human-readable, textual protocols received considerable attention while tightly packed binary protocols are only rudimentarily covered by previous work. Likewise, the inference of a protocol’s grammar is decently solved already. Therefore, our scope is the automation of the analysis of the message syntax, their content data type and the message types of proprietary binary network protocols. In our model, the process to reverse engineer a protocol specification from traces consists of preparation, data collection and preprocessing, feature extraction, message format inference, message type identification, semantic deduction, behavior model reconstruction, and processing of results.
Our key contributions are that (1) we propose NemeSYS, an analytical method to efficiently determine field boundaries of messages for feature extraction. Its key task, segmentation, is based on finding common properties of byte sub-sequences, so-called segments, of each individual message. NemePCA refines this segmentation to infer the message formats. We devised a novel measure to determine the correctness of a message format: the Format Match Score (FMS). (2) Furthermore, we introduce the concept of continuous segment similarity measured by the novel Canberra-Ulm dissimilarity. It provides a means of comparing arbitrary segments to each other, independent of their length. Using this segment similarity, we propose two solutions: (3) NemeFTR, an algorithm for the semantic deduction of field data according to their segment similarity and statistical properties, and (4)NemeTYL, our approach for message type identification by which we achieve a robust clustering of whole messages. For this, we combine the continuous segment similarity measure with sequence alignment. The classes of messages relate to specific functions in the protocol state machine and thus our output can be used as input for existing grammar inference methods.
Our proposed methods promote the ability to infer a protocol specification, without access to the program flow or source code of an entity. The knowledge gained from these analysis steps can be used for network analysis and security-relevant tasks, like vulnerability testing by fuzzing, the setup of honeypots, analyzing botnets, and automated network modeling.