Static Analysis of String Encoders and Decoders
Presented by Loris D. Antoni, this study delves into the complexities of string encoders and decoders, highlighting their importance, challenges faced, and the need for static analysis to ensure correctness. The research covers topics such as BASE64 encoding, equivalence checking, function composition, and code analysis. Through detailed examination and verification, the aim is to enhance the understanding and application of these fundamental components in text processing.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Static Analysis of String Encoders and Decoders Presented By: Loris D Antoni Joint work with: Margus Veanes
Motivation String Encoders and Decoders are Ubiquitous: transformation from Unicode text files in the Internet to in-memory representation of text Hard to write: they use unintuitive logic in order to enable efficiency Hard to verify: big state space, alphabets are very big (216elements). Previous techniques blow up for small decoders. 2
A simple example: BASE64 encoder Text content M a n Bytes 77 97 110 Bit Pattern 0 1 0 0 1 1 0 1 0 1 1 0 0 0 0 1 0 1 1 0 1 1 1 0 Index 19 22 5 46 Base64 Encoded T W F u 3 Bytes 4 Base64 characters Decoder similar (every 4 encodes 3) Uses bit manipulations to be efficient How do we model it and prove it correct? 3
What Properties do we check? Encoder, Decoder denoted by E,D E o D = I D o E = I dom(E) = bytes dom(D) = Base64 bytes We need Equivalence checking Function Composition (our model should be closed under composition) 4
Bek code program base64encode(input){ return iter(x in input)[q:=0;r:=0;]{ case (x>0xFF): raise InvalidCharacter; case (q==0): yield (base64(x>>2)); q:=1; r:=(x&3)<<4; case (q==1): yield (base64(r|(x>>4))); q:=2; r:=(x&0xF)<<2; case (q==2): yield (base64((r|(x>>6))), base64(x&0x3F)); q:=0; r:=0; end case (q==1): yield (base64(r),'=','='); end case (q==2): yield (base64(r),'='); }; } How do we analyze this code? 5
Trust me! It is tricky! [12/12/12 11:35:49 PM] Margus Veanes: I think it is doable, smth that is like ([A-Z2-7]{4}... )* [12/12/12 11:35:57 PM] Loris D'Antoni: ok ill try [12/12/12 11:36:22 PM] Margus Veanes: then you can ry to see the difference compared to the domain of the decoder [12/12/12 11:37:42 PM] Loris D'Antoni: it seems that also on this counterex it doesn't work [12/12/12 11:37:43 PM] Loris D'Antoni: DP2A==== [12/12/12 11:37:50 PM] Loris D'Antoni: which maybe it's a bad one in this sense [12/12/12 11:37:52 PM] Loris D'Antoni: ill check now [12/12/12 11:40:45 PM] Margus Veanes: actually the domain of the decoder looks wrong, it allows 8 and 9 [12/12/12 11:40:46 PM] Margus Veanes: http://www.rise4fun.com/Bek/Cy3 [12/12/12 11:40:58 PM] Loris D'Antoni: yeh i fixed that in my version COUPLE OF HACKS LATER [12/13/12 12:24:02 AM] Loris D'Antoni: ok, found bug and fixed it, now proved them correct. Will work on others tomorrow. Was very silly but hard to spot [12/13/12 12:24:35 AM] Margus Veanes: ... this is why the analysis we can do is useful :-) [12/13/12 12:24:45 AM] Loris D'Antoni: yeh i was mapping [12/13/12 12:24:46 AM] Loris D'Antoni: 26..31 ==> 2..7 [12/13/12 12:24:58 AM] Loris D'Antoni: instead of 26..31 ==> '2'..'7' Brief DEMO 6
Attempt 1: Finite Transducers Decidable equivalence and closure under composition 28 edges out of every state and 216 states n / [TWFu] a / [] M / [] M Ma .. .. .. Finite set of states Each transition reads an input symbol and outputs a sequence of symbols Mapping from strings into strings Blue state (final), for which the mapping is defined 7
Attempt 2: Symbolic Finite Transducers [POPL12] x. x== a / [ x. x>>4, ] x. x== M / [ x. x>>2] M Ma Supports symbolic updates such as bit-vectors .. .. Guards are predicates over any decidable theory instead of single characters Output is a function of the input In this case uses theory of bit-vectors Better reflects implementation operations Analysis is still decidable (equivalence, composition) We did not improve much: still state explosion 8
Attempt 3: Symbolic Transducers [POPL12] True / [r|(x>>6), x&0x3F], r := 0 Registers Registers True / [x>>2], r := (x&3)<<4 True / [r|(x>>4)], r := (x&0xF)<<2 0 1 2 Register can store values and is updated in transitions Inputs and outputs can inspect and use register value Logic is the same as for implementation!! No state explosion!! Closed under sequential composition Analysis (equivalence) is undecidable in general We need a way to eliminate the registers 9
Register Elimination: the nave way x / [r|(x>>6), x&0x3F], r := 0 ST x / [x>>2], r := (x&3)<<4 x / [r|(x>>4)], r := (x&0xF)<<2 0 1 2 Via enumeration: State Explosion, but automatic Can do analysis, but very slow SFT n / [(((((M&3)<<4)|(a>>4))&0xF)<<2)|(n>>6), n&0x3F] Doesn t work if alphabet infinite: waste of Symbolic analysis We need a Better model a / [((M&3)<<4)|(a>>4)] M / [M>>2] M Ma .. .. 10
A simple example: BASE64 Text content M a n Byte 77 97 110 Bit Pattern 0 1 0 0 1 1 0 1 0 1 1 0 0 0 0 1 0 1 1 0 1 1 1 0 Index 19 22 5 46 Base64 Encoded T W F u 3 Bytes 4 Base64 characters Decoder similar (every 4 encodes 3) Uses bit manipulations to be efficient How do we model it and prove it correct? 11
Read sequences of symbols Extended Symbolic Finite Transducers [x1,x2,x3] / [x1>>2, ((x1&3)<<4)|(x2>>4), ((x2&0xF)<<2)|(x3>>6), x3&0x3F] Output is a function of all the 3 symbols 0 No state explosion Analysis can be done for several interesting cases (in particular for encoders) But, how do we pass from STs to ESFTs? 12
Register Elimination: the good way 1/2 x / [r|(x>>6), x&0x3F], r := 0 ST x / [x>>2], r := (x&3)<<4 x / [r|(x>>4)], r := (x&0xF)<<2 0 1 2 2 ESFT [x1,x2] / [r|(x1>>4), ((x1&0xF)<<2)|(x2>>6), x2&0x3F], r:=0 0 1 x / [x>>2], r := (x&3)<<4 13
Register Elimination: the good way 2/2 [x1,x2] / [r|(x1>>4), ((x1&0xF)<<2)|(x2>>6), x2&0x3F], r:=0 0 1 1 x / [x>>2], r := (x&3)<<4 Fast and supports infinite alphabets [x1,x2,x3] / [x1>>2, ((x1&3)<<4)|(x2>>4), ((x2&0xF)<<2)|(x3>>6), x3&0x3F] Not always possible, but works for encoders/decoders 0 14
Composition of ESFTs Uses ST closure under composition Use of registers to remember values Register elimination ESFT E ST E ST E oD ESFT EoD ESFT D ST D Not closed in general 15
Equivalence Semi-Decision Procedure First we check equivalence on domain intersection (hard) (x).True/ [x] ( (x1,x2).True)/[x1,x2] 0 1 We build a product transducer (x1,x2).True/([x1,x2],[x1,x2]) 0,1 then we check domain equivalence (easier in this case). 16
Unicode Case Study We analyzed UTF8 to UTF16 encoder (E) and decoder (D) Test Running Time Dom(E) = UTF16 47 ms Dom(EoD) = UTF16 109 ms Dom(D) = UTF8 156 ms Dom(DoE) = UTF8 320 ms EoD=Identity (naive) Complete analysis in less than a second 82,000 ms DoE=Identity (naive) 134,000 ms EoD=Identity (new algorithm) 123 ms DoE=Identity (new algorithm) 215 ms 17
Result Summary ESFTs a new transducer model for representing encoders and decoders A new register elimination algorithm from ST to ESFTs, independent from input alphabet Correctness analysis of real programs: Unicode, Base64 encoders and decoders Automatic Javascript code generationof the verified code Check it out http://rise4fun.com/Bek/ Transducers are cool!! 18
Future Work Understand the theory of ESFTs (coming soon) Composition closure, equivalence Extend the model to tree transformations Widely used in NLP Analyze more complex scenarios List manipulating programs 19
Thank you Loris D Antoni lorisdan@cis.upenn.edu Questions? 20