Cortex A8 Pipeline Architecture

Slide Note

The Cortex A8 processor core by ARM Holdings features a 14-stage integer pipeline and a 10-stage NEON pipeline. It incorporates a deeper pipeline with sub-stages for improved instruction execution efficiency, supporting multiple pipeline stages for ALU, shifts, flags, and more. The Superscalar Pipeline allows for executing two instructions simultaneously. Learn about the intricacies and benefits of the Cortex A8 pipeline architecture in this detailed exploration.

hsop Follow

Uploaded on Mar 06, 2025 | 2 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

ARM Cortex A8 Pipeline EE126 Wei Wang

Cortex A8 is a processor core designed by ARM Holdings. Application: Apple A4, Samsung Exynos 3110. What s the pipeline architecture in Cortex A8? Deeper pipeline and superscalar pipeline.

Deeper Pipeline Why does it break one cycle into several cycles? IF ID EXE D3 D4 F0 F1 F2 D0 D1 D2 E0 E1 E2 E3 E4 E5 IF ID EXE F0 F1 F2 D0 D1 D2 D3 D4 E0 E1 E2 E3 E4 E5 For pipeline, the speed is limited by the length of the longest stage, and the longest stage is set to be the standard one cycle time. For the deeper pipeline, the time of the new sub-stage is small. The smaller time resolution therefore leads to less time to complete one instruction.

Superscalar Pipeline It is a form of instruction level parallelism, which is faster than normal pipeline. IF ID EX WB Simple 4 Stage Pipeline Superscalar Pipeline 5 4 9 8 1 0 7 6 3 2 Two instructions executed at the same time

Cortex A8 Pipeline Main Architecture: 14-Stage Integer Pipeline 10-Stage NEON Pipeline F0 F1 F2 D0 D1 D2 D3 D4 E0 E1 E2 E3 E4 E5 M0 M1 M2 M3 N1 N2 N3 N4 N5 N6 Instruction Execute and Load/Store In Integer register writeback NEON register writeback NEON Integer ALU Pipeline ALU Pipeline0 Integer MUL Pipeline NEON Register File Architecture Register File Integer shift Pipeline NEON Instruction Decode MUL Pipeline0 Instruction Fetch Instruction Decode Non-IEEE FP ADD Pipeline Non-IEEE FP MUL Pipeline ALU Pipeline1 IEEE FP Engine Load/Store Pipeline0/1 Load/Store Permute Pipeline Load/Store Data Quence NEON Store Data

Execution stages: 6 stage pipeline. E0 E1 E2 E3 E4 E5 Two symmetric ALU pipeline, a multiple pipeline and an address generator for load and store. Instruction Execute and Load/Store Integer register writeback In 1. For the ALU pipeline: E0 access register file; ALU+ Flags BP Shift Sat WB ALU Pipeline Update E1 shift if needed; Architecture Register File E2 ALU function; MUL 1 MUL 2 MUL 3 Multiple Pipeline WB ACC E3 complete saturation if needed; E4 change in control flow; ALU+ Flags E5 write back to register file. BP ALU Pipeline Shift Sat WB Update 2. For the Mul pipeline: E1-E3 implement multiply; Load/Store Pipeline WB AGU Load/Store Pipeline E4 perform addition.; E5 write back. It can extensively support of key forwarding path. Result data is from the outputs of shift, ALU and MUL immediately as it is produced. The intermediate execution stage results can be forwarded. Unlike the simple pipeline, only the final execution stage result can be forwarded.

Deep pipeline and superscalar pipeline have good performance. Why not increases the sub-stages and the parallel instructions? What s the limitations?

Data Dependency Data Independency MUL t3,t2,t1 ADD t6, t5,t4 Data Dependency MUL t3,t2,t1 ADD t6, t3,t4 BUBBLE Add BUBBLE 5 4 1 0 3 2 Solution: Stall the adder until the multiplier has finished.

Output dependency: MUL t3,t2,t1; ADD t3,t4,t5; An output dependency occurs if two paralleled instructions are writing into the same location. An error occurs if the second instruction implement before the first one.

Antidependency: MUL t3,t2,t1; ADD t2,t4,t5; An antidependency exists if an instruction uses a location as an operand while a following one is writing into that location; if the first one is still using the location when the second one writes into it, an error occurs.

Solution for the output independency and antidependency: Use other register. MUL t3,t2,t1; ADD t3,t4,t5; MUL t3,t2,t1; ADD t6,t4,t5; MUL t3,t2,t1; ADD t2,t4,t5; MUL t3,t2,t1; ADD t6,t4,t5; Alternative ways to handle dependency: Compiler will generate instructions with less dependency.