Monday, May 25, 2020

SVD file optimization


Optimizing SVD files

SVD files, what are they?


This webpage has a good overview, after all, they penned the format:

https://www.keil.com/pack/doc/CMSIS/SVD/html/index.html

Basically, its an XML file that describes an SoC from peripherals to registers to individual fields. Typically sucked in by a debugger to get a meaningful view into an SoC. Its also been used as an input to some very useful tools, such as...

svd2ada

When I started looking at Ada for embedded ARM hacking some years back. Adacore had an early library, Ada_Drivers_Library. It had drivers for all the peripherals in some STM32F4 series parts. How it did this was interesting, underneath the driver was a description of the HW from a bunch of .ads files that were... automatically generated. Adacore had written a tool. svd2ada, that would parse the SVD file provided by the vendor, in this case ST microelectronics, and produce a detailed specification of each peripheral in the part along with type records for each register and enclosed fields. Quite eye opening to me. Having waded through many vendors C .h files that were festooned with #define masks all over the place to describe in a flat, non-hierarchical fashion the encoding of fields in regs.

Example

Here is a peripheral from the STM32L562, the on-the-fly decryption engine. We see the name of the peripheral, a description and the all important base address:

<peripheral>
      <name>OTFDEC1</name>
      <description>On-The-Fly Decryption engine</description>
      <groupName>OTFDEC</groupName>
      <baseAddress>0x420C5000</baseAddress>

Lets look at a register from this peripheral:
        <register>
          <name>R1CFGR</name>
          <displayName>R1CFGR</displayName>
          <description>OTFDEC region x configuration
          register</description>
          <addressOffset>0x20</addressOffset>
          <size>0x20</size>

and another:
        <register>
          <name>R2CFGR</name>
          <displayName>R2CFGR</displayName>
          <description>OTFDEC region x configuration
          register</description>
          <addressOffset>0x50</addressOffset>
          <size>0x20</size>

See a pattern? Each register is a name with a number embedded and its offset is rising by a value > the size in bits... we will look at that observation later.

Now traditionally, svd2ada would process this as you see it in the XML and the stm32_svd-otfdec.ads would look like this:

      R1CFGR      at 16#20# range 0 .. 31;
      R1STARTADDR at 16#24# range 0 .. 31;
      R1ENDADDR   at 16#28# range 0 .. 31;
      R1NONCER0   at 16#2C# range 0 .. 31;
      R1NONCER1   at 16#30# range 0 .. 31;
      R1KEYR0     at 16#34# range 0 .. 31;
      R1KEYR1     at 16#38# range 0 .. 31;
      R1KEYR2     at 16#3C# range 0 .. 31;
      R1KEYR3     at 16#40# range 0 .. 31;

...
      R4CFGR      at 16#B0# range 0 .. 31;
      R4STARTADDR at 16#B4# range 0 .. 31;
      R4ENDADDR   at 16#B8# range 0 .. 31;
      R4NONCER0   at 16#BC# range 0 .. 31;
      R4NONCER1   at 16#C0# range 0 .. 31;
      R4KEYR0     at 16#C4# range 0 .. 31;
      R4KEYR1     at 16#C8# range 0 .. 31;
      R4KEYR2     at 16#CC# range 0 .. 31;
      R4KEYR3     at 16#D0# range 0 .. 31;

Now, there is a lot of commonality in those reg groups I think we can see. So the question becomes is there a more compact way to describe the layout of these repetitive groupings?

Well, it turns out, there is. SVD files have some other nomenclature that permits descriptions of this type of repetitive grouping. The term SVD uses is cluster and dim (or <cluster> and <dim> in XML). These terms allow specification of such groups in a form that is indexable by software. Fortunately for us, svd2ada already supports <cluster> and <dim>. Super news for us, if only there was a way to automatically emit these compressive constructs so we don't have to sift through 500k XML files performing hand edits.

3) svdopt.rb
If there was a tool that could parse the SVD file, identify those groupings and re-write the SVD file with those changes, then the indexable records ought to have a more compact form that should reduce the amount of code needed in an Ada driver to work with the peripheral. Taking the example above, if left unchanged, you would have code to handle R1CFGR... R2... R3.. R4 where, realistically, RxCFGR would do if you had an array of records. So a tool was crafted. It accepts an SVD file as input and produces an SVD file as output. It tries to be automatic in processing but we will get to special cases later. For now lets look at the definition for R1CFGR which we saw above. This cluster below describes all the regs in the group. Observe, R is taken as the cluster name as all the RxY use R as the lead-in to the element. The <cluster> has a dim of 4 which matches the HW desc. There is a new field I added <dimOffset> that shows 1. Given SVD files are C oriented, they assume arrays begin at 0. Well Ada doesn't have to do that, and neither does the reference manual and vendor SVD. They start this register group a 1 so... we have a syntax to allow that also. Observe also that this <cluster> has an array embedded inside each element. the RxKEYy and RxNONCEy values. So the tool correctly identifies this from the description and emits embedded <dim> accordingly.

<cluster>
  <dim>4</dim>
  <dimIncrement>0x30</dimIncrement>
  <dimOffset>1</dimOffset>
  <name>R[%s]</name>
  <addressOffset>0x20</addressOffset>
  <register>
    <name>CFGR</name>
    <displayName>CFGR</displayName>
    <description>OTFDEC region x configuration register</description>
    <addressOffset>0x0</addressOffset>
    <size>0x20</size>
    <access>read-write</access>
    <resetValue>0x00000000</resetValue>
    <fields>
      ...
    </fields>
  </register>
  <register>
    <name>STARTADDR</name>
    <displayName>STARTADDR</displayName>
    <description>OTFDEC region x start address register</description>
    <addressOffset>0x4</addressOffset>
    <size>0x20</size>
    <access>read-write</access>
    <resetValue>0x00000000</resetValue>
    <fields>
    </fields>
  </register>
  <register>
    <name>ENDADDR</name>
    <displayName>ENDADDR</displayName>
    <description>OTFDEC region x end address register</description>
    <addressOffset>0x8</addressOffset>
    <size>0x20</size>
    <access>read-write</access>
    <resetValue>0x00000000</resetValue>
    <fields>
      <field>
        <name>REGx_END_ADDR</name>
        <description>Region AXI end address</description>
        <bitOffset>0</bitOffset>
        <bitWidth>32</bitWidth>
      </field>
    </fields>
  </register>
  <register>
    <dim>2</dim>
    <dimIncrement>4</dimIncrement>
    <dimOffset>0</dimOffset>
    <name>NONCER[%s]</name>
    <addressOffset>0xc</addressOffset>
    <size>0x20</size>
    <access>read-write</access>
    <resetValue>0x00000000</resetValue>
    <fields>
      <field>
        <name>REGx_NONCE</name>
        <description>REGx_NONCE</description>
        <bitOffset>0</bitOffset>
        <bitWidth>32</bitWidth>
      </field>
    </fields>
  </register>
  <register>
    <dim>4</dim>
    <dimIncrement>4</dimIncrement>
    <dimOffset>0</dimOffset>
    <name>KEYR[%s]</name>
    <addressOffset>0x14</addressOffset>
    <size>0x20</size>
    <access>read-write</access>
    <resetValue>0x00000000</resetValue>
    <fields>
      <field>
        <name>REGx_KEY</name>
        <description>REGx_KEY</description>
        <bitOffset>0</bitOffset>
        <bitWidth>32</bitWidth>
      </field>
    </fields>
  </register>
</cluster>

svd2ada result

Well, what do we get from the above description? Does <cluster> and <dim> improve svd2ada .ads generation?

   type OTFDEC_Peripheral is record
...
      R   : aliased R_Clusters;

   for R_Cluster use record
      CFGR      at 16#0# range 0 .. 31;
      STARTADDR at 16#4# range 0 .. 31;
      ENDADDR   at 16#8# range 0 .. 31;
      NONCER    at 16#C# range 0 .. 63;
      KEYR      at 16#14# range 0 .. 127;
   end record;

   type R_Clusters is array (1 .. 4) of R_Cluster;

 Here the description is far more compact and will produce less code as the driver need only access the elements as an array vs a case statement. I also would wager the code will be clearer as it matches the reference manual wrt how replicated elements are treated. In the RM's they will write the shorthand for the address computation as so:

OTFDEC region x configuration register(OTFDEC_RxCFGR)
Address offset: 0x20 + 0x30 * (x -1) (x = 1 to 4)

Special cases

What fun would programming be without a myriad of special cases and arcane detail to deal with? Well as with any good programming problem, there are loads of these issues. Lets take a look at some of them.

cluster naming

The example I presented above, was a good one, in that the tool can make an educated guess that the cluster name is... R. Can we get so lucky that this 'rule' holds for all such groupings? Sadly, no. Lets see another case from the DMA controller. For each of the 8 channels, 5 registers make up the cluster:

0x00000008 CCR1
0x0000000c CNDTR1
0x00000010 CPAR1
0x00000014 CM0AR1
0x00000018 CM1AR1
...
0x00000094 CCR8
0x00000098 CNDTR8
0x0000009c CPAR8
0x000000a0 CM0AR8
0x000000a4 CM1AR8

Here we see some interesting layout. The cluster# is the last digit. A possible array is the first digit as in CM0AR1, CM1AR1. More disturbingly there is no implied grouping in the names other than they all start with a C (not too meaningful). So these are really DMA channel regs in a group that is dimensioned 1..8. So we need to help the tool a) identify this issue b) to allow a naming of this grouping. I have a syntax on the cmdline of the tool that permits this naming. It requires some help from the user to place the rename. For this group, it looks like this:

~/ruby/svdopt.rb -C  DMA1:8:CH,DMA2:8:CH ...other options...

This says that when working on peripheral DMA1 or DMA2, at offset 8, use CH as the clustername.
In the output you then get this:

<cluster>
  <dim>8</dim>
  <dimIncrement>0x14</dimIncrement>
  <dimOffset>1</dimOffset>
  <name>CH[%s]</name>
  <addressOffset>0x8</addressOffset>
  <register>
    <name>CCRx</name>
    <displayName>CCRx</displayName>
    <description>channel x configuration register</description>
    <addressOffset>0x0</addressOffset>
    <size>0x20</size>
    <access>read-write</access>
    <resetValue>0x00000000</resetValue>
    <fields>
      ...
    </fields>
  </register>
  <register>
    <name>CNDTRx</name>
    <displayName>CNDTRx</displayName>
    <description>channel x number of data
    register</description>
    <addressOffset>0x4</addressOffset>
    <size>0x20</size>
    <access>read-write</access>
    <resetValue>0x00000000</resetValue>
    <fields>
      ...
    </fields>
  </register>
  <register>
    <name>CPARx</name>
    <displayName>CPARx</displayName>
    <description>channel x peripheral address
    register</description>
    <addressOffset>0x8</addressOffset>
    <size>0x20</size>
    <access>read-write</access>
    <resetValue>0x00000000</resetValue>
    <fields>
      ...
    </fields>
  </register>
  <register>
    <dim>2</dim>
    <dimIncrement>4</dimIncrement>
    <dimOffset>0</dimOffset>
    <name>ARy</name>
    <addressOffset>0xc</addressOffset>
    <size>0x20</size>
    <access>read-write</access>
    <resetValue>0x00000000</resetValue>
    <fields>
      ...
    </fields>
  </register>
</cluster>

Finally svd2ada yields this:

   type DMA_Peripheral is record
...
      CH    : aliased CH_Clusters;
...
   end record

and

   for CH_Cluster use record
      CCRx   at 16#0# range 0 .. 31;
      CNDTRx at 16#4# range 0 .. 31;
      CPARx  at 16#8# range 0 .. 31;
      ARy    at 16#C# range 0 .. 63;
   end record;

   type CH_Clusters is array (1 .. 8) of CH_Cluster;


split fields

Yes, sounds bad, and it is. Lets take a look at AES from the same SoC:

0x00000010 KEYR0
0x00000014 KEYR1
0x00000018 KEYR2
0x0000001c KEYR3
0x00000020 IVR0
0x00000024 IVR1
0x00000028 IVR2
0x0000002c IVR3
0x00000030 KEYR4
0x00000034 KEYR5
0x00000038 KEYR6
0x0000003c KEYR7

Take a look at that reg layout. Looks like when they did the HW it only supported 128bit AES. Who needs more than 128bits they thought? Well, time moves on and now 256bit AES KEYs are commonplace. But what of legacy code that uses 128bit keys and expects the IV to be right after it. Well, lets just stuff the rest of the key after and make a hole in the middle of the KEYR if you were to look at it as a contiguous array 0..7.

How svdopt processes regs

Internally, svdopt looks at regs as so:

['KEYR', :x] or ['KEYR', 1] ... ['KEYR', 7] etc.

So a natural grouping of KEYR above would be to ID it as an array 0..7. There is a safety check in svdopt to ensure that the gap between elements equals the base register size from one numbered element to the next. At KEYR4 this discontinuity is detected and some messy logic takes over to split the array into 2 new arrays:

The re-write looks like this, basically the array becomes 2 new ones, KEYRA and KEYRB. Observe that KEYRB starts at index 4 as you would expect.

<register>
  <dim>4</dim>
  <dimIncrement>4</dimIncrement>
  <dimOffset>0</dimOffset>
  <name>KEYRA[%s]</name>
  <addressOffset>0x10</addressOffset>
  <size>0x20</size>
  <access>read-write</access>
  <resetValue>0x00000000</resetValue>
  <fields>
    <field>
      <name>KEY</name>
      <description>Cryptographic key, bits[31:0]</description>
      <bitOffset>0</bitOffset>
      <bitWidth>32</bitWidth>
    </field>
  </fields>
</register>
<register>
  <dim>4</dim>
  <dimIncrement>4</dimIncrement>
  <dimOffset>0</dimOffset>
  <name>IVR[%s]</name>
  <addressOffset>0x20</addressOffset>
  <size>0x20</size>
  <access>read-write</access>
  <resetValue>0x00000000</resetValue>
  <fields>
    <field>
      <name>IVI</name>
      <description>initialization vector register (LSB IVR
      [31:0])</description>
      <bitOffset>0</bitOffset>
      <bitWidth>32</bitWidth>
    </field>
  </fields>
</register>
<register>
  <dim>4</dim>
  <dimIncrement>4</dimIncrement>
  <dimOffset>4</dimOffset>
  <name>KEYRB[%s]</name>
  <addressOffset>0x30</addressOffset>
  <size>0x20</size>
  <access>read-write</access>
  <resetValue>0x00000000</resetValue>
  <fields>
    <field>
      <name>KEY</name>
      <description>Cryptographic key, bits
      [159:128])</description>
      <bitOffset>0</bitOffset>
      <bitWidth>32</bitWidth>
    </field>
  </fields>
</register>

The svd2ada output then is:

   for AES_Peripheral use record
      CR    at 16#0# range 0 .. 31;
      SR    at 16#4# range 0 .. 31;
      DINR  at 16#8# range 0 .. 31;
      DOUTR at 16#C# range 0 .. 31;
      KEYRA at 16#10# range 0 .. 127;
      IVR   at 16#20# range 0 .. 127;
      KEYRB at 16#30# range 0 .. 127;
      SUSPR at 16#40# range 0 .. 255;
   end record;

   type KEYRA_Registers is array (0 .. 3) of HAL.UInt32;

   type IVR_Registers is array (0 .. 3) of HAL.UInt32;

   type KEYRB_Registers is array (4 .. 7) of HAL.UInt32;

conclusion

This problem was quite nasty as you never know where the cluster index or array index are. There is inconsistency in how the vendor may choose to write the cluster candidate, and whether that was ever a consideration. In some cases the array index comes first:

CM0AR8

In my tool that gets broken out as:

['CM', 0, 'AR', 8] and also: ['CM', :x, 'AR', :y]

The bookkeeping needs to be flexible to not assume that :x or :y above are array or cluster indexes just by their placement. Only through analysis of the regs can this be deduced.

The tool status now is experimental. I am still evaluating it and will be checking it in soon.





Monday, December 23, 2019

Ada on a CM33

The CM33

About 2 years ago, ARM released the CM33 to silicon designers. It is designed for IoT but it has a myriad of options that the silicon designer (SD) can choose from. So, at its core its just another CM series processor not unlike a CM4/CM4F/CM7. What differentiates it to a SD are the choices that allow the main extension and security extensions (aka TrustZone). When both of these options are in place and selected by the end user, the CPU looks at one moment like two CM4Fs one secure and one not secure. Of course, the program counter can only be one place at one time so whilst it may indeed look like two cores, only one is really there and it is in this management of the view of the SoC from each of these two spaces that the true colour of the CPU shows. For this discussion of Ada on a CM33, we will assume the CPU starts in TrustZone mode + Main extension. This means that there are two SysTick timers, one for secure(S) and one for non-secure(NS). The CPU registers are only banked for some of the registers. R0-R12, LR, PC are not banked. Banking takes place for the SP's. For the usual MSP & PSP there are now new variants that deduce their value from the context the CPU is running in. So we have MSP_NS, PSP_NS & MSP_S, PSP_S. There are also some SP limit regs, also banked. A myriad of internal peripherals are also banked, the E000ED00 - space for example. S always has access to the NS banked world but NS has hardware&software reduced views of memory and registers. There are many complexities in a CM33, as an example, lets look at exception handling. If its a S originating exception to S handler, no issues, the usual frame is maintained.  Similarly NS originating exception to NS handler. If however it was a S originating exception to a NS handler there is the potential of a leak of S register info via this asynchronous 'peek' into the now stopped S side. ARM was clever here and I believe must have given their CPU designers quite a design challenge. The idea is to now push a *big* frame onto the S stack and then zero all the registers and arrive at the NS handler. There is major LR magic going on in a CM33. It has more bits now to indicate S and NS frame info. ARM does not do all the work for you in HW here btw. Once your program is running, lets say its executing in S, and wants to call over to code in NS, this is doable. There is a new instr: BLXNS. This allows you to call to a NS entry point. For the converse, a NS program calling a S function, there are limits in place. The entry point must be an SG instruction, the memory space must be marked Non-secure-callable (NSC). After that, the usual veneer code can be used. For the first example, S->NS, it is the job of SW to wipe out the S registers before the BLXNS. So there is some boilerplate to do that in assembly, the func is __gnu_cmse_nonsecure_call. There are some new C compiler options that trigger this automatically. (--cmse cortex-m security extensions I believe that stands for).

Ada on a CM33

So building on AdaCore's ARM offering via the Ada_Drivers_Library and Ravenscar profiles, we can consider how to get software running on a CM33. The first thing to think about with a port of Ada to a CM33 is what is the use case that is being contemplated? How will a CM33 improve on the already perfectly usable Ada footprint in the CM world. I.e. what is to be gained by moving Ada to a CM33, are there threats that the security modes of a CM33 would help to close?

Secure Booting Ada

One item that looks promising is secure boot. Secure boot is the establishment of a root of trust in a system starting from power on reset and extending out into securing the system during runtime by offering a secure interface that can be leveraged back over to NS components to allow them to securely utilize keys, and other high value assets protected by the root of trust and indirectly visible to the NS side. For example, encrypting a packet with a network key. Perhaps that key, on the S side is used with HW to perform an encryption and the result placed in NS ram. At no time did the NS side have access to the key, the HW or any algorithm associated with the encryption. So this looks quite promising, we can establish a root of trust and then pass control to NS whilst offering a set of S side APIs to the NS side.

Non-secure Ada

 The secure side after establishing the root of trust, passes control to the NS side. From this side, user code can take over and perform system code much as before in the CM4 days but it does so under the watchful eye of SoC hardware that has prepared the memory and peripherals in the SoC in such a way as to sandbox the NS side from overstepping any boundary that secure boot establishes. 

Two programs

The way my Ada port to a CM33 works is to have two somewhat independent programs running on the same CPU. The basic layout is S:glue:NS where S and NS are two Ada programs. glue is not written in Ada but is a mix of C and assembly and it is tasked with joining S and NS bidirectionally so that each can call over to each other in a secure fashion.

Here are the memory maps for S and NS. 

Memory region         Used Size  Region Size  %age Used
           flash:          0 GB       512 KB      0.00%
          sram12:       56512 B       256 KB     21.56%
           sglue:          0 GB        60 KB      0.00%
          nsglue:          32 B         4 KB      0.78%
              ns:          0 GB       128 KB      0.00%

Memory region         Used Size  Region Size  %age Used
           flash:          0 GB       512 KB      0.00%
          sram12:          0 GB       256 KB      0.00%
           sglue:          0 GB        60 KB      0.00%
          nsglue:          0 GB         4 KB      0.00%
              ns:       81840 B       128 KB     62.44%
Currently this is being tested on an STM32L552 Nucleo board with TZONE=1 (that is an almost irreversible OTP option bit that once programmed makes the core Main Extension and Secure Extension). By default an STM32L552 Nucleo ships looking a lot like a fancy CM4F with none of the magic CM33 options activated.

Step One

Step one is a debugger. We want to use arm-eabi-gdb but what to connect it to? I did a port of openocd to ARMv8m to achieve this. I have added 3 CM33 cpu's 

1) LPC5569 (lpc55xx.cfg)
2) nRF5340 (nRF5340.cfg)
3) STM32L552 (st_nucleo_l552.cfg)

Only the ST openocd port supports flashing, but at the moment I only use SRAM for my experiments as by now I am sure I would have worn out the flash in the part.

The code is up on github:

Hard bugs

There are some very interesting and very tough bugs that can come from this type of work. Days can disappear as the clues are fleeting.  To that end, one particularly bad one caused me a two or more day detour whilst I got trace going:


Via trace being captured on a Saleae and then massaged into a Pulseview ITM trace, I could see where the issue was. There are two SysTicks now and two Ada programs. The Ravenscar runtimes will periodically context switch on either side after number of SysTicks has been reached. In a single core world, this is fine. However, in a CM33 the PC can be over on either side when the exception appears. The CPU handles this fine, its the runtime software that has to be made wise to this dual origination. I found that crashing would occur anytime the PC was on the other side when the exception appears for the side in Q. The magic LR value indicates where the saved registers are. The problem is when there is a decision by the runtime to context switch using a Pending SV call, that new direction left the other side in a lurch as the exception frame was not honoured and control was stolen over the the excepting side. The solution I have at the moment that fixes the crash is to XOR between the security states. Only context switch when the exception and context are on the same side. This can still lead to starvation if both sides choose to be in WFI instructions. It doesn't crash but the allocation of CPU time is not equitable between the two sides. For now, I think there is no need to WFI on the S side and use it instead as a deterministic API handler. Other bugs can be frustrating security issues where the SAU status indicates faults due to misaligned calls from NS to S or S to NS when the NS side is not marked NS. Some of these security fails can really open up a rabbit hole of odd registers that need to be flipped to allow S->NS or NS->S.

Complexity

The 3 designs I have looked at, the NXP LPC5569, Nordic nRF5340 and STM32L552 are some of the most complex microcontrollers I have ever worked with. A CM4F/CM7 already can be a handful depending on what you are trying to do. Now take a CM4F and add security via TrustZone, the SAU (security attribution unit), possibly an IDAU (implementation specific device attribution unit), the MPU (now also changed and new in the ARMv8m + banked in S and NS). So via these units you begin to divvy up the ram into units S & NS can work with. Next if you need to share SoC peripherals you begin with two views, the S & NS view of the peripheral. Ada's Ada_Device_Library needed to be taught how to handle this new view. Typically each of these SoCs has S and NS views of a peripheral picked by the top nibble, for example:

   GPIOA_Base : constant System.Address :=
     System'To_Address (16#42020000#);

   SEC_GPIOA_Base : constant System.Address :=
     System'To_Address (16#52020000#);

In the Ada driver we have the device auto choose the address based on execution context:

   function S_NS_Periph (Addr : System.Address) return System.Address
     with Inline;

   GPIO_A : aliased GPIO_Port with Import, Volatile, Address => S_NS_Periph (GPIOA_Base);

It should be mentioned that this is vendor specific, each vendor can choose how they do S/NS mappings. This just adds to the complexity.

Example

An example has been prepared that cycles the 3 user LEDs on the Nucleo board. The LED toggle is done by a secure function which is called by a non-secure function.



Sunday, June 16, 2019

An Ada Client & Server on the STM32WB55


Ada WB55 Client & Server

Following on from the last blog posting, I now have a client and server implementation in Ada. Mine is not quite as fancy as ST's, ST's allows role reversal where the client can run on the larger board (MB1355C) and the server on the USB dongle (MB1293C). I only support server on MB1355C and client on MB1293C.

Getting the code & building

You will need gnat2018 or gnat2019 from AdaCore:

https://www.adacore.com/download

Once that's installed, you will need some library code and the STM32 dir I use:

git clone https://github.com/morbos/Ada_Drivers_Library.git
git clone https://github.com/morbos/embedded-runtimes.git
git clone https://github.com/morbos/STM32.git
mv ../embedded-runtimes Ada_Drivers_Library
cd STM32/WB/WB55/cli_serv_wb55
make

Once make finishes you get 2 ELF32 files in the obj/Debug dir.

admin@ubuntu_1604:/tmp/STM32/WB/WB55/cli_serv_wb55$ make
rm -f obj/Debug/client_wb55x
(export LOADER=ROM_WB55x; gprbuild client_wb55x.gpr)
Link
   [link]         client_wb55x.adb
Memory region         Used Size  Region Size  %age Used
           flash:       93008 B         1 MB      8.87%
           sram1:      102792 B       192 KB     52.28%
          sram2a:          0 GB        32 KB      0.00%
          sram2b:          0 GB        32 KB      0.00%
(cd obj/Debug; arm-eabi-objdump -d client_wb55x >client_wb55x.lst; arm-eabi-objdump -s client_wb55x >client_wb55x.dmp; arm-eabi-gcc-nm -an client_wb55x >client_wb55x.nm; arm-eabi-objcopy -Obinary client_wb55x client_wb55x.bin)
rm -f obj/Debug/server_wb55x
(export LOADER=ROM_WB55x; gprbuild server_wb55x.gpr)
Link
   [link]         server_wb55x.adb
Memory region         Used Size  Region Size  %age Used
           flash:       93752 B         1 MB      8.94%
           sram1:      108056 B       192 KB     54.96%
          sram2a:          0 GB        32 KB      0.00%
          sram2b:          0 GB        32 KB      0.00%
(cd obj/Debug; arm-eabi-objdump -d server_wb55x >server_wb55x.lst; arm-eabi-objdump -s server_wb55x >server_wb55x.dmp; arm-eabi-gcc-nm -an server_wb55x >server_wb55x.nm; arm-eabi-objcopy -Obinary server_wb55x server_wb55x.bin)

Each of these ELF32 files can be flashed on the board.

Openocd

To flash the MB1355C use the ST-Link USB connector (the silkscreen is on the bottom of the board).

You will need openocd-0.10.0 that is on my github.

Modify the st_nucleo_wb.tcl in the tcl/board dir. You want to make sure the v2-1 line is uncommented. 

From:
# source [find interface/stlink-v2-1.cfg]
source [find interface/stlink-v2.cfg]

To:

source [find interface/stlink-v2-1.cfg]
#source [find interface/stlink-v2.cfg]

Then you can attach:

root@pi3:~/openocd-0.10.0/tcl# ../src/openocd -f board/st_nucleo_wb55.cfg Open On-Chip Debugger 0.10.0 Licensed under GNU GPL v2 For bug reports, read http://openocd.org/doc/doxygen/bugs.html Info : The selected transport took over low-level target control. The results might differ compared to plain JTAG/SWD adapter speed: 2000 kHz adapter_nsrst_delay: 100 none separate none separate Info : Unable to match requested speed 2000 kHz, using 1800 kHz Info : Unable to match requested speed 2000 kHz, using 1800 kHz Info : clock speed 1800 kHz Info : STLINK v2 JTAG v32 API v2 SWIM v22 VID 0x0483 PID 0x374B Info : using stlink api v2 Info : Target voltage: 3.268721 Info : stm32wb.cpu: hardware has 6 breakpoints, 4 watchpoints

Flashing & Debug

admin@ubuntu_1604:/.share/CACHEDEV1_DATA/Ada/STM32/WB/WB55/cli_serv_wb55$ arm-eabi-gdb obj/Debug/server_wb55x
GNU gdb (GDB) 8.3 for GNAT Community 2019 [rev=gdb-8.3-ref-194-g3fc1095]
Copyright (C) 2019 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
See your support agreement for details of warranty and support.
If you do not have a current support agreement, then there is absolutely
no warranty for this version of GDB.
Type "show copying" and "show warranty" for details.
This GDB was configured as "--host=x86_64-pc-linux-gnu --target=arm-eabi".
Type "show configuration" for configuration details.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from obj/Debug/server_wb55x...
(gdb) target extended-remote 10.0.1.241:3333
Remote debugging using 10.0.1.241:3333
0x0800e02e in system.bb.board_support.interrupts.power_down ()
    at /.share/CACHEDEV1_DATA/Ada/STM32/WB/WB55/cli_serv_wb55/Ada_Drivers_Library/embedded-runtimes/base_runtimes/ravenscar-full/gnarl-common/s-bbbosu.adb:402
402          Asm ("wfi", Volatile => True);
(gdb) monitor reset halt
Unable to match requested speed 2000 kHz, using 1800 kHz
Unable to match requested speed 2000 kHz, using 1800 kHz
adapter speed: 1800 kHz
target halted due to debug-request, current mode: Thread 
xPSR: 0x01000000 pc: 0x08010968 msp: 0x2001a620
(gdb) load
Loading section .text, size 0x121d0 lma 0x8000000
Loading section .ARM.extab, size 0xe34 lma 0x80121d0
Loading section .ARM.exidx, size 0xe98 lma 0x8013004
Loading section .rodata, size 0x31e8 lma 0x8013ea0
Loading section .data, size 0x78c lma 0x8017088
Start address 0x8010968, load size 96272
Transfer rate: 22 KB/sec, 10696 bytes/write.
(gdb) monitor reset init
Unable to match requested speed 2000 kHz, using 1800 kHz
Unable to match requested speed 2000 kHz, using 1800 kHz
adapter speed: 1800 kHz
target halted due to debug-request, current mode: Thread 
xPSR: 0x01000000 pc: 0x08010968 msp: 0x2001a620
Unable to match requested speed 2000 kHz, using 1800 kHz
Unable to match requested speed 2000 kHz, using 1800 kHz
adapter speed: 1800 kHz
(gdb) b main
Breakpoint 1 at 0x800060a: file b__server_wb55x.adb, line 276.
(gdb) b __gnat_last_chance_handler
Breakpoint 2 at 0x800065a: file /.share/CACHEDEV1_DATA/Ada/STM32/WB/lch_sfp/led/last_chance_handler.adb, line 48.
(gdb) c
Continuing.
Note: automatically using hardware breakpoints for read-only addresses.

Breakpoint 1, main () at b__server_wb55x.adb:276
276       Ensure_Reference : aliased System.Address := Ada_Main_Program_Name'Address;
(gdb) 

To flash the MB1293C attach using the other text line of st_nucleo_wb.tcl and use a similar flashing process except flash the client_wb55


Thursday, June 13, 2019

Ada on the STM32WB

STM32WB

This is a new part family that merges an L series microcontroller with the BlueNRG-MS controller realized now as a single die. The microcontroller can run at 64Mhz has 1MB of flash and 256KB of ram. The wireless portion (I say wireless since its not limited to BT/BLE but also supports Thread and Zigbee), runs on a Cortex-M0+ controller and shares the top of the 1MB of flash for its (encrypted) FW. The M0's FW size seems about 256KB or so. The CM4F and CM0+ communicate by using some of the 256KB of ram as a mailbox. This is supported by HW (IPCC) that signals interrupts from one side to the other when data is ready. In the past, as on the SensorTile, the BlueNRG-MS was connected by SPI and that was the transport, now its done by mailbox and hw signalling.

Ada

Can we do it again? Can we get Ada running usefully on an WB series part with the added burden of the new control over BT/BLE? I will save the readers time here to say, yes, its possible! further, its working. Here is some interesting info, Ada's package system cleanly separates modules from one another so, I was able to smoothly migrate the BT work over from the SensorTile to the WB just as I had envisioned. The trick here was to recognize that ST was not going to re-invent the wheel here, they would use 99% of the working BlueNRG-MS stack over on the WB. That means, opcodes are the same, events are the same. All the data structures that had been done for those were a drop in. This saves me months of weekend dev time. Of course, there are some differences, these are relatively minor wrt the BT messaging.

SVD

I carp about SVD files a bunch. To me I think they are the key to getting Ada going on a a new target. ST has been good over the years at generating SVD files. Why should the WB be any different? Well, it is different. There is no SVD file on their website as of this writing. (odd since they released one for the complex STM32MP157 with hundreds of blocks). I need an SVD file so what are we going to do to get Ada going on the WB55? Well, the reference manual defines all the regs... maybe... well that's what I did I cut and paste into block txt files all the reg defns, bugs and all (yes there are loads of datasheet bugs). Once all the hw blocks were assembled I then hacked a Ruby script to convert them into xml fragments and assembled the whole shebang as the SVD file. Its part of the dir trees called out below. svd2ada is happy with the file and I have been using it smoothly for bringup and for my usual technique of using SVD files to parse raw GDB dumps back to ascii dotted format for easy diffing (see my blog entry on that: http://www.hrrzi.com/2017/07/arm-cortex-svd-files-lot-of-goodness.html  ).

An update on the SVD topic, Pierre Le Corre of ST pointed out the the STM32CubeIDE has as subdir where the SVD file can be found. Indeed its there along with all of the F and L series parts.

STM32CubeIDE_1.0.1/STM32CubeIDE/plugins/com.st.stm32cube.ide.mcu.productdb.debug_1.0.0.201904021149/resources/cmsis/STMicroelectronics_CMSIS_SVD

Hardware & Demo FW

ST sells a nucleo board eval kit for the WB55. It has two boards in it. A nucleo board and a USB dongle. Out of the box they have a client server demo. The bigger board is the server and the dongle is a client. The dongle on a button press scans and connects to the server. Once connected, the button (SW1) toggles the blue LED on the other board. So, each can flip the led on the other one. SW2 on the bigger board changes the rate the radio refreshes. In this mode, the LED takes a little longer to toggle.

Ada client

I crafted a workalike of the ST client that runs on the dongle. Here is the larger nucleo running STs server code communing with my Ada client on the dongle.



The Ada client performs all the functions as stated in the readme ST provides:

 - The Peripheral device (BLE_p2pServer) starts advertising (during 1 minute), the green led blinks for each advertising event.
 - The Central device (BLE_p2pClient) starts scanning when pressing the User button (SW1) on the USB Dongle board. 
   - BLE_p2pClient blue led becomes on. 
   - Scan req takes about 5 seconds. *
   - Make sure BLE_p2pServer advertises, if not press reset button or switch off/on to restart advertising.
 - Then, it automatically connects to the BLE_p2pServer. 
   - Blue led turns off and green led starts blinking as on the MB1355C. Connection is done.
 - When pressing SW1 on a board, the blue led toggles on the other one.
   - The SW1 button can be pressed independently on the GATT Client or on the GATT Server.
 - When the server is located on a MB1355C, the connection interval can be modified from 50ms to 1s and vice-versa using SW2. 
 - The green led on the 2 boards blinks for each advertising event, it means quickly when 50ms and slowly when 1s. 
 - Passing from 50ms to 1s is instantaneous, but from 1s to 50ms takes around 10 seconds.
 - The SW1 event, switch on/off blue led, depends on the connection Interval event. 
   - So the delay from SW1 action and blue led change is more or less fast.

* I should say the Ada client scans faster since it abandons the scan when it finds the server.

Code

Order a WB55 nucleo board and get started with Ada running a BLE stack!

## Building on Linux gnat2018 or gnat2019 needs to be installed first
git clone https://github.com/morbos/Ada_Drivers_Library.git
git clone https://github.com/morbos/embedded-runtimes.git
git clone https://github.com/morbos/STM32.git
mv ../embedded-runtimes Ada_Drivers_Library
cd STM32/WB/WB55/client_wb55
make

Flashing & Debugging

To flash the code to the USB dongle, openocd needs to be used. First there is a hookup:


Four wires from the ST Link V2.0 over to the USB dongle. If you zoom in a bit you can id the wires that need to go where.

Next you need my version of openocd:


I built it on a RaspberryPi3 as so:

./configure --enable-ftdi --enable-stlink --enable-ti-icdi --enable-jlink

Then make

Add other --enable-xyz's if you have other targets not called out above.

Finally run it. I usually cd to the tcl dir:

../src/openocd -f board/st_nucleo_wb55.cfg
Open On-Chip Debugger 0.10.0
Licensed under GNU GPL v2
For bug reports, read
http://openocd.org/doc/doxygen/bugs.html
Info : The selected transport took over low-level target control. The results might differ compared to plain JTAG/SWD
adapter speed: 2000 kHz
adapter_nsrst_delay: 100
none separate
none separate
Info : Unable to match requested speed 2000 kHz, using 1800 kHz
Info : Unable to match requested speed 2000 kHz, using 1800 kHz
Info : clock speed 1800 kHz
Info : STLINK v2 JTAG v17 API v2 SWIM v4 VID 0x0483 PID 0x3748
Info : using stlink api v2
Info : Target voltage: 3.248645
Info : stm32wb.cpu: hardware has 6 breakpoints, 4 watchpoints

Future work

This is a work in progress.. I plan to swap out the server on the larger nucleo board with an Ada version. Stay tuned.






Thursday, March 7, 2019

Shift on Mips-X

Mips-X


Mips-X as described earlier on the blog was a Stanford University grad project. A 32bit RISC CPU with some unique features for one, it had 2 delay slots for control change instructions, branches and jumps. I am not aware of any other processor that has that. We had a visit from John Hennessy (Stanford Mips project faculty lead and ultimately university president) one day (not Mips-X related) and I asked him, "why two delay slots?" his paraphrased answer was "It was a graduate project, we were just trying things out".

The Shifter

Mips-X had a barrel shifter and exposed it to the programmer via these opcodes:

asr    rSRC,rDST,#1..32
rotlb  rSRC1,rSRC2,rDST
rotlcb rSRC1,rSRC2,rDST
sh     rSRC1,rSRC2,rDST,#1..32


Via a combination of the above, all the needed shift operations could be done. Observe though there is no variable shift, just fixed # shift values.

My Shift function

Now here is a good puzzle for the reader to parse my variable shift func for lsr.s.

r0 == 0 -- can be a src or dst
r24 is the code segment offset (allows for position independent code off of r24).
r4 is the value to be shifted.
r5 has the #<shift>
r2 is the result.
r31 is the return address

.text
.noreorg
shift_table:
        mov     r4,r2
        lsr     r4,r2,#1
        lsr     r4,r2,#2
        lsr     r4,r2,#3
        lsr     r4,r2,#4
        lsr     r4,r2,#5
        lsr     r4,r2,#6
        lsr     r4,r2,#7
        lsr     r4,r2,#8
        lsr     r4,r2,#9
        lsr     r4,r2,#10
        lsr     r4,r2,#11
        lsr     r4,r2,#12
        lsr     r4,r2,#13
        lsr     r4,r2,#14
        lsr     r4,r2,#15
        lsr     r4,r2,#16
        lsr     r4,r2,#17
        lsr     r4,r2,#18
        lsr     r4,r2,#19
        lsr     r4,r2,#20
        lsr     r4,r2,#21
        lsr     r4,r2,#22
        lsr     r4,r2,#23
        lsr     r4,r2,#24
        lsr     r4,r2,#25
        lsr     r4,r2,#26
        lsr     r4,r2,#27
        lsr     r4,r2,#28
        lsr     r4,r2,#29
        lsr     r4,r2,#30
        lsr     r4,r2,#31
.globl ___lshrsi3
___lshrsi3:
        nop
        add     r24,r5,r1
        jspci   r1,#shift_table,r0
        jspci   r31,#0,r0
        nop
        nop
.end


Look at the two jspci's above. A jspci in the delay slot of a jspci! What happens? Also observe the nop at function entry. Why is that there? Well, this func's caller could have had a LD of r5 in the second delay slot of the jspci. In that case, if add were the first instruction, r5 would be stale as LD's have a one instruction hazard.


        jspci   r24,#___lshrsi3,r0
        nop
        ld      0[r29],r5
___lshrsi3:
        add     r24,r5,r1

That is a hazard as r5 is still in transit in the pipeline when the add goes to use it. Thus the nop.