Double precision float

Hello all, glad to see the forums up and running!

I am having a bit of trouble converting a double precision float in OptoControl R4.1a to a single precision float and I am hoping someone has some tips for me. What I have is a float64 coming in via a serial port. For the float32’s I have coming in it easy enough with the Move32bits(Source, Float) procedure command. But for the float64’s, I am stumped. Any guidance would be greatly appreciated.

Yikes, my source data was not float64 after all. It was two individual float32’s… I was using the following method to convert a float64 to a float32. If I had actually had a float64 does this look correct?

[U]Receive Buffer shifted into individual float64 components:[/U]
Sign as int32
Exponent as int32
Mantissa as int64

my_Float = Power(-1, Sign) * (1+Mantissa/Power(2,52)) * Power(2, (Exponent-1023))


To check your results, this web page here might be very useful…

Hi Chrismec,

You’ve probably long since moved on from this project, but we just recently had another customer asking about converting 64-bits worth of a float he was getting from a Modbus device.

Luckily for him, he had control of how big that float would get so converting it to a float32 on the PAC side worked for his app. I’ll attached an importable chart (which could easily be converted into a subroutine, if you’d prefer one of those).

// This code takes a 64-bit value (which represents a 64-bit float) and, if it'll fit,
// shoves it into a 32-bit float. The size of the exponent is checked. The least 
// significant bits of the significand are just dropped, so it's up to the user
// to make sure the value coming in doesn't exceed the limits of a 32-bit float. 

// The 32-bit float format is: (see form 1755 on our website for more info)
//  1 bit 8 bits   23 bits
//  x     xxxxxxxx xxxxxxxxxxxxxxxxxxxxxxx
//  Sign  Exponent Significand
// and a double (see also ) looks like this:
//  1 bit 11 bits     52 bits
//  x     xxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
//  Sign  Exponent    Significand/fraction
nSubError = 0; // this will be set to non-zero if something goes wrong in this block of code, 
// like -3 if the exponent is bigger than the 8 bits of exponent we have in a 32-bit float.
// If nn64bitValue is a 64-bit float, then the first bit will be the sign, followed by:
// 11 bits of exponent, and the 52 is Significand. Let's pick those pieces apart and 
// see if they'll fit in a 32-bit float, which is 1-bit sign, then:
// 8 bits exponent, and 23 bits of significand
// First, let's get that 11 bits of exponent from the 64 bits, by shoving those 52 bytes of significand
// off to the right, then saving only the 11 most-significant bits by doing a logical and with 
// eleven set bits = binary: 11111111111 = 0x7FF hex
n64Exponent = (nn64BitValue >> 52) bitand 0x7FF;
// Because the 64-bit float has the exponent biased by 1023, we need to subtract 1023
// to it (to get the real exponent). 
// If the real exponent is < -126 or > 127 then it's no good for a 32-bit float. 
if ( ((n64Exponent - 1023) > 127) or 
     ((n64Exponent - 1023) < -126) ) then
  // exponent too big, set and error flag or something here
  nSubError = -3;

  // Consider setting the float to a NAN also when indicating error. 
  // Set the float to a NAN (not a number), see also "IsFloatValid" or "Float Valid?" command
  // to check for a NAN

  /*  This command was added in version 9.2, might be handy in some cases
  fFloat = Int32ToFloatBits(0xFFFFFFFF);
  // This command works w/old & new versions
  Move32Bits(0xFFFFFFFF, fFloat);

  // get the 64-bit significand by doing a logical and with 52 bits set = 0xFFFFFFFFFFFFF
  nn64Sig = nn64BitValue bitand 0xFFFFFFFFFFFFFi64;
  // Then dump the least significant 29 bits of precision, so we're just left with the 23
  // the 32-bit float can hold
  nn64Sig = (nn64Sig >> 29);
  // Because the 64-bit float has the exponent biased by 1023, we need to subtract 1023
  // from it (to get the real exponent). Then bias it by 127 for the 32-bit version
  // Then we shift the whole thing over so it's in the correct place for 32-bit, e.g.
  // 23 bits to the left so the right-most 23 bits will have our significand/fraction
  // Combine the exponent and significand/fraction into one 32-bit
  nInt32 = (( ( n64Exponent - 1023) + 127) << 23) + GetLowBitsOfInt64(nn64Sig);
  // don't forget the sign!
  if (BitTest(nn64BitValue, 63)) then 
    nInt32 = BitSet(nInt32, 31);     
    nInt32 = BitClear(nInt32, 31);     
  // now we're ready to convert those 32-bits into a float 
  /*  This command was added in version 9.2, , might be handy in some cases 
  fFloat = Int32ToFloatBits(nInt32);
  // This command works w/old & new versions
  Move32Bits(nInt32, fFloat);

endif // our exponent was small enough to fit

Fun stuff!


Here’s a chart you can import into a PAC Control 8.0 Basic (or higher) strategy: (2.94 KB)

1 Like

Hi chrismec,

If PAC Control supported 64-bit floats, that method you mentioned (#2 above) looks like how you might convert your int to a float. Then again, if we supported 64-bit floats we’d probably also have a command, like our (new in 9.2) Int32ToFloatBits or older Move32Bits commands which would do this for you. (Perhaps they’d be called Int64ToFloatBits and Move64Bits.)

However, since we do NOT have any built-in 64-bit floats, that formula you mention:

my_Float = Power(-1, Sign) * (1+Mantissa/Power(2,52)) * Power(2, (Exponent-1023))

would have a problem with your large mantissa divided by Power(2,52) (also a very large number). They’re too big to fit into our 32-bit floats. Of course, if you love doing math and writing code like what I shared in this thread, there are options for dealing with this. But staying within the data types we support is much, much simpler and easier to maintain!