SNAP-PAC-R1 controller experiencing connection problems and apparent corruption

Hi,

I’ve been running a SNAP-PAC-R1 in my home automation setup for the last 7 years and it has been very reliable so far. Until last night that is…

The first symptom I noticed was that some of my NodeRed flows could no longer poll the R1. The “PAC read” nodes in the flows would sometimes show “Connection refused” or “ECONNRESET” when they were being refreshed periodically by an inject node.

The second symptom is that the R1 now frequently fails to connect to the FTP server which logs all sensor data to CSV files containing temp, humidity, electrical voltage/current/VA, smoke detectors, leak detectors and heating loop information. I’ve verified the FTP server is indeed accessible and appears to be running normally and is not running out of storage space but the R1 frequently returns the following error messages:
-52 (Invalid connection)
-36 (Invalid command or feature not implemented)
or
-449 (FTP: error while setting local port number that incoming data connections should use)

Thirdly after rebooting the controller from PAC Manager I noticed that some of the startup messages I normally add to the queue appear to be garbled. For instance, when the unit powers up I check if all charts are indeed running, send an email and add this info to the queue like so:

I_Status = StartChart(MODBUS_POLLING_LOOP);
if (I_Status == 0)
then
EMAIL_Body_str_arr[3] = “- MODBUS_POLLING chart started successfully”+CRLF;
AddMessageToQueue(4,EMAIL_Body_str_arr[3]);
DecrementVariable(Chart_Counter);
else
EMAIL_Body_str_arr[3] = “- MODBUS_POLLING chart failed to start”+CRLF;
AddMessageToQueue(16,EMAIL_Body_str_arr[3]);
endif

However the information in the message queue appears like this (notice the repeating word successfully):

Device has powered up. (‘Powerup clear expected’ message received.)
Controller was restarted
SNTP clock sync successful!
Attempting to start 8 charts

  • MODBUS_POLLING chart started successfully
  • LEAK_DETECTION_LOOP chart started successfully
  • ALARMS_LOOP chart started successfully
    sfully
  • DATA_LOGGING_LOOP chart started successfully
  • HYDRONIC_HEATING_LOOP chart started successfully
  • INPUT_LOOP chart started successfully
    cessfully
  • MODBUS_LIGHTING chart started successfully
    ully
  • TIMERS chart started successfully
    ssfully
    ully
    All charts were started successfully
    sfully
    ully

The I/O part of the controller still seems to be working fine (lights are turning ON and OFF, switches are being read, sensors and heating loops are still running as normal etc) but I seem to be experiencing multiple failures with external communications and logging.

I’ve tried restarting the controller, erasing and uploading its program a few times but the problems are still present. I just noticed there has been a firmware update since I last checked (currently running R10.4c). I’ll try updating to see if this helps and post back but if you have any suggestions or possible explanations why this could be happening all of a sudden I would appreciate. I have a feeling this might be some kind of memory corruption issue…

Thanks!

Two things jump to mind.

  1. Check for 5.1v at the rack (after the diode).
  2. Network cable or other network issues like an upstream switch is going bad, or some device on the network has just (as of last night) started spewing out broadcast packets / storm.

Hi Beno,

Thanks for the Sunday reply. I wasn’t asking for the 24/24 emergency service level but it is much appreciated :slight_smile:

So after my original post I installed the latest firmware update and after uploading my solution and verifying /admin/creds, /admin/keys were still intact I thought the problem had vanished but FTP logging and NodeRed coms started failing again soon after.

I checked the input voltage on the backplane and it was 5.0v so I bumped it up until it read 5.1v after the diode. I suspect it has been like this since the original install though so I doubt this was actually the culprit.

I checked the network cable between the panel and the switch. I doubt the switch is at fault since all the other peripherals connected to it are working fine (router, wireless AP, FTP, NodeRed and WWW all Raspbery Pies, the computer I’m currently using etc).

One thing I hadn’t noticed previously is that the problem appears to be periodic. Looking at the message queue in PAC Control I can see the following:

14:05.44 Device has powered up
14:05.44 A few initial FTP upload failures
14.05.49 FTP upload failures stop and normal operation resumes
14:35.03 FTP failures reappear

14:36.09 FTP failures stop and normal operation resumes

15:06.24 FTP failures reappear

15:07.29 FTP failures stop and normal operation resumes

15:38.44 FTP failures reappear

15:39.49 FTP failures stop and normal operation resumes

So approximately every 30 minutes FTP connections start failing for about a minute. My logging chart normally uploads fresh data to the FTP server every minute.

I’ll snif around with Wireshark to see if there are any periodic broadcasts happening but this started late last night around 2AM while everyone was asleep so I can’t see what may have changed around that time.

Thanks again!

Are you using a managed or unmanaged network switch? Did you restarted the PI’s too? Something slightly similar happened to me and initially blamed an unmanaged swith at the end the culprit was a wifi AP connected to the switch, I wiped the AP and updated fw, issue repeated, replaced the AP with spare and issue gone. The AP was part of a wifi bridge to an R1 and got all sort of weirdo comm issues while this happened, in simmilar fashion issues always started 5 minutes after reset, figured it was internal electronics overheating.

Hi @jelectron

Thanks for the response. Since I posted my reply to @Beno and tweaked the 5V supply to the R1 I haven’t seen the problem again. I simultaneously disconnected another PLC which was being bench tested while connected to the same switch. I doubt it could have been the culprit but I haven’t had time to try reproducing the problem or perform an in-depth analysis since it was “fixed”…

Thanks,