Fixing Clamd stuck at 100% CPU

My postfix mail server was taking ages (well, minutes, but still, that is really ages in a computer world) to handle a single email with multiple processes taking a full CPU, all the time! What?

I got it solved (after a couple of hours making U-turns in dead ends) and as always, the answer was really trivial. But I know I’ll be losing another couple of hours next time I run a “yum update”, so here is another in the “let’s blog about it so I remember how to fix it next time” series!

– Click here to go directly to the fix –

The problem

After updating my Centos7 server, I noticed that clamd was taking 100% of a CPU core and additionally, a number of clamscan process were running, also each claiming most of a CPU core.

Clamd is the daemon process for clamav-server which is used by Amavis as anti-virus protection and should be a relatively light-weight process.
Clamscan is the “start up, read all the virus signatures in memory to scan a single item for viruses and exit” version that has been moslty replaced by clamd.

So a server using clamd should really never be running clamscan at all…

Looking at the postfix maillog under /var/log, things looked even worse. The log showed amavisd complaining that clamd was unresponsive. This immediately explained the presence of clamscan processes: amavisd will start up a clamscan for an email if it cannot reach clamd, as a fallback.

amavis[20362]: (20362-01) (!)connect to /var/run/clamd.amavisd/clamd.sock failed, attempt #1: Can't connect to a UNIX socket /var/run/clamd.amavisd/clamd.sock: Connection refused

amavis[20362]: (20362-01) (!)ClamAV-clamd av-scanner FAILED: run_av error: Too many retries to talk to /var/run/clamd.amavisd/clamd.sock (All attempts (1) failed connecting to /var/run/clamd.amavisd/clamd.sock) at (eval 134) line 659.\n

amavis[20362]: (20362-01) (!)WARN: all primary virus scanners failed, considering backups

The plot thickens

Sadly, this log does not tell us anything about why clamd is unresponsive so “strace” to the rescue!

Well, no…

Launching strace and attaching it to the running clamd process (uisng strace -p <pid>) showed nothing bad, except that it got killed all the time. The process started, got killed and got restarted again, etc… Every time getting a new process ID and breaking the strace.

I’m not going to paste the full strace output here because it is extremely long (containing loads and loads of read instructions which I believe is clamd reading all the virus signatures and setting up the in-memory state that allows it to precess mails so quickly), but if your strace ends with

mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fda413a1000
read(6, "443a6f443a6b3f3566382e66382e6638"..., 24576) = 24576
read(6, "be9568b7540ff765cff1550504000;55"..., 4096) = 4096
mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fda41361000
--- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=1, si_uid=0} ---
+++ killed by SIGTERM +++

That is not good.

It should end with something like this:

munmap(0x7f60ec03c000, 262144) = 0
munmap(0x7f60ec07c000, 262144) = 0
munmap(0x7f60ec0bc000, 262144) = 0
munmap(0x7f60ec0fc000, 262144) = 0
munmap(0x7f60ec13c000, 262144) = 0
exit_group(0) = ?
+++ exited with 0 +++

But the auto-restarting was an important clue, especially when I noticed that the clamd process was killed and restarted exactly every 90 seconds….

That simply reeks of a dirty rotten timeout!

The solution

SystemD, who saved us from “initscript hell” (I won’t mind if you don’t agree! 🙂 ) is a much more complex animal and one of the things is has is a “TimeoutStartSec” setting that tells SystemD how long to wait for a process to start. When that timeout has expired, it will kill and restart it at infinitum.

The default value is set in /etc/systemd/system.conf:

 #DefaultTimeoutStopSec=90s

So it looks like something in clamav changed, causing the startup time to be a lot longer now, clocking in at over 3 minutes.

And nice little SystemD really thinks that 90 seconds should be enough for anybody and promptly restart it again, and again…. and again….

Keeping a new clamd process claiming a full CPU until it is almost ready and causing amavisd to spawn clamscan processed to work around the unresponsive clamd daemon.

The fix

Yep, this is why you came here. Finally! 🙂

Simply tell SystemD to wait a little longer for the clamd process to finish starting up. As I wrote earlier: trivial.

Edit the /lib/systemd/system/clamd@.service file and add the timeout setting to the service block:

[Unit]
Description = clamd scanner (%i) daemon
Documentation=man:clamd(8) man:clamd.conf(5) https://www.clamav.net/documents/
# Check for database existence
# ConditionPathExistsGlob=@DBDIR@/main.{c[vl]d,inc}
# ConditionPathExistsGlob=@DBDIR@/daily.{c[vl]d,inc}
After = syslog.target nss-lookup.target network.target

[Service]
Type = forking
TimeoutStartSec = 10min
ExecStart = /usr/sbin/clamd -c /etc/clamd.d/%i.conf
Restart = on-failure

I set it to 10 minutes, which is longer than it strictly needs to be but I like a nice margin and I am happy with it. Feel free to experiment and find a shorter period that still works.

Note: As Thomas wrote in the comments below, according to the official systemd rule-book, you should not be modifying the files under /lib directly, but instead copy the file under /etc and make your local modifications there. So even though the above works, any update of clamd will overwrite your changes and you will need to apply them again unless you use the /etc file location. Please see my reply to Thomas’ comment below for more info about why I did it my way! 😉

The Cleanup

After this change, make sure to activate it.

First stop both amavisd and clamd@amavisd:

systemctl stop amavisd
systemctl stop clamd@amavisd

Do check that there are no clamd or clamscan processes running anymore (ps -ef | grep clam). If there are, either wait for them to go away of kill them.

Next, tell SystemD to reload the config and restart amavisd (which should start clamd for you):

systemctl daemon-reload
systemctl start amavisd

You should now see clamd running at 100% CPU again for about 3 to 4 minutes, after which it will detach from the startup process and happily play nice in the background.

Use the “top” command to see it right at the top for 3 to 4 minutes, after which it should dissapear. Check with “ps-ef | grep clam” to confirm that the process is indeed still there!

Now amavis should no longer be complaining about unresponsive clamd and no more clamscan processes should appear!

Get on with our day and enjoy that coffee.

16 thoughts on “Fixing Clamd stuck at 100% CPU”

chuck elliot says:

September 2, 2019 at 10:45 am

Good job! I had the same problem and your diagnosis was spot on. Thanks.

1. Jhon Masschelein says:
  
  September 2, 2019 at 11:39 am
  
  Excellent!
  Happy you were able to find this and I could help a little. 🙂
  
Andrew Luck says:

September 6, 2019 at 5:38 pm

Worked for me too. Thanks.

Taro Ich says:

September 11, 2019 at 10:58 am

Thank you so much!
My pc saved by you!

IT says:

September 23, 2019 at 11:45 pm

Thanks, your solution solved our problem as well.
A little note: after “systemctl daemon-reload” there was probably meant to be “systemctl start amavisd”.
Unless that was there to check if the reader is before that coffee 🙂

1. Jhon Masschelein says:
  
  September 25, 2019 at 9:22 am
  
  Cut&paste failed me there, good catch!
  
  I’ve update the text now.
  Thanks for letting me know. 🙂
  
George Ray says:

October 16, 2019 at 5:35 pm

I have spent the last two days trying to resolve this exact issue and this worked like a charm. Thank you, you my good sir have my gratitude.

Martin says:

November 3, 2019 at 6:02 pm

Good find!

My clamd was just on the fence, taking between 80 and 105 seconds, so it sometimes took a long time (when restarted on the running system) before it worked again, sometimes it worked right away (usually on boot); the kind of behaviour for a bug to drive me up the walls…

So: thank you

David Means says:

November 16, 2019 at 8:19 pm

Many thanks for this analysis. Since suffered an automatic upgrade of my Centos 7 kernel and the consequent reboot, this has been deviling me for a day and a half. Should this be a bug report?

Marcos Souza says:

November 23, 2019 at 1:49 am

Finally a different and real solution for this problem!
This worked like a charm!
Thank you very much!

Thomas H Jones says:

December 4, 2019 at 9:09 pm

Nice solution. Small comment/question: shouldn’t you have, instead of modifying the `/lib/systemd/system/clamd@.service` file copied it to `/etc/systemd/system/clamd@.service` and made your modification there? My understanding of systemd is that localized modifications of packaged systemd service-definitions are usually supposed to go under `/etc/systemd/`.

1. Jhon Masschelein says:
  
  December 6, 2019 at 8:26 am
  
  Hi Thomas,
  
  Yes, you are right: your way is the right way to do that. The idea being that the files under /etc should not be overwritten when you install a new version of whatever package uses it. However in the past I’ve experienced that not all package builders respect that so I usually kinda don’t bother anymore… And you are absolutely right that this is a “bad” thing!
  
  However in this particular case, I purposefully chose to edit the file under /lib to insure that when a new clamd gets released, it overwrites the changes I made. My hope is that by then this problem has been fixed at the source and my changes are no longer needed. And if they are not… well that’s why I wrote this thing. 😀
  
  But when people read this blog they should get the right way to do it and not suffer from my exotic mindset so I’ll add something to the blog over the weekend to make sure I don’t lead anyone astray. 🙂
  
  Thanks for calling me out on this! 😀 :thumbsup:
  
Rideout says:

December 15, 2019 at 12:23 am

Thanks! This fixed it for me. One of my servers kept crashing out of nowhere. I edited the timeout for clamd and it worked. After which I also ran an update for ClamAV / Clam Daemon, and rebooted.

I copied it to ‘/etc/systemd/system/clamd@.service’ where it remained after the update process.

Lewis says:

March 21, 2020 at 6:36 pm

I additionally had to create a swap file as even with 2GB of RAM clamd couldn’t start. My file is 4GB.

Robert Kupka says:

October 21, 2021 at 11:04 am

I would like to add some notes.
I think the primary culprit, why clamd stops responding is that freshclam and/or other signature update scripts (such as clamav-unofficial-sigs) issue a command to clamd to reload its updated databases. While clamd is working on reload (usually clamd restart), it is not listening on .sock, so amavis cannot use it and falls back to backup AV, which is clamscan. And that causes enormous swapping and eating up all available memory, because amavis spawns as many clamscan processes as it needs.
So, I propose another solution – disable backup AV (clamscan) in your amavisd.conf file.
This will cause temporary fail on AV scanner and AMAVIS will complain in the logs, but once the clamd process is restarted, amavis should work again.
On my server, restart of clamd usually takes 2-3 minutes (yeah, huge signature files). I can live 2-3 minutes without AV scanning of incoming emails. Amavis will still do full SPAM checking.

1. Scott says:
  
  May 7, 2024 at 12:09 am
  
  Robert, thanks. clamscan was kicking off for me, too after the change. And if my problem was memory related, that only made things worse. Never though to disable it as backup, but I agree, I can live without it briefly while clamd recovers.