Please excuse the wall of text that is about to assail you, I have been at this problem for several weeks and have read every google search and technet article I can lay my hands on that seams in any way relevant. So here goes.
The scenario:
I have two severs, one in NewHampshire, the other in Maryland. They are both domain controllers. The one in NewHampshire is an SBS2011 server, the one in Maryland a 2008 R2 regular server. The two are linked by a pair of ZyWall 200 units creating an IPSec tunnel for the two sites. SiteNH uses the subnet 10.0.0.0/24 and the SiteMD uses the subnet 10.0.1.0/24. When the MD server was brought down to its final location it had been unplugged for well beyond the tombstone lifetime for active directory, so as per recommendations I read here, I forcefully demoted it, did a metadata cleanup and rejoined it and re-promoted it. After that we set up a DFS store between the two servers to replicate a particular directory's data. All this worked flawlessly for a little over 24 hours.
The first problem that showed up was that the DFS stopped replicating. It was showing the following errors in the event log:
Event ID: 5002
The DFS Replication service encountered an error communicating with partner MAIN-SBS for replication group galaxy.local\galaxydfs\companymaryland.
About the same time this showed up, a few other errors also showed up in the ActiveDirectory event log:
Event ID: 1311
The Knowledge Consistency Checker (KCC) has detected problems with the following directory partition
Event ID: 1865
The Knowledge Consistency Checker (KCC) was unable to form a complete spanning tree network topology. As a result, the following list of sites cannot be reached from the local site.
Event ID: 1566
All directory servers in the following site that can replicate the directory partition over this transport are currently unavailable.
While this was happening, I checked "repadmin /showrepl" and saw that all replication points had failed with "The remote procedure call failed and did not execute". Also a "repadmin /replicate main-md main-sbs DC=galaxy,DC=local /force" failed with a similar error. At this point I was thinking conenctivity. However from the Maryland server I could ping NH server without a problem, which also ment I was getting DNS resolution. I was also able to successfully ping the NH server using "rpcping -s main-sbs" without a problem. Also client computers at the Maryland site could still access the NH server directly through "\\main-sbs" without a problem. Checking the ZyWall's themselves also didnt turn up any problems, the VPN was established and showed no problems over the last 24 hours. However no matter what I did active directory would not replicate. In an act of desperation I just tried a reboot of the Maryland server, once it came back up I tried another manual replication, which surprisingly worked fine. Suddenly the DFS started replicating as well, a few hours later everything was properly replicated and working perfectly. Given that I couldn't find a problem I figured this wasn't the end, and about 24 hours later it all stopped working again, in exactly the same way.
At this point I'm a bit baffled, Directory Services seems to fall appart every 24 hours, rebooting the remote server in Maryland fixes it for another 24 hours, and when it stops working complaining of RPC communication failures, RPC still seams to be working fine both across the VPN from clients and even on the broken server itself via rpcping in either direction. The only other event of any interest is:
Event ID: 4005
The Windows logon process has unexpectedly terminated.
This tends to start showing up usually once the directory services stop working, not sure if its a symptom or a cause however. Googling for that event in conjunction with the directory services errors turned up nothing. The only other cause I've read about is a lack of memory or system resources, however that doesn't seem to be a problem on either server. The SBS server has 32GB of memory and though it tends to use most of it their is usually at least 1-2gb free. The 2008R2 server has 16GB of memory and usualy has well over 12gb of it free as it does nothing more than host AD and DFS. I expect their will probably be further questions or diagnostic requests, please feel free to ask, at this point I'll try almost anything. I have just about run out of ideas for this one. Thanks in advance!
Justin Shea
LAN Network Connections, Inc.