When this issue was first discovered several months ago, nearly all domain controllers in the environment were “losing” their SRV records.
After extensive internal troubleshooting and research failed to present a remedy, a ticket with Microsoft Support was created (SR# 111071153777276). After a couple of weeks of troubleshooting and information gathering, it was determined that a combination of inconsistent DC configuration, frequent DNS Scavenging , Server 2003’s 24 hour refresh interval of SRV records, and a poorly designed
site topology converged to cause the SRV records to disappear.
Microsoft Support Engineers recommended a group policy setting be implemented on all domain controllers to force the registration of SRV records every 60 minutes as
opposed to Server 2003’s default of 24 hours, and to standardize the DC configuration across the domain.
Their recommendations, along with replication topology improvements, significantly reduced the number of domain controllers that “lose” their SRV records. However, there are still approximately 10 domain controllers (20%) that continue to lose SRV records without intervention (dcdiag, netdiag, restarting netlogon service, etc….). Note that it has since been discovered that these DC’s also lose their own Reverse Lookup PTR records.
The DC will create the SRV records on itself, however they get deleted with the next replication cycle. No replication errors are showing, and the DNS debug
log leaves evidence of them being deleted at the same time replication occurs but doesn't give a reason.
All domain controllers are running Windows Server 2003 R2 SP2 Enterprise Edition, with a mix of 32 and 64 bit editions, and are relatively current with patches. Each DC/DNS server points to itself for primary DNS, and secondary DNS points to a remote AD DC/DNS server. The domain functional level is Server 2000 Native and the forest functional level is 2000. The FSMO roles are split between two domain controllers at the datacenter. There's a DC at each site with 30 or more users. All DC’s are global catalogs with the exception of the Infrastructure Master DC.
The WAN is a managed MPLS network, with connections ranging from 1.5M T1, up to 100M fiber for the datacenter and Cisco routers. All LAN hardware is “modern” Cisco 3750/2960/6513. Servers are connected to the LAN @ 1Gbs. We rarely have WAN or LAN issues.
“Domain.ad” is the only forward lookup zone, is AD integrated with dynamic updates allowed. Aging is set to 1 day no-refresh and 3 days refresh. There are AD integrated reverse lookup zones for all subnets with aging set to 1 day no-refresh and 2 days refresh. Scavenging runs on a 3 day cycle on only 1 DC/DNS server in the enterprise.
There isn’t anything that stands out as different with these 10 domain controllers, other than they lose their SRV and Reverse PTR records.
On any one of these 10 DC’s, if I change the primary DNS server to any other DC/DNS than itself, it starts working just fine. The SRV records get created automatically
by the 60 minute SRV record refresh interval, the reverse lookup pointer for the DC’s IP address gets created, and it survives replication for at least 48 hours. The problem only seems to be when the DC is pointing to itself for DNS.
I’m at the point of rebuilding one of them from scratch to see if it fixes the problem, but I’d rather avoid that if possible. Any ideas would be appreciated. If more information is needed/desired, just let me know.
Thanks