TROUBLESHOOTING TELCO CIRCUITS Chuck Ouimette, Network Systems Test Group NSTG::OUIMETTE 15-FEB-1994 When troubleshooting problems with wide-area network components (Routers, WAN bridges, etc.), the responsibility may fall upon your shoulders to *prove* that the problem is not in the router/bridge or other equipment under MVS contract. In many cases, this requires that you take an active hand in troubleshooting the telephone circuits. TERMINOLOGY: DSU/CSU - a single modem-like component used with a high speed digital telephone line. Like a modem, it translates between the format used on the phone line and the format used on the V.35 or RS 232 connection to the computer or router. CSU - Channel Service unit, required by telco for testing/conditioning that portion of the DSU/CSU component that handles the telephone line. It may be packaged as a separate card/box, or included in the same package with the DSU. Includes some test abilities used by Telco, and other tests that can be done by the users. DSU - Digital Service Unit, converts signals from V.35 to DS0/DS1 levels That portion of the DSU/CSU component that handles interaction with the computer or router. It works in conjunction with the CSU, it may be packaged as a separate card/box. It also includes some test functions that can sometimes be used by Telco. It includes some test abilities used by Telco, and other tests that can be done by the users. OCU - Office Channel Unit. A card in the Telephone company's switching office that communicates directly with the CSU function of the DSU/CSU at the customer's site." DACS - Digital Access Cross-connect System SMARTJACK - A telco provided component that allows them to loop a line, and adjust transmission characteristics. It also provides a connection point (jack) for the DSU/CSU POP - Point of Presence. The nearest switching office of a particular telco vendor. CO - Central Office. A telephone company building where lines are connected to switches. A2 - Line termination unit, usually owned by telco D4/ESF/AMI/B8ZS - Data encoding schemes that must be set the same on both DSU/CSUs and on the telephone line. The line's settings can be found by the telco tester on their line records. PROBLEM ISOLATION. It's a general rule of thumb that the telco circuit is the culprit in the majority of all apparent datalink WAN problems. This is due to the fact that in the typical WAN circuit, the majority of "fallible" components are in the Telco's domain. Whereas the DTE equipment and DSU/CSU are localized on each end, the telephony may span continents, and is subject to adverse weather and power spikes/outages across a wide geography, and the periodic backhoe which inadvertently cuts the cables. While DTE's, DSU's, and modem cables do fail, the telephony is usually the weakest link. Unfortunately, finger-pointing can be a relatively common occurrence when troubleshooting Telco WAN circuits. There are many reasons for this; a point-to-point circuit usually involves equipment from several different vendors. When one component in the circuit fails, it's not always easy to determine which vendor's equipment is at fault. Problems are frequently intermittent, or data-sensitive, and may not be seen by the testing telco technician. GATHERING INFORMATION: When working with a Telco provider in troubleshooting a suspect WAN link, it is very important to collect as much problem symptom data as possible. Be explicit! The more you can isolate the failure mode and symptoms, the better the chance that Telco will be able to see and repair the problem, and you'll be less likely to end up in a finger-pointing contest. For example, if you call Telco and say "The circuit is bouncing, please check it out", they are likely to check it out with a brief test, find no problem, and turn it back over to you as "No Trouble Found". If, however, you are able to say "I'm seeing a high CRC error rate at the Boston end of the circuit, but only when I run an all "0's" test pattern. Mixed 1's & 0's or all 1's run fine, and the New York end never sees any errors. Please *stress test* the circuit with an all 0's pattern for at least an hour, towards Boston.", then you greatly increase Telco's chances of resolving the problem for you and your customer. If you can also say "I have already run local DSU loopbacks from the DTE's at both ends, and only see errors when going across the telco circuit", this gives the technician a good reason to *look* for problems on the circuit. Run all the tests you can against the telco circuit, and make note of all successful as well as failed tests. Remember, whenever possible, check *both* ends of the circuit for power to the DSU/CSU & DTE, as well as proper DTE operation (i.e., is the router booted?). TESTING WITH THE DSU/CSU: On most DSU/CSU's, various loopback options are usually available. Keep in mind that different vendors may implement loopbacks differently, or call them by different names. Consult the specific manual for the DSU/CSU you are using. LOCAL LOOPBACK: This function provides a data loopback towards the DTE equipment from the local DSU/CSU. This is used when the DTE equipment has the ability to generate test patterns (or probes, such as a "ping"), then read the returned data to check for errors. This tests the DTE, cable, and usually the DSU portion of the CSU/DSU. On the DSU/CSU, this function may be called "LDL" (Local Digital Loopback), "LAL" (Local Analog Loopback), or just "LL", depending on where the loopback is implemented in the DSU/CSU. REMOTE LOOPBACK: This function sends a message across the circuit to the remote DSU/CSU, instructing it to bring up a loopback towards the WAN circuit. Test patterns can then be run from the local DTE, through the local DSU/CSU, across the WAN circuit to the remote DSU/CSU, then are looped back through the WAN circuit, through the local DSU/CSU back to the DTE. This will test the largest percentage of the WAN connection. This function is usually called "RL", or "RDL". UNIDIRECTIONAL OR BIDIRECTIONAL LOOPBACK: Some DSU/CSU's implement loopbacks in only one direction at a time, i.e., a Local Digital Loopback will only loop data back to the DTE. Other loopbacks are bidirectional, such that a LDL will both loop data to the DTE as well as looping back data coming in from the Telco side. Consult the individual DSU/CSU's manual for implementation specifics. DSU/CSU INDICATORS: ALARM RED: This condition exists when the incoming signal (from the telco side) is out of frame (OOF) for at least 2.5 seconds. OOF is when 2 out of every 4 framing bits are in error. When this is detected, the DSU/CSU will transmit a "YELLOW ALARM" signal to the remote DSU/CSU. ALARM YELLOW: A specific pattern of 1's and 0's which indicates that the remote DSU/CSU is unable to synchronize on the recieved signal (OOF, see alarm red, above). ALARM BLUE: This alarm indicates that there is a loss of signal on the remote DSU/CSU's telco interface. TESTING PROCEDURES: If available, run DTE tests against a local CSU loopback at each end, to verify the cabling & DTE equipment. Note that most DSU's run on internal clocking when in loopback mode; if there is a clocking mismatch between the DSU's and the circuit, local loops may work fine, while tests across the circuit may fail. In most cases, the CSU/DSU's should be optioned to take their clocking from the telco circuit. Doublecheck your DSU/CSU optioning, and have telco verify the optioning of the circuit (B8ZS, AMI, D4, ESF, etc). A mismatch doesn't mean a dirty circuit. CSU/DSU's (as well as routers) will sometimes get into "brain dead" mode, and only a power cycle clears them. It's worth a try before calling telco, as they may take several hours to respond. CALLING TELCO: Once you have gathered as much symptom information as you can, and you believe the problem is in the WAN circuit, call the circuit provider with your symptom & test information. When calling a Telco provider, have the following information ready: 1. Your callback phone number 2. The customer's company name 3. The telco circuit ID # of the suspect circuit 4. Exact symptom information: "Hard down", intermittent, or "noisy" circuit? 6. The 2 end points of the circuit 7. Customer contacts at each end, in case an onsite presence is required 8. When telco can have the circuit for running "intrusive" testing. If it's "hard down", you will normally release it for testing immediately. Be sure to get the Telco's trouble ticket #, as you will need it to get call status and updates. If the call handler doesn't ask you the two end point addresses of the circuit, ask them to verify them. Sometimes the customer may give you the wrong circuit ID #, or the call handler may make an error in typing it in. Unless you verify the circuit endpoints, Telco may bring down for testing another of your customer's circuits which is perfectly good, or even another customer's circuit! They will also usually ask you if you will pre-authorize onsite dispatch if needed. Remember, if they go onsite and find no problem with their equipment, they will bill the customer for their time. If you do not pre-authorize this, they will call you back with their findings before going onsite. Unless you are absolutely positive that the problem is in the Telco circuit, you probably don't want to pre-authorize onsite dispatch. For instance, there may be no one available to run DTE tests at the remote end, or even to verify power. If the circuit goes down, it may be easier to have telco check the circuit remotely prior to sending someone (either the customer or the local FSE) to the remote site to perform DTE testing. The Telco check is free; but if they don't find a problem, it may be more cost and time-effective to send a Field Engineer onsite to check the site than to have telco go onsite, find no problem, bill the customer, and then *still* need to dispatch a FSE. WORKING WITH TELCO. If telco finds no problem, but you still believe the problem is in the circuit, request that Telco insert loopbacks into their circuits at different demarcation points, and run your data test against these loopbacks. Have them place a loopback as close to you as possible (perhaps at the smartjack, if present), then step them out as each section is successfully tested, testing more of the circuit as you go. You can also put the DSU/CSU into loopback towards the circuit, and let telco run test patterns against the loopback. Note: sometimes, a Telco's "first level" support may not know how to insert these loopbacks in a "passive" mode for you to test against, without also running their own diagnostic patterns. If this is the case, request escalation to the next level of support. TELCO ONSITE. If no smartjack is present onsite, and Telco's tests run cleanly to the last OCU before the site, but DSU/CSU loop tests fail or are dirty, the problem may be anywhere between the last switching office & the CSU/DSU onsite. In this case, telco will usually request that the customer/FSE replace the DSU/CSU, check site premises wiring, etc. If desired, telco will dispatch to the site to continue troubleshooting. They will usually replace the existing DSU/CSU with one of their own, and if that runs cleanly, they may write a bill for the onsite call. PREMISES WIRING. If telco is able to test cleanly to their point of demarcation, and you are able to test cleanly to local loopbacks on the CSU/DSU, then there may be a problem with the site wiring. If the demarcation is the smartjack that the CSU/DSU plugs into, then the wiring from CSU/DSU to smartjack is usually a single cable. But the demarcation point may be ***(at) where a punchdown block where the wiring enters into the building, and there may be several hundred feet of "private wiring" from there to the CSU/DSU. ESCALATION. If you don't feel comfortable with the technical expertise level of the person you're speaking with, request escalation to the next higher level of support. Remember, Telco is a *service* provider, and your customer has paid them money for a service (the circuit). If your customer isn't getting what they are paying for, it's time to escalate. CONTRACT COMMITMENTS. Different Telcos offer different service commitments. A lower priced service offering may take 4 hours to call you back after you log the call, no status updates unless you call in for them, and no automatic escalations. A more expensive offering may offer a 1-hour callback, with hourly status updates from the Telco provider, and automatic escalation for extended outages. "LOCAL" TELCOS. Often, the major telco providers who sell the circuits to customers do not "own" the circuit from end to end. They usually only own the major trunks which make up the bulk of the circuit. At each end however, the circuit from the major Telco's last switching office (POP) to the customer's site is often provided by a local telco company. It's not uncommon for the Telco to say "We can loop clean out to our last office, but take errors when we loop the smartjack on the customer premises. So the problem's in the local Telco's circuit, and we've dispatched a call to them". Even though the problem may be the "local" Telco's, your service agreement is with the "major" telco, which owns responsibility for resolving the problem. The same escalation options are available; If you don't see a timely response, call back the major Telco and request that they escalate within the local telco. MAJOR TRUNK OUTAGES. Sometimes a customer will lose a circuit as the result of a major trunk outage. A backhoe in Phoenix can accidentally cut through a Fiber-Optic cable carrying thousands of circuits, including your customer's which runs from Los Angeles to New York. When this happens, all you can do is be patient. The telco is probably rerouting circuits as quickly as they can, and can usually not give an ETR for individual circuits. Once the circuit is rerouted, there will usually be another brief disruption when the circuit is routed back onto its normal path, the repaired Fiber-Optic cable. ONGOING PROBLEMS. Request a "Class 1 circuit check" (Different Telco providers may have different names for this check). For this check, the Telco provider will visually or electronically inspect all optionable components in the circuit to verify circuit optioning. Unless you explicitly request this, they will only refer to the circuit work orders to determine optioning. The work orders may not reflect reality. In some cases, Telco may re-option circuit components (ESF to D4, AMI to B8ZS, etc.) accidentally, causing circuit problems. ONSITE MEETS. If the Telco provider is unable to resolve the problem on their own, or they deny that there is any problem with their circuits, a "meet" may be necessary. This is where all vendors involved meet onsite, often at both ends of a circuit. CHRONIC OUTAGES. Most providers have a criterion for designating certain circuit outages as "Chronic". Typically, this is 3 circuit outages within any 30 day period. Once a circuit is designated Chronic, the Telco provider may have high-level escalation options unavailable to normal outages, such as continuous circuit monitoring by a dedicated high-level "chronic team" assigned to the circuit. In order to get a circuit declared "chronic" be ready to provide the Telco with a history of relevant trouble ticket #'s. RE-ENGINEERING THE CIRCUIT. As a last resort, for an ongoing chronic circuit which the Telco has been unable to fix after an extended period of time, the customer can demand that the circuit be re-engineered. This means that the entire circuit, from point A to point B, will be installed on all different components; Channel cards, DACS ports, wiring, everything. The hope is that the untraceable problem will be left behind if all components are new. The danger is that new problems may appear, the sort of "breaking in" problems possible with any newly installed circuit.