Software repairs itself on the
go
By
Kimberly Patch,
Technology Research News
Scientists have been working to make computers
more reliable since the first ones were built.
And as information like medical records and business transactions
is increasingly expected to be always accessible, fault tolerance and
ways to keep data from becoming corrupted have become important subjects
of research efforts around the globe.
Researchers from the Massachusetts Institute of Technology have
devised a way to repair data on-the-fly, allowing a system to keep on
computing after it has sustained an error that would normally require
a restart.
The researchers' method uses fewer computer resources than the
log-rollback-and-replay methods of data repair in use today. The approach
makes this type of fault tolerance less expensive and therefore potentially
available for a wider range of applications.
The challenge in repairing data on-the-fly is finding what needs
fixing. Key to the researchers' automatic data structure repair method
is a relatively simple way to find data errors.
Rather than looking for errors in computer code, the scheme calls
for examining abstract models of data that show relationships between
data objects. Using abstract representations eliminates much of the information-encoding
complexity usually involved, said Martin Rinard, an associate professor
of computer science at the Massachusetts Institute of Technology. Data
structure corruptions become apparent in the model in the same way health
problems like a broken bone or cancerous tissue show up in a graphical
representation of data like a CAT scan or MRI, he said.
The researchers' method includes a set of rules that translates
data structures into an abstract model and another set of rules that uses
the model to express key consistency properties, said Rinard.
When the system finds an inconsistency, it carries out a sequence
of actions to fix the problem. The system can put pieces back together,
initialize and insert a missing element, or remove a corrupted part, said
Rinard. The system can also shift the values of variables to make them
self-consistent, he said.
The system can also carry out combinations of repairs, said Rinard.
"The algorithm may remove some corrupted parts, then fill the resulting
hole with newly allocated and initialized data structures," he said.
The method's big advantage is that it can keep the system going
even though there are errors that would normally bring it to a halt, said
Rinard.
Computer programs generally assume that data has a certain amount
of consistency, and if data structures get into a state in which they
do not, the software may not be able to operate at all, bringing the system
to a halt. "In many cases even a small inconsistency in one part of the
data structures can completely prevent the software from correctly executing
any of the data, even data that is not corrupted," said Rinard.
The researchers' approach is different from the rollback-and-replay
methods in practical use today, said Rinard. These involve logging all
transactions, and, when a failure occurs, rolling back to a consistent
state and replaying the transactions to bring the data up to date.
Instead of rolling back when a failure or error occurs, the researchers'
method makes the inconsistent state consistent, said Rinard. This allows
a system to keep running despite errors that would ordinarily bring it
to a halt until a human operator could intervene. "In particular, if a
software error always makes a particular operation fail, the database
can never get past that operation," he said.
The method also decreases the software overhead of failure recovery.
The amount of code devoted to traditional failure recovery schemes is
fairly high -- a large percentage of the code that runs a database is
devoted to recovery, according to Rinard.
Data structure repair has traditionally been applied manually
-- and at a very high development cost -- to systems with extreme reliability
requirements, said Rinard. "One of the goals of our research is to automate
the technique to the extent that it becomes feasible to apply routinely
to all kinds of software," he said.
This is especially apt today, as computer systems rapidly proliferate
throughout the environment, including systems embedded in cars, buildings
and public places, said Rinard. "Many such systems must execute without
any human intervention," he said. "Our research may turn out to [enable]
the systems to detect and repair damage to themselves so that they can
continue to execute successfully without human intervention," he said.
The potential downside of such automation, however, is that an
error might destroy information or otherwise make the repaired data structures
inadequate, said Rinard. In some cases it might be better to let the system
fail, then wait for human intervention to bring it back up, he said.
The researchers' simulations show that in many situations it could
prove practical to fix errors automatically rather use the more cumbersome
roll back method.
The data structure abstraction method "minimizes the amount of
effort required to apply data structure repair... and makes it much easier
to understand what the repair algorithm will produce," said Rinard. The
simulations also show, surprisingly, that most errors involve relatively
simple repair sequences, he said.
The researchers are currently looking at how the method works
in real-world applications and are working on improving its efficiency,
said Rinard.
Longer-term, the abstract approach could be used to make all kinds
of complex data manipulations much easier, said Rinard.
The method could be ready for practical use in two to four years,
said Rinard. The method can be applied in small steps to currently deployed
systems written in standard programming language, he said. "It does not
require changes to existing development techniques, and can be applied
initially by a small group working within the context of a much larger
development effort," he said.
Rinard's research colleague was Brian Demsky. They presented the
work at the Association of Computing Machinery (ACM) Workshop on Algorithms
and Architectures for Self-Managing Systems in San Diego on June 11, 2003.
The research was funded by the National Science Foundation (NSF), the
Defense Advanced Research Projects Agency (DARPA) and the Singapore-MIT
Alliance (SMA).
Timeline: 2-4 years
Funding: Government, University
TRN Categories: Operating Systems
Story Type: News
Related Elements: Technical paper, "Automatic Data Structure
Repair for Self-healing Systems," presented at the Association of Computing
Machinery (ACM) Workshop on Algorithms and Architectures for Self-Managing
Systems San Diego, June 11, 2003 and posted at tesla.hpl.hp.com/self-manage03/Finals/demsky.ps.
Advertisements:
|
January 14/21, 2004
Page
One
Quantum dice debut
Pressure shapes plastic
Software repairs
itself on the go
Nanoparticle dyes
boost storage
Briefs:
Fiber optics goes nano
Melted fibers
make nano channels
Wet biochip preserves
proteins
Nanotubes grown on
plastic
Hardy molecule makes
memory
Atoms make quantum
coprocessor
News:
Research News Roundup
Research Watch blog
Features:
View from the High Ground Q&A
How It Works
RSS Feeds:
News | Blog
| Books
Ad links:
Buy an ad link
Advertisements:
|
|
|
|