In the post Finishing the Maze – Adding Background Music, I mentioned there were two sorts of sound file that Game Maker could play: sound files (like WAV files, or compressed MP3 files) or MIDI files.
In this aside post, I just want to briefly review the principle of how analogue (continuously varying) sound recordings can be stored as digital files using material sourced in part from the OpenLearn units “Crossing the boundary – analogue universe, digital worlds” (in particular the section Crossing the boundary – Sound and music) and “Representing and manipulating data in computers” (in particular the section Representing sound).
Sound and music
Second only to vision, we rely on sound. Music delights us, noises warn us of impending danger, and communication through speech is at the centre of our human lives. We have countless reasons for wanting computers to reach out and take sounds across the boundary.
Sound is another analogue feature of the world. If you cry out, hit a piano key or drop a plate, then you set particles of air shaking – and any ears in the vicinity will interpret this tremor as sound. At first glance, the problem of capturing something as intangible as a vibration and taking it across the boundary seems even more intractable than capturing images. But we all know it can be done – so how is it done?
The best way into the problem is to consider in a little more detail what sound is. Probably the purest sound you can make is by vibrating a tuning fork. As the prongs of the fork vibrate backwards and forwards, particles of air move in sympathy with them. One way to visualise this movement is to draw a graph of how far an air particle moves backwards and forwards (we call this its displacement) as time passes. The graph (showing a typical wave form) will look like this:
Our particle of air moves backwards and forwards in the direction the sound is traveling. As shown in the previous figure, a cycle represents the time between adjacent peaks (or troughs) and the number of cycles completed in a fixed time (usually a second) is known as the frequency. The amplitude of the wave (i.e. maximum displacement of the line in the graph) determines how loud the sound is, the frequency decides how low or high pitched the note sounds to us. Note, though, that the diagram is theoretical; in reality, the amplitude will decrease as the sound fades away.
A sound of high frequency is one that people hear as a high-pitched sound; a sound of low frequency is one that people hear as one of low-pitched sound. Sound consists of air vibrations, and it is the rate at which the air vibrates that determines the frequency: a higher vibration rate is a higher frequency. So if the air vibrates at, say, 100 cycles per second then the frequency of the sound is said to be 100 cycles per second. The unit of 1 cycle per second is given the name ‘hertz’, abbreviated to ‘Hz’. Hence a frequency of 100 cycles per second is normally referred to as a frequency of 100 Hz.
Of course, a tuning fork is a very simple instrument, and so makes a very pure sound. Real instruments and real noises are much more complicated than this. An instrument like a clarinet would have a complex waveform, perhaps like the left hand graph (a) below, and the dropped plate would be a formless nightmare like right hand one (b).
Write down a few ideas about how we might go about transforming a waveform into numbers. This is a difficult question, so as a clue, why not see look at how numbers may be used to encode images: Subsection 4.3 of the the OpenLearn Unit Crossing the boundary – analogue universe, digital worlds.
In a way the answer is similar to the question on how to transform a picture into numbers (see Subsection 4.3 of the OpenLearn Unit Crossing the boundary – analogue universe, digital worlds). We have to find some way to split up the waveform. We split up images by dividing them into very small spaces (pixels). We can split a sound wave up by dividing it into very small time intervals.
What we can do is record what the sound wave is doing at small time intervals. Taking readings like this at time intervals is called sampling. The number of times per second we take a sample is called the sampling rate.
I’ll take the tuning fork example, set an interval of say 0.5 second and look at the state of the wave every 0.5 second, as shown below.
Reading off the amplitude of the wave at every sampling point (marked with dots), gives the following set of numbers:
+9.3, −3.1, −4.1, +8.2, −10.0, +4.0, +4.5
as far as I can judge. Now, if we plot a new graph of the waveform, using just these figures, we get the graph below.
The plateaux at each sample point represent the intervals between samples, where we have no information, and so assume that nothing happens. It looks pretty hopeless, but we’re on the right track.
Self-Assessment Question (SAQ)
How can we improve on the blocky figure shown directly above?
The problem here is similar to one that may be encountered with a digitised bitmapped (pixelated) image. In that case we decreased our spatial division of the image by making the pixel size smaller. In this case we can decrease our temporal splitting up of the waveform, by making the sampling interval smaller.
So, let’s decrease the sampling interval by taking a reading of the amplitude every 0.1 second.
Once again, I’ll read the amplitude at each sampling point and plot them to a new graph, which is already starting to look a little bit more like the original waveform.
So how often must the sound be sampled? There is a rule called the sampling theorem which says that if the frequencies in the sound range from 0 to B Hz then, for a faithful representation, the sound must be sampled at a rate greater than 2B samples per second.
The human ear can detect frequencies in music up to around 20 kHz (that is, 20 000 Hz). What sampling rate is needed for a faithful digital representation of music? What is the time interval between successive samples?
20 kHz is 20 000 Hz, and so the B in the text above the question is 20 000. The sampling theorem therefore says that the music must be sampled more than 2 × 20 000 samples per second, which is more than 40 000 samples per second.
If 40 000 samples are being taken each second, they must be 1/40 000 seconds apart. This is 0.000025 seconds, which is 0.025 milliseconds (thousandths of a second) or 25 microseconds (millionths of a second).
The answer shows the demands made on a computer if music is to be faithfully represented. Samples of the music must be taken at intervals of less than 25 microseconds. And each of those samples must be stored by the computer.
If speech is to be represented then the demands can be less stringent, first because the frequency range of the human voice is smaller than that of music (up to only about 12 kHz) and second because speech is recognisable even when its frequency range is quite severely restricted. (For example, some digital telephone systems sample at only 8000 samples per second, thereby cutting out most of the higher-frequency components of the human voice, yet we can make sense of what the speaker on the other end of the phone says, and even recognise their voice.)
Five minutes of music is sampled at 40 000 samples per second, and each sample is encoded into 16 bits (2 bytes). How big will the resulting music file be?
Five minutes of speech is sampled at 8000 samples per second, and each sample is encoded into 16 bits (2 bytes). How big will the resulting speech file be?
5 minutes = 300 seconds. So there are 300 × 40 000 samples. Each sample occupies 2 bytes, making a file size of 300 × 40 000 × 2 bytes, which is 24 000 000 bytes – some 24 megabytes!
A sampling rate of 8000 per second will generate a fifth as many samples as a rate of 40 000 per second. So the speech file will ‘only’ be 4 800 000 bytes.
This process of sampling the waveform is very similar to the breaking up of a picture into pixels, except that, whereas we split the picture into tiny units of area; we are now breaking the waveform into units of time. In the case of the picture, making our pixels smaller increased the quality of the result, so making the time intervals at which we sample the waveform smaller will bring our encoding closer to the original sound. And just as it is impossible to make a perfect digital coding of an analogue picture, because we will always lose information between the pixels, so we will always lose information between the times we sample a waveform. We can never make a perfect digital representation of an analogue quantity.
Now we’ve sampled the waveform, what do we need to do next to encode the audio signal?
Remember that after we had divided an image into pixels, we then mapped each pixel to a number. We need to carry out the same process in the case of the waveform.
This mapping of samples (or pixels) to numbers is known as quantisation. Again, the faithfulness of the digital copy to the analogue original will depend on how large a range of numbers we make available. [If “8-bit” sampling is used, 256 different amplitudes can be measured.
That is: 2 x 2 x 2 x 2 x 2 x 2 x 2 x 2 = 256 different levels.]
The human eye is an immensely discriminating instrument; the ear is less so. We are not generally able to detect pitch differences of less than a few hertz (1 hertz (Hz) is a frequency of one cycle per second). So sound wave samples are generally mapped to 16-bit numbers.
Copyright OpenLearn/The Open University, licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 2.0 Licence.
The wav encoding that Game Maker can play back is based on the above principles. wav files can be recorded at 8-bit, 16-bit or 24-bit resolution, using a sampling rate set between 8,000 Hz and 48000 Hz. If you calculate some file sizes for different lenght audio clips at a variety of sampling rates and quantisation levels, you will see that wav audio files can be quite big, even for short audio clips.
The MP3 format uses a similar approach to digitise the sound file in the first instance, but then reduces the size of the digital file by using a compression technique to encode the file again in another way. Compression effectively squashes the file size down so that it becomes smaller, but we shall not consider that here.