Dear Fellow Scholars, this is Two Minute Papers with Károly Zsolnai-Fehér.
In our earlier episodes, when it came to learning techniques, we almost always talked about
supervised learning.
This means that we give the algorithm a bunch of images, and some additional information,
for instance, that these images depict dogs or cats.
Then, the learning algorithm is exposed to new images that it had never seen before and
has to be able to classify them correctly.
It is kind of like a teacher sitting next to a student, providing supervision.
Then, the exam comes with new questions.
This is supervised learning, and as you have seen from more than 180 episodes of Two Minute
Papers, there is no doubt that it is an enormously successful field of research.
However, this means that we have to label our datasets, so we have to add some additional
information to every image we have.
This is a very laborious task, which is typically performed by researchers or through crowdsourcing,
both of which takes a lot of funding and hundreds of work hours.
But if we think about it, we have a ton of videos on the internet, you always hear these
mind melting new statistics on how many hours of video footage is uploaded to YouTube every
day.
Of course, we could hire all the employees in the world to annotate these videos frame
by frame to tell the algorithm that this is a guitar, this is an accordion, or a keyboard,
and we would still not be able to learn on most of what's uploaded.
But it would be so great to have an algorithm that can learn on unlabeled data.
However, there are learning techniques in the field of unsupervised learning, which
means that the algorithm is given a bunch of images, or any media, and is instructed
to learn on it without any additional information.
There is no teacher to supervise the learning.
The algorithm learns by itself.
And in this work, the objective is to learn both visual and audio-related tasks in an
unsupervised manner.
So for instance, if we look at the this layer of the visual subnetwork, we'll find neurons
that get very excited when they see, for instance, someone playing an accordion.
And each of the neurons in this layer belong to different object classes.
I surely have something like this for papers.
And here comes the Károly goes crazy part one: this technique not only classifies the
frames of the videos, but it also creates semantic heatmaps, which show us which part
of the image is responsible for the sounds that we hear.
This is insanity!
To accomplish this, they ran a vision subnetwork on the video part, and a separate audio subnetwork
to learn about the sounds, and at the last step, all this information is fused together
to obtain Károly goes crazy part two: this makes the network able to guess whether the
audio and the video stream correspond to each other.
It looks at a man with a fiddle, listens to a sound clip and will say whether the two
correspond to each other.
Wow!
The audio subnetwork also learned the concept of human voices, the sound of water, wind,
music, live concerts and much, much more.
And the answer is yes, it is remarkably close to human-level performance on sound classification.
And all this is provided by the two networks that were trained from scratch, and, no supervision
is required.
We don't need to annotate these videos.
Nailed it.
And please don't get this wrong, it's not like DeepMind has suddenly invented unsupervised
learning, not at all.
This is a field that has been actively researched for decades, it's just that we rarely see
really punchy results like these ones here.
Truly incredible work.
If you enjoyed this episode, and you feel that 8 of these videos a month is worth a
dollar, please consider supporting us on Patreon.
Details are available in the video description.
Thanks for watching and for your generous support, and I'll see you next time!
Không có nhận xét nào:
Đăng nhận xét